Explaining Privacy Models Used in Anonymization (k-Anonymity, l-Diversity, t-Closeness)

What is Anonymization?#

Anonymization is a data processing method whereby subjects of original data can never be re-identified. When you anonymize data, you want the data not to be traced back to a particular individual. Then, how exactly can you do that?

Let's say you have a tabular dataset like below.

Name	Gender	Age	Disease
Bob	male	32	flu
Alice	female	38	flu
George	male	41	norovirus
Smith	male	48	flu

If the data is published publicly without any processing, it's obvious that the sensitive data tied to particular individuals is disclosed because the people's names themselves can easily identify corresponding individuals. Attributes that clearly identify individuals like person's name are often referred to as explicit identifiers or direct identifiers. Explicit identifiers must be completely masked or deleted like below.

Name	Gender	Age	Disease
*	male	32	flu
*	female	38	flu
*	male	41	norovirus
*	male	48	flu

Then, is that enough to avoid re-identification? Unfortunately, no if some conditions are met. For example, if an adversary knows the dataset contains Alice (and she's a woman), he or she can identify the second record is Alice's information as it only contains only one row whose gender attribute is female. Attributes that identify individuals with help of other information like gender are referred to as quasi-identifiers (abbreviated as QI. In this example the age column is also be categorized to a quasi-identifier). You should generalize or delete QIs in some way.

Name	Gender	Age	Disease
*	*	32	flu
*	*	38	flu
*	male	41	norovirus
*	male	48	flu

Again, is that enough? Possibly yes if the adversary has no more information than mentioned above, but we cannot guarantee the data is 100% safe from re-identification because you never know what adversaries know.

Apart from privacy, what about data usability? Removing all or a part of data definitely causes data loss, in which case you may unable to achieve what you originally wanted to achieve with anonymized dataset (e.g., data analysis). Yes, there is a clear trade-off relationship between privacy and usability. Ultimately 100% privacy can be obtained by removing all the data, but there's no point doing that since we can no longer use that data for the original purpose.

K-Anonymity, l-Diversity, t-Closeness are metrics that can be used to measure the extent to which processed data is obfuscated. Based on your data use cases, you may set target k,l,t values for processed dataset.

We'll see them one by one in the following sections.

k-Anonymity#

Definition:

A table dataset is said to satisfy k-Anonymity if each sequence of values in QIs appears with at least k occurrences in the dataset.

Think with the example dataset in the previous section. The dataset has two QIs, gender and age. If you processed the data like below, it now satisfies k(k=2)-Anonymity .

Name	Gender	Age	Disease
*	*	30-39	flu
*	*	30-39	flu
*	male	40-49	norovirus
*	male	40-49	flu

Take a look at each combination of gender and age in every row. The first and the second row now have the same QI values, and the third and the fourth one do the same. That means "each sequence of values in QIs appears with at least k(k=2) occurrences in the dataset". In this case, even if adversaries know George (the third row) is included and he is in mid-40s, they cannot eliminate the possibility the fourth record is George's.

K-Anonymity itself is a model for protecting privacy, not an method of anonymization. There are several anonymization method to achieve k-Anonymity such as Datafly, Mondrian, Incognito. Check those keywords if you are interested.

l-Diversity#

Actually, k-Anonymity has some drawbacks (as you may noticed, the example above also contains a vulnerability). If I'm fairly certain that Bob (the first row) is in the dataset and you are in mid-30s, I can deduce that you probably have flu as all 30-39 year olds in the dataset have flu. The problem here is that k-Anonymity is ignorant of Disease values. You want the disease attribute not to be generalized, but at the same time not to be revealed. This type of information is called a sensitive attribute.

L-Diversity is proposed to address this sensitive attribute disclosure for use in conjunction with k-Anonymity.

Definition:

Q* -block is l-diverse if contains at least l "well-represented" values for the sensitive attribute S. A table is l-diverse if every q* -block is l-diverse

A q* -block represents a set of tuples in a dataset whose QI attribute values are identical. The previous k-Anonymized example has two Q*- block. One comprises of the first and the second tuples, the other consists of the third and the fourth.

The first q* -block has the same disease attribute i.e., flu, then this q* -block is l(l=1)-Diverse. The second q* -block has two different disease attributes i.e., flu and norovirus, then this q* -block is l(l=2)-Diverse. If you want the table to be l(l=2)-Diverse, you should delete first two rows like below:

Name	Gender	Age	Disease
*	*	*	*
*	*	*	*
*	male	40-49	norovirus
*	male	40-49	flu

t-Closeness#

L-Diversity still has some problems. If you process a dataset to satisfy certain k-Anonymity and l-Diversity, the dataset sometimes get skewed.

Suppose a dataset has exactly the same columns as the example before, but contains 1000 records, with 99% of them being flu, and 1% of them being norovirus. One might not mind being known to be diagnosed with flu. However one may mind being considered to be diagnosed with norovirus, since there are only ten records with norovirus. In this case, l(l=2)-diversity is hard to achieve. Even if you satisfy l(l=2)-diversity, the dataset may get skewed during the process. Think of a q* -block consisting of five records with flu and five records with norovirus. Now the q* -block has a 5:5 (flu:norovirus) distribution of the disease attributes, which is far from that in the overall table(i.e., 99:1).

T-Closeness is proposed to resolve this skewed distribution issue.

Definition:

An equivalence class is said to have t-closeness if the distance between the distribution of a sensitive attribute in this class and the distribution of the attribute in the whole table is no more than a threshold t. A table is said to have t-closeness if all equivalence classes have t-closeness.

The definition of equivalence class is the same as q*- block. There are several methods to calculate the distance *t*. The paper in which t-Closeness proposed uses Earth Mover Distance (EMD) metric.

There are certain characteristics of attributes that are definitely suitable to apply l-Diversity and t-Closeness respectively. You should carefully determine which columns the models applied to, considering the data distribution and the range of possible values.

Beyond k-Anonymity, l-Diversity, t-Closeness#

K-Anonymity and its extensions certainly express the extent to which datasets are anonymized in a somewhat quantitative way, but they must be customized to each dataset to strike the balance between acceptable data loss and privacy gain.

Differential privacy is another different approach for anonymization. That defines a limit on the amount of privacy loss you can accept after data processing and focuses on the process rather than result compared to k-Anonymity approach.

I won't write the detail of differential privacy in this article, but it is seen as a state-of-the-art approach for anonymization these days.

References#

L. Sweeney. k-anonymity: a model for protecting privacy. International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (5), 2002; 557-570.
Kifer, D. & Gehrke, J. (2006), l-Diversity: Privacy Beyond k-Anonymity, in 'In ICDE' , pp. 24 .
Li, N.; Li, T. & Venkatasubramanian, S. (2007), t-Closeness: Privacy Beyond k-Anonymity and l-Diversity., in Rada Chirkova; Asuman Dogac; M. Tamer Özsu & Timos K. Sellis, ed., 'ICDE' , IEEE Computer Society, , pp. 106-115 .
Dwork, C. (2006). Differential Privacy. In: Bugliesi, M., Preneel, B., Sassone, V., Wegener, I. (eds) Automata, Languages and Programming. ICALP 2006. Lecture Notes in Computer Science, vol 4052. Springer, Berlin, Heidelberg.