9.1. K-anonymity#

K-anonymity is a privacy-preserving technique used in data anonymization to protect the identities of individuals in a dataset. The main goal of K-anonymity is to ensure that each record in the dataset is indistinguishable from at least “k” other records with respect to a set of quasi-identifier attributes. Quasi-identifiers are attributes that, when combined, could potentially lead to the identification of an individual.

To achieve K-anonymity, the dataset is modified in such a way that the values of the quasi-identifiers are generalized or suppressed to ensure that groups of “k” records with similar quasi-identifiers are identical. This way, an attacker trying to re-identify an individual would not be able to pinpoint a specific individual’s data from the anonymized dataset.

AIJack supports Mondrian algorithm, which efficiently anonymizes table data and preserves privacy.

import pandas as pd

from aijack.defense.kanonymity import Mondrian

# This test code is based on https://github.com/glassonion1/anonypy

data = [
    [6, "1", "test1", "x", 20],
    [6, "1", "test1", "x", 30],
    [8, "2", "test2", "x", 50],
    [8, "2", "test3", "w", 45],
    [8, "1", "test2", "y", 35],
    [4, "2", "test3", "y", 20],
    [4, "1", "test3", "y", 20],
    [2, "1", "test3", "z", 22],
    [2, "2", "test3", "y", 32],
]

columns = ["col1", "col2", "col3", "col4", "col5"]
feature_columns = ["col1", "col2", "col3"]
is_continuous_map = {
    "col1": True,
    "col2": False,
    "col3": False,
    "col4": False,
    "col5": True,
}
sensitive_column = "col4"

df = pd.DataFrame(data=data, columns=columns)
df

	col1	col2	col3	col4	col5
0	6	1	test1	x	20
1	6	1	test1	x	30
2	8	2	test2	x	50
3	8	2	test3	w	45
4	8	1	test2	y	35
5	4	2	test3	y	20
6	4	1	test3	y	20
7	2	1	test3	z	22
8	2	2	test3	y	32

mondrian = Mondrian(k=2)
adf_ignore_unused_features = mondrian.anonymize(
    df, feature_columns, sensitive_column, is_continuous_map
)

adf_ignore_unused_features

	col1	col2	col3	col4
0	3.000000	1	test3	z
1	3.000000	1	test3	y
2	3.000000	2	test3	y
3	3.000000	2	test3	y
4	6.666667	1	test1_test2	x
5	6.666667	1	test1_test2	x
6	6.666667	1	test1_test2	y
7	8.000000	2	test2_test3	x
8	8.000000	2	test2_test3	w