9.1. K-anonymity#
K-anonymity is a privacy-preserving technique used in data anonymization to protect the identities of individuals in a dataset. The main goal of K-anonymity is to ensure that each record in the dataset is indistinguishable from at least “k” other records with respect to a set of quasi-identifier attributes. Quasi-identifiers are attributes that, when combined, could potentially lead to the identification of an individual.
To achieve K-anonymity, the dataset is modified in such a way that the values of the quasi-identifiers are generalized or suppressed to ensure that groups of “k” records with similar quasi-identifiers are identical. This way, an attacker trying to re-identify an individual would not be able to pinpoint a specific individual’s data from the anonymized dataset.
AIJack supports Mondrian algorithm, which efficiently anonymizes table data and preserves privacy.
import pandas as pd
from aijack.defense.kanonymity import Mondrian
# This test code is based on https://github.com/glassonion1/anonypy
data = [
[6, "1", "test1", "x", 20],
[6, "1", "test1", "x", 30],
[8, "2", "test2", "x", 50],
[8, "2", "test3", "w", 45],
[8, "1", "test2", "y", 35],
[4, "2", "test3", "y", 20],
[4, "1", "test3", "y", 20],
[2, "1", "test3", "z", 22],
[2, "2", "test3", "y", 32],
]
columns = ["col1", "col2", "col3", "col4", "col5"]
feature_columns = ["col1", "col2", "col3"]
is_continuous_map = {
"col1": True,
"col2": False,
"col3": False,
"col4": False,
"col5": True,
}
sensitive_column = "col4"
df = pd.DataFrame(data=data, columns=columns)
df
col1 | col2 | col3 | col4 | col5 | |
---|---|---|---|---|---|
0 | 6 | 1 | test1 | x | 20 |
1 | 6 | 1 | test1 | x | 30 |
2 | 8 | 2 | test2 | x | 50 |
3 | 8 | 2 | test3 | w | 45 |
4 | 8 | 1 | test2 | y | 35 |
5 | 4 | 2 | test3 | y | 20 |
6 | 4 | 1 | test3 | y | 20 |
7 | 2 | 1 | test3 | z | 22 |
8 | 2 | 2 | test3 | y | 32 |
mondrian = Mondrian(k=2)
adf_ignore_unused_features = mondrian.anonymize(
df, feature_columns, sensitive_column, is_continuous_map
)
adf_ignore_unused_features
col1 | col2 | col3 | col4 | |
---|---|---|---|---|
0 | 3.000000 | 1 | test3 | z |
1 | 3.000000 | 1 | test3 | y |
2 | 3.000000 | 2 | test3 | y |
3 | 3.000000 | 2 | test3 | y |
4 | 6.666667 | 1 | test1_test2 | x |
5 | 6.666667 | 1 | test1_test2 | x |
6 | 6.666667 | 1 | test1_test2 | y |
7 | 8.000000 | 2 | test2_test3 | x |
8 | 8.000000 | 2 | test2_test3 | w |