9.1. K-anonymity#

K-anonymity is a privacy-preserving technique used in data anonymization to protect the identities of individuals in a dataset. The main goal of K-anonymity is to ensure that each record in the dataset is indistinguishable from at least “k” other records with respect to a set of quasi-identifier attributes. Quasi-identifiers are attributes that, when combined, could potentially lead to the identification of an individual.

To achieve K-anonymity, the dataset is modified in such a way that the values of the quasi-identifiers are generalized or suppressed to ensure that groups of “k” records with similar quasi-identifiers are identical. This way, an attacker trying to re-identify an individual would not be able to pinpoint a specific individual’s data from the anonymized dataset.

AIJack supports Mondrian algorithm, which efficiently anonymizes table data and preserves privacy.

import pandas as pd

from aijack.defense.kanonymity import Mondrian
# This test code is based on https://github.com/glassonion1/anonypy

data = [
    [6, "1", "test1", "x", 20],
    [6, "1", "test1", "x", 30],
    [8, "2", "test2", "x", 50],
    [8, "2", "test3", "w", 45],
    [8, "1", "test2", "y", 35],
    [4, "2", "test3", "y", 20],
    [4, "1", "test3", "y", 20],
    [2, "1", "test3", "z", 22],
    [2, "2", "test3", "y", 32],
]

columns = ["col1", "col2", "col3", "col4", "col5"]
feature_columns = ["col1", "col2", "col3"]
is_continuous_map = {
    "col1": True,
    "col2": False,
    "col3": False,
    "col4": False,
    "col5": True,
}
sensitive_column = "col4"

df = pd.DataFrame(data=data, columns=columns)
df
col1 col2 col3 col4 col5
0 6 1 test1 x 20
1 6 1 test1 x 30
2 8 2 test2 x 50
3 8 2 test3 w 45
4 8 1 test2 y 35
5 4 2 test3 y 20
6 4 1 test3 y 20
7 2 1 test3 z 22
8 2 2 test3 y 32
mondrian = Mondrian(k=2)
adf_ignore_unused_features = mondrian.anonymize(
    df, feature_columns, sensitive_column, is_continuous_map
)
adf_ignore_unused_features
col1 col2 col3 col4
0 3.000000 1 test3 z
1 3.000000 1 test3 y
2 3.000000 2 test3 y
3 3.000000 2 test3 y
4 6.666667 1 test1_test2 x
5 6.666667 1 test1_test2 x
6 6.666667 1 test1_test2 y
7 8.000000 2 test2_test3 x
8 8.000000 2 test2_test3 w