{
"cells": [
{
"cell_type": "markdown",
"id": "525fc252-192d-41e0-9357-c3cdd61e7540",
"metadata": {},
"source": [
"# K-anonymity\n",
"\n",
"K-anonymity is a privacy-preserving technique used in data anonymization to protect the identities of individuals in a dataset. The main goal of K-anonymity is to ensure that each record in the dataset is indistinguishable from at least \"k\" other records with respect to a set of quasi-identifier attributes. Quasi-identifiers are attributes that, when combined, could potentially lead to the identification of an individual.\n",
"\n",
"To achieve K-anonymity, the dataset is modified in such a way that the values of the quasi-identifiers are generalized or suppressed to ensure that groups of \"k\" records with similar quasi-identifiers are identical. This way, an attacker trying to re-identify an individual would not be able to pinpoint a specific individual's data from the anonymized dataset.\n",
"\n",
"AIJack supports [Mondrian](https://ieeexplore.ieee.org/document/1617393) algorithm, which efficiently anonymizes table data and preserves privacy."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "bc4b336e-75c4-4fac-8b16-9e7af35fa233",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"import pandas as pd\n",
"\n",
"from aijack.defense.kanonymity import Mondrian"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "4939daa5-c27b-49e0-8243-f3f4c89d73bb",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"
\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" col1 | \n",
" col2 | \n",
" col3 | \n",
" col4 | \n",
" col5 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 6 | \n",
" 1 | \n",
" test1 | \n",
" x | \n",
" 20 | \n",
"
\n",
" \n",
" 1 | \n",
" 6 | \n",
" 1 | \n",
" test1 | \n",
" x | \n",
" 30 | \n",
"
\n",
" \n",
" 2 | \n",
" 8 | \n",
" 2 | \n",
" test2 | \n",
" x | \n",
" 50 | \n",
"
\n",
" \n",
" 3 | \n",
" 8 | \n",
" 2 | \n",
" test3 | \n",
" w | \n",
" 45 | \n",
"
\n",
" \n",
" 4 | \n",
" 8 | \n",
" 1 | \n",
" test2 | \n",
" y | \n",
" 35 | \n",
"
\n",
" \n",
" 5 | \n",
" 4 | \n",
" 2 | \n",
" test3 | \n",
" y | \n",
" 20 | \n",
"
\n",
" \n",
" 6 | \n",
" 4 | \n",
" 1 | \n",
" test3 | \n",
" y | \n",
" 20 | \n",
"
\n",
" \n",
" 7 | \n",
" 2 | \n",
" 1 | \n",
" test3 | \n",
" z | \n",
" 22 | \n",
"
\n",
" \n",
" 8 | \n",
" 2 | \n",
" 2 | \n",
" test3 | \n",
" y | \n",
" 32 | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" col1 col2 col3 col4 col5\n",
"0 6 1 test1 x 20\n",
"1 6 1 test1 x 30\n",
"2 8 2 test2 x 50\n",
"3 8 2 test3 w 45\n",
"4 8 1 test2 y 35\n",
"5 4 2 test3 y 20\n",
"6 4 1 test3 y 20\n",
"7 2 1 test3 z 22\n",
"8 2 2 test3 y 32"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# This test code is based on https://github.com/glassonion1/anonypy\n",
"\n",
"data = [\n",
" [6, \"1\", \"test1\", \"x\", 20],\n",
" [6, \"1\", \"test1\", \"x\", 30],\n",
" [8, \"2\", \"test2\", \"x\", 50],\n",
" [8, \"2\", \"test3\", \"w\", 45],\n",
" [8, \"1\", \"test2\", \"y\", 35],\n",
" [4, \"2\", \"test3\", \"y\", 20],\n",
" [4, \"1\", \"test3\", \"y\", 20],\n",
" [2, \"1\", \"test3\", \"z\", 22],\n",
" [2, \"2\", \"test3\", \"y\", 32],\n",
"]\n",
"\n",
"columns = [\"col1\", \"col2\", \"col3\", \"col4\", \"col5\"]\n",
"feature_columns = [\"col1\", \"col2\", \"col3\"]\n",
"is_continuous_map = {\n",
" \"col1\": True,\n",
" \"col2\": False,\n",
" \"col3\": False,\n",
" \"col4\": False,\n",
" \"col5\": True,\n",
"}\n",
"sensitive_column = \"col4\"\n",
"\n",
"df = pd.DataFrame(data=data, columns=columns)\n",
"df"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "99d45579-22fb-4e38-bd81-bc4b7747c951",
"metadata": {
"tags": []
},
"outputs": [],
"source": [
"mondrian = Mondrian(k=2)\n",
"adf_ignore_unused_features = mondrian.anonymize(\n",
" df, feature_columns, sensitive_column, is_continuous_map\n",
")"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "dad46817-01e4-41e4-b306-1ae5d094302f",
"metadata": {
"tags": []
},
"outputs": [
{
"data": {
"text/html": [
"\n",
"\n",
"
\n",
" \n",
" \n",
" | \n",
" col1 | \n",
" col2 | \n",
" col3 | \n",
" col4 | \n",
"
\n",
" \n",
" \n",
" \n",
" 0 | \n",
" 3.000000 | \n",
" 1 | \n",
" test3 | \n",
" z | \n",
"
\n",
" \n",
" 1 | \n",
" 3.000000 | \n",
" 1 | \n",
" test3 | \n",
" y | \n",
"
\n",
" \n",
" 2 | \n",
" 3.000000 | \n",
" 2 | \n",
" test3 | \n",
" y | \n",
"
\n",
" \n",
" 3 | \n",
" 3.000000 | \n",
" 2 | \n",
" test3 | \n",
" y | \n",
"
\n",
" \n",
" 4 | \n",
" 6.666667 | \n",
" 1 | \n",
" test1_test2 | \n",
" x | \n",
"
\n",
" \n",
" 5 | \n",
" 6.666667 | \n",
" 1 | \n",
" test1_test2 | \n",
" x | \n",
"
\n",
" \n",
" 6 | \n",
" 6.666667 | \n",
" 1 | \n",
" test1_test2 | \n",
" y | \n",
"
\n",
" \n",
" 7 | \n",
" 8.000000 | \n",
" 2 | \n",
" test2_test3 | \n",
" x | \n",
"
\n",
" \n",
" 8 | \n",
" 8.000000 | \n",
" 2 | \n",
" test2_test3 | \n",
" w | \n",
"
\n",
" \n",
"
\n",
"
"
],
"text/plain": [
" col1 col2 col3 col4\n",
"0 3.000000 1 test3 z\n",
"1 3.000000 1 test3 y\n",
"2 3.000000 2 test3 y\n",
"3 3.000000 2 test3 y\n",
"4 6.666667 1 test1_test2 x\n",
"5 6.666667 1 test1_test2 x\n",
"6 6.666667 1 test1_test2 y\n",
"7 8.000000 2 test2_test3 x\n",
"8 8.000000 2 test2_test3 w"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"adf_ignore_unused_features"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "552bfa42-be85-442d-9568-992d08d5b919",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}