One of the biggest concerns with releasing a dataset is the risk that a potential attacker can identify the owners of particular records. Even though masking or removing unique identifiers, like names and Social Security Numbers, can reduce that risk substantially, it may still not be enough. Harvard professor Latanya Sweeney reported that 87% of the U.S population can be identified using only their gender, age, and a 5-digit zip code1.
To prevent privacy breaches and comply with the Health Insurance Portability and Accountability Act (HIPAA), you must also de-identify such “quasi-identifiers” (along with “key identifiers” like name and SSN) to the point where the risk of row re-identification is statistically moot2. IRI data masking software is designed to do that, but how do you assess the results? This article shows how to credibly score the re-identification risk of data masked with IRI FieldShield, CellShield or Voracity using the free ARX risk analysis application found here. Subsequent articles on API-level integration and re-masking recommendations are planned.
Creating Project & Loading Data
Before any meaningful analysis can be performed, an ARX project must be created and loaded with data. In our case, we are analyzing the re-ID risk of a delimited file masked by FieldShield. To create an ARX project, select File -> New Project… from its dropdown menu. A wizard will prompt you for a name and a description. Once you create the project, load the data by selecting File -> Import Data… and follow the wizard’s instructions to build a table view.
Quasi-Identifiers are a subset of attributes which can be used to identify a record in a dataset. ARX algorithms quantify the risk associated with different combinations of every attribute. To use the quasi-identifier utility, navigate to the “Analyze risk” tab above the data display table:
Select the “Quasi-identifiers” tab to display a list of all attribute subsets which ARX suspects to be quasi-identifiers. By default, ARX uses every attribute that it finds within the dataset to generate the subsets. However, you can manually toggle what attributes will be included in the checklist on the bottom of the screen.
Once you have selected the attributes you wish to analyze, ARX shows the quasi-identifiers alongside two parameters: “Distinction” and “Separation.”
Distinction represents the ratio between the unique values for the quasi-identifiers and the total number of records. Separation represents the ratio between pairs of records with at least one different value for their quasi-identifiers and the total number of ways that two different records can be paired3. In general, a higher distinction and separation are indicators that the quasi-identifiers are more likely to identify a record.
Once you have figured out which subset of attributes will serve as quasi-identifiers, you need to return to the dataset and label each attribute appropriately. Navigate to the top left tab labelled “Configure transformation” and left click on one of the column names to bring up the “Data Transformation” tab on the right.
Every attribute can be labelled as “Insensitive,” “Sensitive,” “Quasi-Identifying,” or “Identifying” by modifying its type in the Data Transformation tab. Unique IDs, like name or social security numbers, will necessarily have distinction and separation values of 100%, and should not be selected as quasi-identifiers, but rather as fully fledged (“key”) identifiers. Any information which you may not want linked to any person through attribute disclosure should also be labeled as “Sensitive.” Sensitive information should only be marked as a quasi-identifier if another dataset contains the same information and can be used to perform joins that will reveal unique records based on the sensitive information.
Once you have tagged all of the attributes, it is now possible to perform more accurate risk analysis in the “Analyze Risk” tab. Select the “Re-identification risk” tab to bring up a GUI showing the number of records that are currently at risk of being re-identified with the given quasi-identifiers. The risk is calculated for 3 different attacker models:
- Prosecutor – the attacker is targeting a specific record using background knowledge of the target.
- Journalist – the attacker has no background knowledge on any person in the dataset, and is trying to randomly re-identify a record.
- Marketer – the attacker is only interested in re-identifying as many records as possible, regardless of whether the records are correctly re-identified.
The score for your dataset is influenced by 3 factors: the number of records at risk, the highest risk that a record will be re-identified, and the success rate for re-identifying a randomly selected record.
On the bottom right of this pane, you can modify the risk threshold by double clicking on or dragging the “highest risk” widget (circled in red in the screenshot). The risk threshold represents the highest acceptable risk of a record being re-identified. Any record which has a higher risk of re-identification than the risk threshold will be counted towards the number of records at risk. There exists no HIPAA-specific standard for the risk threshold, although in practice the minimum has often been set at 20%4.
Thus, to ensure maximum safety, you should aim to keep your dataset “in the green” by reducing the number of records at risk, either by decreasing the sampling size of your dataset through more or different masking functions to reduce record uniqueness. The following two sections will help you diagnose your risk score further by examining the estimated number of unique records in the dataset and the total number of records at risk.
If the score for your dataset is not optimal, there are several panes which you can check to diagnose the extent of the problem; i.e., by checking the number of unique records in your dataset and the total number of records actually at risk.
The records that have the potential for 100% re-identification are records that have a unique subset of quasi-identifiers. Select the “Population uniques” tab on the bottom of the screen. The graph shows the percentage of possible unique records that exist within your dataset based on the fraction of the dataset selected. Each line represents a different approximation based on the statistical model used to approximate the number of unique records.
Distribution of Risk
The distribution of risk is a graph which shows the correlation between the records at risk and the fraction of records that are revealed from the dataset. The more records are suppressed (excluded from the dataset), the lower the risk. The risk threshold specified in the previous “Re-identification risks” tab will affect the number of records with risk.
ARX is a powerful tool for re-ID risk scoring with several visual components which can also be used to help diagnose the source of the risk. In the near future, IRI will leverage the ARX API in IRI Workbench, and thus into FieldShield or Voracity job flows. Until then, we hope this article serves as a good tutorial for how to perform your own risk scoring on IRI-masked output.