The unauthorized use or misuse of our personally identifiable information (or PII) — such as name, social security number, date of birth, mother’s maiden name, place of birth, etc. — can result in identify theft and other crimes related to impersonation, not to mention embarrassment, inconvenience and expense. For those organizations collecting or processing PII but not protecting it, there are serious legal and financial ramifications which is why more organizations are focused on data risk mitigation.
The “Guide to Protecting the Confidentiality of Personally Identifiable Information” by Erika McCallister goes into great detail about protecting PII, and the processes of pseudonymization and de-identification.
According to Wikipedia, and a few other online sources, pseudonymization is the process of “removing the association between data and the subject of that data, and adding an association between the data and an alternative identifier.” It is the process of “depersonalizing” the data so that any identifying fields within a record are replaced by one or more artificial identifiers.
In other words, personal data is removed from a database record or file and replaced with a pseudonym (pseudo-name or fake name) to protect the sensitive name. The fields are placed to look realistic. Using a fake name can help protect sensitive PII from unauthorized misuse because it removes the individual’s association to the remaining data in the record or otherwise on hand.
Thus, the United States Department of Health and Human Services’ Health Information Knowledgebase maintains that “using pseudo-identifiers can assist in compliance with HIPAA regulations regarding suppression of patient identification information.” Article 4(3b) of the European General Data Protection Regulation (GDPR) considers pseudonymization similarly compliant so long as “the data can no longer be attributed to a specific data subject without the use of additional information, [and] as long as such additional information is kept separately and subject to technical and organizational measures to ensure non-attribution to an identified or identifiable individual.”
1) Unrecoverable. This method uses a single-column source or ‘set’ file containing first names, cities, or other values that are listed and available for random selection in place of the original value. Because there is no association between the original and fake values, there is no way to reverse this process, even if you want to reveal the original.
2) Recoverable. This method, which would not be considered compliant with the GDPR, involves a tabular relationship between the source data and its pseuduonym. In practice, a two-column set file using both real and fake data constitutes a look-up table that can be used for both pseudonym display and later restoration of the pseudonym values through a reverse lookup.
IRI recognized the value of this de-identification or pseudonymization method long ago when using set files in its test data generation tool, RowGen. Test data quality is improved, without breaching privacy, when real-looking names replace actual names.
In practice, pseudonymization jobs can be complicated by the introduction of new values in the source; new substitute values need to exist to cover them, and done in such as way that reidentifiability is still prevented. One such remedy, documented here, is to use hashed name values stored in a .set file.
For Extra Security
Security can be compromised when someone can still guess the target individual’s real identity … perhaps because there are too many other identifying elements in the record. In these cases, it makes sense to apply other protections to the remaining fields. Another consideration/requirement of the pseudonymization method regards reversibility–the extent to which the real data can be recovered or the ease with which it can be accomplished. It may therefore make sense to pseudonymize or mask the data outright, with no means of restoring the original values.