IRI Blog Articles

Diving Deeper into Data Management



What is Data Pseudonymization?

by Jeff Simpson

pseudonymizationThis blog thread deals with pseudonymization as one method of de-identifying or anonymizing sensitive data.

The unauthorized use or misuse of our personally identifiable information (or PII) — such as name,  social security number, date of birth, mother’s maiden name, place of birth, etc. — can result in identify theft and other crimes related to impersonation, not to mention embarrassment, inconvenience and expense. For those organizations collecting PII but not protecting it, there are serious legal and financial ramifications which is why more organizations are focused on data risk mitigation.

Hospitals, government agencies, corporations, financial institutions and others who maintain client and patient records containing PII must comply with data privacy laws like the 1974 Privacy Act, and the 1996 Health Insurance Portability and Accountability Act (HIPAA).  Some of these organizations must follow not only the statutory regulations, but requirements specific to a subset of the industry and their own internal business rules. A litany of data breaches and fines is chronicled at

The “Guide to Protecting the Confidentiality of Personally Identifiable Information” by Erika McCallister goes into great detail about protecting PII, and the processes of pseudonymization and de-identification.

According to Wikipedia, and a few other online sources, pseudonymization is the process of “removing the association between data and the subject of that data, and adding an association between the data and an alternative identifier.”  It is the process of  “depersonalizing” the data so that any identifying fields within a record are replaced by one or more artificial identifiers.

Personal data is removed from a record and replaced with a pseudonym (pseudo-name or fake name) to protect the sensitive name. The fields are placed to look realistic.  Using a fake name can help protect sensitive PII from unauthorized misuse because it removes the individual’s association to the remaining data in the record. Thus, the United States Department of Health and Human Services’ Health Information Knowledgebase maintains that “using pseudo-identifiers can assist in compliance with HIPAA regulations regarding suppression of patient identification information.”

An Example of Pseudonymization

IRI FieldShield software provides two options for source field pseudonymization in the context of protecting PII:

1) Unrecoverable.  This method uses a single-column source or ‘set’ file containing first names, cities, or other values that are listed and available for random selection in place of the original value. Because there is no association between the original and fake values, there is no way to reverse this process if you want to reveal the original.

2) Recoverable.  This method creates a two-column set file using both real and fake data to establish a look-up table that can be used for both pseudonym display and later restoration of the pseudonym values through a reverse lookup.

IRI recognized the value of this de-identification or pseudonymization method long ago when using set files in its test data generation tool, RowGen.  Test data quality is improved, without breaching privacy, when real-looking names replace actual names.

For Extra Security

Security can be compromised when someone can still guess the target individual’s real identity … perhaps because there are too many other identifying elements in the record.  In these cases, it makes sense to apply other protections to the remaining fields. Another consideration/requirement of the pseudonymization method regards reversability–the extent to which the real data can be recovered or the ease with which it can be accomplished.  It may therefore make sense to pseudonymize or mask the data outright, with no means of restoring the original values.

Print Friendly

{ 0 comments… add one now }

Leave a Comment

Previous post:

Next post: