Researchers Easily Reidentify Deidentified Patient Records with 95% Accuracy; Privacy Protection of Patient Test Records a Concern for Clinical Laboratories

Protecting patient privacy is of critical importance, and yet researchers reidentified data using only a few additional data points, casting doubt on the effectiveness of existing federally required data security methods and sharing protocols

Clinical laboratories and anatomic pathologists know the data generated by their diagnostics and testing services constitute most of a patient’s personal health record (PHR). They also know federal law requires them to secure their patients’ protected health information (PHI) and any threat to the security of that data endangers medical laboratories and healthcare practices as well.

Therefore, recent coverage in The Guardian which reported on how easily so-called “deidentified data” can be reidentified with just a few additional data points should be of particular interest to clinical laboratory and health network managers and stakeholders.

Risky Balance Between Data Sharing and Privacy

In December 2017, University of Melbourne (UM) researchers, Chris Culnane, PhD, Benjamin Rubinstein, and Vanessa Teague, PhD, published a report with the Cornell University Library detailing how they reidentified data listed in an open dataset of Australian medical billing records.

“We found that patients can be re-identified, without decryption, through a process of linking the unencrypted parts of the record with known information about the individual such as medical procedures and year of birth,” Culnane stated in a UM news release. “This shows the surprising ease with which de-identification can fail, highlighting the risky balance between data sharing and privacy.”

In a similar study published in Scientific Reports, Yves-Alexandre de Montjoye, PhD, a computation private researcher, used location data on 1.5 million people from a mobile phone dataset collected over 15 months to identify 95% of the people in an anonymized dataset using four unique data points. With just two unique data points, he could identify 50% of the people in the dataset.

“Location data is a fingerprint. It’s a piece of information that’s likely to exist across a broad range of data sets and could potentially be used as a global identifier,” Montjoye told The Guardian.

The problem is exacerbated by the fact that everything we do online these days generates data—much of it open to the public. “If you want to be a functioning member of society, you have no ability to restrict the amount of data that’s being vacuumed out of you to a meaningful level,” Chris Vickery, a security researcher and Director of Cyber Risk Research at UpGuard, told The Guardian.

This privacy vulnerability isn’t restricted to just users of the Internet and social media. In 2013, Latanya Sweeney, PhD, Professor and Director at Harvard’s Data Privacy Lab, performed similar analysis on approximately 579 participants in the Personal Genome Project who provided their zip code, date of birth, and gender to be included in the dataset. Of those analyzed, she named 42% of the individuals. Personal Genome Project later confirmed 97% of her submitted names according to Forbes.

In testimony before the Privacy and Integrity Advisory Committee of the Department of Homeland Security (DHS), Latanya Sweeney, PhD (above), Professor and Director at Harvard’s Data Privacy Lab stated, “One problem is that people don’t understand what makes data unique or identifiable. For example, in 1997 I was able to show how medical information that had all explicit identifiers, such as name, address and Social Security number removed could be reidentified using publicly available population registers (e.g., a voter list). In this particular example, I was able to show how the medical record of William Weld, the Governor of Massachusetts of the time, could be reidentified using only his date of birth, gender, and ZIP. In fact, 87% of the population of the United States is uniquely identified by date of birth (e.g., month, day, and year), gender, and their 5-digit ZIP codes. The point is that data that may look anonymous is not necessarily anonymous. Scientific assessment is needed.” (Photo copyright: US Department of Health and Human Services.)

These studies reveal that—regardless of attempts to create security standards—such as the Privacy Rule in the Health Insurance Portability and Accountability Act of 1996 (HIPAA)—the sheer amount of available data on the Internet makes it relatively easy to reidentify data that has been deidentified.

The Future of Privacy in Big Data

“Open publication of deidentified records like health, census, tax or Centrelink data is bound to fail, as it is trying to achieve two inconsistent aims: the protection of individual privacy and publication of detailed individual records,” Dr. Teague noted in the UM news release. “We need a much more controlled release in a secure research environment, as well as the ability to provide patients greater control and visibility over their data.”

While studies are mounting to show how vulnerable deidentified information might be, there’s little in the way of movement to fix the issue. Nevertheless, clinical laboratories should consider carefully any decision to sell anonymized (AKA, blinded) patient data for data mining purposes. The data may still contain enough identifying information to be used inappropriately. (See Dark Daily, “Coverage of Alexion Investigation Highlights the Risk to Clinical Laboratories That Sell Blinded Medical Data,” June 21, 2017.)

Should regulators and governments address the issue, clinical laboratories and healthcare providers could find more stringent regulations on the sharing of data—both identified and deidentified—and increased liability and responsibility regarding its governance and safekeeping.

Until then, any healthcare professional or researcher should consider the implications of deidentification—both to patients and businesses—should people use the data shared in unexpected and potentially malicious ways.

—Jon Stone

Related Information:

‘Data Is a Fingerprint’: Why You Aren’t as Anonymous as You Think Online

Research Reveals De-Identified Patient Data Can Be Re-Identified

Health Data in an Open World

The Simple Process of Re-Identifying Patients in Public Health Records

Harvard Professor Re-Identifies Anonymous Volunteers in DNA Study

How Someone Can Re-Identify Your Medical Records

Trading in Medical Data: Is this a Headache or An Opportunity for Pathologists and Clinical Laboratories

Coverage of Alexion Investigation Highlights the Risk to Clinical Laboratories That Sell Blinded Medical Data