Jake Ottenwaelder and Thenushaa Kandiah discuss technical designations and how these definitions translate across regulatory frameworks.

The world of data privacy is changing faster than ever. As legislative frameworks evolve to meet growing technical innovations, the precision and specificity of our privacy vocabulary must adapt to these changes in real time.

A popularly cited article states that 87% of the U.S. population is identifiable through three publicly available data points: name, date of birth, and zip code. Since its publication in 2000, we’ve seen new regulations, exponential growth in consumer data collection, and maturing machine learning tools – all of which complicate and intensify data privacy risks.

This article illustrates the differences between the three important terms in major data privacy laws and legislation – de-identification, anonymization, and pseudonymization – from a risk-based perspective; analyzes their place within the General Data Protection Regulation (GDPR), California Consumer Privacy Act (CCPA), and Health Insurance Portability and Accountability Act (HIPAA); and delve into their implications for modern identifiability risks.

Understanding Designations

Across global regulatory landscapes, the terms de-identification, anonymization, and pseudonymization define the legal boundaries of data utility and privacy. While many of us may use these terms interchangeably in casual discourse, they carry distinct legal meanings with varying obligations under different laws.

What is anonymization according to the GDPR?

Recital 26 of the GDPR classifies anonymized data as, “information which does not relate to an identified or identifiable natural person or to personal data rendered anonymous in such a manner that the data subject is not or no longer identifiable.”

Key characteristics of anonymized data

Anonymization considers all types of personal data, including direct identifiers, indirect identifiers, and interaction data.
Anonymization must be achieved through technical means which result in permanent and irreversible data modifications.
Business processes or access limitations are not sufficient to achieve anonymization.
Anonymization must be irreversible. Therefore, anonymized data is no longer considered personal data and falls outside the regulatory scope of the GDPR.

What is pseudonymization according to the GDPR?

Pseudonymized data is “personal data [that] can no longer be attributed to a specific data subject without the use of additional information[…]” as stated in Art. 4(5) of the GDPR.

Key characteristics of pseudonymized data

Pseudonymization considers direct and indirect personal identifiers.
Pseudonymization can be achieved through technical approaches.
Pseudonymization leverages business processes and access controls.
Pseudonymization is generally reversible, so pseudonymized data is still considered to be and managed as personal data.

What is de-identification according to the CCPA?

De-identified data is data that “cannot reasonably identify, relate to, describe, be capable of being associated with, or be linked, directly or indirectly, to a particular consumer,” according to Cal. Civ. Code § 1798.140(o)(2).

Key characteristics of de-identified data

De-identification considers both direct and indirect personal identifiers.
De-identification must result in data that cannot be reasonably linked back to an individual.
De-identification can leverage business processes and access controls to bolster the position of de-identified data.
De-identification may be achieved through reversible or non-reversible data manipulations.

What is de-identification according to HIPAA?

According to 45 CFR § 164.514(a)-(c) of HIPAA, de-identified protected health information (PHI), “does not identify an individual with respect to which there is no reasonable basis to believe that the information can be used to identify an individual.” HIPAA offers two formal pathways for de-identifying PHI:

Expert Determination: Individuals with knowledge of statistical and scientific principles for de-identification apply methods to determine “risk is very small” “alone or in combination” with other available data, and documents methods and results.
Safe Harbor: Within a dataset, 18 specific identifiers are removed from the record, such as name, geographic subdivisions, date elements, contact information, identification numbers, biometrics, IP address, etc.

What is the difference?

Among these designations, the key distinguishing factors are the perspective, permanence, and re-identification risk.

‍

	Pseudonymization & De-identification	Anonymization
Perspective	Data within an environment	All data in the universe
Relativity	Relative to internal data housed within an organization	Relative to external data from infinite possible sources
Permanence	Reversible	Irreversible
Risk	Higher risk: Re-identification possible with access to additional data or keys	Lower risk: Re-identification is not reasonably possible

‍

Understanding Technical Approaches

Privacy professionals leverage a variety of technical methods to achieve each standard. We’ll start in reverse order from least to greatest level of anonymization, using a hypothetical patient medical record as an example.

Jane Doe lives in a rural region of California, and she has ALS.

‍

Field	Value
Full Name	Jane Doe
Email	janedoe@email.com
Zip Code	89010
Date of Birth	04/07/1995
MRN	e47ecd89d61
Ethnicity	Caucasian
Diagnosis	G12.21

‍

What does the Safe Harbor approach look like?

Safe Harbor requires the removal of 18 direct identifiers to achieve de-identification. This framework uses suppression, or the explicit removal of these direct identifiers.

‍

Field	Original Value	Safe Harbor Record
Full Name	Jane Doe	null
Email	janedoe@email.com	null
Zip Code	89010	null
Date of Birth	04/07/1995	null
MRN	e47ecd89d61	null
Ethnicity	Caucasian	Caucasian
Diagnosis	G12.21	G12.21

‍

Of note, Safe Harbor does not account for indirect identifiers, which can be used in combination with other identifiers to re-identify an individual. Therefore, the remaining ethnicity and diagnostic code – which is the international classification for Jane’s condition – retain re-identification risk.

What does pseudonymization look like?

Pseudonymization replaces identifiers in the field value with pseudonyms. This can be achieved through hashing or encryptions. In this case, the patient’s name, email, and medical record number (MRN) have been hashed and replaced, leaving the remaining fields intact.

‍

Field	Original Value	Pseudonymized Record
Full Name	Jane Doe	e2a79383c1347633a78fd9
Email	janedoe@email.com	9fb56cf4972a0155919
Zip Code	89010	89010
Date of Birth	04/07/1995	null
MRN	e47ecd89d61	a84b192dd4584d1aef0
Ethnicity	Caucasian	Caucasian
Diagnosis	G12.21	G12.21

‍

Despite pseudonymization, re-identification risks remain. Available datasets pulled alongside this record could re-identify the patient. It is also possible to reverse-engineer pseudonymized data by mapping hashed pseudonyms back to the original values.

What does de-identification look like?

De-identification is achieved by masking identifiers and generalizing the data to reduce re-identification risk. Masking obscures the data values (ex. “Jane Doe” changed to “nil”), while generalization hides identifiers in broader data categories (ex. Date of birth to birth month and year).

‍

Field	Original Value	De-Identified Record
Full Name	Jane Doe	nil
Email	janedoe@email.com	9fb56cf4972a0155919
Zip Code	89010	89000
Date of Birth	04/07/1995	04/1995
MRN	e47ecd89d61	a84b192dd4584d1aef0
Ethnicity	Caucasian	Caucasian
Diagnosis	G12.21	G12.21

‍

De-identified data can still be mapped between tables using internal identifiers. Most often, though, it cannot be reversed or mapped to original data.

What does anonymization look like?

Anonymization requires a range of technical capabilities, including:

Masking: Replacing sensitive data with functional substitutes
Generalization: Reducing precise data elements to ambiguous groups
Field suppression: Removing field values to prevent data transfer
Noise addition: Introducing randomized elements to obscure data values
Differential privacy: Ensuring that individual values do not affect dataset

‍

Field	Original Value	Anonymized Record
Full Name	Jane Doe	nil
Email	janedoe@email.com	nil
Zip Code	89010	89000
Date of Birth	04/07/1995	20-30
MRN	e47ecd89d61	nil
Ethnicity	Caucasian	Caucasian
Diagnosis	G12.21	G12.20

‍

These changes are irreversible, so the data cannot be mapped back to individuals and re-identification risk is significantly reduced. We’ve applied some of these techniques to Jane’s record:

Field Suppression: Name, Email, MRN
Generalization: Date of Birth, Zip Code, Diagnosis

More advanced level techniques like noise addition fall into the differential privacy techniques that go beyond standard de-identification. However, users can still be re-identified after some of these techniques have been applied.

Understanding Re-Identification Risk

We operate under a few basic assumptions:

Regulations are interchangeable: Each set of regulations sets different standards and operates in different geographic scopes.
Risk is zero: There is always risk.
Solutions are evergreen: Technology changes, and data changes.

They are all wrong. Here’s what can happen if a bad actor gets a hold of your dataset.

How can bad actors re-identify data?

Linkage attacks occur when indirect personal identifiers can be linked across tables to gain more information about the data subject and potentially re-identify them. In Jane’s case, her medical record could be linked to public voter registration lists based on her rural zip code or date of birth.

Background attacks refer to when publicly available data, or background knowledge, is applied to a dataset to determine specific characteristics or to re-identify an individual. For example, if Jane participated in the ALS Ice Bucket Challenge and disclosed her condition in a Facebook post, a bad actor could use this public information to match the medical record to her identity.

Inferential attacks involve using statistical methods to calculate the distribution of data and find any outliers in the dataset. By monitoring the dataset over time, bad actors can identify new individuals added to the dataset by deciphering the impact of new information on the statistical distribution of the dataset. If Jane’s record were added to a new dataset, a bad actor could re-identify her by analyzing how the addition altered statistical characteristics.

Singling out can take on two forms: unique record identification and outlier exploitation. With unique record identification, bad actors can single out identifying records that are so statistically unique within the dataset that they can only belong to one individual, even without the use of external data. Outlier exploitation involves targeting records with extreme or unusual values that make them highly distinguishable from the rest of the dataset. Because Jane has a rare medical condition and lives in a rural zip code, she could be singled out as a unique record or outlier.

How does AI complicate re-identification risks?

As machine learning proliferates, the risk of these attacks grows even more likely.

Linkage attacks: Improvements to fuzzy matching and unstructured data exploration
Background attacks: Greater efficiency of mass public data evaluation
Inferential attacks: Bolstered inference power to determine membership
Singling out: Broader possible dimension of comparisons

Understanding Trade-Offs

Data Utility

Safe Harbor is the easiest to implement because HIPAA offers a “rulebook” of identifiers to remove, and no large technical processes are required to suppress these identifiers in a dataset. Data utility can be maintained based on indirect or aggregated identifiers, but linkability risk remains.

Pseudonymization is broadly implemented with relative ease for non-production environments or internal use. Linkability risk remains since hash or modified pseudonyms can be linked back to original identifiers. The key trade-off is that pseudonymized data retains a higher level of data utility, which holds benefits for business or research purposes.

De-identification falls into the middle range of anonymization level and data utility. It offers lower implementation costs, and it’s easy to scale through simple algorithms performed locally on datasets. De-identified data has lower linkability risks and lower data utility compared to pseudonymized data.

Anonymization is the most conservative, risk-averse technical approach, with the highest level of anonymization and the lowest level of data utility. Implementation costs are high, and anonymized datasets must be maintained as data changes over time. Planning is key for anonymized data. Before moving forward with anonymization, it must be decided where data utility is needed and what metrics an organization requires.

However, “anonymized data” is more of a theoretical concept rather than a tangible standard. There is no level of anonymization where there is a true zero risk for re-identification. A truly anonymous dataset is an empty one. While true anonymization is theoretically possible, it is not practically possible for most organizations while maintaining a level of data utility.

Legal Obligations

Under HIPAA, expert determination can verify that the “risk is very small” that data alone or in combination with “reasonably available information” can identify a data subject. Safe Harbor-compliant datasets do not meet the legal requirements for GDPR, since it does not account for indirect identifiers or interaction-level data.

Similarly to Safe Harbor datasets, pseudonymized data does not meet the legal requirements for GDPR. Because the re-identification risk and level of anonymization is high, pseudonymized data must be treated and managed as personal data under the GDPR.

Under CCPA, de-identified data must not “reasonably be used” to infer or link to an individual, and a business can use “technical safeguards” and “business processes” to prevent re-identification and prohibit business from attempting to re-identify. While de-identified data is compliant under CCPA, it still does not meet the legal requirements set by the GDPR.

Under GDPR, anonymized data must “not relate” to a natural personal, and the data subject is “no longer identifiable” after the data has been rendered anonymous. Anonymized data is no longer considered personal information, so it can be freely used for internal business purposes under GDPR.

Understanding Implications

Based on the earlier discussions, we discussed three main differences when choosing which technique or standard you want to align your dataset with.

Each regulation holds a different sphere of influence, but it is always safest to meet the highest standards for your consumer population. This takes into account not only the type or nature of the data in your dataset, but also considers where your consumer base is located and what regulations may apply to them.
Different terms can denote different legal standards: GDPR anonymization is not the same as HIPAA or CCPA de-identification, as the latter two still maintain a tolerance for “very low” or “reasonably low” risk for re-identification, whereas the former does not. It is important to clarify and align the de-identification actions on your data with the standard you’re trying to comply with.
Different levels of data modification means checking competing priorities: completing de-identification or robust anonymization on a dataset may check the legal compliance box, but it can leave out organizational stakeholders that require a level of data utilization from the dataset. It is essential to loop in product, marketing, and analytics teams to ensure that the data is still useful to the organization while maintaining the necessary compliance.

Want to learn more?

This article was originally published in collaboration with the California Lawyer's Association Privacy Law Section as a companion article to the webinar of the same name hosted on March 26, 2026.

The full webinar is available via InReach.

Spot the difference: A technical analysis of data anonymization between CCPA, GDPR, and HIPAA