Data about human beings are very rarely anonymous. Even if the data do not contain information such as name or address, the data may still pose privacy risks to the people they are about, especially if it is possible to re-identify individuals within the data you are using for your research.

The table below can be used to determine privacy risk categories for different types of data. There will always be grey areas when looking at privacy risk, particularly when considering the vulnerability of the research subjects, the sensitivity of the information and the re-identifiability of the data; if in doubt, opt for a higher risk category. Also, not all of the data in your research will have the same level of privacy risk and the privacy risks for each type of data may change as you clean, recode and modify the data from a raw to processed form. Assess the risks for each separate data asset with consideration for how each data asset changes during the various stages of the research life cycle.

Once you’ve determined the privacy risk category for each data asset that you will use in your research, you can use these categories to inform your choices about how data can be de-identified, safely used by students/interns, safely transported physically, securely transferred digitally, and stored securely.


Red Data

Privacy Risk Very high-risk
Description • Data from vulnerable subjects, that are about sensitive topics and are either fully identifiable or are easily re-identified
Impact of a Breach • Serious harm to research subjects and/or serious damage to the reputation of the VU could occur
• Likelihood that such harm/damages could occur after a breach is very high
Examples • Video interviews with children talking about abuse
• Raw transcripts of interviews with refugees about their home country
• Open text responses (e.g. diary-type feedback) from patients with mental or physical health problems
• Open text responses or detailed interviews with employees describing their satisfaction with their employer
• Raw neuroimages of vulnerable subjects that haven’t been de-faced
• Genetic data from vulnerable subjects that indicates risk for disease or disorders

Orange Data

Privacy Risk High-risk
Description • Fully identifiable data about benign information from non-vulnerable subjects
OR
• Data that are fully identifiable or that could be fairly easily re-identified that are about benign topics from vulnerable subjects
OR
• Data that are fully identifiable or that could be fairly easily re-identified that are about sensitive topics from non-vulnerable subjects
Impact of a Breach • Serious harm to research subjects and/or serious damage to the reputation of the VU could occur
• Likelihood that such harm/damages would occur after a breach is moderate, but if the harm/damage does occur the consequences would be severe
Examples • Key files containing names and contact information of research subjects (vulnerable or not)
• Data containing date of birth and 6-digit postal code of research subjects (vulnerable or not)
• Video observations of children playing
• Video observations of team-building activities
• Raw neuroimages of non-vulnerable subjects that haven’t been de-faced
• Raw questionnaire data about sensitive topics and/or from vulnerable subjects containing detailed demographic information
• Genetic data from non-vulnerable subjects

Yellow Data

Privacy Risk Moderate-risk
Description • Data that could be fairly easily re-identified that are about benign topics and from non-vulnerable subjects
OR
• Data that can only be re-identified with great effort and that are about:
  - benign topics from vulnerable subjects;
  - sensitive topics from non-vulnerable subjects;
    OR
  - both
Impact of a Breach • Severity of harm to research subjects and/or damage to the reputation of the VU is moderate to high, but the likelihood that such harms or damages would occur is low
Examples • IP- and MAC-addresses of research subjects (vulnerable or not)
• Raw questionnaire data about benign topics from non-vulnerable subjects containing detailed demographic information
• Questionnaire data about sensitive topics and/or vulnerable populations that have been processed to make re-identification more difficult
• Video recordings with faces blurred and voices modified
• Transcripts of interviews in which the identifying information is replaced with pseudonyms
• Repeated physical measurements that include the dates and times measurement occurred
• Neuroimaging from vulnerable subjects that has been de-faced
• Extensive kinematic measurements of vulnerable subjects or that are used to identify sensitive information such as abnormal movement patterns

Green Data

Privacy Risk Low-risk
Description • Data that can only be re-identified with great effort and that are about benign topics from non-vulnerable subjects
Impact of a Breach • Harm to research subjects is minimal and the likelihood that harm would occur is very low
• Damage to the reputation of the VU may still occur although it is less likely and the impact would be lower
Examples • Blue data in which a random identification code is attached to each record so that every record can be re-identified with the help of a key file
• Data that contain unique records for some or all research subjects:
  - Neuro-imaging from non-vulnerable subjects that has been de-faced
  - Extensive kinematic measurements from non-vulnerable subjects
  - Any other measurements that contain sufficient information to create a unique profile for one or more research subjects
  - Questionnaires about benign topics and answered by non-vulnerable research subjects that have been processed to be less identifiable, but which still contain demographic information about each subject

Blue Data

Privacy Risk Little to no risk
Description • Data that cannot be re-identified whatsoever, regardless of the vulnerability of the subjects or the sensitivity of the information
Impact of a Breach • Research subjects suffer no direct harms and the VU will suffer no damage to its reputation**
Examples • Highly variable physical measurements, e.g. blood pressure, heart rate, blood glucose, body temperature
• Likert scale responses in questionnaire data
• Coded qualitative data
• Summary statistics

NB: If any of the above examples are part of a record about a research subject that contains data from a higher risk category, then the data are not anonymous. They are only anonymous if they are not linkable to the higher risk data.
**NB: Although research subjects will not be directly harmed, the conclusions drawn from research results or the misuse of published research software can impact the wider population to which the research subjects belong. Such ethical considerations should be discussed with the FGB Scientific and Ethical Review Board.


Important Factors in Privacy Risk


  1. The vulnerability of the research subjects:


  1. The sensitivity of the information being used:


  1. The ease with which research subjects can be re-identified in the data:

Confidentiality Risk versus Privacy Risk

You may also need to consider whether your data need to be kept confidential. Even if your data are not about human subjects, they may be confidential, e.g. business secrets or intellectual property. If you are working with a third party, especially a business, they may require you to keep their data confidential. To help you assess the potential confidentiality risks in your research, see the last section of this data classification tool on confidentiality. If at least one of your answers gives you a high risk for confidentiality, your confidentiality risk is high. If none of your answers are high risk, but you have at least one medium risk answer, your confidentiality risk of medium. You can generally map the confidentiality risks to the privacy risks as:

  • High confidentiality risk = Red or orange data
  • Moderate confidentiality risk = Yellow data
  • Low confidentiality risk = Green or blue data

If you are working with data where both privacy and confidentiality apply, categorize the data as the highest possible risk category (i.e. if confidentiality risk is low, but privacy risk is very high, then choose the red data category).


  1. The legal definition of pseudonymization according to the GDPR is quite strict; essentially, according to the GDPR, pseudonymous data becomes anonymous data if the additional information that is necessary to re-identify the pseudonymous data is deleted. Most real life situations do not meet this strict definition. For example, a dataset with no directly identifying data may still contain indirectly identifiable data that can be used single out unique records, which could then be re-identified with publicly available information or based on contextual information. Through data processing, you may further de-identify this dataset so that there are no more unique records, and then the only way to re-identify your participants is through an identification code and a key file. Both the former and latter versions of this dataset would generally be considered as pseudonymized, but under the GDPR only the latter version is legally considered to be pseudonymized. An important takeaway from this is that even if someone says their data is pseudonymous, you may want to investigate to what extent the data are pseudonymous: do they mean the GDPR’s strict definition of pseudonymized or do they mean that directly identifying data is not present in the dataset? This is an important consideration to take into account when assessing your privacy risks because it will impact the re-identifiability of the data.↩︎