FGB Security Tips: Assessing Privacy Risks

Data about human beings are very rarely anonymous. Even if the data do not contain information such as name or address, the data may still pose privacy risks to the people they are about, especially if it is possible to re-identify individuals with the data you are using in your research.

The table below can be used to determine privacy risk categories for different types of data. There will always be grey areas when looking at privacy risk, particularly when considering the vulnerability of the research subjects, the sensitivity of the information and the re-identifiability of the data; if in doubt, opt for a higher risk category. Also, not all of the data in your research will have the same level of privacy risk and the privacy risks for each type of data may change as you clean, code and modify the data from a raw to processed form. Assess the risks for each separate data asset with consideration for how each data asset changes during the various stages of the research life cycle.

Once you’ve determined the privacy risk category for each type of data you will use in your research, you can use these categories to inform your choices about how data can be de-identified, safely used by students/interns, safely transported physically, securely transferred digitally, and stored securely.

Red Data

Privacy Risk	Very high-risk
Description	• Data from vulnerable subjects, that are about sensitive topics and are either fully identifiable or are easily re-identified
Impact of a Breach	• Serious harm to research subjects and/or serious damage to the reputation of the VU could occur • Likelihood that such harm/damages could occur after a breach is very high
Examples	• Video interviews with children talking about abuse • Raw transcripts of interviews with refugees about their home country • Open text responses (e.g. diary-type feedback) from patients with mental or physical health problems • Open text responses or detailed interviews with employees describing their satisfaction with their employer • Raw neuroimages of vulnerable subjects that haven’t been de-faced • Genetic data from vulnerable subjects that indicates risk for disease or disorders

Orange Data

Privacy Risk	High-risk
Description	• Fully identifiable data about benign information from non-vulnerable subjects OR • Data that are fully identifiable or that could be fairly easily re-identified that are about benign topics from vulnerable subjects OR • Data that are fully identifiable or that could be fairly easily re-identified that are about sensitive topics from non-vulnerable subjects
Impact of a Breach	• Serious harm to research subjects and/or serious damage to the reputation of the VU could occur • Likelihood that such harm/damages would occur after a breach is moderate, but if the harm/damage does occur the consequences would be severe
Examples	• Key files containing names and contact information of research subjects (vulnerable or not) • Data containing date of birth and 6-digit postal code of research subjects (vulnerable or not) • Video observations of children playing • Video observations of team-building activities • Raw neuroimages of non-vulnerable subjects that haven’t been de-faced • Raw questionnaire data about sensitive topics and/or from vulnerable subjects containing detailed demographic information • Genetic data from non-vulnerable subjects

Yellow Data

Privacy Risk	Moderate-risk
Description	• Data that could be fairly easily re-identified that are about benign topics and from non-vulnerable subjects OR • Data that can only be re-identified with great effort and that are about: - benign topics from vulnerable subjects; - sensitive topics from non-vulnerable subjects; OR - both
Impact of a Breach	• Severity of harm to research subjects and/or damage to the reputation of the VU is moderate to high, but the likelihood that such harms or damages would occur is low
Examples	• IP- and MAC-addresses of research subjects (vulnerable or not) • Raw questionnaire data about benign topics from non-vulnerable subjects containing detailed demographic information • Questionnaire data about sensitive topics and/or vulnerable populations that have been processed to make re-identification more difficult • Video recordings with faces blurred and voices modified • Transcripts of interviews in which the identifying information is replaced with pseudonyms • Repeated physical measurements that include the dates and times measurement occurred • Neuroimaging from vulnerable subjects that has been de-faced • Extensive kinematic measurements of vulnerable subjects or that are used to identify sensitive information such as abnormal movement patterns

Green Data

Privacy Risk	Low-risk
Description	• Data that can only be re-identified with great effort and that are about benign topics from non-vulnerable subjects
Impact of a Breach	• Harm to research subjects is minimal and the likelihood that harm would occur is very low • Damage to the reputation of the VU may still occur although it is less likely and the impact would be lower
Examples	• Blue data in which a random identification code is attached to each record so that every record can be re-identified with the help of a key file • Data that contain unique records for some or all research subjects: - Neuro-imaging from non-vulnerable subjects that has been de-faced - Extensive kinematic measurements from non-vulnerable subjects - Any other measurements that contain sufficient information to create a unique profile for one or more research subjects - Questionnaires about benign topics and answered by non-vulnerable research subjects that have been processed to be less identifiable, but which still contain demographic information about each subject

Blue Data

Privacy Risk	Little to no risk
Description	• Data that cannot be re-identified whatsoever, regardless of the vulnerability of the subjects or the sensitivity of the information
Impact of a Breach	• Research subjects suffer no direct harms and the VU will suffer no damage to its reputation**
Examples	• Highly variable physical measurements, e.g. blood pressure, heart rate, blood glucose, body temperature • Likert scale responses in questionnaire data • Coded qualitative data • Summary statistics NB: If any of the above examples are part of a record about a research subject that contains data from a higher risk category, then the data are not anonymous. They are only anonymous if they are not linkable to the higher risk data.

**NB: Although research subjects will not be directly harmed, the conclusions drawn from research results or the misuse of published research software can impact the wider population to which the research subjects belong. Such ethical considerations should be discussed with the FGB Scientific and Ethical Review Board.

Important Factors in Privacy Risk

The vulnerability of the research subjects:

Vulnerable research subjects have an additional risk of harm (socially, physically, emotionally, financially) if their personal information is made public. The greater the vulnerability of the research subjects, the greater the potential for serious harm.
Vulnerable research subjects include, but are not limited to:
- children
- people who identify as LGBTQIA2S+
- refugees
- ethnic or religious minorities
The vulnerability of the research subjects can also depend on the context of the research, e.g. employees in organizational psychology research; students in learning analytics research. These contextual risks can also compound the risks for research subjects with other “typical” vulnerability characteristics, e.g. employees who are immigrants.

The sensitivity of the information being used:

Sensitive data include “special” data types that receive extra legal attention under the General Data Protection Regulation (GDPR):
- race or ethnicity
- political opinions
- religious or philosophical beliefs,
- trade union membership
- genetic data
- biometric data (used to identify a person such as fingerprints or iris scans)
- health data
- data about sexuality or sexual activity
Data that are defined as “special” by the GDPR may not necessarily be considered sensitive by the general public (e.g. normal physical measurements in average, healthy people are considered health data under the GDPR). If data are “special”, there are additional legal rules that must be followed, regardless of whether the data are deemed sensitive.
Sensitive data are also any information that is considered sensitive by the general public, such as:
- employment status
- income and other financial data
- student grades and performance
- location data
Data may also be more sensitive because the research subjects are more vulnerable, e.g. a refugee describing their experiences in their home country.

The ease with which research subjects can be re-identified in the data:

Data are personal data if they are directly identifying or indirectly identifiable:
- Directly identifying data are what most people think of as personal data: name, contact information, facial images etc.. This information isn’t always directly identifying (e.g. name = Jan Smit), but, regardless, it’s generally agreed that these types of data should be handled with extra care
- Indirectly identifiable data can also be referred to as pseudonymous data. The ease with which a research subject could be re-identified with indirectly identifiable data depends on several factors including:
  - How much information has been collected about each research subject?
  - How specific is the information about each research subject?
  - Could the data be linked to publicly available information, such as social media profiles, thus enabling re-identification?
As long as data are identifiable, they cannot be referred to as anonymous data and the legal rules of the GDPR must be followed.
Not all identifiable data pose the same level of privacy risk: oftentimes, raw data pose higher privacy risks because of greater re-identifiability (e.g. video recordings), and therefore require additional data protection measures (such as high security storage); as the data are cleaned, coded and analysed they become less identifiable (e.g. coded interactions) and require fewer data protection measures. De-identification is therefore an important part of data processing that can be used to protect the privacy of your research participants, since it’s not usually possible to change the vulnerability of the research subjects or the sensitivity of the research topics.

Confidentiality Risk versus Privacy Risk

You may also need to consider whether your data need to be kept confidential. Even if your data are not about human subjects, they may be confidential, e.g. business secrets or intellectual property. If you are working with a third party, especially a business, they may require you to keep the data confidential. To help you assess the potential confidentiality risks in your research, see the last section of this data classification tool on confidentiality. If at least one of your answers gives you a high risk for confidentiality, your confidentiality risk is high. If none of your answers are high risk, but you have at least one medium risk answer, your confidentiality risk of medium. You can generally map the confidentiality risks to the privacy risks as:

High confidentiality risk = Red or orange data
Moderate confidentiality risk = Yellow data
Low confidentiality risk = Green or blue data

If you are working with data where both privacy and confidentiality apply, categorize the data as the highest possible risk category (i.e. confidentiality is low, but privacy risk is very high, then choose the red data catgeory).