Data about human beings are very rarely anonymous. Even if the data do not contain information such as name or address, the data may still pose privacy risks to the people they are about, especially if it is possible to re-identify individuals with the data you are using in your research.
The table below can be used to determine privacy risk categories for different types of data. There will always be grey areas when looking at privacy risk, particularly when considering the vulnerability of the research subjects, the sensitivity of the information and the re-identifiability of the data; if in doubt, opt for a higher risk category. Also, not all of the data in your research will have the same level of privacy risk and the privacy risks for each type of data may change as you clean, code and modify the data from a raw to processed form. Assess the risks for each separate data asset with consideration for how each data asset changes during the various stages of the research life cycle.
Once you’ve determined the privacy risk category for each type of data you will use in your research, you can use these categories to inform your choices about how data can be de-identified, safely used by students/interns, safely transported physically, securely transferred digitally, and stored securely.
Privacy Risk | Very high-risk |
Description | • Data from vulnerable subjects, that are about sensitive topics and are either fully identifiable or are easily re-identified |
Impact of a Breach | • Serious harm to research subjects and/or serious damage to the reputation of the VU could occur • Likelihood that such harm/damages could occur after a breach is very high |
Examples | • Video interviews with children talking about abuse • Raw transcripts of interviews with refugees about their home country • Open text responses (e.g. diary-type feedback) from patients with mental or physical health problems • Open text responses or detailed interviews with employees describing their satisfaction with their employer • Raw neuroimages of vulnerable subjects that haven’t been de-faced • Genetic data from vulnerable subjects that indicates risk for disease or disorders |
Privacy Risk | High-risk |
Description | • Fully identifiable data about benign information from non-vulnerable subjects OR • Data that are fully identifiable or that could be fairly easily re-identified that are about benign topics from vulnerable subjects OR • Data that are fully identifiable or that could be fairly easily re-identified that are about sensitive topics from non-vulnerable subjects |
Impact of a Breach | • Serious harm to research subjects and/or serious damage to the reputation of the VU could occur • Likelihood that such harm/damages would occur after a breach is moderate, but if the harm/damage does occur the consequences would be severe |
Examples | • Key files containing names and contact information of research subjects (vulnerable or not) • Data containing date of birth and 6-digit postal code of research subjects (vulnerable or not) • Video observations of children playing • Video observations of team-building activities • Raw neuroimages of non-vulnerable subjects that haven’t been de-faced • Raw questionnaire data about sensitive topics and/or from vulnerable subjects containing detailed demographic information • Genetic data from non-vulnerable subjects |
Privacy Risk | Moderate-risk |
Description | • Data that could be fairly easily re-identified that are about benign topics and from non-vulnerable subjects OR • Data that can only be re-identified with great effort and that are about: - benign topics from vulnerable subjects; - sensitive topics from non-vulnerable subjects; OR - both |
Impact of a Breach | • Severity of harm to research subjects and/or damage to the reputation of the VU is moderate to high, but the likelihood that such harms or damages would occur is low |
Examples | • IP- and MAC-addresses of research subjects (vulnerable or not) • Raw questionnaire data about benign topics from non-vulnerable subjects containing detailed demographic information • Questionnaire data about sensitive topics and/or vulnerable populations that have been processed to make re-identification more difficult • Video recordings with faces blurred and voices modified • Transcripts of interviews in which the identifying information is replaced with pseudonyms • Repeated physical measurements that include the dates and times measurement occurred • Neuroimaging from vulnerable subjects that has been de-faced • Extensive kinematic measurements of vulnerable subjects or that are used to identify sensitive information such as abnormal movement patterns |
Privacy Risk | Low-risk |
Description | • Data that can only be re-identified with great effort and that are about benign topics from non-vulnerable subjects |
Impact of a Breach | • Harm to research subjects is minimal and the likelihood that harm would occur is very low • Damage to the reputation of the VU may still occur although it is less likely and the impact would be lower |
Examples | • Blue data in which a random identification code is attached to each record so that every record can be re-identified with the help of a key file • Data that contain unique records for some or all research subjects: - Neuro-imaging from non-vulnerable subjects that has been de-faced - Extensive kinematic measurements from non-vulnerable subjects - Any other measurements that contain sufficient information to create a unique profile for one or more research subjects - Questionnaires about benign topics and answered by non-vulnerable research subjects that have been processed to be less identifiable, but which still contain demographic information about each subject |
Privacy Risk | Little to no risk |
Description | • Data that cannot be re-identified whatsoever, regardless of the vulnerability of the subjects or the sensitivity of the information |
Impact of a Breach | • Research subjects suffer no direct harms and the VU will suffer no damage to its reputation** |
Examples | • Highly variable physical measurements, e.g. blood pressure, heart rate, blood glucose, body temperature • Likert scale responses in questionnaire data • Coded qualitative data • Summary statistics NB: If any of the above examples are part of a record about a research subject that contains data from a higher risk category, then the data are not anonymous. They are only anonymous if they are not linkable to the higher risk data. |
Vulnerable research subjects have an additional risk of harm (socially, physically, emotionally, financially) if their personal information is made public. The greater the vulnerability of the research subjects, the greater the potential for serious harm.
Vulnerable research subjects include, but are not limited to:
The vulnerability of the research subjects can also depend on the context of the research, e.g. employees in organizational psychology research; students in learning analytics research. These contextual risks can also compound the risks for research subjects with other “typical” vulnerability characteristics, e.g. employees who are immigrants.
Data are personal data if they are directly identifying or indirectly identifiable:
As long as data are identifiable, they cannot be referred to as anonymous data and the legal rules of the GDPR must be followed.
Not all identifiable data pose the same level of privacy risk: oftentimes, raw data pose higher privacy risks because of greater re-identifiability (e.g. video recordings), and therefore require additional data protection measures (such as high security storage); as the data are cleaned, coded and analysed they become less identifiable (e.g. coded interactions) and require fewer data protection measures. De-identification is therefore an important part of data processing that can be used to protect the privacy of your research participants, since it’s not usually possible to change the vulnerability of the research subjects or the sensitivity of the research topics.
You may also need to consider whether your data need to be kept confidential. Even if your data are not about human subjects, they may be confidential, e.g. business secrets or intellectual property. If you are working with a third party, especially a business, they may require you to keep the data confidential. To help you assess the potential confidentiality risks in your research, see the last section of this data classification tool on confidentiality. If at least one of your answers gives you a high risk for confidentiality, your confidentiality risk is high. If none of your answers are high risk, but you have at least one medium risk answer, your confidentiality risk of medium. You can generally map the confidentiality risks to the privacy risks as:
If you are working with data where both privacy and confidentiality apply, categorize the data as the highest possible risk category (i.e. confidentiality is low, but privacy risk is very high, then choose the red data catgeory).