Risk Assessment and De-identification
Activity: That's PII
If you haven’t already, play Level 6 in the League of Data game: https://lod.sshopencloud.eu/LodGame/ It helps you think about what sort of information you should anonymize.Source: SSHOC. (2020). Data Publication Challenge [video game]. Social Sciences and Humanities Open Cloud (SSHOC) League of Data (LOD). https://lod.sshopencloud.eu/
--------------
Now that we’ve talked about ethics and sharing, one key component involved in preparing data for sharing is conducting a risk assessment. This can also be called assessing the risk of disclosure. Disclosure occurs when an individual participant in your study is directly or indirectly identified. You don’t want that to happen! It is very often the researcher’s responsibility to conduct a risk assessment and act accordingly to adequately protect their data.
Here are some resources you may find helpful:
- UNC-Wilmington facilitated a webinar in 2022 introducing disclosure risk, as well as how to assess and remediate it. The recording (1 hour) and PPT slides are available: https://library.uncw.edu/love_data_week#risk
- The Sensitive Data Expert Group of the Portage Network in Canada developed this Human Participant Research Data Risk Matrix (version 2, as of 10/1/2020), which breaks down data classification levels by how they should be handled at different stages of the research data process (including data sharing).
- The University of Michigan Open Data DEIA Toolkit (version 1.0, last updated July 2021) lists recommended resources related to DEIA at different stages of the research data process (including data sharing).
- The Poverty Action Lab (2020) explains and provides examples of relevant identifier variables for various processes of de-identification, including:
- Encoding identifiers with anonymous ID numbers
- Removing or partitioning identifiers by separating identified and de-identified data
- Redacting values
- Aggregating variables
- Geographic masking or jittering or displacement by offsetting spatial datapoints in a systematically random nature
You will need to decide what information to keep for its usefulness in analysis and what can be removed, altered (or ideally never even collected) while still being helpful for data reuse by others.
Here are some basic methods that can be done to anonymize or protect the data:
- Remove, mask, generalize, or pseudonymize identifiers (ex: occupation instead of specific position title)
- Reduce precision/detail of identifiers (ex: birth year/decade instead of date of birth)
- Recode geographic areas to be higher levels (ex: show county /region instead of zip code)
- Remove or mask unique outliers
- Recategorize and collapse codes if some coded categories are too small (ex: 9 = “other”)
- Create ranges/bins for variables (ex: ages 0-9 and 10-19, or “child” and “teenager”)
- Combine variables (ex: school in North Carolina, an aggregated location, instead of East Chapel Hill High School, an individual place name)
- Report in the aggregate instead of at individual levels
- If a variable is sensitive and not essential, consider removing this variable
- If you use “find” and “replace” techniques on qualitative data, be aware this may not catch everything!
- If you make replacements like pseudonyms or removals, use brackets to show this interference made to this cleaned copy.
- Check for potential “hidden” metadata, such as metadata in image files or in NVIVO analysis files
Beyond what is listed above, know that there are more complex “Statistical Disclosure Control” methods for de-identification, which should be employed if you are especially concerned about risk from indirect identifiers; ask your IRB office to direct you towards these “Expert Determination” methods rather than just “Safe Harbor” standards for direct identifiers.
Remember FAIR? Part of Reusability is writing documentation. There is value in documenting what steps you made in anonymizing your data—not in revealing what specifically you changed from what into what, but in the thematic processes you made like removing names of municipalities and your decision that the municipalities will now appear as distinct numbers. Here is an example of documentation about a deidentification process done on a research project.
How possible is it to re-identify data that has been anonymized? Here’s an article (2019) examining how difficult it could be to re-identify a dataset: Estimating the success of re-identifications in incomplete datasets using generative models
You may want to double check on your risk assessment after you conduct de-identification. Here are some quick checks you can make:
- Check the consent form
- Check the codebook, data dictionary, or other documentation of variables and data elements to get a review of potential leftover identifiers. Is there documentation about which variables were transformed or de-identified?
- Review data for remaining direct identifiers, especially dates and geography
- Look for indirect identifiers that could possibly link to external datasets
- Run descriptive statistics and crosstabs on quantitative data to see if there are any unique cases and combinations of variables
Activity: De-identify a transcript
Download or make a copy of one of the sample interview transcripts in this Raw Interviews folder—you do not have edit access to these originals.Read through the transcript and try to make changes to anonymize the text.
As you are making changes, document your processes. You can use this Open Data Release Form if you like it (you will have to download or make a copy of this Open Data Release Toolkit document as you don’t have edit access).
Once you’re done, compare your changes against the transcripts in the Cleaned Interviews folder.
--------------------------
Sources
- Darragh, J. (2022, October 11). Sharing Human Participants Data: Challenges and Potential Solutions [Presentation]. Duke University Libraries.
- Darragh, J., Hofelich Mohr, A., Hunt, S., Woodbrook, R., Fearon, D., Moore, J., & Hadley, H. (2020, March 2). Human Subjects Data Essentials Data Curation Primer. Data Curation Network GitHub Repository.
- Dunning, T., & Camp, E. (2015). 0_ Camp_Anonymization Protocol [data documentation]. Brokers, voters, and clientelism: The puzzle of distributive politics. Qualitative Data Repository. https://doi.org/10.5064/F6Z60KZB/0NR0VZ.
- Emmelhainz, C., & Sackmann, A. (2022, February 15). What do I do with all of this text? Cleaning and coding data for qualitative analysis [Presentation]. UC Berkeley Love Data Week 2022. https://docs.google.com/presentation/d/1Twsi1Zmt2kXscXsFGKz-gQNBUEk1CDmp6FpXrFmXpU0/edit?usp=sharing
- Finkle, E. (2016). Open Data Release Toolkit: Privacy Edition [Guide, version 1.2]. DataSF. https://datasf.org/resources/open-data-release-toolkit/
- ICPSR. (n.d.) Guide to Social Science Data Preparation and Archiving: Best Practice Throughout the Data Life Cycle, 6th Edition [Report]. https://www.icpsr.umich.edu/web/pages/deposit/guide/index.html
- Kopper, S., Sautmann, A., & Turitto, J. (2020, January). J-Pal Guide to De-identifying Data. Abdul Latif Jameel Poverty Action Lab. https://www.povertyactionlab.org/resource/data-de-identification
- Office for Civil Rights (OCR). (2012, November 26). Guidance Regarding Methods for De-identification of Protected Health Information in Accordance with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule [Guidance]. U.S. Department of Health & Human Services. https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html
- Qualitative Data Repository. (n.d.) De-Identification. https://qdr.syr.edu/guidance/human-participants/deidentification
- Tarrant, D., Thereaux, O., & Mezeklieva, V. (2020, June). Anonymising data in times of crisis [Report]. Open Data Institute (ODI). https://theodi.org/article/anonymising-data-in-times-of-crisis/
- UK Data Service. (n.d.) Research Data Management: Anonymisation. https://ukdataservice.ac.uk/learning-hub/research-data-management/#anonymisation