Learning Data Ethics for Open Data Sharing

Risk Assessment and De-identification

Activity: That's PII

If you haven’t already, play Level 6 in the League of Data game: https://lod.sshopencloud.eu/LodGame/ It helps you think about what sort of information you should anonymize.
Source: SSHOC. (2020). Data Publication Challenge [video game]. Social Sciences and Humanities Open Cloud (SSHOC) League of Data (LOD). https://lod.sshopencloud.eu/
--------------

Now that we’ve talked about ethics and sharing, one key component involved in preparing data for sharing is conducting a risk assessment. This can also be called assessing the risk of disclosure. Disclosure occurs when an individual participant in your study is directly or indirectly identified. You don’t want that to happen! It is very often the researcher’s responsibility to conduct a risk assessment and act accordingly to adequately protect their data.

Here are some resources you may find helpful:
Once you have done a basic risk assessment to determine what sensitive information you have in your dataset, you can decide whether you want to restrict access to your dataset or anonymize your dataset (i.e. remove the sensitive information). Restricted access can certainly be a viable option; it preserves the original variables and is released only to approved users who agree to conditions respecting the study participants’ confidentiality. Restricted use agreements may require the prospective researcher to hash out a summary of their research question, explain why they need access to this dataset, and provide a data protection plan that explains how they will safeguard the data during a specific period of time and then destroy or return the data once the study is complete.

You will need to decide what information to keep for its usefulness in analysis and what can be removed, altered (or ideally never even collected) while still being helpful for data reuse by others.

Here are some basic methods that can be done to anonymize or protect the data:Check out this simplified infographic “5 Things to check for data deidentification” (note that further on the page is some sample R code you can use for step 5).

Beyond what is listed above, know that there are more complex “Statistical Disclosure Control” methods for de-identification, which should be employed if you are especially concerned about risk from indirect identifiers; ask your IRB office to direct you towards these “Expert Determination” methods rather than just “Safe Harbor” standards for direct identifiers.

Remember FAIR? Part of Reusability is writing documentation. There is value in documenting what steps you made in anonymizing your data—not in revealing what specifically you changed from what into what, but in the thematic processes you made like removing names of municipalities and your decision that the municipalities will now appear as distinct numbers. Here is an example of documentation about a deidentification process done on a research project.

How possible is it to re-identify data that has been anonymized? Here’s an article (2019) examining how difficult it could be to re-identify a dataset: Estimating the success of re-identifications in incomplete datasets using generative models


You may want to double check on your risk assessment after you conduct de-identification. Here are some quick checks you can make:
  1. Check the consent form
  2. Check the codebook, data dictionary, or other documentation of variables and data elements to get a review of potential leftover identifiers. Is there documentation about which variables were transformed or de-identified?
  3. Review data for remaining direct identifiers, especially dates and geography
  4. Look for indirect identifiers that could possibly link to external datasets
  5. Run descriptive statistics and crosstabs on quantitative data to see if there are any unique cases and combinations of variables
---------------------------------

Activity: De-identify a transcript

Download or make a copy of one of the sample interview transcripts in this Raw Interviews folder—you do not have edit access to these originals.

Read through the transcript and try to make changes to anonymize the text.

As you are making changes, document your processes. You can use this Open Data Release Form if you like it (you will have to download or make a copy of this Open Data Release Toolkit document as you don’t have edit access).

Once you’re done, compare your changes against the transcripts in the Cleaned Interviews folder.
 --------------------------

Sources

This page has paths:

This page references: