Digital Humanities Research Institute: Binghamton 2019

Data Cleaning: OpenRefine

DATA CLEANING: OPENREFINE
With Nancy Um

All datasets require a certain amount of cleaning and humanities datasets, in particular, can be particularly messy. Common errors include spelling or formatting inconsistencies, and trailing white spaces that are hard for the human eye to pick up. OpenRefine is a free and effective tool that is designed to aid in cleaning data. It can also be used to reformat data by splitting, combining, or transposing columns, rows, and cells. This lesson will walk through some of OpenRefine’s major functions. If time permits, participants will have the opportunity clean their own datasets as well.

WORKING WITH OPENREFINE

1. Download the Sample Dataset, AM_data.csv. Save it to your desktop.
2. Download OpenRefine 3.1 (download).
3. Open the application, the icon looks like a diamond.
4. Click "Create Project" (OpenRefine considers each cleaning job to be a "project" and will save your work even after you close the application).
5. Select the file from your desktop AM_data.csv.
6. The data will appear in the open window. Click "Create Project" again (in the upper right corner).
7. Let's begin by working with the column entitled "Name". Click on the arrow to the left of the word "Name". Select "Facet" and choose "Text Facet". If it tells you that you have exceeded the limit, increase the limit to 5000.
8. On the left, all of the names in this dataset will appear, clustered alphabetically. Scroll down and take a look at this list. Do you see any possible errors? For instance are Abhey Singh and abhey Singh the same person? How about Abhey?
9. If you wish to edit any of the values, hover over the word and click edit and change as needed.
10. Now let's work with address column. Click the arrow next to "Address". When the menu drops down click "Edit Cells" and then select "Cluster and Edit". Take a look at the clusters and decide if these terms should be clustered. Do they represent distinct values? If not, click the merge box. When you are done with the list, click "Merge Selected." These values will be changed.
11. Now click on "Pickup Done Date." Click "Edit Column". Select "Split into Several Columns". The separator is a space. Now you have two separate columns, one for date and the other for time. You should rename the new column!

CASE STUDY IN MESSY DATA

This dataset was scraped from Twitter using Tweepy, a Python script. It contains a series of tweets from December 2018. For details on how to use Tweepy, see the CUNY DHRI curriculum: https://github.com/DHRI-Curriculum/twitter-api

1. Download the dataset AD_data.csv to your desktop.
2. Open it up with Excel and take a look at it. Note that most of the data is clustered in the fifth column under the heading "User." In order to understand this data more clearly, we are going to have to break that column into many columns.
3. We can start by breaking up this column into smaller parts. Let's just do that in Excel.
4. Select the column "User", then click "Data" and select "Text to Columns." In the window that appears click "Delimited". and "Next". For the delimiter, click "Comma." Notice how most of the fields are separated by commas. Click "Next" and then "Finish". When it asks if you want to replace values in the cells, click "yes". The data that appeared under the Column "User" should now be distributed over several columns.
5. Name the newly created columns, with titles such as ID1, ID2, Name, Screen Name, location 1, location 2, URL, Description, Followers, Friends
6. Now let's try to run it through OpenRefine to clean it further.
7. Download OpenRefine (download).
8. Open the application, the icon looks like a diamond.
9. Click "Create Project" (OpenRefine considers each cleaning job to be a "project" and will save your work even when you leave).
10. Select the file from your desktop AD_data.csv.
11. The data will appear. Click "Create Project" again (in the upper right corner).
12. Let's clean up the column "Name".
13. Click on the small arrow next to "Name".
14. Select "Edit Cells" and then "Transform".
15. Use a GREL expression to clean this column. Try value.split(" ")[1]. Now the extraneous text before the value in each row of this column has been removed. 
16. Try the same expression for other columns. Click on the tab that says "History" in order to simply reuse an earlier expression.
17. If you wish to remove the single quotation marks around each word, try replace(value,"'",""). You can try this for each column.
18. For more GREL expressions, see https://github.com/OpenRefine/OpenRefine/wiki/GREL-Functions

MORE EXPERIENCE WITH OPENREFINE

If you would like to practice using OpenRefine, walk through Miriam Posner’s excellent tutorial, which includes a sample dataset.
Miriam Posner, “Get Started with OpenRefine,” Introduction to Digital Humanities, UCLA, Fall 2017.
http://miriamposner.com/classes/dh101f17/tutorials-guides/data-manipulation/get-started-with-openrefine/

ADDITIONAL RESOURCES

For more resources and advanced functions in OpenRefine, consult these two excellent tutorials:

This page references: