If you’re here, you’ve probably already setup a new project and created this notebook file - but if not, go to the square on the top right of RStudio to create a new project.


This document steps through an OpenRefine tutorial (see https://goo.gl/mQC7oe) adapted from other materials, in R…it demonstrates how fantastic OpenRefine is in many respects!

This resource is adapted from: 1. http://enipedia.tudelft.nl/wiki/OpenRefine_Tutorial#Clean_up_country_names (under a CC-By-SA 3.0 license) C.B.Davis, A. Chmieliauskas, G.P.J. Dijkema, I. Nikolic (2015), Enipedia, http://enipedia.tudelft.nl, Energy & Industry group, Faculty of Technology, Policy and Management, TU Delft, Delft, The Netherlands. 2. And http://schoolofdata.org/handbook/recipes/cleaning-data-with-refine/ Both under a CC-By-SA license. To cite, Knight, S., (2016). Cleaning Data with Refine,


Open Refine (previously Google Refine) is a data cleaning software that uses your web browser as an interface. This means it will look like it runs on the internet but all your data remains on your machine and you do not need internet connection to work with it. The main aim of Refine is to help you exploring and cleaning your data before you use it further. It is built for large datasets – so don’t worry as long as your spreadsheets can keep the information: Refine can as well. To work with your data in Refine you need to start a new project: Walkthrough: Creating a Refine project 1. Start Refine – this will open a browser window pointing to http://127.0.0.1:3333 if this doesn’t happen open the link with your browser directly 2. Create a new project: On the left tab select the “Create Project” tab:

  1. Click on “Choose Files” to choose your downloaded file and click on “next” – you can also use the URL to the CSV directly if your data is hosted on the web.
  2. You will get a preview on how refine will interpret your data – if you have selected a well formatted CSV or other file: this should be pretty automatic.
  3. Review the preview carefully to make sure the data looks right. Double check character encoding. Much, but not all data uses UTF-8 these days, but make sure you don’t see any funny characters in preview.
  4. You may want to turn off “guess data types”, particularly if you have data that contains leading zeros in numbers or identifiers which are significant.
  5. Name your project in the box on the top right side and click on “Create Project”

  6. The project will open in the project view, this is the basic interface you are going to work with: by default refine shows only 10 rows of data, you can change this on the bar above the data rows. Also you can use the navigation on the right to see the next or previous rows. You now have successfully created your first Refine project. Remember: although it runs in a web-browser, the Refine server is still on your machine – all the data is there (so no worries if you handle sensitive information)

#The following code roughly mirrors this bit in OpenRefine (stringsAsFactors=F is broadly equivalent to telling OpenRefine not to predict the datatypes)
unis <- read.csv("https://goo.gl/EJxa20", header=T, stringsAsFactors=F)

Once we created our project, let’s go and explore the data and the Refine interface a bit. Using Refine might be intimidating at first, since it seems so different from spreadsheets, once you get used to it you will notice how easily you can do things with it. One of the commonly used functions in spreadsheets is sorting and filtering data – to figure out minima, maxima or things about certain categories. Refine can do the same thing. Walkthrough: Sorting rows 1. Refine handles data similar to a spreadsheet: you have rows, columns and cells – a cell is a field defined by a row and a column. 2. To sort your rows based on a specific column click on the small downward triangle next to the column

  1. Select “Sort…” to open the sorting dialog
  2. You can select what to sort the values as and then what order to sort in. (We’ll sort in text, since for now we only have text columns)
  3. Click “OK” and your rows will be sorted based on the column
  4. To undo the sort, click on the column options again, select “sort” then “remove sort”
#probably more useful than sorting, is to view the head or tail of a file
head(unis)
unis[1:100,]
tail(unis,100)
tail(unis)
colnames(unis)
 [1] "university"   "endowment"    "numFaculty"   "numDoctoral" 
 [5] "country"      "numStaff"     "established"  "numPostgrad" 
 [9] "numUndergrad" "numStudents" 
#we can also sort it
unis <- unis[with(unis, order(country, university)), ]
head(unis)
unis

Walkthrough: Facetting rows based on a column 1. Select the column options for the column you want to facet with 2. Select “Facet”

  1. You can facet differently for text, numbers or dates – let’s facet as text – click on “Text facet”
  2. This will open a facet in the left bar

  3. Now select one or more of the choices and you’ll see how your data rows are limited to just those selected.
  4. Of course you can add more than one facet and thus filter more than once.

#in R this is probably most easily achieved using the subset function
#see http://www.statmethods.net/management/subset.html 
subset(unis, university == "Cardiff University" | university == "Acadia University", select = c("university", "country", "established"))