Data Cleaning in R

First Thing First

This is, by no means, an exhaustive treatment on how to clean data in R. Rather, it is meant to get you started and give you some tools and insight that will help you to get a handle on the data you are getting ready to analyze. In that light, the first lesson is that, if at all possible, keep your data in R.

Until you become familiar with the various features of R and what is available to you there, you will likely be tempted to go back to earlier habits such as cleaning the data in Excel or a similar spreadsheet. There, be dragons.

If you use a spreadsheet and modify data by hand, you are much more likely to commit one of the nonreturnable cardinal sins of data cleaning and manipulation: sorting a single row or column, or losing track of the modifications you have made.

Sorting a single column will cause your data to lose context. At that point, the only thing that you can do with it is report summary statistics and call it a day.

This is because a data file should tell a story, or rather, many stories simultaneously. If you sort even one of the variables without including the others, it is similar to substituting one person’s information into someone else’s story. It fundamentally changes the story for each observation and destroys the value of the data.

The second issue that I have with spreadsheet programs is that it is easy to lose track of the changes you have made, since there is a lot of annoying work associated with keeping track of every change and most people do not take the time to record which observations were recoded at any given point. So, it is generally impossible to roll back individual changes if you discover later that you should not have made them.

Using a program like R is far superior to a spreadsheet because sorting individual columns and rows will not result in changes to the original data, unless you go to extra effort to make it so. Additionally, any changes you do make can be recorded in a script. You may then choose whether you will save the cleaned and reformatted data, or simply run the cleaning script each time you perform analyses on that set of data. The first option will give you consistency and the second will give you more flexibility. Both are common.

The Necessity of Data Cleaning

The act of cleaning, formatting, and transforming data is likely the most fundamental task of any analyst or data scientist. If you are interviewing or working with someone who claims that they do not clean data, leave. They are either a prima donna, or a plug-and-play analyst who fails at simple tasks like checking the assumptions or validity of the tests and procedures they are running. (Never assume that an analysis worked just because it ran.) They will waste your time and money.

That brief rant aside, it is a common truism that data science projects, and analytics in general, consist of 80% data cleaning and preparation and 20% analysis. In many cases, this is a best case scenario. You will frequently run into situations where the original intent of the data collection is no longer possible and you are being asked to “rescue” what remains. In other cases, you will have data that were never meant for the uses you have planned. But, you can format them in a manner that will allow you to make the inquiry that you are planning.

Even in ideal circumstances, like a downloaded data set that was already cleaned, you will frequently find it necessary to thoroughly vet the data you have received before you can be sure that it is usable.

What follows is a simplified process and a catalog of tools - most of them in base R - that should come in handy when evaluating a set of data. At the moment, this is a first-pass. I will add more as time and necessity allow.

A Simplified Process for Cleaning Data in R

As with any other skill, it helps to have some sort of structure in mind when you are approaching a new data set. Generally speaking, we can organize the data cleaning process into five basic processes.

Version control
Data review
Cleaning
Transformations
Script review and archival

In reality, there is no real order to how you should apply each of the above processes and some should be iterated throughout. But, this should serve as a rough guide to how you may wish to think about making sense of the data that were dumped in your lap and getting them ready for use.

Sample Data

The data set used here is a portion of an actual survey that was collected on the Monterey peninsula in 2013. The topic was issues around undocumented migration. If you would like to follow along, the data are accessible through this link

1. Version control

Version control is a primary concern for anyone who has modified the original data file and then found that they could not undo some boneheaded mistake. If you are human, then version control is for you.

The idea behind version control is to have multiple archived versions of the data on hand so that you may roll back changes that you have made and no longer wish to keep, or, more radically, to start fresh with the original data.

If you use Dropbox or GitHub, you are already using their built-in version control systems. But, even then, it is a good idea to have multiple versions of your data, provided it is not a big data project.

Archiving Data Files

As you proceed through the steps of data cleaning, it is a good idea to periodically create stop points where you can check your data integrity and make sure that you haven’t inadvertently made a fundamental change to the data.

Saving the data as an R object in a drive can also achieve the same result. When doing this, it helps to incorporate the date, and sometimes time, into the data name.

save(mydata, file="datafile_mo_day_yr_2400hrs.rda")

Make it easier on yourself by automating the current date and time to the file name. If you would like to automate this, use the paste function to add the current date and time (Sys.time).

save(mydata, file=paste("data", format(Sys.time(), sep = '_')))

Track Changes

The other part of version control is keeping track of changes that are made to the data. This is actually an automatic part of data cleaning, if you are working in R.

To keep track of the modifications that you make to the data, save the script you use when doing so. This will allow you to modify the data again, with comparatively little effort. Such scripts come in very handy if data are updated when you download them at a later time. If you are working with “live” data, then you will need a script for preprocessing before static analyses are run.

You will also benefit from having a data cleaning script that is associated with a particular data set if you change your mind about any given part of the data cleaning process. It is easy enough to “turn off” any portion of a cleaning script by making it a comment.

2. Data Review

Archival was listed first due to its ongoing role in the cleaning process that should come early and often. The real first step, however, should have already taken place at this point. That is, you need to get to know the data.

Start by reading through the data description in the data dictionary. If there is not data dictionary, make one as best you can. It will be helpful later, so that you will not have to puzzle out what the various variables are at a later date.

When you read the data dictionary, check for some critical items:

Which variables should be continuous (dbl or int) and which variables should be categorical (factor)?
What codes are used to indicate missing observations? (e.g., 99, 999, NaN, N/A, NA)
What scales are used for each of the continuous variables? (i.e., What range of values should you expect to see when reviewing the data?)

After reading carefully through the data dictionary, it is up to you to verify that categorical variables are read into R as factors and that continuous variables are read in as numbers (dbl or int). It is very common to have to tell R that factors that are coded using numbers to indicate categories are actually factors. It is similarly normal to have to clean a continuous variable in order for R to read it as a dbl or int.

This, of course, means examining the data once it has been ported into R. Some important functions for reviewing the data include:

Getting a quick read on how well the data read into R.

dim() – The “shape” of the data: For data frames, how many rows and columns are present
head() – View the first six rows of data object
tail() – View the last six rows of data object
class() – What type of object is the data reading in as? (e.g., data frame, matrix, list, …)

Double check variable names.

names() – List the variable names, in the order that they appear.
ls() - List variable names, in alphabetical order.

Look a little more closely into the values present in each variable and how each is classified (categorical or numeric).

str() – The internal structure of data, with examples of what is present in each
summary() – Summary of each of the variables in the data

Consider plotting numeric variables.

boxplot(mydata[ , sapply(mydata, is.numeric)])

It is also helpful to get to know categorical variables. You can view the tables that describe just categorical variables.

lapply(mydata[, sapply(mydata, is.factor)], table)

3. Cleaning

Once you feel that you have a pretty good idea of what seems to be going on in the data, you may feel that it is time to make some changes.

Categorical Variables

Standardize and Transform
- Upper and lower case tolower() and toupper()
- Remove whitespace
- Regular patterns with grep()
Assigning Level Names
Sufficient number Categories

Data Cleaning in R

Phil Murphy

5/4/2020