Adapted from: “Tricks for cleaning your data in R” by Christine Zhang
Christine Zhang is a writer for the Baltimore Sun, and has created some amazing data-related journalism.
Last edited: September 18th, 2018 by Kyle Monahan
kyle.m.monahan at gmail dot com
Before starting, ensure the associated .R file (ENV170_Tutorial#3_Final.R) and the following data files are in the same folder:
If you are looking at this from the S:/ drive, you’ll find them here:
S:\Classes\Env170_EnviDataAnalysisViz\Week 3\Class 6\Tutorial #3
For more information on how to access the S:/ drive from home, see the link here:
[go.tufts.edu/mountdrive]
Note: If you aren’t affiliated with Tufts University and would like to access this data, please email me at the email above. Alternatively, you can download the data from the original source, and use this document to perform the analysis.
First, we need to tell R that our files are saved in the same location. In order to do this, you should click through the following, starting on the top of the screen at the menu: Session --> Set Working Directory --> To Source File Location
Alteratively, we could use the Files tab to navigate to the source file (ENV170_Tutorial#3_Final.md), using the three dots (...) icon to change the directory. Then we can go to Files Tab > More > Set As Working Directory. This is what we did when we worked on Tutorial #2, but the first approach is much faster. Setting your working directory is like setting up the desk for your analysis: bringing together your Thanks to Christine Zhang for saving us all time!
At this point, if you can see the data files listed above under Getting Started in the Files Tab of RStudio, you’re in the correct workspace.
This is especially important to do when you first start out on a project, so RStudio will remember where it should look for files in the Files tab.
Importing raw data from local data sources is a great way to start cleaning and working with data in RStudio.
The pacakge dplyr contains a huge number of functions that we will use for data cleaning. Remember that a package is like an app for your iPhone - it adds functions and extends the functionality of RStudio in predictable ways. To learn more about dplyr, just type ??dplyr in the Console.
Cleaning strings, or text data, is another important part of working with data in RStudio/Excel. In this tutorial, we will be using RStudio for our analysis, but the approach is similar in Excel.
Describe best practices for data cleaning and entry into RStudio.
Apply best practices arrange variables and observations in RStudio.
Learn to clean real-life datasets for import into a statistical program.
Understand the key differences between RStudio and Excel, and when to use each program.
Learn the usefulness of the tidy package and the classic approach to data analysis in R (AKA What I wish someone taught me earlier)
Prepare each student for working with their own datasets in RStudio
The goals of this tutorial are to gain experience in importing data, cleaning data, and preparing data for analysis in RStudio using the R statistical programming language. In this document, you’ll use examples of real-life “messy” datasets, the problems they present for analysis in R, and the “tidy” solutions to these problems.
You might note we keep referring to
tidydata, originally developed by Hadley Wickham. This is a specific type of structure to data formatting and analysis, and is what we worked towards in our previous assignments. Little did you know, in both of our previous assignments we have been learning a set of principles that are not only important to analysis in R, but to analysis more broadly. They are, to start: 1) Each variable forms a column 2) Each observation forms a row 3) Each type of observational unit forms a table For more information ontidydata guidelines, see here: https://cran.r-project.org/web/packages/tidyr/vignettes/tidy-data.html
In this section of the tutorial, we will clean and organize the data for import into R. We will be working with the Boston Employee Earnings Report, drawn from the city’s (Open Data Portal)[https://data.boston.gov/]. In the words of Christine:
" The Employee Earnings Report is one of the more interesting ones, because it gives payroll data for every person on the municipal payroll.
It’s where the Boston Globe gets stories like these every year:
“64 City of Boston workers earn more than $250,000” (February 6, 2016)
“Police detective tops Boston’s payroll with a total of over $403,000” (February 14, 2017)
Let’s take at the February 14 story from this year. The story begins:
“A veteran police detective took home more than $403,000 in earnings last year, topping the list of Boston’s highest-paid employees in 2016, newly released city payroll data show.”
What if we wanted to check this data using the Employee Earnings Report?"
They even have a great graphic, showing the total annual earnings, which we could reproduce, if we had the source data. Where can we find this source data? [TASK]: Think to yourself - where would you look for this data? How would you search?
To find this data source, I followed our troubleshooting guide [https://tufts.box.com/v/EffectiveProblemSolving]:
Searched for the object I was looking for in Google: boston open data portal earnings report
Found the source, in the second result: [https://data.boston.gov/dataset/employee-earnings-report]
This source will work well! You could reproduce this tutorial with the 2017 report as well, but we will stick with the 2016 data. [TASK]: Scroll down on the page above and look at the additional info - they even provide metadata! A rare find.
If we wanted to check on the validity of the claims made by the Boston Globe, or even investigate further claims ourselves, we first need to import the data.
If you had your own dataset as a file on your computer, you could use the same process of:
Organize your data (and metadata) in a folder on your H drive for your project.
Set the location of your research data as the working directory in RStudio as in section 1.1 in this tutorial.
Import the data… as below!
Now that RStudio can “see” the data, we can import the data by using the read.csv() function. This is in the base R functionality, so we don’t have to import a package to use it. Either copy and paste or type the following into RStudio:
salary <- read.csv('employee-earnings-report-2016.csv')
You’ll see the data has been read into RStudio, under the Environment tab. We can see the number of obs. or observations, the number of variables, etc. We got our data in! Let’s see how clean and tidy the data is.
If we want, we can investigate the first six rows in this set of data, called salary with the following code:
head(salary)
This gives us the first few observations of the data that we have imported.
But we need to work with the data. The variable names, such as NAME, DEPARTMENT_NAME, etc. are in capitals. We would really like to organize and standardize these labels. To do this, we first need to select out our variables of interest using the select() package, just as we did in the in-class activity using the RTutor software to investigate the Clean Air Act.
First, to use the package dplyr, we load the package. note that we
# install.packages('dplyr') # if you don't already have the 'dplyr' package
library('dplyr') # load the dplyr package
## Warning: package 'dplyr' was built under R version 3.4.4
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
salary.selected <- select(salary, NAME, DEPARTMENT_NAME, TOTAL.EARNINGS) # The data frame (salary), and then the variables to select
In my output, some errors warning me that the package dplyr was build under a different version. This is a common error to see, and means that we need to update our version of dplyr. The objects that are masked means that there is another package with the same function name loaded, and we would have to specify the package:
base::intersectto use the function. We will dive into that more later.
Note in your Environment that an object appears, called salary.selected. Let’s look at that.
head(salary.selected)
But these are still captialized! We will have to adjust these. To access the column names of a dataframe, we can use the names() function.
names(salary.selected) <- tolower(names(salary.selected)) # change variable names to lowercase
Now we have replaced the column names of salary.selected with lowercase names, using the tolower() function. Nice! We can check that with the function below. This makes it much easier for typing.
head(salary.selected)
We might want to sort the dataset by relevant variables, such as total earnings. We can do that with the arrange function.
salary.sort <- arrange(salary.selected, # dataset to sort
total.earnings) # variable to sort by
## Warning: package 'bindrcpp' was built under R version 3.4.4
head(salary.sort)
The arrange function is a part of dplyr, so we could also run:
salary.sort <- dplyr::arrange(salary.selected, # dataset to sort
total.earnings) # variable to sort by
head(salary.sort)
We can always specify the package we are working with as: package::function() as we mentioned previously in class. This is useful if we are using multiple functions with the same name, and can happen in more advanced coding projects. This is the reason we were warned about masking before.
So back to our results. What went wrong with this? How can the highest salary in Boston be $1000.00? It’s clear there are much higher salaries.
Well, in order to investigate total earnings and why it is sorting incorrectly, we have to look at the class. The class stores what type of data the variable (column) is holding.
class(salary.selected$total.earnings)
## [1] "factor"
It’s a factor variable, when it should be a numeric variable.
To address this, we will have to use another package. Package alert - try typing ?stringr to learn more about what this package, called stringr does.
Type
?stringrto investigate this package.
It’s a wrapper around a package that helps us manuipulate text information (A.K.A. strings). A wrapper just means it’s a function around a function - it provides easier access to more complex coding functions. It’s easier to identify a candy bar by the wrapper rather than what’s inside, and that’s what a wrapper here does - wraps around the core functions to increase usability.
In order to convert our factor into a numeric variable, we have to clean out the characters that aren’t numbers. This means the commas and dollar signs ($).
To clean out the commas in the dataset we can take the following approach:
# install.packages('stringr') # if you don't already have the 'stringr' package
library('stringr') # load the stringr package
## Warning: package 'stringr' was built under R version 3.4.4
salary.selected$total.earnings <- str_replace(
salary.selected$total.earnings, # column we want to search
pattern = ',', # what to find
replacement = '' # what to replace it with
)
head(salary.selected)
And we can see that the commas are gone! Let’s try to replace the dollar signs ($) with the same approach:
salary.selected$total.earnings <- str_replace(
salary.selected$total.earnings, # column we want to search
pattern = '$', # what to find
replacement = '' # what to replace it with
)
head(salary.selected)
But that didn’t work! Why not?
Well the dollar sign, like many symbols, is a special character. These special characters have to be dealt with in a … well … special manner. Other special characters are * . ? { } [ ] / | etc. These are fairly complicated in practice and important to consider removing in order to have tidy data.
In order to remove these, we have to escape the special symbol, so R treats it as a text to search for and to disregard the special attributes of $. To escape a character, we can use the double slash \\.
salary.selected$total.earnings <- str_replace(
salary.selected$total.earnings, # column we want to search
pattern = '\\$', # what to find
replacement = '' # what to replace it with
)
head(salary.selected)
Great! So now we have removed the commas, and the dollar signs. We are on the way to clean data. Let’s try to sort the salaries again:
salary.sort <- arrange(salary.selected,
total.earnings)
head(salary.sort)
This still didn’t work! Why not? Let’s check the class of total.earnings again.
class(salary.selected$total.earnings)
## [1] "character"
So it’s a character vector. Still not numeric, as we would expect salaries to be.
In order to force R to treat it as a number, we can use the following code:
salary.selected$total.earnings <- as.numeric(salary.selected$total.earnings)
The code is equivalent to right clicking on a column of data in Excel and going to Format > Currency.
class(salary.selected$total.earnings)
## [1] "numeric"
Finally, numeric! Now sort with arrange:
salary.sort <- arrange(salary.selected,
total.earnings)
head(salary.sort) # ascending order by default
But we would like to sort by descending, to see the highest paid employees:
salary.sort <- arrange(salary.selected,
desc(total.earnings)) # descending order
head(salary.sort) # Waiman Lee from the Boston PD is the highest paid city employee
We find that the highest paid employee in the Boston city workforce is Waiman Lee.
Now would be a good time to introduce
%>%, known as the pipe operator.%>%is an extremely valuable tool in R, because it allows functions to be chained rather than nested.%>%looks strange but can be read as “then”—it tells R to do whatever comes after it to the stuff comes before it. You can think of it as a pipe moving the output from one function right into the next function.
salary.average <- salary.selected %>% # take the salary.selected data frame, THEN
group_by(department_name) %>% # group by department_name, THEN
summarise(average.earnings = mean(total.earnings)) # calculate the mean of total.earnings for each department_name and name this average.earnings
head(salary.average) # first six rows of average salary by department (alphabetical order)
Without piping the code would be more compact but less readable:
summarise(group_by(salary.selected, department_name), average.earnings = mean(total.earnings))
head(salary.average) # first six rows of average salary by department (alphabetical order)
Okay, so let’s investigate where the Boston Police Department sits. We can use our pipe (%>%) and a filter for the variable we want.
The nice part about filter is that it is very efficent - just like searching in a database!
salary.average %>% filter(department_name == 'Boston Police Department')
Let’s say we wanted to compare a persons salary with their average salary for their department. To do that, we would have to merge the data.
salary.merged <- merge(x = salary.sort, y = salary.average, by = 'department_name')
head(salary.merged)
We can also take a look at a scatter plot to compare them:
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.4.4
ggplot(salary.merged, aes(x=total.earnings, y=average.earnings)) +
geom_point()
Well that isn’t helpful at all! We will have to reshape data. To learn about that, we will switch to a new dataset - on unemployment rates.
Here’s a dataset on unemployment rates by country from 2012 to 2016, from the International Monetary Fund’s World Economic Outlook database, available here:
[https://www.imf.org/external/pubs/ft/weo/2017/01/weodata/index.aspx]
When you download the dataset, it comes in an Excel file. We can use the read_excel() from the readxl package to load the file into R.
# install.packages('readxl') # if you don't already have the 'readxl' package
library('readxl') # load the readxl package
## Warning: package 'readxl' was built under R version 3.4.4
unemployment <- read_excel('unemployment.xlsx')
## Warning in strptime(x, format, tz = tz): unknown timezone 'default/America/
## New_York'
We need to change the format from “wide” to “long,” meaning that the columns (2012, 2013, 2014, 2015, 2016) will be converted into a new variable, which we’ll call Year, with repeated values for each country. And the unemployment rates will be put into a new variable, which we’ll call Rate.Unemployed.
We’d like the data to look like this:
To do this, we’ll use the gather() function in tidyr to create a new data frame, unemployment.long.
We load the tidyr package using library(‘tidyr’):
# install.packages('tidyr') # if you don't already have the 'tidyr' package
library('tidyr') # load the tidyr package
## Warning: package 'tidyr' was built under R version 3.4.4
unemployment.long <- gather(unemployment, # data to reshape
Year, # column we want to create from the rows
Rate.Unemployed, # the values of interest
-Country # already a column in the data
)
head(unemployment.long)
Great, we now have created a long dataset. But there is an issue! Rate.Unemployed is not a numeric variable, it is a <chr> or character variable. We can see that in the head() output, or check class():
class(unemployment.long$Rate.Unemployed) ## "character", not "numeric"
## [1] "character"
We can use the same function as before… as.numeric()
unemployment.long$Rate.Unemployed <- as.numeric(unemployment.long$Rate.Unemployed)
## Warning: NAs introduced by coercion
Oh no! It says “NAs introducted by coercion”. What does this mean? Use your troubleshooting guide to find out!
Now Rate.Unemployed is a numeric variable! Nice! We can look at the distrubution of this variable now.
ggplot(data=unemployment.long, aes(Rate.Unemployed)) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 15 rows containing non-finite values (stat_bin).
This is technically a panel dataset, where we have information on each group (country) over time (year). We might want to look at unemployment across countries, and we can use the arrange().
unemployment.long <- arrange(unemployment.long, # data frame to sort
Country, Year) # variables to sort by
We can also do this with a pipe operator:
unemployment.long <- unemployment.long %>% # Take the unemployment.long data frame, THEN
arrange(Country, Year) # sort it by Country and then Year.
head(unemployment.long, 5) # First five rows of the data
For Albania, the percentage point change in unemployment rate from 2012 to 2013 would be 16 - 13.4 = 2.5 percentage points. What if I wanted the year-over-year change in unemployment rate for every country?
This is an example where having a tidy dataset really helps. We can use the mutate() function in dplyr to create a new variable, Change, which is the difference between Rate.Unemployed and lag(Rate.Unemployed) (the default for lag() is 1 position, which is good for us since we want the change from the previous year).
unemployment.long <- unemployment.long %>%
mutate(Change = Rate.Unemployed - lag(Rate.Unemployed))
We can inspect the first five rows:
head(unemployment.long, 5)
We might want to look at the last five rows of data as well. It’s always important to check out both sides. Let’s take a look, using the tail function.
tail(unemployment.long, 5)
Wow - why does Vietnam have a -18.493 percentage point change in 2012?
Perhaps we should have grouped by country!
unemployment.long <- unemployment.long %>%
group_by(Country) %>%
mutate(Change = Rate.Unemployed - lag(Rate.Unemployed))
tail(unemployment.long, 5)
This is why these tidy data operators are so important.
We will use the attendees of the original workshop (thanks, Christine!) to work with some real data!
attendees <- read.csv('attendees.csv', stringsAsFactors = F)
head(attendees)
Let’s take a look at the age distrution of the group:
table(attendees$Age.group)
##
## 18-29 30 - 39 30-39
## 4 1 7
Note that 30 - 39 and 30-39 should be the same variable. We have to fix this!
To do this, we can use an ifelse() statement:
attendees$Age.group <- ifelse(attendees$Age.group == '30 - 39', # if attendees$Age.group == '30 - 39'
'30-39', # replace attendees$Age.group with '30-39'
attendees$Age.group) # otherwise, keep attendees$Age.group values the same
table(attendees$Age.group)
##
## 18-29 30-39
## 4 8
Now let’s take a look at the professional status of attendees:
table(attendees$Choose.your.status)
##
## Nonprofit, Academic, Government
## 3
## Nonprofit, Academic, Government Early Bird
## 1
## Professional
## 3
## Student
## 5
“Nonprofit, Academic, Government” and “Nonprofit, Academic, Government Early Bird” seem to be the same. We can use ifelse() (and the R designation | for “or”) to combine these two categories into one big category, “Nonprofit/Gov”. Let’s create a new variable, status, for our simplified categorization.
The symbol | is also used for or in logic, structured query languages, and many other uses. It’s great to know!
attendees$status <- ifelse(attendees$Choose.your.status. == 'Nonprofit, Academic, Government' |
attendees$Choose.your.status. == 'Nonprofit, Academic, Government Early Bird',
'Nonprofit/Gov',
attendees$Choose.your.status.)
table(attendees$status)
##
## Nonprofit/Gov Professional Student
## 4 3 5
For each of these, submit all code as text in a word document or an .R script, if applicable. The more you document your approach, the more partial credit can be given, so submitting code is highly suggested.
You might have noted that we obtained metadata on the Boston Earnings Report 2011 to 2016. Navigate to the data source link and state the license of this dataset. Could we use this dataset as a part of our final products and safely reproduce it in poster form? Find a source that talks about the license and cite it.
Let’s say you were just told that the values in salary$REGULAR are all much higher, and the period should be replaced with a comma. Starting from the initial dataframe salary from the read.csv function, how would you convert the values to a larger number, replacing the period with a comma? Hint: You have to remove the existing commas first!
Caluclate the mean of salary$total.earnings for each title. Hint: You’ll see to use the group_by() function. Report the code you used and the value.
What is the average salary for the Assessing Department? Report the code you used and the value.
Who has the highest salary in the Assessing Department? Did they get overtime pay? Report the code you used and the value.
What postal code has the highest total.earnings? Report the code you used and the value.
How would you use ifelse() and | to create a new variable in the attendees data (let’s call it status2) that has just two categories, “Student” and “Other”?
How would you rename the variables in the attendees data to make them easier to work with? Report the code you would use.
List at least two (2) other interesting questions we could answer by exploring the unemployment dataset from section 5.0. No code is needed here.
What does the error “NAs introduced by coercion” mean?