Introduction to R and RStudio: Data Wrangling (dplyr) and Tidy Data (tidyr)

What is R and RStudio?:

When someone refers to R, they could mean two things: 1) the open source statistical programming language and 2) a software environment in which to use the language. R (the language) is comparable to a coding language like Python. It allows programmers to put complex statistical (and visual) ideas into practice. The “command line interface” included within R (the software package) gives statisticans and data scientists much more flexibility compared to other statistical softwares such as Stata, SPSS, or Minitab. Rather than clicking buttons, you must enter code which tells the software exactly what you want it to do. While this feels unnecessary to most of us, it becomes incredibly valuable when you start carrying out more complex analyses.

The original R (software) interface is straight out of the 1990s and is pretty intimidating to look at. It consists of a single “command line interface,” and is difficult to understand unless you are intimatly familiar with the R language. Because of this, I (and most people at this point) prefer RStudio.

RStudio is effectively an “add on.” It repackages the R software in a much more user friendly (and 21st century-feeling) application. Note: RStudio requires that R (the original software) is already installed on your computer.

Pre-downloading note: these two programs are not that large and you should not have any issues computer-memory wise. That being said, reach out to me at zellespe@grinnell.edu if this is and issue and I will help you with remote options.

Downloading R:

Use this link to download the original R software (do this first!): #https://mirror.las.iastate.edu/CRAN/

You will be prompted to download the correct software for your operating system (Windows, MacOS, or Linux). Once you select your operating system.

For Mac users, you will be prompted to download the correct version of the software for your OS version. This file will have a .pkg extension.

For Windows users, you will be prompted to “install R for the first time.” Click this link, and then select “Download R 3.6.3 for Windows.”

If you have any issues with this, please email me at “zellespe@grinnell.edu” and I will try to set you on the right path!

Downloading RStudio:

Go to this link and download the “free desktop” version of RStudio: #https://rstudio.com/products/rstudio/download/

Opening the Software:

Open R first. After installation is complete and you have opened the software, you will see the command line interface that I mentioned before. Close the window. You don’t need to see that again thanks to RStudio.

Next, open RStudio. This is what you will be working in. You will see that there are 3-4 tabs/boxes/panels that appear within the app. The one marked “consol” is where you can write independent lines of code. The box marked “environment” is where you see any data that you have loaded into RStudio. The box marked “files/plots/packages/etc.” is a catch all window that displays any graphical output that you create, any packages that you have loaded (we’ll get to that), and a help menu.

There is a 4th window that will appear once you load this file (that you’re reading now) into RStudio. This file type is called an RMD (R MarkDown) file. It allows you to combine written notes and R code into a single document. If you are reading this as an html file (I sent two links: one RMD and one html/PDF), try opening the RMD file. Find the file and “open with” RStudio. It should appear as a 4th box within RStudio. This will make it much easier to run the provided code.

Important note: If you loaded these instructions into RStudio as an RMD, the green “play” symbol at the top right had corner of each code box allows you to run the included code. Alternatively, you can copy and paste the code from the html document (found in the light gray or pink boxes) into the console in RStudio.

If you copy the code into the console, you can use the “up” arrow key on your keyboard to recall the previous line of code.

Lets get started.

Packages

Since R is open source, there are very few built in functions. Because of this, we must rely on packages. Packages are created by R users (professors, data scientists, and even private companies) in order to accomplish specific tasks in an efficient way. We will be using two packages in this exercise: dplyr (the most common data cleaning package) and tidyr (a common package used for rearranging data).

Make sure that the following packages are loaded: Use “install.packages(”PACKAGENAME“)” to ensure that the packages are downloaded

This code ensures that the packages are downloaded

library(tidyr)     # contains tools to tidy data
library(dplyr)     # contains tools for data wrangling
library(ggformula) # for plotting / graphing
library(readr)     # a package for parsing data; you should already have this from reading in your initial data

This code ensures that the packages are loaded in Rs working library.

Datasets in R

Let’s start at the begining: loading data.

I prefer a package called “readr” to the base R version of loading data. Normally data would be stored on your computer, but this exercise relies on online data. It will automatically download using these commands:

UBSprices <- read.csv("https://raw.githubusercontent.com/ds4stats/r-tutorials/master/tidying-data/data/UBSprices.csv", as.is = TRUE)
polls <- read.csv("https://raw.githubusercontent.com/ds4stats/r-tutorials/master/tidying-data/data/rcp-polls.csv", na.strings = "--", as.is = TRUE)
airlines <- read.csv("https://raw.githubusercontent.com/ds4stats/r-tutorials/master/tidying-data/data/airline-safety.csv", as.is = TRUE)

After importing the files, you should see them in the “environment” box. Click on them to open the datasheets. This is what data looks like in RStudio.

Tidying your data with tidyr

Data sets are stored in tabular format and there are many possible ways to organize tabular data. Some organizational schemes are designed to be easily read on the page (or screen), while others are designed to be easily used in analysis. In this exercise, we focus on how a data set should be formatted for analysis in R.

1. Definition of a tidy data set

In R, it is easiest to work with data that follow five basic rules:

Every variable is stored in its own column.
Every observation is stored in its own row—that is, every row corresponds to a single case.
Each value of a variable is stored in a cell of the table.
Values should not contain units. Rather, units should be specified in the supporting documentation for the data set, often called a codebook.
There should be no extraneous information (footnotes, table titles, etc.).

A data set satisfying these rules is said to be tidy, a term popularized by Hadley Wickham.

Note: Most of the time data that violate rules 4 and 5 are obviously not tidy, and there are easy ways to exclude footnotes and titles in spreadsheets by simply omitting the offending rows. This tutorial focuses on the “sneakier” form of untidiness that violates at least one of the first three rules.

This tutorial will describe the following tidyr commands, which can be thought of as verbs for tidying data: (Note: there are many more of these type of commands within the tidyr package)

Command	Meaning
`gather`	collapses multiple columns into two columns
`separate`	splits compound variables into individual columns

2. Tidying longitudinal data (`gather`)

UBS is an international bank that reports prices of various staples in major cities every three years. The data set in the UBSprices data set contains prices of a 1 kg bag of rice in 2003 and 2009 in major world cities. The data set was extracted from the alr4 R package.

head(UBSprices)

##        city rice2003 rice2009
## 1 Amsterdam        9       11
## 2    Athens       19       27
## 3  Auckland        9       13
## 4   Bangkok       25       27
## 5 Barcelona       10        8
## 6    Berlin       16       17

This data set is not tidy because each row contains two cases: the city in 2003 and the city in 2009. Additionally, the column names 2003 and 2009 contain the year, which should be the value of a variable. In order to tidy these data, we need to

Reorganize the data so that each row corresponds to a city in a specific year.
Create a single variable for the price of rice.
Add a variable for year.

To do this, we will use the gather function in the tidyr package. gather collapses multiple columns into two columns: a key column and a value column. The key will be the new variable containing the old column names and the value will contain the information recorded in the cells of the collapsed columns.

In our example, we want to collapse rice2003 and rice2009 into the key-value pair year and price. To do this, we use the following command:

tidy_ubs <- gather(data = UBSprices, key = year, value = price, rice2003, rice2009)
head(tidy_ubs)

##        city     year price
## 1 Amsterdam rice2003     9
## 2    Athens rice2003    19
## 3  Auckland rice2003     9
## 4   Bangkok rice2003    25
## 5 Barcelona rice2003    10
## 6    Berlin rice2003    16

Remarks

The first argument passed to gather should be the data frame being tidied. This is true for all of the tidyr functions we discuss in this tutorial.
After specifying the data frame, the next two arguments specify the column names you wish to give to two new columns. One column is called the key and the other is called the values.
After the first three arguments, specify the columns that you wish to collapse, separated by commas. Notice that the original column names are now listed in the key column and the original cell values are now all in one column.

To finish tidying these data, we need to modify the year column by removing the word rice from each cell.

To do this, we can use the parse_number function in the readr package. This function drops any non-numeric characters before or after the first number in a character string. We now have a data set that we can call tidy.

tidy_ubs$year <- parse_number(tidy_ubs$year)
head(tidy_ubs)

##        city year price
## 1 Amsterdam 2003     9
## 2    Athens 2003    19
## 3  Auckland 2003     9
## 4   Bangkok 2003    25
## 5 Barcelona 2003    10
## 6    Berlin 2003    16

Remark

This data set started in a relatively tidy form, so it may be difficult to see the benefit of tidying it. Tidy data are typically required for summarizing and plotting data in R. For example, consider making a side-by-side boxplot using ggformula (we learn more about plotting data in the ggplot and ggformula tutorials).

Note: the following code uses a technique known as “piping” (the %>% symbol). Piping allows users to shorten the amount of code written in functions by “piping” instructions from one line of code into the next. This is a relatively advanced technique and you are not expected to know how to do this by the end of the exercise.

tidy_ubs %>%
  gf_boxplot(price ~ factor(year)) %>%
  gf_labs(x = "year")

This was straightforward since tidy_ubs was already tidy, but would have required extra manipulation in the original format.

3. Tidying pollster data (`separate + gather`)

The polls data set contains the results of various presidential polls conducted during July 2016, and was scraped from RealClear Politics.

polls

##                     Poll        Date  Sample MoE Clinton..D. Trump..R.
## 1               Monmouth 7/14 - 7/16  688 LV 3.7          45        43
## 2                CNN/ORC 7/13 - 7/16  872 RV 3.5          42        37
## 3     ABC News/Wash Post 7/11 - 7/14  816 RV 4.0          42        38
## 4 NBC News/Wall St. Jrnl  7/9 - 7/13 1000 RV 3.1          41        35
## 5       Economist/YouGov  7/9 - 7/11  932 RV 4.5          40        37
## 6   Associated Press-GfK  7/7 - 7/11  837 RV  NA          40        36
## 7       McClatchy/Marist   7/5 - 7/9 1053 RV 3.0          40        35
##   Johnson..L. Stein..G.
## 1           5         1
## 2          13         5
## 3           8         5
## 4          11         6
## 5           5         2
## 6           6         2
## 7          10         5

Here, the data set is not tidy because

The Date column contains both the beginning and end dates. These should be stored in separate columns.
The Sample column contains two variables: the number of people in the sample and the population that was sampled, likely voters (LV) or registered voters (RV). These should be stored in separate columns.
The last four column names are values of candidate and party variables, which should be stored in their own columns.

To break a single character column into multiple new columns we use the separate function in the tidyr package.

To begin, let’s break the Date column into Begin and End columns:

tidy_polls <- separate(data = polls, col = Date, into = c("Begin", "End"), sep = " - ")
tidy_polls

##                     Poll Begin  End  Sample MoE Clinton..D. Trump..R.
## 1               Monmouth  7/14 7/16  688 LV 3.7          45        43
## 2                CNN/ORC  7/13 7/16  872 RV 3.5          42        37
## 3     ABC News/Wash Post  7/11 7/14  816 RV 4.0          42        38
## 4 NBC News/Wall St. Jrnl   7/9 7/13 1000 RV 3.1          41        35
## 5       Economist/YouGov   7/9 7/11  932 RV 4.5          40        37
## 6   Associated Press-GfK   7/7 7/11  837 RV  NA          40        36
## 7       McClatchy/Marist   7/5  7/9 1053 RV 3.0          40        35
##   Johnson..L. Stein..G.
## 1           5         1
## 2          13         5
## 3           8         5
## 4          11         6
## 5           5         2
## 6           6         2
## 7          10         5

Remarks

The first argument passed to separate is the name of the data frame to be tidied.
The second argument, col, specifies the name of the column to be split.
The third argument, into, specifies the names of the new columns. Note that since these are specific column names we are creating, they should be given in quotes.
R will try to guess how the values should be separated by searching for non-alphanumeric values; however, if there are multiple non-alphanumeric values this may fail. In this example, if we did not specify that sep = " - ", then R would erroneously use \ as the separator. To manually specify the separator between columns we can place the character(s) in quotes.
In sep = " - ", the spaces around - avoid excess whitespace in the resulting cell values.

We also need to separate the Sample column into size and population columns.

tidy_polls <- separate(data = tidy_polls, col = Sample, into = c("size", "population"), sep = " ")
tidy_polls

##                     Poll Begin  End size population MoE Clinton..D.
## 1               Monmouth  7/14 7/16  688         LV 3.7          45
## 2                CNN/ORC  7/13 7/16  872         RV 3.5          42
## 3     ABC News/Wash Post  7/11 7/14  816         RV 4.0          42
## 4 NBC News/Wall St. Jrnl   7/9 7/13 1000         RV 3.1          41
## 5       Economist/YouGov   7/9 7/11  932         RV 4.5          40
## 6   Associated Press-GfK   7/7 7/11  837         RV  NA          40
## 7       McClatchy/Marist   7/5  7/9 1053         RV 3.0          40
##   Trump..R. Johnson..L. Stein..G.
## 1        43           5         1
## 2        37          13         5
## 3        38           8         5
## 4        35          11         6
## 5        37           5         2
## 6        36           6         2
## 7        35          10         5

Next, we need to gather the last four columns into a candidate variable.

tidy_polls <- gather(data = tidy_polls, key = candidate, value = percentage, 7:10)
head(tidy_polls)

##                     Poll Begin  End size population MoE   candidate
## 1               Monmouth  7/14 7/16  688         LV 3.7 Clinton..D.
## 2                CNN/ORC  7/13 7/16  872         RV 3.5 Clinton..D.
## 3     ABC News/Wash Post  7/11 7/14  816         RV 4.0 Clinton..D.
## 4 NBC News/Wall St. Jrnl   7/9 7/13 1000         RV 3.1 Clinton..D.
## 5       Economist/YouGov   7/9 7/11  932         RV 4.5 Clinton..D.
## 6   Associated Press-GfK   7/7 7/11  837         RV  NA Clinton..D.
##   percentage
## 1         45
## 2         42
## 3         42
## 4         41
## 5         40
## 6         40

Notice that instead of writing out the column names (Clinton..D., Trump..R., etc.) we can simply specify the column numbers—here 7:10 specifies that we are gathering columns 7 through 10.

Finally, we need to separate the candidate names from the political party.

tidy_polls <- separate(tidy_polls, candidate, into= c("candidate", "party"))

## Warning: Expected 2 pieces. Additional pieces discarded in 28 rows [1, 2,
## 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].

head(tidy_polls)

##                     Poll Begin  End size population MoE candidate party
## 1               Monmouth  7/14 7/16  688         LV 3.7   Clinton     D
## 2                CNN/ORC  7/13 7/16  872         RV 3.5   Clinton     D
## 3     ABC News/Wash Post  7/11 7/14  816         RV 4.0   Clinton     D
## 4 NBC News/Wall St. Jrnl   7/9 7/13 1000         RV 3.1   Clinton     D
## 5       Economist/YouGov   7/9 7/11  932         RV 4.5   Clinton     D
## 6   Associated Press-GfK   7/7 7/11  837         RV  NA   Clinton     D
##   percentage
## 1         45
## 2         42
## 3         42
## 4         41
## 5         40
## 6         40

Remark

In the last command we let R guess which separator to use. This worked, but resulted in a warning message—we’re lucky that it worked! There are many situations where the separator is too complex for R to guess correctly and it cannot be specified using a simple character in quotes. In such cases we need to use regular expressions (see the tutorial IntroStrings) to aid our data tidying, but that’s a topic for another tutorial. The important thing to note here is that you should always check that separate worked as you expected, don’t blindly trust it!

4. Homework Questions Part 1

the following two questions are from part 2 and use UBSprices

How are the number of rows adjusted by using the gather command? Use the dim(UBSprices) command to determine how many rows are in the UBSprices data set and dim(tidy_ubs) to determine how many are in the tidy_ubs data set).
How many rows would there be if used the gather command and the original UBSprices data set had five columns of years: rice2003, rice2006, rice2009, rice2012, and rice2015?

5. Additional Resources

RStudio’s data wrangling cheat sheet provides a nice summary of how to reshape data sets and a quick reminder of the definition of tidy data.
The tidyr vignette provides additional examples and elaborates on the capabilities of the tidyr package.

Data Wrangling in dplyr

Data manipulation is central to data analysis and is often the most time consuming portion of an analysis. The dplyr package contains a suite of functions to make data manipulation easier. The core functions of the dplyr package can be thought of as verbs for data manipulation.

Verb(s)	Meaning
`filter` and `slice`	pick specific observations (i.e. specific rows)
`arrange`	sort rows by a particular variable
`mutate`	add new calculated columns to a data frame
`summarize`	aggregate many rows into a single row

In this example we will explore how to use each of these functions, as well as how to combine them with the group_by function for groupwise manipulations.

To begin, let’s make sure that our data set and the dplyr package are loaded

library(dplyr)

Data: The file college2015.csv contains information on predominantly bachelor’s-degree granting institutions from 2015 that might be of interest to a college applicant. This data is not online and will have to be loaded manually. Use the “import dataset” dropdown button in the “environment” box. Select the college2015 data from your computer and import. It will appear in your environment.

colleges <- read.csv("~/Shared/F18MAT310/r-tutorials-master/data-wrangling/data/colleges2015.csv")
library(dplyr)

Note: The above code will not work for you. This is only included to create the RMD file. You will need to manually load the data as previously described.

To get a feel for what data are available, look at the first six rows

head(colleges)

##   unitid                             college    type       city state
## 1 100654            Alabama A & M University  public     Normal    AL
## 2 100663 University of Alabama at Birmingham  public Birmingham    AL
## 3 100690                  Amridge University private Montgomery    AL
## 4 100706 University of Alabama in Huntsville  public Huntsville    AL
## 5 100724            Alabama State University  public Montgomery    AL
## 6 100751           The University of Alabama  public Tuscaloosa    AL
##      region admissionRate ACTmath ACTenglish undergrads  cost gradRate
## 1 Southeast        0.8989      17         17       4051 18888   0.2914
## 2 Southeast        0.8673      23         26      11200 19990   0.5377
## 3 Southeast            NA      NA         NA        322 12300   0.6667
## 4 Southeast        0.8062      25         26       5525 20306   0.4835
## 5 Southeast        0.5125      17         17       5354 17400   0.2517
## 6 Southeast        0.5655      25         27      28692 26717   0.6665
##   FYretention fedloan    debt
## 1      0.6314  0.8204 33611.5
## 2      0.8016  0.5397 23117.0
## 3      0.3750  0.7629 26995.0
## 4      0.8098  0.4728 24738.0
## 5      0.6219  0.8735 33452.0
## 6      0.8700  0.4148 24000.0

the last six rows

tail(colleges)

##      unitid                            college    type         city state
## 1771 443058            Family of Faith College private      Shawnee    OK
## 1772 443340       Williamson Christian College private     Franklin    TN
## 1773 444398 Baptist University of the Americas private  San Antonio    TX
## 1774 444413         Beis Medrash Heichal Dovid private Far Rockaway    NY
## 1775 444778      New Life Theological Seminary private    Charlotte    NC
## 1776 444990        Flagler College-Tallahassee private  Tallahassee    FL
##         region admissionRate ACTmath ACTenglish undergrads  cost gradRate
## 1771 Southwest        0.6667      NA         NA         19 21795   1.0000
## 1772 Southeast            NA      NA         NA         72 20737   0.0000
## 1773 Southwest            NA      NA         NA        189 14542   0.2941
## 1774  Mid East        1.0000      NA         NA         96 16000   0.1000
## 1775 Southeast            NA      NA         NA        122 24685   0.4000
## 1776 Southeast            NA      NA         NA        469    NA       NA
##      FYretention fedloan  debt
## 1771          NA  0.2500    NA
## 1772      0.2500  0.4156    NA
## 1773      0.6190  0.0000    NA
## 1774      0.6667  0.0000    NA
## 1775      0.0000  0.8224    NA
## 1776          NA  0.6370 23638

and the structure of the data frame.

str(colleges)

## 'data.frame':    1776 obs. of  15 variables:
##  $ unitid       : int  100654 100663 100690 100706 100724 100751 100812 100830 100858 100937 ...
##  $ college      : Factor w/ 1760 levels "Abilene Christian University",..: 8 1411 35 1412 9 1334 63 67 66 132 ...
##  $ type         : Factor w/ 2 levels "private","public": 2 2 1 2 2 2 2 2 2 1 ...
##  $ city         : Factor w/ 1040 levels "Aberdeen","Abilene",..: 669 89 614 433 614 944 48 614 51 89 ...
##  $ state        : Factor w/ 54 levels "AK","AL","AR",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ region       : Factor w/ 10 levels "Far West","Great Lakes",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ admissionRate: num  0.899 0.867 NA 0.806 0.512 ...
##  $ ACTmath      : int  17 23 NA 25 17 25 NA 20 26 25 ...
##  $ ACTenglish   : int  17 26 NA 26 17 27 NA 21 28 27 ...
##  $ undergrads   : int  4051 11200 322 5525 5354 28692 2999 4322 19761 1181 ...
##  $ cost         : int  18888 19990 12300 20306 17400 26717 NA 16556 23788 44512 ...
##  $ gradRate     : num  0.291 0.538 0.667 0.483 0.252 ...
##  $ FYretention  : num  0.631 0.802 0.375 0.81 0.622 ...
##  $ fedloan      : num  0.82 0.54 0.763 0.473 0.874 ...
##  $ debt         : num  33612 23117 26995 24738 33452 ...

6. Filtering rows

To extract the rows only for colleges and universities in a specific state we use the filter function. For example, we can extract the colleges in Wisconsin from the colleges data set using the following code:

wi <- filter(colleges, state == "WI")
head(wi)

##   unitid                     college    type      city state      region
## 1 238193             Alverno College private Milwaukee    WI Great Lakes
## 2 238324              Bellin College private Green Bay    WI Great Lakes
## 3 238333              Beloit College private    Beloit    WI Great Lakes
## 4 238430 Cardinal Stritch University private Milwaukee    WI Great Lakes
## 5 238458          Carroll University private  Waukesha    WI Great Lakes
## 6 238476            Carthage College private   Kenosha    WI Great Lakes
##   admissionRate ACTmath ACTenglish undergrads  cost gradRate FYretention
## 1        0.7887      19         19       1833 30496   0.3852      0.7173
## 2        0.5556      25         24        285    NA   0.6786      0.7500
## 3        0.6769      26         28       1244 48236   0.7815      0.9228
## 4        0.8423      22         22       2680 37563   0.4162      0.7099
## 5        0.8114      24         24       3024 37963   0.5629      0.7598
## 6        0.7018      24         24       2874 44910   0.6501      0.7841
##   fedloan  debt
## 1  0.8784 33110
## 2  0.8145 18282
## 3  0.5784 26500
## 4  0.7178 27875
## 5  0.7108 27000
## 6  0.8048 27000

Remarks

The first argument given to filter is always the data frame (this is true for all the core functions in dplyr), followed by logical tests that the returned cases must pass. In our example, the test was whether the school was in Wisconsin, which is written as state == "WI".
We have to use == to indicate equality because = is equivalent to <-.
When testing character variables, be sure to use quotes to specify the value of the variable that you are testing.
To specify multiple tests, use a comma to separate the tests (think of the comma as the word “and”). For example,
```
```
smallWI <- filter(colleges, state == "WI", undergrads < 2000)
```
```
returns only those rows corresponding to schools in Wisconsin with fewer than 2,000 undergraduate students.
To specify that at least one test must be passed, use the | character instead of the comma. For example, the below test checks whether a college is in Wisconsin or Minnesota or Iowa, so it returns all of the colleges in Wisconsin, Minnesota, and Iowa.
```
```
WiMnIa <- filter(colleges, state == "WI" | state == "MN" | state == "IA")
```
```
You can use both | and , to specify multiple tests. For example, we can return all colleges with fewer than 2,000 undergraduate students in Wisconsin, Minnesota, and Iowa.
```
```
smallWIM <- filter(colleges, state == "WI" | state == "MN" | state == "IA", undergrads < 2000)
```
```
Common comparison operators for the tests include: >, >=, <, <=, != (not equal), and == (equal).
To remove rows with missing values, use the R command na.omit. For example,
colleges <- na.omit(colleges)

will reduce the data set to only rows with no missing values.
```
```
colleges <- filter(colleges, !is.na(cost))
``` 
```
will eliminate only rows with NA in the cost column.

Questions:

How many Maryland colleges are in the colleges data frame? (The abbreviation for Maryland is MD.)
How many private Maryland colleges with under 5000 undergraduates are in the colleges data frame?

7. Slicing rows

To extract rows 10 through 16 from the colleges data frame we use the slice function.

slice(colleges, 10:16)

##   unitid                       college    type         city state
## 1 100937   Birmingham Southern College private   Birmingham    AL
## 2 101073     Concordia College Alabama private        Selma    AL
## 3 101189           Faulkner University private   Montgomery    AL
## 4 101435            Huntingdon College private   Montgomery    AL
## 5 101453 Heritage Christian University private     Florence    AL
## 6 101480 Jacksonville State University  public Jacksonville    AL
## 7 101541                Judson College private       Marion    AL
##      region admissionRate ACTmath ACTenglish undergrads  cost gradRate
## 1 Southeast        0.6422      25         27       1181 44512   0.6192
## 2 Southeast            NA      NA         NA        523 17655   0.2115
## 3 Southeast            NA      NA         NA       2358 28485   0.2287
## 4 Southeast        0.6279      20         22       1100 31433   0.4319
## 5 Southeast            NA      NA         NA         67 21160   0.0000
## 6 Southeast        0.8326      21         22       7195 19202   0.3083
## 7 Southeast        0.7388      20         23        331 27815   0.4051
##   FYretention fedloan  debt
## 1      0.8037  0.4939 27000
## 2      0.4103  0.9100 26500
## 3      0.5000  0.7427 23750
## 4      0.6196  0.7227 27000
## 5      1.0000  0.4839    NA
## 6      0.7112  0.6811 23500
## 7      0.5974  0.7110 26000

Remarks

To select consecutive rows, create a vector of the row indices by separating the first and last row numbers with a :.
To select non-consecutive rows, create a vector manually by concatenating the row numbers using c(). For example, to select the 2nd, 18th, and 168th rows use slice(colleges, c(2, 18, 168)).

8. Arranging rows

To sort the rows by total cost, from the least expensive to the most expensive, we use the arrange function.

costDF <- arrange(colleges, cost)
head(costDF)

##   unitid                                   college    type        city
## 1 197027     United States Merchant Marine Academy  public Kings Point
## 2 176336              Southeastern Baptist College private      Laurel
## 3 241951 Escuela de Artes Plasticas de Puerto Rico  public    San Juan
## 4 241216               Atlantic University College private    Guaynabo
## 5 241377              Caribbean University-Bayamon private     Bayamon
## 6 243221     University of Puerto Rico-Rio Piedras  public    San Juan
##   state               region admissionRate ACTmath ACTenglish undergrads
## 1    NY U.S. Service Schools            NA      NA         NA        958
## 2    MS            Southeast            NA      NA         NA         37
## 3    PR       Outlying Areas        0.7154      NA         NA        529
## 4    PR       Outlying Areas            NA      NA         NA       1365
## 5    PR       Outlying Areas            NA      NA         NA       1572
## 6    PR       Outlying Areas        0.5248      NA         NA      11834
##   cost gradRate FYretention fedloan debt
## 1 6603   0.7365      0.9733  0.0780 4211
## 2 6753   0.6875      1.0000  0.0000   NA
## 3 7248   0.4127      0.8382  0.0000   NA
## 4 7695   0.3891      0.7200  0.1053 5000
## 5 8006   0.2166      0.7951  0.2210 9000
## 6 8020   0.4748      0.8968  0.0966 5500

Remarks

By default, arrange assumes that we want the data arranged in ascending order by the specified variable(s).
To arrange the rows in descending order, wrap the variable name in the desc function. For example, to arrange the data frame from most to least expensive we would use the following command:
```
```
costDF <- arrange(colleges, desc(cost))
```
```
To arrange a data frame by the values of multiple variables, list the variables in a comma separated list. The order of the variables specifies the order in which the data frame will be arranged. For example,
```
```
actDF <- arrange(colleges, desc(ACTmath), desc(ACTenglish))
```

reorders **colleges** first by the median ACT math score (in descending order) and then by the ACT english score (in descending order)
```

Questions

What school is most expensive?
What school has the least expensive tuition in Wisconsin?

9. Mutating data (adding new columns)

Data sets often do not contain the exact variables we need, but contain all of the information necessary to calculate the needed variables. In this case, we can use the mutate function to add a new column to a data frame that is calculated from other variables. For example, we may wish to report percentages rather than proportions for the admissions rate.

colleges <- mutate(colleges, admissionPct = 100 * admissionRate)

Remarks

After specifying the data frame, give the name of the new variable and it’s definition. Notice that we need to use = to assign the value of the new variable.
To add multiple variables once, separate the list of new variables by commas. For example, we can also add percentage versions of FYretention and gradRate.

colleges <- mutate(colleges, FYretentionPct = 100 * FYretention,
                   gradPct = 100 * gradRate)

10. Summarizing rows

To create summary statistics for columns within the data set we must aggregate all of the rows using the summarize command. (Note that you can also use the British spelling: summarise.) For example, to calculate the median cost of all 1776 colleges in our data set we run the following command:

summarize(colleges, medianCost = median(cost, na.rm = TRUE))

##   medianCost
## 1      29849

Remarks

As with all of the functions we have seen, the first argument should be the name of the data frame.
We add na.rm = TRUE here to remove any missing values in the cost column before the calculation. Many functions, including this summarize function, will return an error if there are missing values (blanks, NAs or NaNs) in your data.
summarize returns a data frame, with one row and one column.
We can ask for multiple aggregations in one line of code by simply using a comma separated list. For example, we can calculate the five number summary of cost for all 1776 colleges in our data set

summarize(colleges, 
          min = min(cost, na.rm = TRUE), 
          Q1 = quantile(cost, .25, na.rm = TRUE), 
          median = median(cost, na.rm = TRUE), 
          Q3 = quantile(cost, .75, na.rm = TRUE), 
          max = max(cost, na.rm = TRUE))

##    min    Q1 median    Q3   max
## 1 6603 19831  29849 41180 62636

Notice that even when multiple statistics are calculated, the result is a data frame with one row and the number of columns correspond to the number of summary statistics.

11. Groupwise manipulation

Often it is of interest to manipulate data within groups. For example, we might be more interested in creating separate summaries for each state, or for private and public colleges. To do this we must first tell R what groups are of interest using the group_by function, and then we can use any of the above functions. Most often group_by is paired with summarise or mutate.

Let’s first consider comparing the cost of private and public colleges. First, we must specify that the variable type defines the groups of interest.

colleges_by_type <- group_by(colleges, type)

Remarks

After specifying the data frame, list the categorical variable(s) defining the groups.

Multiple variables can be used to specify the groups. For example, to specify groups by state and type, we would run the following command:
```
```
colleges_state_type <- group_by(colleges, state, type)
```
```

Combining `group_by` with other commands

Once we have a grouped data frame, we can obtain summaries by group via summarize. For example, the five number summary of cost by institution type is obtained below

summarize(colleges_by_type, 
          min = min(cost, na.rm = TRUE), 
          Q1 = quantile(cost, .25, na.rm = TRUE), 
          median = median(cost, na.rm = TRUE), 
          Q3 = quantile(cost, .75, na.rm = TRUE), 
          max = max(cost, na.rm = TRUE))

## # A tibble: 2 x 6
##   type      min     Q1 median     Q3   max
##   <fct>   <int>  <dbl>  <dbl>  <dbl> <int>
## 1 private  6753 28788.  37302 45728. 62636
## 2 public   6603 16822   19303 21909  33208

We can also calculate new variables within groups, such as the standardized cost of attendance within each state:

colleges_by_state <- group_by(colleges, state)
colleges_by_state <- mutate(colleges_by_state, 
                            mean.cost = mean(cost, na.rm = TRUE), 
                            sd.cost = sd(cost, na.rm = TRUE),
                            std.cost = (cost - mean.cost) / sd.cost)
head(colleges_by_state)

## # A tibble: 6 x 21
## # Groups:   state [1]
##   unitid college type  city  state region admissionRate ACTmath ACTenglish
##    <int> <fct>   <fct> <fct> <fct> <fct>          <dbl>   <int>      <int>
## 1 100654 Alabam… publ… Norm… AL    South…         0.899      17         17
## 2 100663 Univer… publ… Birm… AL    South…         0.867      23         26
## 3 100690 Amridg… priv… Mont… AL    South…        NA          NA         NA
## 4 100706 Univer… publ… Hunt… AL    South…         0.806      25         26
## 5 100724 Alabam… publ… Mont… AL    South…         0.512      17         17
## 6 100751 The Un… publ… Tusc… AL    South…         0.566      25         27
## # … with 12 more variables: undergrads <int>, cost <int>, gradRate <dbl>,
## #   FYretention <dbl>, fedloan <dbl>, debt <dbl>, admissionPct <dbl>,
## #   FYretentionPct <dbl>, gradPct <dbl>, mean.cost <dbl>, sd.cost <dbl>,
## #   std.cost <dbl>

Remarks

mutate allows you to use variables defined earlier to calculate a new variable. This is how std.cost was calculated.
The group_by function returns an object of class c("grouped_df", "tbl_df", "tbl", "data.frame"), which looks confusing, but essentially allows the data frame to be printed neatly. Notice that only the first 10 rows print when we print the data frame in the console by typing colleges_by_state, and the width of the console determines how many variables are shown.
To print all columns we can convert the results back to a data.frame using the as.data.frame function. Try running head(as.data.frame(colleges_by_state)).
You can also use the viewer by running the command View(colleges_by_state).
Another option is to select a reduced number of columns to print.

12. Homework Questions Part 2

Filter the rows for colleges in Great Lakes or Plains regions.
Arrange the subset from part #1 to reveal what school has the highest first-year retention rate in this reduced data set.

13. Additional Resources

RStudio’s data wrangling cheat sheet provides a nice summary of the functions in the dplyr package, including those covered in this tutorial.
The introductory vignette to dplyr provides an example of wrangling a data set consisting of 336,776 flights that departed from New York City in 2013.
Roger Peng’s video overview of the dplyr package.

Introduction to R and RStudio: Data Wrangling (dplyr) and Tidy Data (tidyr)

Peter Zelles and Shonda Kuiper

3/31/2020

1. Definition of a tidy data set

2. Tidying longitudinal data (`gather`)

3. Tidying pollster data (`separate + gather`)

4. Homework Questions Part 1

5. Additional Resources

6. Filtering rows

7. Slicing rows

8. Arranging rows

9. Mutating data (adding new columns)

10. Summarizing rows

11. Groupwise manipulation

Combining `group_by` with other commands

12. Homework Questions Part 2

13. Additional Resources

Introduction to R and RStudio: Data Wrangling (dplyr) and Tidy Data (tidyr)

Peter Zelles and Shonda Kuiper

3/31/2020

1. Definition of a tidy data set

2. Tidying longitudinal data (gather)

3. Tidying pollster data (separate + gather)

4. Homework Questions Part 1

5. Additional Resources

6. Filtering rows

7. Slicing rows

8. Arranging rows

9. Mutating data (adding new columns)

10. Summarizing rows

11. Groupwise manipulation

Combining group_by with other commands

12. Homework Questions Part 2

13. Additional Resources

2. Tidying longitudinal data (`gather`)

3. Tidying pollster data (`separate + gather`)

Combining `group_by` with other commands