What is R and RStudio?:
When someone refers to R, they could mean two things: 1) the open source statistical programming language and 2) a software environment in which to use the language. R (the language) is comparable to a coding language like Python. It allows programmers to put complex statistical (and visual) ideas into practice. The “command line interface” included within R (the software package) gives statisticans and data scientists much more flexibility compared to other statistical softwares such as Stata, SPSS, or Minitab. Rather than clicking buttons, you must enter code which tells the software exactly what you want it to do. While this feels unnecessary to most of us, it becomes incredibly valuable when you start carrying out more complex analyses.
The original R (software) interface is straight out of the 1990s and is pretty intimidating to look at. It consists of a single “command line interface,” and is difficult to understand unless you are intimatly familiar with the R language. Because of this, I (and most people at this point) prefer RStudio.
RStudio is effectively an “add on.” It repackages the R software in a much more user friendly (and 21st century-feeling) application. Note: RStudio requires that R (the original software) is already installed on your computer.
Pre-downloading note: these two programs are not that large and you should not have any issues computer-memory wise. That being said, reach out to me at zellespe@grinnell.edu if this is and issue and I will help you with remote options.
Downloading R:
Use this link to download the original R software (do this first!): #https://mirror.las.iastate.edu/CRAN/
You will be prompted to download the correct software for your operating system (Windows, MacOS, or Linux). Once you select your operating system.
For Mac users, you will be prompted to download the correct version of the software for your OS version. This file will have a .pkg extension.
For Windows users, you will be prompted to “install R for the first time.” Click this link, and then select “Download R 3.6.3 for Windows.”
If you have any issues with this, please email me at “zellespe@grinnell.edu” and I will try to set you on the right path!
Downloading RStudio:
Go to this link and download the “free desktop” version of RStudio: #https://rstudio.com/products/rstudio/download/
Opening the Software:
Open R first. After installation is complete and you have opened the software, you will see the command line interface that I mentioned before. Close the window. You don’t need to see that again thanks to RStudio.
Next, open RStudio. This is what you will be working in. You will see that there are 3-4 tabs/boxes/panels that appear within the app. The one marked “consol” is where you can write independent lines of code. The box marked “environment” is where you see any data that you have loaded into RStudio. The box marked “files/plots/packages/etc.” is a catch all window that displays any graphical output that you create, any packages that you have loaded (we’ll get to that), and a help menu.
There is a 4th window that will appear once you load this file (that you’re reading now) into RStudio. This file type is called an RMD (R MarkDown) file. It allows you to combine written notes and R code into a single document. If you are reading this as an html file (I sent two links: one RMD and one html/PDF), try opening the RMD file. Find the file and “open with” RStudio. It should appear as a 4th box within RStudio. This will make it much easier to run the provided code.
Important note: If you loaded these instructions into RStudio as an RMD, the green “play” symbol at the top right had corner of each code box allows you to run the included code. Alternatively, you can copy and paste the code from the html document (found in the light gray or pink boxes) into the console in RStudio.
If you copy the code into the console, you can use the “up” arrow key on your keyboard to recall the previous line of code.
Lets get started.
Packages
Since R is open source, there are very few built in functions. Because of this, we must rely on packages. Packages are created by R users (professors, data scientists, and even private companies) in order to accomplish specific tasks in an efficient way. We will be using two packages in this exercise: dplyr (the most common data cleaning package) and tidyr (a common package used for rearranging data).
Make sure that the following packages are loaded: Use “install.packages(”PACKAGENAME“)” to ensure that the packages are downloaded
This code ensures that the packages are downloaded
library(tidyr) # contains tools to tidy data
library(dplyr) # contains tools for data wrangling
library(ggformula) # for plotting / graphing
library(readr) # a package for parsing data; you should already have this from reading in your initial data
This code ensures that the packages are loaded in Rs working library.
Datasets in R
Let’s start at the begining: loading data.
I prefer a package called “readr” to the base R version of loading data. Normally data would be stored on your computer, but this exercise relies on online data. It will automatically download using these commands:
UBSprices <- read.csv("https://raw.githubusercontent.com/ds4stats/r-tutorials/master/tidying-data/data/UBSprices.csv", as.is = TRUE)
polls <- read.csv("https://raw.githubusercontent.com/ds4stats/r-tutorials/master/tidying-data/data/rcp-polls.csv", na.strings = "--", as.is = TRUE)
airlines <- read.csv("https://raw.githubusercontent.com/ds4stats/r-tutorials/master/tidying-data/data/airline-safety.csv", as.is = TRUE)
After importing the files, you should see them in the “environment” box. Click on them to open the datasheets. This is what data looks like in RStudio.
Tidying your data with tidyr
Data sets are stored in tabular format and there are many possible ways to organize tabular data. Some organizational schemes are designed to be easily read on the page (or screen), while others are designed to be easily used in analysis. In this exercise, we focus on how a data set should be formatted for analysis in R.
In R, it is easiest to work with data that follow five basic rules:
A data set satisfying these rules is said to be tidy, a term popularized by Hadley Wickham.
Note: Most of the time data that violate rules 4 and 5 are obviously not tidy, and there are easy ways to exclude footnotes and titles in spreadsheets by simply omitting the offending rows. This tutorial focuses on the “sneakier” form of untidiness that violates at least one of the first three rules.
This tutorial will describe the following tidyr commands, which can be thought of as verbs for tidying data: (Note: there are many more of these type of commands within the tidyr package)
| Command | Meaning |
|---|---|
gather |
collapses multiple columns into two columns |
separate |
splits compound variables into individual columns |
gather)UBS is an international bank that reports prices of various staples in major cities every three years. The data set in the UBSprices data set contains prices of a 1 kg bag of rice in 2003 and 2009 in major world cities. The data set was extracted from the alr4 R package.
head(UBSprices)
## city rice2003 rice2009
## 1 Amsterdam 9 11
## 2 Athens 19 27
## 3 Auckland 9 13
## 4 Bangkok 25 27
## 5 Barcelona 10 8
## 6 Berlin 16 17
This data set is not tidy because each row contains two cases: the city in 2003 and the city in 2009. Additionally, the column names 2003 and 2009 contain the year, which should be the value of a variable. In order to tidy these data, we need to
To do this, we will use the gather function in the tidyr package. gather collapses multiple columns into two columns: a key column and a value column. The key will be the new variable containing the old column names and the value will contain the information recorded in the cells of the collapsed columns.
In our example, we want to collapse rice2003 and rice2009 into the key-value pair year and price. To do this, we use the following command:
tidy_ubs <- gather(data = UBSprices, key = year, value = price, rice2003, rice2009)
head(tidy_ubs)
## city year price
## 1 Amsterdam rice2003 9
## 2 Athens rice2003 19
## 3 Auckland rice2003 9
## 4 Bangkok rice2003 25
## 5 Barcelona rice2003 10
## 6 Berlin rice2003 16
Remarks
gather should be the data frame being tidied. This is true for all of the tidyr functions we discuss in this tutorial.To finish tidying these data, we need to modify the year column by removing the word rice from each cell.
To do this, we can use the parse_number function in the readr package. This function drops any non-numeric characters before or after the first number in a character string. We now have a data set that we can call tidy.
tidy_ubs$year <- parse_number(tidy_ubs$year)
head(tidy_ubs)
## city year price
## 1 Amsterdam 2003 9
## 2 Athens 2003 19
## 3 Auckland 2003 9
## 4 Bangkok 2003 25
## 5 Barcelona 2003 10
## 6 Berlin 2003 16
Remark
This data set started in a relatively tidy form, so it may be difficult to see the benefit of tidying it. Tidy data are typically required for summarizing and plotting data in R. For example, consider making a side-by-side boxplot using ggformula (we learn more about plotting data in the ggplot and ggformula tutorials).
Note: the following code uses a technique known as “piping” (the %>% symbol). Piping allows users to shorten the amount of code written in functions by “piping” instructions from one line of code into the next. This is a relatively advanced technique and you are not expected to know how to do this by the end of the exercise.
tidy_ubs %>%
gf_boxplot(price ~ factor(year)) %>%
gf_labs(x = "year")
This was straightforward since tidy_ubs was already tidy, but would have required extra manipulation in the original format.
separate + gather)The polls data set contains the results of various presidential polls conducted during July 2016, and was scraped from RealClear Politics.
polls
## Poll Date Sample MoE Clinton..D. Trump..R.
## 1 Monmouth 7/14 - 7/16 688 LV 3.7 45 43
## 2 CNN/ORC 7/13 - 7/16 872 RV 3.5 42 37
## 3 ABC News/Wash Post 7/11 - 7/14 816 RV 4.0 42 38
## 4 NBC News/Wall St. Jrnl 7/9 - 7/13 1000 RV 3.1 41 35
## 5 Economist/YouGov 7/9 - 7/11 932 RV 4.5 40 37
## 6 Associated Press-GfK 7/7 - 7/11 837 RV NA 40 36
## 7 McClatchy/Marist 7/5 - 7/9 1053 RV 3.0 40 35
## Johnson..L. Stein..G.
## 1 5 1
## 2 13 5
## 3 8 5
## 4 11 6
## 5 5 2
## 6 6 2
## 7 10 5
Here, the data set is not tidy because
Date column contains both the beginning and end dates. These should be stored in separate columns.Sample column contains two variables: the number of people in the sample and the population that was sampled, likely voters (LV) or registered voters (RV). These should be stored in separate columns.candidate and party variables, which should be stored in their own columns.To break a single character column into multiple new columns we use the separate function in the tidyr package.
To begin, let’s break the Date column into Begin and End columns:
tidy_polls <- separate(data = polls, col = Date, into = c("Begin", "End"), sep = " - ")
tidy_polls
## Poll Begin End Sample MoE Clinton..D. Trump..R.
## 1 Monmouth 7/14 7/16 688 LV 3.7 45 43
## 2 CNN/ORC 7/13 7/16 872 RV 3.5 42 37
## 3 ABC News/Wash Post 7/11 7/14 816 RV 4.0 42 38
## 4 NBC News/Wall St. Jrnl 7/9 7/13 1000 RV 3.1 41 35
## 5 Economist/YouGov 7/9 7/11 932 RV 4.5 40 37
## 6 Associated Press-GfK 7/7 7/11 837 RV NA 40 36
## 7 McClatchy/Marist 7/5 7/9 1053 RV 3.0 40 35
## Johnson..L. Stein..G.
## 1 5 1
## 2 13 5
## 3 8 5
## 4 11 6
## 5 5 2
## 6 6 2
## 7 10 5
Remarks
separate is the name of the data frame to be tidied.col, specifies the name of the column to be split.into, specifies the names of the new columns. Note that since these are specific column names we are creating, they should be given in quotes.sep = " - ", then R would erroneously use \ as the separator. To manually specify the separator between columns we can place the character(s) in quotes.sep = " - ", the spaces around - avoid excess whitespace in the resulting cell values. We also need to separate the Sample column into size and population columns.
tidy_polls <- separate(data = tidy_polls, col = Sample, into = c("size", "population"), sep = " ")
tidy_polls
## Poll Begin End size population MoE Clinton..D.
## 1 Monmouth 7/14 7/16 688 LV 3.7 45
## 2 CNN/ORC 7/13 7/16 872 RV 3.5 42
## 3 ABC News/Wash Post 7/11 7/14 816 RV 4.0 42
## 4 NBC News/Wall St. Jrnl 7/9 7/13 1000 RV 3.1 41
## 5 Economist/YouGov 7/9 7/11 932 RV 4.5 40
## 6 Associated Press-GfK 7/7 7/11 837 RV NA 40
## 7 McClatchy/Marist 7/5 7/9 1053 RV 3.0 40
## Trump..R. Johnson..L. Stein..G.
## 1 43 5 1
## 2 37 13 5
## 3 38 8 5
## 4 35 11 6
## 5 37 5 2
## 6 36 6 2
## 7 35 10 5
Next, we need to gather the last four columns into a candidate variable.
tidy_polls <- gather(data = tidy_polls, key = candidate, value = percentage, 7:10)
head(tidy_polls)
## Poll Begin End size population MoE candidate
## 1 Monmouth 7/14 7/16 688 LV 3.7 Clinton..D.
## 2 CNN/ORC 7/13 7/16 872 RV 3.5 Clinton..D.
## 3 ABC News/Wash Post 7/11 7/14 816 RV 4.0 Clinton..D.
## 4 NBC News/Wall St. Jrnl 7/9 7/13 1000 RV 3.1 Clinton..D.
## 5 Economist/YouGov 7/9 7/11 932 RV 4.5 Clinton..D.
## 6 Associated Press-GfK 7/7 7/11 837 RV NA Clinton..D.
## percentage
## 1 45
## 2 42
## 3 42
## 4 41
## 5 40
## 6 40
Notice that instead of writing out the column names (Clinton..D., Trump..R., etc.) we can simply specify the column numbers—here 7:10 specifies that we are gathering columns 7 through 10.
Finally, we need to separate the candidate names from the political party.
tidy_polls <- separate(tidy_polls, candidate, into= c("candidate", "party"))
## Warning: Expected 2 pieces. Additional pieces discarded in 28 rows [1, 2,
## 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
head(tidy_polls)
## Poll Begin End size population MoE candidate party
## 1 Monmouth 7/14 7/16 688 LV 3.7 Clinton D
## 2 CNN/ORC 7/13 7/16 872 RV 3.5 Clinton D
## 3 ABC News/Wash Post 7/11 7/14 816 RV 4.0 Clinton D
## 4 NBC News/Wall St. Jrnl 7/9 7/13 1000 RV 3.1 Clinton D
## 5 Economist/YouGov 7/9 7/11 932 RV 4.5 Clinton D
## 6 Associated Press-GfK 7/7 7/11 837 RV NA Clinton D
## percentage
## 1 45
## 2 42
## 3 42
## 4 41
## 5 40
## 6 40
Remark
In the last command we let R guess which separator to use. This worked, but resulted in a warning message—we’re lucky that it worked! There are many situations where the separator is too complex for R to guess correctly and it cannot be specified using a simple character in quotes. In such cases we need to use regular expressions (see the tutorial IntroStrings) to aid our data tidying, but that’s a topic for another tutorial. The important thing to note here is that you should always check that separate worked as you expected, don’t blindly trust it!
the following two questions are from part 2 and use UBSprices
How are the number of rows adjusted by using the gather command? Use the dim(UBSprices) command to determine how many rows are in the UBSprices data set and dim(tidy_ubs) to determine how many are in the tidy_ubs data set).
How many rows would there be if used the gather command and the original UBSprices data set had five columns of years: rice2003, rice2006, rice2009, rice2012, and rice2015?
RStudio’s data wrangling cheat sheet provides a nice summary of how to reshape data sets and a quick reminder of the definition of tidy data.
The tidyr vignette provides additional examples and elaborates on the capabilities of the tidyr package.
Data Wrangling in dplyr
Data manipulation is central to data analysis and is often the most time consuming portion of an analysis. The dplyr package contains a suite of functions to make data manipulation easier. The core functions of the dplyr package can be thought of as verbs for data manipulation.
| Verb(s) | Meaning |
|---|---|
filter and slice |
pick specific observations (i.e. specific rows) |
arrange |
sort rows by a particular variable |
mutate |
add new calculated columns to a data frame |
summarize |
aggregate many rows into a single row |
In this example we will explore how to use each of these functions, as well as how to combine them with the group_by function for groupwise manipulations.
To begin, let’s make sure that our data set and the dplyr package are loaded
library(dplyr)
Data: The file college2015.csv contains information on predominantly bachelor’s-degree granting institutions from 2015 that might be of interest to a college applicant. This data is not online and will have to be loaded manually. Use the “import dataset” dropdown button in the “environment” box. Select the college2015 data from your computer and import. It will appear in your environment.
colleges <- read.csv("~/Shared/F18MAT310/r-tutorials-master/data-wrangling/data/colleges2015.csv")
library(dplyr)
Note: The above code will not work for you. This is only included to create the RMD file. You will need to manually load the data as previously described.
To get a feel for what data are available, look at the first six rows
head(colleges)
## unitid college type city state
## 1 100654 Alabama A & M University public Normal AL
## 2 100663 University of Alabama at Birmingham public Birmingham AL
## 3 100690 Amridge University private Montgomery AL
## 4 100706 University of Alabama in Huntsville public Huntsville AL
## 5 100724 Alabama State University public Montgomery AL
## 6 100751 The University of Alabama public Tuscaloosa AL
## region admissionRate ACTmath ACTenglish undergrads cost gradRate
## 1 Southeast 0.8989 17 17 4051 18888 0.2914
## 2 Southeast 0.8673 23 26 11200 19990 0.5377
## 3 Southeast NA NA NA 322 12300 0.6667
## 4 Southeast 0.8062 25 26 5525 20306 0.4835
## 5 Southeast 0.5125 17 17 5354 17400 0.2517
## 6 Southeast 0.5655 25 27 28692 26717 0.6665
## FYretention fedloan debt
## 1 0.6314 0.8204 33611.5
## 2 0.8016 0.5397 23117.0
## 3 0.3750 0.7629 26995.0
## 4 0.8098 0.4728 24738.0
## 5 0.6219 0.8735 33452.0
## 6 0.8700 0.4148 24000.0
the last six rows
tail(colleges)
## unitid college type city state
## 1771 443058 Family of Faith College private Shawnee OK
## 1772 443340 Williamson Christian College private Franklin TN
## 1773 444398 Baptist University of the Americas private San Antonio TX
## 1774 444413 Beis Medrash Heichal Dovid private Far Rockaway NY
## 1775 444778 New Life Theological Seminary private Charlotte NC
## 1776 444990 Flagler College-Tallahassee private Tallahassee FL
## region admissionRate ACTmath ACTenglish undergrads cost gradRate
## 1771 Southwest 0.6667 NA NA 19 21795 1.0000
## 1772 Southeast NA NA NA 72 20737 0.0000
## 1773 Southwest NA NA NA 189 14542 0.2941
## 1774 Mid East 1.0000 NA NA 96 16000 0.1000
## 1775 Southeast NA NA NA 122 24685 0.4000
## 1776 Southeast NA NA NA 469 NA NA
## FYretention fedloan debt
## 1771 NA 0.2500 NA
## 1772 0.2500 0.4156 NA
## 1773 0.6190 0.0000 NA
## 1774 0.6667 0.0000 NA
## 1775 0.0000 0.8224 NA
## 1776 NA 0.6370 23638
and the structure of the data frame.
str(colleges)
## 'data.frame': 1776 obs. of 15 variables:
## $ unitid : int 100654 100663 100690 100706 100724 100751 100812 100830 100858 100937 ...
## $ college : Factor w/ 1760 levels "Abilene Christian University",..: 8 1411 35 1412 9 1334 63 67 66 132 ...
## $ type : Factor w/ 2 levels "private","public": 2 2 1 2 2 2 2 2 2 1 ...
## $ city : Factor w/ 1040 levels "Aberdeen","Abilene",..: 669 89 614 433 614 944 48 614 51 89 ...
## $ state : Factor w/ 54 levels "AK","AL","AR",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ region : Factor w/ 10 levels "Far West","Great Lakes",..: 8 8 8 8 8 8 8 8 8 8 ...
## $ admissionRate: num 0.899 0.867 NA 0.806 0.512 ...
## $ ACTmath : int 17 23 NA 25 17 25 NA 20 26 25 ...
## $ ACTenglish : int 17 26 NA 26 17 27 NA 21 28 27 ...
## $ undergrads : int 4051 11200 322 5525 5354 28692 2999 4322 19761 1181 ...
## $ cost : int 18888 19990 12300 20306 17400 26717 NA 16556 23788 44512 ...
## $ gradRate : num 0.291 0.538 0.667 0.483 0.252 ...
## $ FYretention : num 0.631 0.802 0.375 0.81 0.622 ...
## $ fedloan : num 0.82 0.54 0.763 0.473 0.874 ...
## $ debt : num 33612 23117 26995 24738 33452 ...
To extract the rows only for colleges and universities in a specific state we use the filter function. For example, we can extract the colleges in Wisconsin from the colleges data set using the following code:
wi <- filter(colleges, state == "WI")
head(wi)
## unitid college type city state region
## 1 238193 Alverno College private Milwaukee WI Great Lakes
## 2 238324 Bellin College private Green Bay WI Great Lakes
## 3 238333 Beloit College private Beloit WI Great Lakes
## 4 238430 Cardinal Stritch University private Milwaukee WI Great Lakes
## 5 238458 Carroll University private Waukesha WI Great Lakes
## 6 238476 Carthage College private Kenosha WI Great Lakes
## admissionRate ACTmath ACTenglish undergrads cost gradRate FYretention
## 1 0.7887 19 19 1833 30496 0.3852 0.7173
## 2 0.5556 25 24 285 NA 0.6786 0.7500
## 3 0.6769 26 28 1244 48236 0.7815 0.9228
## 4 0.8423 22 22 2680 37563 0.4162 0.7099
## 5 0.8114 24 24 3024 37963 0.5629 0.7598
## 6 0.7018 24 24 2874 44910 0.6501 0.7841
## fedloan debt
## 1 0.8784 33110
## 2 0.8145 18282
## 3 0.5784 26500
## 4 0.7178 27875
## 5 0.7108 27000
## 6 0.8048 27000
Remarks
filter is always the data frame (this is true for all the core functions in dplyr), followed by logical tests that the returned cases must pass. In our example, the test was whether the school was in Wisconsin, which is written as state == "WI".== to indicate equality because = is equivalent to <-.To specify multiple tests, use a comma to separate the tests (think of the comma as the word “and”). For example,
```
smallWI <- filter(colleges, state == "WI", undergrads < 2000)
```
returns only those rows corresponding to schools in Wisconsin with fewer than 2,000 undergraduate students.
To specify that at least one test must be passed, use the | character instead of the comma. For example, the below test checks whether a college is in Wisconsin or Minnesota or Iowa, so it returns all of the colleges in Wisconsin, Minnesota, and Iowa.
```
WiMnIa <- filter(colleges, state == "WI" | state == "MN" | state == "IA")
```You can use both | and , to specify multiple tests. For example, we can return all colleges with fewer than 2,000 undergraduate students in Wisconsin, Minnesota, and Iowa.
```
smallWIM <- filter(colleges, state == "WI" | state == "MN" | state == "IA", undergrads < 2000)
```>, >=, <, <=, != (not equal), and == (equal).To remove rows with missing values, use the R command na.omit. For example,
colleges <- na.omit(colleges)
will reduce the data set to only rows with no missing values.
```
colleges <- filter(colleges, !is.na(cost))
```
will eliminate only rows with NA in the cost column.
Questions:
To extract rows 10 through 16 from the colleges data frame we use the slice function.
slice(colleges, 10:16)
## unitid college type city state
## 1 100937 Birmingham Southern College private Birmingham AL
## 2 101073 Concordia College Alabama private Selma AL
## 3 101189 Faulkner University private Montgomery AL
## 4 101435 Huntingdon College private Montgomery AL
## 5 101453 Heritage Christian University private Florence AL
## 6 101480 Jacksonville State University public Jacksonville AL
## 7 101541 Judson College private Marion AL
## region admissionRate ACTmath ACTenglish undergrads cost gradRate
## 1 Southeast 0.6422 25 27 1181 44512 0.6192
## 2 Southeast NA NA NA 523 17655 0.2115
## 3 Southeast NA NA NA 2358 28485 0.2287
## 4 Southeast 0.6279 20 22 1100 31433 0.4319
## 5 Southeast NA NA NA 67 21160 0.0000
## 6 Southeast 0.8326 21 22 7195 19202 0.3083
## 7 Southeast 0.7388 20 23 331 27815 0.4051
## FYretention fedloan debt
## 1 0.8037 0.4939 27000
## 2 0.4103 0.9100 26500
## 3 0.5000 0.7427 23750
## 4 0.6196 0.7227 27000
## 5 1.0000 0.4839 NA
## 6 0.7112 0.6811 23500
## 7 0.5974 0.7110 26000
Remarks
:.c(). For example, to select the 2nd, 18th, and 168th rows use slice(colleges, c(2, 18, 168)).To sort the rows by total cost, from the least expensive to the most expensive, we use the arrange function.
costDF <- arrange(colleges, cost)
head(costDF)
## unitid college type city
## 1 197027 United States Merchant Marine Academy public Kings Point
## 2 176336 Southeastern Baptist College private Laurel
## 3 241951 Escuela de Artes Plasticas de Puerto Rico public San Juan
## 4 241216 Atlantic University College private Guaynabo
## 5 241377 Caribbean University-Bayamon private Bayamon
## 6 243221 University of Puerto Rico-Rio Piedras public San Juan
## state region admissionRate ACTmath ACTenglish undergrads
## 1 NY U.S. Service Schools NA NA NA 958
## 2 MS Southeast NA NA NA 37
## 3 PR Outlying Areas 0.7154 NA NA 529
## 4 PR Outlying Areas NA NA NA 1365
## 5 PR Outlying Areas NA NA NA 1572
## 6 PR Outlying Areas 0.5248 NA NA 11834
## cost gradRate FYretention fedloan debt
## 1 6603 0.7365 0.9733 0.0780 4211
## 2 6753 0.6875 1.0000 0.0000 NA
## 3 7248 0.4127 0.8382 0.0000 NA
## 4 7695 0.3891 0.7200 0.1053 5000
## 5 8006 0.2166 0.7951 0.2210 9000
## 6 8020 0.4748 0.8968 0.0966 5500
Remarks
arrange assumes that we want the data arranged in ascending order by the specified variable(s).To arrange the rows in descending order, wrap the variable name in the desc function. For example, to arrange the data frame from most to least expensive we would use the following command:
```
costDF <- arrange(colleges, desc(cost))
```To arrange a data frame by the values of multiple variables, list the variables in a comma separated list. The order of the variables specifies the order in which the data frame will be arranged. For example,
```
actDF <- arrange(colleges, desc(ACTmath), desc(ACTenglish))
```
reorders **colleges** first by the median ACT math score (in descending order) and then by the ACT english score (in descending order)Questions
Data sets often do not contain the exact variables we need, but contain all of the information necessary to calculate the needed variables. In this case, we can use the mutate function to add a new column to a data frame that is calculated from other variables. For example, we may wish to report percentages rather than proportions for the admissions rate.
colleges <- mutate(colleges, admissionPct = 100 * admissionRate)
Remarks
= to assign the value of the new variable.FYretention and gradRate.colleges <- mutate(colleges, FYretentionPct = 100 * FYretention,
gradPct = 100 * gradRate)
To create summary statistics for columns within the data set we must aggregate all of the rows using the summarize command. (Note that you can also use the British spelling: summarise.) For example, to calculate the median cost of all 1776 colleges in our data set we run the following command:
summarize(colleges, medianCost = median(cost, na.rm = TRUE))
## medianCost
## 1 29849
Remarks
na.rm = TRUE here to remove any missing values in the cost column before the calculation. Many functions, including this summarize function, will return an error if there are missing values (blanks, NAs or NaNs) in your data.summarize returns a data frame, with one row and one column.cost for all 1776 colleges in our data setsummarize(colleges,
min = min(cost, na.rm = TRUE),
Q1 = quantile(cost, .25, na.rm = TRUE),
median = median(cost, na.rm = TRUE),
Q3 = quantile(cost, .75, na.rm = TRUE),
max = max(cost, na.rm = TRUE))
## min Q1 median Q3 max
## 1 6603 19831 29849 41180 62636
Often it is of interest to manipulate data within groups. For example, we might be more interested in creating separate summaries for each state, or for private and public colleges. To do this we must first tell R what groups are of interest using the group_by function, and then we can use any of the above functions. Most often group_by is paired with summarise or mutate.
Let’s first consider comparing the cost of private and public colleges. First, we must specify that the variable type defines the groups of interest.
colleges_by_type <- group_by(colleges, type)
Remarks
Multiple variables can be used to specify the groups. For example, to specify groups by state and type, we would run the following command:
```
colleges_state_type <- group_by(colleges, state, type)
```group_by with other commandsOnce we have a grouped data frame, we can obtain summaries by group via summarize. For example, the five number summary of cost by institution type is obtained below
summarize(colleges_by_type,
min = min(cost, na.rm = TRUE),
Q1 = quantile(cost, .25, na.rm = TRUE),
median = median(cost, na.rm = TRUE),
Q3 = quantile(cost, .75, na.rm = TRUE),
max = max(cost, na.rm = TRUE))
## # A tibble: 2 x 6
## type min Q1 median Q3 max
## <fct> <int> <dbl> <dbl> <dbl> <int>
## 1 private 6753 28788. 37302 45728. 62636
## 2 public 6603 16822 19303 21909 33208
We can also calculate new variables within groups, such as the standardized cost of attendance within each state:
colleges_by_state <- group_by(colleges, state)
colleges_by_state <- mutate(colleges_by_state,
mean.cost = mean(cost, na.rm = TRUE),
sd.cost = sd(cost, na.rm = TRUE),
std.cost = (cost - mean.cost) / sd.cost)
head(colleges_by_state)
## # A tibble: 6 x 21
## # Groups: state [1]
## unitid college type city state region admissionRate ACTmath ACTenglish
## <int> <fct> <fct> <fct> <fct> <fct> <dbl> <int> <int>
## 1 100654 Alabam… publ… Norm… AL South… 0.899 17 17
## 2 100663 Univer… publ… Birm… AL South… 0.867 23 26
## 3 100690 Amridg… priv… Mont… AL South… NA NA NA
## 4 100706 Univer… publ… Hunt… AL South… 0.806 25 26
## 5 100724 Alabam… publ… Mont… AL South… 0.512 17 17
## 6 100751 The Un… publ… Tusc… AL South… 0.566 25 27
## # … with 12 more variables: undergrads <int>, cost <int>, gradRate <dbl>,
## # FYretention <dbl>, fedloan <dbl>, debt <dbl>, admissionPct <dbl>,
## # FYretentionPct <dbl>, gradPct <dbl>, mean.cost <dbl>, sd.cost <dbl>,
## # std.cost <dbl>
Remarks
mutate allows you to use variables defined earlier to calculate a new variable. This is how std.cost was calculated.group_by function returns an object of class c("grouped_df", "tbl_df", "tbl", "data.frame"), which looks confusing, but essentially allows the data frame to be printed neatly. Notice that only the first 10 rows print when we print the data frame in the console by typing colleges_by_state, and the width of the console determines how many variables are shown.data.frame using the as.data.frame function. Try running head(as.data.frame(colleges_by_state)).View(colleges_by_state).select a reduced number of columns to print.RStudio’s data wrangling cheat sheet provides a nice summary of the functions in the dplyr package, including those covered in this tutorial.
The introductory vignette to dplyr provides an example of wrangling a data set consisting of 336,776 flights that departed from New York City in 2013.
Roger Peng’s video overview of the dplyr package.