What is R and RStudio?:

When someone refers to R, they could mean two things: 1) the open source statistical programming language and 2) a software environment in which to use the language. R (the language) is comparable to a coding language like Python. It allows programmers to put complex statistical (and visual) ideas into practice. The “command line interface” included within R (the software package) gives statisticans and data scientists much more flexibility compared to other statistical softwares such as Stata, SPSS, or Minitab. Rather than clicking buttons, you must enter code which tells the software exactly what you want it to do. While this feels unnecessary to most of us, it becomes incredibly valuable when you start carrying out more complex analyses.

The original R (software) interface is straight out of the 1990s and is pretty intimidating to look at. It consists of a single “command line interface,” and is difficult to understand unless you are intimatly familiar with the R language. Because of this, I (and most people at this point) prefer RStudio.

RStudio is effectively an “add on.” It repackages the R software in a much more user friendly (and 21st century-feeling) application. Note: RStudio requires that R (the original software) is already installed on your computer.

Pre-downloading note: these two programs are not that large and you should not have any issues computer-memory wise. That being said, reach out to me at if this is and issue and I will help you with remote options.

Downloading R:

Use this link to download the original R software (do this first!): #https://mirror.las.iastate.edu/CRAN/

You will be prompted to download the correct software for your operating system (Windows, MacOS, or Linux). Once you select your operating system.

For Mac users, you will be prompted to download the correct version of the software for your OS version. This file will have a .pkg extension.

For Windows users, you will be prompted to “install R for the first time.” Click this link, and then select “Download R 3.6.3 for Windows.”

If you have any issues with this, please email me at “” and I will try to set you on the right path!

Downloading RStudio:

Go to this link and download the “free desktop” version of RStudio: #https://rstudio.com/products/rstudio/download/

Opening the Software:

Open R first. After installation is complete and you have opened the software, you will see the command line interface that I mentioned before. Close the window. You don’t need to see that again thanks to RStudio.

Next, open RStudio. This is what you will be working in. You will see that there are 3-4 tabs/boxes/panels that appear within the app. The one marked “consol” is where you can write independent lines of code. The box marked “environment” is where you see any data that you have loaded into RStudio. The box marked “files/plots/packages/etc.” is a catch all window that displays any graphical output that you create, any packages that you have loaded (we’ll get to that), and a help menu.

There is a 4th window that will appear once you load this file (that you’re reading now) into RStudio. This file type is called an RMD (R MarkDown) file. It allows you to combine written notes and R code into a single document. If you are reading this as an html file (I sent two links: one RMD and one html/PDF), try opening the RMD file. Find the file and “open with” RStudio. It should appear as a 4th box within RStudio. This will make it much easier to run the provided code.

Important note: If you loaded these instructions into RStudio as an RMD, the green “play” symbol at the top right had corner of each code box allows you to run the included code. Alternatively, you can copy and paste the code from the html document (found in the light gray or pink boxes) into the console in RStudio.

If you copy the code into the console, you can use the “up” arrow key on your keyboard to recall the previous line of code.

Lets get started.

Packages

Since R is open source, there are very few built in functions. Because of this, we must rely on packages. Packages are created by R users (professors, data scientists, and even private companies) in order to accomplish specific tasks in an efficient way. We will be using two packages in this exercise: dplyr (the most common data cleaning package) and tidyr (a common package used for rearranging data).

Make sure that the following packages are loaded: Use “install.packages(”PACKAGENAME“)” to ensure that the packages are downloaded

This code ensures that the packages are downloaded

library(tidyr)     # contains tools to tidy data
library(dplyr)     # contains tools for data wrangling
library(ggformula) # for plotting / graphing
library(readr)     # a package for parsing data; you should already have this from reading in your initial data

This code ensures that the packages are loaded in Rs working library.

Datasets in R

Let’s start at the begining: loading data.

I prefer a package called “readr” to the base R version of loading data. Normally data would be stored on your computer, but this exercise relies on online data. It will automatically download using these commands:

UBSprices <- read.csv("https://raw.githubusercontent.com/ds4stats/r-tutorials/master/tidying-data/data/UBSprices.csv", as.is = TRUE)
polls <- read.csv("https://raw.githubusercontent.com/ds4stats/r-tutorials/master/tidying-data/data/rcp-polls.csv", na.strings = "--", as.is = TRUE)
airlines <- read.csv("https://raw.githubusercontent.com/ds4stats/r-tutorials/master/tidying-data/data/airline-safety.csv", as.is = TRUE)

After importing the files, you should see them in the “environment” box. Click on them to open the datasheets. This is what data looks like in RStudio.

Tidying your data with tidyr

Data sets are stored in tabular format and there are many possible ways to organize tabular data. Some organizational schemes are designed to be easily read on the page (or screen), while others are designed to be easily used in analysis. In this exercise, we focus on how a data set should be formatted for analysis in R.

1. Definition of a tidy data set

In R, it is easiest to work with data that follow five basic rules:

  1. Every variable is stored in its own column.
  2. Every observation is stored in its own row—that is, every row corresponds to a single case.
  3. Each value of a variable is stored in a cell of the table.
  4. Values should not contain units. Rather, units should be specified in the supporting documentation for the data set, often called a codebook.
  5. There should be no extraneous information (footnotes, table titles, etc.).

A data set satisfying these rules is said to be tidy, a term popularized by Hadley Wickham.

Note: Most of the time data that violate rules 4 and 5 are obviously not tidy, and there are easy ways to exclude footnotes and titles in spreadsheets by simply omitting the offending rows. This tutorial focuses on the “sneakier” form of untidiness that violates at least one of the first three rules.

This tutorial will describe the following tidyr commands, which can be thought of as verbs for tidying data: (Note: there are many more of these type of commands within the tidyr package)

Command Meaning
gather collapses multiple columns into two columns
separate splits compound variables into individual columns

2. Tidying longitudinal data (gather)

UBS is an international bank that reports prices of various staples in major cities every three years. The data set in the UBSprices data set contains prices of a 1 kg bag of rice in 2003 and 2009 in major world cities. The data set was extracted from the alr4 R package.

head(UBSprices)
##        city rice2003 rice2009
## 1 Amsterdam        9       11
## 2    Athens       19       27
## 3  Auckland        9       13
## 4   Bangkok       25       27
## 5 Barcelona       10        8
## 6    Berlin       16       17

This data set is not tidy because each row contains two cases: the city in 2003 and the city in 2009. Additionally, the column names 2003 and 2009 contain the year, which should be the value of a variable. In order to tidy these data, we need to

  1. Reorganize the data so that each row corresponds to a city in a specific year.
  2. Create a single variable for the price of rice.
  3. Add a variable for year.

To do this, we will use the gather function in the tidyr package. gather collapses multiple columns into two columns: a key column and a value column. The key will be the new variable containing the old column names and the value will contain the information recorded in the cells of the collapsed columns.

In our example, we want to collapse rice2003 and rice2009 into the key-value pair year and price. To do this, we use the following command:

tidy_ubs <- gather(data = UBSprices, key = year, value = price, rice2003, rice2009)
head(tidy_ubs)
##        city     year price
## 1 Amsterdam rice2003     9
## 2    Athens rice2003    19
## 3  Auckland rice2003     9
## 4   Bangkok rice2003    25
## 5 Barcelona rice2003    10
## 6    Berlin rice2003    16

Remarks

To finish tidying these data, we need to modify the year column by removing the word rice from each cell.

To do this, we can use the parse_number function in the readr package. This function drops any non-numeric characters before or after the first number in a character string. We now have a data set that we can call tidy.

tidy_ubs$year <- parse_number(tidy_ubs$year)
head(tidy_ubs)
##        city year price
## 1 Amsterdam 2003     9
## 2    Athens 2003    19
## 3  Auckland 2003     9
## 4   Bangkok 2003    25
## 5 Barcelona 2003    10
## 6    Berlin 2003    16

Remark

This data set started in a relatively tidy form, so it may be difficult to see the benefit of tidying it. Tidy data are typically required for summarizing and plotting data in R. For example, consider making a side-by-side boxplot using ggformula (we learn more about plotting data in the ggplot and ggformula tutorials).

Note: the following code uses a technique known as “piping” (the %>% symbol). Piping allows users to shorten the amount of code written in functions by “piping” instructions from one line of code into the next. This is a relatively advanced technique and you are not expected to know how to do this by the end of the exercise.

tidy_ubs %>%
  gf_boxplot(price ~ factor(year)) %>%
  gf_labs(x = "year")

This was straightforward since tidy_ubs was already tidy, but would have required extra manipulation in the original format.

3. Tidying pollster data (separate + gather)

The polls data set contains the results of various presidential polls conducted during July 2016, and was scraped from RealClear Politics.

polls
##                     Poll        Date  Sample MoE Clinton..D. Trump..R.
## 1               Monmouth 7/14 - 7/16  688 LV 3.7          45        43
## 2                CNN/ORC 7/13 - 7/16  872 RV 3.5          42        37
## 3     ABC News/Wash Post 7/11 - 7/14  816 RV 4.0          42        38
## 4 NBC News/Wall St. Jrnl  7/9 - 7/13 1000 RV 3.1          41        35
## 5       Economist/YouGov  7/9 - 7/11  932 RV 4.5          40        37
## 6   Associated Press-GfK  7/7 - 7/11  837 RV  NA          40        36
## 7       McClatchy/Marist   7/5 - 7/9 1053 RV 3.0          40        35
##   Johnson..L. Stein..G.
## 1           5         1
## 2          13         5
## 3           8         5
## 4          11         6
## 5           5         2
## 6           6         2
## 7          10         5

Here, the data set is not tidy because

To break a single character column into multiple new columns we use the separate function in the tidyr package.

To begin, let’s break the Date column into Begin and End columns:

tidy_polls <- separate(data = polls, col = Date, into = c("Begin", "End"), sep = " - ")
tidy_polls
##                     Poll Begin  End  Sample MoE Clinton..D. Trump..R.
## 1               Monmouth  7/14 7/16  688 LV 3.7          45        43
## 2                CNN/ORC  7/13 7/16  872 RV 3.5          42        37
## 3     ABC News/Wash Post  7/11 7/14  816 RV 4.0          42        38
## 4 NBC News/Wall St. Jrnl   7/9 7/13 1000 RV 3.1          41        35
## 5       Economist/YouGov   7/9 7/11  932 RV 4.5          40        37
## 6   Associated Press-GfK   7/7 7/11  837 RV  NA          40        36
## 7       McClatchy/Marist   7/5  7/9 1053 RV 3.0          40        35
##   Johnson..L. Stein..G.
## 1           5         1
## 2          13         5
## 3           8         5
## 4          11         6
## 5           5         2
## 6           6         2
## 7          10         5

Remarks

We also need to separate the Sample column into size and population columns.

tidy_polls <- separate(data = tidy_polls, col = Sample, into = c("size", "population"), sep = " ")
tidy_polls
##                     Poll Begin  End size population MoE Clinton..D.
## 1               Monmouth  7/14 7/16  688         LV 3.7          45
## 2                CNN/ORC  7/13 7/16  872         RV 3.5          42
## 3     ABC News/Wash Post  7/11 7/14  816         RV 4.0          42
## 4 NBC News/Wall St. Jrnl   7/9 7/13 1000         RV 3.1          41
## 5       Economist/YouGov   7/9 7/11  932         RV 4.5          40
## 6   Associated Press-GfK   7/7 7/11  837         RV  NA          40
## 7       McClatchy/Marist   7/5  7/9 1053         RV 3.0          40
##   Trump..R. Johnson..L. Stein..G.
## 1        43           5         1
## 2        37          13         5
## 3        38           8         5
## 4        35          11         6
## 5        37           5         2
## 6        36           6         2
## 7        35          10         5

Next, we need to gather the last four columns into a candidate variable.

tidy_polls <- gather(data = tidy_polls, key = candidate, value = percentage, 7:10)
head(tidy_polls)
##                     Poll Begin  End size population MoE   candidate
## 1               Monmouth  7/14 7/16  688         LV 3.7 Clinton..D.
## 2                CNN/ORC  7/13 7/16  872         RV 3.5 Clinton..D.
## 3     ABC News/Wash Post  7/11 7/14  816         RV 4.0 Clinton..D.
## 4 NBC News/Wall St. Jrnl   7/9 7/13 1000         RV 3.1 Clinton..D.
## 5       Economist/YouGov   7/9 7/11  932         RV 4.5 Clinton..D.
## 6   Associated Press-GfK   7/7 7/11  837         RV  NA Clinton..D.
##   percentage
## 1         45
## 2         42
## 3         42
## 4         41
## 5         40
## 6         40

Notice that instead of writing out the column names (Clinton..D., Trump..R., etc.) we can simply specify the column numbers—here 7:10 specifies that we are gathering columns 7 through 10.

Finally, we need to separate the candidate names from the political party.

tidy_polls <- separate(tidy_polls, candidate, into= c("candidate", "party"))
## Warning: Expected 2 pieces. Additional pieces discarded in 28 rows [1, 2,
## 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, ...].
head(tidy_polls)
##                     Poll Begin  End size population MoE candidate party
## 1               Monmouth  7/14 7/16  688         LV 3.7   Clinton     D
## 2                CNN/ORC  7/13 7/16  872         RV 3.5   Clinton     D
## 3     ABC News/Wash Post  7/11 7/14  816         RV 4.0   Clinton     D
## 4 NBC News/Wall St. Jrnl   7/9 7/13 1000         RV 3.1   Clinton     D
## 5       Economist/YouGov   7/9 7/11  932         RV 4.5   Clinton     D
## 6   Associated Press-GfK   7/7 7/11  837         RV  NA   Clinton     D
##   percentage
## 1         45
## 2         42
## 3         42
## 4         41
## 5         40
## 6         40

Remark

In the last command we let R guess which separator to use. This worked, but resulted in a warning message—we’re lucky that it worked! There are many situations where the separator is too complex for R to guess correctly and it cannot be specified using a simple character in quotes. In such cases we need to use regular expressions (see the tutorial IntroStrings) to aid our data tidying, but that’s a topic for another tutorial. The important thing to note here is that you should always check that separate worked as you expected, don’t blindly trust it!

4. Homework Questions Part 1

the following two questions are from part 2 and use UBSprices

  1. How are the number of rows adjusted by using the gather command? Use the dim(UBSprices) command to determine how many rows are in the UBSprices data set and dim(tidy_ubs) to determine how many are in the tidy_ubs data set).

  2. How many rows would there be if used the gather command and the original UBSprices data set had five columns of years: rice2003, rice2006, rice2009, rice2012, and rice2015?

5. Additional Resources

Data Wrangling in dplyr

Data manipulation is central to data analysis and is often the most time consuming portion of an analysis. The dplyr package contains a suite of functions to make data manipulation easier. The core functions of the dplyr package can be thought of as verbs for data manipulation.

Verb(s) Meaning
filter and slice pick specific observations (i.e. specific rows)
arrange sort rows by a particular variable
mutate add new calculated columns to a data frame
summarize aggregate many rows into a single row

In this example we will explore how to use each of these functions, as well as how to combine them with the group_by function for groupwise manipulations.

To begin, let’s make sure that our data set and the dplyr package are loaded

library(dplyr)

Data: The file college2015.csv contains information on predominantly bachelor’s-degree granting institutions from 2015 that might be of interest to a college applicant. This data is not online and will have to be loaded manually. Use the “import dataset” dropdown button in the “environment” box. Select the college2015 data from your computer and import. It will appear in your environment.

colleges <- read.csv("~/Shared/F18MAT310/r-tutorials-master/data-wrangling/data/colleges2015.csv")
library(dplyr)

Note: The above code will not work for you. This is only included to create the RMD file. You will need to manually load the data as previously described.

To get a feel for what data are available, look at the first six rows

head(colleges)
##   unitid                             college    type       city state
## 1 100654            Alabama A & M University  public     Normal    AL
## 2 100663 University of Alabama at Birmingham  public Birmingham    AL
## 3 100690                  Amridge University private Montgomery    AL
## 4 100706 University of Alabama in Huntsville  public Huntsville    AL
## 5 100724            Alabama State University  public Montgomery    AL
## 6 100751           The University of Alabama  public Tuscaloosa    AL
##      region admissionRate ACTmath ACTenglish undergrads  cost gradRate
## 1 Southeast        0.8989      17         17       4051 18888   0.2914
## 2 Southeast        0.8673      23         26      11200 19990   0.5377
## 3 Southeast            NA      NA         NA        322 12300   0.6667
## 4 Southeast        0.8062      25         26       5525 20306   0.4835
## 5 Southeast        0.5125      17         17       5354 17400   0.2517
## 6 Southeast        0.5655      25         27      28692 26717   0.6665
##   FYretention fedloan    debt
## 1      0.6314  0.8204 33611.5
## 2      0.8016  0.5397 23117.0
## 3      0.3750  0.7629 26995.0
## 4      0.8098  0.4728 24738.0
## 5      0.6219  0.8735 33452.0
## 6      0.8700  0.4148 24000.0

the last six rows

tail(colleges)
##      unitid                            college    type         city state
## 1771 443058            Family of Faith College private      Shawnee    OK
## 1772 443340       Williamson Christian College private     Franklin    TN
## 1773 444398 Baptist University of the Americas private  San Antonio    TX
## 1774 444413         Beis Medrash Heichal Dovid private Far Rockaway    NY
## 1775 444778      New Life Theological Seminary private    Charlotte    NC
## 1776 444990        Flagler College-Tallahassee private  Tallahassee    FL
##         region admissionRate ACTmath ACTenglish undergrads  cost gradRate
## 1771 Southwest        0.6667      NA         NA         19 21795   1.0000
## 1772 Southeast            NA      NA         NA         72 20737   0.0000
## 1773 Southwest            NA      NA         NA        189 14542   0.2941
## 1774  Mid East        1.0000      NA         NA         96 16000   0.1000
## 1775 Southeast            NA      NA         NA        122 24685   0.4000
## 1776 Southeast            NA      NA         NA        469    NA       NA
##      FYretention fedloan  debt
## 1771          NA  0.2500    NA
## 1772      0.2500  0.4156    NA
## 1773      0.6190  0.0000    NA
## 1774      0.6667  0.0000    NA
## 1775      0.0000  0.8224    NA
## 1776          NA  0.6370 23638

and the structure of the data frame.

str(colleges)
## 'data.frame':    1776 obs. of  15 variables:
##  $ unitid       : int  100654 100663 100690 100706 100724 100751 100812 100830 100858 100937 ...
##  $ college      : Factor w/ 1760 levels "Abilene Christian University",..: 8 1411 35 1412 9 1334 63 67 66 132 ...
##  $ type         : Factor w/ 2 levels "private","public": 2 2 1 2 2 2 2 2 2 1 ...
##  $ city         : Factor w/ 1040 levels "Aberdeen","Abilene",..: 669 89 614 433 614 944 48 614 51 89 ...
##  $ state        : Factor w/ 54 levels "AK","AL","AR",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ region       : Factor w/ 10 levels "Far West","Great Lakes",..: 8 8 8 8 8 8 8 8 8 8 ...
##  $ admissionRate: num  0.899 0.867 NA 0.806 0.512 ...
##  $ ACTmath      : int  17 23 NA 25 17 25 NA 20 26 25 ...
##  $ ACTenglish   : int  17 26 NA 26 17 27 NA 21 28 27 ...
##  $ undergrads   : int  4051 11200 322 5525 5354 28692 2999 4322 19761 1181 ...
##  $ cost         : int  18888 19990 12300 20306 17400 26717 NA 16556 23788 44512 ...
##  $ gradRate     : num  0.291 0.538 0.667 0.483 0.252 ...
##  $ FYretention  : num  0.631 0.802 0.375 0.81 0.622 ...
##  $ fedloan      : num  0.82 0.54 0.763 0.473 0.874 ...
##  $ debt         : num  33612 23117 26995 24738 33452 ...

6. Filtering rows

To extract the rows only for colleges and universities in a specific state we use the filter function. For example, we can extract the colleges in Wisconsin from the colleges data set using the following code:

wi <- filter(colleges, state == "WI")
head(wi)
##   unitid                     college    type      city state      region
## 1 238193             Alverno College private Milwaukee    WI Great Lakes
## 2 238324              Bellin College private Green Bay    WI Great Lakes
## 3 238333              Beloit College private    Beloit    WI Great Lakes
## 4 238430 Cardinal Stritch University private Milwaukee    WI Great Lakes
## 5 238458          Carroll University private  Waukesha    WI Great Lakes
## 6 238476            Carthage College private   Kenosha    WI Great Lakes
##   admissionRate ACTmath ACTenglish undergrads  cost gradRate FYretention
## 1        0.7887      19         19       1833 30496   0.3852      0.7173
## 2        0.5556      25         24        285    NA   0.6786      0.7500
## 3        0.6769      26         28       1244 48236   0.7815      0.9228
## 4        0.8423      22         22       2680 37563   0.4162      0.7099
## 5        0.8114      24         24       3024 37963   0.5629      0.7598
## 6        0.7018      24         24       2874 44910   0.6501      0.7841
##   fedloan  debt
## 1  0.8784 33110
## 2  0.8145 18282
## 3  0.5784 26500
## 4  0.7178 27875
## 5  0.7108 27000
## 6  0.8048 27000

Remarks

Questions:

  1. How many Maryland colleges are in the colleges data frame? (The abbreviation for Maryland is MD.)
  2. How many private Maryland colleges with under 5000 undergraduates are in the colleges data frame?

7. Slicing rows

To extract rows 10 through 16 from the colleges data frame we use the slice function.

slice(colleges, 10:16)
##   unitid                       college    type         city state
## 1 100937   Birmingham Southern College private   Birmingham    AL
## 2 101073     Concordia College Alabama private        Selma    AL
## 3 101189           Faulkner University private   Montgomery    AL
## 4 101435            Huntingdon College private   Montgomery    AL
## 5 101453 Heritage Christian University private     Florence    AL
## 6 101480 Jacksonville State University  public Jacksonville    AL
## 7 101541                Judson College private       Marion    AL
##      region admissionRate ACTmath ACTenglish undergrads  cost gradRate
## 1 Southeast        0.6422      25         27       1181 44512   0.6192
## 2 Southeast            NA      NA         NA        523 17655   0.2115
## 3 Southeast            NA      NA         NA       2358 28485   0.2287
## 4 Southeast        0.6279      20         22       1100 31433   0.4319
## 5 Southeast            NA      NA         NA         67 21160   0.0000
## 6 Southeast        0.8326      21         22       7195 19202   0.3083
## 7 Southeast        0.7388      20         23        331 27815   0.4051
##   FYretention fedloan  debt
## 1      0.8037  0.4939 27000
## 2      0.4103  0.9100 26500
## 3      0.5000  0.7427 23750
## 4      0.6196  0.7227 27000
## 5      1.0000  0.4839    NA
## 6      0.7112  0.6811 23500
## 7      0.5974  0.7110 26000

Remarks

8. Arranging rows

To sort the rows by total cost, from the least expensive to the most expensive, we use the arrange function.

costDF <- arrange(colleges, cost)
head(costDF)
##   unitid                                   college    type        city
## 1 197027     United States Merchant Marine Academy  public Kings Point
## 2 176336              Southeastern Baptist College private      Laurel
## 3 241951 Escuela de Artes Plasticas de Puerto Rico  public    San Juan
## 4 241216               Atlantic University College private    Guaynabo
## 5 241377              Caribbean University-Bayamon private     Bayamon
## 6 243221     University of Puerto Rico-Rio Piedras  public    San Juan
##   state               region admissionRate ACTmath ACTenglish undergrads
## 1    NY U.S. Service Schools            NA      NA         NA        958
## 2    MS            Southeast            NA      NA         NA         37
## 3    PR       Outlying Areas        0.7154      NA         NA        529
## 4    PR       Outlying Areas            NA      NA         NA       1365
## 5    PR       Outlying Areas            NA      NA         NA       1572
## 6    PR       Outlying Areas        0.5248      NA         NA      11834
##   cost gradRate FYretention fedloan debt
## 1 6603   0.7365      0.9733  0.0780 4211
## 2 6753   0.6875      1.0000  0.0000   NA
## 3 7248   0.4127      0.8382  0.0000   NA
## 4 7695   0.3891      0.7200  0.1053 5000
## 5 8006   0.2166      0.7951  0.2210 9000
## 6 8020   0.4748      0.8968  0.0966 5500

Remarks

Questions

  1. What school is most expensive?
  2. What school has the least expensive tuition in Wisconsin?

9. Mutating data (adding new columns)

Data sets often do not contain the exact variables we need, but contain all of the information necessary to calculate the needed variables. In this case, we can use the mutate function to add a new column to a data frame that is calculated from other variables. For example, we may wish to report percentages rather than proportions for the admissions rate.

colleges <- mutate(colleges, admissionPct = 100 * admissionRate)

Remarks

colleges <- mutate(colleges, FYretentionPct = 100 * FYretention,
                   gradPct = 100 * gradRate)

10. Summarizing rows

To create summary statistics for columns within the data set we must aggregate all of the rows using the summarize command. (Note that you can also use the British spelling: summarise.) For example, to calculate the median cost of all 1776 colleges in our data set we run the following command:

summarize(colleges, medianCost = median(cost, na.rm = TRUE))
##   medianCost
## 1      29849

Remarks

summarize(colleges, 
          min = min(cost, na.rm = TRUE), 
          Q1 = quantile(cost, .25, na.rm = TRUE), 
          median = median(cost, na.rm = TRUE), 
          Q3 = quantile(cost, .75, na.rm = TRUE), 
          max = max(cost, na.rm = TRUE))
##    min    Q1 median    Q3   max
## 1 6603 19831  29849 41180 62636

11. Groupwise manipulation

Often it is of interest to manipulate data within groups. For example, we might be more interested in creating separate summaries for each state, or for private and public colleges. To do this we must first tell R what groups are of interest using the group_by function, and then we can use any of the above functions. Most often group_by is paired with summarise or mutate.

Let’s first consider comparing the cost of private and public colleges. First, we must specify that the variable type defines the groups of interest.

colleges_by_type <- group_by(colleges, type)

Remarks

Combining group_by with other commands

Once we have a grouped data frame, we can obtain summaries by group via summarize. For example, the five number summary of cost by institution type is obtained below

summarize(colleges_by_type, 
          min = min(cost, na.rm = TRUE), 
          Q1 = quantile(cost, .25, na.rm = TRUE), 
          median = median(cost, na.rm = TRUE), 
          Q3 = quantile(cost, .75, na.rm = TRUE), 
          max = max(cost, na.rm = TRUE))
## # A tibble: 2 x 6
##   type      min     Q1 median     Q3   max
##   <fct>   <int>  <dbl>  <dbl>  <dbl> <int>
## 1 private  6753 28788.  37302 45728. 62636
## 2 public   6603 16822   19303 21909  33208

We can also calculate new variables within groups, such as the standardized cost of attendance within each state:

colleges_by_state <- group_by(colleges, state)
colleges_by_state <- mutate(colleges_by_state, 
                            mean.cost = mean(cost, na.rm = TRUE), 
                            sd.cost = sd(cost, na.rm = TRUE),
                            std.cost = (cost - mean.cost) / sd.cost)
head(colleges_by_state)
## # A tibble: 6 x 21
## # Groups:   state [1]
##   unitid college type  city  state region admissionRate ACTmath ACTenglish
##    <int> <fct>   <fct> <fct> <fct> <fct>          <dbl>   <int>      <int>
## 1 100654 Alabam… publ… Norm… AL    South…         0.899      17         17
## 2 100663 Univer… publ… Birm… AL    South…         0.867      23         26
## 3 100690 Amridg… priv… Mont… AL    South…        NA          NA         NA
## 4 100706 Univer… publ… Hunt… AL    South…         0.806      25         26
## 5 100724 Alabam… publ… Mont… AL    South…         0.512      17         17
## 6 100751 The Un… publ… Tusc… AL    South…         0.566      25         27
## # … with 12 more variables: undergrads <int>, cost <int>, gradRate <dbl>,
## #   FYretention <dbl>, fedloan <dbl>, debt <dbl>, admissionPct <dbl>,
## #   FYretentionPct <dbl>, gradPct <dbl>, mean.cost <dbl>, sd.cost <dbl>,
## #   std.cost <dbl>

Remarks

  • mutate allows you to use variables defined earlier to calculate a new variable. This is how std.cost was calculated.
  • The group_by function returns an object of class c("grouped_df", "tbl_df", "tbl", "data.frame"), which looks confusing, but essentially allows the data frame to be printed neatly. Notice that only the first 10 rows print when we print the data frame in the console by typing colleges_by_state, and the width of the console determines how many variables are shown.
  • To print all columns we can convert the results back to a data.frame using the as.data.frame function. Try running head(as.data.frame(colleges_by_state)).
  • You can also use the viewer by running the command View(colleges_by_state).
  • Another option is to select a reduced number of columns to print.

12. Homework Questions Part 2

  1. Filter the rows for colleges in Great Lakes or Plains regions.
  2. Arrange the subset from part #1 to reveal what school has the highest first-year retention rate in this reduced data set.

13. Additional Resources