1 Version Notes

Latest Versions & Updates: This markdown document was built using the following versions of R and RStudio:

R v. 3.5.1
RStudio v. 1.1.456
Document v. 2.0
Created: 2019-01-03
Updated: 2019-01-17
Added packages dplyr, tidyr functions

2 Review: Dates & Times

Standard, Unambiguous Format: Dates are best stored in “YYYY-MM-DD” format.

This is especially true if dates are ordered naively, e.g. alphabetically

The right way versus alternatives. Source: XKCD

The Epoch: 12:00 AM, January 1, 1970 is the time by which all time-related objects are compared and stored.

Times preceding 1970-01-01 may be stored as negative numbers
Times following 1970-01-01 may be stored as positive numbers

Datetime Classes: Time-related variables may be one of several classes:

POSIXlt: Stores datetimes as components parts, e.g. days, months, hours, and minutes
POSIXct: Stores datetimes as total seconds relative to 1970-01-01
Date: Stores datetimes as total days relative to 1970-01-01
Difftime: Stores intervals of time between date and datetime objects

Package “Lubridate”: A unified, easy-to-use package for working with dates and times in R.

if(!require(lubridate)){install.packages("lubridate")}
library(lubridate)

Formatting in “Lubridate”: Lubridate’s functions make it easy to format dates and times.

Use functions specifying day, month, year, etc. for easy conversion, e.g. mdy() for “Aug. 10, 1989”
Recall that y is “year”, m is “month”, d is “day”, h is “hour”, etc.

Extracting Time Components: Extract individual time components, e.g. weekdays, months, etc.

Function year(), month(), and day(), e.g., extract year, month, and day, respectively

Rounding Dates: Date rounding lets you round to the nearest, minimum, or maximum unit of time.

Function round_date() rounds to the nearest unit specified
Function floor_date() rounds to the minimum possible value of the unit specified
Function ceiling_date() rounds to the maximum possible value of the unit specified

3 Data Manipulation

“I think you can have a ridiculously enormous and complex data set, but if you have the right tools and methodology then it’s not a problem.” (Aaron Koblin)

What is data manipulation?: Basically, the process of organizing data per your needs.

3.1 On Package “dplyr”

A grammar for data manipulation: Package “dplyr” purports to use a form of manipulation-related grammer.

Similar to Leland Wilkinson’s The Grammar of Graphics, the underpinnings of package “ggplot2”
Fundamentally important package for the Tidyverse’s myriad libraries and extensions
Informally, “dplyr” functions are referred to as “verbs”
Focuses on 5 key “verbs” and lesser “helper functions”
Coded in C++, so extremely fast compared to base R
Design for data frames; matrices are out of luck

3.2 Installing “dplyr”

Run the Following: This code detects if “dplyr” is present, installing if not, and loads the “dplyr” package.

if(!require(dplyr)){install.packages("dplyr")}
library(dplyr)

3.3 On Tibbles

Tibbles: Special kinds of data frames used in “dplyr” and other Tidyverse packages.

These data structures are made with function tbl_df()
Tibbles only print enough data to fit onto your RStudio console
Unlike standard data frames, tibbles provide data dimensions, variable names, and variable classes
There is little difference in how data frames and tibbles function, only new features

A Data Frame:

as.data.frame(mtcars)

##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

A Tibble:

tbl_df(mtcars)

## # A tibble: 32 x 11
##      mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
##  * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  21       6  160    110  3.9   2.62  16.5     0     1     4     4
##  2  21       6  160    110  3.9   2.88  17.0     0     1     4     4
##  3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
##  4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
##  5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
##  6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
##  7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
##  8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2
##  9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
## 10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
## # ... with 22 more rows

3.4 Passing the Pipe

Piping: The process of passing output from one function as input to another function.

To do this, we use the “Pipe Operator”, or %>%, to chain functions
Piping always passes a data frame from left to right
Piped data can move leftward only for assignment (<-)

For example, we can pipe multiple filter() operations from package “dplyr”:

mtcars %>%
  filter(mpg > 23) %>%
  filter(cyl < 6) %>%
  filter(hp < 90 )

##    mpg cyl  disp hp drat    wt  qsec vs am gear carb
## 1 24.4   4 146.7 62 3.69 3.190 20.00  1  0    4    2
## 2 32.4   4  78.7 66 4.08 2.200 19.47  1  1    4    1
## 3 30.4   4  75.7 52 4.93 1.615 18.52  1  1    4    2
## 4 33.9   4  71.1 65 4.22 1.835 19.90  1  1    4    1
## 5 27.3   4  79.0 66 4.08 1.935 18.90  1  1    4    1

Note: Once you’ve called the name of the data frame (here, mtcars), you don’t need to call it again!

4 Essential “dplyr” Verbs

Fundamental Verbs: The following are the most critical dplyr functions.

4.1 Select (Columns)

Function select() is used for dimension reduction, or eliminating unwanted variables.

Simply input or pipe a data frame or tibble to select()
List all variable names to preserve, no quotes needed!
Include variables in a sequence using :, e.g. mpg:wt
Exclude variables by putting - before their names, e.g. -drat

head(mtcars)

##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

mtcars %>% 
  select(mpg, cyl, hp, wt, am)     # Calling data frame first, then piping

##                      mpg cyl  hp    wt am
## Mazda RX4           21.0   6 110 2.620  1
## Mazda RX4 Wag       21.0   6 110 2.875  1
## Datsun 710          22.8   4  93 2.320  1
## Hornet 4 Drive      21.4   6 110 3.215  0
## Hornet Sportabout   18.7   8 175 3.440  0
## Valiant             18.1   6 105 3.460  0
## Duster 360          14.3   8 245 3.570  0
## Merc 240D           24.4   4  62 3.190  0
## Merc 230            22.8   4  95 3.150  0
## Merc 280            19.2   6 123 3.440  0
## Merc 280C           17.8   6 123 3.440  0
## Merc 450SE          16.4   8 180 4.070  0
## Merc 450SL          17.3   8 180 3.730  0
## Merc 450SLC         15.2   8 180 3.780  0
## Cadillac Fleetwood  10.4   8 205 5.250  0
## Lincoln Continental 10.4   8 215 5.424  0
## Chrysler Imperial   14.7   8 230 5.345  0
## Fiat 128            32.4   4  66 2.200  1
## Honda Civic         30.4   4  52 1.615  1
## Toyota Corolla      33.9   4  65 1.835  1
## Toyota Corona       21.5   4  97 2.465  0
## Dodge Challenger    15.5   8 150 3.520  0
## AMC Javelin         15.2   8 150 3.435  0
## Camaro Z28          13.3   8 245 3.840  0
## Pontiac Firebird    19.2   8 175 3.845  0
## Fiat X1-9           27.3   4  66 1.935  1
## Porsche 914-2       26.0   4  91 2.140  1
## Lotus Europa        30.4   4 113 1.513  1
## Ford Pantera L      15.8   8 264 3.170  1
## Ferrari Dino        19.7   6 175 2.770  1
## Maserati Bora       15.0   8 335 3.570  1
## Volvo 142E          21.4   4 109 2.780  1

select(.data = mtcars, mpg, cyl, hp, wt, am)     # Using select() without piping

##                      mpg cyl  hp    wt am
## Mazda RX4           21.0   6 110 2.620  1
## Mazda RX4 Wag       21.0   6 110 2.875  1
## Datsun 710          22.8   4  93 2.320  1
## Hornet 4 Drive      21.4   6 110 3.215  0
## Hornet Sportabout   18.7   8 175 3.440  0
## Valiant             18.1   6 105 3.460  0
## Duster 360          14.3   8 245 3.570  0
## Merc 240D           24.4   4  62 3.190  0
## Merc 230            22.8   4  95 3.150  0
## Merc 280            19.2   6 123 3.440  0
## Merc 280C           17.8   6 123 3.440  0
## Merc 450SE          16.4   8 180 4.070  0
## Merc 450SL          17.3   8 180 3.730  0
## Merc 450SLC         15.2   8 180 3.780  0
## Cadillac Fleetwood  10.4   8 205 5.250  0
## Lincoln Continental 10.4   8 215 5.424  0
## Chrysler Imperial   14.7   8 230 5.345  0
## Fiat 128            32.4   4  66 2.200  1
## Honda Civic         30.4   4  52 1.615  1
## Toyota Corolla      33.9   4  65 1.835  1
## Toyota Corona       21.5   4  97 2.465  0
## Dodge Challenger    15.5   8 150 3.520  0
## AMC Javelin         15.2   8 150 3.435  0
## Camaro Z28          13.3   8 245 3.840  0
## Pontiac Firebird    19.2   8 175 3.845  0
## Fiat X1-9           27.3   4  66 1.935  1
## Porsche 914-2       26.0   4  91 2.140  1
## Lotus Europa        30.4   4 113 1.513  1
## Ford Pantera L      15.8   8 264 3.170  1
## Ferrari Dino        19.7   6 175 2.770  1
## Maserati Bora       15.0   8 335 3.570  1
## Volvo 142E          21.4   4 109 2.780  1

mtcars %>%
  select(mpg:wt, -disp, -drat)     # From "mpg" to "wt", excluding "drat" and "disp"

##                      mpg cyl  hp    wt
## Mazda RX4           21.0   6 110 2.620
## Mazda RX4 Wag       21.0   6 110 2.875
## Datsun 710          22.8   4  93 2.320
## Hornet 4 Drive      21.4   6 110 3.215
## Hornet Sportabout   18.7   8 175 3.440
## Valiant             18.1   6 105 3.460
## Duster 360          14.3   8 245 3.570
## Merc 240D           24.4   4  62 3.190
## Merc 230            22.8   4  95 3.150
## Merc 280            19.2   6 123 3.440
## Merc 280C           17.8   6 123 3.440
## Merc 450SE          16.4   8 180 4.070
## Merc 450SL          17.3   8 180 3.730
## Merc 450SLC         15.2   8 180 3.780
## Cadillac Fleetwood  10.4   8 205 5.250
## Lincoln Continental 10.4   8 215 5.424
## Chrysler Imperial   14.7   8 230 5.345
## Fiat 128            32.4   4  66 2.200
## Honda Civic         30.4   4  52 1.615
## Toyota Corolla      33.9   4  65 1.835
## Toyota Corona       21.5   4  97 2.465
## Dodge Challenger    15.5   8 150 3.520
## AMC Javelin         15.2   8 150 3.435
## Camaro Z28          13.3   8 245 3.840
## Pontiac Firebird    19.2   8 175 3.845
## Fiat X1-9           27.3   4  66 1.935
## Porsche 914-2       26.0   4  91 2.140
## Lotus Europa        30.4   4 113 1.513
## Ford Pantera L      15.8   8 264 3.170
## Ferrari Dino        19.7   6 175 2.770
## Maserati Bora       15.0   8 335 3.570
## Volvo 142E          21.4   4 109 2.780

Helper Functions for Select: Package “dplyr” verbs have “helper functions”, e.g.

Functions starts_with() and ends_with() help select well-labeled variables
Functions contains() and matches() help select variables containing patterns

4.2 Filter (Rows)

Function filter() is used to preserve only rows that meet specified criteria.

Use comparators like <, >, <=, >=, ==, and !=
Use multiple comparator statements in a single call to filter()

mtcars %>% 
  filter(mpg > 23)     # Calling data frame first, then piping

##    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## 1 24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## 2 32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## 3 30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## 4 33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## 5 27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## 6 26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## 7 30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2

filter(mtcars, wt < 2.1)     # Using select() without piping

##    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1 30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2
## 2 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1
## 3 27.3   4 79.0  66 4.08 1.935 18.90  1  1    4    1
## 4 30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2

filter(mtcars, 
       mpg > 23, 
       wt < 2.1)     # Filter on multiple conditions

##    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1 30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2
## 2 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1
## 3 27.3   4 79.0  66 4.08 1.935 18.90  1  1    4    1
## 4 30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2

4.3 Mutate

Function mutate() easily creates new variables from pre-existing ones.

Like variables from the data frame, new variable names don’t need quotes
The new name of the variable is leftward, the formula is rightward
Separate the new name and variable formula with =
Multiple mutations may occur in a single mutate() call

Example: Converting mtcars weight units (wt) from 1,000 lbs. to 1 lb.

mtcars %>%
  select(-disp:-drat, 
         -qsec:-am) %>%              # Reducing variables with select()
  mutate(wt_lbs = wt * 1000) %>%     # Mutating variable "wt": "wt_lbs"
  as_tibble()                        # Coercing to tibble

## # A tibble: 32 x 6
##      mpg   cyl    wt  gear  carb wt_lbs
##    <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
##  1  21       6  2.62     4     4   2620
##  2  21       6  2.88     4     4   2875
##  3  22.8     4  2.32     4     1   2320
##  4  21.4     6  3.22     3     1   3215
##  5  18.7     8  3.44     3     2   3440
##  6  18.1     6  3.46     3     1   3460
##  7  14.3     8  3.57     3     4   3570
##  8  24.4     4  3.19     4     2   3190
##  9  22.8     4  3.15     4     2   3150
## 10  19.2     6  3.44     4     4   3440
## # ... with 22 more rows

4.4 Arrange

Function arrange() reorders entire rows based on values for one or more specified variables.

Specified variables need no "" or $
The first variable specified is the one to be rearranged
The second variable specified is the values by which to arrange
One variable may be arranged per multiple variables and values
Use function desc() for descending order

Example: Arranging mtcars by mileage, mpg, then mileage (mpg) and cylinders (cyl).

my_cars <- mtcars %>% 
  select(mpg:wt, -disp, -drat) %>%
  filter(wt < 3.2)

my_cars %>% arrange(mpg)          # Arrange by "mpg"

##     mpg cyl  hp    wt
## 1  15.8   8 264 3.170
## 2  19.7   6 175 2.770
## 3  21.0   6 110 2.620
## 4  21.0   6 110 2.875
## 5  21.4   4 109 2.780
## 6  21.5   4  97 2.465
## 7  22.8   4  93 2.320
## 8  22.8   4  95 3.150
## 9  24.4   4  62 3.190
## 10 26.0   4  91 2.140
## 11 27.3   4  66 1.935
## 12 30.4   4  52 1.615
## 13 30.4   4 113 1.513
## 14 32.4   4  66 2.200
## 15 33.9   4  65 1.835

my_cars %>% arrange(cyl, mpg)     # Arrange by "cyl", then "mpg"

##     mpg cyl  hp    wt
## 1  21.4   4 109 2.780
## 2  21.5   4  97 2.465
## 3  22.8   4  93 2.320
## 4  22.8   4  95 3.150
## 5  24.4   4  62 3.190
## 6  26.0   4  91 2.140
## 7  27.3   4  66 1.935
## 8  30.4   4  52 1.615
## 9  30.4   4 113 1.513
## 10 32.4   4  66 2.200
## 11 33.9   4  65 1.835
## 12 19.7   6 175 2.770
## 13 21.0   6 110 2.620
## 14 21.0   6 110 2.875
## 15 15.8   8 264 3.170

4.5 Mid-Tutorial Practice

Instructions: Run the following code to download the 2018 Hancock Airport renovation data.

if(!require(readr)){install.packages("readr")}
library(readr)

url <- "https://tinyurl.com/y85g52dg"
hancock <- read_csv(url)

Explore Your Data: Run the following code to explore your data.

class(hancock)
dim(hancock)
colnames(hancock)
glimpse(hancock)
summary(hancock)

Challenges: Complete the challenges using only “dplyr” verbs.

Connect all “dplyr” verbs with the pipe operator, %>%.

Filter: Only some companies in variable name have race and gender data.
- Use filter() to preserve the following companies:
  - Longhouse Construction
  - Patricia Electric
  - Quality Structures
  - Schalk & Son
  - Stone Bridge Iron & Steel
- Store your updated data frame in an object using assignment, <-
- It may be helpful which companies to exclude, viz.:
  - John W. Danforth Company
  - Environmental Services
  - Niagara Erecting
Select: We can reduce some unwanted or unnecessary dimensions, like detailed location data.
- Use select() to preserve the following variables
  - name
  - title
  - sex
  - race
  - net
  - county
  - state
- Again, store your data using assignment, <-
Arrange: We can gain better insight by sorting the our data.
- Use arrange() to sort by:
  - Company name, or variable name
  - Ethnicity, or variable race
  - Worker county, or variable county
  - Net pay for period, or variable net
- Again, store your data using assignment, <-
Simplify: Print your resulting data frame as a tibble using as_tibble(); resave.
Chain: Beginning with hancock, chain all above steps into a single command!
- Use the pipe operator, %>%, to chain together operations
- Remember that you only need to call the name of the data frame once
- What goes into and comes out of a “dplyr verb” must be a data frame or tibble

4.6 Grouping

Function group_by() will group observations, by one or more specified variables, for further manipulation.

Functions “piped” (%>%) after group_by() will transform each group individually
All variables are treated categorically whether of class character or factor
Refrain from attempting to group by numeric or other continuous data
Specified variables need no "", $, or =

Importing New Data: Grouping Hancock Airport and Lakeview Amphitheater renovations worker records for summary.

library(readr)
library(dplyr)

url <- "https://tinyurl.com/y9vla57e"      # Assign URLs for cleaner code

workers <- read_csv(url) %>%               # Read in URL data, pipe forward
  mutate(zip = as.character(zip)) %>%      # Coerce "zip" to class character
  select(project:race)                     # Reduce variables, assign to "workers"

Grouping by Project: Use piping (%>%) to chain group_by().

workers %>%
  filter(!is.na(project)) %>%        # Remove rows missing values for "project"
  group_by(project)                  # Group by "project"

Grouping by Project & Race: Use piping (%>%) to chain group_by().

workers %>%
  filter(!is.na(project),
         !is.na(race)) %>%        # Remove rows missing values for "project"
  group_by(project, race)         # Group by "project" and "race"

Takeaway: Function group_by() is part bridging function, part partitioning function.

On it’s own, it doesn’t do anything, it is a means to an end
Functions piped after group_by() will then act independently on each group
Function summarize() was built to inherit group_by() output
Limited application for non-statistical functions

4.7 Summarizing

Function summarize() creates custom summaries using summarizing functions as arguments.

The formula for creating a summary variable is: variable = transformation
- The leftward variable may take any alphanumeric name
- The rightward transformation takes any expression, functions, etc.

Example: Call summarize() for the median() and mean() of variable gross.

workers %>%
  filter(!is.na(gross)) %>%                   # Filter rows missing "gross" (optional)
  summarize(median_gross = median(gross),     # Call median() on gross; name "median_gross"
            mean_gross = mean(gross))         # Call mean() on gross; name "mean_gross"

## # A tibble: 1 x 2
##   median_gross mean_gross
##          <dbl>      <dbl>
## 1        1216.      1242.

Spaces, Numbers in Variable Names: Use backticks, ``, to allow spaces, numbers, etc. in variable names:

workers %>%
  filter(!is.na(gross)) %>%                     # Filter rows missing "gross" (optional)
  summarize(`Median Gross` = median(gross),     # Name `Median Gross`
            `Mean Gross` = mean(gross))         # Name `Mean Gross`

## # A tibble: 1 x 2
##   `Median Gross` `Mean Gross`
##            <dbl>        <dbl>
## 1          1216.        1242.

5 Group-Summarize Operations

A Powerful Combination: So powerful, it warranted it’s own section.

“Group-” or “Group By-Summarize” operations are powerful analytical techniques
Pipe group_by() and specified grouping variables to summarize()
Perform statistical operations on grouped observations independently
Summarized output does not preserve the rest of your data

Median Gross by Race: We’ll combine group_by(), summarize(), and median().

workers %>%
  filter(!is.na(race),
         !is.na(gross)) %>%                         # Filter missing "race", "gross"
  group_by(race) %>%                                # Group by "race"
  summarize(med_gross = median(gross)) %>%          # Summarize with median() on "gross"
  arrange(desc(med_gross))                          # Arrange by descending order

## # A tibble: 6 x 2
##   race        med_gross
##   <chr>           <dbl>
## 1 Native          1470 
## 2 Hispanic        1464.
## 3 Asian           1394.
## 4 White           1297.
## 5 Black            974 
## 6 Multiracial      825.

Median Gross by Company, Race: We’ll apply the same techniques as above, adding name.

workers %>%
  filter(!is.na(race),
         !is.na(gross)) %>%                         # Filter missing "race", "gross"
  group_by(name, race) %>%                          # Group by "race"
  summarize(med_gross = median(gross)) %>%          # Summarize with median() on "gross"
  arrange(name, race)                               # Arrange company, race, gross

## # A tibble: 50 x 3
## # Groups:   name [20]
##    name            race     med_gross
##    <chr>           <chr>        <dbl>
##  1 AM Electric     White         630 
##  2 Atlas Fence     Black         829.
##  3 Atlas Fence     White        1148 
##  4 Burn Bros       Black         651.
##  5 Burn Bros       Hispanic     1434.
##  6 Burn Bros       White        1042.
##  7 Davis Ulmer     Black         760 
##  8 Davis Ulmer     White         760 
##  9 EJ Construction Asian        1514.
## 10 EJ Construction Native       1037 
## # ... with 40 more rows

Note: The more variables input in group_by(), the more sophisticated the output.

5.1 Practice: Grouping & Summarizing

Instructions: Finish the following pipeline to get:

The mean() of variable gross
Group by variables project and sex

workers %>%
  filter(!is.na(sex),
         !is.na(gross)) %>% 
  group_by()

Question: Which is higher? The mean gross payment for males at project “Hancock” or “Lakeview”?

6 Join Functions

A Family of Functions: Several “join”, or merging, functions exist in dplyr.

“Join” functions always end in *_join(), e.g. inner_join()
Each “join” function either mutates or filters merged data
- Mutating Joins add one or more columns to the primary table
- Filtering Joins return data that are filtered, not modified
The base R equivalent of *_join() functions is merge()
Function left_join() is the workhorse “join” function
See further information on Stat 545 and RStudio

Loading Location Data: Use variable zip to find additional location data.

if(!require(zipcode)){install.packages("zipcode")}
library(zipcode)

data(zipcode)

Joining Location Data: Once loaded, we can use left_join() on variable zip.

locations <- left_join(workers,                          # Indicate data to be joined
                       zipcode,                          # Indicate data to join
                       by = "zip") %>%                   # Indicate "key" variable
  select(project:ending, ssn, gross, city:longitude)     # Reduce variables

6.1 Practice: Joining Data

Join More Location Data: Use the new city and state data to add FIPS codes.

if(!require(noncensus)){install.packages("noncensus")}
library(noncensus)

data(zip_codes)

Instructions: Use function left_join() to merge locations and zip_codes.

Remember to use by = to specify the variable(s) by which to join

locations %>%
  left_join()

7 Tidy Data

Tidy Data are data that are arranged to conform to 3 core principles:

Every observation has a row
Every variable has a column
Every table contains relevant, appropriate variables

You can read more about “Tidy Data” in Hadley Wickham’s 2012 treatise, “Tidy Data”.

Tidyverse Packages will often operate optimally or exclusively using “Tidy Data” format.

7.1 Example: Untidy Data

Suppose we track the weights (lbs.) of three pets over three weeks. This is known as wide format.

owner	pet	week_1	week_2	week_3
Amit	Marshmallow	56.2	56.7	56.6
Sophia	Pebbles	14.4	14.3	14.9
Greyson	Gary Laser-Eyes	38.1	41.1	49.3

Question: Can you see why these data are “untidy”?

“Week” is one variable, but here it has multiple columns
By converting these data to long format, we tidy them

7.2 Example: Tidy Data

Suppose we tidy the above data. This is known as long format.

owner	pet	week	weight
Amit	Marshmallow	1	56.2
Sophia	Pebbles	2	14.4
Greyson	Gary Laser-Eyes	3	38.1
Amit	Marshmallow	1	56.7
Sophia	Pebbles	2	14.3
Greyson	Gary Laser-Eyes	3	41.1
Amit	Marshmallow	1	56.6
Sophia	Pebbles	2	14.9
Greyson	Gary Laser-Eyes	3	49.3

Question: Why are the data now “tidy”?

Note how the dataset becomes longer as values now repeat more often
Hence, tidy data are often long

7.3 Package “tidyr”

Package tidyr is meant for one purpose: tidying data.

Two very important functions: spread() and gather()
These functions are more than transposing data, they tidy them

if(!require(tidyr)){install.packages("tidyr")}
library(tidyr)

7.4 Spreading Data

Function spread() creates multiple new columns from key-pair data.

subset <- workers %>%
  select(project:race) %>%
  filter(!is.na(ending))

head(subset)

## # A tibble: 6 x 12
##   project name  ending     zip   ssn   class hours  rate gross   net sex  
##   <chr>   <chr> <date>     <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Lakevi~ Ajay~ 2015-08-02 14612 3888  Jour~    40  18.5  740   511. <NA> 
## 2 Lakevi~ Ajay~ 2015-08-02 13205 2807  Jour~    32  28.7  918.  587. <NA> 
## 3 Lakevi~ Ajay~ 2015-08-02 14428 7762  Jour~    40  33.7 1348  1026. <NA> 
## 4 Lakevi~ Ajay~ 2015-08-02 14549 9759  Jour~    26  33.7  876.  891. <NA> 
## 5 Lakevi~ Ajay~ 2015-08-09 14433 5632  Jour~    32  17    544   384. <NA> 
## 6 Lakevi~ Ajay~ 2015-08-09 14612 3888  Jour~    40  18.5  740   658. <NA> 
## # ... with 1 more variable: race <chr>

wrong <- subset %>% 
  spread(key = ending,        # Unique values in `ending` become columns
         value = hours)       # Variable `hours` is redistributed

wrong[, 40:44]                # View 4 of 46 new columns added

## # A tibble: 3,621 x 5
##    `2015-08-09` `2015-08-16` `2015-08-23` `2015-08-30` `2015-09-06`
##           <dbl>        <dbl>        <dbl>        <dbl>        <dbl>
##  1           NA           NA           NA           NA           NA
##  2           NA           NA           NA           NA           NA
##  3           40           NA           NA           NA           NA
##  4           NA           NA           NA           NA           NA
##  5           32           32           NA           NA           NA
##  6           40           NA           NA           NA           NA
##  7           16           NA           NA           NA           NA
##  8           NA           42           NA           NA           NA
##  9           NA           28           NA           NA           NA
## 10           NA           42           NA           NA           NA
## # ... with 3,611 more rows

Takeaway: I generally avoid spread() but it has its uses.

We often recieve data that are “untidy”, “spread”, or “wide”
This format has it’s uses for certain functions in R and other tooling

7.5 Gathering (Tidying) Data

Function gather() tidies “wide” datasets.

Create an Untidy Data Frame: Recreate the previous example and assign to pets.

pets <- data_frame(owner = c("Amit", "Sophia", "Greyson"), 
                   pet = c("Marshmallow", "Pebbles", "Gary Laser-Eyes"),
                   week_1 = c(56.2, 14.4, 38.1),
                   week_2 = c(56.7, 14.3, 41.1),
                   week_3 = c(56.6, 14.9, 49.3))

print(pets)

## # A tibble: 3 x 5
##   owner   pet             week_1 week_2 week_3
##   <chr>   <chr>            <dbl>  <dbl>  <dbl>
## 1 Amit    Marshmallow       56.2   56.7   56.6
## 2 Sophia  Pebbles           14.4   14.3   14.9
## 3 Greyson Gary Laser-Eyes   38.1   41.1   49.3

Tidy that Data Frame: Use gather() and the notation as guidance.

pets %>%                      # Pipe "pets" into spread()
  gather(key = week,          # Name the variable that takes gathered column names
         value = weight,      # Name the variable that takes gathered values
         week_1:week_3)       # Name the columns to gather

## # A tibble: 9 x 4
##   owner   pet             week   weight
##   <chr>   <chr>           <chr>   <dbl>
## 1 Amit    Marshmallow     week_1   56.2
## 2 Sophia  Pebbles         week_1   14.4
## 3 Greyson Gary Laser-Eyes week_1   38.1
## 4 Amit    Marshmallow     week_2   56.7
## 5 Sophia  Pebbles         week_2   14.3
## 6 Greyson Gary Laser-Eyes week_2   41.1
## 7 Amit    Marshmallow     week_3   56.6
## 8 Sophia  Pebbles         week_3   14.9
## 9 Greyson Gary Laser-Eyes week_3   49.3

8 SQL with “dplyr”

SQL, or Structured Query Language (often pronounced “sequel”), is a specialized language and is:

“SQL is over 40 years old, and is used by pretty much every database in existence.” (“dplyr” vignette)
Used mainly for relational databases, collections of datasets that are related by unique IDs
One of the most important languages in data analytics, given the prevalence of databases
Practically domain-specific, i.e. used almost exclusively for a single purpose

8.1 Relational Databases

All relational databases are made up of separate tables that connect each other by using unique IDs.

Observe the following tables:

Table A: Names & IDs

Name	ID
Harshita	001
Alex	002
Fred	003
Shannon	004
Jamison	005
Luis	006
Kennedy	007

Table B: Likert Scale Responses on Nicholas Cage

ID	Cage Rating
001	7
002	6
003	7
004	1
005	2
006	7
007	4

Table C: Civil Service Exam Scores

ID	Score
001	96
002	99
003	97
004	NA
005	41
006	75
007	79

These tables demonstrate how relational databases function.

In Table A, or Cage Table, Likert responses range from one to seven:

“7” indicates “strongly agree”
“1” indicates “stongly disagree”
Statement: “Nicholas Cage is the only actor to evolve the craft since Marlon Brando”.

However, we cannot yet determine the most vehement Cage fans, since Cage Table only contains ID numbers, not names.

Hence, we must find a related table to better make sense of these data - specifically Table A.

merged_tables <- left_join(name_table,
                           cage_table, 
                           by = "ID")     # Use one of dplyr's *_join() functions
print(merged_tables)

## # A tibble: 7 x 3
##   Name     ID    `Cage Rating`
##   <chr>    <chr>         <dbl>
## 1 Harshita 001               7
## 2 Alex     002               6
## 3 Fred     003               7
## 4 Shannon  004               1
## 5 Jamison  005               2
## 6 Luis     006               7
## 7 Kennedy  007               4

Now that these tables are merged, we can determine who exactly recognized Nicholas Cage’s genius.

8.2 Relating More than One Table

One need not stop at simply identifying Cage advocates by name. By relating additonal databases, we may:

Reveal additional information about observations or instances
Study the relationships between variables regardless of the individuals that connect the data

Here, we merge Table C with the already merged Table A and Table B to create master_table.

master_table <- left_join(merged_tables,
                          exam_table, 
                          by = "ID")

print(master_table)

## # A tibble: 7 x 4
##   Name     ID    `Cage Rating` Score
##   <chr>    <chr>         <dbl> <dbl>
## 1 Harshita 001               7    96
## 2 Alex     002               6    99
## 3 Fred     003               7    97
## 4 Shannon  004               1    NA
## 5 Jamison  005               2    41
## 6 Luis     006               7    75
## 7 Kennedy  007               4    79

In sum, these databases are only relatable through the use of unique IDs.

Ignoring the “who”, we can study the relationship between variables. The following attempts to model civil service exam scores in relation to appreciation for Nicholas Cage’s art.

head(master_table)

## # A tibble: 6 x 4
##   Name     ID    `Cage Rating` Score
##   <chr>    <chr>         <dbl> <dbl>
## 1 Harshita 001               7    96
## 2 Alex     002               6    99
## 3 Fred     003               7    97
## 4 Shannon  004               1    NA
## 5 Jamison  005               2    41
## 6 Luis     006               7    75

model <- lm(master_table$Score ~ master_table$`Cage Rating`)

print(model)

## 
## Call:
## lm(formula = master_table$Score ~ master_table$`Cage Rating`)
## 
## Coefficients:
##                (Intercept)  master_table$`Cage Rating`  
##                     32.690                       8.814

Fascinating! There seems to be no significant relationship between utter devotion to Nicholas Cage and exam scores. Still, be careful out there.

8.3 SQL with “dplyr” Functions

Interestingly, many functions we’ve seen in “dplyr” are the same functions in SQL:

SELECT is similar to select()
INNER JOIN and OUTER JOIN, are similar to inner_join() and outer_join()
GROUP BY is simialr to group_by()

However, we do not need to learn SQL (though it is recommended) to use it. Instead we may use standard “dplyr” functions that:

Allow connecting and disconneting to databases
Read in only the values queried, allowing work with data too big for a single machine, even with parallel processing
Convert “dplyr” code into SQL or other native database languages

8.4 Connecting to Databases

The first step to querying data is to connect to a database. Consider the following:

Using SQL via “dplyr” requires additional packages, e.g. “dbplyr” and “RMySQL”
SQL comes in different flavors, so using the right function is critical
- src_sqlite() for SQLite
- src_mysql() for MySQL and MariaDB
- src_postgres() for PostgreSQL
Learn more about SQL and “dplyr” by typing the expression: vignette("databases", packages = "dplyr")

First, we must install and load the “DBI” package, which helps R translate different languages like SQL.

if(!require(dbplyr)){install.packages("dbplyr")}     # Package "DBI" installed with "dbplyr"
if(!require(RMySQL)){install.packages("RMySQL")}
if(!require(RSQLite)){install.packages("RSQLite")}

library(dbplyr)
library(RMySQL)
library(RSQLite)

The following uses a free practice SQL database, courtesy of RStudio.

To learn more about these arguments, view package documentation with the expression: ?dbplyr, or visit the package “dbplyr” vignette, which provides the following examples.

In lieu of accessing a live SQL database, we may use dbConnect() to connect to our machine’s internal memory.

library(dplyr)

con <- DBI::dbConnect(RMySQL::RMySQL(), dbname = ":memory:")

8.5 Storing in Databases

Connecting the local memory to simulate a practice database, we may store nycflights13 using function copy_to().

if(!require(nycflights13)){install.packages("nycflights13")}
library(nycflights13)

copy_to(con, nycflights13::flights, "flights",    # The vignette author emphasizes starting with the right indexing.
  temporary = FALSE, 
  indexes = list(
    c("year", "month", "day"), 
    "carrier", 
    "tailnum",
    "dest"
  )
)

8.6 Referencing Databases

The database has no name, though. To reference it, use function tbl() and assignment, or <-.

flights_db <- tbl(con, "flights")

8.7 Querying Databases

Now that you can reference flights_db by name, you can extract data from it like any other object in the environment.

names(flights_db)
class(flights_db)

print(flights_db)

Note that “dplyr” functions that have equivalent functions in SQL are far more likely to operate as expected.

However, “dbplyr” is optimized for select(), one of the most common tasks for analysts in generating SQL queries.

flights_db %>%
  select(year, month, day, carrier, air_time) %>%
  group_by(carrier) %>%
  summarize(total_at = sum(air_time))

8.8 Don’t Neglect, Disconnect!

When you’ve finished with your queries, don’t forget to disconnect. As a courtesy, don’t hog others’ bandwidth.

dbDisconnect(conn = con)

9 Applied Practice

Preparation: Resimulate the nycflights13 data if you disconnected during the previous section:

con <- DBI::dbConnect(RMySQL::RMySQL(), dbname = ":memory:")

if(!require(dplyr)){install.packages("dplyr")}
if(!require(nycflights13)){install.packages("nycflights13")}

library(dplyr)
library(nycflights13)

con <- DBI::dbConnect(RMySQL::RMySQL(), dbname = ":memory:")

copy_to(con, nycflights13::flights, "flights",    # The vignette author emphasizes starting with the right indexing.
  temporary = FALSE, 
  indexes = list(
    c("year", "month", "day"), 
    "carrier", 
    "tailnum",
    "dest"
  )
)

flights_db <- tbl(con, "flights")

Instructions: Use “dplyr” functions with object flights_db to determine the following summary statistics.

What are the mean departure delays, or dep_delay, when grouping by carrier.
What is the maximum distance among trips when grouping by month?
What is the mean and median dep_delay and arr_delay when grouping by month?
Bonus: What should you do once you’ve finished generating queries in a connected a database?

Intro to R: Data Manipulation

Jamison R. Crawford, MPA

January 31, 2019