Latest Versions & Updates: This markdown document was built using the following versions of R and RStudio:
dplyr, tidyr functionsStandard, Unambiguous Format: Dates are best stored in “YYYY-MM-DD” format.
The right way versus alternatives. Source: XKCD
The Epoch: 12:00 AM, January 1, 1970 is the time by which all time-related objects are compared and stored.
Datetime Classes: Time-related variables may be one of several classes:
Package “Lubridate”: A unified, easy-to-use package for working with dates and times in R.
if(!require(lubridate)){install.packages("lubridate")}
library(lubridate)
Formatting in “Lubridate”: Lubridate’s functions make it easy to format dates and times.
mdy() for “Aug. 10, 1989”y is “year”, m is “month”, d is “day”, h is “hour”, etc.Extracting Time Components: Extract individual time components, e.g. weekdays, months, etc.
year(), month(), and day(), e.g., extract year, month, and day, respectivelyRounding Dates: Date rounding lets you round to the nearest, minimum, or maximum unit of time.
round_date() rounds to the nearest unit specifiedfloor_date() rounds to the minimum possible value of the unit specifiedceiling_date() rounds to the maximum possible value of the unit specified“I think you can have a ridiculously enormous and complex data set, but if you have the right tools and methodology then it’s not a problem.” (Aaron Koblin)
What is data manipulation?: Basically, the process of organizing data per your needs.
A grammar for data manipulation: Package “dplyr” purports to use a form of manipulation-related grammer.
Run the Following: This code detects if “dplyr” is present, installing if not, and loads the “dplyr” package.
if(!require(dplyr)){install.packages("dplyr")}
library(dplyr)
Tibbles: Special kinds of data frames used in “dplyr” and other Tidyverse packages.
tbl_df()A Data Frame:
as.data.frame(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
A Tibble:
tbl_df(mtcars)
## # A tibble: 32 x 11
## mpg cyl disp hp drat wt qsec vs am gear carb
## * <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4
## 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
## 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
## 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
## 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
## 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
## 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
## 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
## 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
## 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
## # ... with 22 more rows
Piping: The process of passing output from one function as input to another function.
%>%, to chain functions<-)For example, we can pipe multiple filter() operations from package “dplyr”:
mtcars %>%
filter(mpg > 23) %>%
filter(cyl < 6) %>%
filter(hp < 90 )
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## 2 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## 3 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## 4 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## 5 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
Note: Once you’ve called the name of the data frame (here, mtcars), you don’t need to call it again!
Fundamental Verbs: The following are the most critical dplyr functions.
Function select() is used for dimension reduction, or eliminating unwanted variables.
select():, e.g. mpg:wt- before their names, e.g. -drathead(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
mtcars %>%
select(mpg, cyl, hp, wt, am) # Calling data frame first, then piping
## mpg cyl hp wt am
## Mazda RX4 21.0 6 110 2.620 1
## Mazda RX4 Wag 21.0 6 110 2.875 1
## Datsun 710 22.8 4 93 2.320 1
## Hornet 4 Drive 21.4 6 110 3.215 0
## Hornet Sportabout 18.7 8 175 3.440 0
## Valiant 18.1 6 105 3.460 0
## Duster 360 14.3 8 245 3.570 0
## Merc 240D 24.4 4 62 3.190 0
## Merc 230 22.8 4 95 3.150 0
## Merc 280 19.2 6 123 3.440 0
## Merc 280C 17.8 6 123 3.440 0
## Merc 450SE 16.4 8 180 4.070 0
## Merc 450SL 17.3 8 180 3.730 0
## Merc 450SLC 15.2 8 180 3.780 0
## Cadillac Fleetwood 10.4 8 205 5.250 0
## Lincoln Continental 10.4 8 215 5.424 0
## Chrysler Imperial 14.7 8 230 5.345 0
## Fiat 128 32.4 4 66 2.200 1
## Honda Civic 30.4 4 52 1.615 1
## Toyota Corolla 33.9 4 65 1.835 1
## Toyota Corona 21.5 4 97 2.465 0
## Dodge Challenger 15.5 8 150 3.520 0
## AMC Javelin 15.2 8 150 3.435 0
## Camaro Z28 13.3 8 245 3.840 0
## Pontiac Firebird 19.2 8 175 3.845 0
## Fiat X1-9 27.3 4 66 1.935 1
## Porsche 914-2 26.0 4 91 2.140 1
## Lotus Europa 30.4 4 113 1.513 1
## Ford Pantera L 15.8 8 264 3.170 1
## Ferrari Dino 19.7 6 175 2.770 1
## Maserati Bora 15.0 8 335 3.570 1
## Volvo 142E 21.4 4 109 2.780 1
select(.data = mtcars, mpg, cyl, hp, wt, am) # Using select() without piping
## mpg cyl hp wt am
## Mazda RX4 21.0 6 110 2.620 1
## Mazda RX4 Wag 21.0 6 110 2.875 1
## Datsun 710 22.8 4 93 2.320 1
## Hornet 4 Drive 21.4 6 110 3.215 0
## Hornet Sportabout 18.7 8 175 3.440 0
## Valiant 18.1 6 105 3.460 0
## Duster 360 14.3 8 245 3.570 0
## Merc 240D 24.4 4 62 3.190 0
## Merc 230 22.8 4 95 3.150 0
## Merc 280 19.2 6 123 3.440 0
## Merc 280C 17.8 6 123 3.440 0
## Merc 450SE 16.4 8 180 4.070 0
## Merc 450SL 17.3 8 180 3.730 0
## Merc 450SLC 15.2 8 180 3.780 0
## Cadillac Fleetwood 10.4 8 205 5.250 0
## Lincoln Continental 10.4 8 215 5.424 0
## Chrysler Imperial 14.7 8 230 5.345 0
## Fiat 128 32.4 4 66 2.200 1
## Honda Civic 30.4 4 52 1.615 1
## Toyota Corolla 33.9 4 65 1.835 1
## Toyota Corona 21.5 4 97 2.465 0
## Dodge Challenger 15.5 8 150 3.520 0
## AMC Javelin 15.2 8 150 3.435 0
## Camaro Z28 13.3 8 245 3.840 0
## Pontiac Firebird 19.2 8 175 3.845 0
## Fiat X1-9 27.3 4 66 1.935 1
## Porsche 914-2 26.0 4 91 2.140 1
## Lotus Europa 30.4 4 113 1.513 1
## Ford Pantera L 15.8 8 264 3.170 1
## Ferrari Dino 19.7 6 175 2.770 1
## Maserati Bora 15.0 8 335 3.570 1
## Volvo 142E 21.4 4 109 2.780 1
mtcars %>%
select(mpg:wt, -disp, -drat) # From "mpg" to "wt", excluding "drat" and "disp"
## mpg cyl hp wt
## Mazda RX4 21.0 6 110 2.620
## Mazda RX4 Wag 21.0 6 110 2.875
## Datsun 710 22.8 4 93 2.320
## Hornet 4 Drive 21.4 6 110 3.215
## Hornet Sportabout 18.7 8 175 3.440
## Valiant 18.1 6 105 3.460
## Duster 360 14.3 8 245 3.570
## Merc 240D 24.4 4 62 3.190
## Merc 230 22.8 4 95 3.150
## Merc 280 19.2 6 123 3.440
## Merc 280C 17.8 6 123 3.440
## Merc 450SE 16.4 8 180 4.070
## Merc 450SL 17.3 8 180 3.730
## Merc 450SLC 15.2 8 180 3.780
## Cadillac Fleetwood 10.4 8 205 5.250
## Lincoln Continental 10.4 8 215 5.424
## Chrysler Imperial 14.7 8 230 5.345
## Fiat 128 32.4 4 66 2.200
## Honda Civic 30.4 4 52 1.615
## Toyota Corolla 33.9 4 65 1.835
## Toyota Corona 21.5 4 97 2.465
## Dodge Challenger 15.5 8 150 3.520
## AMC Javelin 15.2 8 150 3.435
## Camaro Z28 13.3 8 245 3.840
## Pontiac Firebird 19.2 8 175 3.845
## Fiat X1-9 27.3 4 66 1.935
## Porsche 914-2 26.0 4 91 2.140
## Lotus Europa 30.4 4 113 1.513
## Ford Pantera L 15.8 8 264 3.170
## Ferrari Dino 19.7 6 175 2.770
## Maserati Bora 15.0 8 335 3.570
## Volvo 142E 21.4 4 109 2.780
Helper Functions for Select: Package “dplyr” verbs have “helper functions”, e.g.
starts_with() and ends_with() help select well-labeled variablescontains() and matches() help select variables containing patternsFunction filter() is used to preserve only rows that meet specified criteria.
<, >, <=, >=, ==, and !=filter()mtcars %>%
filter(mpg > 23) # Calling data frame first, then piping
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## 2 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## 3 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## 4 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## 5 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## 6 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## 7 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
filter(mtcars, wt < 2.1) # Using select() without piping
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## 2 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## 3 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## 4 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
filter(mtcars,
mpg > 23,
wt < 2.1) # Filter on multiple conditions
## mpg cyl disp hp drat wt qsec vs am gear carb
## 1 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## 2 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## 3 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## 4 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
Function mutate() easily creates new variables from pre-existing ones.
=mutate() callExample: Converting mtcars weight units (wt) from 1,000 lbs. to 1 lb.
mtcars %>%
select(-disp:-drat,
-qsec:-am) %>% # Reducing variables with select()
mutate(wt_lbs = wt * 1000) %>% # Mutating variable "wt": "wt_lbs"
as_tibble() # Coercing to tibble
## # A tibble: 32 x 6
## mpg cyl wt gear carb wt_lbs
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 21 6 2.62 4 4 2620
## 2 21 6 2.88 4 4 2875
## 3 22.8 4 2.32 4 1 2320
## 4 21.4 6 3.22 3 1 3215
## 5 18.7 8 3.44 3 2 3440
## 6 18.1 6 3.46 3 1 3460
## 7 14.3 8 3.57 3 4 3570
## 8 24.4 4 3.19 4 2 3190
## 9 22.8 4 3.15 4 2 3150
## 10 19.2 6 3.44 4 4 3440
## # ... with 22 more rows
Function arrange() reorders entire rows based on values for one or more specified variables.
"" or $desc() for descending orderExample: Arranging mtcars by mileage, mpg, then mileage (mpg) and cylinders (cyl).
my_cars <- mtcars %>%
select(mpg:wt, -disp, -drat) %>%
filter(wt < 3.2)
my_cars %>% arrange(mpg) # Arrange by "mpg"
## mpg cyl hp wt
## 1 15.8 8 264 3.170
## 2 19.7 6 175 2.770
## 3 21.0 6 110 2.620
## 4 21.0 6 110 2.875
## 5 21.4 4 109 2.780
## 6 21.5 4 97 2.465
## 7 22.8 4 93 2.320
## 8 22.8 4 95 3.150
## 9 24.4 4 62 3.190
## 10 26.0 4 91 2.140
## 11 27.3 4 66 1.935
## 12 30.4 4 52 1.615
## 13 30.4 4 113 1.513
## 14 32.4 4 66 2.200
## 15 33.9 4 65 1.835
my_cars %>% arrange(cyl, mpg) # Arrange by "cyl", then "mpg"
## mpg cyl hp wt
## 1 21.4 4 109 2.780
## 2 21.5 4 97 2.465
## 3 22.8 4 93 2.320
## 4 22.8 4 95 3.150
## 5 24.4 4 62 3.190
## 6 26.0 4 91 2.140
## 7 27.3 4 66 1.935
## 8 30.4 4 52 1.615
## 9 30.4 4 113 1.513
## 10 32.4 4 66 2.200
## 11 33.9 4 65 1.835
## 12 19.7 6 175 2.770
## 13 21.0 6 110 2.620
## 14 21.0 6 110 2.875
## 15 15.8 8 264 3.170
Instructions: Run the following code to download the 2018 Hancock Airport renovation data.
if(!require(readr)){install.packages("readr")}
library(readr)
url <- "https://tinyurl.com/y85g52dg"
hancock <- read_csv(url)
Explore Your Data: Run the following code to explore your data.
class(hancock)
dim(hancock)
colnames(hancock)
glimpse(hancock)
summary(hancock)
Challenges: Complete the challenges using only “dplyr” verbs.
%>%.Filter: Only some companies in variable name have race and gender data.
Use filter() to preserve the following companies:
<-It may be helpful which companies to exclude, viz.:
Select: We can reduce some unwanted or unnecessary dimensions, like detailed location data.
Use select() to preserve the following variables
nametitlesexracenetcountystateAgain, store your data using assignment, <-
Arrange: We can gain better insight by sorting the our data.
Use arrange() to sort by:
nameracecountynetAgain, store your data using assignment, <-
Simplify: Print your resulting data frame as a tibble using as_tibble(); resave.
Chain: Beginning with hancock, chain all above steps into a single command!
%>%, to chain together operationsFunction group_by() will group observations, by one or more specified variables, for further manipulation.
%>%) after group_by() will transform each group individuallycharacter or factornumeric or other continuous data"", $, or =Importing New Data: Grouping Hancock Airport and Lakeview Amphitheater renovations worker records for summary.
library(readr)
library(dplyr)
url <- "https://tinyurl.com/y9vla57e" # Assign URLs for cleaner code
workers <- read_csv(url) %>% # Read in URL data, pipe forward
mutate(zip = as.character(zip)) %>% # Coerce "zip" to class character
select(project:race) # Reduce variables, assign to "workers"
Grouping by Project: Use piping (%>%) to chain group_by().
workers %>%
filter(!is.na(project)) %>% # Remove rows missing values for "project"
group_by(project) # Group by "project"
Grouping by Project & Race: Use piping (%>%) to chain group_by().
workers %>%
filter(!is.na(project),
!is.na(race)) %>% # Remove rows missing values for "project"
group_by(project, race) # Group by "project" and "race"
Takeaway: Function group_by() is part bridging function, part partitioning function.
group_by() will then act independently on each groupsummarize() was built to inherit group_by() outputFunction summarize() creates custom summaries using summarizing functions as arguments.
variable = transformation
variable may take any alphanumeric nametransformation takes any expression, functions, etc.Example: Call summarize() for the median() and mean() of variable gross.
workers %>%
filter(!is.na(gross)) %>% # Filter rows missing "gross" (optional)
summarize(median_gross = median(gross), # Call median() on gross; name "median_gross"
mean_gross = mean(gross)) # Call mean() on gross; name "mean_gross"
## # A tibble: 1 x 2
## median_gross mean_gross
## <dbl> <dbl>
## 1 1216. 1242.
Spaces, Numbers in Variable Names: Use backticks, ``, to allow spaces, numbers, etc. in variable names:
workers %>%
filter(!is.na(gross)) %>% # Filter rows missing "gross" (optional)
summarize(`Median Gross` = median(gross), # Name `Median Gross`
`Mean Gross` = mean(gross)) # Name `Mean Gross`
## # A tibble: 1 x 2
## `Median Gross` `Mean Gross`
## <dbl> <dbl>
## 1 1216. 1242.
A Powerful Combination: So powerful, it warranted it’s own section.
group_by() and specified grouping variables to summarize()Median Gross by Race: We’ll combine group_by(), summarize(), and median().
workers %>%
filter(!is.na(race),
!is.na(gross)) %>% # Filter missing "race", "gross"
group_by(race) %>% # Group by "race"
summarize(med_gross = median(gross)) %>% # Summarize with median() on "gross"
arrange(desc(med_gross)) # Arrange by descending order
## # A tibble: 6 x 2
## race med_gross
## <chr> <dbl>
## 1 Native 1470
## 2 Hispanic 1464.
## 3 Asian 1394.
## 4 White 1297.
## 5 Black 974
## 6 Multiracial 825.
Median Gross by Company, Race: We’ll apply the same techniques as above, adding name.
workers %>%
filter(!is.na(race),
!is.na(gross)) %>% # Filter missing "race", "gross"
group_by(name, race) %>% # Group by "race"
summarize(med_gross = median(gross)) %>% # Summarize with median() on "gross"
arrange(name, race) # Arrange company, race, gross
## # A tibble: 50 x 3
## # Groups: name [20]
## name race med_gross
## <chr> <chr> <dbl>
## 1 AM Electric White 630
## 2 Atlas Fence Black 829.
## 3 Atlas Fence White 1148
## 4 Burn Bros Black 651.
## 5 Burn Bros Hispanic 1434.
## 6 Burn Bros White 1042.
## 7 Davis Ulmer Black 760
## 8 Davis Ulmer White 760
## 9 EJ Construction Asian 1514.
## 10 EJ Construction Native 1037
## # ... with 40 more rows
Note: The more variables input in group_by(), the more sophisticated the output.
Instructions: Finish the following pipeline to get:
mean() of variable grossproject and sexworkers %>%
filter(!is.na(sex),
!is.na(gross)) %>%
group_by()
Question: Which is higher? The mean gross payment for males at project “Hancock” or “Lakeview”?
A Family of Functions: Several “join”, or merging, functions exist in dplyr.
*_join(), e.g. inner_join()*_join() functions is merge()left_join() is the workhorse “join” functionLoading Location Data: Use variable zip to find additional location data.
if(!require(zipcode)){install.packages("zipcode")}
library(zipcode)
data(zipcode)
Joining Location Data: Once loaded, we can use left_join() on variable zip.
locations <- left_join(workers, # Indicate data to be joined
zipcode, # Indicate data to join
by = "zip") %>% # Indicate "key" variable
select(project:ending, ssn, gross, city:longitude) # Reduce variables
Join More Location Data: Use the new city and state data to add FIPS codes.
if(!require(noncensus)){install.packages("noncensus")}
library(noncensus)
data(zip_codes)
Instructions: Use function left_join() to merge locations and zip_codes.
by = to specify the variable(s) by which to joinlocations %>%
left_join()
Tidy Data are data that are arranged to conform to 3 core principles:
You can read more about “Tidy Data” in Hadley Wickham’s 2012 treatise, “Tidy Data”.
Tidyverse Packages will often operate optimally or exclusively using “Tidy Data” format.
Suppose we track the weights (lbs.) of three pets over three weeks. This is known as wide format.
| owner | pet | week_1 | week_2 | week_3 |
|---|---|---|---|---|
| Amit | Marshmallow | 56.2 | 56.7 | 56.6 |
| Sophia | Pebbles | 14.4 | 14.3 | 14.9 |
| Greyson | Gary Laser-Eyes | 38.1 | 41.1 | 49.3 |
Question: Can you see why these data are “untidy”?
Suppose we tidy the above data. This is known as long format.
| owner | pet | week | weight |
|---|---|---|---|
| Amit | Marshmallow | 1 | 56.2 |
| Sophia | Pebbles | 2 | 14.4 |
| Greyson | Gary Laser-Eyes | 3 | 38.1 |
| Amit | Marshmallow | 1 | 56.7 |
| Sophia | Pebbles | 2 | 14.3 |
| Greyson | Gary Laser-Eyes | 3 | 41.1 |
| Amit | Marshmallow | 1 | 56.6 |
| Sophia | Pebbles | 2 | 14.9 |
| Greyson | Gary Laser-Eyes | 3 | 49.3 |
Question: Why are the data now “tidy”?
Package tidyr is meant for one purpose: tidying data.
spread() and gather()if(!require(tidyr)){install.packages("tidyr")}
library(tidyr)
Function spread() creates multiple new columns from key-pair data.
subset <- workers %>%
select(project:race) %>%
filter(!is.na(ending))
head(subset)
## # A tibble: 6 x 12
## project name ending zip ssn class hours rate gross net sex
## <chr> <chr> <date> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Lakevi~ Ajay~ 2015-08-02 14612 3888 Jour~ 40 18.5 740 511. <NA>
## 2 Lakevi~ Ajay~ 2015-08-02 13205 2807 Jour~ 32 28.7 918. 587. <NA>
## 3 Lakevi~ Ajay~ 2015-08-02 14428 7762 Jour~ 40 33.7 1348 1026. <NA>
## 4 Lakevi~ Ajay~ 2015-08-02 14549 9759 Jour~ 26 33.7 876. 891. <NA>
## 5 Lakevi~ Ajay~ 2015-08-09 14433 5632 Jour~ 32 17 544 384. <NA>
## 6 Lakevi~ Ajay~ 2015-08-09 14612 3888 Jour~ 40 18.5 740 658. <NA>
## # ... with 1 more variable: race <chr>
wrong <- subset %>%
spread(key = ending, # Unique values in `ending` become columns
value = hours) # Variable `hours` is redistributed
wrong[, 40:44] # View 4 of 46 new columns added
## # A tibble: 3,621 x 5
## `2015-08-09` `2015-08-16` `2015-08-23` `2015-08-30` `2015-09-06`
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 NA NA NA NA NA
## 2 NA NA NA NA NA
## 3 40 NA NA NA NA
## 4 NA NA NA NA NA
## 5 32 32 NA NA NA
## 6 40 NA NA NA NA
## 7 16 NA NA NA NA
## 8 NA 42 NA NA NA
## 9 NA 28 NA NA NA
## 10 NA 42 NA NA NA
## # ... with 3,611 more rows
Takeaway: I generally avoid spread() but it has its uses.
Function gather() tidies “wide” datasets.
Create an Untidy Data Frame: Recreate the previous example and assign to pets.
pets <- data_frame(owner = c("Amit", "Sophia", "Greyson"),
pet = c("Marshmallow", "Pebbles", "Gary Laser-Eyes"),
week_1 = c(56.2, 14.4, 38.1),
week_2 = c(56.7, 14.3, 41.1),
week_3 = c(56.6, 14.9, 49.3))
print(pets)
## # A tibble: 3 x 5
## owner pet week_1 week_2 week_3
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Amit Marshmallow 56.2 56.7 56.6
## 2 Sophia Pebbles 14.4 14.3 14.9
## 3 Greyson Gary Laser-Eyes 38.1 41.1 49.3
Tidy that Data Frame: Use gather() and the notation as guidance.
pets %>% # Pipe "pets" into spread()
gather(key = week, # Name the variable that takes gathered column names
value = weight, # Name the variable that takes gathered values
week_1:week_3) # Name the columns to gather
## # A tibble: 9 x 4
## owner pet week weight
## <chr> <chr> <chr> <dbl>
## 1 Amit Marshmallow week_1 56.2
## 2 Sophia Pebbles week_1 14.4
## 3 Greyson Gary Laser-Eyes week_1 38.1
## 4 Amit Marshmallow week_2 56.7
## 5 Sophia Pebbles week_2 14.3
## 6 Greyson Gary Laser-Eyes week_2 41.1
## 7 Amit Marshmallow week_3 56.6
## 8 Sophia Pebbles week_3 14.9
## 9 Greyson Gary Laser-Eyes week_3 49.3
SQL, or Structured Query Language (often pronounced “sequel”), is a specialized language and is:
All relational databases are made up of separate tables that connect each other by using unique IDs.
Observe the following tables:
Table A: Names & IDs
| Name | ID |
|---|---|
| Harshita | 001 |
| Alex | 002 |
| Fred | 003 |
| Shannon | 004 |
| Jamison | 005 |
| Luis | 006 |
| Kennedy | 007 |
Table B: Likert Scale Responses on Nicholas Cage
| ID | Cage Rating |
|---|---|
| 001 | 7 |
| 002 | 6 |
| 003 | 7 |
| 004 | 1 |
| 005 | 2 |
| 006 | 7 |
| 007 | 4 |
Table C: Civil Service Exam Scores
| ID | Score |
|---|---|
| 001 | 96 |
| 002 | 99 |
| 003 | 97 |
| 004 | NA |
| 005 | 41 |
| 006 | 75 |
| 007 | 79 |
These tables demonstrate how relational databases function.
In Table A, or Cage Table, Likert responses range from one to seven:
However, we cannot yet determine the most vehement Cage fans, since Cage Table only contains ID numbers, not names.
Hence, we must find a related table to better make sense of these data - specifically Table A.
merged_tables <- left_join(name_table,
cage_table,
by = "ID") # Use one of dplyr's *_join() functions
print(merged_tables)
## # A tibble: 7 x 3
## Name ID `Cage Rating`
## <chr> <chr> <dbl>
## 1 Harshita 001 7
## 2 Alex 002 6
## 3 Fred 003 7
## 4 Shannon 004 1
## 5 Jamison 005 2
## 6 Luis 006 7
## 7 Kennedy 007 4
Now that these tables are merged, we can determine who exactly recognized Nicholas Cage’s genius.
One need not stop at simply identifying Cage advocates by name. By relating additonal databases, we may:
Here, we merge Table C with the already merged Table A and Table B to create master_table.
master_table <- left_join(merged_tables,
exam_table,
by = "ID")
print(master_table)
## # A tibble: 7 x 4
## Name ID `Cage Rating` Score
## <chr> <chr> <dbl> <dbl>
## 1 Harshita 001 7 96
## 2 Alex 002 6 99
## 3 Fred 003 7 97
## 4 Shannon 004 1 NA
## 5 Jamison 005 2 41
## 6 Luis 006 7 75
## 7 Kennedy 007 4 79
In sum, these databases are only relatable through the use of unique IDs.
Ignoring the “who”, we can study the relationship between variables. The following attempts to model civil service exam scores in relation to appreciation for Nicholas Cage’s art.
head(master_table)
## # A tibble: 6 x 4
## Name ID `Cage Rating` Score
## <chr> <chr> <dbl> <dbl>
## 1 Harshita 001 7 96
## 2 Alex 002 6 99
## 3 Fred 003 7 97
## 4 Shannon 004 1 NA
## 5 Jamison 005 2 41
## 6 Luis 006 7 75
model <- lm(master_table$Score ~ master_table$`Cage Rating`)
print(model)
##
## Call:
## lm(formula = master_table$Score ~ master_table$`Cage Rating`)
##
## Coefficients:
## (Intercept) master_table$`Cage Rating`
## 32.690 8.814
Fascinating! There seems to be no significant relationship between utter devotion to Nicholas Cage and exam scores. Still, be careful out there.
Interestingly, many functions we’ve seen in “dplyr” are the same functions in SQL:
SELECT is similar to select()INNER JOIN and OUTER JOIN, are similar to inner_join() and outer_join()GROUP BY is simialr to group_by()However, we do not need to learn SQL (though it is recommended) to use it. Instead we may use standard “dplyr” functions that:
The first step to querying data is to connect to a database. Consider the following:
src_sqlite() for SQLitesrc_mysql() for MySQL and MariaDBsrc_postgres() for PostgreSQLvignette("databases", packages = "dplyr")First, we must install and load the “DBI” package, which helps R translate different languages like SQL.
if(!require(dbplyr)){install.packages("dbplyr")} # Package "DBI" installed with "dbplyr"
if(!require(RMySQL)){install.packages("RMySQL")}
if(!require(RSQLite)){install.packages("RSQLite")}
library(dbplyr)
library(RMySQL)
library(RSQLite)
The following uses a free practice SQL database, courtesy of RStudio.
To learn more about these arguments, view package documentation with the expression: ?dbplyr, or visit the package “dbplyr” vignette, which provides the following examples.
In lieu of accessing a live SQL database, we may use dbConnect() to connect to our machine’s internal memory.
library(dplyr)
con <- DBI::dbConnect(RMySQL::RMySQL(), dbname = ":memory:")
Connecting the local memory to simulate a practice database, we may store nycflights13 using function copy_to().
if(!require(nycflights13)){install.packages("nycflights13")}
library(nycflights13)
copy_to(con, nycflights13::flights, "flights", # The vignette author emphasizes starting with the right indexing.
temporary = FALSE,
indexes = list(
c("year", "month", "day"),
"carrier",
"tailnum",
"dest"
)
)
The database has no name, though. To reference it, use function tbl() and assignment, or <-.
flights_db <- tbl(con, "flights")
Now that you can reference flights_db by name, you can extract data from it like any other object in the environment.
names(flights_db)
class(flights_db)
print(flights_db)
Note that “dplyr” functions that have equivalent functions in SQL are far more likely to operate as expected.
However, “dbplyr” is optimized for select(), one of the most common tasks for analysts in generating SQL queries.
flights_db %>%
select(year, month, day, carrier, air_time) %>%
group_by(carrier) %>%
summarize(total_at = sum(air_time))
When you’ve finished with your queries, don’t forget to disconnect. As a courtesy, don’t hog others’ bandwidth.
dbDisconnect(conn = con)
Preparation: Resimulate the nycflights13 data if you disconnected during the previous section:
con <- DBI::dbConnect(RMySQL::RMySQL(), dbname = ":memory:")
if(!require(dplyr)){install.packages("dplyr")}
if(!require(nycflights13)){install.packages("nycflights13")}
library(dplyr)
library(nycflights13)
con <- DBI::dbConnect(RMySQL::RMySQL(), dbname = ":memory:")
copy_to(con, nycflights13::flights, "flights", # The vignette author emphasizes starting with the right indexing.
temporary = FALSE,
indexes = list(
c("year", "month", "day"),
"carrier",
"tailnum",
"dest"
)
)
flights_db <- tbl(con, "flights")
Instructions: Use “dplyr” functions with object flights_db to determine the following summary statistics.
dep_delay, when grouping by carrier.distance among trips when grouping by month?dep_delay and arr_delay when grouping by month?