From Data to Plotting Using the Tidyverse

Adam Goodkind

1/25/2018

Preliminaries

How these slides were made

Required Installations

  • tidyverse
  • magrittr
install.packages(c("tidyverse", "magrittr"))
library(tidyverse)
## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.3.4     ✔ dplyr   0.7.4
## ✔ tidyr   0.7.2     ✔ stringr 1.2.0
## ✔ readr   1.1.1     ✔ forcats 0.2.0
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(magrittr)
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract

The tidyverse

  • What is it?
  • What’s included?

Data for today

Pre-installed datasets

data()

all datasets list

The mtcars dataset

mtcars
##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2

Viewing datasets

> mtcars
> print(mtcars)
> str(mtcars)
> head(mtcars) # first n rows

glimpse(mtcars) 
## Observations: 32
## Variables: 11
## $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19....
## $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, ...
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 1...
## $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, ...
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.9...
## $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3...
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 2...
## $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, ...
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, ...
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, ...

Viewing datasets in RStudio

Click on the variable name in the Environments pane

Will execute this code:

View(mtcars)

The tidyverse

What is the tidyverse?

  • Designed (originally) by Hadley Wickham
  • A suite of tools that emphasize consistency and transparency
  • Smart defaults
  • Open source and constantly expanded/improved
  • Packages for importing, formatting, analyzing, modeling and plotting
  • All of these can be connected(!)

The pipe %>%

Sends the output of the LHS function to the first argument of the RHS function

sum(1:8) %>%
  sqrt()
## [1] 6

Principles of tidy-ness

  • Each type of observation gets its own dataframe
  • Variables make up the columns
  • Observations make up the rows
  • Values go into cells

Making data tidy

  • Human-readable data
  • Tidying up
table %>% 
  gather(`1999`, `2000`, key = "year", value = "cases")

Data Manipulation

Selecting data

filter(mtcars, mpg > 30)
##    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1 32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1
## 2 30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2
## 3 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1
## 4 30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2
filter(mtcars, mpg > 30) %>%
  select(mpg, cyl, hp)
##    mpg cyl  hp
## 1 32.4   4  66
## 2 30.4   4  52
## 3 33.9   4  65
## 4 30.4   4 113

Ordering data

filter(mtcars, mpg > 30) %>%
  select(mpg, cyl, hp) %>%
  arrange(hp)
##    mpg cyl  hp
## 1 30.4   4  52
## 2 33.9   4  65
## 3 32.4   4  66
## 4 30.4   4 113

Summarizing data

filter(mtcars, mpg > 20) %>%
  group_by(am) %>%
  summarize(avg_hp = mean(hp))
## # A tibble: 2 x 2
##      am avg_hp
##   <dbl>  <dbl>
## 1     0   91.0
## 2     1   87.5

We can also add a summary to our data frame

filter(mtcars, mpg > 31) %>%
  select(mpg, cyl, hp) %>%
  mutate(avg_hp = mean(hp))
##    mpg cyl hp avg_hp
## 1 32.4   4 66   65.5
## 2 33.9   4 65   65.5

Summarizing data

mtcars %>%
  count(cyl)
## # A tibble: 3 x 2
##     cyl     n
##   <dbl> <int>
## 1     4    11
## 2     6     7
## 3     8    14

Fun with strings of text

library(stringr)
starwars <- dplyr::starwars
glimpse(starwars)
## Observations: 87
## Variables: 13
## $ name       <chr> "Luke Skywalker", "C-3PO", "R2-D2", "Darth Vader", ...
## $ height     <int> 172, 167, 96, 202, 150, 178, 165, 97, 183, 182, 188...
## $ mass       <dbl> 77.0, 75.0, 32.0, 136.0, 49.0, 120.0, 75.0, 32.0, 8...
## $ hair_color <chr> "blond", NA, NA, "none", "brown", "brown, grey", "b...
## $ skin_color <chr> "fair", "gold", "white, blue", "white", "light", "l...
## $ eye_color  <chr> "blue", "yellow", "red", "yellow", "brown", "blue",...
## $ birth_year <dbl> 19.0, 112.0, 33.0, 41.9, 19.0, 52.0, 47.0, NA, 24.0...
## $ gender     <chr> "male", NA, NA, "male", "female", "male", "female",...
## $ homeworld  <chr> "Tatooine", "Tatooine", "Naboo", "Tatooine", "Alder...
## $ species    <chr> "Human", "Droid", "Droid", "Human", "Human", "Human...
## $ films      <list> [<"Revenge of the Sith", "Return of the Jedi", "Th...
## $ vehicles   <list> [<"Snowspeeder", "Imperial Speeder Bike">, <>, <>,...
## $ starships  <list> [<"X-wing", "Imperial shuttle">, <>, <>, "TIE Adva...

Applying a function to each column (or row)

mtcars %>% 
  map(sum)
## $mpg
## [1] 642.9
## 
## $cyl
## [1] 198
## 
## $disp
## [1] 7383.1
## 
## $hp
## [1] 4694
## 
## $drat
## [1] 115.09
## 
## $wt
## [1] 102.952
## 
## $qsec
## [1] 571.16
## 
## $vs
## [1] 14
## 
## $am
## [1] 13
## 
## $gear
## [1] 118
## 
## $carb
## [1] 90

Plotting our data

Basics of ggplot

  • Creates a stardard grammar of graphics
  • Seperates the data from the appearance

In base R

plot(mtcars$mpg, type='p', col='red')

Elements of a ggplot function

  • First part is just the data
ggplot(data=mtcars, aes(x=mpg, y=hp)) + ...
  • Then we decide how the data will appear
... + geom_point()

Putting a plot together and highlighting factors

ggplot(mtcars, aes(x=mpg, y=hp)) + 
  geom_point(aes(color=factor(am), size=wt))

Plots can also summarize data

ggplot(mtcars, aes(x=factor(gear), y=disp)) + 
  geom_boxplot()

Adding a trendline is often useful

ggplot(mtcars, aes(x=mpg, y=hp)) + 
  geom_point() +
  stat_smooth(method='lm')