From Data to Plotting Using the Tidyverse

Adam Goodkind

R User Group - 1/25/2018

Preliminaries

How these slides were made

Required Installations

  • tidyverse
  • magrittr
install.packages(c("tidyverse", "magrittr"))
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.4.1
## Warning: Installed Rcpp (0.12.13) different from Rcpp used to build dplyr (0.12.11).
## Please reinstall dplyr to avoid random crashes or undefined behavior.
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Warning: package 'tidyr' was built under R version 3.4.1
## Warning: package 'readr' was built under R version 3.4.1
## Warning: package 'purrr' was built under R version 3.4.1
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats
library(magrittr)
## 
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
## 
##     set_names
## The following object is masked from 'package:tidyr':
## 
##     extract

The tidyverse

  • What is it?
    • “Opinionated” collection of R packages (open source)
    • Designed for data science
    • Emphasize consistency and transparency
    • Smart defaults
  • What’s included?

Getting some data

Importing data

# Base R
dat <- read.csv('example.csv')

# Tidyverse
dat <- read_csv('example.csv')
  • Tidyverse has smart error/missing data handler
  • Tidyverse is faster
  • Both create a “data frame” by default

Pre-installed datasets

data()

all datasets list

The starwars dataset

starwars
## # A tibble: 87 x 13
##                  name height  mass    hair_color  skin_color eye_color
##                 <chr>  <int> <dbl>         <chr>       <chr>     <chr>
##  1     Luke Skywalker    172    77         blond        fair      blue
##  2              C-3PO    167    75          <NA>        gold    yellow
##  3              R2-D2     96    32          <NA> white, blue       red
##  4        Darth Vader    202   136          none       white    yellow
##  5        Leia Organa    150    49         brown       light     brown
##  6          Owen Lars    178   120   brown, grey       light      blue
##  7 Beru Whitesun lars    165    75         brown       light      blue
##  8              R5-D4     97    32          <NA>  white, red       red
##  9  Biggs Darklighter    183    84         black       light     brown
## 10     Obi-Wan Kenobi    182    77 auburn, white        fair blue-gray
## # ... with 77 more rows, and 7 more variables: birth_year <dbl>,
## #   gender <chr>, homeworld <chr>, species <chr>, films <list>,
## #   vehicles <list>, starships <list>

Viewing datasets (data frames)

> starwars
> print(starwars)
> str(starwars)
> head(starwars) # first 5 rows

summary(starwars) 
##      name               height           mass             homeworld 
##  Length:87          Min.   : 66.0   Min.   :  15.00   Naboo    :11  
##  Class :character   1st Qu.:167.0   1st Qu.:  55.60   Tatooine :10  
##  Mode  :character   Median :180.0   Median :  79.00   Alderaan : 3  
##                     Mean   :174.4   Mean   :  97.31   Coruscant: 3  
##                     3rd Qu.:191.0   3rd Qu.:  84.50   Kamino   : 3  
##                     Max.   :264.0   Max.   :1358.00   (Other)  :47  
##                     NA's   :6       NA's   :28        NA's     :10

Viewing datasets in RStudio

Click on the variable name in the Environments pane

Will execute this code:

View(mtcars)

The tidyverse

The pipe %>%

Sends the output of the LHS function to the first argument of the RHS function

sum(1:8) %>%
  sqrt()
## [1] 6

Principles of tidy-ness

  • Variables make up the columns
  • Observations make up the rows
  • Values go into cells

Making data tidy

  • Human-readable data

  • Tidying up
table %>% 
  gather(`1999`, `2000`, key = "year", value = "cases")

Data Manipulation

Selecting data

starwars %>% filter(height < 90)
## # A tibble: 3 x 13
##                    name height  mass hair_color skin_color eye_color
##                   <chr>  <int> <dbl>      <chr>      <chr>     <chr>
## 1                  Yoda     66    17      white      green     brown
## 2 Wicket Systri Warrick     88    20      brown      brown     brown
## 3         Ratts Tyerell     79    15       none grey, blue   unknown
## # ... with 7 more variables: birth_year <dbl>, gender <chr>,
## #   homeworld <fctr>, species <chr>, films <list>, vehicles <list>,
## #   starships <list>
starwars %>%
  filter(height < 90) %>%
    select(name, gender, species, height)
## # A tibble: 3 x 4
##                    name gender        species height
##                   <chr>  <chr>          <chr>  <int>
## 1                  Yoda   male Yoda's species     66
## 2 Wicket Systri Warrick   male           Ewok     88
## 3         Ratts Tyerell   male         Aleena     79

Your turn

  • How would I select only the characters with red eye color, and just display their names?
starwars %>% filter(eye_color == "red") %>% select(name)
## # A tibble: 5 x 1
##          name
##         <chr>
## 1       R2-D2
## 2       R5-D4
## 3       IG-88
## 4       Bossk
## 5 Nute Gunray

Ordering data

filter(starwars, height < 90) %>%
  select(name, gender, species, height) %>%
  arrange(height)
## # A tibble: 3 x 4
##                    name gender        species height
##                   <chr>  <chr>          <chr>  <int>
## 1                  Yoda   male Yoda's species     66
## 2         Ratts Tyerell   male         Aleena     79
## 3 Wicket Systri Warrick   male           Ewok     88

Your turn

  • How would I arrange all of the characters in alphabetic order by name?
starwars %>% arrange(name)
## # A tibble: 87 x 13
##                   name height  mass hair_color          skin_color
##                  <chr>  <int> <dbl>      <chr>               <chr>
##  1              Ackbar    180    83       none        brown mottle
##  2          Adi Gallia    184    50       none                dark
##  3    Anakin Skywalker    188    84      blond                fair
##  4        Arvel Crynyd     NA    NA      brown                fair
##  5         Ayla Secura    178    55       none                blue
##  6 Bail Prestor Organa    191    NA      black                 tan
##  7       Barriss Offee    166    50      black              yellow
##  8                 BB8     NA    NA       none                none
##  9      Ben Quadinaros    163    65       none grey, green, yellow
## 10  Beru Whitesun lars    165    75      brown               light
## # ... with 77 more rows, and 8 more variables: eye_color <chr>,
## #   birth_year <dbl>, gender <chr>, homeworld <fctr>, species <chr>,
## #   films <list>, vehicles <list>, starships <list>

Summarizing data

starwars %>%
  na.omit() %>%
  group_by(species) %>%
  summarize(avg_mass = mean(mass))
## # A tibble: 11 x 2
##         species  avg_mass
##           <chr>     <dbl>
##  1       Cerean  82.00000
##  2         Ewok  20.00000
##  3       Gungan  66.00000
##  4        Human  81.01111
##  5      Kel Dor  80.00000
##  6     Mirialan  53.10000
##  7 Mon Calamari  83.00000
##  8   Trandoshan 113.00000
##  9      Twi'lek  55.00000
## 10      Wookiee 112.00000
## 11       Zabrak  80.00000

Adding summary data to our data frame

starwars %>%
  na.omit() %>%
  group_by(species, gender) %>%
  select(name, gender, species, mass) %>%
  mutate(avg_mass = mean(mass))
## # A tibble: 29 x 5
## # Groups:   species, gender [12]
##                  name gender species  mass  avg_mass
##                 <chr>  <chr>   <chr> <dbl>     <dbl>
##  1     Luke Skywalker   male   Human    77  85.94667
##  2        Darth Vader   male   Human   136  85.94667
##  3        Leia Organa female   Human    49  56.33333
##  4          Owen Lars   male   Human   120  85.94667
##  5 Beru Whitesun lars female   Human    75  56.33333
##  6  Biggs Darklighter   male   Human    84  85.94667
##  7     Obi-Wan Kenobi   male   Human    77  85.94667
##  8   Anakin Skywalker   male   Human    84  85.94667
##  9          Chewbacca   male Wookiee   112 112.00000
## 10           Han Solo   male   Human    80  85.94667
## # ... with 19 more rows

Your turn

  • How would I add a column that is the sum of height and mass?
starwars %>%
  mutate(height_plus_mass = height + mass) %>%
  select(name, height, mass, height_plus_mass)
## # A tibble: 87 x 4
##                  name height  mass height_plus_mass
##                 <chr>  <int> <dbl>            <dbl>
##  1     Luke Skywalker    172    77              249
##  2              C-3PO    167    75              242
##  3              R2-D2     96    32              128
##  4        Darth Vader    202   136              338
##  5        Leia Organa    150    49              199
##  6          Owen Lars    178   120              298
##  7 Beru Whitesun lars    165    75              240
##  8              R5-D4     97    32              129
##  9  Biggs Darklighter    183    84              267
## 10     Obi-Wan Kenobi    182    77              259
## # ... with 77 more rows

Different types of variables

Factor Scalar/Numeric
discrete continuous
Example:
Human, Droid, Wookie
Example:
0.7, 1.2, 3.4

Can convert between types

nums <- c(0.7, 1.2, 3.4)
factor_nums <- as.factor(nums)
levels(factor_nums)
## [1] "0.7" "1.2" "3.4"
nums + 1
## [1] 1.7 2.2 4.4
factor_nums + 1
## Warning in Ops.factor(factor_nums, 1): '+' not meaningful for factors
## [1] NA NA NA

Plotting our data

Basics of ggplot

  • Creates a stardard grammar of graphics
  • Seperates the data variables from the appearance variables
# Base R
plot(starwars$height, type='p', col='red', pch=16)

Elements of a ggplot function

  • First part is just the data
ggplot(data=starwars, aes(x=height, y=mass)) + ...
  • Then we decide how the data will appear
... + geom_point(), + geom_boxplot(), etc.

Putting a plot together and highlighting factors

ggplot(starwars, aes(x=height, y=mass)) + 
  geom_point(aes(color=gender), size=5) 

Plots can also summarize data

ggplot(subset(starwars, species %in% c('Droid', 'Human', "Gungan")),
       aes(x=species, y=height)) + 
  geom_boxplot()

Adding a trendline is often useful

ggplot(starwars, aes(x=height, y=mass)) + 
  geom_point(size=5) +
  stat_smooth(method='lm')

We can even pipe from dplyr to ggplot

starwars %>%
  filter(species == 'Human' & gender %in% c('male', 'female')) %>%
  ggplot(aes(mass)) +
  geom_histogram() +
  facet_grid(. ~ gender)

Your turn

  • Use geom_bar to plot the count of different species with a home world of Naboo or Tatooine
starwars %>%
  filter(homeworld %in% c("Naboo", "Tatooine")) %>%
  ggplot(aes(species)) +
  geom_bar() +
  facet_grid(. ~ homeworld)

Takeaways

  • The tidyverse is useful and accessible
  • The tidyverse has smart defaults
  • The tidyverse can standardize workflow from data import to analysis/visualization
  • The pipe (%>%) can link many functions
  • Tidy data is useful for more complex statistics and modeling
  • ggplot makes plotting more intuitive
    • Allows for focused control of each element (data, plot type, color, etc.)
    • Highlight the important relationships
  • R can be FUN!
  • Questions/comments: a.goodkind@u.northwestern.edu