Dplyr Basics

library(tidyverse)
## ── Attaching packages ────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.8
## ✔ tidyr   0.8.2     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(openintro)
## Please visit openintro.org for free statistics materials
## 
## Attaching package: 'openintro'
## The following object is masked from 'package:ggplot2':
## 
##     diamonds
## The following objects are masked from 'package:datasets':
## 
##     cars, trees

Five Verbs

Dplyr mostly uses five “verbs.”

Some basic behavior of R is altered. All dataframes touched by dplyr become tbl’s (pronounced tibble).

glimpse(diamonds)
## Observations: 53,940
## Variables: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, ...
## $ cut     <fct> Ideal, Premium, Good, Premium, Good, Very Good, Very G...
## $ color   <fct> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, ...
## $ clarity <fct> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI...
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, ...
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54...
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339,...
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, ...
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, ...
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, ...

Piping with %>% is almost always used when dplyr is in use, although it is not strictly necessary.

Exercise

Create a new version of diamonds, called d. Cut, color and clarity should be simple character variables. Make a new variable ppc, the price per carat.

Solution

d <- diamonds %>%
filter(x>0, y>0, z>0) %>%
mutate(cut = as.character(cut),
       color = as.character(color),
       clarity = as.character(clarity),
       ppc = price/carat) %>% 
glimpse()
## Observations: 53,920
## Variables: 11
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, ...
## $ cut     <chr> "Ideal", "Premium", "Good", "Premium", "Good", "Very G...
## $ color   <chr> "E", "E", "E", "I", "J", "J", "I", "H", "E", "H", "J",...
## $ clarity <chr> "SI2", "SI1", "VS1", "VS2", "SI2", "VVS2", "VVS1", "SI...
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, ...
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54...
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339,...
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, ...
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, ...
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, ...
## $ ppc     <dbl> 1417.391, 1552.381, 1421.739, 1151.724, 1080.645, 1400...
summary(d)
##      carat            cut               color             clarity         
##  Min.   :0.2000   Length:53920       Length:53920       Length:53920      
##  1st Qu.:0.4000   Class :character   Class :character   Class :character  
##  Median :0.7000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :0.7977                                                           
##  3rd Qu.:1.0400                                                           
##  Max.   :5.0100                                                           
##      depth           table           price             x         
##  Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 3.730  
##  1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  949   1st Qu.: 4.710  
##  Median :61.80   Median :57.00   Median : 2401   Median : 5.700  
##  Mean   :61.75   Mean   :57.46   Mean   : 3931   Mean   : 5.732  
##  3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5323   3rd Qu.: 6.540  
##  Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740  
##        y                z              ppc       
##  Min.   : 3.680   Min.   : 1.07   Min.   : 1051  
##  1st Qu.: 4.720   1st Qu.: 2.91   1st Qu.: 2478  
##  Median : 5.710   Median : 3.53   Median : 3495  
##  Mean   : 5.735   Mean   : 3.54   Mean   : 4008  
##  3rd Qu.: 6.540   3rd Qu.: 4.04   3rd Qu.: 4949  
##  Max.   :58.900   Max.   :31.80   Max.   :17829

Exercise

Use Dplyr to create a subset d1 of d with cut = “Ideal” and ppc greater than $5,000. Keep only the columns clarity, color, and ppc.

Solution

d1 = d %>% 
  filter( cut == "Ideal" & ppc > 5000) %>% 
  select( clarity, color, ppc)
summary(d1)
##    clarity             color                ppc       
##  Length:4780        Length:4780        Min.   : 5001  
##  Class :character   Class :character   1st Qu.: 5753  
##  Mode  :character   Mode  :character   Median : 6758  
##                                        Mean   : 7104  
##                                        3rd Qu.: 8091  
##                                        Max.   :17078

Example of group_by and summarize

dcells <- d %>%
  group_by(cut,color,clarity) %>%
  summarize(count = n(),
            mean_ppc = mean(ppc),
            mean_carat = mean(carat),
            median_ppc = median(ppc),
            sd_ppc = sd(ppc),
            IQR_ppc = IQR(ppc)) %>%
  ungroup() %>% # Important
  arrange(desc(mean_ppc))

Task 1.

library(openintro)
cc <- countyComplete %>%
  select(name,state,pop2010,per_capita_income,area,density,bachelors) %>%
  mutate(name = as.character(name),
         state = as.character(state),
         total_income = per_capita_income * pop2010)
glimpse(cc)
## Observations: 3,143
## Variables: 8
## $ name              <chr> "Autauga County", "Baldwin County", "Barbour...
## $ state             <chr> "Alabama", "Alabama", "Alabama", "Alabama", ...
## $ pop2010           <dbl> 54571, 182265, 27457, 22915, 57322, 10914, 2...
## $ per_capita_income <dbl> 24568, 26469, 15875, 19918, 21070, 20289, 16...
## $ area              <dbl> 594.44, 1589.78, 884.88, 622.58, 644.78, 622...
## $ density           <dbl> 91.8, 114.6, 31.0, 36.8, 88.9, 17.5, 27.0, 1...
## $ bachelors         <dbl> 21.7, 26.8, 13.5, 10.0, 12.5, 12.0, 11.0, 16...
## $ total_income      <dbl> 1340700328, 4824372285, 435879875, 456420970...

Task 2.

Answer

cc %>% 
  group_by(state) %>% 
  summarize(SPCI1 = sum(total_income)/sum(pop2010),
            SPCI2 = mean(total_income/pop2010)
            ) %>% 
  arrange(desc(SPCI1))-> states
glimpse(states)
## Observations: 51
## Variables: 3
## $ state <chr> "District of Columbia", "Connecticut", "Maryland", "New ...
## $ SPCI1 <dbl> 42078.00, 36792.41, 34907.87, 34853.30, 33982.91, 32293....
## $ SPCI2 <dbl> 42078.00, 34873.25, 31459.08, 34391.19, 33547.07, 25557....