library(tidyverse)

## ── Attaching packages ────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.1.0     ✔ purrr   0.2.5
## ✔ tibble  1.4.2     ✔ dplyr   0.7.8
## ✔ tidyr   0.8.2     ✔ stringr 1.3.1
## ✔ readr   1.1.1     ✔ forcats 0.3.0

## ── Conflicts ───────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(openintro)

## Please visit openintro.org for free statistics materials

## 
## Attaching package: 'openintro'

## The following object is masked from 'package:ggplot2':
## 
##     diamonds

## The following objects are masked from 'package:datasets':
## 
##     cars, trees

Five Verbs

Dplyr mostly uses five “verbs.”

select - Pick columns
filter - Pick rows
mutate - Create new columns
arrange - sort
summarize + group_by - Create summary stats

Some basic behavior of R is altered. All dataframes touched by dplyr become tbl’s (pronounced tibble).

Typing the name of a dataframe is prevented from overwhelming the console by dumping the entire dataframe. Both rows and columns are restricted.
Use of View() and glimpse() give us ability to see everything.

glimpse(diamonds)

## Observations: 53,940
## Variables: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, ...
## $ cut     <fct> Ideal, Premium, Good, Premium, Good, Very Good, Very G...
## $ color   <fct> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, ...
## $ clarity <fct> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI...
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, ...
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54...
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339,...
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, ...
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, ...
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, ...

Piping with %>% is almost always used when dplyr is in use, although it is not strictly necessary.

Exercise

Create a new version of diamonds, called d. Cut, color and clarity should be simple character variables. Make a new variable ppc, the price per carat.

Solution

d <- diamonds %>%
filter(x>0, y>0, z>0) %>%
mutate(cut = as.character(cut),
       color = as.character(color),
       clarity = as.character(clarity),
       ppc = price/carat) %>% 
glimpse()

## Observations: 53,920
## Variables: 11
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, ...
## $ cut     <chr> "Ideal", "Premium", "Good", "Premium", "Good", "Very G...
## $ color   <chr> "E", "E", "E", "I", "J", "J", "I", "H", "E", "H", "J",...
## $ clarity <chr> "SI2", "SI1", "VS1", "VS2", "SI2", "VVS2", "VVS1", "SI...
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, ...
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54...
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339,...
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, ...
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, ...
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, ...
## $ ppc     <dbl> 1417.391, 1552.381, 1421.739, 1151.724, 1080.645, 1400...

summary(d)

##      carat            cut               color             clarity         
##  Min.   :0.2000   Length:53920       Length:53920       Length:53920      
##  1st Qu.:0.4000   Class :character   Class :character   Class :character  
##  Median :0.7000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :0.7977                                                           
##  3rd Qu.:1.0400                                                           
##  Max.   :5.0100                                                           
##      depth           table           price             x         
##  Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 3.730  
##  1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  949   1st Qu.: 4.710  
##  Median :61.80   Median :57.00   Median : 2401   Median : 5.700  
##  Mean   :61.75   Mean   :57.46   Mean   : 3931   Mean   : 5.732  
##  3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5323   3rd Qu.: 6.540  
##  Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740  
##        y                z              ppc       
##  Min.   : 3.680   Min.   : 1.07   Min.   : 1051  
##  1st Qu.: 4.720   1st Qu.: 2.91   1st Qu.: 2478  
##  Median : 5.710   Median : 3.53   Median : 3495  
##  Mean   : 5.735   Mean   : 3.54   Mean   : 4008  
##  3rd Qu.: 6.540   3rd Qu.: 4.04   3rd Qu.: 4949  
##  Max.   :58.900   Max.   :31.80   Max.   :17829

Exercise

Use Dplyr to create a subset d1 of d with cut = “Ideal” and ppc greater than $5,000. Keep only the columns clarity, color, and ppc.

Solution

d1 = d %>% 
  filter( cut == "Ideal" & ppc > 5000) %>% 
  select( clarity, color, ppc)
summary(d1)

##    clarity             color                ppc       
##  Length:4780        Length:4780        Min.   : 5001  
##  Class :character   Class :character   1st Qu.: 5753  
##  Mode  :character   Mode  :character   Median : 6758  
##                                        Mean   : 7104  
##                                        3rd Qu.: 8091  
##                                        Max.   :17078

Example of group_by and summarize

dcells <- d %>%
  group_by(cut,color,clarity) %>%
  summarize(count = n(),
            mean_ppc = mean(ppc),
            mean_carat = mean(carat),
            median_ppc = median(ppc),
            sd_ppc = sd(ppc),
            IQR_ppc = IQR(ppc)) %>%
  ungroup() %>% # Important
  arrange(desc(mean_ppc))

Task 1.

Get the countyComplete dataset from the openIntro package.
Create a new dataset cc keeping the following variables. state, name, pop20101, bachelors, per_capita_income, area, density,
Create a new variable total_income using pop2010 and per_capita_income.
Glimpse your dataset and run a summary to make sure everything is OK.

library(openintro)
cc <- countyComplete %>%
  select(name,state,pop2010,per_capita_income,area,density,bachelors) %>%
  mutate(name = as.character(name),
         state = as.character(state),
         total_income = per_capita_income * pop2010)
glimpse(cc)

## Observations: 3,143
## Variables: 8
## $ name              <chr> "Autauga County", "Baldwin County", "Barbour...
## $ state             <chr> "Alabama", "Alabama", "Alabama", "Alabama", ...
## $ pop2010           <dbl> 54571, 182265, 27457, 22915, 57322, 10914, 2...
## $ per_capita_income <dbl> 24568, 26469, 15875, 19918, 21070, 20289, 16...
## $ area              <dbl> 594.44, 1589.78, 884.88, 622.58, 644.78, 622...
## $ density           <dbl> 91.8, 114.6, 31.0, 36.8, 88.9, 17.5, 27.0, 1...
## $ bachelors         <dbl> 21.7, 26.8, 13.5, 10.0, 12.5, 12.0, 11.0, 16...
## $ total_income      <dbl> 1340700328, 4824372285, 435879875, 456420970...

Task 2.

Create a states dataset using group_by and summarize.
In this dataset create average per capita income for each state with two different methods.
SPCI1 is the simple average of the county per capita income values.
SPCI2 is constructed by adding up total income and population from the county values and dividing the two at the state level. Examine the difference between SPCI1 and SPCI2.

Answer

cc %>% 
  group_by(state) %>% 
  summarize(SPCI1 = sum(total_income)/sum(pop2010),
            SPCI2 = mean(total_income/pop2010)
            ) %>% 
  arrange(desc(SPCI1))-> states
glimpse(states)

## Observations: 51
## Variables: 3
## $ state <chr> "District of Columbia", "Connecticut", "Maryland", "New ...
## $ SPCI1 <dbl> 42078.00, 36792.41, 34907.87, 34853.30, 33982.91, 32293....
## $ SPCI2 <dbl> 42078.00, 34873.25, 31459.08, 34391.19, 33547.07, 25557....

Dplyr Basics

Five Verbs

Exercise

Solution

Exercise

Solution

Example of group_by and summarize

Task 1.

Task 2.

Answer