Dplyr Basics

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Five Verbs

Dplyr mostly uses five “verbs.”

Some basic behavior of R is altered. All dataframes touched by dplyr become tbl’s (pronounced tibble).

glimpse(diamonds)
## Rows: 53,940
## Columns: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…

Piping with %>% is almost always used when dplyr is in use, although it is not strictly necessary.

Exercise

Create a new version of diamonds, called d.

Use filter to eliminate any diamonds with x, y, or z = 0.

Cut, color and clarity should be simple character variables instead of factors.

Make a new variable ppc, the price per carat.

Solution

d <- diamonds %>%
filter(x>0, y>0, z>0) %>%
mutate(cut = as.character(cut),
       color = as.character(color),
       clarity = as.character(clarity),
       ppc = price/carat) %>% 
glimpse()
## Rows: 53,920
## Columns: 11
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
## $ cut     <chr> "Ideal", "Premium", "Good", "Premium", "Good", "Very Good", "V…
## $ color   <chr> "E", "E", "E", "I", "J", "J", "I", "H", "E", "H", "J", "J", "F…
## $ clarity <chr> "SI2", "SI1", "VS1", "VS2", "SI2", "VVS2", "VVS1", "SI1", "VS2…
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…
## $ ppc     <dbl> 1417.391, 1552.381, 1421.739, 1151.724, 1080.645, 1400.000, 14…
summary(d)
##      carat            cut               color             clarity         
##  Min.   :0.2000   Length:53920       Length:53920       Length:53920      
##  1st Qu.:0.4000   Class :character   Class :character   Class :character  
##  Median :0.7000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :0.7977                                                           
##  3rd Qu.:1.0400                                                           
##  Max.   :5.0100                                                           
##      depth           table           price             x         
##  Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 3.730  
##  1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  949   1st Qu.: 4.710  
##  Median :61.80   Median :57.00   Median : 2401   Median : 5.700  
##  Mean   :61.75   Mean   :57.46   Mean   : 3931   Mean   : 5.732  
##  3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5323   3rd Qu.: 6.540  
##  Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740  
##        y                z              ppc       
##  Min.   : 3.680   Min.   : 1.07   Min.   : 1051  
##  1st Qu.: 4.720   1st Qu.: 2.91   1st Qu.: 2478  
##  Median : 5.710   Median : 3.53   Median : 3495  
##  Mean   : 5.735   Mean   : 3.54   Mean   : 4008  
##  3rd Qu.: 6.540   3rd Qu.: 4.04   3rd Qu.: 4949  
##  Max.   :58.900   Max.   :31.80   Max.   :17829

Exercise

Use Dplyr to create a subset d1 of d with cut = “Ideal” and ppc greater than $5,000. Keep only the columns clarity, color, and ppc.

Solution

d1 = d %>% 
  filter( cut == "Ideal" & ppc > 5000) %>% 
  select( clarity, color, ppc)
summary(d1)
##    clarity             color                ppc       
##  Length:4780        Length:4780        Min.   : 5001  
##  Class :character   Class :character   1st Qu.: 5753  
##  Mode  :character   Mode  :character   Median : 6758  
##                                        Mean   : 7104  
##                                        3rd Qu.: 8091  
##                                        Max.   :17078

Example of group_by and summarize

dcells <- d %>%
  group_by(cut,color,clarity) %>%
  summarize(count = n(),
            mean_ppc = mean(ppc),
            mean_carat = mean(carat),
            median_ppc = median(ppc),
            sd_ppc = sd(ppc),
            IQR_ppc = IQR(ppc)) %>%
  ungroup() %>% # Important
  arrange(desc(mean_ppc))
## `summarise()` has grouped output by 'cut', 'color'. You can override using the
## `.groups` argument.