Dplyr Basics

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Five Verbs

Dplyr mostly uses five “verbs.”

Some basic behavior of R is altered. All dataframes touched by dplyr become tbl’s (pronounced tibble).

glimpse(diamonds)
## Rows: 53,940
## Columns: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…

Piping with %>% is almost always used when dplyr is in use, although it is not strictly necessary.

Exercise

Create a new version of diamonds, called d.

Use filter to eliminate any diamonds with x, y, or z = 0.

Cut, color and clarity should be simple character variables instead of factors.

Make a new variable ppc, the price per carat.

Solution

d <- diamonds %>%
filter(x>0, y>0, z>0) %>%
mutate(cut = as.character(cut),
       color = as.character(color),
       clarity = as.character(clarity),
       ppc = price/carat) %>% 
glimpse()
## Rows: 53,920
## Columns: 11
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
## $ cut     <chr> "Ideal", "Premium", "Good", "Premium", "Good", "Very Good", "V…
## $ color   <chr> "E", "E", "E", "I", "J", "J", "I", "H", "E", "H", "J", "J", "F…
## $ clarity <chr> "SI2", "SI1", "VS1", "VS2", "SI2", "VVS2", "VVS1", "SI1", "VS2…
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…
## $ ppc     <dbl> 1417.391, 1552.381, 1421.739, 1151.724, 1080.645, 1400.000, 14…
summary(d)
##      carat            cut               color             clarity         
##  Min.   :0.2000   Length:53920       Length:53920       Length:53920      
##  1st Qu.:0.4000   Class :character   Class :character   Class :character  
##  Median :0.7000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :0.7977                                                           
##  3rd Qu.:1.0400                                                           
##  Max.   :5.0100                                                           
##      depth           table           price             x         
##  Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 3.730  
##  1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  949   1st Qu.: 4.710  
##  Median :61.80   Median :57.00   Median : 2401   Median : 5.700  
##  Mean   :61.75   Mean   :57.46   Mean   : 3931   Mean   : 5.732  
##  3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5323   3rd Qu.: 6.540  
##  Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740  
##        y                z              ppc       
##  Min.   : 3.680   Min.   : 1.07   Min.   : 1051  
##  1st Qu.: 4.720   1st Qu.: 2.91   1st Qu.: 2478  
##  Median : 5.710   Median : 3.53   Median : 3495  
##  Mean   : 5.735   Mean   : 3.54   Mean   : 4008  
##  3rd Qu.: 6.540   3rd Qu.: 4.04   3rd Qu.: 4949  
##  Max.   :58.900   Max.   :31.80   Max.   :17829

Exercise

Drop the variables x, y, and z from d. Run str to make sure you succeeded.

Solution

d = d %>% 
  select(-c(x,y,z))

str(d)
## tibble [53,920 × 8] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53920] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : chr [1:53920] "Ideal" "Premium" "Good" "Premium" ...
##  $ color  : chr [1:53920] "E" "E" "E" "I" ...
##  $ clarity: chr [1:53920] "SI2" "SI1" "VS1" "VS2" ...
##  $ depth  : num [1:53920] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53920] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53920] 326 326 327 334 335 336 336 337 337 338 ...
##  $ ppc    : num [1:53920] 1417 1552 1422 1152 1081 ...

Exercise

Use dplyr to create a subset d1 of d with cut = “Ideal” and ppc greater than $5,000. Keep only the columns clarity, color, and ppc.

Solution

d1 = d %>% 
  filter( cut == "Ideal" & ppc > 5000) %>% 
  select( clarity, color, ppc)
summary(d1)
##    clarity             color                ppc       
##  Length:4780        Length:4780        Min.   : 5001  
##  Class :character   Class :character   1st Qu.: 5753  
##  Mode  :character   Mode  :character   Median : 6758  
##                                        Mean   : 7104  
##                                        3rd Qu.: 8091  
##                                        Max.   :17078

Example of group_by and summarize

Create a new dataframe dcells. There is a row for every combination of cut, color, and clarity. The other variables with obvious definitions are mean_ppc, mean_carat, sd_ppc, and sd_carat. Sort the dataframe in descending order of mean_ppc.

Solution

dcells <- d %>%
  group_by(cut,color,clarity) %>%
  summarize(count = n(),
            mean_ppc = mean(ppc),
            mean_carat = mean(carat),
            median_ppc = median(ppc),
            sd_ppc = sd(ppc)) %>%
  ungroup() %>% # Important
  arrange(desc(mean_ppc))
## `summarise()` has grouped output by 'cut', 'color'. You can override using the
## `.groups` argument.

Structure

Run str() on the dataframe.

Solution

str(dcells)
## tibble [276 × 8] (S3: tbl_df/tbl/data.frame)
##  $ cut       : chr [1:276] "Very Good" "Good" "Premium" "Ideal" ...
##  $ color     : chr [1:276] "D" "D" "D" "D" ...
##  $ clarity   : chr [1:276] "IF" "IF" "IF" "IF" ...
##  $ count     : int [1:276] 23 9 10 28 27 43 3 67 3 4 ...
##  $ mean_ppc  : num [1:276] 11347 10877 10099 9034 5817 ...
##  $ mean_carat: num [1:276] 0.803 0.787 0.708 0.616 0.576 ...
##  $ median_ppc: num [1:276] 10202 14932 11057 7162 3657 ...
##  $ sd_ppc    : num [1:276] 4800 5248 5960 4055 3565 ...

Exercise

Rerun the code to create dcells2 leaving out ungroup(). Run str() on decells2 and compare.

Solution

dcells2 <- d %>%
  group_by(cut,color,clarity) %>%
  summarize(count = n(),
            mean_ppc = mean(ppc),
            mean_carat = mean(carat),
            median_ppc = median(ppc),
            sd_ppc = sd(ppc)) %>%
  arrange(desc(mean_ppc))
## `summarise()` has grouped output by 'cut', 'color'. You can override using the
## `.groups` argument.
str(dcells2)
## gropd_df [276 × 8] (S3: grouped_df/tbl_df/tbl/data.frame)
##  $ cut       : chr [1:276] "Very Good" "Good" "Premium" "Ideal" ...
##  $ color     : chr [1:276] "D" "D" "D" "D" ...
##  $ clarity   : chr [1:276] "IF" "IF" "IF" "IF" ...
##  $ count     : int [1:276] 23 9 10 28 27 43 3 67 3 4 ...
##  $ mean_ppc  : num [1:276] 11347 10877 10099 9034 5817 ...
##  $ mean_carat: num [1:276] 0.803 0.787 0.708 0.616 0.576 ...
##  $ median_ppc: num [1:276] 10202 14932 11057 7162 3657 ...
##  $ sd_ppc    : num [1:276] 4800 5248 5960 4055 3565 ...
##  - attr(*, "groups")= tibble [35 × 3] (S3: tbl_df/tbl/data.frame)
##   ..$ cut  : chr [1:35] "Fair" "Fair" "Fair" "Fair" ...
##   ..$ color: chr [1:35] "D" "E" "F" "G" ...
##   ..$ .rows: list<int> [1:35] 
##   .. ..$ : int [1:8] 7 21 29 80 87 125 146 190
##   .. ..$ : int [1:7] 9 66 104 136 157 177 276
##   .. ..$ : int [1:8] 11 13 44 100 108 164 175 275
##   .. ..$ : int [1:8] 25 129 138 151 185 200 234 273
##   .. ..$ : int [1:7] 40 82 91 93 123 148 268
##   .. ..$ : int [1:7] 32 121 173 174 196 219 269
##   .. ..$ : int [1:7] 101 205 209 226 243 263 271
##   .. ..$ : int [1:8] 2 55 76 81 154 193 197 248
##   .. ..$ : int [1:8] 51 57 72 189 202 211 230 238
##   .. ..$ : int [1:8] 37 83 84 147 179 192 198 272
##   .. ..$ : int [1:8] 27 47 75 79 139 150 167 270
##   .. ..$ : int [1:8] 10 109 113 155 168 225 242 250
##   .. ..$ : int [1:8] 60 86 128 181 228 231 254 259
##   .. ..$ : int [1:8] 35 178 183 184 207 217 235 265
##   .. ..$ : int [1:8] 4 17 24 126 203 208 215 249
##   .. ..$ : int [1:8] 20 110 114 162 191 201 213 239
##   .. ..$ : int [1:8] 42 70 90 118 132 144 149 229
##   .. ..$ : int [1:8] 39 46 63 111 158 169 186 218
##   .. ..$ : int [1:8] 78 97 143 156 195 210 214 233
##   .. ..$ : int [1:8] 43 106 135 170 224 241 245 261
##   .. ..$ : int [1:8] 94 127 160 171 180 204 258 266
##   .. ..$ : int [1:8] 3 12 15 18 130 141 166 227
##   .. ..$ : int [1:8] 5 54 67 74 134 145 172 256
##   .. ..$ : int [1:8] 14 16 19 22 53 107 117 257
##   .. ..$ : int [1:8] 28 36 38 52 88 98 131 253
##   .. ..$ : int [1:8] 33 48 49 77 137 212 246 262
##   .. ..$ : int [1:8] 26 50 71 89 216 223 252 255
##   .. ..$ : int [1:8] 23 30 31 62 102 103 140 267
##   .. ..$ : int [1:8] 1 45 105 115 120 124 163 260
##   .. ..$ : int [1:8] 6 116 119 152 182 199 206 247
##   .. ..$ : int [1:8] 8 58 61 68 99 161 165 232
##   .. ..$ : int [1:8] 34 56 69 95 142 159 194 264
##   .. ..$ : int [1:8] 64 85 92 133 187 220 236 237
##   .. ..$ : int [1:8] 59 65 73 96 112 221 240 244
##   .. ..$ : int [1:8] 41 122 153 176 188 222 251 274
##   .. ..@ ptype: int(0) 
##   ..- attr(*, ".drop")= logi TRUE

Implication

Calculate the median of mean_ppc using summarize() for both dcells and dcells2.

Solution

dcells %>% 
  summarize(median(mean_ppc),n())
## # A tibble: 1 × 2
##   `median(mean_ppc)` `n()`
##                <dbl> <int>
## 1              3927.   276
dcells2 %>% 
  summarize(median(mean_ppc),n())
## `summarise()` has grouped output by 'cut'. You can override using the `.groups`
## argument.
## # A tibble: 35 × 4
## # Groups:   cut [5]
##    cut   color `median(mean_ppc)` `n()`
##    <chr> <chr>              <dbl> <int>
##  1 Fair  D                  4223.     8
##  2 Fair  E                  3932.     7
##  3 Fair  F                  4074.     8
##  4 Fair  G                  3764.     8
##  5 Fair  H                  4148.     7
##  6 Fair  I                  3750.     7
##  7 Fair  J                  3307.     7
##  8 Good  D                  4052.     8
##  9 Good  E                  3576.     8
## 10 Good  F                  3797.     8
## # ℹ 25 more rows