library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.2     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Five Verbs

Dplyr mostly uses five “verbs.”

select - Pick columns
filter - Pick rows
mutate - Create new columns
arrange - sort
summarize + group_by - Create summary stats

Some basic behavior of R is altered. All dataframes touched by dplyr become tbl’s (pronounced tibble).

Typing the name of a dataframe is prevented from overwhelming the console by dumping the entire dataframe. Both rows and columns are restricted.
Use of View() and glimpse() give us the ability to see everything. You can only use View() in an interactive session.

glimpse(diamonds)

## Rows: 53,940
## Columns: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…

Piping with %>% is almost always used when dplyr is in use, although it is not strictly necessary.

Exercise

Create a new version of diamonds, called d.

Use filter to eliminate any diamonds with x, y, or z = 0.

Cut, color and clarity should be simple character variables instead of factors.

Make a new variable ppc, the price per carat.

Solution

d <- diamonds %>%
filter(x>0, y>0, z>0) %>%
mutate(cut = as.character(cut),
       color = as.character(color),
       clarity = as.character(clarity),
       ppc = price/carat) %>% 
glimpse()

## Rows: 53,920
## Columns: 11
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
## $ cut     <chr> "Ideal", "Premium", "Good", "Premium", "Good", "Very Good", "V…
## $ color   <chr> "E", "E", "E", "I", "J", "J", "I", "H", "E", "H", "J", "J", "F…
## $ clarity <chr> "SI2", "SI1", "VS1", "VS2", "SI2", "VVS2", "VVS1", "SI1", "VS2…
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…
## $ ppc     <dbl> 1417.391, 1552.381, 1421.739, 1151.724, 1080.645, 1400.000, 14…

summary(d)

##      carat            cut               color             clarity         
##  Min.   :0.2000   Length:53920       Length:53920       Length:53920      
##  1st Qu.:0.4000   Class :character   Class :character   Class :character  
##  Median :0.7000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :0.7977                                                           
##  3rd Qu.:1.0400                                                           
##  Max.   :5.0100                                                           
##      depth           table           price             x         
##  Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 3.730  
##  1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  949   1st Qu.: 4.710  
##  Median :61.80   Median :57.00   Median : 2401   Median : 5.700  
##  Mean   :61.75   Mean   :57.46   Mean   : 3931   Mean   : 5.732  
##  3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5323   3rd Qu.: 6.540  
##  Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740  
##        y                z              ppc       
##  Min.   : 3.680   Min.   : 1.07   Min.   : 1051  
##  1st Qu.: 4.720   1st Qu.: 2.91   1st Qu.: 2478  
##  Median : 5.710   Median : 3.53   Median : 3495  
##  Mean   : 5.735   Mean   : 3.54   Mean   : 4008  
##  3rd Qu.: 6.540   3rd Qu.: 4.04   3rd Qu.: 4949  
##  Max.   :58.900   Max.   :31.80   Max.   :17829

Exercise

Drop the variables x, y, and z from d. Run str to make sure you succeeded.

Solution

d = d %>% 
  select(-c(x,y,z))

str(d)

## tibble [53,920 × 8] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53920] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : chr [1:53920] "Ideal" "Premium" "Good" "Premium" ...
##  $ color  : chr [1:53920] "E" "E" "E" "I" ...
##  $ clarity: chr [1:53920] "SI2" "SI1" "VS1" "VS2" ...
##  $ depth  : num [1:53920] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53920] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53920] 326 326 327 334 335 336 336 337 337 338 ...
##  $ ppc    : num [1:53920] 1417 1552 1422 1152 1081 ...

Exercise

Use dplyr to create a subset d1 of d with cut = “Ideal” and ppc greater than $5,000. Keep only the columns clarity, color, and ppc.

Solution

d1 = d %>% 
  filter( cut == "Ideal" & ppc > 5000) %>% 
  select( clarity, color, ppc)
summary(d1)

##    clarity             color                ppc       
##  Length:4780        Length:4780        Min.   : 5001  
##  Class :character   Class :character   1st Qu.: 5753  
##  Mode  :character   Mode  :character   Median : 6758  
##                                        Mean   : 7104  
##                                        3rd Qu.: 8091  
##                                        Max.   :17078

Example of group_by and summarize

Create a new dataframe dcells. There is a row for every combination of cut, color, and clarity. The other variables with obvious definitions are mean_ppc, mean_carat, sd_ppc, and sd_carat. Sort the dataframe in descending order of mean_ppc.

Solution

dcells <- d %>%
  group_by(cut,color,clarity) %>%
  summarize(count = n(),
            mean_ppc = mean(ppc),
            mean_carat = mean(carat),
            median_ppc = median(ppc),
            sd_ppc = sd(ppc)) %>%
  ungroup() %>% # Important
  arrange(desc(mean_ppc))

## `summarise()` has grouped output by 'cut', 'color'. You can override using the
## `.groups` argument.

Structure

Run str() on the dataframe.

Solution

str(dcells)

## tibble [276 × 8] (S3: tbl_df/tbl/data.frame)
##  $ cut       : chr [1:276] "Very Good" "Good" "Premium" "Ideal" ...
##  $ color     : chr [1:276] "D" "D" "D" "D" ...
##  $ clarity   : chr [1:276] "IF" "IF" "IF" "IF" ...
##  $ count     : int [1:276] 23 9 10 28 27 43 3 67 3 4 ...
##  $ mean_ppc  : num [1:276] 11347 10877 10099 9034 5817 ...
##  $ mean_carat: num [1:276] 0.803 0.787 0.708 0.616 0.576 ...
##  $ median_ppc: num [1:276] 10202 14932 11057 7162 3657 ...
##  $ sd_ppc    : num [1:276] 4800 5248 5960 4055 3565 ...

Exercise

Rerun the code to create dcells2 leaving out ungroup(). Run str() on decells2 and compare.

Solution

dcells2 <- d %>%
  group_by(cut,color,clarity) %>%
  summarize(count = n(),
            mean_ppc = mean(ppc),
            mean_carat = mean(carat),
            median_ppc = median(ppc),
            sd_ppc = sd(ppc)) %>%
  arrange(desc(mean_ppc))

## `summarise()` has grouped output by 'cut', 'color'. You can override using the
## `.groups` argument.

str(dcells2)

## gropd_df [276 × 8] (S3: grouped_df/tbl_df/tbl/data.frame)
##  $ cut       : chr [1:276] "Very Good" "Good" "Premium" "Ideal" ...
##  $ color     : chr [1:276] "D" "D" "D" "D" ...
##  $ clarity   : chr [1:276] "IF" "IF" "IF" "IF" ...
##  $ count     : int [1:276] 23 9 10 28 27 43 3 67 3 4 ...
##  $ mean_ppc  : num [1:276] 11347 10877 10099 9034 5817 ...
##  $ mean_carat: num [1:276] 0.803 0.787 0.708 0.616 0.576 ...
##  $ median_ppc: num [1:276] 10202 14932 11057 7162 3657 ...
##  $ sd_ppc    : num [1:276] 4800 5248 5960 4055 3565 ...
##  - attr(*, "groups")= tibble [35 × 3] (S3: tbl_df/tbl/data.frame)
##   ..$ cut  : chr [1:35] "Fair" "Fair" "Fair" "Fair" ...
##   ..$ color: chr [1:35] "D" "E" "F" "G" ...
##   ..$ .rows: list<int> [1:35] 
##   .. ..$ : int [1:8] 7 21 29 80 87 125 146 190
##   .. ..$ : int [1:7] 9 66 104 136 157 177 276
##   .. ..$ : int [1:8] 11 13 44 100 108 164 175 275
##   .. ..$ : int [1:8] 25 129 138 151 185 200 234 273
##   .. ..$ : int [1:7] 40 82 91 93 123 148 268
##   .. ..$ : int [1:7] 32 121 173 174 196 219 269
##   .. ..$ : int [1:7] 101 205 209 226 243 263 271
##   .. ..$ : int [1:8] 2 55 76 81 154 193 197 248
##   .. ..$ : int [1:8] 51 57 72 189 202 211 230 238
##   .. ..$ : int [1:8] 37 83 84 147 179 192 198 272
##   .. ..$ : int [1:8] 27 47 75 79 139 150 167 270
##   .. ..$ : int [1:8] 10 109 113 155 168 225 242 250
##   .. ..$ : int [1:8] 60 86 128 181 228 231 254 259
##   .. ..$ : int [1:8] 35 178 183 184 207 217 235 265
##   .. ..$ : int [1:8] 4 17 24 126 203 208 215 249
##   .. ..$ : int [1:8] 20 110 114 162 191 201 213 239
##   .. ..$ : int [1:8] 42 70 90 118 132 144 149 229
##   .. ..$ : int [1:8] 39 46 63 111 158 169 186 218
##   .. ..$ : int [1:8] 78 97 143 156 195 210 214 233
##   .. ..$ : int [1:8] 43 106 135 170 224 241 245 261
##   .. ..$ : int [1:8] 94 127 160 171 180 204 258 266
##   .. ..$ : int [1:8] 3 12 15 18 130 141 166 227
##   .. ..$ : int [1:8] 5 54 67 74 134 145 172 256
##   .. ..$ : int [1:8] 14 16 19 22 53 107 117 257
##   .. ..$ : int [1:8] 28 36 38 52 88 98 131 253
##   .. ..$ : int [1:8] 33 48 49 77 137 212 246 262
##   .. ..$ : int [1:8] 26 50 71 89 216 223 252 255
##   .. ..$ : int [1:8] 23 30 31 62 102 103 140 267
##   .. ..$ : int [1:8] 1 45 105 115 120 124 163 260
##   .. ..$ : int [1:8] 6 116 119 152 182 199 206 247
##   .. ..$ : int [1:8] 8 58 61 68 99 161 165 232
##   .. ..$ : int [1:8] 34 56 69 95 142 159 194 264
##   .. ..$ : int [1:8] 64 85 92 133 187 220 236 237
##   .. ..$ : int [1:8] 59 65 73 96 112 221 240 244
##   .. ..$ : int [1:8] 41 122 153 176 188 222 251 274
##   .. ..@ ptype: int(0) 
##   ..- attr(*, ".drop")= logi TRUE

Implication

Calculate the median of mean_ppc using summarize() for both dcells and dcells2.

Solution

dcells %>% 
  summarize(median(mean_ppc),n())

## # A tibble: 1 × 2
##   `median(mean_ppc)` `n()`
##                <dbl> <int>
## 1              3927.   276

dcells2 %>% 
  summarize(median(mean_ppc),n())

## `summarise()` has grouped output by 'cut'. You can override using the `.groups`
## argument.

## # A tibble: 35 × 4
## # Groups:   cut [5]
##    cut   color `median(mean_ppc)` `n()`
##    <chr> <chr>              <dbl> <int>
##  1 Fair  D                  4223.     8
##  2 Fair  E                  3932.     7
##  3 Fair  F                  4074.     8
##  4 Fair  G                  3764.     8
##  5 Fair  H                  4148.     7
##  6 Fair  I                  3750.     7
##  7 Fair  J                  3307.     7
##  8 Good  D                  4052.     8
##  9 Good  E                  3576.     8
## 10 Good  F                  3797.     8
## # ℹ 25 more rows

Dplyr Basics

Five Verbs

Exercise

Solution

Exercise

Solution

Exercise

Solution

Example of group_by and summarize

Solution

Structure

Solution

Exercise

Solution

Implication

Solution