## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.2 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Dplyr mostly uses five “verbs.”
Some basic behavior of R is altered. All dataframes touched by dplyr become tbl’s (pronounced tibble).
Typing the name of a dataframe is prevented from overwhelming the console by dumping the entire dataframe. Both rows and columns are restricted.
Use of View() and glimpse() give us the ability to see everything. You can only use View() in an interactive session.
## Rows: 53,940
## Columns: 10
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…
Piping with %>% is almost always used when dplyr is in use, although it is not strictly necessary.
Create a new version of diamonds, called d.
Use filter to eliminate any diamonds with x, y, or z = 0.
Cut, color and clarity should be simple character variables instead of factors.
Make a new variable ppc, the price per carat.
d <- diamonds %>%
filter(x>0, y>0, z>0) %>%
mutate(cut = as.character(cut),
color = as.character(color),
clarity = as.character(clarity),
ppc = price/carat) %>%
glimpse()## Rows: 53,920
## Columns: 11
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
## $ cut <chr> "Ideal", "Premium", "Good", "Premium", "Good", "Very Good", "V…
## $ color <chr> "E", "E", "E", "I", "J", "J", "I", "H", "E", "H", "J", "J", "F…
## $ clarity <chr> "SI2", "SI1", "VS1", "VS2", "SI2", "VVS2", "VVS1", "SI1", "VS2…
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…
## $ ppc <dbl> 1417.391, 1552.381, 1421.739, 1151.724, 1080.645, 1400.000, 14…
## carat cut color clarity
## Min. :0.2000 Length:53920 Length:53920 Length:53920
## 1st Qu.:0.4000 Class :character Class :character Class :character
## Median :0.7000 Mode :character Mode :character Mode :character
## Mean :0.7977
## 3rd Qu.:1.0400
## Max. :5.0100
## depth table price x
## Min. :43.00 Min. :43.00 Min. : 326 Min. : 3.730
## 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 949 1st Qu.: 4.710
## Median :61.80 Median :57.00 Median : 2401 Median : 5.700
## Mean :61.75 Mean :57.46 Mean : 3931 Mean : 5.732
## 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5323 3rd Qu.: 6.540
## Max. :79.00 Max. :95.00 Max. :18823 Max. :10.740
## y z ppc
## Min. : 3.680 Min. : 1.07 Min. : 1051
## 1st Qu.: 4.720 1st Qu.: 2.91 1st Qu.: 2478
## Median : 5.710 Median : 3.53 Median : 3495
## Mean : 5.735 Mean : 3.54 Mean : 4008
## 3rd Qu.: 6.540 3rd Qu.: 4.04 3rd Qu.: 4949
## Max. :58.900 Max. :31.80 Max. :17829
Drop the variables x, y, and z from d. Run str to make sure you succeeded.
## tibble [53,920 × 8] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53920] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : chr [1:53920] "Ideal" "Premium" "Good" "Premium" ...
## $ color : chr [1:53920] "E" "E" "E" "I" ...
## $ clarity: chr [1:53920] "SI2" "SI1" "VS1" "VS2" ...
## $ depth : num [1:53920] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53920] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53920] 326 326 327 334 335 336 336 337 337 338 ...
## $ ppc : num [1:53920] 1417 1552 1422 1152 1081 ...
Use dplyr to create a subset d1 of d with cut = “Ideal” and ppc greater than $5,000. Keep only the columns clarity, color, and ppc.
## clarity color ppc
## Length:4780 Length:4780 Min. : 5001
## Class :character Class :character 1st Qu.: 5753
## Mode :character Mode :character Median : 6758
## Mean : 7104
## 3rd Qu.: 8091
## Max. :17078
Create a new dataframe dcells. There is a row for every combination of cut, color, and clarity. The other variables with obvious definitions are mean_ppc, mean_carat, sd_ppc, and sd_carat. Sort the dataframe in descending order of mean_ppc.
dcells <- d %>%
group_by(cut,color,clarity) %>%
summarize(count = n(),
mean_ppc = mean(ppc),
mean_carat = mean(carat),
median_ppc = median(ppc),
sd_ppc = sd(ppc)) %>%
ungroup() %>% # Important
arrange(desc(mean_ppc))## `summarise()` has grouped output by 'cut', 'color'. You can override using the
## `.groups` argument.
Run str() on the dataframe.
## tibble [276 × 8] (S3: tbl_df/tbl/data.frame)
## $ cut : chr [1:276] "Very Good" "Good" "Premium" "Ideal" ...
## $ color : chr [1:276] "D" "D" "D" "D" ...
## $ clarity : chr [1:276] "IF" "IF" "IF" "IF" ...
## $ count : int [1:276] 23 9 10 28 27 43 3 67 3 4 ...
## $ mean_ppc : num [1:276] 11347 10877 10099 9034 5817 ...
## $ mean_carat: num [1:276] 0.803 0.787 0.708 0.616 0.576 ...
## $ median_ppc: num [1:276] 10202 14932 11057 7162 3657 ...
## $ sd_ppc : num [1:276] 4800 5248 5960 4055 3565 ...
Rerun the code to create dcells2 leaving out ungroup(). Run str() on decells2 and compare.
dcells2 <- d %>%
group_by(cut,color,clarity) %>%
summarize(count = n(),
mean_ppc = mean(ppc),
mean_carat = mean(carat),
median_ppc = median(ppc),
sd_ppc = sd(ppc)) %>%
arrange(desc(mean_ppc))## `summarise()` has grouped output by 'cut', 'color'. You can override using the
## `.groups` argument.
## gropd_df [276 × 8] (S3: grouped_df/tbl_df/tbl/data.frame)
## $ cut : chr [1:276] "Very Good" "Good" "Premium" "Ideal" ...
## $ color : chr [1:276] "D" "D" "D" "D" ...
## $ clarity : chr [1:276] "IF" "IF" "IF" "IF" ...
## $ count : int [1:276] 23 9 10 28 27 43 3 67 3 4 ...
## $ mean_ppc : num [1:276] 11347 10877 10099 9034 5817 ...
## $ mean_carat: num [1:276] 0.803 0.787 0.708 0.616 0.576 ...
## $ median_ppc: num [1:276] 10202 14932 11057 7162 3657 ...
## $ sd_ppc : num [1:276] 4800 5248 5960 4055 3565 ...
## - attr(*, "groups")= tibble [35 × 3] (S3: tbl_df/tbl/data.frame)
## ..$ cut : chr [1:35] "Fair" "Fair" "Fair" "Fair" ...
## ..$ color: chr [1:35] "D" "E" "F" "G" ...
## ..$ .rows: list<int> [1:35]
## .. ..$ : int [1:8] 7 21 29 80 87 125 146 190
## .. ..$ : int [1:7] 9 66 104 136 157 177 276
## .. ..$ : int [1:8] 11 13 44 100 108 164 175 275
## .. ..$ : int [1:8] 25 129 138 151 185 200 234 273
## .. ..$ : int [1:7] 40 82 91 93 123 148 268
## .. ..$ : int [1:7] 32 121 173 174 196 219 269
## .. ..$ : int [1:7] 101 205 209 226 243 263 271
## .. ..$ : int [1:8] 2 55 76 81 154 193 197 248
## .. ..$ : int [1:8] 51 57 72 189 202 211 230 238
## .. ..$ : int [1:8] 37 83 84 147 179 192 198 272
## .. ..$ : int [1:8] 27 47 75 79 139 150 167 270
## .. ..$ : int [1:8] 10 109 113 155 168 225 242 250
## .. ..$ : int [1:8] 60 86 128 181 228 231 254 259
## .. ..$ : int [1:8] 35 178 183 184 207 217 235 265
## .. ..$ : int [1:8] 4 17 24 126 203 208 215 249
## .. ..$ : int [1:8] 20 110 114 162 191 201 213 239
## .. ..$ : int [1:8] 42 70 90 118 132 144 149 229
## .. ..$ : int [1:8] 39 46 63 111 158 169 186 218
## .. ..$ : int [1:8] 78 97 143 156 195 210 214 233
## .. ..$ : int [1:8] 43 106 135 170 224 241 245 261
## .. ..$ : int [1:8] 94 127 160 171 180 204 258 266
## .. ..$ : int [1:8] 3 12 15 18 130 141 166 227
## .. ..$ : int [1:8] 5 54 67 74 134 145 172 256
## .. ..$ : int [1:8] 14 16 19 22 53 107 117 257
## .. ..$ : int [1:8] 28 36 38 52 88 98 131 253
## .. ..$ : int [1:8] 33 48 49 77 137 212 246 262
## .. ..$ : int [1:8] 26 50 71 89 216 223 252 255
## .. ..$ : int [1:8] 23 30 31 62 102 103 140 267
## .. ..$ : int [1:8] 1 45 105 115 120 124 163 260
## .. ..$ : int [1:8] 6 116 119 152 182 199 206 247
## .. ..$ : int [1:8] 8 58 61 68 99 161 165 232
## .. ..$ : int [1:8] 34 56 69 95 142 159 194 264
## .. ..$ : int [1:8] 64 85 92 133 187 220 236 237
## .. ..$ : int [1:8] 59 65 73 96 112 221 240 244
## .. ..$ : int [1:8] 41 122 153 176 188 222 251 274
## .. ..@ ptype: int(0)
## ..- attr(*, ".drop")= logi TRUE
Calculate the median of mean_ppc using summarize() for both dcells and dcells2.
## # A tibble: 1 × 2
## `median(mean_ppc)` `n()`
## <dbl> <int>
## 1 3927. 276
## `summarise()` has grouped output by 'cut'. You can override using the `.groups`
## argument.
## # A tibble: 35 × 4
## # Groups: cut [5]
## cut color `median(mean_ppc)` `n()`
## <chr> <chr> <dbl> <int>
## 1 Fair D 4223. 8
## 2 Fair E 3932. 7
## 3 Fair F 4074. 8
## 4 Fair G 3764. 8
## 5 Fair H 4148. 7
## 6 Fair I 3750. 7
## 7 Fair J 3307. 7
## 8 Good D 4052. 8
## 9 Good E 3576. 8
## 10 Good F 3797. 8
## # ℹ 25 more rows