Harold Nelson
2025-06-17
Dplyr mostly uses five “verbs.”
Some basic behavior of R is altered. All dataframes touched by dplyr become tbl’s (pronounced tibble).
Typing the name of a dataframe is prevented from overwhelming the console by dumping the entire dataframe. Both rows and columns are restricted.
Use of View() and glimpse() give us ability to see everything.
## Rows: 53,940
## Columns: 10
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…
Piping with %>% is almost always used when dplyr is in use, although it is not strictly necessary.
Create a new version of diamonds, called d. Make a new variable ppc, the price per carat.
## Rows: 53,940
## Columns: 11
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…
## $ ppc <dbl> 1417.391, 1552.381, 1421.739, 1151.724, 1080.645, 1400.000, 14…
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z ppc
## Min. : 0.000 Min. : 1051
## 1st Qu.: 2.910 1st Qu.: 2478
## Median : 3.530 Median : 3495
## Mean : 3.539 Mean : 4008
## 3rd Qu.: 4.040 3rd Qu.: 4950
## Max. :31.800 Max. :17829
##
Drop the variable table from d. Run str() to verify.
## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
## $ ppc : num [1:53940] 1417 1552 1422 1152 1081 ...
Eliminate all diamonds from d where x or y or z is not greater than zero.
Use Dplyr to create a subset d1 of d with cut = “Ideal” and ppc greater than $5,000. Keep only the columns clarity, color, and ppc.
## clarity color ppc
## VS2 :1164 D: 501 Min. : 5001
## SI1 :1007 E: 557 1st Qu.: 5753
## VS1 : 929 F: 903 Median : 6758
## VVS2 : 623 G:1298 Mean : 7104
## SI2 : 521 H: 769 3rd Qu.: 8091
## VVS1 : 369 I: 544 Max. :17078
## (Other): 167 J: 208
Create a summary table dcells. There is a row for every combination of cut, color, and clarity. Each row contains the count of diamonds, the mean of ppc, the mean of carat, the median of ppc, the sd of ppc, and the IQR of ppc.
After ungrouping, display the table in descending order of ppc.
dcells <- d %>%
group_by(cut,color,clarity) %>%
summarize(count = n(),
mean_ppc = mean(ppc),
mean_carat = mean(carat),
median_ppc = median(ppc),
sd_ppc = sd(ppc),
IQR_ppc = IQR(ppc)) %>%
ungroup() %>% # Important
arrange(desc(mean_ppc))
## `summarise()` has grouped output by 'cut', 'color'. You can override using the
## `.groups` argument.
## # A tibble: 276 × 9
## cut color clarity count mean_ppc mean_carat median_ppc sd_ppc IQR_ppc
## <ord> <ord> <ord> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Very Good D IF 23 11347. 0.803 10202. 4800. 8796.
## 2 Good D IF 9 10877. 0.787 14932. 5248. 8671.
## 3 Premium D IF 10 10099. 0.708 11057. 5960. 11043.
## 4 Ideal D IF 28 9034. 0.616 7162. 4055. 4623.
## 5 Premium E IF 27 5817. 0.576 3657. 3565. 4573.
## 6 Very Good E IF 43 5706. 0.579 3900 3365. 6152.
## 7 Fair D VVS1 3 5690. 0.607 3584 4404. 4009.
## 8 Very Good F IF 67 5544. 0.607 3202. 3593. 6951.
## 9 Fair E VVS1 3 5447. 0.64 4921. 2772. 2734.
## 10 Good H IF 4 5394. 0.935 5100. 2664. 3411.
## # ℹ 266 more rows
load("ctycmplt.Rdata")
cc <- ctycmplt %>%
select(name,state,pop2010,per_capita_income,area,density,bachelors) %>%
mutate(total_income = per_capita_income * pop2010)
glimpse(cc)
## Rows: 3,143
## Columns: 8
## $ name <fct> Autauga County, Baldwin County, Barbour County, Bibb…
## $ state <fct> Alabama, Alabama, Alabama, Alabama, Alabama, Alabama…
## $ pop2010 <dbl> 54571, 182265, 27457, 22915, 57322, 10914, 20947, 11…
## $ per_capita_income <dbl> 24568, 26469, 15875, 19918, 21070, 20289, 16916, 205…
## $ area <dbl> 594.44, 1589.78, 884.88, 622.58, 644.78, 622.81, 776…
## $ density <dbl> 91.8, 114.6, 31.0, 36.8, 88.9, 17.5, 27.0, 195.7, 57…
## $ bachelors <dbl> 21.7, 26.8, 13.5, 10.0, 12.5, 12.0, 11.0, 16.1, 10.8…
## $ total_income <dbl> 1340700328, 4824372285, 435879875, 456420970, 120777…
Create a states dataset using group_by and summarize.
In this dataset create average per capita income for each state with two different methods.
SPCI1 is the simple average of the county per capita income values.
SPCI2 is constructed by adding up total income and population from the county values and dividing the two at the state level. Examine the difference between SPCI1 and SPCI2.
cc %>%
group_by(state) %>%
summarize(SPCI1 = sum(total_income)/sum(pop2010),
SPCI2 = mean(total_income/pop2010)
) %>%
arrange(desc(SPCI1))-> states
glimpse(states)
## Rows: 51
## Columns: 3
## $ state <fct> District of Columbia, Connecticut, Maryland, New Jersey, Massach…
## $ SPCI1 <dbl> 42078.00, 36792.41, 34907.87, 34853.30, 33982.91, 32293.45, 3142…
## $ SPCI2 <dbl> 42078.00, 34873.25, 31459.08, 34391.19, 33547.07, 25557.64, 2890…