Five Verbs

Dplyr mostly uses five “verbs.”

select - Pick columns
filter - Pick rows
mutate - Create new columns
arrange - sort
summarize + group_by - Create summary stats

Some basic behavior of R is altered. All dataframes touched by dplyr become tbl’s (pronounced tibble).

Typing the name of a dataframe is prevented from overwhelming the console by dumping the entire dataframe. Both rows and columns are restricted.
Use of View() and glimpse() give us ability to see everything.

glimpse(diamonds)

## Rows: 53,940
## Columns: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…

Piping with %>% is almost always used when dplyr is in use, although it is not strictly necessary.

Exercise

Create a new version of diamonds, called d. Make a new variable ppc, the price per carat.

Solution

d <- diamonds %>%
mutate(ppc = price/carat)

glimpse(d)

## Rows: 53,940
## Columns: 11
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…
## $ ppc     <dbl> 1417.391, 1552.381, 1421.739, 1151.724, 1080.645, 1400.000, 14…

summary(d)

##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z               ppc       
##  Min.   : 0.000   Min.   : 1051  
##  1st Qu.: 2.910   1st Qu.: 2478  
##  Median : 3.530   Median : 3495  
##  Mean   : 3.539   Mean   : 4008  
##  3rd Qu.: 4.040   3rd Qu.: 4950  
##  Max.   :31.800   Max.   :17829  
##

Exercise

Drop the variable table from d. Run str() to verify.

Solution

d = d %>% 
  select(-table)

str(d)

## tibble [53,940 × 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
##  $ ppc    : num [1:53940] 1417 1552 1422 1152 1081 ...

Exercise

Eliminate all diamonds from d where x or y or z is not greater than zero.

Solution

d = d %>% 
  filter(x > 0, y > 0, z > 0)

Exercise

Use Dplyr to create a subset d1 of d with cut = “Ideal” and ppc greater than $5,000. Keep only the columns clarity, color, and ppc.

Solution

d1 = d %>% 
  filter( cut == "Ideal" & ppc > 5000) %>% 
  select( clarity, color, ppc)
summary(d1)

##     clarity     color         ppc       
##  VS2    :1164   D: 501   Min.   : 5001  
##  SI1    :1007   E: 557   1st Qu.: 5753  
##  VS1    : 929   F: 903   Median : 6758  
##  VVS2   : 623   G:1298   Mean   : 7104  
##  SI2    : 521   H: 769   3rd Qu.: 8091  
##  VVS1   : 369   I: 544   Max.   :17078  
##  (Other): 167   J: 208

Example of group_by and summarize

Create a summary table dcells. There is a row for every combination of cut, color, and clarity. Each row contains the count of diamonds, the mean of ppc, the mean of carat, the median of ppc, the sd of ppc, and the IQR of ppc.

After ungrouping, display the table in descending order of ppc.

Solution

dcells <- d %>%
  group_by(cut,color,clarity) %>%
  summarize(count = n(),
            mean_ppc = mean(ppc),
            mean_carat = mean(carat),
            median_ppc = median(ppc),
            sd_ppc = sd(ppc),
            IQR_ppc = IQR(ppc)) %>%
  ungroup() %>% # Important
  arrange(desc(mean_ppc))

## `summarise()` has grouped output by 'cut', 'color'. You can override using the
## `.groups` argument.

dcells

## # A tibble: 276 × 9
##    cut       color clarity count mean_ppc mean_carat median_ppc sd_ppc IQR_ppc
##    <ord>     <ord> <ord>   <int>    <dbl>      <dbl>      <dbl>  <dbl>   <dbl>
##  1 Very Good D     IF         23   11347.      0.803     10202.  4800.   8796.
##  2 Good      D     IF          9   10877.      0.787     14932.  5248.   8671.
##  3 Premium   D     IF         10   10099.      0.708     11057.  5960.  11043.
##  4 Ideal     D     IF         28    9034.      0.616      7162.  4055.   4623.
##  5 Premium   E     IF         27    5817.      0.576      3657.  3565.   4573.
##  6 Very Good E     IF         43    5706.      0.579      3900   3365.   6152.
##  7 Fair      D     VVS1        3    5690.      0.607      3584   4404.   4009.
##  8 Very Good F     IF         67    5544.      0.607      3202.  3593.   6951.
##  9 Fair      E     VVS1        3    5447.      0.64       4921.  2772.   2734.
## 10 Good      H     IF          4    5394.      0.935      5100.  2664.   3411.
## # ℹ 266 more rows

Task

Download the ctycmplt.Rdata file from Moodle and load it.
Create a new dataset cc keeping the following variables. state, name, pop2010, bachelors, per_capita_income, area, density,
Create a new variable total_income using pop2010 and per_capita_income.
Glimpse your dataset and run a summary to make sure everything is OK.

Solution

load("ctycmplt.Rdata")
cc <- ctycmplt %>%
  select(name,state,pop2010,per_capita_income,area,density,bachelors) %>%
  mutate(total_income = per_capita_income * pop2010)
glimpse(cc)

## Rows: 3,143
## Columns: 8
## $ name              <fct> Autauga County, Baldwin County, Barbour County, Bibb…
## $ state             <fct> Alabama, Alabama, Alabama, Alabama, Alabama, Alabama…
## $ pop2010           <dbl> 54571, 182265, 27457, 22915, 57322, 10914, 20947, 11…
## $ per_capita_income <dbl> 24568, 26469, 15875, 19918, 21070, 20289, 16916, 205…
## $ area              <dbl> 594.44, 1589.78, 884.88, 622.58, 644.78, 622.81, 776…
## $ density           <dbl> 91.8, 114.6, 31.0, 36.8, 88.9, 17.5, 27.0, 195.7, 57…
## $ bachelors         <dbl> 21.7, 26.8, 13.5, 10.0, 12.5, 12.0, 11.0, 16.1, 10.8…
## $ total_income      <dbl> 1340700328, 4824372285, 435879875, 456420970, 120777…

Task 2.

Create a states dataset using group_by and summarize.
In this dataset create average per capita income for each state with two different methods.
SPCI1 is the simple average of the county per capita income values.
SPCI2 is constructed by adding up total income and population from the county values and dividing the two at the state level. Examine the difference between SPCI1 and SPCI2.

Answer

cc %>% 
  group_by(state) %>% 
  summarize(SPCI1 = sum(total_income)/sum(pop2010),
            SPCI2 = mean(total_income/pop2010)
            ) %>% 
  arrange(desc(SPCI1))-> states
glimpse(states)

## Rows: 51
## Columns: 3
## $ state <fct> District of Columbia, Connecticut, Maryland, New Jersey, Massach…
## $ SPCI1 <dbl> 42078.00, 36792.41, 34907.87, 34853.30, 33982.91, 32293.45, 3142…
## $ SPCI2 <dbl> 42078.00, 34873.25, 31459.08, 34391.19, 33547.07, 25557.64, 2890…

Dplyr Basics

Five Verbs

Exercise

Solution

Exercise

Solution

Exercise

Solution

Exercise

Solution

Example of group_by and summarize

Solution

Task

Solution

Task 2.

Answer