## ── Attaching packages ────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.1.0 ✔ purrr 0.2.5
## ✔ tibble 1.4.2 ✔ dplyr 0.7.8
## ✔ tidyr 0.8.2 ✔ stringr 1.3.1
## ✔ readr 1.1.1 ✔ forcats 0.3.0
## ── Conflicts ───────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## Please visit openintro.org for free statistics materials
##
## Attaching package: 'openintro'
## The following object is masked from 'package:ggplot2':
##
## diamonds
## The following objects are masked from 'package:datasets':
##
## cars, trees
Dplyr mostly uses five “verbs.”
Some basic behavior of R is altered. All dataframes touched by dplyr become tbl’s (pronounced tibble).
Typing the name of a dataframe is prevented from overwhelming the console by dumping the entire dataframe. Both rows and columns are restricted.
Use of View() and glimpse() give us ability to see everything.
## Observations: 53,940
## Variables: 10
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, ...
## $ cut <fct> Ideal, Premium, Good, Premium, Good, Very Good, Very G...
## $ color <fct> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, ...
## $ clarity <fct> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI...
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, ...
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54...
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339,...
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, ...
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, ...
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, ...
Piping with %>% is almost always used when dplyr is in use, although it is not strictly necessary.
Create a new version of diamonds, called d. Cut, color and clarity should be simple character variables. Make a new variable ppc, the price per carat.
d <- diamonds %>%
filter(x>0, y>0, z>0) %>%
mutate(cut = as.character(cut),
color = as.character(color),
clarity = as.character(clarity),
ppc = price/carat) %>%
glimpse()
## Observations: 53,920
## Variables: 11
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, ...
## $ cut <chr> "Ideal", "Premium", "Good", "Premium", "Good", "Very G...
## $ color <chr> "E", "E", "E", "I", "J", "J", "I", "H", "E", "H", "J",...
## $ clarity <chr> "SI2", "SI1", "VS1", "VS2", "SI2", "VVS2", "VVS1", "SI...
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, ...
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54...
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339,...
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, ...
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, ...
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, ...
## $ ppc <dbl> 1417.391, 1552.381, 1421.739, 1151.724, 1080.645, 1400...
## carat cut color clarity
## Min. :0.2000 Length:53920 Length:53920 Length:53920
## 1st Qu.:0.4000 Class :character Class :character Class :character
## Median :0.7000 Mode :character Mode :character Mode :character
## Mean :0.7977
## 3rd Qu.:1.0400
## Max. :5.0100
## depth table price x
## Min. :43.00 Min. :43.00 Min. : 326 Min. : 3.730
## 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 949 1st Qu.: 4.710
## Median :61.80 Median :57.00 Median : 2401 Median : 5.700
## Mean :61.75 Mean :57.46 Mean : 3931 Mean : 5.732
## 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5323 3rd Qu.: 6.540
## Max. :79.00 Max. :95.00 Max. :18823 Max. :10.740
## y z ppc
## Min. : 3.680 Min. : 1.07 Min. : 1051
## 1st Qu.: 4.720 1st Qu.: 2.91 1st Qu.: 2478
## Median : 5.710 Median : 3.53 Median : 3495
## Mean : 5.735 Mean : 3.54 Mean : 4008
## 3rd Qu.: 6.540 3rd Qu.: 4.04 3rd Qu.: 4949
## Max. :58.900 Max. :31.80 Max. :17829
Use Dplyr to create a subset d1 of d with cut = “Ideal” and ppc greater than $5,000. Keep only the columns clarity, color, and ppc.
## clarity color ppc
## Length:4780 Length:4780 Min. : 5001
## Class :character Class :character 1st Qu.: 5753
## Mode :character Mode :character Median : 6758
## Mean : 7104
## 3rd Qu.: 8091
## Max. :17078
library(openintro)
cc <- countyComplete %>%
select(name,state,pop2010,per_capita_income,area,density,bachelors) %>%
mutate(name = as.character(name),
state = as.character(state),
total_income = per_capita_income * pop2010)
glimpse(cc)
## Observations: 3,143
## Variables: 8
## $ name <chr> "Autauga County", "Baldwin County", "Barbour...
## $ state <chr> "Alabama", "Alabama", "Alabama", "Alabama", ...
## $ pop2010 <dbl> 54571, 182265, 27457, 22915, 57322, 10914, 2...
## $ per_capita_income <dbl> 24568, 26469, 15875, 19918, 21070, 20289, 16...
## $ area <dbl> 594.44, 1589.78, 884.88, 622.58, 644.78, 622...
## $ density <dbl> 91.8, 114.6, 31.0, 36.8, 88.9, 17.5, 27.0, 1...
## $ bachelors <dbl> 21.7, 26.8, 13.5, 10.0, 12.5, 12.0, 11.0, 16...
## $ total_income <dbl> 1340700328, 4824372285, 435879875, 456420970...
Create a states dataset using group_by and summarize.
In this dataset create average per capita income for each state with two different methods.
SPCI1 is the simple average of the county per capita income values.
SPCI2 is constructed by adding up total income and population from the county values and dividing the two at the state level. Examine the difference between SPCI1 and SPCI2.
cc %>%
group_by(state) %>%
summarize(SPCI1 = sum(total_income)/sum(pop2010),
SPCI2 = mean(total_income/pop2010)
) %>%
arrange(desc(SPCI1))-> states
glimpse(states)
## Observations: 51
## Variables: 3
## $ state <chr> "District of Columbia", "Connecticut", "Maryland", "New ...
## $ SPCI1 <dbl> 42078.00, 36792.41, 34907.87, 34853.30, 33982.91, 32293....
## $ SPCI2 <dbl> 42078.00, 34873.25, 31459.08, 34391.19, 33547.07, 25557....