Notes on RFDS 3

library(tidyverse)

## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr

## Conflicts with tidy packages ----------------------------------------------

## filter(): dplyr, stats
## lag():    dplyr, stats

load("~/Dropbox/RProjects/Oly Weather/olywthr.rdata")

Exercise

Download the file demo1 from Moodle and inspect it. Then write R code using dplyr to create this file based on olywthr.

Answer

olywthr %>% 
  filter((yr == 2015 | yr == 2016),mo==2,dy>=14,dy<=29) %>% 
  select(DATE,TMIN) %>% 
  mutate(TMIN_C = (5/9)*(TMIN - 32)) %>% 
  arrange(TMIN) -> demo1

Exercise

The dataframe demo2, available in Moodle was constructed from the diamonds dataframe found in ggplot2. Download and inspect it. Then write dplyr statements to recreate it. Note: ppc stands for dollars/carat.

Answer

diamonds %>% 
   filter(cut == "Ideal") %>%
   select(cut,color,clarity,carat,price) %>% 
   mutate(dollars_per_carat = price/carat) %>% 
   group_by(cut,color,clarity) %>% 
   summarize(med_ppc = mean(dollars_per_carat),
             cell_count = n()) %>% 
   ungroup() %>% 
   arrange(med_ppc) -> demo2
glimpse(demo2)

## Observations: 56
## Variables: 5
## $ cut        <ord> Ideal, Ideal, Ideal, Ideal, Ideal, Ideal, Ideal, Id...
## $ color      <ord> J, I, J, D, I, I, E, H, F, I, G, D, H, E, H, D, J, ...
## $ clarity    <ord> VVS1, IF, IF, I1, VVS1, I1, I1, VVS1, I1, VVS2, I1,...
## $ med_ppc    <dbl> 2612.756, 2722.393, 2774.645, 2898.389, 2954.667, 3...
## $ cell_count <int> 29, 95, 25, 13, 179, 17, 18, 326, 42, 178, 16, 738,...

Exercise

Get the countyComplete dataset from the openIntro package.
Create a new dataset cc keeping the following variables. state, name, pop20101, bachelors, per_capita_income, area, density.
Use the as.character() function to convert the factors state and name to ordinary character variables.
Create a new variable total_income using pop2010 and per_capita_income.
Glimpse your dataset and run a summary to make sure everything is OK.

Answer

library(openintro)

## Please visit openintro.org for free statistics materials

## 
## Attaching package: 'openintro'

## The following object is masked from 'package:ggplot2':
## 
##     diamonds

## The following objects are masked from 'package:datasets':
## 
##     cars, trees

cc <- countyComplete %>%
  select(name,state,pop2010,per_capita_income,area,density,bachelors) %>%
  mutate(name = as.character(name),
         state = as.character(state),
         total_income = per_capita_income * pop2010)
glimpse(cc)

## Observations: 3,143
## Variables: 8
## $ name              <chr> "Autauga County", "Baldwin County", "Barbour...
## $ state             <chr> "Alabama", "Alabama", "Alabama", "Alabama", ...
## $ pop2010           <dbl> 54571, 182265, 27457, 22915, 57322, 10914, 2...
## $ per_capita_income <dbl> 24568, 26469, 15875, 19918, 21070, 20289, 16...
## $ area              <dbl> 594.44, 1589.78, 884.88, 622.58, 644.78, 622...
## $ density           <dbl> 91.8, 114.6, 31.0, 36.8, 88.9, 17.5, 27.0, 1...
## $ bachelors         <dbl> 21.7, 26.8, 13.5, 10.0, 12.5, 12.0, 11.0, 16...
## $ total_income      <dbl> 1340700328, 4824372285, 435879875, 456420970...

summary(cc)

##      name              state              pop2010        per_capita_income
##  Length:3143        Length:3143        Min.   :     82   Min.   : 7772    
##  Class :character   Class :character   1st Qu.:  11104   1st Qu.:19030    
##  Mode  :character   Mode  :character   Median :  25857   Median :21773    
##                                        Mean   :  98233   Mean   :22505    
##                                        3rd Qu.:  66699   3rd Qu.:24814    
##                                        Max.   :9818605   Max.   :64381    
##       area             density          bachelors      total_income      
##  Min.   :     2.0   Min.   :    0.0   Min.   : 3.70   Min.   :3.462e+06  
##  1st Qu.:   430.7   1st Qu.:   16.9   1st Qu.:13.10   1st Qu.:2.203e+08  
##  Median :   615.6   Median :   45.2   Median :16.90   Median :5.370e+08  
##  Mean   :  1123.7   Mean   :  259.3   Mean   :19.03   Mean   :2.687e+09  
##  3rd Qu.:   924.0   3rd Qu.:  113.8   3rd Qu.:22.60   3rd Qu.:1.503e+09  
##  Max.   :145504.8   Max.   :69467.5   Max.   :71.00   Max.   :2.685e+11

Exercise

Create a states dataset using group_by and summarize.
In this dataset create average per capita income for each state with two different methods.
mean_county_pci is the simple average of the county per capita income values.
state_pci is constructed by adding up total income and population from the county values and dividing the two at the state level.

Answer

cc %>% 
  group_by(state) %>% 
  summarize(state_pci = sum(total_income)/sum(pop2010),
            mean_county_pci = mean(total_income/pop2010)
            ) %>% 
  arrange(desc(mean_county_pci))-> states
glimpse(states)

## Observations: 51
## Variables: 3
## $ state           <chr> "District of Columbia", "Connecticut", "New Je...
## $ state_pci       <dbl> 42078.00, 36792.41, 34853.30, 33982.91, 28703....
## $ mean_county_pci <dbl> 42078.00, 34873.25, 34391.19, 33547.07, 32741....

Exercise

Create a subset, pnw, of cc consisting of counties from Washington, Oregon and Idaho. Produce an appropriate graphic to compare the counties in these states on the basis of percent of people holding a bachelors degree.

Answer

cc %>% 
  filter(state %in% c("Washington","Oregon","Idaho")) -> pnw
  
pnw %>% ggplot(aes(x=state,y=bachelors)) + 
  geom_boxplot() +
  coord_flip()

# or

pnw %>% ggplot(aes(x=bachelors,fill=state)) + 
  geom_density() +
  facet_wrap(~state,ncol=1)

Notes on RFDS 3

Review 5.1.3

Exercise

Answer

Exercise

Answer

Exercise

Answer

Exercise

Answer

Exercise

Answer