Notes Mar 3

Harold Nelson

3/3/2021

Setup

Get the earnings data from Tidy Tuesday of 2/23 and load the tidyverse.

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3     ✓ purrr   0.3.4
## ✓ tibble  3.0.6     ✓ dplyr   1.0.4
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.4.0     ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
earn <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-02-23/earn.csv')
## 
## ── Column specification ────────────────────────────────────────────────────────
## cols(
##   sex = col_character(),
##   race = col_character(),
##   ethnic_origin = col_character(),
##   age = col_character(),
##   year = col_double(),
##   quarter = col_double(),
##   n_persons = col_double(),
##   median_weekly_earn = col_double()
## )

Glimpse earn and get a summary.

Answer

glimpse(earn)
## Rows: 4,224
## Columns: 8
## $ sex                <chr> "Both Sexes", "Both Sexes", "Both Sexes", "Both Se…
## $ race               <chr> "All Races", "All Races", "All Races", "All Races"…
## $ ethnic_origin      <chr> "All Origins", "All Origins", "All Origins", "All …
## $ age                <chr> "16 years and over", "16 years and over", "16 year…
## $ year               <dbl> 2010, 2010, 2010, 2010, 2011, 2011, 2011, 2011, 20…
## $ quarter            <dbl> 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1,…
## $ n_persons          <dbl> 96821000, 99798000, 101385000, 100120000, 98329000…
## $ median_weekly_earn <dbl> 754, 740, 740, 752, 755, 753, 753, 764, 769, 771, …
summary(earn)
##      sex                race           ethnic_origin          age           
##  Length:4224        Length:4224        Length:4224        Length:4224       
##  Class :character   Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character   Mode  :character  
##                                                                             
##                                                                             
##                                                                             
##       year         quarter       n_persons         median_weekly_earn
##  Min.   :2010   Min.   :1.00   Min.   :   103000   Min.   : 318.0    
##  1st Qu.:2012   1st Qu.:1.75   1st Qu.:  2614000   1st Qu.: 605.0    
##  Median :2015   Median :2.50   Median :  7441000   Median : 755.0    
##  Mean   :2015   Mean   :2.50   Mean   : 16268338   Mean   : 762.2    
##  3rd Qu.:2018   3rd Qu.:3.25   3rd Qu.: 17555250   3rd Qu.: 911.0    
##  Max.   :2020   Max.   :4.00   Max.   :118358000   Max.   :1709.0

Create Factors

Convert the character variables to factors. Get tables of each of these.

Answer

earn = earn %>% 
  mutate(sex = factor(sex),
         race = factor(race),
         ethnic_origin = factor(ethnic_origin),
         age = factor(age))

table(earn$sex)
## 
## Both Sexes        Men      Women 
##       1408       1408       1408
table(earn$race)
## 
##                 All Races                     Asian Black or African American 
##                      2244                       660                       660 
##                     White 
##                       660
table(earn$ethnic_origin)
## 
##        All Origins Hispanic or Latino 
##               3564                660
table(earn$age)
## 
##    16 to 19 years    16 to 24 years 16 years and over    20 to 24 years 
##               132               660               660               132 
##    25 to 34 years    25 to 54 years 25 years and over    35 to 44 years 
##               132               660               660               132 
##    45 to 54 years    55 to 64 years 55 years and over 65 years and over 
##               132               132               660               132

Whole Population

These counts show that not all combinations of possible values exist.

There is a combination that defines the population as a whole.

Define this combination and use it to create an overall time-series of median weekly earnings.

Use year as the time variable for your first attempt.

Answer

whole.pop = earn %>% 
  filter(sex == "Both Sexes",
         race == "All Races",
         age == "16 years and over",
         ethnic_origin == "All Origins")

head(whole.pop)
## # A tibble: 6 x 8
##   sex     race   ethnic_origin age       year quarter n_persons median_weekly_e…
##   <fct>   <fct>  <fct>         <fct>    <dbl>   <dbl>     <dbl>            <dbl>
## 1 Both S… All R… All Origins   16 year…  2010       1  96821000              754
## 2 Both S… All R… All Origins   16 year…  2010       2  99798000              740
## 3 Both S… All R… All Origins   16 year…  2010       3 101385000              740
## 4 Both S… All R… All Origins   16 year…  2010       4 100120000              752
## 5 Both S… All R… All Origins   16 year…  2011       1  98329000              755
## 6 Both S… All R… All Origins   16 year…  2011       2 100593000              753
whole.pop %>% 
  ggplot(aes(year,median_weekly_earn)) +
  geom_point()

How can we expand the time dimension to spread out the quarters?

Answer

whole.pop = earn %>% 
  filter(sex == "Both Sexes",
         race == "All Races",
         age == "16 years and over",
         ethnic_origin == "All Origins") %>% 
  mutate(qtime = year + (quarter - 1) * .25)

head(whole.pop)
## # A tibble: 6 x 9
##   sex   race  ethnic_origin age    year quarter n_persons median_weekly_e… qtime
##   <fct> <fct> <fct>         <fct> <dbl>   <dbl>     <dbl>            <dbl> <dbl>
## 1 Both… All … All Origins   16 y…  2010       1  96821000              754 2010 
## 2 Both… All … All Origins   16 y…  2010       2  99798000              740 2010.
## 3 Both… All … All Origins   16 y…  2010       3 101385000              740 2010.
## 4 Both… All … All Origins   16 y…  2010       4 100120000              752 2011.
## 5 Both… All … All Origins   16 y…  2011       1  98329000              755 2011 
## 6 Both… All … All Origins   16 y…  2011       2 100593000              753 2011.
whole.pop %>% 
  ggplot(aes(qtime,median_weekly_earn)) +
  geom_point()

Comparison

We can now use this as a platform to compare the earnings of a subset of the population to those of the population as a whole. Note that all conceivable subset exist.

Start by creating a dataframe for people in the 16 - 24 age group. Modify the code used for creating the whole population dataframe.

Create a time-series plot to show how the earnings of this subset compares with the whole population. Color the subset points red.

Answer

young = earn %>% 
  filter(sex == "Both Sexes",
         race == "All Races",
         age == "16 to 24 years",
         ethnic_origin == "All Origins") %>% 
  mutate(qtime = year + (quarter - 1) * .25)



whole.pop %>% 
  ggplot(aes(qtime,median_weekly_earn)) +
  geom_point() +
  geom_point(data = young,color="red")

## Middle Age

Repeat the exercise for the 45 to 54 years age group.

Answer

middle_age = earn %>% 
  filter(sex == "Both Sexes",
         race == "All Races",
         age == "45 to 54 years",
         ethnic_origin == "All Origins") %>% 
  mutate(qtime = year + (quarter - 1) * .25)



whole.pop %>% 
  ggplot(aes(qtime,median_weekly_earn)) +
  geom_point() +
  geom_point(data = middle_age,color="red")

An Alternative Comparison

We can also compare the earnings time-series for subsets of the population defined by the values of one of our factors.

As an example, compare earnings by sex for middle-aged people.

Answer

middle_age_by_sex = earn %>% 
  filter(race == "All Races",
         age == "45 to 54 years",
         ethnic_origin == "All Origins") %>% 
  mutate(qtime = year + (quarter - 1) * .25)



middle_age_by_sex %>% 
  ggplot(aes(qtime,median_weekly_earn,color = sex)) +
  geom_point()