Harold Nelson
3/3/2021
Get the earnings data from Tidy Tuesday of 2/23 and load the tidyverse.
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.0.6 ✓ dplyr 1.0.4
## ✓ tidyr 1.1.2 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
earn <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-02-23/earn.csv')
##
## ── Column specification ────────────────────────────────────────────────────────
## cols(
## sex = col_character(),
## race = col_character(),
## ethnic_origin = col_character(),
## age = col_character(),
## year = col_double(),
## quarter = col_double(),
## n_persons = col_double(),
## median_weekly_earn = col_double()
## )
## Rows: 4,224
## Columns: 8
## $ sex <chr> "Both Sexes", "Both Sexes", "Both Sexes", "Both Se…
## $ race <chr> "All Races", "All Races", "All Races", "All Races"…
## $ ethnic_origin <chr> "All Origins", "All Origins", "All Origins", "All …
## $ age <chr> "16 years and over", "16 years and over", "16 year…
## $ year <dbl> 2010, 2010, 2010, 2010, 2011, 2011, 2011, 2011, 20…
## $ quarter <dbl> 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1,…
## $ n_persons <dbl> 96821000, 99798000, 101385000, 100120000, 98329000…
## $ median_weekly_earn <dbl> 754, 740, 740, 752, 755, 753, 753, 764, 769, 771, …
## sex race ethnic_origin age
## Length:4224 Length:4224 Length:4224 Length:4224
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
## year quarter n_persons median_weekly_earn
## Min. :2010 Min. :1.00 Min. : 103000 Min. : 318.0
## 1st Qu.:2012 1st Qu.:1.75 1st Qu.: 2614000 1st Qu.: 605.0
## Median :2015 Median :2.50 Median : 7441000 Median : 755.0
## Mean :2015 Mean :2.50 Mean : 16268338 Mean : 762.2
## 3rd Qu.:2018 3rd Qu.:3.25 3rd Qu.: 17555250 3rd Qu.: 911.0
## Max. :2020 Max. :4.00 Max. :118358000 Max. :1709.0
Convert the character variables to factors. Get tables of each of these.
earn = earn %>%
mutate(sex = factor(sex),
race = factor(race),
ethnic_origin = factor(ethnic_origin),
age = factor(age))
table(earn$sex)
##
## Both Sexes Men Women
## 1408 1408 1408
##
## All Races Asian Black or African American
## 2244 660 660
## White
## 660
##
## All Origins Hispanic or Latino
## 3564 660
##
## 16 to 19 years 16 to 24 years 16 years and over 20 to 24 years
## 132 660 660 132
## 25 to 34 years 25 to 54 years 25 years and over 35 to 44 years
## 132 660 660 132
## 45 to 54 years 55 to 64 years 55 years and over 65 years and over
## 132 132 660 132
These counts show that not all combinations of possible values exist.
There is a combination that defines the population as a whole.
Define this combination and use it to create an overall time-series of median weekly earnings.
Use year as the time variable for your first attempt.
whole.pop = earn %>%
filter(sex == "Both Sexes",
race == "All Races",
age == "16 years and over",
ethnic_origin == "All Origins")
head(whole.pop)
## # A tibble: 6 x 8
## sex race ethnic_origin age year quarter n_persons median_weekly_e…
## <fct> <fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl>
## 1 Both S… All R… All Origins 16 year… 2010 1 96821000 754
## 2 Both S… All R… All Origins 16 year… 2010 2 99798000 740
## 3 Both S… All R… All Origins 16 year… 2010 3 101385000 740
## 4 Both S… All R… All Origins 16 year… 2010 4 100120000 752
## 5 Both S… All R… All Origins 16 year… 2011 1 98329000 755
## 6 Both S… All R… All Origins 16 year… 2011 2 100593000 753
How can we expand the time dimension to spread out the quarters?
whole.pop = earn %>%
filter(sex == "Both Sexes",
race == "All Races",
age == "16 years and over",
ethnic_origin == "All Origins") %>%
mutate(qtime = year + (quarter - 1) * .25)
head(whole.pop)
## # A tibble: 6 x 9
## sex race ethnic_origin age year quarter n_persons median_weekly_e… qtime
## <fct> <fct> <fct> <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Both… All … All Origins 16 y… 2010 1 96821000 754 2010
## 2 Both… All … All Origins 16 y… 2010 2 99798000 740 2010.
## 3 Both… All … All Origins 16 y… 2010 3 101385000 740 2010.
## 4 Both… All … All Origins 16 y… 2010 4 100120000 752 2011.
## 5 Both… All … All Origins 16 y… 2011 1 98329000 755 2011
## 6 Both… All … All Origins 16 y… 2011 2 100593000 753 2011.
We can now use this as a platform to compare the earnings of a subset of the population to those of the population as a whole. Note that all conceivable subset exist.
Start by creating a dataframe for people in the 16 - 24 age group. Modify the code used for creating the whole population dataframe.
Create a time-series plot to show how the earnings of this subset compares with the whole population. Color the subset points red.
young = earn %>%
filter(sex == "Both Sexes",
race == "All Races",
age == "16 to 24 years",
ethnic_origin == "All Origins") %>%
mutate(qtime = year + (quarter - 1) * .25)
whole.pop %>%
ggplot(aes(qtime,median_weekly_earn)) +
geom_point() +
geom_point(data = young,color="red")
## Middle Age
Repeat the exercise for the 45 to 54 years age group.
middle_age = earn %>%
filter(sex == "Both Sexes",
race == "All Races",
age == "45 to 54 years",
ethnic_origin == "All Origins") %>%
mutate(qtime = year + (quarter - 1) * .25)
whole.pop %>%
ggplot(aes(qtime,median_weekly_earn)) +
geom_point() +
geom_point(data = middle_age,color="red")
We can also compare the earnings time-series for subsets of the population defined by the values of one of our factors.
As an example, compare earnings by sex for middle-aged people.