## ── Attaching packages ─────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1 ✓ purrr 0.3.3
## ✓ tibble 2.1.3 ✓ dplyr 0.8.4
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
A cohort is a collection of people born (or otherwise began) at the same time.
Naturally the size of a cohort is largest at the beginning and naturally shrinks over time. In demographics, birth cohorts experience losses as people die.
In the last graph of the preceding lecture, we noticed that some cohorts grew over time. The number of women aged 35-39 in 2012 should be smaller than the number of women aged 30-34 in 2007. The graph contradicts this assumption. What cane we see in the data that could shed some light on this?
Create a plot of the size over time of the cohort of women who were aged 30-34 in 2007.
QNat0718 = Nat0718 %>%
mutate(Age = as.character(Age)) %>%
filter((Year == 2007 & Age == "30-34") |
(Year == 2012 & Age == "35-39") |
(Year == 2017 & Age == "40-44")) %>%
group_by(Year) %>%
summarize(Fpop = sum(Fpop)) %>%
ungroup() %>%
select(Year, Fpop) %>%
mutate(Fpop = Fpop/1000000)
QNat0718 %>%
ggplot(aes(x = factor(Year), y = Fpop)) +
geom_point() +
ggtitle("Women 30 to 34 in 2007")
Replicate the plot for women who were 15 to 19 in 2007.
QNat0718 = Nat0718 %>%
mutate(Age = as.character(Age)) %>%
filter((Year == 2007 & Age == "15-19") |
(Year == 2012 & Age == "20-24") |
(Year == 2017 & Age == "25-29")) %>%
group_by(Year) %>%
summarize(Fpop = sum(Fpop)) %>%
ungroup() %>%
select(Year, Fpop) %>%
mutate(Fpop = Fpop/1000000)
QNat0718 %>%
ggplot(aes(x = factor(Year), y = Fpop)) +
geom_point() +
ggtitle("Women 15 to 19 in 2007")
Compare the racial composition of women 15 to 19 in 2007 with the racial composition of women 25 to 29 in 2017.
RCNat0718 = Nat0718 %>%
mutate(Age = as.character(Age)) %>%
filter((Year == 2007 & Age == "15-19") |
(Year == 2017 & Age == "25-29")) %>%
group_by(Year,Race) %>%
summarize(Fpop = sum(Fpop)) %>%
ungroup() %>%
mutate(Fpop = Fpop/1000000)
RCNat0718 %>%
ggplot(aes(x = Race, y = Fpop)) +
geom_point() +
facet_wrap(~Year) +
ggtitle("Women 15 to 19 in 2007")
The dataframe RCNat0718 has the information we want but its structure can be improved. Look at the table as it is now.
## # A tibble: 8 x 3
## Year Race Fpop
## <dbl> <fct> <dbl>
## 1 2007 AmInd 0.182
## 2 2007 Black 1.83
## 3 2007 White 8.19
## 4 2007 A-PI 0.529
## 5 2017 AmInd 0.187
## 6 2017 Black 1.91
## 7 2017 White 8.45
## 8 2017 A-PI 0.923
For each race, the counts for 2007 and 2017 are in separate rows. To do the arithmetic we want for comparison, we need to have the 2007 and 2017 counts in one row for each race. This is a common problem in data analysis. There are at least two ways to solve this problem. We can use dplyr commands to get a single row of data for each race. The following code does this.
Yr2007 = RCNat0718 %>%
filter(Year == 2007)
Yr2017 = RCNat0718 %>%
filter(Year == 2017) %>%
rename(Fpop17 = Fpop) %>%
select(Fpop17)
both = cbind(Yr2007,Yr2017)
both = both %>%
select(-Year) %>%
mutate(dpop = Fpop17 - Fpop,
pct_growth= dpop/Fpop,
pct_contrib = dpop/sum(dpop))
both
## Race Fpop Fpop17 dpop pct_growth pct_contrib
## 1 AmInd 0.181783 0.186706 0.004923 0.02708174 0.006680653
## 2 Black 1.829855 1.912390 0.082535 0.04510467 0.112002378
## 3 White 8.190463 8.445926 0.255463 0.03119030 0.346670665
## 4 A-PI 0.529225 0.923208 0.393983 0.74445274 0.534646304
In this view, it is possible to see that the growth in the A-PI category has the largest absolute change and accounts for 53% of the growth in the cohort.
The package tidyr has a function pivot_wider, which will perform this manipulation without lower level commands.
both = RCNat0718 %>%
pivot_wider(names_from = Year, values_from = Fpop)
colnames(both ) = c("Race","Fpop","Fpop17")
both = both %>%
mutate(dpop = Fpop17 - Fpop,
pct_growth= dpop/Fpop,
pct_contrib = dpop/sum(dpop))
both
## # A tibble: 4 x 6
## Race Fpop Fpop17 dpop pct_growth pct_contrib
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 AmInd 0.182 0.187 0.00492 0.0271 0.00668
## 2 Black 1.83 1.91 0.0825 0.0451 0.112
## 3 White 8.19 8.45 0.255 0.0312 0.347
## 4 A-PI 0.529 0.923 0.394 0.744 0.535