Cohorts

Libraries and Data

library(tidyverse)
## ── Attaching packages ─────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.4
## ✓ tidyr   1.0.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## ── Conflicts ────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
load("/cloud/project/Nat0718.Rdata")

A cohort is a collection of people born (or otherwise began) at the same time.

Naturally the size of a cohort is largest at the beginning and naturally shrinks over time. In demographics, birth cohorts experience losses as people die.

In the last graph of the preceding lecture, we noticed that some cohorts grew over time. The number of women aged 35-39 in 2012 should be smaller than the number of women aged 30-34 in 2007. The graph contradicts this assumption. What cane we see in the data that could shed some light on this?

Build a Cohort

Create a plot of the size over time of the cohort of women who were aged 30-34 in 2007.

Answer

QNat0718 = Nat0718 %>% 
  mutate(Age = as.character(Age)) %>% 
  filter((Year == 2007 & Age == "30-34") |
         (Year == 2012 & Age == "35-39") |
         (Year == 2017 & Age == "40-44")) %>% 
  group_by(Year) %>% 
  summarize(Fpop = sum(Fpop)) %>% 
  ungroup() %>% 
  select(Year, Fpop) %>% 
  mutate(Fpop = Fpop/1000000)

QNat0718 %>% 
  ggplot(aes(x = factor(Year), y = Fpop)) +
  geom_point() + 
  ggtitle("Women 30 to 34 in 2007")

Another Cohort.

Replicate the plot for women who were 15 to 19 in 2007.

Answer

QNat0718 = Nat0718 %>% 
  mutate(Age = as.character(Age)) %>% 
  filter((Year == 2007 & Age == "15-19") |
         (Year == 2012 & Age == "20-24") |
         (Year == 2017 & Age == "25-29")) %>% 
  group_by(Year) %>% 
  summarize(Fpop = sum(Fpop)) %>% 
  ungroup() %>% 
  select(Year, Fpop) %>% 
  mutate(Fpop = Fpop/1000000)

QNat0718 %>% 
  ggplot(aes(x = factor(Year), y = Fpop)) +
  geom_point() + 
  ggtitle("Women 15 to 19 in 2007")

Compare

Compare the racial composition of women 15 to 19 in 2007 with the racial composition of women 25 to 29 in 2017.

RCNat0718 = Nat0718 %>% 
  mutate(Age = as.character(Age)) %>% 
  filter((Year == 2007 & Age == "15-19") |
         (Year == 2017 & Age == "25-29")) %>% 
  group_by(Year,Race) %>% 
  summarize(Fpop = sum(Fpop)) %>% 
  ungroup() %>% 
  mutate(Fpop = Fpop/1000000)

RCNat0718 %>% 
  ggplot(aes(x = Race, y = Fpop)) +
  geom_point() + 
  facet_wrap(~Year) +
  ggtitle("Women 15 to 19 in 2007")

Improved Cmparison

The dataframe RCNat0718 has the information we want but its structure can be improved. Look at the table as it is now.

RCNat0718
## # A tibble: 8 x 3
##    Year Race   Fpop
##   <dbl> <fct> <dbl>
## 1  2007 AmInd 0.182
## 2  2007 Black 1.83 
## 3  2007 White 8.19 
## 4  2007 A-PI  0.529
## 5  2017 AmInd 0.187
## 6  2017 Black 1.91 
## 7  2017 White 8.45 
## 8  2017 A-PI  0.923

For each race, the counts for 2007 and 2017 are in separate rows. To do the arithmetic we want for comparison, we need to have the 2007 and 2017 counts in one row for each race. This is a common problem in data analysis. There are at least two ways to solve this problem. We can use dplyr commands to get a single row of data for each race. The following code does this.

Yr2007 = RCNat0718 %>% 
  filter(Year == 2007)

Yr2017 = RCNat0718 %>% 
  filter(Year == 2017) %>% 
  rename(Fpop17 = Fpop) %>% 
  select(Fpop17)

both = cbind(Yr2007,Yr2017)

both = both %>% 
  select(-Year) %>% 
  mutate(dpop = Fpop17 - Fpop,
         pct_growth= dpop/Fpop,
         pct_contrib = dpop/sum(dpop))

both
##    Race     Fpop   Fpop17     dpop pct_growth pct_contrib
## 1 AmInd 0.181783 0.186706 0.004923 0.02708174 0.006680653
## 2 Black 1.829855 1.912390 0.082535 0.04510467 0.112002378
## 3 White 8.190463 8.445926 0.255463 0.03119030 0.346670665
## 4  A-PI 0.529225 0.923208 0.393983 0.74445274 0.534646304

In this view, it is possible to see that the growth in the A-PI category has the largest absolute change and accounts for 53% of the growth in the cohort.

The Widyverse Way

The package tidyr has a function pivot_wider, which will perform this manipulation without lower level commands.

both = RCNat0718 %>% 
  pivot_wider(names_from = Year, values_from = Fpop)

colnames(both ) = c("Race","Fpop","Fpop17")

both = both %>% 
  mutate(dpop = Fpop17 - Fpop,
         pct_growth= dpop/Fpop,
         pct_contrib = dpop/sum(dpop))
both
## # A tibble: 4 x 6
##   Race   Fpop Fpop17    dpop pct_growth pct_contrib
##   <fct> <dbl>  <dbl>   <dbl>      <dbl>       <dbl>
## 1 AmInd 0.182  0.187 0.00492     0.0271     0.00668
## 2 Black 1.83   1.91  0.0825      0.0451     0.112  
## 3 White 8.19   8.45  0.255       0.0312     0.347  
## 4 A-PI  0.529  0.923 0.394       0.744      0.535