Fertility Data Exploration

Harold Nelson

10/1/2020

Setup

Load the usual packages.

library(tidyverse)
library(knitr)

Load Data

Load the file Nat0718.Rdata from your working directory after downloading it from Moodle.

Answer

load("Nat0718.Rdata")

Basic ID Task

Use glipmse() and/or str() to identify the categorical and quantitative variables in this dataframe.

Answer

glimpse(Nat0718)
## Rows: 1,152
## Columns: 8
## $ Region <fct> NE, NE, NE, NE, NE, NE, NE, NE, NE, NE, NE, NE, NE, NE, NE, NE…
## $ Race   <fct> AmInd, AmInd, AmInd, AmInd, AmInd, AmInd, AmInd, AmInd, AmInd,…
## $ Age    <fct> 15-19, 15-19, 15-19, 15-19, 15-19, 15-19, 15-19, 15-19, 15-19,…
## $ Year   <dbl> 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 20…
## $ Births <dbl> 281, 279, 258, 233, 269, 284, 212, 157, 163, 123, 117, 160, 58…
## $ Fpop   <dbl> 15483, 16310, 17173, 17600, 17035, 16940, 16635, 16337, 16373,…
## $ yr     <dbl> 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 7, 8, 9, 10, 11, …
## $ Rate   <dbl> 0.018148938, 0.017106070, 0.015023584, 0.013238636, 0.01579101…

There are three clearly categorical variables: Region, Race, and Age (Age bracket).

The remaining variables are quantitative. yr and Year have the same information. Rate is redundant since it can be computed from Fpop and Births.

Integrity Tests

Do some checking of the data for reasonableness.

summary(Nat0718)
##  Region      Race          Age           Year          Births      
##  NE:288   AmInd:288   15-19  :192   Min.   :2007   Min.   :    39  
##  MW:288   Black:288   20-24  :192   1st Qu.:2010   1st Qu.:  2382  
##  S :288   White:288   25-29  :192   Median :2012   Median : 12266  
##  W :288   A-PI :288   30-34  :192   Mean   :2012   Mean   : 41623  
##                       35-39  :192   3rd Qu.:2015   3rd Qu.: 36796  
##                       40-44  :192   Max.   :2018   Max.   :337413  
##                       (Other):  0                                  
##       Fpop               yr             Rate        
##  Min.   :  13280   Min.   : 7.00   Min.   :0.00277  
##  1st Qu.:  85274   1st Qu.: 9.75   1st Qu.:0.01975  
##  Median : 233955   Median :12.50   Median :0.05096  
##  Mean   : 656517   Mean   :12.50   Mean   :0.05652  
##  3rd Qu.:1109400   3rd Qu.:15.25   3rd Qu.:0.08964  
##  Max.   :3070092   Max.   :18.00   Max.   :0.14825  
## 

The numbers do vary widely but this is reasonable because of the variation in specific population sizes and the variation in birth rates for soecific groups.

Births by Year

Create a dataframe births_by_year containing the variables yr and total_births.

Display the dataframe.

Answer

births_by_year = Nat0718 %>% 
  group_by (yr) %>% 
  summarize(total_births = sum(Births))
## `summarise()` ungrouping output (override with `.groups` argument)
births_by_year
## # A tibble: 12 x 2
##       yr total_births
##    <dbl>        <dbl>
##  1     7      4302685
##  2     8      4234280
##  3     9      4117747
##  4    10      3987164
##  5    11      3942006
##  6    12      3941412
##  7    13      3920911
##  8    14      3976864
##  9    15      3967072
## 10    16      3934579
## 11    17      3844260
## 12    18      3780401

It is somewhat surprising that the total number if births in the US has declined. Has the number of women been declining in this time period. Redo the previous dataframe and include the total number of women. Also compute the ratio of births to women. Print this dataframe so we can see all three variables.

Answer

births_by_year = Nat0718 %>% 
  group_by (yr) %>% 
  summarize(total_births = sum(Births),
            total_women = sum(Fpop)) %>%
  mutate(ratio = total_births/total_women)
## `summarise()` ungrouping output (override with `.groups` argument)
births_by_year
## # A tibble: 12 x 4
##       yr total_births total_women  ratio
##    <dbl>        <dbl>       <dbl>  <dbl>
##  1     7      4302685    62292084 0.0691
##  2     8      4234280    62359858 0.0679
##  3     9      4117747    62373024 0.0660
##  4    10      3987164    62374964 0.0639
##  5    11      3942006    62517048 0.0631
##  6    12      3941412    62744930 0.0628
##  7    13      3920911    62939772 0.0623
##  8    14      3976864    63356565 0.0628
##  9    15      3967072    63606765 0.0624
## 10    16      3934579    63613014 0.0619
## 11    17      3844260    63958243 0.0601
## 12    18      3780401    64171698 0.0589

Basic Plots

Create plots of each of these three variables against time.

Answer

base = births_by_year %>% ggplot(aes(x = yr))

base + geom_point(aes(y = total_births)) + ggtitle("Births")

base + geom_point(aes(y = total_women)) + ggtitle("Women")

base + geom_point(aes(y = ratio)) + ggtitle("Births per Woman")

Age Structure

Could this be happening because there are fewer women in age groups at which births are most common. Let’s look at how age enters into this picture.

Compute births per woman for each of the age groups in our data. Combine all years, races, and regions.

Display the dataframe.

Answer

age_group_ratios = Nat0718 %>% 
  group_by(Age) %>% 
  summarize(total_births = sum(Births),
            total_women = sum(Fpop)) %>%
  mutate(ratio = total_births/total_women)
## `summarise()` ungrouping output (override with `.groups` argument)
age_group_ratios
## # A tibble: 6 x 4
##   Age   total_births total_women  ratio
##   <fct>        <dbl>       <dbl>  <dbl>
## 1 15-19      3628252   125889729 0.0288
## 2 20-24     10858973   129208011 0.0840
## 3 25-29     13746835   129397220 0.106 
## 4 30-34     12342495   124330266 0.0993
## 5 35-39      6053616   122765369 0.0493
## 6 40-44      1319210   124717370 0.0106

The differences in births per woman across age groups suggests that a change in the age structure may be relevant. How could we pursue this question?

Breakout Exercise

My Answer

Note that Nat0718 itself has the value of our ratio computed for every cell defined by Year, Race and Region.

Do a scatterplot for Rate against Year for the age group 25-29.

Nat0718 %>% 
  filter(Age == "25-29") %>% 
  ggplot(aes(x = yr, y = Rate)) + 
  geom_point() +
  ggtitle("25-29")

What’s the story behind the grouping? How do race and region play into this picture?

Breakout Exercise

My Answer(s)

First, Look at Race

Nat0718 %>% 
  filter(Age == "25-29") %>% 
  ggplot(aes(x = yr, y = Rate, color = Race)) + 
  geom_point() +
  ggtitle("25-29 color = Race")

Let’s look at Region.

Nat0718 %>% 
  filter(Age == "25-29") %>% 
  ggplot(aes(x = yr, y = Rate, color = Region)) + 
  geom_point() +
  ggtitle("25-29")

Let’s facet

Nat0718 %>% 
  filter(Age == "25-29") %>% 
  ggplot(aes(x = yr, y = Rate)) + 
  geom_point(color = "blue") +
  geom_smooth(method = "lm", color = "red", size = .5) +
  ggtitle("25-29") +
  facet_grid(Race ~ Region)
## `geom_smooth()` using formula 'y ~ x'