Harold Nelson
10/1/2020
Load the file Nat0718.Rdata from your working directory after downloading it from Moodle.
Use glipmse() and/or str() to identify the categorical and quantitative variables in this dataframe.
## Rows: 1,152
## Columns: 8
## $ Region <fct> NE, NE, NE, NE, NE, NE, NE, NE, NE, NE, NE, NE, NE, NE, NE, NE…
## $ Race <fct> AmInd, AmInd, AmInd, AmInd, AmInd, AmInd, AmInd, AmInd, AmInd,…
## $ Age <fct> 15-19, 15-19, 15-19, 15-19, 15-19, 15-19, 15-19, 15-19, 15-19,…
## $ Year <dbl> 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 20…
## $ Births <dbl> 281, 279, 258, 233, 269, 284, 212, 157, 163, 123, 117, 160, 58…
## $ Fpop <dbl> 15483, 16310, 17173, 17600, 17035, 16940, 16635, 16337, 16373,…
## $ yr <dbl> 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 7, 8, 9, 10, 11, …
## $ Rate <dbl> 0.018148938, 0.017106070, 0.015023584, 0.013238636, 0.01579101…
There are three clearly categorical variables: Region, Race, and Age (Age bracket).
The remaining variables are quantitative. yr and Year have the same information. Rate is redundant since it can be computed from Fpop and Births.
Do some checking of the data for reasonableness.
## Region Race Age Year Births
## NE:288 AmInd:288 15-19 :192 Min. :2007 Min. : 39
## MW:288 Black:288 20-24 :192 1st Qu.:2010 1st Qu.: 2382
## S :288 White:288 25-29 :192 Median :2012 Median : 12266
## W :288 A-PI :288 30-34 :192 Mean :2012 Mean : 41623
## 35-39 :192 3rd Qu.:2015 3rd Qu.: 36796
## 40-44 :192 Max. :2018 Max. :337413
## (Other): 0
## Fpop yr Rate
## Min. : 13280 Min. : 7.00 Min. :0.00277
## 1st Qu.: 85274 1st Qu.: 9.75 1st Qu.:0.01975
## Median : 233955 Median :12.50 Median :0.05096
## Mean : 656517 Mean :12.50 Mean :0.05652
## 3rd Qu.:1109400 3rd Qu.:15.25 3rd Qu.:0.08964
## Max. :3070092 Max. :18.00 Max. :0.14825
##
The numbers do vary widely but this is reasonable because of the variation in specific population sizes and the variation in birth rates for soecific groups.
Create a dataframe births_by_year containing the variables yr and total_births.
Display the dataframe.
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 12 x 2
## yr total_births
## <dbl> <dbl>
## 1 7 4302685
## 2 8 4234280
## 3 9 4117747
## 4 10 3987164
## 5 11 3942006
## 6 12 3941412
## 7 13 3920911
## 8 14 3976864
## 9 15 3967072
## 10 16 3934579
## 11 17 3844260
## 12 18 3780401
It is somewhat surprising that the total number if births in the US has declined. Has the number of women been declining in this time period. Redo the previous dataframe and include the total number of women. Also compute the ratio of births to women. Print this dataframe so we can see all three variables.
births_by_year = Nat0718 %>%
group_by (yr) %>%
summarize(total_births = sum(Births),
total_women = sum(Fpop)) %>%
mutate(ratio = total_births/total_women)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 12 x 4
## yr total_births total_women ratio
## <dbl> <dbl> <dbl> <dbl>
## 1 7 4302685 62292084 0.0691
## 2 8 4234280 62359858 0.0679
## 3 9 4117747 62373024 0.0660
## 4 10 3987164 62374964 0.0639
## 5 11 3942006 62517048 0.0631
## 6 12 3941412 62744930 0.0628
## 7 13 3920911 62939772 0.0623
## 8 14 3976864 63356565 0.0628
## 9 15 3967072 63606765 0.0624
## 10 16 3934579 63613014 0.0619
## 11 17 3844260 63958243 0.0601
## 12 18 3780401 64171698 0.0589
Create plots of each of these three variables against time.
base = births_by_year %>% ggplot(aes(x = yr))
base + geom_point(aes(y = total_births)) + ggtitle("Births")
Could this be happening because there are fewer women in age groups at which births are most common. Let’s look at how age enters into this picture.
Compute births per woman for each of the age groups in our data. Combine all years, races, and regions.
Display the dataframe.
age_group_ratios = Nat0718 %>%
group_by(Age) %>%
summarize(total_births = sum(Births),
total_women = sum(Fpop)) %>%
mutate(ratio = total_births/total_women)
## `summarise()` ungrouping output (override with `.groups` argument)
## # A tibble: 6 x 4
## Age total_births total_women ratio
## <fct> <dbl> <dbl> <dbl>
## 1 15-19 3628252 125889729 0.0288
## 2 20-24 10858973 129208011 0.0840
## 3 25-29 13746835 129397220 0.106
## 4 30-34 12342495 124330266 0.0993
## 5 35-39 6053616 122765369 0.0493
## 6 40-44 1319210 124717370 0.0106
The differences in births per woman across age groups suggests that a change in the age structure may be relevant. How could we pursue this question?
Note that Nat0718 itself has the value of our ratio computed for every cell defined by Year, Race and Region.
Do a scatterplot for Rate against Year for the age group 25-29.
Nat0718 %>%
filter(Age == "25-29") %>%
ggplot(aes(x = yr, y = Rate)) +
geom_point() +
ggtitle("25-29")
What’s the story behind the grouping? How do race and region play into this picture?
First, Look at Race
Nat0718 %>%
filter(Age == "25-29") %>%
ggplot(aes(x = yr, y = Rate, color = Race)) +
geom_point() +
ggtitle("25-29 color = Race")
Let’s look at Region.