Fertility Study

Get Libraries and Raw Data

library(tidyverse)
## ── Attaching packages ────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.1       ✔ purrr   0.2.5  
## ✔ tibble  2.0.1       ✔ dplyr   0.8.0.1
## ✔ tidyr   0.8.1       ✔ stringr 1.4.0  
## ✔ readr   1.1.1       ✔ forcats 0.3.0
## ── Conflicts ───────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(stringr)
Nat0306 <- read_delim("~/Dropbox/RProjects/Fertility Study/Nat0306.txt","\t", escape_double = FALSE)
## Parsed with column specification:
## cols(
##   Notes = col_character(),
##   State = col_character(),
##   StateCode = col_character(),
##   Age = col_character(),
##   AgeCode = col_character(),
##   Year = col_integer(),
##   YearCode = col_integer(),
##   Births = col_integer(),
##   FPop = col_character(),
##   Rate = col_character()
## )
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 66 parsing failures.
## row # A tibble: 5 x 5 col     row col   expected   actual    file                                     expected   <int> <chr> <chr>      <chr>     <chr>                                    actual 1  1637 <NA>  10 columns 1 columns '~/Dropbox/RProjects/Fertility Study/Na… file 2  1638 <NA>  10 columns 1 columns '~/Dropbox/RProjects/Fertility Study/Na… row 3  1639 <NA>  10 columns 1 columns '~/Dropbox/RProjects/Fertility Study/Na… col 4  1640 <NA>  10 columns 1 columns '~/Dropbox/RProjects/Fertility Study/Na… expected 5  1641 <NA>  10 columns 1 columns '~/Dropbox/RProjects/Fertility Study/Na…
## ... ................. ... ........................................................................... ........ ........................................................................... ...... ........................................................................... .... ........................................................................... ... ........................................................................... ... ........................................................................... ........ ...........................................................................
## See problems(...) for more details.
Nat0717 <- read_delim("~/Dropbox/RProjects/Fertility Study/Nat0717.txt","\t", escape_double = FALSE)
## Parsed with column specification:
## cols(
##   Notes = col_character(),
##   State = col_character(),
##   StateCode = col_character(),
##   Age = col_character(),
##   AgeCode = col_character(),
##   Year = col_integer(),
##   YearCode = col_integer(),
##   Births = col_integer(),
##   FPop = col_character(),
##   Rate = col_character()
## )
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 73 parsing failures.
## row # A tibble: 5 x 5 col     row col   expected   actual    file                                     expected   <int> <chr> <chr>      <chr>     <chr>                                    actual 1  4539 <NA>  10 columns 1 columns '~/Dropbox/RProjects/Fertility Study/Na… file 2  4540 <NA>  10 columns 1 columns '~/Dropbox/RProjects/Fertility Study/Na… row 3  4541 <NA>  10 columns 1 columns '~/Dropbox/RProjects/Fertility Study/Na… col 4  4542 <NA>  10 columns 1 columns '~/Dropbox/RProjects/Fertility Study/Na… expected 5  4543 <NA>  10 columns 1 columns '~/Dropbox/RProjects/Fertility Study/Na…
## ... ................. ... ........................................................................... ........ ........................................................................... ...... ........................................................................... .... ........................................................................... ... ........................................................................... ... ........................................................................... ........ ...........................................................................
## See problems(...) for more details.

Parsing Errors ??

Look at the data.

Solution

Select the variables we want. Convert the character variable Rate to numeric and eliminate the rows where this variable is na. Then combine the two dataframes using rbind.

We could probably do the rbind first and save a few lines of typing. Per-preocesing the two separate files eliminates the possibility of problems because of differences between these two files, which came from different queries.

Nat0306 %>% 
  select(State,Year,Age,Rate) %>% 
  mutate(Rate = as.numeric(Rate)) %>% 
  filter(!is.na(Rate)) -> Nat0306
## Warning: NAs introduced by coercion
glimpse(Nat0306)
## Observations: 1,224
## Variables: 4
## $ State <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Al…
## $ Year  <int> 2003, 2004, 2005, 2006, 2003, 2004, 2005, 2006, 2003, 2004…
## $ Age   <chr> "15-19 years", "15-19 years", "15-19 years", "15-19 years"…
## $ Rate  <dbl> 51.40, 51.02, 48.06, 51.83, 113.67, 113.48, 116.90, 121.36…
Nat0717 %>% 
  select(State,Year,Age,Rate) %>% 
  mutate(Rate = as.numeric(Rate)) %>% 
  filter(!is.na(Rate)) -> Nat0717
## Warning: NAs introduced by coercion
glimpse(Nat0717)
## Observations: 3,366
## Variables: 4
## $ State <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Al…
## $ Year  <int> 2007, 2008, 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016…
## $ Age   <chr> "15-19 years", "15-19 years", "15-19 years", "15-19 years"…
## $ Rate  <dbl> 52.07, 50.53, 48.30, 43.63, 40.53, 39.15, 34.32, 32.02, 30…
both = rbind(Nat0306,Nat0717)
summary(both)
##     State                Year          Age                 Rate       
##  Length:4590        Min.   :2003   Length:4590        Min.   :  3.93  
##  Class :character   1st Qu.:2006   Class :character   1st Qu.: 29.38  
##  Mode  :character   Median :2010   Mode  :character   Median : 60.67  
##                     Mean   :2010                      Mean   : 65.21  
##                     3rd Qu.:2014                      3rd Qu.:102.11  
##                     Max.   :2017                      Max.   :184.05

Create the Total Fertility Rate (TFR)

Now create a new dataframe with one row for each combination of State and Year. Create the variable TFR.

both %>% 
  group_by(State,Year) %>% 
  summarize(TFR = 5*sum(Rate)/1000) %>% 
  ungroup() -> SYTFR
glimpse(SYTFR)
## Observations: 765
## Variables: 3
## $ State <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "Al…
## $ Year  <int> 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 2012…
## $ TFR   <dbl> 1.92205, 1.91695, 1.93785, 2.00555, 2.04290, 2.02050, 1.94…

Examine the Results

Let’s look at the results. First try side-by-side boxplots to look at the changes over time. Note that we have to cast Year as a factor to get our boxplots.

Solution

SYTFR %>% 
  ggplot(aes(x = factor(Year),y=TFR)) + geom_boxplot()

There are some distinct groups of years. For analysis purposes, I would group them as follows:

Scatterplot.

Solution

SYTFR %>% ggplot(aes(x= Year, y = TFR)) + geom_point()

Explanation?

Probably the great recession caused a drop in the birth rate.

The next step in our analysis is to introduce per capita personal income data from the Bureau of Economic Analysis. This is available in the file CA1.csv in Moodle. Import this data.

Solution

CA1 <- read_csv("~/Dropbox/RProjects/Fertility Study/CA1.csv")
## Parsed with column specification:
## cols(
##   .default = col_character(),
##   GeoFIPS = col_integer(),
##   Region = col_integer(),
##   LineCode = col_integer()
## )
## See spec(...) for full column specifications.
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 4 parsing failures.
## row # A tibble: 4 x 5 col     row col    expected  actual                         file                expected   <int> <chr>  <chr>     <chr>                          <chr>               actual 1  9595 GeoFI… an integ… Note: See the included footno… '~/Dropbox/RProjec… file 2  9596 GeoFI… an integ… CA1: Personal Income Summary:… '~/Dropbox/RProjec… row 3  9597 GeoFI… an integ… Last updated: November 17, 20… '~/Dropbox/RProjec… col 4  9598 GeoFI… an integ… Source: U.S. Department of Co… '~/Dropbox/RProjec…

Error Messages?

Answer

Looking at the results of the import process, we can see that there are some invalid records at the bottom of the file. These can be eliminated by requiring that GeoFIPS be numeric. We also want to keep only the rows for statewide values. These can be indentified as having GeoName values containing the string “state”. The income data we want can be identified as having LineCode values of 3. In our output, we want only the name of the state and the income data in columns 8 through 54.

OK, Do that

Solution

CA2 <- CA1 %>% 
  filter(LineCode == 3, str_detect(GeoName,"state")) %>%
  select(GeoName,8:54) %>%
  rename(State = GeoName)

Look at CA2

Solution

Use View

Now we need to remove the extra text from the variable State. All rows have a string " state total" and some have an asterisk. These are removed in separate steps. The asterisk needs two leading backslashes because it has special significance in regular expressions. This is called “escaping.”

Do that

Solution

CA3 <- CA2 %>%
  mutate(State = gsub(" state total","",State)) %>% 
  mutate(State = gsub("\\*","",State))

Examine

Use View.

Conclusion

Now we need to use the gather function from tidyr to put the column names indicating years into a column named Year with the income values from those columns in a single column named PCInc. In addition, we need to make Year and PCInc numeric. We only want data for the years 2003 through 2015.

Look at the documentation on gather() in Chapter 12 of RFDS.

Do that

Solution

CA4 <- CA3 %>% 
  gather(Year,PCInc,-1) %>% 
  mutate(Year = as.numeric(Year),
         PCInc = as.numeric(PCInc)) %>%   filter(Year >= 2003)
glimpse(CA4)
## Observations: 663
## Variables: 3
## $ State <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California", …
## $ Year  <dbl> 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003…
## $ PCInc <dbl> 26513, 35591, 28166, 25467, 35234, 35132, 45468, 38187, 46…

Now we can join the fertility data with the income data and examine the relationship between the two.

AllData = inner_join(CA4,SYTFR)
## Joining, by = c("State", "Year")
glimpse(AllData)
## Observations: 663
## Variables: 4
## $ State <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California", …
## $ Year  <dbl> 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003…
## $ PCInc <dbl> 26513, 35591, 28166, 25467, 35234, 35132, 45468, 38187, 46…
## $ TFR   <dbl> 1.92205, 2.28325, 2.39805, 2.04710, 2.11665, 2.14480, 1.90…

We can now look at the relationship between per capita personal income and the TFR.

Do that with a scatterplot

Solution

AllData %>% ggplot(aes(x=PCInc,y=TFR)) + geom_point()

The plot suggests something counterituitive - Higher income leads to reduced fertility. This clashes with our explanation for the decline in fertility beginning with the great recession.

Let’s add time to this graphic by mapping color to Year.

Do that

Solution

AllData %>% ggplot(aes(x=PCInc,y=TFR,color=Year)) + geom_point()

It seems that there may be some correlation between time and our income variable. Review data on real personal income and the CPI in FRED. https://fred.stlouisfed.org/ is the link.

Let’s import the CPI data downloaded from FRED.

Do that

Solution

CPIData <- read_csv("~/Dropbox/RProjects/Fertility Study/CPIData.csv")
## Parsed with column specification:
## cols(
##   Year = col_integer(),
##   CPI = col_double()
## )

Now we need to join the CPI Data with AllData and adjust the per capita income data for inflation.

Do that

Solution

AllData2 = inner_join(AllData,CPIData)
## Joining, by = "Year"
AllData2 %>% mutate(RealPCInc = PCInc/CPI) -> AllData2
glimpse(AllData2)
## Observations: 663
## Variables: 6
## $ State     <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "Californi…
## $ Year      <dbl> 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, …
## $ PCInc     <dbl> 26513, 35591, 28166, 25467, 35234, 35132, 45468, 38187…
## $ TFR       <dbl> 1.92205, 2.28325, 2.39805, 2.04710, 2.11665, 2.14480, …
## $ CPI       <dbl> 184, 184, 184, 184, 184, 184, 184, 184, 184, 184, 184,…
## $ RealPCInc <dbl> 144.0924, 193.4293, 153.0761, 138.4076, 191.4891, 190.…

Now redo the original scatterplot using the adjusted per capita income data.

Do that

Solution

AllData2 %>% ggplot(aes(x=RealPCInc,y=TFR,color=Year)) + geom_point()