library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
library(stringr)
Nat0306 <- read_delim("~/Dropbox/RProjects/Fertility Study/Nat0306.txt","\t", escape_double = FALSE)
## Parsed with column specification:
## cols(
## Notes = col_character(),
## State = col_character(),
## StateCode = col_character(),
## Age = col_character(),
## AgeCode = col_character(),
## Year = col_integer(),
## YearCode = col_integer(),
## Births = col_integer(),
## FPop = col_character(),
## Rate = col_character()
## )
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 66 parsing failures.
## row # A tibble: 5 x 5 col row col expected actual expected <int> <chr> <chr> <chr> actual 1 1637 <NA> 10 columns 1 columns file 2 1638 <NA> 10 columns 1 columns row 3 1639 <NA> 10 columns 1 columns col 4 1640 <NA> 10 columns 1 columns expected 5 1641 <NA> 10 columns 1 columns actual # ... with 1 more variables: file <chr>
## ... ................. ... .................................. ........ .................................. ...... .................................. .... .................................. ... .................................. ... .................................. ........ .................................. ...... .......................................
## See problems(...) for more details.
Nat0715 <- read_delim("~/Dropbox/RProjects/Fertility Study/Nat0715.txt","\t", escape_double = FALSE)
## Parsed with column specification:
## cols(
## Notes = col_character(),
## State = col_character(),
## StateCode = col_character(),
## Year = col_integer(),
## YearCode = col_integer(),
## Age = col_character(),
## AgeCode = col_character(),
## Births = col_integer(),
## FPop = col_character(),
## Rate = col_character()
## )
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 69 parsing failures.
## row # A tibble: 5 x 5 col row col expected actual expected <int> <chr> <chr> <chr> actual 1 3713 <NA> 10 columns 1 columns file 2 3714 <NA> 10 columns 1 columns row 3 3715 <NA> 10 columns 1 columns col 4 3716 <NA> 10 columns 1 columns expected 5 3717 <NA> 10 columns 1 columns actual # ... with 1 more variables: file <chr>
## ... ................. ... .................................. ........ .................................. ...... .................................. .... .................................. ... .................................. ... .................................. ........ .................................. ...... .......................................
## See problems(...) for more details.
Look at the data.
Select the variables we want. Convert the character variable Rate to numeric and eliminate the rows where this variable is na. Then combine the two dataframes using rbind.
We could probably do the rbind first and save a few lines of typing. Per-preocesing the two separate files eliminates the possibility of problems because of differences between these two files, which came from different queries.
Nat0306 %>%
select(State,Year,Age,Rate) %>%
mutate(Rate = as.numeric(Rate)) %>%
filter(!is.na(Rate)) -> Nat0306
## Warning in evalq(as.numeric(Rate), <environment>): NAs introduced by
## coercion
glimpse(Nat0306)
## Observations: 1,224
## Variables: 4
## $ State <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "...
## $ Year <int> 2003, 2004, 2005, 2006, 2003, 2004, 2005, 2006, 2003, 20...
## $ Age <chr> "15-19 years", "15-19 years", "15-19 years", "15-19 year...
## $ Rate <dbl> 51.40, 51.02, 48.06, 51.83, 113.67, 113.48, 116.90, 121....
Nat0715 %>%
select(State,Year,Age,Rate) %>%
mutate(Rate = as.numeric(Rate)) %>%
filter(!is.na(Rate)) -> Nat0715
## Warning in evalq(as.numeric(Rate), <environment>): NAs introduced by
## coercion
glimpse(Nat0715)
## Observations: 2,754
## Variables: 4
## $ State <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "...
## $ Year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2008, 2008, 2008, 20...
## $ Age <chr> "15-19 years", "20-24 years", "25-29 years", "30-34 year...
## $ Rate <dbl> 52.07, 123.78, 117.79, 78.00, 31.43, 5.51, 50.53, 121.63...
both = rbind(Nat0306,Nat0715)
summary(both)
## State Year Age Rate
## Length:3978 Min. :2003 Length:3978 Min. : 3.93
## Class :character 1st Qu.:2006 Class :character 1st Qu.: 30.56
## Mode :character Median :2009 Mode :character Median : 60.91
## Mean :2009 Mean : 65.96
## 3rd Qu.:2012 3rd Qu.:102.77
## Max. :2015 Max. :184.05
Now create a new dataframe with one row for each combination of State and Year. Create the variable TFR.
both %>%
group_by(State,Year) %>%
summarize(TFR = 5*sum(Rate)/1000) %>%
ungroup() -> SYTFR
glimpse(SYTFR)
## Observations: 663
## Variables: 3
## $ State <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama", "...
## $ Year <int> 2003, 2004, 2005, 2006, 2007, 2008, 2009, 2010, 2011, 20...
## $ TFR <dbl> 1.92205, 1.91695, 1.93785, 2.00555, 2.04290, 2.02050, 1....
Let’s look at the results. First try side-by-side boxplots to look at the changes over time. Note that we have to cast Year as a factor to get our boxplots.
SYTFR %>%
ggplot(aes(x = factor(Year),y=TFR)) + geom_boxplot()
There are some distinct groups of years. For analysis purposes, I would group them as follows:
SYTFR %>% ggplot(aes(x= Year, y = TFR)) + geom_point()
Probably the great recession caused a drop in the birth rate.
The next step in our analysis is to introduce per capita personal income data from the Bureau of Economic Analysis. This is available in the file CA1.csv in Moodle. Import this data.
CA1 <- read_csv("~/Dropbox/RProjects/Fertility Study/CA1.csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## GeoFIPS = col_integer(),
## Region = col_integer(),
## LineCode = col_integer()
## )
## See spec(...) for full column specifications.
## Warning in rbind(names(probs), probs_f): number of columns of result is not
## a multiple of vector length (arg 1)
## Warning: 4 parsing failures.
## row # A tibble: 4 x 5 col row col expected expected <int> <chr> <chr> actual 1 9595 GeoFIPS an integer file 2 9596 GeoFIPS an integer row 3 9597 GeoFIPS an integer col 4 9598 GeoFIPS an integer expected # ... with 2 more variables: actual <chr>, file <chr>
Looking at the results of the import process, we can see that there are some invalid records at the bottom of the file. These can be eliminated by requiring that GeoFIPS be numeric. We also want to keep only the rows for statewide values. These can be indentified as having GeoName values containing the string “state”. The income data we want can be identified as having LineCode values of 3. In our output, we want only the name of the state and the income data in columns 8 through 54.
CA2 <- CA1 %>%
filter(LineCode == 3, str_detect(GeoName,"state")) %>%
select(GeoName,8:54) %>%
rename(State = GeoName)
Use View
Now we need to remove the extra text from the variable State. All rows have a string " state total" and some have an asterisk. These are removed in separate steps. The asterisk needs two leading backslashes because it has special significance in regular expressions. This is called “escaping.”
CA3 <- CA2 %>%
mutate(State = gsub(" state total","",State)) %>%
mutate(State = gsub("\\*","",State))
Use View.
Now we need to use the gather function from tidyr to put the column names indicating years into a column named Year with the income values from those columns in a single column named PCInc. In addition, we need to make Year and PCInc numeric. We only want data for the years 2003 through 2015.
Look at the documentation on gather() in Chapter 12 of RFDS.
CA4 <- CA3 %>%
gather(Year,PCInc,-1) %>%
mutate(Year = as.numeric(Year),
PCInc = as.numeric(PCInc)) %>% filter(Year >= 2003)
glimpse(CA4)
## Observations: 663
## Variables: 3
## $ State <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California"...
## $ Year <dbl> 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 20...
## $ PCInc <dbl> 26513, 35591, 28166, 25467, 35234, 35132, 45468, 38187, ...
Now we can join the fertility data with the income data and examine the relationship between the two.
AllData = inner_join(CA4,SYTFR)
## Joining, by = c("State", "Year")
glimpse(AllData)
## Observations: 663
## Variables: 4
## $ State <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "California"...
## $ Year <dbl> 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 20...
## $ PCInc <dbl> 26513, 35591, 28166, 25467, 35234, 35132, 45468, 38187, ...
## $ TFR <dbl> 1.92205, 2.28325, 2.39805, 2.04710, 2.11665, 2.14480, 1....
We can now look at the relationship between per capita personal income and the TFR.
AllData %>% ggplot(aes(x=PCInc,y=TFR)) + geom_point()
The plot suggests something counterituitive - Higher income leads to reduced fertility. This clashes with our explanation for the decline in fertility beginning with the great recession.
Let’s add time to this graphic by mapping color to Year.
AllData %>% ggplot(aes(x=PCInc,y=TFR,color=Year)) + geom_point()
It seems that there may be some correlation between time and our income variable. Review data on real personal income and the CPI in FRED. https://fred.stlouisfed.org/ is the link.
Let’s import the CPI data downloaded from FRED.
CPIData <- read_csv("~/Dropbox/RProjects/Fertility Study/CPIData.csv")
## Parsed with column specification:
## cols(
## Year = col_integer(),
## CPI = col_double()
## )
Now we need to join the CPI Data with AllData and adjust the per capita income data for inflation.
AllData2 = inner_join(AllData,CPIData)
## Joining, by = "Year"
AllData2 %>% mutate(RealPCInc = PCInc/CPI) -> AllData2
glimpse(AllData2)
## Observations: 663
## Variables: 6
## $ State <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "Califor...
## $ Year <dbl> 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003...
## $ PCInc <dbl> 26513, 35591, 28166, 25467, 35234, 35132, 45468, 381...
## $ TFR <dbl> 1.92205, 2.28325, 2.39805, 2.04710, 2.11665, 2.14480...
## $ CPI <dbl> 184, 184, 184, 184, 184, 184, 184, 184, 184, 184, 18...
## $ RealPCInc <dbl> 144.0924, 193.4293, 153.0761, 138.4076, 191.4891, 19...
Now redo the original scatterplot using the adjusted per capita income data.
AllData2 %>% ggplot(aes(x=RealPCInc,y=TFR,color=Year)) + geom_point()
There is still a counterintuitive relationship. How can we explain this?
Let’s repeat the previous graph with a facet by year.
AllData2 %>% ggplot(aes(x=RealPCInc,y=TFR,color=Year)) + geom_point() + geom_smooth() + facet_wrap(~Year)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Looking back at the side-by-side boxplot, recall the grouping we saw there.
We can use the case_when statement to develop a variable to capture this grouping.
AllData3 <- AllData2 %>%
mutate(Period = case_when(
.$Year < 2006 ~ "Early",
.$Year < 2009 ~ "Peak",
.$Year < 2012 ~ "Decline",
TRUE ~ "Late"),
Period = factor(Period,levels=c("Early","Peak","Decline","Late"))
)
str(AllData3)
## Classes 'tbl_df', 'tbl' and 'data.frame': 663 obs. of 7 variables:
## $ State : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ Year : num 2003 2003 2003 2003 2003 ...
## $ PCInc : num 26513 35591 28166 25467 35234 ...
## $ TFR : num 1.92 2.28 2.4 2.05 2.12 ...
## $ CPI : num 184 184 184 184 184 184 184 184 184 184 ...
## $ RealPCInc: num 144 193 153 138 191 ...
## $ Period : Factor w/ 4 levels "Early","Peak",..: 1 1 1 1 1 1 1 1 1 1 ...
Now we can repeat the plot above with these groupings. We can also do a simple smoothing plot using the new variable to distinguish among periods.
AllData3 %>% ggplot(aes(x=RealPCInc,y=TFR,color=Period)) + geom_point() +
geom_smooth() + facet_wrap(~Period)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Here is another look at this without a facet or the points.
AllData3 %>% ggplot(aes(x=RealPCInc,y=TFR,color=Period)) +
geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Turning off the confidence interval shading might be better.
AllData3 %>% ggplot(aes(x=RealPCInc,y=TFR,color=Period)) +
geom_smooth(level=0.0)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The impact of the peak years is clear. There is some separation of the curves for lower levels of RealPCInc, but the spearation is weaker at higher levles of RealPCInc.
Let’s try a regression model and include the factor Period. R will create dummy variables automatically to take care of the Period variable.
FertMod = lm(TFR~RealPCInc + Period,data = AllData3)
summary(FertMod)
##
## Call:
## lm(formula = TFR ~ RealPCInc + Period, data = AllData3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.39463 -0.10693 -0.02544 0.09289 0.53061
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.4110913 0.0401444 60.061 < 2e-16 ***
## RealPCInc -0.0021371 0.0002127 -10.047 < 2e-16 ***
## PeriodPeak 0.0802308 0.0199591 4.020 6.50e-05 ***
## PeriodDecline -0.0684196 0.0199052 -3.437 0.000625 ***
## PeriodLate -0.1215957 0.0188969 -6.435 2.38e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1735 on 658 degrees of freedom
## Multiple R-squared: 0.2809, Adjusted R-squared: 0.2765
## F-statistic: 64.26 on 4 and 658 DF, p-value: < 2.2e-16
Let’s look at a few specific states in specific years.
str(AllData3)
## Classes 'tbl_df', 'tbl' and 'data.frame': 663 obs. of 7 variables:
## $ State : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
## $ Year : num 2003 2003 2003 2003 2003 ...
## $ PCInc : num 26513 35591 28166 25467 35234 ...
## $ TFR : num 1.92 2.28 2.4 2.05 2.12 ...
## $ CPI : num 184 184 184 184 184 184 184 184 184 184 ...
## $ RealPCInc: num 144 193 153 138 191 ...
## $ Period : Factor w/ 4 levels "Early","Peak",..: 1 1 1 1 1 1 1 1 1 1 ...
AllData3 %>%
filter(State %in% c("Connecticut","Utah","Washington","Alabama"),
Year %in% c(2003,2008,2015)) %>%
select(State,Year, RealPCInc,TFR) %>%
arrange(State,Year)
## # A tibble: 12 x 4
## State Year RealPCInc TFR
## <chr> <dbl> <dbl> <dbl>
## 1 Alabama 2003 144.0924 1.92205
## 2 Alabama 2008 155.3049 2.02050
## 3 Alabama 2015 160.4729 1.83050
## 4 Connecticut 2003 247.1087 1.90130
## 5 Connecticut 2008 284.4639 1.84720
## 6 Connecticut 2015 289.9062 1.60500
## 7 Utah 2003 139.4402 2.62225
## 8 Utah 2008 157.6370 2.64900
## 9 Utah 2015 165.8656 2.28530
## 10 Washington 2003 187.3424 1.91165
## 11 Washington 2008 208.0983 2.02355
## 12 Washington 2015 218.9909 1.81110
The striking fact is that the different states have very different numbers for both of these variables. Moreover, the differences persist over time.
To make the numbers more comparable and more suitable for model building we frequently recast them in terms of percentage change from a base period. Here the base period is 2003. We will carry this out in distinct steps.
Starting with AllData3, create a dataframe baseValues with the following variables.
State B_TFR - Value of TFR in 2003 B_RealPCInc - Value of RealPCInc in 2003
AllData3 %>%
filter(Year == 2003) %>%
mutate(B_TFR = TFR, B_RealPCInc = RealPCInc) %>%
select(State, B_TFR, B_RealPCInc) -> baseValues
glimpse(baseValues)
## Observations: 51
## Variables: 3
## $ State <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "Calif...
## $ B_TFR <dbl> 1.92205, 2.28325, 2.39805, 2.04710, 2.11665, 2.144...
## $ B_RealPCInc <dbl> 144.0924, 193.4293, 153.0761, 138.4076, 191.4891, ...
Join the baseValues table to AllData3 producing AllData4. Glimpse AllData4 to make sure it looks OK. In AllData4, create PD_RealPCInc PD_TFR. These are percentage differences from the base values.
AllData4 = inner_join(AllData3,baseValues)
## Joining, by = "State"
glimpse(AllData4)
## Observations: 663
## Variables: 9
## $ State <chr> "Alabama", "Alaska", "Arizona", "Arkansas", "Calif...
## $ Year <dbl> 2003, 2003, 2003, 2003, 2003, 2003, 2003, 2003, 20...
## $ PCInc <dbl> 26513, 35591, 28166, 25467, 35234, 35132, 45468, 3...
## $ TFR <dbl> 1.92205, 2.28325, 2.39805, 2.04710, 2.11665, 2.144...
## $ CPI <dbl> 184, 184, 184, 184, 184, 184, 184, 184, 184, 184, ...
## $ RealPCInc <dbl> 144.0924, 193.4293, 153.0761, 138.4076, 191.4891, ...
## $ Period <fctr> Early, Early, Early, Early, Early, Early, Early, ...
## $ B_TFR <dbl> 1.92205, 2.28325, 2.39805, 2.04710, 2.11665, 2.144...
## $ B_RealPCInc <dbl> 144.0924, 193.4293, 153.0761, 138.4076, 191.4891, ...
In AllData4, create PD_RealPCInc PD_TFR. These are percentage differences from the base values.
AllData4 %>%
mutate(PD_RealPCInc = (RealPCInc - B_RealPCInc)/B_RealPCInc,
PD_TFR = (TFR - B_TFR)/B_TFR) -> AllData4
Examine the results Graphically. Your choice!
AllData4 %>% ggplot(aes(x=Year,y=PD_TFR)) + geom_point()
AllData4 %>% ggplot(aes(x=Year,y=PD_RealPCInc)) + geom_point()
AllData4 %>% ggplot(aes(x=RealPCInc,y=PD_TFR)) + geom_point()