Research Question: A recession is defined as two consecutive quarters of negative growth as measured by a countrys Gross Domestic Product. What are the effects of a US recession? Specifically, do U.S. recessions have an impact on number of births in the United States? I found a data set with US recession dates and a data set of number of births in the US and I plan to see if there is any correlation between the two. I will mainly focus on the birth data to determine if there are any changes related to a recession - I will look at the year occurring after a recession spefically to see if the recession created a change in decisions for parents to have children.
Birth Data:
library(stringr)
library(knitr)
data <- read.csv(url("https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_1994-2003_CDC_NCHS.csv"))
births <- as.data.frame(data)
head(births)## year month date_of_month day_of_week births
## 1 1994 1 1 6 8096
## 2 1994 1 2 7 7772
## 3 1994 1 3 1 10142
## 4 1994 1 4 2 11248
## 5 1994 1 5 3 11053
## 6 1994 1 6 4 11406
Recession data:
library(stringr)
reces <- read.csv("/Users/christinakasman/Desktop/Real_time_decision_rules.csv")
head(reces)## Date.of.release Date.described index declaration our.dates NBER.dates
## 1 May-68 1967 3.8 NA NA
## 2 Aug-68 1968 1.8 NA NA
## 3 Feb-69 1968 2.3 NA NA
## 4 May-69 1968 6.3 NA NA
## 5 Nov-68 1968 1.2 NA NA
## 6 Aug-69 1969 13.0 NA NA
## hyperlink
## 1
## 2
## 3
## 4
## 5
## 6
Grouping recession rates by month:
library(dplyr)##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
reces2 <- reces %>% group_by(Date.described) %>% summarise(index = sum(index))
reces2## # A tibble: 51 x 2
## Date.described index
## <int> <dbl>
## 1 1967 3.8
## 2 1968 11.6
## 3 1969 168.4
## 4 1970 378.9
## 5 1971 41.7
## 6 1972 2.5
## 7 1973 102.6
## 8 1974 398.1
## 9 1975 143.4
## 10 1976 5.1
## # ... with 41 more rows
Need to subset for years to match birth data (1994-2002)
recesfin <- reces2[c(28:36), c(1:2)]
names(recesfin) = c("year", "index")
recesfin## # A tibble: 9 x 2
## year index
## <int> <dbl>
## 1 1994 13.7
## 2 1995 86.9
## 3 1996 44.3
## 4 1997 16.7
## 5 1998 12.8
## 6 1999 8.8
## 7 2000 26.4
## 8 2001 204.9
## 9 2002 81.2
What are the cases, and how many are there? The cases in the birth data set are number of births by month and year. There are 39722137 births recorded from 1994-2003.
sum(births$births)## [1] 39722137
Describe the method of data collection. The data is collected by each state and the National Center for Health Statistics (NCHS) from standard collection of birth certificate forms.
What type of study is this (observational/experiment)? This is an observational study - data from 1994-2003
If not, provide a citation/link.
What is the response variable, and what type is it (numerical/categorical)? Response variable is number of births - numerical
What is the explanatory variable, and what type is it (numerical/categorival)? Explanatory variable is years in which a recession occur - numerical
Provide summary statistics relevant to your research question. For example, if you’re comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
#summary statistic for number of a births per day
summary(births$births)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 6443 8844 11620 10880 12270 14540
library(dplyr)
library(tidyr)
births$date <- paste(births$month, births$year)
births1 <- births %>% group_by(date) %>% summarise(monthlybirths = sum(as.numeric(births)))
births1[order(as.Date(births1$date, format="%M%Y")),]## # A tibble: 120 x 2
## date monthlybirths
## <chr> <dbl>
## 1 1 1994 320705
## 2 10 1994 330172
## 3 11 1994 319397
## 4 12 1994 326748
## 5 2 1994 301327
## 6 3 1994 339736
## 7 4 1994 317392
## 8 5 1994 330295
## 9 6 1994 329737
## 10 7 1994 345862
## # ... with 110 more rows
To assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.
qqnorm(births1$monthlybirths)
qqline(births1$monthlybirths)#Normal distributionhist(births1$monthlybirths, main = "Births per month from 1994-2003", xlab = "Births")#Births per month show a normal distributionSummary of statistics of births per month
summary(births1$monthlybirths)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 291500 319700 330600 331000 342800 364200
Grouping total births by year:
library(dplyr)
births2 <- births %>% group_by(year) %>% summarise(Totalbirths = sum(births))
births2## # A tibble: 10 x 2
## year Totalbirths
## <int> <int>
## 1 1994 3952767
## 2 1995 3899589
## 3 1996 3891494
## 4 1997 3880894
## 5 1998 3941553
## 6 1999 3959417
## 7 2000 4058814
## 8 2001 4025933
## 9 2002 4021726
## 10 2003 4089950
Merge data together:
merge1 <- merge(recesfin, births2, by= "year")
merge1## year index Totalbirths
## 1 1994 13.7 3952767
## 2 1995 86.9 3899589
## 3 1996 44.3 3891494
## 4 1997 16.7 3880894
## 5 1998 12.8 3941553
## 6 1999 8.8 3959417
## 7 2000 26.4 4058814
## 8 2001 204.9 4025933
## 9 2002 81.2 4021726
Births per year
library(ggplot2)
ggplot(births2, aes(year, Totalbirths, color = Totalbirths)) + geom_point(shape = 16, size = 2, show.legend = FALSE) +
theme_minimal() +
xlab("Year") +
ylab("Births") Recession index per year
library(ggplot2)
ggplot(recesfin, aes(year, index, color = index)) + geom_point(shape = 16, size = 2, show.legend = FALSE) +
theme_minimal() +
xlab("Year") +
ylab("Recession Index") We want to look at total Births as a function of Recession Index
plot(merge1$Totalbirths ~ merge1$index)lm1 <- lm(merge1$Totalbirths~merge1$index, data = merge1)
summary(lm1)##
## Call:
## lm(formula = merge1$Totalbirths ~ merge1$index, data = merge1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -71024 -63750 8563 16981 110028
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3939261.2 29024.1 135.724 3.11e-13 ***
## merge1$index 360.8 357.0 1.011 0.346
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 64050 on 7 degrees of freedom
## Multiple R-squared: 0.1273, Adjusted R-squared: 0.002669
## F-statistic: 1.021 on 1 and 7 DF, p-value: 0.3458
cor(merge1$index, merge1$Totalbirths)## [1] 0.3568412
Read in Births for year after data:
yearafter <- read.csv("/Users/christinakasman/Desktop/yearafter.csv")
yearafter## Year.after index year.after.births
## 1 1994 13.7 3899589
## 2 1995 86.9 3891494
## 3 1996 44.3 3880894
## 4 1997 16.7 3941553
## 5 1998 12.8 3959417
## 6 1999 8.8 4058814
## 7 2000 26.4 4025933
## 8 2001 204.9 4021726
plot(yearafter$year.after.births ~ yearafter$index) The plot does not look normal or linear. Does not follow original hypothesis assumptions (as recession index increases, the year after births will decrease)
cor(yearafter$year.after.births, yearafter$index)## [1] 0.1458522
The correlation for birthrate for the year after a recession is very low at .14 (lower than the correlation for birthrate during a recession). My hypothesis is therefore null.
For future tesing I would use a larger data sample.