Research Question: A recession is defined as two consecutive quarters of negative growth as measured by a countrys Gross Domestic Product. What are the effects of a US recession? Specifically, do U.S. recessions have an impact on number of births in the United States? I found a data set with US recession dates and a data set of number of births in the US and I plan to see if there is any correlation between the two. I will mainly focus on the birth data to determine if there are any changes related to a recession - I will look at the year occurring after a recession spefically to see if the recession created a change in decisions for parents to have children.

Birth Data:

library(stringr)
library(knitr)
data <- read.csv(url("https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_1994-2003_CDC_NCHS.csv"))
births <- as.data.frame(data)
head(births)
##   year month date_of_month day_of_week births
## 1 1994     1             1           6   8096
## 2 1994     1             2           7   7772
## 3 1994     1             3           1  10142
## 4 1994     1             4           2  11248
## 5 1994     1             5           3  11053
## 6 1994     1             6           4  11406

Recession data:

library(stringr)
reces <- read.csv("/Users/christinakasman/Desktop/Real_time_decision_rules.csv")
head(reces)
##   Date.of.release Date.described index declaration our.dates NBER.dates
## 1          May-68           1967   3.8                    NA         NA
## 2          Aug-68           1968   1.8                    NA         NA
## 3          Feb-69           1968   2.3                    NA         NA
## 4          May-69           1968   6.3                    NA         NA
## 5          Nov-68           1968   1.2                    NA         NA
## 6          Aug-69           1969  13.0                    NA         NA
##   hyperlink
## 1          
## 2          
## 3          
## 4          
## 5          
## 6

Grouping recession rates by month:

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
reces2 <- reces %>% group_by(Date.described) %>% summarise(index = sum(index))
reces2
## # A tibble: 51 x 2
##    Date.described index
##             <int> <dbl>
##  1           1967   3.8
##  2           1968  11.6
##  3           1969 168.4
##  4           1970 378.9
##  5           1971  41.7
##  6           1972   2.5
##  7           1973 102.6
##  8           1974 398.1
##  9           1975 143.4
## 10           1976   5.1
## # ... with 41 more rows

Need to subset for years to match birth data (1994-2002)

recesfin <- reces2[c(28:36), c(1:2)]
names(recesfin) = c("year", "index")
recesfin
## # A tibble: 9 x 2
##    year index
##   <int> <dbl>
## 1  1994  13.7
## 2  1995  86.9
## 3  1996  44.3
## 4  1997  16.7
## 5  1998  12.8
## 6  1999   8.8
## 7  2000  26.4
## 8  2001 204.9
## 9  2002  81.2

What are the cases, and how many are there? The cases in the birth data set are number of births by month and year. There are 39722137 births recorded from 1994-2003.

sum(births$births)
## [1] 39722137

Describe the method of data collection. The data is collected by each state and the National Center for Health Statistics (NCHS) from standard collection of birth certificate forms.

What type of study is this (observational/experiment)? This is an observational study - data from 1994-2003

If not, provide a citation/link.

What is the response variable, and what type is it (numerical/categorical)? Response variable is number of births - numerical

What is the explanatory variable, and what type is it (numerical/categorival)? Explanatory variable is years in which a recession occur - numerical

Provide summary statistics relevant to your research question. For example, if you’re comparing means across groups provide means, SDs, sample sizes of each group. This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

#summary statistic for number of a births per day
summary(births$births)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6443    8844   11620   10880   12270   14540
library(dplyr)
library(tidyr)
births$date <- paste(births$month, births$year)
births1 <- births %>% group_by(date) %>% summarise(monthlybirths = sum(as.numeric(births)))
births1[order(as.Date(births1$date, format="%M%Y")),]
## # A tibble: 120 x 2
##       date monthlybirths
##      <chr>         <dbl>
##  1  1 1994        320705
##  2 10 1994        330172
##  3 11 1994        319397
##  4 12 1994        326748
##  5  2 1994        301327
##  6  3 1994        339736
##  7  4 1994        317392
##  8  5 1994        330295
##  9  6 1994        329737
## 10  7 1994        345862
## # ... with 110 more rows

To assess whether the linear model is reliable, we need to check for (1) linearity, (2) nearly normal residuals, and (3) constant variability.

qqnorm(births1$monthlybirths)
qqline(births1$monthlybirths)

#Normal distribution
hist(births1$monthlybirths, main = "Births per month from 1994-2003", xlab = "Births")

#Births per month show a normal distribution

Summary of statistics of births per month

summary(births1$monthlybirths)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  291500  319700  330600  331000  342800  364200

Grouping total births by year:

library(dplyr)
births2 <- births %>% group_by(year) %>% summarise(Totalbirths = sum(births))
births2
## # A tibble: 10 x 2
##     year Totalbirths
##    <int>       <int>
##  1  1994     3952767
##  2  1995     3899589
##  3  1996     3891494
##  4  1997     3880894
##  5  1998     3941553
##  6  1999     3959417
##  7  2000     4058814
##  8  2001     4025933
##  9  2002     4021726
## 10  2003     4089950

Merge data together:

merge1 <- merge(recesfin, births2, by= "year")
merge1
##   year index Totalbirths
## 1 1994  13.7     3952767
## 2 1995  86.9     3899589
## 3 1996  44.3     3891494
## 4 1997  16.7     3880894
## 5 1998  12.8     3941553
## 6 1999   8.8     3959417
## 7 2000  26.4     4058814
## 8 2001 204.9     4025933
## 9 2002  81.2     4021726

Births per year

library(ggplot2)
ggplot(births2, aes(year, Totalbirths, color = Totalbirths)) + geom_point(shape = 16, size = 2, show.legend = FALSE) +
  theme_minimal() +
  xlab("Year") +
  ylab("Births") 

Recession index per year

library(ggplot2)
ggplot(recesfin, aes(year, index, color = index)) + geom_point(shape = 16, size = 2, show.legend = FALSE)  +
  theme_minimal() +
  xlab("Year") +
  ylab("Recession Index") 

We want to look at total Births as a function of Recession Index

plot(merge1$Totalbirths ~ merge1$index)

lm1 <- lm(merge1$Totalbirths~merge1$index, data = merge1)
summary(lm1)
## 
## Call:
## lm(formula = merge1$Totalbirths ~ merge1$index, data = merge1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -71024 -63750   8563  16981 110028 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3939261.2    29024.1 135.724 3.11e-13 ***
## merge1$index     360.8      357.0   1.011    0.346    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 64050 on 7 degrees of freedom
## Multiple R-squared:  0.1273, Adjusted R-squared:  0.002669 
## F-statistic: 1.021 on 1 and 7 DF,  p-value: 0.3458
cor(merge1$index, merge1$Totalbirths)
## [1] 0.3568412

Read in Births for year after data:

yearafter <- read.csv("/Users/christinakasman/Desktop/yearafter.csv")
yearafter
##   Year.after index year.after.births
## 1       1994  13.7           3899589
## 2       1995  86.9           3891494
## 3       1996  44.3           3880894
## 4       1997  16.7           3941553
## 5       1998  12.8           3959417
## 6       1999   8.8           4058814
## 7       2000  26.4           4025933
## 8       2001 204.9           4021726
plot(yearafter$year.after.births ~ yearafter$index)

The plot does not look normal or linear. Does not follow original hypothesis assumptions (as recession index increases, the year after births will decrease)

cor(yearafter$year.after.births, yearafter$index)
## [1] 0.1458522

The correlation for birthrate for the year after a recession is very low at .14 (lower than the correlation for birthrate during a recession). My hypothesis is therefore null.

For future tesing I would use a larger data sample.