Part 1 - Introduction

Question: What day of the week do parents prefer to give birth?

In an article that I read from fivethirtyeight, they were discussing that there are parents that don’t want to have kids on a Friday fearing that it would fall on Friday the 13th. It makes one wonder are people really that superstitious. Would it really be bad if someone gave birth to a child on Friday the 13th?

Part 2 - Data

Cases: What are the cases? (Remember: case = units of observation or units of experiment)

There are 5480 different cases that are in the data. They involve US births that is between the year 2000-2014. In the fivethirtyeight article it mentioned that some people would induce there labor just to make sure that they can avoid giving birth on a particular day.

Data collection: Describe how the data were collected.

contains U.S. births data for the years 2000 to 2014, as provided by the Social Security Administration.

Type of study: What is the type of study, observational or an experiment? Explain how you’ve arrived at your conclusion using information on the sampling and/or experimental design.

This study is observational, because it is based on the data that was collected by the Social Security Administration.

Scope of inference - generalizability: Identify the population of interest, and whether the findings from this analysis can be generalized to that population, or, if not, a subsection of that population. Explain why or why not. Also discuss any potential sources of bias that might prevent generalizability.

The population of interest of whom may be interested in this data are likely to be first time parents or parents that are about to have a child.

Dependent Variable

The dependent variable is numeric

Independent Variable

You should have two independent variables, one quantitative and one qualitative. Dates that people give birth on, and whether or not people are avoiding to give birth on Friday the 13th.

Part 3 - Exploratory data analysis: Perform relevant descriptive statistics, including summary statistics and visualization of the data. Also address what the exploratory data analysis suggests about your research question.

str(usbirths)
## 'data.frame':    5479 obs. of  5 variables:
##  $ year         : int  2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
##  $ month        : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ date_of_month: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ day_of_week  : int  6 7 1 2 3 4 5 6 7 1 ...
##  $ births       : int  9083 8006 11363 13032 12558 12466 12516 8934 7949 11668 ...
summary(usbirths)
##       year          month        date_of_month    day_of_week     births     
##  Min.   :2000   Min.   : 1.000   Min.   : 1.00   Min.   :1    Min.   : 5728  
##  1st Qu.:2003   1st Qu.: 4.000   1st Qu.: 8.00   1st Qu.:2    1st Qu.: 8740  
##  Median :2007   Median : 7.000   Median :16.00   Median :4    Median :12343  
##  Mean   :2007   Mean   : 6.523   Mean   :15.73   Mean   :4    Mean   :11350  
##  3rd Qu.:2011   3rd Qu.:10.000   3rd Qu.:23.00   3rd Qu.:6    3rd Qu.:13082  
##  Max.   :2014   Max.   :12.000   Max.   :31.00   Max.   :7    Max.   :16081
summary(usbirths$day_of_week)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       1       2       4       4       6       7
require(tidyverse)
## Loading required package: tidyverse
## -- Attaching packages ------------------------------------------------------------------------------------------ tidyverse 1.3.0 --
## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts --------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
us_births1 <- usbirths %>% 
  unite(col=yeardate,year,month,date_of_month, sep = "-" )
head(us_births1)
##   yeardate day_of_week births
## 1 2000-1-1           6   9083
## 2 2000-1-2           7   8006
## 3 2000-1-3           1  11363
## 4 2000-1-4           2  13032
## 5 2000-1-5           3  12558
## 6 2000-1-6           4  12466

Sum of Births based on Day of the Week

us_births1 %>%
  group_by(day_of_week)%>%
  summarise(Sum_of_Births = sum(births))  
## # A tibble: 7 x 2
##   day_of_week Sum_of_Births
##         <int>         <int>
## 1           1       9316001
## 2           2      10274874
## 3           3      10109130
## 4           4      10045436
## 5           5       9850199
## 6           6       6704495
## 7           7       5886889
ggplot(us_births1, aes(fill=day_of_week, y=births, x=day_of_week)) + 
    geom_bar(position="dodge", stat="identity") +
    ggtitle("Day_of_Week Vs. Births")

Sum of Births based on Date of the Month

usbirths %>%
  group_by(date_of_month) %>%
  summarise(Sum_of_Births = sum(births))  
## # A tibble: 31 x 2
##    date_of_month Sum_of_Births
##            <int>         <int>
##  1             1       2003627
##  2             2       2030447
##  3             3       2042441
##  4             4       2004785
##  5             5       2036185
##  6             6       2037729
##  7             7       2063416
##  8             8       2061652
##  9             9       2044600
## 10            10       2066154
## # ... with 21 more rows
ggplot(usbirths, aes(fill=date_of_month, y=births, x=date_of_month)) + 
    geom_bar(position="dodge", stat="identity") +
    ggtitle("Day_of_Week Vs. Births")

plot(usbirths$day_of_week, usbirths$births)

qqnorm(usbirths$births)
qqline(usbirths$births)

Part 4 - Inference: If your data fails some conditions and you can’t use a theoretical method, then you should use simulation. If you can use both methods, then you should use both methods. It is your responsibility to figure out the appropriate methodology.

Hyphothesis- Day of the week matters Hyphothesis - Day of the week does not matter.

Correlation of births and day of the week

cor(usbirths$day_of_week, usbirths$births)
## [1] -0.6934876
  • They have a negative correlation
usb <- lm(births ~ day_of_week, data = usbirths)
summary(usb)
## 
## Call:
## lm(formula = births ~ day_of_week, data = usbirths)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7359.7 -1328.3   -97.4  1347.3  5011.3 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14574.97      50.62  287.95   <2e-16 ***
## day_of_week  -806.26      11.32  -71.23   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1676 on 5477 degrees of freedom
## Multiple R-squared:  0.4809, Adjusted R-squared:  0.4808 
## F-statistic:  5074 on 1 and 5477 DF,  p-value: < 2.2e-16
plot( usbirths$births~usbirths$day_of_week)
abline(usb)

plot(usb$residuals ~ usbirths$births)
abline(h = 0, lty = 3)  # adds a horizontal dashed line at y = 0

Part 5 - Conclusion

There seems to be less births on the 6th and 7th day of the week, could it be because people do not want to give birth on a friday. And the least day happens to be on a Saturday. So the hypothesis that the day of week does matter to parents with the given data it seems so.

Appendix (optional)

Remove this section if you don’t have an appendix