In an article that I read from fivethirtyeight, they were discussing that there are parents that don’t want to have kids on a Friday fearing that it would fall on Friday the 13th. It makes one wonder are people really that superstitious. Would it really be bad if someone gave birth to a child on Friday the 13th?
There are 5480 different cases that are in the data. They involve US births that is between the year 2000-2014. In the fivethirtyeight article it mentioned that some people would induce there labor just to make sure that they can avoid giving birth on a particular day.
contains U.S. births data for the years 2000 to 2014, as provided by the Social Security Administration.
This study is observational, because it is based on the data that was collected by the Social Security Administration.
The population of interest of whom may be interested in this data are likely to be first time parents or parents that are about to have a child.
Since the data is observational one cannot establish casual links between the variable.
This data was on the fivethirtyeight github.
-https://github.com/fivethirtyeight/data/tree/master/births -https://github.com/fivethirtyeight/data/blob/master/births/US_births_2000-2014_SSA.csv -https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_2000-2014_SSA.csv
The dependent variable is numeric
You should have two independent variables, one quantitative and one qualitative. Dates that people give birth on, and whether or not people are avoiding to give birth on Friday the 13th.
str(usbirths)
## 'data.frame': 5479 obs. of 5 variables:
## $ year : int 2000 2000 2000 2000 2000 2000 2000 2000 2000 2000 ...
## $ month : int 1 1 1 1 1 1 1 1 1 1 ...
## $ date_of_month: int 1 2 3 4 5 6 7 8 9 10 ...
## $ day_of_week : int 6 7 1 2 3 4 5 6 7 1 ...
## $ births : int 9083 8006 11363 13032 12558 12466 12516 8934 7949 11668 ...
summary(usbirths)
## year month date_of_month day_of_week births
## Min. :2000 Min. : 1.000 Min. : 1.00 Min. :1 Min. : 5728
## 1st Qu.:2003 1st Qu.: 4.000 1st Qu.: 8.00 1st Qu.:2 1st Qu.: 8740
## Median :2007 Median : 7.000 Median :16.00 Median :4 Median :12343
## Mean :2007 Mean : 6.523 Mean :15.73 Mean :4 Mean :11350
## 3rd Qu.:2011 3rd Qu.:10.000 3rd Qu.:23.00 3rd Qu.:6 3rd Qu.:13082
## Max. :2014 Max. :12.000 Max. :31.00 Max. :7 Max. :16081
summary(usbirths$day_of_week)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1 2 4 4 6 7
require(tidyverse)
## Loading required package: tidyverse
## -- Attaching packages ------------------------------------------------------------------------------------------ tidyverse 1.3.0 --
## v ggplot2 3.2.1 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## -- Conflicts --------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
us_births1 <- usbirths %>%
unite(col=yeardate,year,month,date_of_month, sep = "-" )
head(us_births1)
## yeardate day_of_week births
## 1 2000-1-1 6 9083
## 2 2000-1-2 7 8006
## 3 2000-1-3 1 11363
## 4 2000-1-4 2 13032
## 5 2000-1-5 3 12558
## 6 2000-1-6 4 12466
us_births1 %>%
group_by(day_of_week)%>%
summarise(Sum_of_Births = sum(births))
## # A tibble: 7 x 2
## day_of_week Sum_of_Births
## <int> <int>
## 1 1 9316001
## 2 2 10274874
## 3 3 10109130
## 4 4 10045436
## 5 5 9850199
## 6 6 6704495
## 7 7 5886889
ggplot(us_births1, aes(fill=day_of_week, y=births, x=day_of_week)) +
geom_bar(position="dodge", stat="identity") +
ggtitle("Day_of_Week Vs. Births")
usbirths %>%
group_by(date_of_month) %>%
summarise(Sum_of_Births = sum(births))
## # A tibble: 31 x 2
## date_of_month Sum_of_Births
## <int> <int>
## 1 1 2003627
## 2 2 2030447
## 3 3 2042441
## 4 4 2004785
## 5 5 2036185
## 6 6 2037729
## 7 7 2063416
## 8 8 2061652
## 9 9 2044600
## 10 10 2066154
## # ... with 21 more rows
ggplot(usbirths, aes(fill=date_of_month, y=births, x=date_of_month)) +
geom_bar(position="dodge", stat="identity") +
ggtitle("Day_of_Week Vs. Births")
plot(usbirths$day_of_week, usbirths$births)
qqnorm(usbirths$births)
qqline(usbirths$births)
Hyphothesis- Day of the week matters Hyphothesis - Day of the week does not matter.
Correlation of births and day of the week
cor(usbirths$day_of_week, usbirths$births)
## [1] -0.6934876
usb <- lm(births ~ day_of_week, data = usbirths)
summary(usb)
##
## Call:
## lm(formula = births ~ day_of_week, data = usbirths)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7359.7 -1328.3 -97.4 1347.3 5011.3
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14574.97 50.62 287.95 <2e-16 ***
## day_of_week -806.26 11.32 -71.23 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1676 on 5477 degrees of freedom
## Multiple R-squared: 0.4809, Adjusted R-squared: 0.4808
## F-statistic: 5074 on 1 and 5477 DF, p-value: < 2.2e-16
plot( usbirths$births~usbirths$day_of_week)
abline(usb)
plot(usb$residuals ~ usbirths$births)
abline(h = 0, lty = 3) # adds a horizontal dashed line at y = 0
There seems to be less births on the 6th and 7th day of the week, could it be because people do not want to give birth on a friday. And the least day happens to be on a Saturday. So the hypothesis that the day of week does matter to parents with the given data it seems so.
This data was on the fivethirtyeight github.
-https://github.com/fivethirtyeight/data/tree/master/births -https://github.com/fivethirtyeight/data/blob/master/births/US_births_2000-2014_SSA.csv -https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_2000-2014_SSA.csv
Remove this section if you don’t have an appendix