Are Fewer Children Born on Weekends?

Analysis by Brian Weinfeld

During a recent discussion with some proud new parents, they told me an anecdote told to them by their doctor. The doctor had told them that more children were born during the week than during the weekend. This idea immediately intrigued me as logically the day of the week should have no affect on when a child is born. I decided to investigate.

birthData <- read.table(file='https://raw.githubusercontent.com/brian-cuny/rassignment3/master/Births.csv', header=TRUE, sep=',', stringsAsFactors=FALSE)
head(birthData,n=10)
##     X       date births  wday year month day day_of_year day_of_week
## 1   1 1969-01-01   8486   Wed 1969     1   1           1           3
## 2   2 1969-01-02   9002 Thurs 1969     1   2           2           4
## 3   3 1969-01-03   9542   Fri 1969     1   3           3           5
## 4   4 1969-01-04   8960   Sat 1969     1   4           4           6
## 5   5 1969-01-05   8390   Sun 1969     1   5           5           7
## 6   6 1969-01-06   9560   Mon 1969     1   6           6           1
## 7   7 1969-01-07   9738  Tues 1969     1   7           7           2
## 8   8 1969-01-08   9734   Wed 1969     1   8           8           3
## 9   9 1969-01-09   9434 Thurs 1969     1   9           9           4
## 10 10 1969-01-10  10042   Fri 1969     1  10          10           5

I began by pulling data on births in the United States from 1969 to 1988. The raw data showed that the number of births on any given day in this set was appoximately normal centered around a mean of 9649 with a standard deviation of 1127.315 births.

summary(birthData$births)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6675    8792    9622    9649   10510   12851
sd(birthData$births)
## [1] 1127.315
ggplot(birthData, aes(x=births)) + geom_histogram(aes(y=..density.., fill=..count..), bins=50) + scale_fill_gradient('Count', low='#DCDCDC', high='#7C7C7C') + stat_function(fun=dnorm, color='red', args=list(mean=mean(birthData$births), sd=sd(birthData$births)), aes(size=2), show.legend=FALSE) + theme(legend.position='none') + scale_y_continuous('Proportion', labels=scales::comma) + scale_x_continuous('Births', labels=scales::comma) + labs(title='Frequency of Births in Single Day (1969-1988)', caption='Data From: http://vincentarelbundock.github.io/Rdatasets/') 

I removed the unneeded columns and then agregated the data to show the total number of births for each day of the week for each year. In this data set 1 through 5 represented Monday through Friday while 6 and 7 represented Saturday and Sunday respsectively.

workingData <- birthData[,c(5,9,3)]
combinedBirthData <- aggregate(births ~ day_of_week + year, workingData, sum)[,c(2,1,3)]
head(combinedBirthData,n=10)
##    year day_of_week births
## 1  1969           1 526638
## 2  1969           2 549754
## 3  1969           3 537910
## 4  1969           4 517742
## 5  1969           5 531220
## 6  1969           6 483710
## 7  1969           7 451604
## 8  1970           1 545956
## 9  1970           2 573734
## 10 1970           3 552794

I then created a scatter plot of the data comparing births with day of the week and drew a line through the mean of each day’s data. Visual inspection of the graph seemed to indicate that there was a significant dip in births on Saturday and Sunday. I also plotted each year seperately in case one year’s data was skewing the overall results. Visual inspection of that data supported the initial observation.

original <- ggplot(combinedBirthData, aes(x=factor(day_of_week), y=births)) + geom_point(aes(color=factor(year))) + 
  labs(title='Births By Day of Week (1969-1988) With Trend Line', caption='Data From: http://vincentarelbundock.github.io/Rdatasets/',x='Day Of Week', y='Births', color='Year')

original + stat_summary(aes(y=births,group=1,size=1.5), fun.y=mean, geom='line',group=1, show.legend=FALSE) + scale_x_discrete(labels=c('Mon', 'Tues','Wed','Thurs','Fri','Sat','Sun')) 

original + facet_wrap(~year) + theme(legend.position='none') + scale_x_discrete(labels=c('M','T','W',"Th",'F','Sa','Su'))

Although visual inspection seemed to indicate a difference in birth rates it is important to perform a test to ensure that this is supportred statistically. An ANOVA Test would identify if there were a statistically significant difference between the number of births for each day of the week across all the data. However, I first wanted to ensure that the data had no outliers. I created a boxplot showing the range of births for each day of the week. This confirmed there were no outliers that needed to be addressed prior to the ANOVA test.

ggplot(combinedBirthData, aes(x=factor(day_of_week), y=births)) + geom_boxplot() + scale_x_discrete(labels=c('Mon', 'Tues','Wed','Thurs','Fri','Sat','Sun')) + 
  labs(title='Births By Day of Week (1969-1988)', caption='Data From: http://vincentarelbundock.github.io/Rdatasets/',x='Day Of Week', y='Births') 

summary(aov(births ~ day_of_week, data=workingData))
##               Df    Sum Sq   Mean Sq F value Pr(>F)    
## day_of_week    1 2.519e+09 2.519e+09    2721 <2e-16 ***
## Residuals   7303 6.763e+09 9.260e+05                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With a p-value of appoximately \(2\times10^{-16}\) I can confidently say that there was a statistically significant difference between the birth rates of babies when compared to the day of the week.

Further research yielded a perfectly reasonable explanation for my results. Improvements in medical technology have allowed doctors to schedule deliveries in advance and induce labor at the appropriate time. Combine this fact with the fact that doctors are more likely to take the weekends off and this leads to a logical explanation for the data. Doctors take weekends off so they schedule deliveries for the weekdays.

With that mystery solved I turned to another observation I noticed from visual inspection of the data. It appeared that the discrepency between weekday and weekend births are growing more pronounced as the years went by. Had this trend been growing during the examined time frame?

I began by writing a function to aid in creating a new data frame that identied the proportion of births that occurred during the week days.

weekDayFunction <- function(data){
  sum(data[which(data[,2] %in% 1:5),]$births) / sum(data$births)
}

ratioData <- ddply(workingData, .variables='year', .fun=weekDayFunction)
names(ratioData) <- c('year', 'proportion_weekday')
head(ratioData, n=10)
##    year proportion_weekday
## 1  1969          0.7400879
## 2  1970          0.7413371
## 3  1971          0.7430011
## 4  1972          0.7394419
## 5  1973          0.7438612
## 6  1974          0.7453826
## 7  1975          0.7458690
## 8  1976          0.7459045
## 9  1977          0.7452324
## 10 1978          0.7439112
tail(ratioData, n=10)
##    year proportion_weekday
## 11 1979          0.7458443
## 12 1980          0.7471125
## 13 1981          0.7488962
## 14 1982          0.7510158
## 15 1983          0.7498953
## 16 1984          0.7514446
## 17 1985          0.7552771
## 18 1986          0.7585414
## 19 1987          0.7604594
## 20 1988          0.7589383

I then plotted each year’s proportion of week day births with a regression line. The regresssion line appears to show a clear growing trend in this proportion. I performed a least-square regression analysis.

ggplot(ratioData, aes(x=year, y=proportion_weekday)) + geom_point(aes(color=proportion_weekday)) + scale_color_gradient(low='blue', high='red') + labs(title='Proportion of Births during Weekday (1969-1988)', x='Year', y='Proprotion of Babies Born', caption='Data From: http://vincentarelbundock.github.io/Rdatasets/') + theme(legend.position='none') + scale_y_continuous(labels=scales::percent) + geom_smooth(method='lm')

ratioData$year <- ratioData$year - 1969
summary(lm(formula = proportion_weekday ~ year, data=ratioData))
## 
## Call:
## lm(formula = proportion_weekday ~ year, data = ratioData)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.003671 -0.002081  0.000556  0.001560  0.004048 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.388e-01  9.750e-04  757.72  < 2e-16 ***
## year        9.810e-04  8.773e-05   11.18 1.56e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.002262 on 18 degrees of freedom
## Multiple R-squared:  0.8742, Adjusted R-squared:  0.8672 
## F-statistic:   125 on 1 and 18 DF,  p-value: 1.558e-09
cor(ratioData$proportion_weekday, ratioData$year)
## [1] 0.9349669
\[\widehat{proportion}=.0981(year)+73.88\]

The least squares regression line above where \(year\) represents the number of the years since 1969 and \(\widehat{proportion}\) represents the proportion of weekday births indicates that there is an increase of appoximately \(\frac{1}{10}\)th of a percentage point per year (or 1 percentage point per decade) in the rate of weekday births compared to weekend births. With a p-value of appoximately \(1.56\times10^{-9}\) it can safely be concluded that this is a statistically significant trend. The correlation \(r=0.9349669\) supports this conclusion.

It can be stated with strong confidence that weekday births have been steadily increasing in the United States in the years 1969 to 1988. It can also be stated with strong confidence that there is a statitically significant increase in the number of babies born during the week than during the weekend.

There are a number of follow up questions I would be interested in pursuing: