How to choose a birthday
Five-Thirty-Eight has a a couple nice datasets on the amount of births for each day. In this article, https://fivethirtyeight.com/features/some-people-are-too-superstitious-to-have-a-baby-on-friday-the-13th/, they try to see if less people are born on Friday the 13th’s. However, what I’m interested in is if one can simply predict births given the day and month. The dataset has all of the stuff one needs to find out, so let’s make a real bulky multiple regression and see if it does anything.
Perliminary Data Analysis
Below is a glance at what the data looks like. We’re also going to change the month, date, day of the week to factors so we can do regression with them as categorical values.
df <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_2000-2014_SSA.csv")
df$month <- as.factor(df$month)
df$date_of_month <- as.factor(df$date_of_month)
df$day_of_week <- as.factor(df$day_of_week)
head(df)
## year month date_of_month day_of_week births
## 1 2000 1 1 6 9083
## 2 2000 1 2 7 8006
## 3 2000 1 3 1 11363
## 4 2000 1 4 2 13032
## 5 2000 1 5 3 12558
## 6 2000 1 6 4 12466
## year month date_of_month day_of_week births
## Min. :2000 1 : 465 1 : 180 1:783 Min. : 5728
## 1st Qu.:2003 3 : 465 2 : 180 2:783 1st Qu.: 8740
## Median :2007 5 : 465 3 : 180 3:783 Median :12343
## Mean :2007 7 : 465 4 : 180 4:782 Mean :11350
## 3rd Qu.:2011 8 : 465 5 : 180 5:782 3rd Qu.:13082
## Max. :2014 10 : 465 6 : 180 6:783 Max. :16081
## (Other):2689 (Other):4399 7:783
We have data for the year, the month, date of the month, day of the week, and the amount of recorded births on that day. The data spans 15 years and there are 5479 observations in total.
Below is basic scatter plot of all birth counts.
It’s immediately apparent that there are two different types of counts: low and high. The lower counts tend to be about 4000 births shorter than the higher ones, and there’s not much inbetween. Second thing you might notice is the wavey pattern it follows, which repeats 15 times. Id be fairly confident in predicting that’s related to the month.
Let’s break it down further and examine births by each other data category, starting with year below.
y <- aggregate(df$births, by=list(Category=df$year), FUN=sum) #Aggregates the births within each year
plot(y, xlab = 'year', ylab = "births")
So it definitely varies by year, with more births occuring around 2006-2008, and fewer past 2010. There’s about a 10% difference between the max and min.
Below is a plot of births by month, not aggregated.
While there is significant variance between days within the month, we can tell that births on average tend to increase from May up through September, before dropping back down in October; one could engauge in pleanty of speculation as to why. Another less obious detail here is that december has a much lower minimum than does any other month (you’ll see why in the next plot).
Now let’s take a look at day of the month.
Immediately, we can see that the 13th’s and the 31st’s tend to have lower averages; the 31st doesn’t happen every month, and I guess people really don’t like giving birth on Friday the 13th’s? Another thing you might notice is the low minium value on the 25th, which I almost gaurentee is because of Christmas, and would explain the December min we saw on the last graph. Also, the 9th’s have a decently larger maximum, and I’m not sure why.
Now let’s look at it by day of the week.
*to clarify, Monday here is 1, and Sunday is 7.
There’s a few interesting things about this one. First is that there are signifcantly fewer births on Saturdays and Sundays than the rest of the week. I’m not sure why that is; maybe people really don’t want to ruin their weekends with childbirth, or perhaps the hospitals are just bad at keeping tabs on the weekends. But if the latter were true, wouldn’t we see a big bump in the number recorded on Mondays?
It was at this point that the author breifly inquired Google as to why this might occur, and it was suggested that hospitals schedule c-sections and induced labors away from the weekend. This might have explained the Christmas thing as well. Although, there was also a noticeable dip in Monday births though; maybe they scheduled them for mid week?
The second big thing of note here is the number of outliers on Monday through Friday. I’m not sure why this is; perhaps the greater influence of holidays on weekdays, or long weekends of Fridays and Mondays. Hopefully this won’t tick off the regression.
Regression
Given that we’re taking this all as categorical data, the multiple-linear-regression model here is going to have many variables; specifically, one for each day of the month, day of the week, and month. We’re going to leave year out of it, because I’m more interested in how birth are affected by the time of year rather than which year.
##
## Call:
## lm(formula = births ~ month + date_of_month + day_of_week, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6210.7 -306.9 35.8 413.9 3098.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11260.42 76.56 147.087 < 2e-16 ***
## month2 234.40 55.45 4.227 2.40e-05 ***
## month3 234.66 54.03 4.343 1.43e-05 ***
## month4 118.80 54.54 2.178 0.029436 *
## month5 251.25 54.03 4.650 3.40e-06 ***
## month6 578.25 54.54 10.602 < 2e-16 ***
## month7 823.54 54.03 15.242 < 2e-16 ***
## month8 1003.69 54.03 18.577 < 2e-16 ***
## month9 1104.29 54.54 20.247 < 2e-16 ***
## month10 486.41 54.03 9.003 < 2e-16 ***
## month11 220.10 54.54 4.036 5.52e-05 ***
## month12 286.03 54.03 5.294 1.24e-07 ***
## date_of_month2 169.81 86.85 1.955 0.050593 .
## date_of_month3 185.16 86.84 2.132 0.033042 *
## date_of_month4 -58.29 86.84 -0.671 0.502147
## date_of_month5 134.37 86.84 1.547 0.121863
## date_of_month6 173.17 86.84 1.994 0.046201 *
## date_of_month7 279.39 86.85 3.217 0.001303 **
## date_of_month8 322.36 86.84 3.712 0.000208 ***
## date_of_month9 248.44 86.85 2.861 0.004243 **
## date_of_month10 316.90 86.84 3.649 0.000266 ***
## date_of_month11 215.67 86.84 2.483 0.013040 *
## date_of_month12 336.02 86.84 3.869 0.000110 ***
## date_of_month13 -36.08 86.84 -0.416 0.677786
## date_of_month14 350.92 86.85 4.041 5.40e-05 ***
## date_of_month15 352.07 86.84 4.054 5.10e-05 ***
## date_of_month16 326.50 86.85 3.760 0.000172 ***
## date_of_month17 347.00 86.84 3.996 6.54e-05 ***
## date_of_month18 346.65 86.84 3.992 6.65e-05 ***
## date_of_month19 296.27 86.84 3.412 0.000651 ***
## date_of_month20 426.04 86.84 4.906 9.57e-07 ***
## date_of_month21 367.07 86.85 4.227 2.41e-05 ***
## date_of_month22 263.25 86.84 3.031 0.002445 **
## date_of_month23 131.52 86.85 1.514 0.129966
## date_of_month24 -88.38 86.84 -1.018 0.308845
## date_of_month25 -237.46 86.84 -2.734 0.006271 **
## date_of_month26 -59.38 86.84 -0.684 0.494189
## date_of_month27 160.69 86.84 1.850 0.064321 .
## date_of_month28 213.35 86.85 2.457 0.014057 *
## date_of_month29 221.82 88.28 2.513 0.012007 *
## date_of_month30 250.05 88.87 2.814 0.004914 **
## date_of_month31 -169.61 101.62 -1.669 0.095163 .
## day_of_week2 1226.71 41.65 29.454 < 2e-16 ***
## day_of_week3 1014.26 41.64 24.356 < 2e-16 ***
## day_of_week4 949.11 41.66 22.783 < 2e-16 ***
## day_of_week5 697.05 41.66 16.733 < 2e-16 ***
## day_of_week6 -3336.14 41.64 -80.112 < 2e-16 ***
## day_of_week7 -4381.01 41.65 -105.188 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 823.8 on 5431 degrees of freedom
## Multiple R-squared: 0.8756, Adjusted R-squared: 0.8745
## F-statistic: 813.4 on 47 and 5431 DF, p-value: < 2.2e-16
So above we see the breakdown of the model. If you look at the bottom, you’ll see a Adjusted R-squared of 0.875, which means the model is pretty good at predcting the number of births given the data. The main reason it is good at doing this is because it set the intercept very close to the mean number of births, and differentiated the rest primarily by the day of the week. If you look at the bottom of the list above, the days of the week are extremely signficant, and affect the prediction by a higher magnitude than either the month or days of the month. That’s not to say that the days of the month and month dont matter though. Infact, the months (barring April) are highly significant, as are many of the days of the month to a lesser extent. The last thing worth mentioning are the residual statistics (shown up top), which has a median pretty close to 0 (good), but somewhat disparate 1st and 3rd quartiles and min and max. We’ll pull that appart later.
Below is a scatterplot of predictions against actual births.
predictions <- predict(model, data = df[,2:4])
plot(df$births, predictions, xlab = "Actual Births", ylab = "Predicted Births")
abline(0,1, col = 'red')
*The red line is what no differences between the two would look like (100% accuracy)
As we can see, it’s pretty close for the most part. There are two big blobs (weekends and week-days) that are right about where they should be. Noticeably, however, are the outliers to the left of the upper blob; these are low-mid birth days which were falsely predicted as high birth days. This is likely all of the low birth days that occured on week-days, which are outliers (see the births vs days of the week plot). We’ll examine the residuals further.
Residuals Analysis
Below is a plot of fitted values against residual values.
As we cold see in the plot before, we have two big blobs for weekends and week-days, and a bunch of low value residuals under the week-day blob. Residuals seem fairly normally distributed for the weekends, but I’m not sure if it holds for week-days given those outliers. One thing to note is that there are probably less than 50 outliers below that should be about 4000 residual points. Let’s examine them further with a histogram.
Overall, it does look fairly normal. However, there are an abundance of outliers to the left, but the vast majority of the residuals ar enormally distributed.
Let’s see what the Q-Q plot looks like.
Looks great in the middle but pretty ugly on the left end. I’m not sure how to interperet this given that there should be about 5000 residuals on the Q-Q line, and only a few in comparison as outliers below it. Let’s take a look at a residuals vs leverage plot and see if the outliers actually make much of a difference.
As it turns out, those outliers really don’t make much of a difference overall. It marks out a couple influential outliers, but overall you can’t even see the Cook’s distance lines.
Conclusions
Yep, you can predict the expected number of births on any given day fairly well given the day of the week, the month, and the day of the month. There are definitely days for which this model does not work, as shown by the outlier residuals. I suspect that if one was to also take into consideration holidays, days off work for hospitals, and apparently Friday the 13th’s, then the model would sort those out appropriately. But overall, this has a fairly good chance of predicting the amount of births on any given day, at least for the years 2000-2014.