US Birthday Prediction

How to choose a birthday

Five-Thirty-Eight has a a couple nice datasets on the amount of births for each day. In this article, https://fivethirtyeight.com/features/some-people-are-too-superstitious-to-have-a-baby-on-friday-the-13th/, they try to see if less people are born on Friday the 13th’s. However, what I’m interested in is if one can simply predict births given the day and month. The dataset has all of the stuff one needs to find out, so let’s make a real bulky multiple regression and see if it does anything.

Perliminary Data Analysis

Below is a glance at what the data looks like. We’re also going to change the month, date, day of the week to factors so we can do regression with them as categorical values.

df <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/data/master/births/US_births_2000-2014_SSA.csv")

df$month <- as.factor(df$month)
df$date_of_month <- as.factor(df$date_of_month)
df$day_of_week <- as.factor(df$day_of_week)

head(df)

##   year month date_of_month day_of_week births
## 1 2000     1             1           6   9083
## 2 2000     1             2           7   8006
## 3 2000     1             3           1  11363
## 4 2000     1             4           2  13032
## 5 2000     1             5           3  12558
## 6 2000     1             6           4  12466

summary(df)

##       year          month      date_of_month  day_of_week     births     
##  Min.   :2000   1      : 465   1      : 180   1:783       Min.   : 5728  
##  1st Qu.:2003   3      : 465   2      : 180   2:783       1st Qu.: 8740  
##  Median :2007   5      : 465   3      : 180   3:783       Median :12343  
##  Mean   :2007   7      : 465   4      : 180   4:782       Mean   :11350  
##  3rd Qu.:2011   8      : 465   5      : 180   5:782       3rd Qu.:13082  
##  Max.   :2014   10     : 465   6      : 180   6:783       Max.   :16081  
##                 (Other):2689   (Other):4399   7:783

We have data for the year, the month, date of the month, day of the week, and the amount of recorded births on that day. The data spans 15 years and there are 5479 observations in total.

Below is basic scatter plot of all birth counts.

plot(df$births)

It’s immediately apparent that there are two different types of counts: low and high. The lower counts tend to be about 4000 births shorter than the higher ones, and there’s not much inbetween. Second thing you might notice is the wavey pattern it follows, which repeats 15 times. Id be fairly confident in predicting that’s related to the month.

Let’s break it down further and examine births by each other data category, starting with year below.

y <- aggregate(df$births, by=list(Category=df$year), FUN=sum) #Aggregates the births within each year
plot(y, xlab = 'year', ylab = "births")

So it definitely varies by year, with more births occuring around 2006-2008, and fewer past 2010. There’s about a 10% difference between the max and min.

Below is a plot of births by month, not aggregated.

x <- df$month 
plot(x, df$births, xlab = 'month', ylab = "births")

While there is significant variance between days within the month, we can tell that births on average tend to increase from May up through September, before dropping back down in October; one could engauge in pleanty of speculation as to why. Another less obious detail here is that december has a much lower minimum than does any other month (you’ll see why in the next plot).

Now let’s take a look at day of the month.

x <- df$date_of_month 
plot(x, df$births, xlab = 'date of month', ylab = "births")

Immediately, we can see that the 13th’s and the 31st’s tend to have lower averages; the 31st doesn’t happen every month, and I guess people really don’t like giving birth on Friday the 13th’s? Another thing you might notice is the low minium value on the 25th, which I almost gaurentee is because of Christmas, and would explain the December min we saw on the last graph. Also, the 9th’s have a decently larger maximum, and I’m not sure why.

Now let’s look at it by day of the week.

x <- df$day_of_week 
plot(x, df$births, xlab = 'day of week', ylab = "births")

*to clarify, Monday here is 1, and Sunday is 7.

There’s a few interesting things about this one. First is that there are signifcantly fewer births on Saturdays and Sundays than the rest of the week. I’m not sure why that is; maybe people really don’t want to ruin their weekends with childbirth, or perhaps the hospitals are just bad at keeping tabs on the weekends. But if the latter were true, wouldn’t we see a big bump in the number recorded on Mondays?

It was at this point that the author breifly inquired Google as to why this might occur, and it was suggested that hospitals schedule c-sections and induced labors away from the weekend. This might have explained the Christmas thing as well. Although, there was also a noticeable dip in Monday births though; maybe they scheduled them for mid week?

The second big thing of note here is the number of outliers on Monday through Friday. I’m not sure why this is; perhaps the greater influence of holidays on weekdays, or long weekends of Fridays and Mondays. Hopefully this won’t tick off the regression.

Regression

Given that we’re taking this all as categorical data, the multiple-linear-regression model here is going to have many variables; specifically, one for each day of the month, day of the week, and month. We’re going to leave year out of it, because I’m more interested in how birth are affected by the time of year rather than which year.

model <- lm(births ~ month + date_of_month + day_of_week, data=df)
summary(model)

## 
## Call:
## lm(formula = births ~ month + date_of_month + day_of_week, data = df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6210.7  -306.9    35.8   413.9  3098.1 
## 
## Coefficients:
##                 Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)     11260.42      76.56  147.087  < 2e-16 ***
## month2            234.40      55.45    4.227 2.40e-05 ***
## month3            234.66      54.03    4.343 1.43e-05 ***
## month4            118.80      54.54    2.178 0.029436 *  
## month5            251.25      54.03    4.650 3.40e-06 ***
## month6            578.25      54.54   10.602  < 2e-16 ***
## month7            823.54      54.03   15.242  < 2e-16 ***
## month8           1003.69      54.03   18.577  < 2e-16 ***
## month9           1104.29      54.54   20.247  < 2e-16 ***
## month10           486.41      54.03    9.003  < 2e-16 ***
## month11           220.10      54.54    4.036 5.52e-05 ***
## month12           286.03      54.03    5.294 1.24e-07 ***
## date_of_month2    169.81      86.85    1.955 0.050593 .  
## date_of_month3    185.16      86.84    2.132 0.033042 *  
## date_of_month4    -58.29      86.84   -0.671 0.502147    
## date_of_month5    134.37      86.84    1.547 0.121863    
## date_of_month6    173.17      86.84    1.994 0.046201 *  
## date_of_month7    279.39      86.85    3.217 0.001303 ** 
## date_of_month8    322.36      86.84    3.712 0.000208 ***
## date_of_month9    248.44      86.85    2.861 0.004243 ** 
## date_of_month10   316.90      86.84    3.649 0.000266 ***
## date_of_month11   215.67      86.84    2.483 0.013040 *  
## date_of_month12   336.02      86.84    3.869 0.000110 ***
## date_of_month13   -36.08      86.84   -0.416 0.677786    
## date_of_month14   350.92      86.85    4.041 5.40e-05 ***
## date_of_month15   352.07      86.84    4.054 5.10e-05 ***
## date_of_month16   326.50      86.85    3.760 0.000172 ***
## date_of_month17   347.00      86.84    3.996 6.54e-05 ***
## date_of_month18   346.65      86.84    3.992 6.65e-05 ***
## date_of_month19   296.27      86.84    3.412 0.000651 ***
## date_of_month20   426.04      86.84    4.906 9.57e-07 ***
## date_of_month21   367.07      86.85    4.227 2.41e-05 ***
## date_of_month22   263.25      86.84    3.031 0.002445 ** 
## date_of_month23   131.52      86.85    1.514 0.129966    
## date_of_month24   -88.38      86.84   -1.018 0.308845    
## date_of_month25  -237.46      86.84   -2.734 0.006271 ** 
## date_of_month26   -59.38      86.84   -0.684 0.494189    
## date_of_month27   160.69      86.84    1.850 0.064321 .  
## date_of_month28   213.35      86.85    2.457 0.014057 *  
## date_of_month29   221.82      88.28    2.513 0.012007 *  
## date_of_month30   250.05      88.87    2.814 0.004914 ** 
## date_of_month31  -169.61     101.62   -1.669 0.095163 .  
## day_of_week2     1226.71      41.65   29.454  < 2e-16 ***
## day_of_week3     1014.26      41.64   24.356  < 2e-16 ***
## day_of_week4      949.11      41.66   22.783  < 2e-16 ***
## day_of_week5      697.05      41.66   16.733  < 2e-16 ***
## day_of_week6    -3336.14      41.64  -80.112  < 2e-16 ***
## day_of_week7    -4381.01      41.65 -105.188  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 823.8 on 5431 degrees of freedom
## Multiple R-squared:  0.8756, Adjusted R-squared:  0.8745 
## F-statistic: 813.4 on 47 and 5431 DF,  p-value: < 2.2e-16

So above we see the breakdown of the model. If you look at the bottom, you’ll see a Adjusted R-squared of 0.875, which means the model is pretty good at predcting the number of births given the data. The main reason it is good at doing this is because it set the intercept very close to the mean number of births, and differentiated the rest primarily by the day of the week. If you look at the bottom of the list above, the days of the week are extremely signficant, and affect the prediction by a higher magnitude than either the month or days of the month. That’s not to say that the days of the month and month dont matter though. Infact, the months (barring April) are highly significant, as are many of the days of the month to a lesser extent. The last thing worth mentioning are the residual statistics (shown up top), which has a median pretty close to 0 (good), but somewhat disparate 1st and 3rd quartiles and min and max. We’ll pull that appart later.

Below is a scatterplot of predictions against actual births.

predictions <- predict(model, data = df[,2:4])
plot(df$births, predictions, xlab = "Actual Births", ylab = "Predicted Births")
abline(0,1, col = 'red')

*The red line is what no differences between the two would look like (100% accuracy)

As we can see, it’s pretty close for the most part. There are two big blobs (weekends and week-days) that are right about where they should be. Noticeably, however, are the outliers to the left of the upper blob; these are low-mid birth days which were falsely predicted as high birth days. This is likely all of the low birth days that occured on week-days, which are outliers (see the births vs days of the week plot). We’ll examine the residuals further.

Residuals Analysis

Below is a plot of fitted values against residual values.

plot(model$fitted.values,model$residuals)
abline(h = 0, col = "red")

As we cold see in the plot before, we have two big blobs for weekends and week-days, and a bunch of low value residuals under the week-day blob. Residuals seem fairly normally distributed for the weekends, but I’m not sure if it holds for week-days given those outliers. One thing to note is that there are probably less than 50 outliers below that should be about 4000 residual points. Let’s examine them further with a histogram.

hist(model$residuals, breaks = 50)

Overall, it does look fairly normal. However, there are an abundance of outliers to the left, but the vast majority of the residuals ar enormally distributed.

Let’s see what the Q-Q plot looks like.

qqnorm(model$residuals)
qqline(model$residuals, col = 'red')

Looks great in the middle but pretty ugly on the left end. I’m not sure how to interperet this given that there should be about 5000 residuals on the Q-Q line, and only a few in comparison as outliers below it. Let’s take a look at a residuals vs leverage plot and see if the outliers actually make much of a difference.

plot(model, which = 5)

As it turns out, those outliers really don’t make much of a difference overall. It marks out a couple influential outliers, but overall you can’t even see the Cook’s distance lines.

Conclusions

Yep, you can predict the expected number of births on any given day fairly well given the day of the week, the month, and the day of the month. There are definitely days for which this model does not work, as shown by the outlier residuals. I suspect that if one was to also take into consideration holidays, days off work for hospitals, and apparently Friday the 13th’s, then the model would sort those out appropriately. But overall, this has a fairly good chance of predicting the amount of births on any given day, at least for the years 2000-2014.