I got the data for my friend’s birthdays through facebook. First, I saved the calendar with the birthdays into an .ics (iCalendar) file. I then used projectwizards.com to translate that file into excel format. In excel I then separated the names from the birthdays using the ‘text to columns’ function under the ‘data’ tab, at the same time stripping off extra characters in the data. That resulted in a two column file containing the ‘Name’ and ‘Birthdate’ of each of my friends.


Let’s Get Started

Upload the csv file and install some packages and we are on our way.

getwd()
## [1] "/Users/jacob/Downloads"
bd <- read.csv("Friend's_Birthdays.csv")

install.packages('ggplot2', repos = 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.3/ggplot2_2.1.0.tgz')
## Warning: unable to access index for repository https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.3/ggplot2_2.1.0.tgz/src/contrib:
##   cannot download all files
## Warning: package 'ggplot2' is not available (for R version 3.3.1)
## Warning: unable to access index for repository https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.3/ggplot2_2.1.0.tgz/bin/macosx/mavericks/contrib/3.3:
##   cannot download all files
library(ggplot2)
install.packages('lubridate', repos = 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.3/lubridate_1.6.0.tgz')
## Warning: unable to access index for repository https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.3/lubridate_1.6.0.tgz/src/contrib:
##   cannot download all files
## Warning: package 'lubridate' is not available (for R version 3.3.1)
## Warning: unable to access index for repository https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.3/lubridate_1.6.0.tgz/bin/macosx/mavericks/contrib/3.3:
##   cannot download all files
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
install.packages('dplyr', repos = 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.3/dplyr_0.5.0.tgz')
## Warning: unable to access index for repository https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.3/dplyr_0.5.0.tgz/src/contrib:
##   cannot download all files
## Warning: package 'dplyr' is not available (for R version 3.3.1)
## Warning: unable to access index for repository https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.3/dplyr_0.5.0.tgz/bin/macosx/mavericks/contrib/3.3:
##   cannot download all files
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:lubridate':
## 
##     intersect, setdiff, union
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Quick Summary

summary(bd)
##             Name        Birthdate  
##  Jacob Dickey :  2   5/24/17 :  8  
##  Julia Weber  :  2   12/10/16:  6  
##  Aaron Dickey :  1   6/22/17 :  6  
##  Aaron Orr    :  1   2/12/17 :  5  
##  Aaron Zell   :  1   5/20/17 :  5  
##  Abbey Buroker:  1   7/13/17 :  5  
##  (Other)      :553   (Other) :526

One thing I noticed right off the bat was that the birth year of each friend was not included. Since I pulled the data from the facebook calendar, the birthdates noted were for this year and part of last.

Also, there are only 558 observations when I have 636 facebook friends. This means that there are 78 birthdays that are not accounted for. This is likely because those friends chose not to share their birthday on facebook.

I also noticed that there were two records for ‘Jacob Dickey’ and ‘Julia Weber’. The records for ‘Jacob Dickey’ are valid as I happen to have two friends on facebook that share my name. This also means that I am not included in this dataset. The two records for Julia Weber however, are for the same person. This will cause her birthday count to be off by one. Another thing I notice is that May 24th is the most popular birthday of my friends with a count of 8.


Quick Fix Date Formatting

bd$Birthdate <- as.Date(bd$Birthdate,
  format = "%m/%d/%y")

The format of the dates are adapted so that R studio can read them properly. They were in the formate of (09/17/2017) and they were changed to the format of (2017-09-17).


Closer Look

qplot(data = bd, x = Birthdate, binwidth = 1) + 
  scale_x_date(date_minor_breaks = "1 day", date_labels = "%b %d")
## Warning: Removed 3 rows containing non-finite values (stat_bin).

One thing that I notice is that there is a period of about a week and a half where there are at least two birthdays per day a little after the ‘July 01’ break. Inversely there is a few days in a row just before ‘Jan 01’ where there are no birthdays at all, therefore additionally not everyday has a birthday being celebrated.


Objective

After looking at the summary of the data, some questions came to mind that I wanted answers for.

Questions:

  1. What month has the most birthdays?

  2. How many people share a birthday with me?

  3. What season in the next year has the most birthdays?

  4. What is the average number of birthdays per week?

  5. How many days are there where none of my friends have a birthday?


1. What month has the most birthdays?

bm <- format(bd$Birthdate, "%m")
bd <- subset(bd, !is.na(Birthdate))
bm <- format(bd$Birthdate, "%m")

To start off I extracted the birth months of the “Birthdate” field and put them into a variable called bm (BirthMonth). Then I noticed something off about the data, there were some values with “NA”. To fix this problem, I simply omitted the 3 observations with “NA” values.

qplot(x = bm, 
      color = I('black'), fill = I('orange'), 
      xlab = 'Month', 
      ylab = 'Number of Birthdays') +
  geom_abline(intercept = 49, slope = 0, color = 'black', size = 1, lty = 'dashed')

mytable <- table(bm)
names(mytable) <- c('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun' , 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec')
mytable
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 
##  34  46  41  43  57  52  63  35  42  49  46  50
max(mytable)
## [1] 63

Next, I plotted a histogram of the distribution of birthdays by month and also a table of values to get the exact numbers. I used the ‘names’ function to make the table more readable. I can tell clearly on both histogram and table that July has the most birthdays. It has 63 of the 558 total birthdays (11.3%) which is 36.1% more than the average of 49 birthdays per month (8.3%).

min(mytable)
## [1] 34

Inversely, January has the least number of birthdays with 34 (5.8%) which is 30.1% lower than the average.

mytable[mytable > 49]
## May Jun Jul Dec 
##  57  52  63  50
mytable[mytable < 49]
## Jan Feb Mar Apr Aug Sep Nov 
##  34  46  41  43  35  42  46
mytable[mytable == 49]
## Oct 
##  49

The higher percent difference from the average was above by 6%. This is also reflected by the lower number of months above average (4) than below (7).


2. How many people share a birthday with me?

table(format(bd$Birthdate, "%m %d") == "09 17")
## 
## FALSE  TRUE 
##   556     2

This one was easy to find out. I again formatted the “Birthdate” variable so that it was month then day. I then tabled the result to find a count of boolean responses to whether they match my birthday. Here the answer is two.

Side Note: I actually happen to already know who the two are as one is my cousin Eris Harcourt and the other my friend Allison Gaffney. All three of us were born on the exact same day, and Allison and I were actually delivered in the same hospital!


3. What season in the next year has the most birthdays?

fs <- format(bd$Birthdate, "%y %m %d")

min(bd$Birthdate)
## [1] "2016-10-21"
max(bd$Birthdate)
## [1] "2017-10-20"
Fall <- table(fs <= "16 12 21" | fs >= "17 09 22") 
Winter <- table(fs >= "16 12 21" & fs <= "17 03 20") 
Spring <- table(fs >= "17 03 20" & fs <= "17 06 21") 
Summer <- table(fs >= "17 06 21" & fs <= "17 09 22") 

Fall
## 
## FALSE  TRUE 
##   403   155
Winter 
## 
## FALSE  TRUE 
##   441   117
Spring 
## 
## FALSE  TRUE 
##   407   151
Summer
## 
## FALSE  TRUE 
##   416   142
Fall + Winter + Spring + Summer
## 
## FALSE  TRUE 
##  1667   565

To solve this, I reformatted the date so that the year was before the month then day into a variable called fs (FormatSeasons). This way it was easily divided as they were also numerically ordered. I divided them including the day of the season changes each way because technically the solstices and equinoxes occur at a specific time that day so the day itself would consist of two seasons.

The season that has the most birthdays is Fall, closely followed by Spring.

table(fs == "16 12 21") + 
  table(fs == "17 03 20") + 
  table(fs == "17 06 21") + 
  table(fs == "17 09 22")
## 
## FALSE  TRUE 
##  2225     7

Looks like the count is correct as I get a count of 565 which is equal to the 558 birthdays plus the 7 birthdays occuring on solstices or equinoxes that were counted twice.


4. What is the average number of birthdays per week?

ggplot(bd, aes(x = Birthdate)) +
  geom_histogram(binwidth = 7, color = 'orange', fill = 'black') +
  stat_bin(aes(y=..count.., label=..count..), 
             geom="text", vjust = -0.5, binwidth = 7) +
  scale_x_date(date_minor_breaks = "1 week", date_labels = "%b %d") +
  scale_y_continuous(breaks = seq(0, 25, 1)) +
  xlab("Birth Month") + 
  geom_abline(intercept = 10.7, slope = 0, 
              color = I('#afafaf'), size = 0.75, lty = 'solid')

The average number of birthdays per week is 11 (or more exactly, 10.7). Here I made the bins for the histogram to represent the weeks of the dataset. What I found interesting was that you can more easily see the two observations I had about the first histogram. The week just before Jan. 1 has the lowest number of birthdays with 3, and the week a little after Jul. 1 has 4 consecutive weeks with 12 birthdays or more.


5. How many days are there where none of my friends have a birthday?

n_distinct(bd$Birthdate) == 365
## [1] FALSE
365 - n_distinct(bd$Birthdate)
## [1] 80

Again, as also shown by the first histogram, it is apparent that not every day has a birthday. By counting the distinct values of the ‘Birthdate’ variable, we can then determine how many days have birthdays and conclude the rest don’t. There are 285 days with birthdays meaning there are 80 days without, meaning 22% of the year doesn’t have a birthday.