I got the data for my friend’s birthdays through facebook. First, I saved the calendar with the birthdays into an .ics (iCalendar) file. I then used projectwizards.com to translate that file into excel format. In excel I then separated the names from the birthdays using the ‘text to columns’ function under the ‘data’ tab, at the same time stripping off extra characters in the data. That resulted in a two column file containing the ‘Name’ and ‘Birthdate’ of each of my friends.
Upload the csv file and install some packages and we are on our way.
getwd()
## [1] "/Users/jacob/Downloads"
bd <- read.csv("Friend's_Birthdays.csv")
install.packages('ggplot2', repos = 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.3/ggplot2_2.1.0.tgz')
## Warning: unable to access index for repository https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.3/ggplot2_2.1.0.tgz/src/contrib:
## cannot download all files
## Warning: package 'ggplot2' is not available (for R version 3.3.1)
## Warning: unable to access index for repository https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.3/ggplot2_2.1.0.tgz/bin/macosx/mavericks/contrib/3.3:
## cannot download all files
library(ggplot2)
install.packages('lubridate', repos = 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.3/lubridate_1.6.0.tgz')
## Warning: unable to access index for repository https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.3/lubridate_1.6.0.tgz/src/contrib:
## cannot download all files
## Warning: package 'lubridate' is not available (for R version 3.3.1)
## Warning: unable to access index for repository https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.3/lubridate_1.6.0.tgz/bin/macosx/mavericks/contrib/3.3:
## cannot download all files
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
install.packages('dplyr', repos = 'https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.3/dplyr_0.5.0.tgz')
## Warning: unable to access index for repository https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.3/dplyr_0.5.0.tgz/src/contrib:
## cannot download all files
## Warning: package 'dplyr' is not available (for R version 3.3.1)
## Warning: unable to access index for repository https://cran.rstudio.com/bin/macosx/mavericks/contrib/3.3/dplyr_0.5.0.tgz/bin/macosx/mavericks/contrib/3.3:
## cannot download all files
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:lubridate':
##
## intersect, setdiff, union
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
summary(bd)
## Name Birthdate
## Jacob Dickey : 2 5/24/17 : 8
## Julia Weber : 2 12/10/16: 6
## Aaron Dickey : 1 6/22/17 : 6
## Aaron Orr : 1 2/12/17 : 5
## Aaron Zell : 1 5/20/17 : 5
## Abbey Buroker: 1 7/13/17 : 5
## (Other) :553 (Other) :526
One thing I noticed right off the bat was that the birth year of each friend was not included. Since I pulled the data from the facebook calendar, the birthdates noted were for this year and part of last.
Also, there are only 558 observations when I have 636 facebook friends. This means that there are 78 birthdays that are not accounted for. This is likely because those friends chose not to share their birthday on facebook.
I also noticed that there were two records for ‘Jacob Dickey’ and ‘Julia Weber’. The records for ‘Jacob Dickey’ are valid as I happen to have two friends on facebook that share my name. This also means that I am not included in this dataset. The two records for Julia Weber however, are for the same person. This will cause her birthday count to be off by one. Another thing I notice is that May 24th is the most popular birthday of my friends with a count of 8.
bd$Birthdate <- as.Date(bd$Birthdate,
format = "%m/%d/%y")
The format of the dates are adapted so that R studio can read them properly. They were in the formate of (09/17/2017) and they were changed to the format of (2017-09-17).
qplot(data = bd, x = Birthdate, binwidth = 1) +
scale_x_date(date_minor_breaks = "1 day", date_labels = "%b %d")
## Warning: Removed 3 rows containing non-finite values (stat_bin).
One thing that I notice is that there is a period of about a week and a half where there are at least two birthdays per day a little after the ‘July 01’ break. Inversely there is a few days in a row just before ‘Jan 01’ where there are no birthdays at all, therefore additionally not everyday has a birthday being celebrated.
After looking at the summary of the data, some questions came to mind that I wanted answers for.
Questions:
What month has the most birthdays?
How many people share a birthday with me?
What season in the next year has the most birthdays?
What is the average number of birthdays per week?
How many days are there where none of my friends have a birthday?
bm <- format(bd$Birthdate, "%m")
bd <- subset(bd, !is.na(Birthdate))
bm <- format(bd$Birthdate, "%m")
To start off I extracted the birth months of the “Birthdate” field and put them into a variable called bm (BirthMonth). Then I noticed something off about the data, there were some values with “NA”. To fix this problem, I simply omitted the 3 observations with “NA” values.
qplot(x = bm,
color = I('black'), fill = I('orange'),
xlab = 'Month',
ylab = 'Number of Birthdays') +
geom_abline(intercept = 49, slope = 0, color = 'black', size = 1, lty = 'dashed')
mytable <- table(bm)
names(mytable) <- c('Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun' , 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec')
mytable
## Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 34 46 41 43 57 52 63 35 42 49 46 50
max(mytable)
## [1] 63
Next, I plotted a histogram of the distribution of birthdays by month and also a table of values to get the exact numbers. I used the ‘names’ function to make the table more readable. I can tell clearly on both histogram and table that July has the most birthdays. It has 63 of the 558 total birthdays (11.3%) which is 36.1% more than the average of 49 birthdays per month (8.3%).
min(mytable)
## [1] 34
Inversely, January has the least number of birthdays with 34 (5.8%) which is 30.1% lower than the average.
mytable[mytable > 49]
## May Jun Jul Dec
## 57 52 63 50
mytable[mytable < 49]
## Jan Feb Mar Apr Aug Sep Nov
## 34 46 41 43 35 42 46
mytable[mytable == 49]
## Oct
## 49
The higher percent difference from the average was above by 6%. This is also reflected by the lower number of months above average (4) than below (7).
fs <- format(bd$Birthdate, "%y %m %d")
min(bd$Birthdate)
## [1] "2016-10-21"
max(bd$Birthdate)
## [1] "2017-10-20"
Fall <- table(fs <= "16 12 21" | fs >= "17 09 22")
Winter <- table(fs >= "16 12 21" & fs <= "17 03 20")
Spring <- table(fs >= "17 03 20" & fs <= "17 06 21")
Summer <- table(fs >= "17 06 21" & fs <= "17 09 22")
Fall
##
## FALSE TRUE
## 403 155
Winter
##
## FALSE TRUE
## 441 117
Spring
##
## FALSE TRUE
## 407 151
Summer
##
## FALSE TRUE
## 416 142
Fall + Winter + Spring + Summer
##
## FALSE TRUE
## 1667 565
To solve this, I reformatted the date so that the year was before the month then day into a variable called fs (FormatSeasons). This way it was easily divided as they were also numerically ordered. I divided them including the day of the season changes each way because technically the solstices and equinoxes occur at a specific time that day so the day itself would consist of two seasons.
The season that has the most birthdays is Fall, closely followed by Spring.
table(fs == "16 12 21") +
table(fs == "17 03 20") +
table(fs == "17 06 21") +
table(fs == "17 09 22")
##
## FALSE TRUE
## 2225 7
Looks like the count is correct as I get a count of 565 which is equal to the 558 birthdays plus the 7 birthdays occuring on solstices or equinoxes that were counted twice.
ggplot(bd, aes(x = Birthdate)) +
geom_histogram(binwidth = 7, color = 'orange', fill = 'black') +
stat_bin(aes(y=..count.., label=..count..),
geom="text", vjust = -0.5, binwidth = 7) +
scale_x_date(date_minor_breaks = "1 week", date_labels = "%b %d") +
scale_y_continuous(breaks = seq(0, 25, 1)) +
xlab("Birth Month") +
geom_abline(intercept = 10.7, slope = 0,
color = I('#afafaf'), size = 0.75, lty = 'solid')
The average number of birthdays per week is 11 (or more exactly, 10.7). Here I made the bins for the histogram to represent the weeks of the dataset. What I found interesting was that you can more easily see the two observations I had about the first histogram. The week just before Jan. 1 has the lowest number of birthdays with 3, and the week a little after Jul. 1 has 4 consecutive weeks with 12 birthdays or more.
n_distinct(bd$Birthdate) == 365
## [1] FALSE
365 - n_distinct(bd$Birthdate)
## [1] 80
Again, as also shown by the first histogram, it is apparent that not every day has a birthday. By counting the distinct values of the ‘Birthdate’ variable, we can then determine how many days have birthdays and conclude the rest don’t. There are 285 days with birthdays meaning there are 80 days without, meaning 22% of the year doesn’t have a birthday.