This short project is part of Udacity’s “Data Analysis with R” course. It aims at exploring and drawing conclusions from a dataset containing birthdays of Facebook friends.
Since I closed my Facebook account many years ago, I used a mock dataset provided by Udacity. These data do not contain any information wih regards to birth year, only day and month. This will limit the scope of this analysis somewhat, but the essential work will remain the same.
I first load the data and look at its structure.
fb <- read.csv('./birthdaysExample.csv', stringsAsFactors = FALSE)
str(fb)
## 'data.frame': 1033 obs. of 1 variable:
## $ dates: chr "11/25/14" "6/8/14" "9/12/14" "5/26/14" ...
As expected, all the dates have been imported as strings, so we now need to do some work to turn them into dates. The lubridate package makes this task very easy:
fb$dates <- mdy(fb$dates)
str(fb)
## 'data.frame': 1033 obs. of 1 variable:
## $ dates: POSIXct, format: "2014-11-25" "2014-06-08" ...
Before going further, we can take a look at the distribution of birthdays by week along the year, using a histogram:
binw_days <- 7
qplot(x = dates, data = fb, binwidth = 24 * 3600 * binw_days,
color = I("black"),
fill = I("thistle")) + # Convert bin width to seconds
theme_minimal()
There are some peaks and valleys such as in March (peak) and July/August (valley) but no obvious pattern.
Note: The data contains a year (2014) but this only corresponds to the year when the data was generated and should be disregarded.
Let’s look for the days with the most birthdays:
fb_grp <- fb %>% group_by(dates) %>% summarise(count = n())
max_count <- max(fb_grp$count)
# Most birthdays:
print(fb_grp[fb_grp$count == max_count, ])
## Source: local data frame [3 x 2]
##
## dates count
## (time) (int)
## 1 2014-02-06 8
## 2 2014-05-22 8
## 3 2014-07-16 8
3 of the days days have 8 birthdays each.
Out of curiosity, let’s see how many ‘friends’ share my birthday (February 23rd):
print(fb_grp[fb_grp$dates == ymd("2014-02-23"), ])
## Source: local data frame [1 x 2]
##
## dates count
## (time) (int)
## 1 2014-02-23 7
So my birthday is actually a fairly busy day, with 7 people sharing it.
Are there any days without birthdays?
# Number of unique dates in the dataset:
length(unique(fb$dates))
## [1] 348
According to this, there are 17 days where nobody in the sample was born. To see them, we need to generate a vector of all the days in the year, then look at the set differences between this vector and the one containing birthdays.
Note: In theory and to make sure we capture every possible day, we have to do this for a leap year (such as 2016). However the dataset uses 2014 as the year for all birthdays and therefore does not contain any on February 29th, so we can ignore this detail.
all_days <- seq(ymd("2014-01-01", tz = "UTC"), ymd("2014-12-31", tz = "UTC"), 3600 * 24)
no_bdays <- as.POSIXct(setdiff(all_days, fb$dates), tz = "UTC", origin = "1970-01-01")
print(no_bdays)
## [1] "2014-02-08 UTC" "2014-02-21 UTC" "2014-02-22 UTC" "2014-03-06 UTC"
## [5] "2014-04-16 UTC" "2014-04-21 UTC" "2014-05-03 UTC" "2014-05-24 UTC"
## [9] "2014-06-26 UTC" "2014-08-03 UTC" "2014-08-06 UTC" "2014-08-23 UTC"
## [13] "2014-11-11 UTC" "2014-11-13 UTC" "2014-12-06 UTC" "2014-12-13 UTC"
## [17] "2014-12-23 UTC"
We will now look into the data in more detail by breaking it down by month and calendar day. To that end, let’s transform the dataset:
# Create a new dataset with individual columns for months and days:
fb_md <- fb %>% mutate(b_month = month(dates), b_day = day(dates)) %>% select(-dates)
head(fb_md)
## b_month b_day
## 1 11 25
## 2 6 8
## 3 9 12
## 4 5 26
## 5 2 20
## 6 6 19
How are birthdays distributed by month?
qplot(x = b_month, data = fb_md, binwidth = 1,
color = I('black'), fill = I('thistle')) +
scale_x_continuous(limits = c(0, 13), breaks = seq(1, 12, 1)) +
xlab("Month") + ylab("Birthday Count") +
theme_minimal()
March is the month with the most birthdays. Let’s count them:
sum(fb_md$b_month == 3)
## [1] 98
May and December seem tied for last position:
sum(fb_md$b_month == 5)
## [1] 72
sum(fb_md$b_month == 12)
## [1] 72
And indeed they are, at 72 each.
How are birthdays distributed by day in the month? Are there any patterns (we would expect not)?
qplot(x = b_day, data = fb_md, binwidth = 1,
color = I('black'), fill = I('thistle')) +
scale_x_continuous(limits = c(0, 32), breaks = seq(1, 31, 1)) +
xlab("Day of the Month") + ylab("Birthday Count") +
theme_minimal()
Obviously the 31st has less birthdays than any other day because only 7 months out of 12 have that many days. There is a large peak on the 14th which is quite surprising. It could just be due to random noise, but it might also not be a coincidence that the data was generated in 2014. Could it be that somewhere in the generation process, the day and year got mixed up? We cannot answer this question without knowing more about how the dataset was created.
From this dataset of 1033 ‘friends’, we were able to draw the following conclusions: