The following is a brief EDA study, required for problem set 3 of Udacity’s Data Analysis in R. It concerns my Facebook friend’s birthdays. Since I do not use Facebook, I studied the supplied birthday’s example file. The following code imports the data and uses the Lubridate library functions to create various columns in a data frame ‘birthdays.’

library(lubridate)
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:lubridate':
## 
##     intersect, setdiff, union
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
setwd('/Users/christopherkaalund/Documents/Study/Udacity Data Science/Data Analysis in R/Problem Set 3')
birthdays = read.csv('birthdaysExample.csv')
birthdays$mdy = mdy(birthdays$date)
birthdays$day = day(birthdays$mdy)
birthdays$month = month(birthdays$mdy,label=TRUE)
birthdays$year = year(birthdays$mdy)
birthdays$yday = yday(birthdays$mdy) # day of the year
birthdays$wday = wday(birthdays$mdy,label=TRUE) # day of the week (1=Sunday)

I produced various histograms, below, showing birthdays by month, day, day of the week, day of the year, and so on.

Here I answer various questions about the data set, and create various dataframes to do this.

How many people share your birthday?

ANS: 2

dplyr::filter(birthdays,day==20 & month=='Jan')
##     dates        mdy day month year yday wday
## 1 1/20/14 2014-01-20  20   Jan 2014   20  Mon
## 2 1/20/14 2014-01-20  20   Jan 2014   20  Mon

Which month contains the most number of birthdays?

ANS: Inspecting the column df2$ones below shows that the most number of birthdays occurs in March (98 birthdays).

I creates a column ones, and filled it with ones for each row of the original dataframe. Grouping and summing this column for each month give the total number of birthdays in each month.

ones = seq(1,1,length=nrow(birthdays))
birthdays$ones = ones
df1 = dplyr::group_by(birthdays,month)
df2 = dplyr::summarise_each(df1,funs(sum))

How many birthdays are in each month?

ANS: The number of birthdays in each month (Jan~Feb) is 89 79 98 81 72 93 86 91 96 89 87 72

df2$ones
##  [1] 89 79 98 81 72 93 86 91 96 89 87 72

Which day of the year has the most number of birthdays?

ANS: 6-Feb, 15-May, 7-Jul The maximum number of birthdays is 8 on each of these days.

df3 = dplyr::group_by(birthdays,yday)
df4 = dplyr::summarise_each(df3,funs(sum))
max(df4$ones) # maximum number of birthdays, result 8
## [1] 8
which(df4$ones==8) # Results 37, 135, 188
## [1]  37 135 188
subset(birthdays,birthdays$yday==37) # find date corresponding to yday=37, result 2/6/14
##       dates        mdy day month year yday  wday ones
## 159  2/6/14 2014-02-06   6   Feb 2014   37 Thurs    1
## 277  2/6/14 2014-02-06   6   Feb 2014   37 Thurs    1
## 311  2/6/14 2014-02-06   6   Feb 2014   37 Thurs    1
## 367  2/6/14 2014-02-06   6   Feb 2014   37 Thurs    1
## 408  2/6/14 2014-02-06   6   Feb 2014   37 Thurs    1
## 843  2/6/14 2014-02-06   6   Feb 2014   37 Thurs    1
## 974  2/6/14 2014-02-06   6   Feb 2014   37 Thurs    1
## 1009 2/6/14 2014-02-06   6   Feb 2014   37 Thurs    1
subset(birthdays,birthdays$yday==135) # find date corresponding to yday=135, result 15/5/14
##       dates        mdy day month year yday  wday ones
## 180 5/15/14 2014-05-15  15   May 2014  135 Thurs    1
subset(birthdays,birthdays$yday==188) # find date corresponding to yday=188, result 7/7/14
##      dates        mdy day month year yday wday ones
## 388 7/7/14 2014-07-07   7   Jul 2014  188  Mon    1
## 529 7/7/14 2014-07-07   7   Jul 2014  188  Mon    1
## 748 7/7/14 2014-07-07   7   Jul 2014  188  Mon    1
## 767 7/7/14 2014-07-07   7   Jul 2014  188  Mon    1
## 987 7/7/14 2014-07-07   7   Jul 2014  188  Mon    1

Do you have at least 365 friends that have birthdays on everyday of the year?

ANS: No - I can see that there are gaps in the histogram for days of the year in which birthdays occur. Also, nrow(df4) = 348 < 365

nrow(df4)
## [1] 348