The following is a brief EDA study, required for problem set 3 of Udacity’s Data Analysis in R. It concerns my Facebook friend’s birthdays. Since I do not use Facebook, I studied the supplied birthday’s example file. The following code imports the data and uses the Lubridate library functions to create various columns in a data frame ‘birthdays.’
library(lubridate)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:lubridate':
##
## intersect, setdiff, union
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
setwd('/Users/christopherkaalund/Documents/Study/Udacity Data Science/Data Analysis in R/Problem Set 3')
birthdays = read.csv('birthdaysExample.csv')
birthdays$mdy = mdy(birthdays$date)
birthdays$day = day(birthdays$mdy)
birthdays$month = month(birthdays$mdy,label=TRUE)
birthdays$year = year(birthdays$mdy)
birthdays$yday = yday(birthdays$mdy) # day of the year
birthdays$wday = wday(birthdays$mdy,label=TRUE) # day of the week (1=Sunday)
I produced various histograms, below, showing birthdays by month, day, day of the week, day of the year, and so on.
Here I answer various questions about the data set, and create various dataframes to do this.
ANS: Inspecting the column df2$ones below shows that the most number of birthdays occurs in March (98 birthdays).
I creates a column ones, and filled it with ones for each row of the original dataframe. Grouping and summing this column for each month give the total number of birthdays in each month.
ones = seq(1,1,length=nrow(birthdays))
birthdays$ones = ones
df1 = dplyr::group_by(birthdays,month)
df2 = dplyr::summarise_each(df1,funs(sum))
ANS: The number of birthdays in each month (Jan~Feb) is 89 79 98 81 72 93 86 91 96 89 87 72
df2$ones
## [1] 89 79 98 81 72 93 86 91 96 89 87 72
ANS: 6-Feb, 15-May, 7-Jul The maximum number of birthdays is 8 on each of these days.
df3 = dplyr::group_by(birthdays,yday)
df4 = dplyr::summarise_each(df3,funs(sum))
max(df4$ones) # maximum number of birthdays, result 8
## [1] 8
which(df4$ones==8) # Results 37, 135, 188
## [1] 37 135 188
subset(birthdays,birthdays$yday==37) # find date corresponding to yday=37, result 2/6/14
## dates mdy day month year yday wday ones
## 159 2/6/14 2014-02-06 6 Feb 2014 37 Thurs 1
## 277 2/6/14 2014-02-06 6 Feb 2014 37 Thurs 1
## 311 2/6/14 2014-02-06 6 Feb 2014 37 Thurs 1
## 367 2/6/14 2014-02-06 6 Feb 2014 37 Thurs 1
## 408 2/6/14 2014-02-06 6 Feb 2014 37 Thurs 1
## 843 2/6/14 2014-02-06 6 Feb 2014 37 Thurs 1
## 974 2/6/14 2014-02-06 6 Feb 2014 37 Thurs 1
## 1009 2/6/14 2014-02-06 6 Feb 2014 37 Thurs 1
subset(birthdays,birthdays$yday==135) # find date corresponding to yday=135, result 15/5/14
## dates mdy day month year yday wday ones
## 180 5/15/14 2014-05-15 15 May 2014 135 Thurs 1
subset(birthdays,birthdays$yday==188) # find date corresponding to yday=188, result 7/7/14
## dates mdy day month year yday wday ones
## 388 7/7/14 2014-07-07 7 Jul 2014 188 Mon 1
## 529 7/7/14 2014-07-07 7 Jul 2014 188 Mon 1
## 748 7/7/14 2014-07-07 7 Jul 2014 188 Mon 1
## 767 7/7/14 2014-07-07 7 Jul 2014 188 Mon 1
## 987 7/7/14 2014-07-07 7 Jul 2014 188 Mon 1
ANS: No - I can see that there are gaps in the histogram for days of the year in which birthdays occur. Also, nrow(df4) = 348 < 365
nrow(df4)
## [1] 348