Problem Set 3, FaceBook Friends
Load the Required Packages
#install.packages("lubridate")
library(ggplot2)
library(lubridate)
library(gridExtra)
## Loading required package: grid
library(plyr)
##
## Attaching package: 'plyr'
##
## The following object is masked from 'package:lubridate':
##
## here
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:plyr':
##
## arrange, desc, failwith, id, mutate, summarise, summarize
##
## The following objects are masked from 'package:lubridate':
##
## intersect, setdiff, union
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Load the Data
fb <- read.csv('/Users/michaelreinhard/Google Drive/R/birthdaysExample.csv')
names(fb)
## [1] "dates"
head(fb)
## dates
## 1 11/25/14
## 2 6/8/14
## 3 9/12/14
## 4 5/26/14
## 5 2/20/14
## 6 6/19/14
summary(fb)
## dates
## 2/6/14 : 8
## 5/22/14: 8
## 7/16/14: 8
## 1/14/14: 7
## 2/2/14 : 7
## 2/23/14: 7
## (Other):988
Preliminary Data Preparation
dat.gd <- mdy(fb$dates)
head(dat.gd)
## [1] "2014-11-25 UTC" "2014-06-08 UTC" "2014-09-12 UTC" "2014-05-26 UTC"
## [5] "2014-02-20 UTC" "2014-06-19 UTC"
fb <- data.frame( Value=dat.gd, Year=year(dat.gd), Month=month(dat.gd), Day=day(dat.gd),WeeDa=wday(dat.gd,label=T, abbr=T))
head(fb)
## Value Year Month Day WeeDa
## 1 2014-11-25 2014 11 25 Tues
## 2 2014-06-08 2014 6 8 Sun
## 3 2014-09-12 2014 9 12 Fri
## 4 2014-05-26 2014 5 26 Mon
## 5 2014-02-20 2014 2 20 Thurs
## 6 2014-06-19 2014 6 19 Thurs
So, here are my questions:
1) how many people have my birthday? 2) How many birthdays are in each month? 3) Which day of the year has the most birthdays? 4) Does someone have a birthday on each day of the year?
1. How many people have my birthday?
head(fb)
## Value Year Month Day WeeDa
## 1 2014-11-25 2014 11 25 Tues
## 2 2014-06-08 2014 6 8 Sun
## 3 2014-09-12 2014 9 12 Fri
## 4 2014-05-26 2014 5 26 Mon
## 5 2014-02-20 2014 2 20 Thurs
## 6 2014-06-19 2014 6 19 Thurs
myBday <- subset(fb, Month == "6" & Day == "15")
head(myBday)
## Value Year Month Day WeeDa
## 675 2014-06-15 2014 6 15 Sun
## 758 2014-06-15 2014 6 15 Sun
So there are two people that have the same birthday as I do in the data set.
2) How many birthdays are in each month? Ok, I am sure it is a horribly inefficient solution but this at least works:
nrow(subset(fb, Month == "1"))
## [1] 89
nrow(subset(fb, Month == "2"))
## [1] 79
nrow(subset(fb, Month == "3"))
## [1] 98
nrow(subset(fb, Month == "4"))
## [1] 81
nrow(subset(fb, Month == "5"))
## [1] 72
nrow(subset(fb, Month == "6"))
## [1] 93
nrow(subset(fb, Month == "7"))
## [1] 86
nrow(subset(fb, Month == "8"))
## [1] 91
nrow(subset(fb, Month == "9"))
## [1] 96
nrow(subset(fb, Month == "10"))
## [1] 89
nrow(subset(fb, Month == "11"))
## [1] 87
nrow(subset(fb, Month == "12"))
## [1] 72
There has to be a better way, though. How about table?
table(fb$Month)
##
## 1 2 3 4 5 6 7 8 9 10 11 12
## 89 79 98 81 72 93 86 91 96 89 87 72
max(table(fb$Month))
## [1] 98
Now don’t I feel foolish?
3) Which day of the year has the most birthdays?
After spending a lot of time on the analysis below I realized that the summary and head function run on the data set at the very beginning had alread told me the answer, but I include all this for educational value.
The first thing we have to do is create a variable for each unique month-day combination in the data set. Then I run table() and max() on that variable.
The first idea that seemed promising was to turn each of the original values in the data set into a factor variable. The a simple summary command would give the number of times each unique date occurred.
fb$Value.Fac <- as.factor(fb$Value)
summary(fb$Value.Fac)
## 2014-02-06 2014-05-22 2014-07-16 2014-01-14 2014-02-02 2014-02-23
## 8 8 8 7 7 7
## 2014-04-14 2014-07-08 2014-08-18 2014-08-27 2014-09-14 2014-09-29
## 7 7 7 7 7 7
## 2014-01-09 2014-03-02 2014-03-16 2014-03-19 2014-03-21 2014-04-08
## 6 6 6 6 6 6
## 2014-06-10 2014-06-25 2014-06-30 2014-08-26 2014-09-01 2014-09-24
## 6 6 6 6 6 6
## 2014-11-03 2014-11-05 2014-11-17 2014-12-18 2014-01-03 2014-01-19
## 6 6 6 6 5 5
## 2014-01-22 2014-01-27 2014-02-10 2014-02-13 2014-02-24 2014-02-27
## 5 5 5 5 5 5
## 2014-03-13 2014-03-28 2014-04-23 2014-04-26 2014-06-06 2014-06-09
## 5 5 5 5 5 5
## 2014-06-17 2014-07-07 2014-07-19 2014-08-05 2014-08-07 2014-08-21
## 5 5 5 5 5 5
## 2014-09-16 2014-09-20 2014-10-14 2014-10-28 2014-11-23 2014-12-09
## 5 5 5 5 5 5
## 2014-12-14 2014-12-28 2014-01-01 2014-01-11 2014-01-13 2014-01-26
## 5 5 4 4 4 4
## 2014-02-16 2014-02-17 2014-03-09 2014-03-12 2014-03-20 2014-03-22
## 4 4 4 4 4 4
## 2014-03-24 2014-04-06 2014-04-12 2014-05-04 2014-05-07 2014-05-08
## 4 4 4 4 4 4
## 2014-05-18 2014-05-19 2014-05-28 2014-06-01 2014-06-02 2014-06-19
## 4 4 4 4 4 4
## 2014-06-23 2014-07-06 2014-07-18 2014-07-20 2014-08-14 2014-08-17
## 4 4 4 4 4 4
## 2014-08-31 2014-09-09 2014-09-10 2014-09-19 2014-09-21 2014-09-23
## 4 4 4 4 4 4
## 2014-10-04 2014-10-07 2014-10-11 2014-10-13 2014-10-21 2014-10-23
## 4 4 4 4 4 4
## 2014-10-29 2014-11-06 2014-11-08 (Other)
## 4 4 4 538
From this it was easy to see that three days, February 6, May 22 and July 16 were tied for having the most birthdays at 8. What concerns me about this is that the table ends with (Other) having the value of 538. I assume that means there are 538 cases left, but I would like to be sure.
So, I looked at the histogram to see if it comported with my theory of what the numbers meant.
plot <- ggplot(fb, aes(Value.Fac))
plot + geom_histogram()
## Warning: position_stack requires constant width: output may be incorrect
This was good enough to show that I was probably interpreting the 538 number correctly, but it was hard to interpret since they were not in order. So I tried making it an ordered factor.
#fb$Value.Ord.Fac <- as.orderedfactor(fb$Value.Fac) # no such function
plot <- ggplot(fb, aes(sort(Value.Fac)))
plot + geom_histogram()
## Warning: position_stack requires constant width: output may be incorrect
Which looks exactly the same. What about if the data were in the original date format?
plot <- ggplot(fb, aes(sort(Value)))
plot + geom_histogram()
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
This certainly orders them, and now it seems clear that the reason for the curious appearance of the histogram before was that the variable was being sorted as a number with the months sorted first and then, within each month, the days being sorted. It is curious that there appear to be no gaps in the birthday data when it is sorted by date since, in looking at the summaries, it was clear that there were many days on which there were no birthdays. I investigate the possibility this could be due to the size of the bins being so broad as to cover the gaps.
plot <- ggplot(fb, aes(sort(Value.Fac)))
plot + geom_histogram(binwidth = 1)
## Warning: position_stack requires constant width: output may be incorrect
So now I am reasonably confident that I have identified the dates with the most birthdays.
4) Does at least one birthday occur on each day of the year? By inspection of the histogram produced before we can see that there are some days without birthdays, quite a few in fact. The gaps in the histogram could be one day or a whole block of days, afterall.
Grid <- as.data.frame(expand.grid(unique(fb\(Month),unique(fb\)Day))) plotGrid <- ggplot(Grid, aes(Var2)) plotGrid + geom_histogram()
I am giving up, but what I would like to do is make a histogram of all the calendar dates in the year and map the existing dates in the data set to that. I would like to see the actual gaps in births mapped out over real time.
head(order(grid\(Var1)) ??sort() summary(grid\)Var2) dat.df\(MonthFac <- as.factor(dat.df\)Month) pMonth <- ggplot(dat.df, aes(MonthFac)) pMonth + geom_histogram() dat.df\(uniqueDay <- paste(dat.df\)Month, dat.df\(Day, sep = "/") dat.df\)uniqueDay.Num <- 0.01*dat.df\(Day + dat.df\)Month table(dat.df$uniqueDay.Num) ddply(dat.df, .(id), summarise, noDays = length(unique(dat.df)))
head(dat.df$uniqueDay.Num)
head(dat.df) pUniqueDay <- ggplot(dat.df, aes(uniqueDay.Num)) pUniqueDay + geom_histogram(binwidth = 0.01) pUniqueDay + geom_bar() + xlim() head(dat.df$uniqueDay)
Discarded ideas
Ok, so there are two people with my birthday, June 15th. Is that a lot or a little? How many times does the typical date occur as a birthday?
To define each day of the year would require the creation of a new variable with Python code, I expect. That would at least require some internet research but it is something to think about for the future. Maybe two nested for loops?
But we could make a plot, a histogram, of the frequency of each date in the data set.
Abandoned code for future reference:
dat.gd <- mdy(fb$dates)
Both of these functions do the same thing, though the WeeDa = wday() argument in data.frame() orders the days begining with Monday while the customized factor() function started the days of the week on Monday.
dat.df <- data.frame( Value=dat.gd, Year=year(dat.gd), Month=month(dat.gd), Day=day(dat.gd),WeeDa=wday(dat.gd,label=T, abbr=T))
dat.df\(WeeDa <- factor(dat.df\)WeeDa, levels=c(‘Mon’, ‘Tues’,‘Wed’,‘Thurs’,‘Fri’,‘Sat’,‘Sun’), ordered=T)
To get a rough idea how typical that is I create a simple histogram.
p <- ggplot(dat.df, aes(Month)) p + geom_bar(binwidth = 1, color = “white”)
dDays <- ggplot(dat.df, aes(Day)) dDays + geom_bar(binwidth = 1) #```
Attempts for number 3: first I tried finding the cartesian product, but that only gave me the list of unique combinations without giving me the count of the number of times each occurred. eachDay <- c(fb\(Month %x% fb\)Day) summary(eachDay) max(eachDay)
grid <- as.data.frame(expand.grid(unique(fb\(Month),unique(fb\)Day)))
grid