Source file ⇒ /Users/sambamamba/HolidayBirthdays.Rmd

The data table Birthdays in the mosaicData package gives the number of births recorded on each day of the year in each state from 1969 to 1988.

1. Create a new data table, DailyBirths, that adds up all the births for each day across all the states. Plot out daily births vs. date.

DailyBirths <- Birthdays %>%
  select(date, births) %>%
  group_by(date) %>%
  summarise(total = sum(births))
DailyBirths
## Source: local data frame [7,305 x 2]
## 
##          date total
##        (time) (int)
## 1  1969-01-01  8486
## 2  1969-01-02  9002
## 3  1969-01-03  9542
## 4  1969-01-04  8960
## 5  1969-01-05  8390
## 6  1969-01-06  9560
## 7  1969-01-07  9738
## 8  1969-01-08  9734
## 9  1969-01-09  9434
## 10 1969-01-10 10042
## ..        ...   ...
posn <- position_jitter(width = 0.5)
ggplot(DailyBirths, aes(x = date, y = total)) + geom_point(position = posn, alpha = 0.8, size = 1)

2. To examine seasonality in birth rates, look at the number of births aggregated over the years by:

  1. each week.
weekly <- week(DailyBirths$date) 
DW <- data.frame(DailyBirths, weekly)
WeeklyBirths <- DW %>%
  select(weekly, total) %>%
  group_by(weekly) %>%
  summarise(total = sum(total))
ggplot(WeeklyBirths, aes(x = weekly, y = total)) + geom_point(position = posn, alpha = 0.8, size = 1) + ylim(240000, 1500000)

  1. each month.
monthly <- month(DailyBirths$date) 
DM <- data.frame(DailyBirths, monthly)
MonthlyBirths <- DM %>%
  select(monthly, total) %>%
  group_by(monthly) %>%
  summarise(total = sum(total))
ggplot(MonthlyBirths, aes(x = monthly, y = total)) + geom_point(size = 1.5) + ylim(5000000, 7000000)

  1. each Julian day.
daily <- yday(DailyBirths$date)
DD <- data.frame(DailyBirths, daily) 
JulianBirths <- DD %>%
  select(daily, total) %>%
  group_by(daily) %>%
  summarise(total = sum(total))
ggplot(JulianBirths, aes(x = daily, y = total)) + geom_point(position = posn, alpha = 0.8, size = 1, na.rm = TRUE) + ylim(40000, 200000)

3. Examine the patterns within the week, look at the number of births by day of the week.

4. Pick a two-year span of the Birthdays that falls in the 1980s, say, 1980/1981. Extract out the data just in this interval, calling it MyTwoYears. Plotu out the births in this two-year span day by day. Color each date according to its day of the week. Explain the pattern that you see.

MyTwoYears <- Birthdays %>%
  select(year, date, births, wday) %>%
  filter(year==c(1980,1981)) %>%
  group_by(date, wday) %>%
  summarise(total = sum(births))
head(MyTwoYears)
## Source: local data frame [6 x 3]
## Groups: date [6]
## 
##         date   wday total
##       (time) (fctr) (int)
## 1 1980-01-01   Tues  4576
## 2 1980-01-02    Wed  4112
## 3 1980-01-03  Thurs  5544
## 4 1980-01-04    Fri  4411
## 5 1980-01-05    Sat  4725
## 6 1980-01-06    Sun  3656
ggplot(MyTwoYears, aes(x = date, y = total, color =wday )) + geom_point() + scale_fill_discrete(name="Days of Week", labels=c("Sunday", "Monday", "Tuesday", "Wednesday","Thursday","Friday","Saturday")) 

5.A few days each year don’t follow the pattern in (4). We are going to examine the hypothesis that these are holidays. You can find a data set listing US federal holidays at “http://tiny.cc/dcf/US-Holidays.csv”.

Holidays <- read.csv("http://tiny.cc/dcf/US-Holidays.csv") %>%
  mutate(date = lubridate::dmy(date))

6. Add a couple of layers to your plot from (4). 1. Draw a vertical bar at each date which is a holiday. Youll use the geom_vline() glyph. You can give a data = argument to geom_vline() to tell it to plot out the information from Holdays rather than MyTwoYears.

Holi <- Holidays %>%
  select(holiday, year, date) %>%
  filter(year==c(1980,1981)) %>%
  group_by(holiday) 
## Warning in c(1978L, 1979L, 1980L, 1981L, 1982L, 1983L, 1984L, 1985L,
## 1986L, : longer object length is not a multiple of shorter object length

  1. Add a text label to each of the vertical bars to identify which holiday it is. Use the geom_text glyph.
MyTwoYears <- Birthdays %>%
    select(year, date, births, wday) %>%
    filter(year==c(1980,1981)) %>%
    group_by(date, wday) %>%
    summarise(total = sum(births))
Holi <- Holidays %>%
    select(holiday, year, date) %>%
    filter(year==c(1980,1981)) %>%
    group_by(holiday) 
## Warning in c(1978L, 1979L, 1980L, 1981L, 1982L, 1983L, 1984L, 1985L,
## 1986L, : longer object length is not a multiple of shorter object length
TwoYearHolidays <- MyTwoYears %>%
  left_join(Holidays %>% select(holiday, date, year), by = c("date" = "date"))
ggplot(MyTwoYears, aes(x = date, y = total, col = wday)) + geom_point() + scale_fill_discrete(name="Days of Week", labels=c("Sunday", "Monday", "Tuesday", "Wednesday","Thursday","Friday","Saturday")) + geom_vline(data = Holi, mapping = aes(xintercept = as.numeric(date),  color = wday(date, label = TRUE))) + geom_text(data = TwoYearHolidays, aes(label = holiday, angle = 65))   
## Warning: Removed 713 rows containing missing values (geom_text).

7. Join MyTwoYears and Holidays.

TwoYearHolidays <- MyTwoYears %>%
  left_join(Holidays %>% select(holiday, date, year), by = c("date" = "date"))
head(TwoYearHolidays)
## Source: local data frame [6 x 5]
## Groups: date [6]
## 
##         date   wday total        holiday  year
##       (time) (fctr) (int)         (fctr) (int)
## 1 1980-01-01   Tues  4576 New Year's Day  1980
## 2 1980-01-02    Wed  4112             NA    NA
## 3 1980-01-03  Thurs  5544             NA    NA
## 4 1980-01-04    Fri  4411             NA    NA
## 5 1980-01-05    Sat  4725             NA    NA
## 6 1980-01-06    Sun  3656             NA    NA

8. Mutate the holiday variable to be “yes” or “no”, depending on whether the day is a holiday or not. As an appropriate argument to mutate would be is_holiday = ifelse(is.na(holiday), "no", "yes").

TwoYearHolidays2 <- TwoYearHolidays %>%
  mutate(is_holiday = ifelse(is.na(holiday), "no", "yes"))
head(TwoYearHolidays2)
## Source: local data frame [6 x 6]
## Groups: date [6]
## 
##         date   wday total        holiday  year is_holiday
##       (time) (fctr) (int)         (fctr) (int)      (chr)
## 1 1980-01-01   Tues  4576 New Year's Day  1980        yes
## 2 1980-01-02    Wed  4112             NA    NA         no
## 3 1980-01-03  Thurs  5544             NA    NA         no
## 4 1980-01-04    Fri  4411             NA    NA         no
## 5 1980-01-05    Sat  4725             NA    NA         no
## 6 1980-01-06    Sun  3656             NA    NA         no

9. Plot out the daily pattern over the two years of MyTwoYears, setting the size of the symbol to is_holiday. Is your hypothesis in (5) correct? If yes, which holidays do not follow the pattern?

ggplot(TwoYearHolidays2, aes(x = date, y = total, col = wday)) +  geom_point() + scale_fill_discrete(name="Days of Week", labels=c("Sunday", "Monday", "Tuesday", "Wednesday","Thursday","Friday","Saturday")) + geom_vline(data = Holi, mapping = aes(xintercept = as.numeric(date),  color = wday(date, label = TRUE))) + geom_text(data = TwoYearHolidays2, aes(label = holiday, angle = 90, size = is_holiday))
## Warning: Removed 713 rows containing missing values (geom_text).

It is somewhat difficult to discern even with jittering, however one can infer that the hypothesis is not correct.

However, I would comment that the inferences made from the graph, even with the vertical lines, can vary depending on which trend we are referring to. These may be a result of the error in how the attributes and aesthetics were revised in order to market the cleanest and most readable graph. For one, conflating multiple attributes and symbols can make the graph harder to read. If we split the graph up into facets, isolating a trend in the data using some sort of linear model, then we can use analytic tools to compare the correlation between the holiday date trends and the birthday trends.

Human error can also occur on the perspective of the source of inference, i.e. myself, for potentially having an incorrect idea of what the trend should be that we should observe which would be compared to also the slightly subjective trend of holiday dates. As such, the source of inference could learn more data wrangling and graphing tools in order to generate a more accessible plot, in which the inference would be a bit more accurate and less subjective.