Source file ⇒ /Users/sambamamba/HolidayBirthdays.Rmd
The data table Birthdays in the mosaicData package gives the number of births recorded on each day of the year in each state from 1969 to 1988.
1. Create a new data table, DailyBirths, that adds up all the births for each day across all the states. Plot out daily births vs. date.
DailyBirths <- Birthdays %>%
select(date, births) %>%
group_by(date) %>%
summarise(total = sum(births))
DailyBirths
## Source: local data frame [7,305 x 2]
##
## date total
## (time) (int)
## 1 1969-01-01 8486
## 2 1969-01-02 9002
## 3 1969-01-03 9542
## 4 1969-01-04 8960
## 5 1969-01-05 8390
## 6 1969-01-06 9560
## 7 1969-01-07 9738
## 8 1969-01-08 9734
## 9 1969-01-09 9434
## 10 1969-01-10 10042
## .. ... ...
posn <- position_jitter(width = 0.5)
ggplot(DailyBirths, aes(x = date, y = total)) + geom_point(position = posn, alpha = 0.8, size = 1)
2. To examine seasonality in birth rates, look at the number of births aggregated over the years by:
weekly <- week(DailyBirths$date)
DW <- data.frame(DailyBirths, weekly)
WeeklyBirths <- DW %>%
select(weekly, total) %>%
group_by(weekly) %>%
summarise(total = sum(total))
ggplot(WeeklyBirths, aes(x = weekly, y = total)) + geom_point(position = posn, alpha = 0.8, size = 1) + ylim(240000, 1500000)
monthly <- month(DailyBirths$date)
DM <- data.frame(DailyBirths, monthly)
MonthlyBirths <- DM %>%
select(monthly, total) %>%
group_by(monthly) %>%
summarise(total = sum(total))
ggplot(MonthlyBirths, aes(x = monthly, y = total)) + geom_point(size = 1.5) + ylim(5000000, 7000000)
daily <- yday(DailyBirths$date)
DD <- data.frame(DailyBirths, daily)
JulianBirths <- DD %>%
select(daily, total) %>%
group_by(daily) %>%
summarise(total = sum(total))
ggplot(JulianBirths, aes(x = daily, y = total)) + geom_point(position = posn, alpha = 0.8, size = 1, na.rm = TRUE) + ylim(40000, 200000)
3. Examine the patterns within the week, look at the number of births by day of the week.
4. Pick a two-year span of the Birthdays that falls in the 1980s, say, 1980/1981. Extract out the data just in this interval, calling it MyTwoYears. Plotu out the births in this two-year span day by day. Color each date according to its day of the week. Explain the pattern that you see.
MyTwoYears <- Birthdays %>%
select(year, date, births, wday) %>%
filter(year==c(1980,1981)) %>%
group_by(date, wday) %>%
summarise(total = sum(births))
head(MyTwoYears)
## Source: local data frame [6 x 3]
## Groups: date [6]
##
## date wday total
## (time) (fctr) (int)
## 1 1980-01-01 Tues 4576
## 2 1980-01-02 Wed 4112
## 3 1980-01-03 Thurs 5544
## 4 1980-01-04 Fri 4411
## 5 1980-01-05 Sat 4725
## 6 1980-01-06 Sun 3656
ggplot(MyTwoYears, aes(x = date, y = total, color =wday )) + geom_point() + scale_fill_discrete(name="Days of Week", labels=c("Sunday", "Monday", "Tuesday", "Wednesday","Thursday","Friday","Saturday"))
By observing the graph, there seems to be a lesser amount of people who are born on a Saturday and a great about of people born on a Tuesday. There does not seem to be a lot of people born on a Monday. Per the graph above, there seems to be a periodic trend (although loosely defined to be periodic) following a sinusoidal-esque pattern. More concretely, there seems to be a low amount of births during January-February 1980, then a high amount of births around the fall of 1980, or August-November 1980, then another low amount of births towards the spring time of 1980 (March-April-May), and then we see another peak during October-November of 1981. The periodic pattern is interesting to note, since this was plotted from observed data, i.e., the Birthdays data set.
Also, there seems to be a concentrated number yet a low amount of births on Saturdays, roughly 4000 births on Saturdays at max, and a greater amount of births on Wednesdays and Thursdays if you observe the vertical trend of the colors of the graph.
5.A few days each year don’t follow the pattern in (4). We are going to examine the hypothesis that these are holidays. You can find a data set listing US federal holidays at “http://tiny.cc/dcf/US-Holidays.csv”.
Holidays <- read.csv("http://tiny.cc/dcf/US-Holidays.csv") %>%
mutate(date = lubridate::dmy(date))
6. Add a couple of layers to your plot from (4). 1. Draw a vertical bar at each date which is a holiday. Youll use the geom_vline() glyph. You can give a data = argument to geom_vline() to tell it to plot out the information from Holdays rather than MyTwoYears.
Holi <- Holidays %>%
select(holiday, year, date) %>%
filter(year==c(1980,1981)) %>%
group_by(holiday)
## Warning in c(1978L, 1979L, 1980L, 1981L, 1982L, 1983L, 1984L, 1985L,
## 1986L, : longer object length is not a multiple of shorter object length
geom_text glyph.MyTwoYears <- Birthdays %>%
select(year, date, births, wday) %>%
filter(year==c(1980,1981)) %>%
group_by(date, wday) %>%
summarise(total = sum(births))
Holi <- Holidays %>%
select(holiday, year, date) %>%
filter(year==c(1980,1981)) %>%
group_by(holiday)
## Warning in c(1978L, 1979L, 1980L, 1981L, 1982L, 1983L, 1984L, 1985L,
## 1986L, : longer object length is not a multiple of shorter object length
TwoYearHolidays <- MyTwoYears %>%
left_join(Holidays %>% select(holiday, date, year), by = c("date" = "date"))
ggplot(MyTwoYears, aes(x = date, y = total, col = wday)) + geom_point() + scale_fill_discrete(name="Days of Week", labels=c("Sunday", "Monday", "Tuesday", "Wednesday","Thursday","Friday","Saturday")) + geom_vline(data = Holi, mapping = aes(xintercept = as.numeric(date), color = wday(date, label = TRUE))) + geom_text(data = TwoYearHolidays, aes(label = holiday, angle = 65))
## Warning: Removed 713 rows containing missing values (geom_text).
7. Join MyTwoYears and Holidays.
TwoYearHolidays <- MyTwoYears %>%
left_join(Holidays %>% select(holiday, date, year), by = c("date" = "date"))
head(TwoYearHolidays)
## Source: local data frame [6 x 5]
## Groups: date [6]
##
## date wday total holiday year
## (time) (fctr) (int) (fctr) (int)
## 1 1980-01-01 Tues 4576 New Year's Day 1980
## 2 1980-01-02 Wed 4112 NA NA
## 3 1980-01-03 Thurs 5544 NA NA
## 4 1980-01-04 Fri 4411 NA NA
## 5 1980-01-05 Sat 4725 NA NA
## 6 1980-01-06 Sun 3656 NA NA
8. Mutate the holiday variable to be “yes” or “no”, depending on whether the day is a holiday or not. As an appropriate argument to mutate would be is_holiday = ifelse(is.na(holiday), "no", "yes").
TwoYearHolidays2 <- TwoYearHolidays %>%
mutate(is_holiday = ifelse(is.na(holiday), "no", "yes"))
head(TwoYearHolidays2)
## Source: local data frame [6 x 6]
## Groups: date [6]
##
## date wday total holiday year is_holiday
## (time) (fctr) (int) (fctr) (int) (chr)
## 1 1980-01-01 Tues 4576 New Year's Day 1980 yes
## 2 1980-01-02 Wed 4112 NA NA no
## 3 1980-01-03 Thurs 5544 NA NA no
## 4 1980-01-04 Fri 4411 NA NA no
## 5 1980-01-05 Sat 4725 NA NA no
## 6 1980-01-06 Sun 3656 NA NA no
9. Plot out the daily pattern over the two years of MyTwoYears, setting the size of the symbol to is_holiday. Is your hypothesis in (5) correct? If yes, which holidays do not follow the pattern?
ggplot(TwoYearHolidays2, aes(x = date, y = total, col = wday)) + geom_point() + scale_fill_discrete(name="Days of Week", labels=c("Sunday", "Monday", "Tuesday", "Wednesday","Thursday","Friday","Saturday")) + geom_vline(data = Holi, mapping = aes(xintercept = as.numeric(date), color = wday(date, label = TRUE))) + geom_text(data = TwoYearHolidays2, aes(label = holiday, angle = 90, size = is_holiday))
## Warning: Removed 713 rows containing missing values (geom_text).
It is somewhat difficult to discern even with jittering, however one can infer that the hypothesis is not correct.
The hypothesis in (5) states that given that there is a trend in the births in the two-year timespan, the holiday dates would not follow the observed trend. Per #4, we saw a somewhat periodic trend, with regions of high points and low points in the plot. By the graph above, the vertical colored lines represent the holidays in the span of two years colored by the day of the week, as the legend describes.
If you observe the bell-shaped curve between May-June 1980 to January 1981, you can see that the increase in births at that region is in the region where there is a great amount of holidays. A similar phenomenon of high births aggregated around the holiday dates happens between May of 1981 to December of 1981. As such, since there seems to be a relationship between the large number of birth regions and the regions where holiday dates are largely concentrated, then we can infer that there is a rough pattern to the holiday dates that complement the pattern of the data points we mentioned in (4). Since the hypothesis assumes that the holidays don’t follow the pattern in (4).
More specifically, however, I would say that Thanksgiving, Christmas, and New Year are, relative to the other holiday dates, are slightly off from the periodicity of the holiday dates and the trend of the data points. The other holidays more closely follow the pattern of births.
However, I would comment that the inferences made from the graph, even with the vertical lines, can vary depending on which trend we are referring to. These may be a result of the error in how the attributes and aesthetics were revised in order to market the cleanest and most readable graph. For one, conflating multiple attributes and symbols can make the graph harder to read. If we split the graph up into facets, isolating a trend in the data using some sort of linear model, then we can use analytic tools to compare the correlation between the holiday date trends and the birthday trends.
Human error can also occur on the perspective of the source of inference, i.e. myself, for potentially having an incorrect idea of what the trend should be that we should observe which would be compared to also the slightly subjective trend of holiday dates. As such, the source of inference could learn more data wrangling and graphing tools in order to generate a more accessible plot, in which the inference would be a bit more accurate and less subjective.