Forks First

Fork that repo!!! I use the github website for the forking and copy the url to my clip board.Then I switch into my terminal…

# cd datasciencecoursera/Reproducible_Research
# git clone "myrepo_url""
# cd RepData_PeerAssesment1
# unzip activity.zip

Simple right? Switch to your desired location for your local branch; clone there; switch into the cloned files; unzip the data.

Read In

Now I can actually get to all that tasty data! And finally get back to my turf in RStudio :)

setwd("~/datasciencecoursera/Reproducible_Research/RepData_PeerAssessment1/")
# don't want to type that more than once
datf <- data.table::fread("activity.csv")
str(datf)
## Classes 'data.table' and 'data.frame':   17568 obs. of  3 variables:
##  $ steps   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ date    : chr  "2012-10-01" "2012-10-01" "2012-10-01" "2012-10-01" ...
##  $ interval: int  0 5 10 15 20 25 30 35 40 45 ...
##  - attr(*, ".internal.selfref")=<externalptr>

All read in and I have a general idea of the structure of the date.

Histogram of steps taken per day with Mean and Median

Now I need to summarise the data as total number of steps per day. I really like dplyr and magrittr for this kind of thing.

library(magrittr)
library(dplyr)
daily_total <- datf %>% group_by(date) %>% summarise(steps = sum(steps))
summaries <- daily_total %>% summarise(avg = mean(steps, na.rm = T), med = median(steps, na.rm = T))

I went ahead and made another data.frame for my summary statistics. This is definately overkill but makes my plotting a little easier. Now lets visualize the daily totals via ggplot2.

library(ggplot2)
theme_set(theme_light()) # I hate the grey default background
ggplot(daily_total, aes(x = steps)) +
    geom_histogram(aes(y = ..density..), bins = 20, fill = "grey") +
    geom_density(color = "black") +
    geom_vline(data = summaries, aes(xintercept = avg), color = "blue", linetype = 2, alpha = .5, size = 2) +
    geom_label(data = summaries, aes(label = round(avg), x = avg), y = .000125, hjust = 1, color = "blue") +
    geom_vline(data = summaries, aes(xintercept = med), color = "red", linetype = 3, alpha = .5, size = 2) +
    geom_label(data = summaries, aes(label = round(med), x = med),y = .000135, hjust = 0, color = "red") +
    labs(title = "#2 Histogram and Density of daily step totals\n#3 Mean in blue; Median in red",
         y = "Density", x = "Steps")

Jeez, the mean and the median are right on top of one another. I’ve tried to show that using alpha = .5 and varying linetype= and of course the geom_label() annotations. I prefer the curve of geom_density(), but didn’t want to lose point for not having a histogram. Note the change in geom_histogram(aes(y = ..density..)), this is a must to get them on the same density scale (i.e. percentage of total, instead of total counts). There is a lot of information in this little figure!

Time series of average steps taken

We need to now do a summary grouped by interval instead of date. I also did another summary data.frame and set up a color variable to help identify the maximum step interval.

interval_total <- datf %>% group_by(interval) %>% summarise(total = sum(steps, na.rm = T))
summaries2 <- interval_total %>% filter(total == max(total))
interval_total %<>% mutate(color = total == summaries2$total)

ggplot(interval_total, aes(x = interval, y = total)) +
    geom_path(color = "grey") +
    geom_point(aes( color = color), size = 2) +
    scale_color_brewer(palette = "Set1", name = "Is Max?") +
    geom_label(data = summaries2, aes(label = interval), hjust = -.5) +
    geom_smooth(color = "black", se = F) +
    labs(title = "Which interval has the most step activity?\n 830th - 835th minute or 1:50p - 1:55p",
         y = "Total Steps", x = "Minutes since midnight (in 5 min intervals)")

The black line show the geom_smooth() loess line of best fin and confirms that the most activty happens around 1:50pm.

Working on the NA problem

Just how many NA values are lurking in our datf?

table(complete.cases(datf))
## 
## FALSE  TRUE 
##  2304 15264

We are missing a total of 2304 out of 17568 observations, or ~ 13%. Seems like a lot, lets approximate those missing values with the mean values (na.rm = TRUE of course) from their respective time interval.

na_fill <- datf %>% group_by(interval) %>% summarise(med = mean(steps, na.rm = T))
datf2 <- merge(datf, na_fill, by = "interval")
datf2 %<>% mutate(steps = ifelse(is.na(steps), med, steps))

Let’s see how our data imputation shifted the mean and median, like we visualized in our first plot

daily_total2 <- datf2 %>% group_by(date) %>% summarise(steps = sum(steps))
summaries3 <- daily_total2 %>% summarise(avg = mean(steps, na.rm = T), med = median(steps, na.rm = T))
ggplot(daily_total2, aes(x = steps)) +
    geom_histogram(aes(y = ..density..), bins = 20, fill = "grey") +
    geom_density(color = "black") +
    geom_vline(data = summaries3, aes(xintercept = avg), color = "blue", linetype = 2, alpha = .5, size = 2) +
    geom_label(data = summaries3, aes(label = round(avg), x = avg), y = .000125, hjust = 1, color = "blue") +
    geom_vline(data = summaries3, aes(xintercept = med), color = "red", linetype = 3, alpha = .5, size = 2) +
    geom_label(data = summaries3, aes(label = round(med), x = med),y = .000135, hjust = 0, color = "red") +
    labs(title = "#7 AFTER IMPUTING: Histogram and Density of daily step totals\nMean in blue; Median in red",
         y = "Density", x = "Steps")

Well that didn’t change things too much, I guess imputing in this case doesn’t make a huge difference. It tightened the distribution a little bit but the mean and median values are essentially unchanged (okay you’re right the median increased by 1 big whoop-didoo!) but we wouldn’t expect the mean to change at all, since thats the value we replaced out NAs with.

Are weekends for relaxing or stepping?

To test this theory we need to make a new variable called weekend that will be TRUE if the day is “Saturday” or “Sunday” ad FALSE otherwise.

library(lubridate)
str(datf2)
## 'data.frame':    17568 obs. of  4 variables:
##  $ interval: int  0 0 0 0 0 0 0 0 0 0 ...
##  $ steps   : num  1.72 0 0 47 0 ...
##  $ date    : chr  "2012-10-01" "2012-10-02" "2012-10-03" "2012-10-04" ...
##  $ med     : num  1.72 1.72 1.72 1.72 1.72 ...
datf2$date %<>% ymd()
datf2 %<>% mutate(day = wday(date, label = T), weekend = grepl("S(at|un)", day))

weekdays <- filter(datf2, weekend == F) %>% group_by(interval) %>% summarise(total = sum(steps)) %>% mutate(day = "Weekday")
weekday_sum <- weekdays %>% filter(total == max(total)) %>% mutate(weekend = F)
weekends <- filter(datf2, weekend == T) %>% group_by(interval) %>% summarise(total = sum(steps)) %>% mutate(day = "Weekend")
weekend_sum <- weekends %>% filter(total == max(total)) %>% mutate(weekend = T)

days <- rbind(weekdays, weekends)
sums <- rbind(weekday_sum, weekend_sum)

ggplot(days, aes(x = interval, y = total)) +
    geom_path(color = "grey") +
    geom_point(aes(color = day), size = 2) +
    scale_color_brewer(palette = "Set1", name = "Day of Week:") +
    geom_smooth(color = "black", se = F) +
    facet_wrap(~day, nrow = 2, labeller = "label_value") +
    labs(title = "When are people more active Weekday or Weekend?\nDuring the week, guess people are relaxing more on the weekend, go figure",
         y = "Total Steps", x = "Minutes since midnight (in 5 min intervals)")

I am genuinely a little suprised that people are less active on the weekends. Since I myself work at a computre most of the day I relish the freedome the weekend’s bring to get outside and do something physical.

Anywho, jeez that was a lot of code. Sorry for whoever is grading this that I didn’t get it in on time but I was out of town this weekend (oh the irony). Hope it was a pleasant read and please let me know any detailed feedback. Cheers, Nate.