Fork that repo!!! I use the github website for the forking and copy the url to my clip board.Then I switch into my terminal…
# cd datasciencecoursera/Reproducible_Research
# git clone "myrepo_url""
# cd RepData_PeerAssesment1
# unzip activity.zip
Simple right? Switch to your desired location for your local branch; clone there; switch into the cloned files; unzip the data.
Now I can actually get to all that tasty data! And finally get back to my turf in RStudio :)
setwd("~/datasciencecoursera/Reproducible_Research/RepData_PeerAssessment1/")
# don't want to type that more than once
datf <- data.table::fread("activity.csv")
str(datf)
## Classes 'data.table' and 'data.frame': 17568 obs. of 3 variables:
## $ steps : int NA NA NA NA NA NA NA NA NA NA ...
## $ date : chr "2012-10-01" "2012-10-01" "2012-10-01" "2012-10-01" ...
## $ interval: int 0 5 10 15 20 25 30 35 40 45 ...
## - attr(*, ".internal.selfref")=<externalptr>
All read in and I have a general idea of the structure of the date.
Now I need to summarise the data as total number of steps per day. I really like dplyr and magrittr for this kind of thing.
library(magrittr)
library(dplyr)
daily_total <- datf %>% group_by(date) %>% summarise(steps = sum(steps))
summaries <- daily_total %>% summarise(avg = mean(steps, na.rm = T), med = median(steps, na.rm = T))
I went ahead and made another data.frame for my summary statistics. This is definately overkill but makes my plotting a little easier. Now lets visualize the daily totals via ggplot2.
library(ggplot2)
theme_set(theme_light()) # I hate the grey default background
ggplot(daily_total, aes(x = steps)) +
geom_histogram(aes(y = ..density..), bins = 20, fill = "grey") +
geom_density(color = "black") +
geom_vline(data = summaries, aes(xintercept = avg), color = "blue", linetype = 2, alpha = .5, size = 2) +
geom_label(data = summaries, aes(label = round(avg), x = avg), y = .000125, hjust = 1, color = "blue") +
geom_vline(data = summaries, aes(xintercept = med), color = "red", linetype = 3, alpha = .5, size = 2) +
geom_label(data = summaries, aes(label = round(med), x = med),y = .000135, hjust = 0, color = "red") +
labs(title = "#2 Histogram and Density of daily step totals\n#3 Mean in blue; Median in red",
y = "Density", x = "Steps")
Jeez, the mean and the median are right on top of one another. I’ve tried to show that using alpha = .5 and varying linetype= and of course the geom_label() annotations. I prefer the curve of geom_density(), but didn’t want to lose point for not having a histogram. Note the change in geom_histogram(aes(y = ..density..)), this is a must to get them on the same density scale (i.e. percentage of total, instead of total counts). There is a lot of information in this little figure!
We need to now do a summary grouped by interval instead of date. I also did another summary data.frame and set up a color variable to help identify the maximum step interval.
interval_total <- datf %>% group_by(interval) %>% summarise(total = sum(steps, na.rm = T))
summaries2 <- interval_total %>% filter(total == max(total))
interval_total %<>% mutate(color = total == summaries2$total)
ggplot(interval_total, aes(x = interval, y = total)) +
geom_path(color = "grey") +
geom_point(aes( color = color), size = 2) +
scale_color_brewer(palette = "Set1", name = "Is Max?") +
geom_label(data = summaries2, aes(label = interval), hjust = -.5) +
geom_smooth(color = "black", se = F) +
labs(title = "Which interval has the most step activity?\n 830th - 835th minute or 1:50p - 1:55p",
y = "Total Steps", x = "Minutes since midnight (in 5 min intervals)")
The black line show the geom_smooth() loess line of best fin and confirms that the most activty happens around 1:50pm.
Just how many NA values are lurking in our datf?
table(complete.cases(datf))
##
## FALSE TRUE
## 2304 15264
We are missing a total of 2304 out of 17568 observations, or ~ 13%. Seems like a lot, lets approximate those missing values with the mean values (na.rm = TRUE of course) from their respective time interval.
na_fill <- datf %>% group_by(interval) %>% summarise(med = mean(steps, na.rm = T))
datf2 <- merge(datf, na_fill, by = "interval")
datf2 %<>% mutate(steps = ifelse(is.na(steps), med, steps))
Let’s see how our data imputation shifted the mean and median, like we visualized in our first plot
daily_total2 <- datf2 %>% group_by(date) %>% summarise(steps = sum(steps))
summaries3 <- daily_total2 %>% summarise(avg = mean(steps, na.rm = T), med = median(steps, na.rm = T))
ggplot(daily_total2, aes(x = steps)) +
geom_histogram(aes(y = ..density..), bins = 20, fill = "grey") +
geom_density(color = "black") +
geom_vline(data = summaries3, aes(xintercept = avg), color = "blue", linetype = 2, alpha = .5, size = 2) +
geom_label(data = summaries3, aes(label = round(avg), x = avg), y = .000125, hjust = 1, color = "blue") +
geom_vline(data = summaries3, aes(xintercept = med), color = "red", linetype = 3, alpha = .5, size = 2) +
geom_label(data = summaries3, aes(label = round(med), x = med),y = .000135, hjust = 0, color = "red") +
labs(title = "#7 AFTER IMPUTING: Histogram and Density of daily step totals\nMean in blue; Median in red",
y = "Density", x = "Steps")
Well that didn’t change things too much, I guess imputing in this case doesn’t make a huge difference. It tightened the distribution a little bit but the mean and median values are essentially unchanged (okay you’re right the median increased by 1 big whoop-didoo!) but we wouldn’t expect the mean to change at all, since thats the value we replaced out NAs with.
To test this theory we need to make a new variable called weekend that will be TRUE if the day is “Saturday” or “Sunday” ad FALSE otherwise.
library(lubridate)
str(datf2)
## 'data.frame': 17568 obs. of 4 variables:
## $ interval: int 0 0 0 0 0 0 0 0 0 0 ...
## $ steps : num 1.72 0 0 47 0 ...
## $ date : chr "2012-10-01" "2012-10-02" "2012-10-03" "2012-10-04" ...
## $ med : num 1.72 1.72 1.72 1.72 1.72 ...
datf2$date %<>% ymd()
datf2 %<>% mutate(day = wday(date, label = T), weekend = grepl("S(at|un)", day))
weekdays <- filter(datf2, weekend == F) %>% group_by(interval) %>% summarise(total = sum(steps)) %>% mutate(day = "Weekday")
weekday_sum <- weekdays %>% filter(total == max(total)) %>% mutate(weekend = F)
weekends <- filter(datf2, weekend == T) %>% group_by(interval) %>% summarise(total = sum(steps)) %>% mutate(day = "Weekend")
weekend_sum <- weekends %>% filter(total == max(total)) %>% mutate(weekend = T)
days <- rbind(weekdays, weekends)
sums <- rbind(weekday_sum, weekend_sum)
ggplot(days, aes(x = interval, y = total)) +
geom_path(color = "grey") +
geom_point(aes(color = day), size = 2) +
scale_color_brewer(palette = "Set1", name = "Day of Week:") +
geom_smooth(color = "black", se = F) +
facet_wrap(~day, nrow = 2, labeller = "label_value") +
labs(title = "When are people more active Weekday or Weekend?\nDuring the week, guess people are relaxing more on the weekend, go figure",
y = "Total Steps", x = "Minutes since midnight (in 5 min intervals)")
I am genuinely a little suprised that people are less active on the weekends. Since I myself work at a computre most of the day I relish the freedome the weekend’s bring to get outside and do something physical.
Anywho, jeez that was a lot of code. Sorry for whoever is grading this that I didn’t get it in on time but I was out of town this weekend (oh the irony). Hope it was a pleasant read and please let me know any detailed feedback. Cheers, Nate.