Thank you for reading my markdown file for the first Peer Assessment in Reproducible Research.
First we load “activity.csv” into dataframe ‘activity’, which contains three columns (steps, date, and interval). I load pander, which I have not used before, because it allows me to set the diplay digits easiy. I load ggplot2 for the final two-panel chart.
library(pander)
library(ggplot2)
panderOptions('digits',2)
activity <- read.csv("activity.csv", stringsAsFactors = FALSE)
activity$date <- as.Date(activity$date)
I could use median(), but median is just a special case of the 50th quantile, so I used that for fun! I tried different bin values but ultimately settled on 10 bins (the default returns 5 bins).
steps.perday <- aggregate(activity["steps"], by=activity["date"],FUN="sum")
steps.perday.median <- quantile(steps.perday$steps, probs = 0.5, na.rm=TRUE)
steps.perday.mean <-mean(steps.perday$steps, na.rm=TRUE)
hist(steps.perday$steps, 10, freq=TRUE, main = "Histogram (NAs not treated)",
ylab = "Frequency (No of Days)", xlab="Steps", col="green")
In regard to the total number of steps per day over the period:
Now we consider the pattern over an average day. There are 288 5-minute intervals in each day (288 = 24 hours * 60 minutes / 5 minutes). For each 5-minute interval, this plots the average across all days.
steps.perinterval <- aggregate(activity["steps"], by=activity["interval"],
FUN="mean", na.rm = TRUE)
plot(steps.perinterval, type="l", main="average daily activity pattern")
interval.maxsteps <- max(steps.perinterval$steps)
interval.max <- steps.perinterval$interval[steps.perinterval$steps == max(steps.perinterval$steps)]
On average across all the days in the datase, the 5-minute interval which constains the maximum number of steps is interval 835 which contains (on average) 206.17 steps.
I used a very simple strategy to replace missing values (NAs): the NA is replaced by the average number of steps for that interval. For example, interval = 0 (the first interval) has an average of 1.717 steps, such that it will replace all NAs the occur during interval = 0.
number.nas <- sum(is.na(activity$steps))
df.merged <- merge(activity, steps.perinterval, by.x= "interval", by.y= "interval", all.x = TRUE)
suppressWarnings(df.merged$steps.x[is.na(df.merged$steps.x)] <- df.merged$steps.y)
index <- with(df.merged, order(date, interval))
new.table <- df.merged[index,]
newsteps.perday <- aggregate(new.table["steps.x"], by=new.table["date"],FUN="sum")
hist(newsteps.perday$steps, 10, freq=TRUE, main = "Histogram (NAs replaced)",
ylab = "Frequency (No of Days)", xlab="Steps", col="green")
newsteps.perday.median <- quantile(newsteps.perday$steps, probs = 0.5, na.rm=TRUE)
newsteps.perday.mean <- mean(newsteps.perday$steps, na.rm=TRUE)
In regard to the impact of replacing missing values, we can see:
In regard to the total number of steps per day over the period:
Finally, we essentially parse the above “average daily activity pattern” into two components: weekday average versus weekend average (for each 5-minute interval). We can observe:
new.table$weekday <- weekdays(new.table$date)
v.weekdays <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
v.weekend <- c("Saturday", "Sunday")
new.table$weekday[new.table$weekday %in% v.weekdays] <- "Weekday"
new.table$weekday[new.table$weekday %in% v.weekend] <- "Weekend"
new.table$weekday <- factor(new.table$weekday)
n.steps.perday <- aggregate(new.table["steps.x"], by=new.table[c("interval","weekday")],FUN="sum")
ggplot(n.steps.perday, aes(x=interval, y=steps.x)) +
geom_line() +
facet_grid(weekday ~ .)