Author: Sheng Li
activity <- read.csv("activity.csv", header = TRUE)
activity$date <- as.Date(as.character(activity$date))
summary(activity)
## steps date interval
## Min. : 0.00 Min. :2012-10-01 Min. : 0.0
## 1st Qu.: 0.00 1st Qu.:2012-10-16 1st Qu.: 588.8
## Median : 0.00 Median :2012-10-31 Median :1177.5
## Mean : 37.38 Mean :2012-10-31 Mean :1177.5
## 3rd Qu.: 12.00 3rd Qu.:2012-11-15 3rd Qu.:1766.2
## Max. :806.00 Max. :2012-11-30 Max. :2355.0
## NA's :2304
This analysis first examines the mean total number of steps taken per day if the missing values are included. As seen from the histogram, the mean (red line) is lower than the median (dashed blue line) because of the number of days where 0 step is recorded by the activity monitoring device.
library(plyr)
library(ggplot2)
stepsPerDayNA <- ddply(activity, "date", summarise, totalSteps = sum(steps, na.rm=T))
cuts1 <- data.frame(Thresholds="Mean", vals = mean(stepsPerDayNA$totalSteps))
cuts2 <- data.frame(Thresholds="Median", vals = median(stepsPerDayNA$totalSteps))
cuts <- rbind(cuts1,cuts2)
ggplot(data = stepsPerDayNA, aes(x = stepsPerDayNA$totalSteps)) + geom_histogram() +
geom_vline(data=cuts, aes(xintercept=vals, linetype=Thresholds, colour = Thresholds), show_guide = TRUE) +
xlab("Total number of steps") + ggtitle("Total Number of Steps Taken Per Day (include missing values)")
The calculation shows that the mean total number of steps taken per day is 9354 steps, whereas the median total number of steps taken per day is 10395 steps.
mean(stepsPerDayNA$totalSteps)
## [1] 9354.23
median(stepsPerDayNA$totalSteps)
## [1] 10395
Alternatively, I consider the case if the analysis ignores the missing values in the dataset. The histogram shows that the mean and the median total number of steps taken per day cannot be distinguished by the plot because the red line and the dashed blue line are aligned together.
stepsPerDay <- ddply(activity, "date", summarise, totalSteps = sum(steps))
cuts1 <- data.frame(Thresholds="Mean", vals = mean(stepsPerDay$totalSteps, na.rm=T))
cuts2 <- data.frame(Thresholds="Median", vals = median(stepsPerDay$totalSteps, na.rm=T))
cuts <- rbind(cuts1,cuts2)
ggplot(data = stepsPerDay, aes(x = stepsPerDay$totalSteps)) + geom_histogram() +
geom_vline(data=cuts, aes(xintercept=vals, linetype=Thresholds, colour = Thresholds), show_guide = TRUE) +
xlab("Total number of steps") + ggtitle("Total Number of Steps Taken Per Day (exclude missing values)")
The calculation reveals that the mean total number of steps taken per day is 10766 steps, whereas the median total number of steps taken per day is 10765 steps.
mean(stepsPerDay$totalSteps, na.rm=T)
## [1] 10766.19
median(stepsPerDay$totalSteps, na.rm=T)
## [1] 10765
First, I construct a time series plot of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis).
intervalavg <- ddply(activity, "interval", summarise, avgSteps = mean(steps, na.rm=T))
summary(intervalavg)
## interval avgSteps
## Min. : 0.0 Min. : 0.000
## 1st Qu.: 588.8 1st Qu.: 2.486
## Median :1177.5 Median : 34.113
## Mean :1177.5 Mean : 37.383
## 3rd Qu.:1766.2 3rd Qu.: 52.835
## Max. :2355.0 Max. :206.170
ggplot(data=intervalavg, aes(x=intervalavg$interval, y=intervalavg$avgSteps)) + geom_line() +
ggtitle("Average Daily Activity Pattern Per 5-min Interval") + xlab("Interval (24-hours)") +
ylab("Average Number of Steps Taken")
The 5-minute interval, on average across all the days in the dataset, that contains the maximum number of steps is at interval 835.
intervalavg[which.max(intervalavg$avgSteps),]
## interval avgSteps
## 104 835 206.1698
First, I report that there are 2304 missing values in the dataset (i.e. the total number of rows with NAs).
table((is.na(activity$steps)))
##
## FALSE TRUE
## 15264 2304
I then fill in all of the missing values in the dataset, and create a new dataset that is equal to the original dataset but with the missing data filled in. In this case, I filled all missing values with the average value for that specific 5-minutes interval.
averages <- aggregate(x=list(steps=activity$steps), by=list(interval=activity$interval), mean, na.rm=TRUE)
fill_value <- function(steps, interval) {
filled <- NA
if (!is.na(steps))
filled <- c(steps)
else
filled <- (averages[intervalavg$interval==interval, "steps"])
return(filled)
}
activityFilled <- activity
activityFilled$steps <- mapply(fill_value, activityFilled$steps, activityFilled$interval)
Next, I make a histogram of the total number of steps taken each day. Again, the plot shows that the mean and the median total number of steps taken per day cannot be distinguished since the red line and the dashed blue line are aligned together.
stepsPerDayFilled <- ddply(activityFilled, "date", summarise, totalSteps = sum(steps))
cuts1 <- data.frame(Thresholds="Mean", vals = mean(stepsPerDayFilled$totalSteps))
cuts2 <- data.frame(Thresholds="Median", vals = median(stepsPerDayFilled$totalSteps))
cuts <- rbind(cuts1,cuts2)
ggplot(data = stepsPerDayFilled, aes(x = stepsPerDayFilled$totalSteps)) + geom_histogram() +
geom_vline(data=cuts, aes(xintercept=vals, linetype=Thresholds, colour = Thresholds), show_guide = TRUE) +
xlab("Total number of steps") + ggtitle("Total Number of Steps Taken Per Day (missing values filled)")
After filling in the missing values, the calculation reveals that the mean and median total number of steps taken per day is 10766 steps.
mean(stepsPerDayFilled$totalSteps)
## [1] 10766.19
median(stepsPerDayFilled$totalSteps)
## [1] 10766.19
First, I create a new factor variable in the dataset with two levels, “weekday” and “weekend,” to indicate whether a given date is a weekday or weekend day.
dayofWeek <- ifelse(weekdays(activityFilled$date)=="Saturday" | weekdays(activityFilled$date)=="Sunday","weekend","weekday")
activityFilled$day <- as.factor(dayofWeek)
dayActivity <- ddply(activityFilled, c("interval","day"), summarise, avgSteps=mean(steps))
Next, I construct a panel plot containing a time series plot of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all weekday days orweekend days (y-axis). According to the time series plot, the user was more active on the weekdays during the morning between the interval 800 and 1000 than on the weekends. The plot also indicates that the user was typically an early-riser since the activity usually began after the time interval 500. However, perhaps because the user had to work, the activity level was higher on the weekends in the time interval after 1000.
ggplot(dayActivity, aes(interval, avgSteps)) + geom_line(aes(colour=day)) + facet_grid(day ~ .) +
ggtitle("Average Daily Activity Per 5-Min Interval (Weekday vs Weekend)") + xlab("interval (24-hours)") +
ylab("Average Number of Steps Taken")