Coursera: Reproducible Research

Week 2: Course Project 1

Setup Working Directorty

setwd("~/GitHub/RepData_PeerAssessment1")

1. Code for reading in the dataset and/or processing the data

Unzip the data

unzip(zipfile="activity.zip")

Read Data

data <- read.csv("activity.csv")

2. Histogram of the total number of steps taken each day

library(ggplot2)
total.steps <- tapply(data$steps, data$date, FUN=sum, na.rm=TRUE)
qplot(total.steps, binwidth=1000, xlab="total number of steps taken each day")

3. Mean and median number of steps taken each day

mean(total.steps, na.rm=TRUE)

## [1] 9354.23

median(total.steps, na.rm=TRUE)

## [1] 10395

4. Time series plot of the average number of steps taken

library(ggplot2)
averages <- aggregate(x=list(steps=data$steps), by=list(interval=data$interval),
                      FUN=mean, na.rm=TRUE)
ggplot(data=averages, aes(x=interval, y=steps)) +
    geom_line() +
    xlab("5-minute interval") +
    ylab("average number of steps taken")

5. The 5-minute interval that, on average, contains the maximum number of steps

averages[which.max(averages$steps),]

##     interval    steps
## 104      835 206.1698

6. Code to describe and show a strategy for imputing missing data

There are many days/intervals where there are missing values (coded as NA).The presence of missing days may introduce bias into some calculations or summaries of the data.

missing <- is.na(data$steps)
# Total missing
table(missing)

## missing
## FALSE  TRUE 
## 15264  2304

All of the missing values are filled in with mean value for that 5-minute interval.

# Replace each missing value with the mean value of its 5-minute interval
fill.value <- function(steps, interval) {
    filled <- NA
    if (!is.na(steps))
        filled <- c(steps)
    else
        filled <- (averages[averages$interval==interval, "steps"])
    return(filled)
}
filled.data <- data
filled.data$steps <- mapply(fill.value, filled.data$steps, filled.data$interval)

7. Histogram of the total number of steps taken each day after missing values are imputed

total.steps <- tapply(filled.data$steps, filled.data$date, FUN=sum)
qplot(total.steps, binwidth=1000, xlab="total number of steps taken each day")

mean(total.steps)

## [1] 10766.19

median(total.steps)

## [1] 10766.19

Mean and median values are higher after imputing missing data. The reason is

In the original data, there are some days with steps values NA for any interval.
The total number of steps taken in such days are set to 0s by default.
After replacing missing steps values with the mean steps of associated interval value, these 0 values are removed from the histogram of total number of steps taken each day.

8. Panel plot comparing the average number of steps taken per 5-minute interval across weekdays and weekends

Calculating day of the week for each measurement in the dataset.
Using the dataset with the filled-in values.

weekday.or.weekend <- function(date) {
    day <- weekdays(date)
    if (day %in% c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday"))
        return("weekday")
    else if (day %in% c("Saturday", "Sunday"))
        return("weekend")
    else
        stop("invalid date")
}
filled.data$date <- as.Date(filled.data$date)
filled.data$day <- sapply(filled.data$date, FUN=weekday.or.weekend)

panel plot containing plots of average number of steps taken on weekdays and weekends.

averages <- aggregate(steps ~ interval + day, data=filled.data, mean)
ggplot(averages, aes(interval, steps)) + geom_line() + facet_grid(day ~ .) +
    xlab("5-minute interval") + ylab("Number of steps")