Coursera: Reproducible Research

Week 2: Course Project 1


Setup Working Directorty

setwd("~/GitHub/RepData_PeerAssessment1")

1. Code for reading in the dataset and/or processing the data

Unzip the data

unzip(zipfile="activity.zip")

Read Data

data <- read.csv("activity.csv")

2. Histogram of the total number of steps taken each day

library(ggplot2)
total.steps <- tapply(data$steps, data$date, FUN=sum, na.rm=TRUE)
qplot(total.steps, binwidth=1000, xlab="total number of steps taken each day")

3. Mean and median number of steps taken each day

mean(total.steps, na.rm=TRUE)
## [1] 9354.23
median(total.steps, na.rm=TRUE)
## [1] 10395

4. Time series plot of the average number of steps taken

library(ggplot2)
averages <- aggregate(x=list(steps=data$steps), by=list(interval=data$interval),
                      FUN=mean, na.rm=TRUE)
ggplot(data=averages, aes(x=interval, y=steps)) +
    geom_line() +
    xlab("5-minute interval") +
    ylab("average number of steps taken")

5. The 5-minute interval that, on average, contains the maximum number of steps

averages[which.max(averages$steps),]
##     interval    steps
## 104      835 206.1698

6. Code to describe and show a strategy for imputing missing data

There are many days/intervals where there are missing values (coded as NA).The presence of missing days may introduce bias into some calculations or summaries of the data.

missing <- is.na(data$steps)
# Total missing
table(missing)
## missing
## FALSE  TRUE 
## 15264  2304

All of the missing values are filled in with mean value for that 5-minute interval.

# Replace each missing value with the mean value of its 5-minute interval
fill.value <- function(steps, interval) {
    filled <- NA
    if (!is.na(steps))
        filled <- c(steps)
    else
        filled <- (averages[averages$interval==interval, "steps"])
    return(filled)
}
filled.data <- data
filled.data$steps <- mapply(fill.value, filled.data$steps, filled.data$interval)

7. Histogram of the total number of steps taken each day after missing values are imputed

total.steps <- tapply(filled.data$steps, filled.data$date, FUN=sum)
qplot(total.steps, binwidth=1000, xlab="total number of steps taken each day")

mean(total.steps)
## [1] 10766.19
median(total.steps)
## [1] 10766.19

Mean and median values are higher after imputing missing data. The reason is

  • In the original data, there are some days with steps values NA for any interval.
  • The total number of steps taken in such days are set to 0s by default.
  • After replacing missing steps values with the mean steps of associated interval value, these 0 values are removed from the histogram of total number of steps taken each day.

8. Panel plot comparing the average number of steps taken per 5-minute interval across weekdays and weekends

  • Calculating day of the week for each measurement in the dataset.
  • Using the dataset with the filled-in values.
weekday.or.weekend <- function(date) {
    day <- weekdays(date)
    if (day %in% c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday"))
        return("weekday")
    else if (day %in% c("Saturday", "Sunday"))
        return("weekend")
    else
        stop("invalid date")
}
filled.data$date <- as.Date(filled.data$date)
filled.data$day <- sapply(filled.data$date, FUN=weekday.or.weekend)

panel plot containing plots of average number of steps taken on weekdays and weekends.

averages <- aggregate(steps ~ interval + day, data=filled.data, mean)
ggplot(averages, aes(interval, steps)) + geom_line() + facet_grid(day ~ .) +
    xlab("5-minute interval") + ylab("Number of steps")