Reproducible Research: Activity Monitoring

Introduction

It is now possible to collect a large amount of data about personal movement using activity monitoring devices such as a Fitbit, Nike Fuelband, or Jawbone Up. These type of devices are part of the “quantified self” movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. But these data remain under-utilized both because the raw data are hard to obtain and there is a lack of statistical methods and software for processing and interpreting the data.

This assignment makes use of data from a personal activity monitoring device. This device collects data at 5 minute intervals through out the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day.

Below we describe and show all steps taken to investigate the data set, including R code.

Loading and preprocessing the data

First we set our working directory, unzip the data, read the unzipped csv file to a dataframe and convert the date variable to a POSIXct/POSIXt date format.

suppressMessages(library(lubridate))
setwd("~/git/RepData_PeerAssessment1")
unzip("activity.zip")
df.activity <- read.csv("activity.csv")
df.activity$date <- ymd(df.activity$date)

What is mean total number of steps taken per day?

We calculate the total number of steps taken per day and plot this in a histogram.

df.steps.per.day <- aggregate(steps ~ date, data = df.activity, sum)
hist(df.steps.per.day$steps, xlab="Steps per day", main="Histogram of steps per day (raw data)")

The mean and median of the total number of steps taken per day are calculated as follows.

mean.steps.per.day <- mean(df.steps.per.day$steps)
median.steps.per.day <- median(df.steps.per.day$steps)
mean.steps.per.day

## [1] 10766.19

median.steps.per.day

## [1] 10765

What is the average daily activity pattern?

Next, a time series plot is created of the 5-minute interval and the average number of steps taken, averaged across all days.

df.average.steps.per.interval <- aggregate(steps ~ interval, data = df.activity, mean)
plot(df.average.steps.per.interval, type="l", main="Average number of steps on specific time of date")

Determining which 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps, we find:

df.average.steps.per.interval$interval[which.max(df.average.steps.per.interval$steps)]

## [1] 835

Imputing missing values

Total number of rows with missing values in the dataset:

sum(apply(df.activity, 1, function(row) any(is.na(row))))

## [1] 2304

We now create a new dataset df.activity.imputed in which we imputed missing values by using a linear interpolation of closest non-NA rows:

df.activity.imputed <- df.activity
df.activity.imputed$steps <- approxfun(seq_along(df.activity$steps), df.activity$steps, method="linear", rule=2)(seq_along(df.activity$steps))
if(any(is.na(df.activity.imputed))) {print("WARN: NA values in df.activity.imputed")}

Histogram of the total number of steps taken each day after imputation:

df.steps.per.day.imputed <- aggregate(steps ~ date, data = df.activity.imputed, sum)
hist(df.steps.per.day.imputed$steps, xlab="Steps per day", , main="Histogram of steps per day (after imputation)")

Mean and median total number of steps taken per day after imputation:

mean.steps.per.day.imputed <- mean(df.steps.per.day.imputed$steps)
median.steps.per.day.imputed <- median(df.steps.per.day.imputed$steps)
mean.steps.per.day.imputed

## [1] 9354.23

median.steps.per.day.imputed

## [1] 10395

Recall the non-imputed versions of these metrics:

mean.steps.per.day

## [1] 10766.19

median.steps.per.day

## [1] 10765

We conclude that linear imputation of missing data changes the mean value of total steps per day by -1411.96 and the median by -370.

Activity patterns differences between weekdays and weekends

We create a new factor variable in the df.activity.imputed dataset with two levels – “weekday” and “weekend”" indicating whether a given date is a weekday or weekend day.

weekorweekend <- function(date) {
    if (weekdays(date) %in% c("Saturday", "Sunday")) {"weekend"} else {"weekday"}
}
df.activity.imputed$weekorweekend <- as.factor(sapply(df.activity.imputed$date, weekorweekend))

Finally, we make a panel plot containing a time series plot of the 5-minute interval and the average number of steps taken, averaged across all weekday days or weekend days.

df.average.steps.per.interval.and.weekorweekday.imputed <- aggregate(steps ~ interval + weekorweekend, data = df.activity.imputed, mean)
suppressMessages(library(ggplot2))
ggplot(data = df.average.steps.per.interval.and.weekorweekday.imputed, aes(x=interval, y=steps)) + facet_grid(weekorweekend ~ .) + geom_line() + ggtitle("Average activity pattern weekdays vs weekends")

We see some clear differences between weekday and weekend days.

Conclusion

Using a few simple chunks of R code (all shown inline) we’ve been able to process a zipped data file with human activity data and gain insight into the following questions:

What is mean total number of steps taken per day?
What is the average daily activity pattern?
How to impute missing values?
Are there differences in activity patterns between weekdays and weekends?