Introduction

This report will seek to analyze some data taken from a personal activity monitor taken by an anonymous individual. Data is collected during the months of October and November, 2012, and includes steps taken per 5 minue interval throughout the day. Variables in the data set include:

steps: Number of steps taking in a 5-minute interval (missing values are coded as NA)
date: the date on which the measurement was taken in YYYY-MM-DD format
interval: Identifier for the 5-minute interval in which measurement was taken

Loading and processing the data

if(!file.exists("activity.csv")){
download.file(url = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip", destfile = "activity.zip")
unzip("activity.zip")
}

# Loading Libraries
library(dplyr)
library(knitr)

# Load data
act <- read.csv('activity.csv')
act <- data.frame('steps'=as.integer(act$steps),  
                  'date'=as.Date(act$date),  
                  'interval'=as.integer(act$interval))

Initial Analysis: Mean, Median, Maximum

Mean and Median for the total steps taken per day:

Note that the mean is somewhat lower than the median because of the large number of 0’s in the data set.

act_steps <- tapply(act$steps, act$date, sum, na.rm=T)
mean(act_steps)

## [1] 9354.23

median(act_steps)

## [1] 10395

What is the interval with the maximum number of steps?

The interval with the highest maximum number of steps is No. 615, with a value of 806.

maximum <- which(act$steps == max(act$steps[!is.na(act$steps)]))
act[maximum,]

##       steps       date interval
## 16492   806 2012-11-27      615

This is by no means an outlier, as illustrated by the table showing the top 0.05% of other maximum steps per interval

top_001 <- quantile(act$steps, 0.9995, na.rm = T)
kable(act[which(act$steps > top_001),], caption='Top 0.05% max steps per interval', align = "c")

Top 0.05% max steps per interval
	steps	date	interval
3277	802	2012-10-12	900
4136	786	2012-10-15	835
10194	785	2012-11-05	925
14024	785	2012-11-18	1635
14201	789	2012-11-19	720
15745	785	2012-11-24	1600
16487	794	2012-11-27	550
16492	806	2012-11-27	615

Visualizing the data:

Frequency of steps taken by measurement:

This skewing of the data twoard 0 is apparent in the plot of the frequency of steps taken per measurement of each interval:

hist(act$steps, main="Frequency of Steps Taken", xlab="steps", ylab="frequency")

Frequency of total steps taken by day

The plot of total steps taken per day shows that, though the vast majority of the measurements are 0, the actual number of steps taken per day is somewhat Gausian:

hist(aggregate(steps ~ date, act, sum)$steps, main ="Sum of Steps per Day", xlab = "Steps per Day")

Daily Activity pattern

ptrn <- tapply(act$steps, act$interval, mean, na.rm = T)
plot(ptrn, type="l", main = "Fig 3: Daily Activity Pattern", ylab="steps", xlab = "interval")

Imputing missing values

To impute the missing data by filling with the mean of steps taken per interval:

found the mean for each interval
in for loop, if steps is NA, get the mean steps for that interval; otherwise, get step value
place step values and mean step values into a vector
subset new step values in the place of original step values to include interval means in the place of NA step values

    ags <- aggregate(steps ~ interval, data = act, FUN=mean)
    na_fill <- NULL
    for(i in 1:nrow(act)) {
        replace_rows <- act[i,]
        
        ifelse(is.na(replace_rows$steps), 
            tmp <- subset(ags, interval == replace_rows$interval)$steps,
            tmp <- replace_rows$steps)
        
        na_fill <- c(na_fill, tmp)
    }
act_new <- act
act_new$steps <- na_fill

Mean, Median of new dataset:

The new mean and median are larger than in the original data set because NA values are now equal to the mean of each interval.

act_new_steps <- tapply(act_new$steps, act_new$date, FUN = sum)
mean(act_new_steps)

## [1] 10766.19

median(act_new_steps)

## [1] 10766.19

This is also visible in the frequency plot for sum of steps per day: the only change is in the central bucket, 1000-1500 steps, because NA values were imputed with mean values.

hist(act_new_steps, main = "New total steps per day", xlab="steps per day")

Are there differences in activity patterns between weekdays and weekends?

The daily pattern for weekends is similar to weekdays, but there is more noise. In both daily patterns, there is a large jump in steps taken around the 105th interval, and then a dip for the rest of the day. For weekends, intervals after the 105th interval, contain much more noise. Also, there are generally more steps taken on the weekends.

Activity Monitoring

Ben McCary

2016-05-15