ACTIVITY MONITORING ANALYSIS

Loading and preprocessing the data

First we load the data and define the classes of our column variables. Lets look at the structure of our data.

activity <- read.csv("activity.csv", header = T, colClasses = c("numeric", "Date", 
    "integer"))
str(activity)
## 'data.frame':    17568 obs. of  3 variables:
##  $ steps   : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ date    : Date, format: "2012-10-01" "2012-10-01" ...
##  $ interval: int  0 5 10 15 20 25 30 35 40 45 ...

What is mean total number of steps taken per day?

library(ggplot2)
library(plyr)
perDay <- ddply(activity, "date", summarise, totalSteps = sum(steps, na.rm = T))
qplot(totalSteps, data = perDay, binwidth = 500)

plot of chunk unnamed-chunk-2

with(perDay, mean(totalSteps, na.rm = T))
## [1] 9354
with(perDay, median(totalSteps, na.rm = T))
## [1] 10395

What is the average daily activity pattern?

perInterval <- ddply(activity, "interval", summarise, meanSteps = mean(steps, 
    na.rm = T))
qplot(interval, meanSteps, data = perInterval, geom = "line")

plot of chunk unnamed-chunk-4

with(perInterval, interval[which.max(meanSteps)])
## [1] 835

Imputing missing values

Note that there are a number of days/intervals where there are missing values (coded as NA). The presence of missing days may introduce bias into some calculations or summaries of the data.

with(activity, sum(is.na(steps)))
## [1] 2304

Here I will use the impute function from the Hmisc package. This will fill the missing values with the mean of the values of this variable (steps).

library(Hmisc, warn.conflicts = F)
## Loading required package: grid
## Loading required package: lattice
## Loading required package: survival
## Loading required package: splines
## Loading required package: Formula
completeActivity <- transform(activity, steps = impute(steps, mean))
byDay <- ddply(completeActivity, "date", summarise, totalSteps = sum(steps, 
    na.rm = T))

When imputing missing value, our mean and median changed

qplot(totalSteps, data = byDay, binwidth = 500)

plot of chunk unnamed-chunk-8

with(byDay, mean(totalSteps, na.rm = T))
## [1] 10766
with(byDay, median(totalSteps, na.rm = T))
## [1] 10766

Are there differences in activity patterns between weekdays and weekends?

For this part the weekdays() function may be of some help here. Use the dataset with the filled-in missing values for this part.

completeActivity$weekends <- weekdays(completeActivity$date) %in% c("Saturday", 
    "Sunday")
completeActivity$weekends <- as.factor(completeActivity$weekends)
levels(completeActivity$weekends) <- c("Weekday", "Weekend")
perInterval <- ddply(completeActivity, .(weekends, interval), summarise, meanSteps = mean(steps))
qplot(interval, meanSteps, data = perInterval, geom = "line", facets = . ~ weekends)

plot of chunk unnamed-chunk-10