Loading and preprocessing the data

Load in the CSV and take a peek at the first couple of rows.

activity <- read.csv("activity.csv")
head(activity)
##   steps       date interval
## 1    NA 2012-10-01        0
## 2    NA 2012-10-01        5
## 3    NA 2012-10-01       10
## 4    NA 2012-10-01       15
## 5    NA 2012-10-01       20
## 6    NA 2012-10-01       25

A quick summary of the data.

summary(activity)
##      steps               date          interval   
##  Min.   :  0.0   2012-10-01:  288   Min.   :   0  
##  1st Qu.:  0.0   2012-10-02:  288   1st Qu.: 589  
##  Median :  0.0   2012-10-03:  288   Median :1178  
##  Mean   : 37.4   2012-10-04:  288   Mean   :1178  
##  3rd Qu.: 12.0   2012-10-05:  288   3rd Qu.:1766  
##  Max.   :806.0   2012-10-06:  288   Max.   :2355  
##  NA's   :2304    (Other)   :15840

What is mean total number of steps taken per day?

I like plyr’s ddply for summarizing.

library(plyr)
daily.activity <- ddply(activity, .(date), summarize, total.steps = sum(steps,na.rm=TRUE))
daily.steps.hist <- hist(daily.activity$total.steps,main="Steps Histogram",xlab = "Steps per day",breaks = 10)

plot of chunk unnamed-chunk-3

Let’s look at the mean and median.

mean(daily.activity$total.steps,na.rm = T)
## [1] 9354
median(daily.activity$total.steps,na.rm = T)
## [1] 10395

What is the average daily activity pattern?

To get the daily activity pattern we take the number of steps taken for each of the day’s intervals, averaged over all days.

avg.interval.steps <- ddply(activity, .(interval),summarize,avg.steps=mean(steps, na.rm=T))
plot(avg.interval.steps,type="l",xlab="Interval",ylab="Average Steps",main="Daily Activity Pattern")

plot of chunk unnamed-chunk-5 The pattern seems to make sense. Activity is initially low while sleeping, then a flurry of activity in the morning and then a gradual decline in the afternoon and evening as people wind down for the day.

Imputing missing values

How many rows have missing steps values?

sum(is.na(activity$steps))
## [1] 2304

Let’s break it up into two sets: complete and missing.

activity.missing <- activity[is.na(activity$steps),]
activity.complete<-activity[complete.cases(activity),]

We fill in the missing data using the average interval values computed earlier.

activity.missing.filled<-merge(activity.missing,avg.interval.steps)
activity.missing <- transform(activity.missing.filled,steps=avg.steps)[,c("interval","steps","date")]

Then we combine it with the complete data to form a new dataset.

activity.filled <- rbind(activity.complete,activity.missing)

Compare to earlier.

daily.activity.filled <- ddply(activity.filled, .(date), summarize, total.steps = sum(steps,na.rm=TRUE))
daily.steps.hist <- hist(daily.activity.filled$total.steps,main="Steps Histogram \n (missing data interpolated from avg)",xlab = "Steps per day",breaks = 10)

plot of chunk unnamed-chunk-10 Note how the lowermost bin is now much smaller.

Let’s look at the mean and median.

mean(daily.activity.filled$total.steps)
## [1] 10766
median(daily.activity.filled$total.steps)
## [1] 10766

Are there differences in activity patterns between weekdays and weekends?

We add a weekday column, then an is.weekend column

activity.filled<-transform(activity.filled, weekday=weekdays(as.POSIXct(date)))
activity.filled<-cbind(activity.filled,is.weekend=factor(activity.filled$weekday == "Saturday"| activity.filled$weekday == "Sunday",labels=c("Weekday","Weekend")))

We get the average per-interval for weekends and weekdays

interval.steps <- ddply(activity.filled, .(interval,is.weekend),summarize,avg.steps=mean(steps, na.rm=T))
library(lattice)
xyplot(avg.steps ~ interval|is.weekend,interval.steps,type="l",layout=c(1,2))

plot of chunk unnamed-chunk-14 Looks like weekend mornings aren’t as hectic as weekday mornings, although the rest of the day is more active.