Load in the CSV and take a peek at the first couple of rows.
activity <- read.csv("activity.csv")
head(activity)
## steps date interval
## 1 NA 2012-10-01 0
## 2 NA 2012-10-01 5
## 3 NA 2012-10-01 10
## 4 NA 2012-10-01 15
## 5 NA 2012-10-01 20
## 6 NA 2012-10-01 25
A quick summary of the data.
summary(activity)
## steps date interval
## Min. : 0.0 2012-10-01: 288 Min. : 0
## 1st Qu.: 0.0 2012-10-02: 288 1st Qu.: 589
## Median : 0.0 2012-10-03: 288 Median :1178
## Mean : 37.4 2012-10-04: 288 Mean :1178
## 3rd Qu.: 12.0 2012-10-05: 288 3rd Qu.:1766
## Max. :806.0 2012-10-06: 288 Max. :2355
## NA's :2304 (Other) :15840
I like plyr’s ddply for summarizing.
library(plyr)
daily.activity <- ddply(activity, .(date), summarize, total.steps = sum(steps,na.rm=TRUE))
daily.steps.hist <- hist(daily.activity$total.steps,main="Steps Histogram",xlab = "Steps per day",breaks = 10)
Let’s look at the mean and median.
mean(daily.activity$total.steps,na.rm = T)
## [1] 9354
median(daily.activity$total.steps,na.rm = T)
## [1] 10395
To get the daily activity pattern we take the number of steps taken for each of the day’s intervals, averaged over all days.
avg.interval.steps <- ddply(activity, .(interval),summarize,avg.steps=mean(steps, na.rm=T))
plot(avg.interval.steps,type="l",xlab="Interval",ylab="Average Steps",main="Daily Activity Pattern")
The pattern seems to make sense. Activity is initially low while sleeping, then a flurry of activity in the morning and then a gradual decline in the afternoon and evening as people wind down for the day.
How many rows have missing steps values?
sum(is.na(activity$steps))
## [1] 2304
Let’s break it up into two sets: complete and missing.
activity.missing <- activity[is.na(activity$steps),]
activity.complete<-activity[complete.cases(activity),]
We fill in the missing data using the average interval values computed earlier.
activity.missing.filled<-merge(activity.missing,avg.interval.steps)
activity.missing <- transform(activity.missing.filled,steps=avg.steps)[,c("interval","steps","date")]
Then we combine it with the complete data to form a new dataset.
activity.filled <- rbind(activity.complete,activity.missing)
Compare to earlier.
daily.activity.filled <- ddply(activity.filled, .(date), summarize, total.steps = sum(steps,na.rm=TRUE))
daily.steps.hist <- hist(daily.activity.filled$total.steps,main="Steps Histogram \n (missing data interpolated from avg)",xlab = "Steps per day",breaks = 10)
Note how the lowermost bin is now much smaller.
Let’s look at the mean and median.
mean(daily.activity.filled$total.steps)
## [1] 10766
median(daily.activity.filled$total.steps)
## [1] 10766
We add a weekday column, then an is.weekend column
activity.filled<-transform(activity.filled, weekday=weekdays(as.POSIXct(date)))
activity.filled<-cbind(activity.filled,is.weekend=factor(activity.filled$weekday == "Saturday"| activity.filled$weekday == "Sunday",labels=c("Weekday","Weekend")))
We get the average per-interval for weekends and weekdays
interval.steps <- ddply(activity.filled, .(interval,is.weekend),summarize,avg.steps=mean(steps, na.rm=T))
library(lattice)
xyplot(avg.steps ~ interval|is.weekend,interval.steps,type="l",layout=c(1,2))
Looks like weekend mornings aren’t as hectic as weekday mornings, although the rest of the day is more active.