It is now possible to collect a large amount of data about personal movement using activity monitoring devices such as a Fitbit, Nike Fuelband, or Jawbone Up. These type of devices are part of the “quantified self” movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. But these data remain under-utilized both because the raw data are hard to obtain and there is a lack of statistical methods and software for processing and interpreting the data.
This assignment makes use of data from a personal activity monitoring device. This device collects data at 5 minute intervals through out the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day
unzip("activity.zip")
data <- read.table("activity.csv",header=T, quote="\"", sep=",", na.strings="NA")
stepsperday <- tapply(data$steps, data$date, sum, na.rm=TRUE)
hist(stepsperday,xlab="Steps", main="Steps per Day")
meanDailySteps <- mean(stepsperday)
medianDailySteps <- median(stepsperday)
The mean number of daily steps is 9354.2295082.
The median number of daily steps is 10395.
stepsByInterval <- aggregate(steps ~ interval, data=data,FUN="mean")
plot(stepsByInterval$interval, stepsByInterval$steps, type="l", xlab="Interval",ylab="Steps",main="Mean steps by interval")
maxStepInterval <- stepsByInterval[which.max(stepsByInterval$steps),]
The time interval with the maximum daily steps is 835
nacount <- sum(is.na(data$steps))
There are a total of 2304 rows with missing data for steps.
I am using the MICE library with linear regression to predict the missing data.
library(mice)
impData <- mice(data, maxit=10,method="norm.predict")
completeData <- complete(impData,action=1)
cstepsperday <- tapply(completeData$steps, completeData$date, sum, na.rm=TRUE)
hist(cstepsperday,xlab="Steps", main="Steps per Day")
cmeanDailySteps <- round(mean(cstepsperday), digits=4)
cmedianDailySteps <- median(cstepsperday)
The mean daily steps for the complete data set is 10716.97 and the median is 10395. Imputing the missing data using predictive linear regression has increased the mean and median daily steps.
I was skeptical of the imputation method changing both the mean and the median, so repeated the process using other imputation methods and all of them yielded results in the same range. After this sanity check, I used the linear regression algorithm as that is the algorithm I understand the best.
Use the timeDate library to create another factor column on the completeData indicating whether the day is a weekday or a weekend.
library(timeDate)
completeData$date <- as.Date(completeData$date)
completeData$day.type <- "weekday"
weekends <- isWeekend(completeData$date)
completeData$day.type[weekends] <- "weekend"
completeData$day.type <- as.factor(completeData$day.type)
Now we can plot the steps by time interval separating weekdays and weekends.
library(ggplot2)
cstepsByInterval <- aggregate(steps ~ interval + day.type, data=completeData,FUN="mean")
ggplot(cstepsByInterval,aes(x= interval, y=steps)) + geom_line(color="steelblue") + facet_grid(day.type ~ .) + labs(x="Interval", y="Average Steps")