It is now possible to collect a large amount of data about personal movement using activity monitoring devices such as a Fitbit, Nike Fuelband, or Jawbone Up. These type of devices are part of the “quantified self” movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. But these data remain under-utilized both because the raw data are hard to obtain and there is a lack of statistical methods and software for processing and interpreting the data.

This analysis makes use of data from a personal activity monitoring device. This device collects data at 5 minute intervals through out the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day.

Data

The data for this analysis resides here:

The variables included in this dataset are:

Loading and preprocessing the data

data <- read.csv(unz("activity.zip","activity.csv"))
data$date <- as.Date(data$date)

What is mean total number of steps taken per day?

x <- aggregate(steps~date,data,sum)[,2]
hist(x,breaks=20,col = "aquamarine3",main = "Steps per Day Histogram",xlab = "Sum of Steps per Day")

mean(x)
median(x)

What is the average daily activity pattern?

x <- aggregate(steps~interval,data,mean)
plot(x$interval, x$steps, type="l", main = "Average daily activity pattern", xlab="5 min interval", ylab="Average steps taken")

x[x$steps==max(x$steps),]

Imputing missing values

sum(is.na(data$steps))
x <- aggregate(steps~date,data=data, function(y) {sum(is.na(y))}, na.action = NULL)
colnames(x) <- c("date","numberNAs")
library(knitr)
kable(x[x$numberNAs>0,],format="markdown")
date numberNAs
1 2012-10-01 288
8 2012-10-08 288
32 2012-11-01 288
35 2012-11-04 288
40 2012-11-09 288
41 2012-11-10 288
45 2012-11-14 288
61 2012-11-30 288

We can observe that NA values appear on full-day clusters, they’re not randomly scattered across the dataset. It is on 8 specific dates where we’re missing the entire monitoring data for the day.

I propose filling these empty days with the results from part 3, the average daily activity pattern. The average figures for each interval will be rounded to 0 decimals.

# Vector of average step values, previously calculated on part 3:
averageValSteps <- round(aggregate(steps~interval,data,mean)[,2])
# Vector of dates with missing data, displayed on the table above:
dates           <- x[x$numberNAs>0,1]
# Vector of different intervals:
intervals       <- unique(data$interval)
# Merging them all together:
filledData      <- as.data.frame(cbind(averageValSteps,intervals))
filledData      <- merge(filledData,dates)
filledData      <- filledData[,c(1,3,2)]
colnames(filledData) <- c("steps","date","interval")
fullData <- rbind(data[!is.na(data$steps),],filledData)
fullData <- fullData[order(fullData$date),]
rownames(fullData) <- c(1:nrow(fullData))
x <- aggregate(steps~date,fullData,sum)[,2]
hist(x,breaks=20,col = "chartreuse3",main = "Steps per Day Histogram",xlab = "Sum of Steps per Day")

mean(x)
median(x)

As spected, replacing missing values with average values brought mean and median closer to the total number of steps for a day with average values: 10762

Are there differences in activity patterns between weekdays and weekends?

x <- factor(weekdays(fullData[,2],abbreviate=TRUE))
levels(x) <- list(weekday=c("Mon","Tue","Wed","Thu","Fri"),weekend=c("Sat","Sun"))
fullData[,"day"] <- x
x <- aggregate(steps~interval+day,fullData,mean)
library(lattice)
xyplot(steps~interval|day,data=x,layout=c(1,2),type="l")

It is noticeable that weekdays have a distribution much more similar to the overall average of steps taken per interval, than weekends. It would have been something very important to consider when imputing the missing values on part 4.