Data Analysis: Human Steps Recognition

It is now possible to collect a large amount of data about personal movement using activity monitoring devices such as a Fitbit, Nike Fuelband, or Jawbone Up. These type of devices are part of the “quantified self” movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. But these data remain under-utilized both because the raw data are hard to obtain and there is a lack of statistical methods and software for processing and interpreting the data.

This analysis makes use of data from a personal activity monitoring device. This device collects data at 5 minute intervals through out the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day.

Data

The data for this analysis resides here:

Dataset: Activity monitoring data [52K]

The variables included in this dataset are:

steps: Number of steps taking in a 5-minute interval (missing values are coded as NA)
date: The date on which the measurement was taken in YYYY-MM-DD format
interval: Identifier for the 5-minute interval in which measurement was taken

Loading and preprocessing the data

Reading from zipped file:

data <- read.csv(unz("activity.zip","activity.csv"))

Reformatting “date” variable to Date Format:

data$date <- as.Date(data$date)

What is mean total number of steps taken per day?

Total number of steps taken per day:

x <- aggregate(steps~date,data,sum)[,2]

Histogram of total number of steps taken per day:

hist(x,breaks=20,col = "aquamarine3",main = "Steps per Day Histogram",xlab = "Sum of Steps per Day")

Mean of total number of steps per day: 10766.19

mean(x)

Median of total number of steps per day: 10765

median(x)

What is the average daily activity pattern?

Average number of steps taken per interval:

x <- aggregate(steps~interval,data,mean)

Time series plot of the different intervals vs. average number of steps taken:

plot(x$interval, x$steps, type="l", main = "Average daily activity pattern", xlab="5 min interval", ylab="Average steps taken")

5min interval with maximum number of steps (interval,steps): (835, 206.169811320755)

x[x$steps==max(x$steps),]

Imputing missing values

Total number of NAs: 2304

sum(is.na(data$steps))

Let’s look at the number of NAs that appear each day:

x <- aggregate(steps~date,data=data, function(y) {sum(is.na(y))}, na.action = NULL)
colnames(x) <- c("date","numberNAs")

Let’s just display the days that have more than 0 NAs:

library(knitr)
kable(x[x$numberNAs>0,],format="markdown")

	date	numberNAs
1	2012-10-01	288
8	2012-10-08	288
32	2012-11-01	288
35	2012-11-04	288
40	2012-11-09	288
41	2012-11-10	288
45	2012-11-14	288
61	2012-11-30	288

We can observe that NA values appear on full-day clusters, they’re not randomly scattered across the dataset. It is on 8 specific dates where we’re missing the entire monitoring data for the day.

I propose filling these empty days with the results from part 3, the average daily activity pattern. The average figures for each interval will be rounded to 0 decimals.

Generating filled data for NA dates, filling in every interval observation with the average daily activity pattern:

# Vector of average step values, previously calculated on part 3:
averageValSteps <- round(aggregate(steps~interval,data,mean)[,2])
# Vector of dates with missing data, displayed on the table above:
dates           <- x[x$numberNAs>0,1]
# Vector of different intervals:
intervals       <- unique(data$interval)
# Merging them all together:
filledData      <- as.data.frame(cbind(averageValSteps,intervals))
filledData      <- merge(filledData,dates)
filledData      <- filledData[,c(1,3,2)]
colnames(filledData) <- c("steps","date","interval")

Binding filled data with the original uncomplete data:

fullData <- rbind(data[!is.na(data$steps),],filledData)
fullData <- fullData[order(fullData$date),]
rownames(fullData) <- c(1:nrow(fullData))

Histogram of total number of steps taken per day:

x <- aggregate(steps~date,fullData,sum)[,2]
hist(x,breaks=20,col = "chartreuse3",main = "Steps per Day Histogram",xlab = "Sum of Steps per Day")

Mean of total number of steps per day: 10765.64

mean(x)

Median of total number of steps per day: 10762

median(x)

As spected, replacing missing values with average values brought mean and median closer to the total number of steps for a day with average values: 10762

Are there differences in activity patterns between weekdays and weekends?

Creating factor vector from fullData, with 2 levels: “weekdays” & “weekends”

x <- factor(weekdays(fullData[,2],abbreviate=TRUE))
levels(x) <- list(weekday=c("Mon","Tue","Wed","Thu","Fri"),weekend=c("Sat","Sun"))

Adding column to fullData. 2 types of days: weekdays & weekends.

fullData[,"day"] <- x

Average number of steps taken per interval, per type of day:

x <- aggregate(steps~interval+day,fullData,mean)

Plotting using lattice:

library(lattice)
xyplot(steps~interval|day,data=x,layout=c(1,2),type="l")

It is noticeable that weekdays have a distribution much more similar to the overall average of steps taken per interval, than weekends. It would have been something very important to consider when imputing the missing values on part 4.