This example involves analysis of activity data from a activity monitor (e.g. a fitbit) worn daily. The individual’s steps were measured in 5 second intervals. Some data is missing and so it was imputed using a function I show within this document. The imputation function was chosen to randomly generate numbers within a reasonable distribution of the mean of this data set. You will notice that most of the imputed values are within 1-2 standard deviations of the mean. There are other ways to imput missing data, but rather than just adding data to the mean, I used a statistical function which infers activity based on the data distribution.
The first question was how to load and prepare the data. I did so as follows:
activity <- read.csv("~/Documents/Reproducible Research/activity.csv", na.strings="NA")
#mean number of steps taken will need to exclude NA
activity$steps <- as.numeric(activity$steps)
activity$interval <- as.numeric(activity$interval)
A histogram of the raw data as well as summary statistics (including mean and median), were included in the initial analyses:
StepsTotal <- aggregate(steps ~ date, data=activity, sum, na.rm=TRUE)
hist(x=StepsTotal$steps, col = "blue", breaks=50, xlab="Steps Per Day", main="Daily Number of Steps", cex=1, cex.lab=0.75, cex.axis=0.75, cex.main=0.95, cex.sub=0.75, font=2, font.lab=2)
summary(StepsTotal$steps)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 41 8841 10760 10770 13290 21190
Next a time-series plot was made which shows the 5-minute interval average of number of steps taken and averaged across all days. You will also see which 5-minute interval across all days contained the maximum number of steps. I also calculate the number of missing values for steps.
library(lattice)
activity$date <- as.character(activity$date)
time_series <- tapply(activity$steps, activity$interval, mean, na.rm = TRUE)
plot(row.names(time_series), time_series, type = "l", xlab = "5 minute Intervals", ylab="Average All Days", main="Average Steps Taken", col="blue", cex=1, cex.axis=0.75, cex.lab=0.75, cex.main=0.95, font.lab=2, font=2)
max_interval <- which.max(time_series) #queries time series for max steps
names(max_interval) #gives the 5-min interval with max steps
## [1] "835"
sum(is.na(activity)) #calculates the total number of missing values
## [1] 2304
Next to give more accurate assessment for time-series analyses I created a function to impute data for the steps. This is a randomized function that generates data points within a subset of already present points. See below. Also see Statistics reference site of Columbia University: http://www.stat.columbia.edu/~gelman/arm/missing.pdf
#call the function random.imp to make random values to add to the missing data points in steps
source("~/Documents/Reproducible Research/imputeValues.R")
#R function to imput missing values for daily steps analysis
random.imp <- function (a){
missing <- is.na(a)
n.missing <- sum(missing)
a.obs <- a[!missing]
imputed <- a
imputed[missing] <- sample(a.obs, n.missing, replace=TRUE)
return(imputed)
}
After the function is called, the data can be appended as a new column to the dataframe. This allows me to analyze the same data set using imputed data. I also report the new mean and median based on the imputed data. Its always a good idea to keep all of your data in case later one wants to determine whether the imputed data was appropriate or another inference technique should be used.
df.steps <- random.imp(activity$steps) #now you have data to add to missing points
activity$imputed = paste(random.imp(activity$steps)) #pasted a column into the dataframe with the randomly imputed data based on original steps data
####repeat of analysis with imputed data
activity$imputed <- as.numeric(activity$imputed)
StepsTotal <- aggregate(imputed ~ date, data=activity, sum, na.rm=TRUE)
hist(x=StepsTotal$imputed, col = "blue", breaks=50, xlab="Steps Per Day + Imputed", main="Daily Number of Steps with Imputed Values", cex=1, cex.lab=0.75, cex.axis=0.75, cex.main=0.95, cex.sub=0.75, font=2, font.lab=2)
summary(StepsTotal$imputed)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 41 9293 10570 10720 12880 21190
Next, using the real and imputed data, I determine a time-series for the number of steps taken on average on the weekend versus the weekday by this individual.
activity$date <- as.Date(activity$date) #now you need to change activity date to POSIXt date object
activity$date <- strptime(paste(activity$date), format="%Y-%m-%d", tz="UTC")
activity$weekday <- paste(weekdays(activity$date)) #add the day of the week based on time-stamp
weekdays <- c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday")
activity$weekday = as.factor(ifelse(is.element(weekdays(as.Date(activity$date)), weekdays), "Weekday", "Weekend"))
StepsImputedInterval <- aggregate(imputed ~ interval + weekday, activity, mean)
library(lattice)
xyplot(StepsImputedInterval$imputed ~ StepsImputedInterval$interval | StepsImputedInterval$weekday, main="Average Steps (Imputed) per Day by Interval", xlab="Interval (5 seconds each)", ylab="Imputed Steps", layout=c(1,2), type="l", cex=1, cex.axis=0.75, font=2, font.lab=2, font.main=2, font.sub=2, font.lab=2)
From the analysis one can tell that this individual is more active on the weekend. Interestingly though they seem to have certain hours during the week-day in which they are more active. Its important to keep in mind whether the fitness data was gathered at the same time each day – did the individual wake at the same time, or not? Just a few caveats to think about when looking at fitness and health data.