This report will seek to analyze some data taken from a personal activity monitor taken by an anonymous individual. Data is collected during the months of October and November, 2012, and includes steps taken per 5 minue interval throughout the day. Variables in the data set include:
if(!file.exists("activity.csv")){
download.file(url = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip", destfile = "activity.zip")
unzip("activity.zip")
}
# Loading Libraries
library(dplyr)
library(knitr)
# Load data
act <- read.csv('activity.csv')
act <- data.frame('steps'=as.integer(act$steps),
'date'=as.Date(act$date),
'interval'=as.integer(act$interval))
Note that the mean is somewhat lower than the median because of the large number of 0’s in the data set.
act_steps <- tapply(act$steps, act$date, sum, na.rm=T)
mean(act_steps)
## [1] 9354.23
median(act_steps)
## [1] 10395
The interval with the highest maximum number of steps is No. 615, with a value of 806.
maximum <- which(act$steps == max(act$steps[!is.na(act$steps)]))
act[maximum,]
## steps date interval
## 16492 806 2012-11-27 615
This is by no means an outlier, as illustrated by the table showing the top 0.05% of other maximum steps per interval
top_001 <- quantile(act$steps, 0.9995, na.rm = T)
kable(act[which(act$steps > top_001),], caption='Top 0.05% max steps per interval', align = "c")
| steps | date | interval | |
|---|---|---|---|
| 3277 | 802 | 2012-10-12 | 900 |
| 4136 | 786 | 2012-10-15 | 835 |
| 10194 | 785 | 2012-11-05 | 925 |
| 14024 | 785 | 2012-11-18 | 1635 |
| 14201 | 789 | 2012-11-19 | 720 |
| 15745 | 785 | 2012-11-24 | 1600 |
| 16487 | 794 | 2012-11-27 | 550 |
| 16492 | 806 | 2012-11-27 | 615 |
This skewing of the data twoard 0 is apparent in the plot of the frequency of steps taken per measurement of each interval:
hist(act$steps, main="Frequency of Steps Taken", xlab="steps", ylab="frequency")
The plot of total steps taken per day shows that, though the vast majority of the measurements are 0, the actual number of steps taken per day is somewhat Gausian:
hist(aggregate(steps ~ date, act, sum)$steps, main ="Sum of Steps per Day", xlab = "Steps per Day")
ptrn <- tapply(act$steps, act$interval, mean, na.rm = T)
plot(ptrn, type="l", main = "Fig 3: Daily Activity Pattern", ylab="steps", xlab = "interval")
To impute the missing data by filling with the mean of steps taken per interval:
ags <- aggregate(steps ~ interval, data = act, FUN=mean)
na_fill <- NULL
for(i in 1:nrow(act)) {
replace_rows <- act[i,]
ifelse(is.na(replace_rows$steps),
tmp <- subset(ags, interval == replace_rows$interval)$steps,
tmp <- replace_rows$steps)
na_fill <- c(na_fill, tmp)
}
act_new <- act
act_new$steps <- na_fill
The new mean and median are larger than in the original data set because NA values are now equal to the mean of each interval.
act_new_steps <- tapply(act_new$steps, act_new$date, FUN = sum)
mean(act_new_steps)
## [1] 10766.19
median(act_new_steps)
## [1] 10766.19
This is also visible in the frequency plot for sum of steps per day: the only change is in the central bucket, 1000-1500 steps, because NA values were imputed with mean values.
hist(act_new_steps, main = "New total steps per day", xlab="steps per day")
The daily pattern for weekends is similar to weekdays, but there is more noise. In both daily patterns, there is a large jump in steps taken around the 105th interval, and then a dip for the rest of the day. For weekends, intervals after the 105th interval, contain much more noise. Also, there are generally more steps taken on the weekends.