This report looks at data collected from an activity monitoring device of an anonymous individual over a two month period (October,2012 and November, 2012). The device recorded the steps taken in five minute intervals throughout each day.
The data was downloaded from the link given on 2014-09-06 at 17:15:00 PST. It was unzipped into the working directory for the purpose of the following code used to read the data.
unzip("./activity.zip")
activitydata <- read.csv("./activity.csv")
Sum the steps per day and create a new data frame.
date <- as.Date(levels(activitydata$date))
steps <- tapply(activitydata$steps, activitydata$date, sum, na.rm = TRUE)
spddata <- data.frame(date, steps, row.names = NULL)
Average the number of steps over a given interval of time and create a data frame.
dailyact <- aggregate(activitydata$steps, by = list(interval = activitydata$interval), mean, na.rm = TRUE)
colnames(dailyact) <- c("interval", "steps")
dailyact$interval <- strptime(sprintf("%04d",dailyact$interval), "%H%M")
I created a histogram of the total steps taken per day.
hist(spddata$steps, col = "blue", main = "Total Steps Taken Per Day", xlab = "Number of Steps" )
I used the mean function to calculate the mean total number of steps taken per day.
mean(spddata$steps)
## [1] 9354
Apply the median function to calculate the median value.
median(spddata$steps)
## [1] 10395
For this part I took the data frame created earlier that shows the average number of steps over a given interval, and I plotted that data using the ggplot2 package.
require(ggplot2)
## Loading required package: ggplot2
require(scales)
## Loading required package: scales
timeplot <- ggplot(dailyact, aes(interval, steps))
timeplot + geom_line() + scale_x_datetime(labels= date_format("%H%M")) + ggtitle ("Average Daily Activity")
##Prints interval with highest number of steps
dailyact$interval[which.max(dailyact$steps)]
## [1] "0835"
sum(is.na(activitydata))
## [1] 2304
After noting during exploratory analysis that the average number of steps factored by interval varies depending on the day of the week. I chose to develop code that does a row by row replacement of the NA values with the corresponding weekend or weekday average depending on the day of the week.
The first step is to process the data so that there is a factor variable that identifies whether a day falls on a weekend or weekday.
##Processing Data so that weekend weekday factor is present in order to find
##average intervals across portion of week factor with NA data in order to replace
##date with either weekend and weekday factor.
##Create a vector that identifies the day of the week for each date entry
dow <- weekdays(as.Date(activitydata$date))
##Create an empty vector to store the weekend/weekday indentifiers
ww <- c()
##Categorizes day of week storing the data in the vector ww
##Does this through if/else argument that identifies and lables a day as
##falling on a weekend by comparing the day of the week to a charatcher string
##containing the appropriate days in a character string if not then it lables it
##a weekday
for(i in 1:17568){
if(dow[i] %in% c("Saturday", "Sunday")){
ww[i] <- "weekend"
}
else{
ww[i] <-"weekday"
}
}
activitydata$WEorWD <- factor(ww)
Next, I created two data frames that show the mean value of the number of steps over each of the five minute factored by whether or not the day of the week falls on a weekend or weekday.
##Subsets weekend data
WEdf <- subset(activitydata, activitydata$WEorWD == "weekend")
##Finds mean value for five minute intervals and returns a data frame
WEdf <- aggregate(WEdf$steps, by = list(interval = WEdf$interval), mean, na.rm = TRUE)
##Renames column named "x" to "steps"
colnames(WEdf) <- c("interval", "steps")
##Repeats process for weekday values
WDdf <- subset(activitydata, activitydata$WEorWD == "weekday")
WDdf <- aggregate(WDdf$steps, by = list(interval = WDdf$interval), mean, na.rm = TRUE)
colnames(WDdf) <- c("interval", "steps")
Finally, I create a new data frame that has the missing values replaced with the interval average determined by when the day falls in the week.
##Creates copy of data frame with missing values
noNAdata <- activitydata
##Looks for NA values
##Checks the period of the week indicator to determine which interval average to
##apply
##Locates and replaces NA with corresponding average step value
for(i in 1:17568){
if(is.na(activitydata$steps[i]) == TRUE){
if(activitydata$WEorWD[i] == "weekend" ){
m <- which(WEdf$interval == activitydata$interval[i])
noNAdata$steps[i] <- WEdf$steps[m]
}
if(activitydata$WEorWD[i] == "weekday" ){
m <- which(WDdf$interval == activitydata$interval[i])
noNAdata$steps[i] <- WDdf$steps[m]
}
}
}
I repeated the process used previously to create a data frame that stores the total number of steps per day.
date <- as.Date(levels(noNAdata$date))
steps <- tapply(noNAdata$steps, noNAdata$date, sum, na.rm = TRUE)
SPDAdata <- data.frame(date, steps, row.names = NULL)
Create a new histogram using this data.
hist(SPDAdata$steps, col = "green", main = "Total Steps Taken Per Day", xlab = "Number of Steps" )
Apply mean function to new data set developed from replacing missing values.
mean(SPDAdata$steps)
## [1] 10762
Apply median function to new data set developed from replacing missing values.
median(SPDAdata$steps)
## [1] 10571
There is an increase of 1, 408 steps in the mean from the data analyzed in the beginning and the data analyzed after replacement of the missing values.
mean(spddata$steps) - mean(SPDAdata$steps)
## [1] -1408
There is a smaller increase in the median value of 176 steps.
median(spddata$steps) - median(SPDAdata$steps)
## [1] -176
Having missing values prevented me from being able to calculate a true average. Given that in calculating the mean NA values were calculated as a zero, the average number of steps would be higher than the calculated value.
I have already created a factor variable in the data set that indicates whether a day falls on a weekend or weekday. In order to compare the activity patterns of the weekends v. weekdays, I took subsets of the data, found the average number of steps taken in each interval, and then I combined that data into a single data frame.
##Subsets weekend data
weekend <- subset(noNAdata, noNAdata$WEorWD == "weekend")
##Subsets weekday data
weekday <- subset(noNAdata, noNAdata$WEorWD == "weekday")
##Calculates mean values for steps by interval
weekend <- aggregate(weekend$steps, by = list(interval = weekend$interval), mean)
weekday <- aggregate(weekday$steps, by = list(interval = weekday$interval), mean)
##Adds weekend and weekday indentifiers
weekend <- cbind(weekend, c("weekend"))
weekday <- cbind(weekday, c("weekday"))
##renames columns to improve readablity and ensure data frames identical names
colnames(weekend) <- c("interval", "steps", "WEorWD")
colnames(weekday) <- c("interval", "steps", "WEorWD")
##Combines the two data frames into one
WDWEave <- rbind(weekend, weekday)
Using the ggplot2 package, I created a panel plot comparing the activity patterns of weekends and weekdays.
WDWEave$interval <- sprintf("%04d", WDWEave$interval)
WDWEave$interval <- strptime(factor(WDWEave$interval),"%H%M")
g <- ggplot(WDWEave, aes(interval, steps))
g + geom_line() + scale_x_datetime(labels= date_format("%H%M")) +
facet_grid(WEorWD~.) + ggtitle("Average Daily Activity Pattern")
On average the measured activity level increased during the day on weekends.