Reproducible Research: Peer Assessment 1

Understanding the data

This report looks at data collected from an activity monitoring device of an anonymous individual over a two month period (October,2012 and November, 2012). The device recorded the steps taken in five minute intervals throughout each day.

Loading and preprocessing the data

Loading

The data was downloaded from the link given on 2014-09-06 at 17:15:00 PST. It was unzipped into the working directory for the purpose of the following code used to read the data.

unzip("./activity.zip")
activitydata <- read.csv("./activity.csv")

Processing

Sum the steps per day and create a new data frame.

date <- as.Date(levels(activitydata$date))
                         
steps <- tapply(activitydata$steps, activitydata$date, sum, na.rm = TRUE)

spddata <- data.frame(date, steps, row.names = NULL)

Average the number of steps over a given interval of time and create a data frame.

dailyact <- aggregate(activitydata$steps, by = list(interval =  activitydata$interval), mean, na.rm = TRUE)

colnames(dailyact) <- c("interval", "steps")

dailyact$interval <- strptime(sprintf("%04d",dailyact$interval), "%H%M")

What is mean total number of steps taken per day?

I created a histogram of the total steps taken per day.

hist(spddata$steps, col = "blue", main = "Total Steps Taken Per Day", xlab = "Number of Steps" )

plot of chunk unnamed-chunk-4

I used the mean function to calculate the mean total number of steps taken per day.

mean(spddata$steps)

## [1] 9354

Apply the median function to calculate the median value.

median(spddata$steps)

## [1] 10395

What is the average daily activity pattern?

For this part I took the data frame created earlier that shows the average number of steps over a given interval, and I plotted that data using the ggplot2 package.

require(ggplot2)

## Loading required package: ggplot2

require(scales)

## Loading required package: scales

timeplot <- ggplot(dailyact, aes(interval, steps)) 

timeplot + geom_line() + scale_x_datetime(labels= date_format("%H%M")) + ggtitle ("Average Daily Activity")

plot of chunk unnamed-chunk-7

Which 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps?

##Prints interval with highest number of steps
dailyact$interval[which.max(dailyact$steps)]

## [1] "0835"

Imputing missing values

Total number of missing values:

sum(is.na(activitydata))

## [1] 2304

Process for replacing missing values

After noting during exploratory analysis that the average number of steps factored by interval varies depending on the day of the week. I chose to develop code that does a row by row replacement of the NA values with the corresponding weekend or weekday average depending on the day of the week.

The first step is to process the data so that there is a factor variable that identifies whether a day falls on a weekend or weekday.

##Processing Data so that weekend weekday factor is present in order to find 
##average intervals across portion of week factor with NA data in order to replace 
##date with either weekend and weekday factor. 

##Create a vector that identifies the day of the week for each date entry

dow <- weekdays(as.Date(activitydata$date))

##Create an empty vector to store the weekend/weekday indentifiers

ww <- c() 

##Categorizes day of week storing the data in the vector ww
##Does this through if/else argument that identifies and lables a day as 
##falling on a weekend by comparing the day of the week to a charatcher string 
##containing the appropriate days in a character string if not then it lables it 
##a weekday

for(i in 1:17568){
        if(dow[i] %in% c("Saturday", "Sunday")){
                ww[i] <- "weekend"
        }
        else{
                ww[i] <-"weekday"
        }
}

activitydata$WEorWD <- factor(ww)

Next, I created two data frames that show the mean value of the number of steps over each of the five minute factored by whether or not the day of the week falls on a weekend or weekday.

##Subsets weekend data
WEdf <- subset(activitydata, activitydata$WEorWD == "weekend") 

##Finds mean value for five minute intervals and returns a data frame 
WEdf <- aggregate(WEdf$steps, by = list(interval = WEdf$interval), mean, na.rm = TRUE) 

##Renames column named "x" to "steps" 
colnames(WEdf) <- c("interval", "steps")

##Repeats process for weekday values
WDdf <- subset(activitydata, activitydata$WEorWD == "weekday")

WDdf <- aggregate(WDdf$steps, by = list(interval = WDdf$interval), mean, na.rm = TRUE) 

colnames(WDdf) <- c("interval", "steps")

Finally, I create a new data frame that has the missing values replaced with the interval average determined by when the day falls in the week.

##Creates copy of data frame with missing values
noNAdata <- activitydata

##Looks for NA values
##Checks the period of the week indicator to determine which interval average to 
##apply
##Locates and replaces NA with corresponding average step value

for(i in 1:17568){
        if(is.na(activitydata$steps[i]) == TRUE){
                if(activitydata$WEorWD[i] == "weekend" ){
                m <- which(WEdf$interval == activitydata$interval[i]) 
                noNAdata$steps[i] <- WEdf$steps[m]
                }
                if(activitydata$WEorWD[i] == "weekday" ){
                        m <- which(WDdf$interval == activitydata$interval[i]) 
                        noNAdata$steps[i] <- WDdf$steps[m]
                }
        }
}

What is the impact of imputing missing data on the estimates of the total daily number of steps?

I repeated the process used previously to create a data frame that stores the total number of steps per day.

date <- as.Date(levels(noNAdata$date))

steps <- tapply(noNAdata$steps, noNAdata$date, sum, na.rm = TRUE)

SPDAdata <- data.frame(date, steps, row.names = NULL)

Create a new histogram using this data.

hist(SPDAdata$steps, col = "green", main = "Total Steps Taken Per Day", xlab = "Number of Steps" )

plot of chunk unnamed-chunk-15

Apply mean function to new data set developed from replacing missing values.

mean(SPDAdata$steps)

## [1] 10762

Apply median function to new data set developed from replacing missing values.

median(SPDAdata$steps)

## [1] 10571

Do these values differ from the estimates from the first part of the assignment?

There is an increase of 1, 408 steps in the mean from the data analyzed in the beginning and the data analyzed after replacement of the missing values.

mean(spddata$steps) - mean(SPDAdata$steps)

## [1] -1408

There is a smaller increase in the median value of 176 steps.

median(spddata$steps) - median(SPDAdata$steps)

## [1] -176

Having missing values prevented me from being able to calculate a true average. Given that in calculating the mean NA values were calculated as a zero, the average number of steps would be higher than the calculated value.

Are there differences in activity patterns between weekdays and weekends?

I have already created a factor variable in the data set that indicates whether a day falls on a weekend or weekday. In order to compare the activity patterns of the weekends v. weekdays, I took subsets of the data, found the average number of steps taken in each interval, and then I combined that data into a single data frame.

##Subsets weekend data
weekend <- subset(noNAdata, noNAdata$WEorWD == "weekend")

##Subsets weekday data
weekday <- subset(noNAdata, noNAdata$WEorWD  == "weekday")

##Calculates mean values for steps by interval
weekend <- aggregate(weekend$steps, by = list(interval = weekend$interval), mean)

weekday <- aggregate(weekday$steps, by = list(interval = weekday$interval), mean)

##Adds weekend and weekday indentifiers
weekend <- cbind(weekend, c("weekend"))
weekday <- cbind(weekday, c("weekday"))

##renames columns to improve readablity and ensure data frames identical names  
colnames(weekend) <- c("interval", "steps", "WEorWD")
colnames(weekday) <- c("interval", "steps", "WEorWD")

##Combines the two data frames into one
WDWEave <- rbind(weekend, weekday)

Using the ggplot2 package, I created a panel plot comparing the activity patterns of weekends and weekdays.

WDWEave$interval <- sprintf("%04d", WDWEave$interval)

WDWEave$interval <- strptime(factor(WDWEave$interval),"%H%M")

g <- ggplot(WDWEave, aes(interval, steps))

g + geom_line() + scale_x_datetime(labels= date_format("%H%M")) + 
facet_grid(WEorWD~.) + ggtitle("Average Daily Activity Pattern")

plot of chunk unnamed-chunk-21

On average the measured activity level increased during the day on weekends.