Reproducible Research: Peer Assessment 1

Loading and preprocessing the data

Below is the code necessary to read in the data from the provided .zip file. Additionally, I have included all necessary libraries for the graphs and transformations ahead.
The data variable that is created with the data from the activity.csv file is stripped of the NA values here. The data set will be reimplemented later on to impute the NA values.

#setwd("~/Documents/Docs/R Programming CourseEra/RepData_PeerAssessment1")
setwd("/Users/seankrinik/Dropbox/Personal Files/R Programming CourseEra/RepData_PeerAssessment1")
#Install necessary packages:
library(lubridate); library(dplyr); library(ggplot2); library(gridExtra); library(timeDate)
#assuming the raw data is within the current working directory:
unzip("activity.zip")
data <- read.csv("activity.csv")
data <- filter(data, !is.na(data$steps)) #Get rid of the NA values

What is mean total number of steps taken per day?

The mean total number of steps per day can be seen in the printed table below, along with a histogram showing the frequency of total step numerical values for each day:

total_steps <- with(data, aggregate(steps, by = list(ymd(date)), sum))
total_steps <- rename(total_steps, Day = Group.1, Steps = x)
#
mean_med <- data.frame(mean(total_steps$Steps), as.numeric(median(total_steps$Steps)), row.names = "Steps")
names(mean_med) <-c("Mean of total steps", "Median of total steps")
mean_med

##       Mean of total steps Median of total steps
## Steps            10766.19                 10765

#Plot
p <- ggplot(total_steps, aes(Steps))
p + geom_histogram(bins = length(total_steps$Day)) + 
    xlab("Total Steps") + ylab("Frequency") +
    ggtitle("Histogram of Total Steps Per Day") + guides(fill = F)

What is the average daily activity pattern?

Below you will see the maximum average daily steps in a given 5-minute video. Additionally, you will see a graphical representation of the daily activity by 5-minute interval.

daily_activity <- aggregate(steps~interval, data = data, mean)
paste("Maximum Average Steps in Interval: ", round(daily_activity[which.max(daily_activity$steps), ]$steps, digits = 2))

## [1] "Maximum Average Steps in Interval:  206.17"

paste("Maximum Average Steps - Interval Number: ", daily_activity[which.max(daily_activity$steps), ]$interval)

## [1] "Maximum Average Steps - Interval Number:  835"

p1 <- ggplot(daily_activity, aes(x = interval, y = steps)) 
p1 + geom_line() + ggtitle("Daily Activity Time Series") + xlab("Interval Number") + ylab("Average Steps \n(per day, by interval)")

Conclusion:

In examining the time series graph, interval 0 represents midnight of the day and interval 2355 (x-limit) represents the final 5-minute interval of the day. Thus, you can see most of the steps throughout the day occur in the late morning and throughout the mid-day intervals, on average.

Imputing missing values

The data set from activity.csv is reloaded to include the NA values omitted previously. Now, how do we interpret these NA values? The method below gets the indicies of the NA values then creates a new vector with replacement values from the mean daily values. Using this vector of mean daily values by interval, the intervals of the NA indicies are then matched and replaced with the mean values. Once the replacement is achieved, the data is mutated back into the full data set.
In summary, the method peels off the steps column from the data frame, changes the NA values to their respective mean values by interval, then the column is put pack into the data frame.
You’ll find an amended total daily steps histogram that now includes the imputed values.

#Simply reload data with NA values now:
data_na <- read.csv("activity.csv")
na_values <- as.numeric(sum(is.na(data_na$steps))) #number of NA's
paste("Number of 'NA' Values: ", na_values) #Print to console

## [1] "Number of 'NA' Values:  2304"

#Use the mean of each interval from the daily activity data set
na_vec <- which(is.na(data_na$steps)) #get indicies of NA values in dataset
new_data <- replace(data_na$steps,na_vec,daily_activity[match(data_na$interval, daily_activity$interval), ]$steps) #replace the indicies of NA values with the matched intervals from the daily_activity data frame that has the daily means.

## Warning in x[list] <- values: number of items to replace is not a multiple
## of replacement length

mean_na_data <- mutate(data_na, steps = new_data) #steps column now has replaced data in steps column

# test NA's
table(is.na(mean_na_data$steps))

## 
## FALSE 
## 17568

# FALSE :: 17568 

# Recalculate Totals with new data set:
total_imputed <- with(mean_na_data, aggregate(steps, by = list(ymd(date)), sum))
total_imputed <- rename(total_imputed, Day = Group.1, Steps = x)

#Mean and Median of Total Steps Imputed
mean_med_na <- data.frame(mean(total_imputed$Steps), as.numeric(median(total_imputed$Steps)), row.names = "Steps")
names(mean_med_na) <-c("Mean of total steps", "Median of total steps")
mean_med_na

##       Mean of total steps Median of total steps
## Steps            10766.19              10766.19

### Compare Boxplots: 
par(mfrow = c(1,2))
boxplot(total_steps$Steps, main = "Boxplot for Total Steps \n(NA Omitted)", ylab = "Total Steps")
boxplot(total_imputed$Steps, main = "Boxplot for Total Steps \n(NA Imputed)", ylab = "Total Steps")

#Quantiles: NA Omitted 
summary(total_steps$Steps)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      41    8841   10765   10766   13294   21194

#Quantiles: NA Imputed
summary(total_imputed$Steps)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      41    9819   10766   10766   12811   21194

#Histogram with imputed missing data:
p <- ggplot(total_imputed, aes(Steps))
p + geom_histogram(bins = length(total_imputed$Day)) + 
    xlab("Total Steps") + ylab("Frequency") +
    ggtitle("Histogram of Total Steps Per Day \nWith Mean Imputed NA Values ") + guides(fill = F)

###Conclusion:

As you can see by the comparison boxplots and printed quantiles, the imputed values cause a change in the quantiles of the total steps per day. Additionally, you’ll see frequencies on the histogram have increased since there are more intervals with step values.
By imputing the NA’s, we tighten the quantiles of the total steps per day translating to adding more density toward the mean of the total steps. However, as seen in the boxplot for the imputed values, we’ve caused the 1st and 3rd quartiles to be more central to the median (result of adding density toward the mean), but many of our values have become outliers. Imputing the mean data for the NA’s seems to have skewed our data slightly introducing more error.

Are there differences in activity patterns between weekdays and weekends?

daily_activity_wk <- mutate(mean_na_data, weekday = date)
daily_activity_wk$weekday <- isWeekday(as.timeDate(daily_activity_wk$weekday))
daily_activity_wk <- aggregate(steps~interval+weekday, data = daily_activity_wk, mean)
daily_activity_wk$weekday <- factor(daily_activity_wk$weekday, levels = c(TRUE, FALSE), labels = c("Weekday", "Weekend"))

p2 <- ggplot(daily_activity_wk, aes(x = interval, y = steps))
p2 + facet_grid(weekday~.) + geom_line(aes(color = weekday), show.legend = F) + guides(fill = F)+ ggtitle("Average Number of Steps Taken on Weekday/Weekend") + xlab("Interval") + ylab("Average Number of Steps")

Conclusion:

Examining the two graphs shows us that there are changes in the consistency of steps throughout the day, as well as the frequency of high step counts. Weekends seem to show more activity later at night, and more consistent numbers of steps throughout the day. Weekdays give us a very high peak for steps in the late morning but as the day progresses, the number of steps tends to not be as consistent. Additionally, late night activity diminishes more quickly on average on weekdays versus weekends where there is more late night activity as seen in the high interval numbers on the blue plot.

Reproducible Research: Peer Assessment 1

Sean Krinik

Loading and preprocessing the data

What is mean total number of steps taken per day?

What is the average daily activity pattern?

Conclusion:

Imputing missing values

Are there differences in activity patterns between weekdays and weekends?

Conclusion: