This R Markdown document is Bill Seliger’s submission for Reproducible Research Peer Assessment 1. The course assignment can be found here https://class.coursera.org/repdata-010/human_grading/view/courses/973511/assessments/3/submissions
This assignment makes use of data from a personal activity monitoring device. This device collects data at 5 minute intervals throughout the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day. The original dataset can be found here https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip
The variables included in this dataset are:
Set the working directory on my local computer
setwd("C:/Users/rr046302/Documents/Bill's Stuff/Coursera/Reproducible Research/RepData_PeerAssessment1")
I require several packages I will be using - dplyr, ggplot2, and lattice packages - and set the scientific notation option. I have echo=FALSE and warning=FALSE to suppress echo and warning just for this code chunk.
require("dplyr") ## dplyr is used for structuring the data for analysis
require("ggplot2") ## ggplot2 is required for several plots
require("lattice") ## lattice plot is required for the weekday-weekend plot
options(scipen = 999) ## eliminate scientific notation
Then read in the data from the local zipped file to the object activty, and convert the activity object to a tbl class.
activity <- read.csv(unz("activity.zip","activity.csv")) ## read in the data
activity <- tbl_df(activity) ## structure the data as a tbl class
For this part of the assignment, you can ignore the missing values in the dataset. Make a histogram of the total number of steps taken each day
First I aggregate the number of steps per day. I use the group_by function and then the summarise function from dplyr to perform the aggregation of steps by day and last the hist function to create the histogram plot
activity_days <- activity %>% group_by(date) %>% summarise(total.steps = sum(steps))
hist(activity_days$total.steps, breaks = 25, main = "Histogram of Total Steps per Day")
mean((activity_days$total.steps), na.rm = TRUE)
## [1] 10766.19
median((activity_days$total.steps), na.rm = TRUE)
## [1] 10765
Create a factor of the interval - time of day - so that we can aggregate based on it
activity$interval.factor <- as.factor(activity$interval)
Calculate the average number of steps for each interval using the group_by and summarise functions
activity_interval <- activity %>% group_by(interval.factor) %>%
summarise(mean.steps = mean(steps, na.rm =TRUE))
activity_interval$interval <- as.numeric(as.character(activity_interval$interval.factor))
plot(activity_interval$interval, activity_interval$mean.steps, type = "l", xaxt="n",
xlab = "<-----------------Morning 5-minute interval Night----------------->",
ylab = "mean steps", main = "Daily Activity Pattern", sub = "Average steps recorded for October-November 2012")
axis(1, at = seq(100, 2300, by = 100), las = 2)
max_steps_interval <- which.max(activity_interval$mean.steps)
print(activity_interval[max_steps_interval,])
## Source: local data frame [1 x 3]
##
## interval.factor mean.steps interval
## 1 835 206.1698 835
There are a number of observations where there are missing values (coded as NA). The presence of missing data may introduce bias into some calculations or summaries of the data.
sum(is.na(activity$steps))
## [1] 2304
After reviewing the data I found that the NAs consist of a specific set of dates for which no observations are recorded - in other words for each date for which there are observations there are no NAs and there are 8 days for which there are no observations at all - the observations for those 8 days are NAs. I perform exploratory data analysis that will suggest an imputation strategy.
First I create a variable for day of week and order them so they appear in US order of weekday - weekend in plots (Monday-Sunday)
activity$weekday <- weekdays(as.Date(activity$date))
activity$weekday <- factor(activity$weekday, levels= c("Monday",
"Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))
The number of missing observations is not the same across all weekdays. Here I calculate the number of missing observations for each day of the week
activity_day_NA <- activity %>% group_by(weekday) %>% summarise(sum(is.na(steps)))
print(activity_day_NA)
## Source: local data frame [7 x 2]
##
## weekday sum(is.na(steps))
## 1 Monday 576
## 2 Tuesday 0
## 3 Wednesday 288
## 4 Thursday 288
## 5 Friday 576
## 6 Saturday 288
## 7 Sunday 288
In performing exploratory data analysis on the dataset I found that there are differences between the different weekdays for which we have observations.
To show the variation in steps across each day of week/interval I create a facet plot of the mean steps for each interval for each weekday
activity_day$interval <- as.numeric(as.character(activity_day$interval.factor))
ggplot(data=activity_day, aes(x=interval, y=mean.steps)) + geom_line() + facet_wrap(~weekday) +
labs(title = "Mean steps per Interval for each day of the Week")
I will use the following strategy to impute missing values - calculate the average number of steps for each day of week/interval combination and complete the dataset by substituting this data for the NAs.
First I calculate the interval average for each weekday for which we have observations
activity_day <- activity %>% group_by(weekday, interval.factor) %>%
summarise(mean.steps = mean(steps, na.rm =TRUE))
I then merge the original data table, activity, with the activity_day dataframe which has the average steps for each interval/day combination and then create the variable impute_steps which uses an ifelse statement to use the original steps for the interval/date combination if that data is populated, else it uses the average number of steps if the original interval/date combination is NA.
activity_impute <- merge(activity, activity_day, by=c("weekday","interval.factor"))
activity_impute$impute.steps <- ifelse(is.na(activity_impute$steps),
activity_impute$mean.steps, activity_impute$steps)
First I aggregate the number of steps per day. I use the group_by function and then the summarise function from dplyr to perform the aggregation of steps by day and then hist function to create the histogram plot
activity_impute_mean <- activity_impute %>% group_by(date) %>%
summarise(total.steps = sum(impute.steps))
hist(activity_impute_mean$total.steps, breaks = 25,
main = "Histogram of Total Steps per Day using Imputed Data")
mean(activity_impute_mean$total.steps)
## [1] 10821.21
median(activity_impute_mean$total.steps)
## [1] 11015
Because I used an impute strategy that is granular, i.e. imputing at each day of the week and interval, and the number of missing observations varies across the days of the week, the imputation strategy does impact the mean and median steps across the entire data set.
Here I show the mean steps for each weekday prior to imputation (which is the same as the average after imputation) - because the missing observations varied across the weekdays the imputation strategy did impact the post-imputation Histogram, Mean and Median
activity_day_mean <- activity %>% group_by (date, weekday) %>% summarise(total.steps = sum(steps)) %>%
group_by (weekday) %>% summarise(mean.steps = round(mean(total.steps, na.rm = TRUE),0))
print(activity_day_mean)
## Source: local data frame [7 x 2]
##
## weekday mean.steps
## 1 Monday 9975
## 2 Tuesday 8950
## 3 Wednesday 11791
## 4 Thursday 8213
## 5 Friday 12360
## 6 Saturday 12535
## 7 Sunday 12278
I believe this imputation strategy is valid and supported by the data provided for this assignment.
For this part the weekdays() function may be of some help here. Use the dataset with the filled-in missing values for this part.
Create a new factor variable in the dataset with two levels - “weekday” and “weekend” indicating whether a given date is a weekday or weekend day.
activity_impute <- activity_impute %>%
mutate(weekend = ifelse(weekday == "Saturday" | weekday == "Sunday", "weekend", "weekday"))
Make a panel plot containing a time series plot (i.e. type = “l”) of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all weekday days or weekend days (y-axis). See the README file in the GitHub repository to see an example of what this plot should look like using simulated data.
activity_impute_mean <- activity_impute %>% group_by(weekend, interval) %>%
summarise(mean.steps = mean(impute.steps))
xyplot(mean.steps ~ interval | weekend, data = activity_impute_mean,
type = "l", layout = c(1,2), xlab = "Interval", ylab = "Number of Steps",
main = "Average Steps by 5-minute Interval for Weekends and Weekdays")
There exists clear differences in activity between weekends and weekdays, which is understandable as most people are more active in the weekends than they are during the week.