This project was the first peer assessment for Reproducible Research, the fifth course in the Johns Hopkins Data Science Specialization.

It is now possible to collect a large amount of data about personal movement using activity monitoring devices such as a Fitbit, Nike Fuelband, or Jawbone Up. These type of devices are part of the “quantified self” movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. But these data remain under-utilized both because the raw data are hard to obtain and there is a lack of statistical methods and software for processing and interpreting the data.

This assignment makes use of data from a personal activity monitoring device. This device collects data at 5 minute intervals through out the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day.

Loading and preprocessing the data

library(lubridate)
library(dplyr)
if (!exists("activity")) {
    url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip"
    datafile <- "activity.zip"
    if (!file.exists("activity.csv")) {
        download.file(url, datafile, mode = "wb")
        unzip(datafile)
    }
    activity <- read.csv(gsub("zip", "csv", datafile))
    activity$date <- ymd(as.character(activity$date))
}

What is the mean total number of steps taken per day?

daily_steps <- activity %>% 
               group_by(date) %>% 
    mutate(total = sum(steps, na.rm = TRUE))
hist(daily_steps$total, col = "orchid3", xlab = "Total Steps", 
     main = "Histogram of Daily Steps")

mean(daily_steps$total)
## [1] 9354.23
median(daily_steps$total)
## [1] 10395

What is the average daily activity pattern?

ave_daily_steps <- activity %>% 
                   group_by(interval) %>% 
                   mutate(average_day = mean(steps, na.rm = TRUE))
plot(ave_daily_steps$interval, ave_daily_steps$average_day, type = "l", 
     col = "orchid3", xlab = "Interval", ylab = "Average Number of Steps Taken", 
     main = "Steps Taken During an Average Day")

maxsteps <- ave_daily_steps$interval[ave_daily_steps$average_day ==  
                                     max(ave_daily_steps$average_day)]
max(ave_daily_steps$average_day) # highest average of steps taken
## [1] 206.1698
maxsteps[1] # interval which has the highest average of steps taken
## [1] 835

Imputing missing values

sum(!complete.cases(activity)) # number of intervals with NA value
## [1] 2304

I will fill in the missing values with the average number of steps for that interval over the observation period.

comp_activity <- activity # copy original data before making changes
comp_activity$steps[is.na(comp_activity$steps)] <-  
    ave_daily_steps$average_day[is.na(comp_activity$steps)]
comp_daily_steps <- comp_activity %>% 
                    group_by(date) %>% 
                    mutate(total = sum(steps))
hist(comp_daily_steps$total, col = "orchid3", xlab = "Total Steps", 
     main = "Histogram of Daily Steps from Completed Dataset")

mean(comp_daily_steps$total)
## [1] 10766.19
median(comp_daily_steps$total)
## [1] 10766.19
mean(comp_daily_steps$total) - mean(daily_steps$total)
## [1] 1411.959
median(comp_daily_steps$total) - median(daily_steps$total)
## [1] 371.1887

Filling in the missing data had a large effect on the estimates of mean and median number of daily steps. As expected, the median was affected less than the mean. ## Are there differences in activity patterns between weekends and weekdays?

wk_activity <- comp_daily_steps
wk_activity$weekend <- factor(weekdays(wk_activity$date, abbreviate = FALSE) 
                              %in% c("Saturday", "Sunday"), 
                              labels = c("weekday", "weekend"))
par(mfrow = c(2, 1))
ave_wk_steps <- wk_activity %>% 
                group_by(interval, weekend) %>% 
                mutate(average_day = mean(steps))
plot(ave_wk_steps$interval[ave_wk_steps$weekend == "weekday"],  
     ave_wk_steps$average_day[ave_wk_steps$weekend == "weekday"], 
     type = "l", col = "orchid3", xlab = "Interval", ylim = c(0, 250),
     ylab = "Average Number of Steps Taken", 
     main = "Steps Taken During an Average Weekday")
plot(ave_wk_steps$interval[ave_wk_steps$weekend == "weekend"],  
     ave_wk_steps$average_day[ave_wk_steps$weekend == "weekend"], 
     type = "l", col = "orchid3", xlab = "Interval", ylim = c(0, 250),
     ylab = "Average Number of Steps Taken", 
     main = "Steps Taken During an Average Weekend Day")

Compared to the average weekday, weekend activity begins later and is more consistant through the day.