Reproducible Research: Peer Assessment 1

Loading and preprocessing the data

Reading in activity csv file

activity <- read.csv(unzip("activity.zip"))

What is mean total number of steps taken per day?

Calculating total mean number of steps taken each day:

total <- tapply(activity$steps, activity$date, sum)
hist(total, main = "Histogram of Total steps per day", xlab = "Total Steps")

mean(total, na.rm = TRUE)

## [1] 10766.19

median(total,na.rm = TRUE)

## [1] 10765

What is the average daily activity pattern?

Using dplyr package to calculate mean steps per interval across all days:

intervalmean <- activity %>% group_by(interval) %>% summarise(mean=mean(steps, na.rm = TRUE))

Plotting average number of steps per interval across all days using ggplot2 package

ggplot(intervalmean, aes(interval, mean)) + geom_line() + ylab("average steps taken")

which.max(intervalmean$mean)

## [1] 104

intervalmean[104,]

## # A tibble: 1 x 2
##   interval  mean
##      <int> <dbl>
## 1      835  206.

The maximum average number of steps (206) occurs during the 104th 5 minute interval of the day which corresponds to 8.35am.

Imputing missing values

Calculating total number of missing rows in dataset:

sum(is.na(activity$steps))

## [1] 2304

From the above analysis we can see that 2304 measurements out of the total 17568 measurements in the sample are missing. This represents 13% of our dataset which is very high and can potentially have a significant impact on our analysis.

For this dataset we are going to impute all NA values with the mean for that interval.

Obtain a logical vector where NAs correspond to true:

impute_values <- is.na(activity$steps)

Create a repeating vector of interval means to fill our missing data:

impute_mean <- rep(intervalmean[["mean"]],8)

Note that we repeat the interval mean vector 8 times corresponding to 8 missing dates in the dataset. Complete our missing data:

activity_complete <- activity
activity_complete$steps[impute_values] <- impute_mean

Calculating total number of steps taken per day with our complete dataset:

total_complete <- tapply(activity_complete$steps, activity$date, sum)
hist(total_complete, main = "Histogram of Total steps per day", xlab = "Total Steps")

mean(total_complete, na.rm = TRUE)

## [1] 10766.19

median(total_complete,na.rm = TRUE)

## [1] 10766.19

Mean and median values have not changed in the imputed dataset compared to the original dataset. This is becuase we used the mean value for each interval across all days. This lowers the variance of our dataset which may not be desireable. Other methods of imputation may better reflect the variablility of the data, however they were not chosen for this analysis.

Are there differences in activity patterns between weekdays and weekends?

Converting date variable from factor to date format using the lubridate package:

activity_complete$date <- ymd(activity_complete$date)

Use weekdays() function to find day of the week and then add a two level weekend/weekday factor variable to our data frame:

DayoftheWeek <- weekdays(activity_complete$date)
weekendlogical <- DayoftheWeek %in% c("Saturday", "Sunday")
activity_complete$weekendweekday <- factor(ifelse(weekendlogical==TRUE, "weekend", "weekday"))

Making our panel plot showing showing 5-minute intervals on the x-axis and average steps taken over all weekend days or weekday days.

plotdata <- aggregate(activity_complete$steps, list(activity_complete$interval, activity_complete$weekendweekday), mean)
names(plotdata) <- c("interval", "weekendweekday", "averagesteps")
p <- ggplot(data = plotdata, aes(x = interval, y = averagesteps)) + geom_line()
p + facet_wrap(~weekendweekday)