Reading in activity csv file
activity <- read.csv(unzip("activity.zip"))
Calculating total mean number of steps taken each day:
total <- tapply(activity$steps, activity$date, sum)
hist(total, main = "Histogram of Total steps per day", xlab = "Total Steps")
mean(total, na.rm = TRUE)
## [1] 10766.19
median(total,na.rm = TRUE)
## [1] 10765
Using dplyr package to calculate mean steps per interval across all days:
intervalmean <- activity %>% group_by(interval) %>% summarise(mean=mean(steps, na.rm = TRUE))
Plotting average number of steps per interval across all days using ggplot2 package
ggplot(intervalmean, aes(interval, mean)) + geom_line() + ylab("average steps taken")
which.max(intervalmean$mean)
## [1] 104
intervalmean[104,]
## # A tibble: 1 x 2
## interval mean
## <int> <dbl>
## 1 835 206.
The maximum average number of steps (206) occurs during the 104th 5 minute interval of the day which corresponds to 8.35am.
Calculating total number of missing rows in dataset:
sum(is.na(activity$steps))
## [1] 2304
From the above analysis we can see that 2304 measurements out of the total 17568 measurements in the sample are missing. This represents 13% of our dataset which is very high and can potentially have a significant impact on our analysis.
For this dataset we are going to impute all NA values with the mean for that interval.
Obtain a logical vector where NAs correspond to true:
impute_values <- is.na(activity$steps)
Create a repeating vector of interval means to fill our missing data:
impute_mean <- rep(intervalmean[["mean"]],8)
Note that we repeat the interval mean vector 8 times corresponding to 8 missing dates in the dataset. Complete our missing data:
activity_complete <- activity
activity_complete$steps[impute_values] <- impute_mean
Calculating total number of steps taken per day with our complete dataset:
total_complete <- tapply(activity_complete$steps, activity$date, sum)
hist(total_complete, main = "Histogram of Total steps per day", xlab = "Total Steps")
mean(total_complete, na.rm = TRUE)
## [1] 10766.19
median(total_complete,na.rm = TRUE)
## [1] 10766.19
Mean and median values have not changed in the imputed dataset compared to the original dataset. This is becuase we used the mean value for each interval across all days. This lowers the variance of our dataset which may not be desireable. Other methods of imputation may better reflect the variablility of the data, however they were not chosen for this analysis.
Converting date variable from factor to date format using the lubridate package:
activity_complete$date <- ymd(activity_complete$date)
Use weekdays() function to find day of the week and then add a two level weekend/weekday factor variable to our data frame:
DayoftheWeek <- weekdays(activity_complete$date)
weekendlogical <- DayoftheWeek %in% c("Saturday", "Sunday")
activity_complete$weekendweekday <- factor(ifelse(weekendlogical==TRUE, "weekend", "weekday"))
Making our panel plot showing showing 5-minute intervals on the x-axis and average steps taken over all weekend days or weekday days.
plotdata <- aggregate(activity_complete$steps, list(activity_complete$interval, activity_complete$weekendweekday), mean)
names(plotdata) <- c("interval", "weekendweekday", "averagesteps")
p <- ggplot(data = plotdata, aes(x = interval, y = averagesteps)) + geom_line()
p + facet_wrap(~weekendweekday)