First we load the data and define the classes of our column variables. Lets look at the structure of our data.
activity <- read.csv("activity.csv", header = T, colClasses = c("numeric", "Date",
"integer"))
str(activity)
## 'data.frame': 17568 obs. of 3 variables:
## $ steps : num NA NA NA NA NA NA NA NA NA NA ...
## $ date : Date, format: "2012-10-01" "2012-10-01" ...
## $ interval: int 0 5 10 15 20 25 30 35 40 45 ...
library(ggplot2)
library(plyr)
perDay <- ddply(activity, "date", summarise, totalSteps = sum(steps, na.rm = T))
qplot(totalSteps, data = perDay, binwidth = 500)
with(perDay, mean(totalSteps, na.rm = T))
## [1] 9354
with(perDay, median(totalSteps, na.rm = T))
## [1] 10395
perInterval <- ddply(activity, "interval", summarise, meanSteps = mean(steps,
na.rm = T))
qplot(interval, meanSteps, data = perInterval, geom = "line")
with(perInterval, interval[which.max(meanSteps)])
## [1] 835
Note that there are a number of days/intervals where there are missing values (coded as NA). The presence of missing days may introduce bias into some calculations or summaries of the data.
with(activity, sum(is.na(steps)))
## [1] 2304
Devise a strategy for filling in all of the missing values in the dataset. The strategy does not need to be sophisticated. For example, you could use the mean/median for that day, or the mean for that 5-minute interval, etc.
Create a new dataset that is equal to the original dataset but with the missing data filled in.
Here I will use the impute function from the Hmisc package. This will fill the missing values with the mean of the values of this variable (steps).
library(Hmisc, warn.conflicts = F)
## Loading required package: grid
## Loading required package: lattice
## Loading required package: survival
## Loading required package: splines
## Loading required package: Formula
completeActivity <- transform(activity, steps = impute(steps, mean))
byDay <- ddply(completeActivity, "date", summarise, totalSteps = sum(steps,
na.rm = T))
When imputing missing value, our mean and median changed
qplot(totalSteps, data = byDay, binwidth = 500)
with(byDay, mean(totalSteps, na.rm = T))
## [1] 10766
with(byDay, median(totalSteps, na.rm = T))
## [1] 10766
For this part the weekdays() function may be of some help here. Use the dataset with the filled-in missing values for this part.
completeActivity$weekends <- weekdays(completeActivity$date) %in% c("Saturday",
"Sunday")
completeActivity$weekends <- as.factor(completeActivity$weekends)
levels(completeActivity$weekends) <- c("Weekday", "Weekend")
perInterval <- ddply(completeActivity, .(weekends, interval), summarise, meanSteps = mean(steps))
qplot(interval, meanSteps, data = perInterval, geom = "line", facets = . ~ weekends)