Load the given csv file. Transform the 'date' column to POSIXct objects.
df <- read.csv("./data/activity.csv", sep = ",", stringsAsFactors = FALSE)
df$date <- as.POSIXct(strptime(df$date, "%Y-%m-%d"))
Use aggregate function to get average value of steps taken per day.Then draw a histogram to see the distribution of the number of steps taken per day.
stepsSum1 <- aggregate(. ~ date, data = df[, c(1, 2)], FUN = sum)
# Option 2 Use sapply to pass the na.rm parameter to tapply function
# stepsSum1 <- sapply('steps', function(i) tapply(df[[i]], df$date, sum,
# na.rm=TRUE))
hist(stepsSum1$steps[!is.na(stepsSum1)], ylim = c(0, 30), xlab = "", main = "")
title(main = "Total number of steps taken per day", xlab = "Number of steps")
Calculate the mean total number of steps taken per day.
stepsMean1 <- mean(stepsSum1$steps)
stepsMean1
## [1] 10766
Calculate the median total number of steps taken per day.
stepsMedian1 <- median(stepsSum1$steps)
stepsMedian1
## [1] 10765
Similar to the previous task. Change the grouping variable to 'interval'.
stepsMeanInterval <- aggregate(. ~ interval, data = df[, c(1, 3)], FUN = mean)
# stepsMean2 <- sapply('steps', function(i) tapply(df[[i]], df$interval,
# mean, na.rm=TRUE))
plot(stepsMeanInterval, type = "l", main = "", xlab = "", ylab = "")
max <- which.max(stepsMeanInterval$steps)
points(x = stepsMeanInterval$interval[max], y = stepsMeanInterval$steps[max],
col = "red")
text(x = stepsMeanInterval$interval[max], y = stepsMeanInterval$steps[max],
col = "red", labels = paste("the maximum number of steps: ", round(stepsMeanInterval$steps[max],
digits = 2)), pos = 4, cex = 0.8)
title(main = "The Average Daily Activity Pattern", xlab = "Interval", ylab = "Steps Taken")
I am going to replace the missing values with the mean value corresponding to the same interval. First, I will locate the missing values and look up their intervals. Next, I check the overall mean value for the same intervals and replace the missing values with the mean values corresponding to the same interval. Because in previous task I have already got the average daily activity pattern, I can use use this data.frame -stepsMeanInterval- directly.
Use complete.cases function to calculate the number of incomplete cases.
sum(!complete.cases(df))
## [1] 2304
Replace the missing values with the resulting dataset of the previous task, i.e. the mean for that 5-minute interval. The dataset 'df2' is the result.
incomplete <- !complete.cases(df)
df2 <- data.frame(df)
for (i in seq(nrow(df2))) {
if (incomplete[i]) {
df2[i, 1] <- stepsMeanInterval$steps[which(stepsMeanInterval$interval ==
df2[i, 3])]
}
}
Check if df2 has no missing values and whether the missing values are correctly replaced.
table(complete.cases(df2)) # no FALSE exists
##
## TRUE
## 17568
head(cbind(df[incomplete, ], imputation = df2[incomplete, 1]))
## steps date interval imputation
## 1 NA 2012-10-01 0 1.71698
## 2 NA 2012-10-01 5 0.33962
## 3 NA 2012-10-01 10 0.13208
## 4 NA 2012-10-01 15 0.15094
## 5 NA 2012-10-01 20 0.07547
## 6 NA 2012-10-01 25 2.09434
head(stepsMeanInterval)
## interval steps
## 1 0 1.71698
## 2 5 0.33962
## 3 10 0.13208
## 4 15 0.15094
## 5 20 0.07547
## 6 25 2.09434
Similar to the revious task, draw a histogram for the imputed dataset showing the total number of steps taken per day.
stepsSumImputed <- aggregate(. ~ date, df2[, c(1, 2)], FUN = sum)
hist(stepsSumImputed$steps, main = "", xlab = "")
title(main = "Total number of steps taken per day (after imputation)", xlab = "Number of steps")
Calculate and report the mean and median total number of steps taken per day.
stepsSum3 <- aggregate(. ~ date, df2[, c(1, 2)], FUN = sum)
stepsMean3 <- mean(stepsSum3$steps)
stepsMedian3 <- median(stepsSum3$steps)
cat("Mean number of steps taken per day (before imputation): ", stepsMean1,
"\n", "Mean number of steps taken per day (after imputation): ", stepsMean3)
## Mean number of steps taken per day (before imputation): 10766
## Mean number of steps taken per day (after imputation): 10766
cat("Median number of steps taken per day (before imputation): ", stepsMedian1,
"\n", "Median number of steps taken per day (after imputation): ", stepsMedian3)
## Median number of steps taken per day (before imputation): 10765
## Median number of steps taken per day (after imputation): 10766
It appears that the mean and median values per day are almost the same before and after imputation. Therefore, imputation do not really impact the overall mean and median values.
To observe how imputation influences the mean value on a daily basis, we plot the datasets before and after imputation to visualize the impact on the estimates of the total daily number of steps. It can be seen that those data with missing values are imputed with the values that make the mean number of steps for each day equal to the overall mean value, i.e. 10766.19 (the blue horizontal line).
plot(stepsSumImputed$date, stepsSumImputed$steps, type = "l", lwd = 3, xlab = "",
ylab = "")
lines(stepsSum1$date, stepsSum1$steps, type = "l", col = "red") # before imputation
abline(h = stepsSumImputed[1, 2], col = "blue")
legend("topright", pch = "_", col = c("black", "red", "blue"), legend = c("After Imputation",
"Before Imputation", "Average"))
title(xlab = "Date", ylab = "Steps", main = "Total daily number of steps")
Use weekdays function to transfrom the date to weekdays, and then add one more column for differentiating workdays and weekends. Finally plot the steps along with the intervals, separated by workday and weekend. We can see that the patterns of steps for workdays and weekends are different.
df2$weekday <- weekdays(df2$date, abbreviate = TRUE)
df2$weekday1 <- ifelse(df2$weekday == "Sun" | df2$weekday == "Sat", "weekend",
"workday")
df2$weekday1 <- factor(df2$weekday1)
stepsMeanInterval2 <- aggregate(. ~ interval + weekday1, data = df2[, c(1, 3,
5)], FUN = mean)
library(lattice)
xyplot(steps ~ interval | weekday1, data = stepsMeanInterval2, layout = c(1,
2), type = "l")