Reproducible Research: Peer Assessment 1

Loading and preprocessing the data

Load the given csv file. Transform the 'date' column to POSIXct objects.

df <- read.csv("./data/activity.csv", sep = ",", stringsAsFactors = FALSE)
df$date <- as.POSIXct(strptime(df$date, "%Y-%m-%d"))

What is mean total number of steps taken per day?

Use aggregate function to get average value of steps taken per day.Then draw a histogram to see the distribution of the number of steps taken per day.

stepsSum1 <- aggregate(. ~ date, data = df[, c(1, 2)], FUN = sum)
# Option 2 Use sapply to pass the na.rm parameter to tapply function
# stepsSum1 <- sapply('steps', function(i) tapply(df[[i]], df$date, sum,
# na.rm=TRUE))

hist(stepsSum1$steps[!is.na(stepsSum1)], ylim = c(0, 30), xlab = "", main = "")
title(main = "Total number of steps taken per day", xlab = "Number of steps")

plot of chunk stepsMeanPerDay

Calculate the mean total number of steps taken per day.

stepsMean1 <- mean(stepsSum1$steps)
stepsMean1
## [1] 10766

Calculate the median total number of steps taken per day.

stepsMedian1 <- median(stepsSum1$steps)
stepsMedian1
## [1] 10765

What is the average daily activity pattern?

Similar to the previous task. Change the grouping variable to 'interval'.

stepsMeanInterval <- aggregate(. ~ interval, data = df[, c(1, 3)], FUN = mean)
# stepsMean2 <- sapply('steps', function(i) tapply(df[[i]], df$interval,
# mean, na.rm=TRUE))
plot(stepsMeanInterval, type = "l", main = "", xlab = "", ylab = "")
max <- which.max(stepsMeanInterval$steps)
points(x = stepsMeanInterval$interval[max], y = stepsMeanInterval$steps[max], 
    col = "red")
text(x = stepsMeanInterval$interval[max], y = stepsMeanInterval$steps[max], 
    col = "red", labels = paste("the maximum number of steps: ", round(stepsMeanInterval$steps[max], 
        digits = 2)), pos = 4, cex = 0.8)
title(main = "The Average Daily Activity Pattern", xlab = "Interval", ylab = "Steps Taken")

plot of chunk stepsMeanForIntervals

Imputing missing values

Strategy

I am going to replace the missing values with the mean value corresponding to the same interval. First, I will locate the missing values and look up their intervals. Next, I check the overall mean value for the same intervals and replace the missing values with the mean values corresponding to the same interval. Because in previous task I have already got the average daily activity pattern, I can use use this data.frame -stepsMeanInterval- directly.

Use complete.cases function to calculate the number of incomplete cases.

sum(!complete.cases(df))
## [1] 2304

Replace the missing values with the resulting dataset of the previous task, i.e. the mean for that 5-minute interval. The dataset 'df2' is the result.

incomplete <- !complete.cases(df)
df2 <- data.frame(df)
for (i in seq(nrow(df2))) {
    if (incomplete[i]) {
        df2[i, 1] <- stepsMeanInterval$steps[which(stepsMeanInterval$interval == 
            df2[i, 3])]
    }
}

Check if df2 has no missing values and whether the missing values are correctly replaced.

table(complete.cases(df2))  # no FALSE exists
## 
##  TRUE 
## 17568
head(cbind(df[incomplete, ], imputation = df2[incomplete, 1]))
##   steps       date interval imputation
## 1    NA 2012-10-01        0    1.71698
## 2    NA 2012-10-01        5    0.33962
## 3    NA 2012-10-01       10    0.13208
## 4    NA 2012-10-01       15    0.15094
## 5    NA 2012-10-01       20    0.07547
## 6    NA 2012-10-01       25    2.09434
head(stepsMeanInterval)
##   interval   steps
## 1        0 1.71698
## 2        5 0.33962
## 3       10 0.13208
## 4       15 0.15094
## 5       20 0.07547
## 6       25 2.09434

Similar to the revious task, draw a histogram for the imputed dataset showing the total number of steps taken per day.

stepsSumImputed <- aggregate(. ~ date, df2[, c(1, 2)], FUN = sum)
hist(stepsSumImputed$steps, main = "", xlab = "")
title(main = "Total number of steps taken per day (after imputation)", xlab = "Number of steps")

plot of chunk stepsMeanPerDayNoNA

Calculate and report the mean and median total number of steps taken per day.

stepsSum3 <- aggregate(. ~ date, df2[, c(1, 2)], FUN = sum)
stepsMean3 <- mean(stepsSum3$steps)
stepsMedian3 <- median(stepsSum3$steps)

cat("Mean number of steps taken per day (before imputation): ", stepsMean1, 
    "\n", "Mean number of steps taken per day (after imputation): ", stepsMean3)
## Mean number of steps taken per day (before imputation):  10766 
##  Mean number of steps taken per day (after imputation):  10766
cat("Median number of steps taken per day (before imputation): ", stepsMedian1, 
    "\n", "Median number of steps taken per day (after imputation): ", stepsMedian3)
## Median number of steps taken per day (before imputation):  10765 
##  Median number of steps taken per day (after imputation):  10766

It appears that the mean and median values per day are almost the same before and after imputation. Therefore, imputation do not really impact the overall mean and median values.

some personal observation

To observe how imputation influences the mean value on a daily basis, we plot the datasets before and after imputation to visualize the impact on the estimates of the total daily number of steps. It can be seen that those data with missing values are imputed with the values that make the mean number of steps for each day equal to the overall mean value, i.e. 10766.19 (the blue horizontal line).

plot(stepsSumImputed$date, stepsSumImputed$steps, type = "l", lwd = 3, xlab = "", 
    ylab = "")
lines(stepsSum1$date, stepsSum1$steps, type = "l", col = "red")  # before imputation
abline(h = stepsSumImputed[1, 2], col = "blue")
legend("topright", pch = "_", col = c("black", "red", "blue"), legend = c("After Imputation", 
    "Before Imputation", "Average"))
title(xlab = "Date", ylab = "Steps", main = "Total daily number of steps")

plot of chunk myOwnObservation

Are there differences in activity patterns between weekdays and weekends?

Use weekdays function to transfrom the date to weekdays, and then add one more column for differentiating workdays and weekends. Finally plot the steps along with the intervals, separated by workday and weekend. We can see that the patterns of steps for workdays and weekends are different.

df2$weekday <- weekdays(df2$date, abbreviate = TRUE)
df2$weekday1 <- ifelse(df2$weekday == "Sun" | df2$weekday == "Sat", "weekend", 
    "workday")
df2$weekday1 <- factor(df2$weekday1)
stepsMeanInterval2 <- aggregate(. ~ interval + weekday1, data = df2[, c(1, 3, 
    5)], FUN = mean)

library(lattice)
xyplot(steps ~ interval | weekday1, data = stepsMeanInterval2, layout = c(1, 
    2), type = "l")

plot of chunk weekdays