This is my R markdown file for reproducible research project
First, I read the dataset “activity.csv”, and transform the column “date” into Date class, and transform the column “interval” into numeric class for later use
activity <- read.csv("activity.csv")
activity$date <- as.Date(as.character(activity$date))
activity$interval <- as.numeric(activity$interval)
Second, I calculate the total number of steps each day, using function “tapply”, and draw a histogram of it
steps_per_day <- with(activity, tapply(steps, date, sum))
hist(steps_per_day, col = "red")
Third, I calculate the mean and median of steps each day
steps_mean <- mean(steps_per_day, na.rm = TRUE)
steps_mean
## [1] 10766.19
steps_median <- median(steps_per_day, na.rm = TRUE)
steps_median
## [1] 10765
The mean and median of total number of steps taken per day are 1.076618910^{4} and 10765
Fourth, I calculate average steps on 5-minute interval to draw a time-series plot
pattern <- with(activity, tapply(steps, interval, mean, na.rm = TRUE))
plot(names(pattern), pattern, type = "l", xlab = "5-minute interval", ylab = "average number of steps")
title("average daily activity pattern")
Fifth, I find the max steps on 5-minute interval
max_steps <- max(pattern, na.rm = TRUE)
max_steps
## [1] 206.1698
max_interval <- names(which.max(pattern))
max_interval
## [1] "835"
The max 5-minute interval is 835 and its value is 206.1698113
Sixth, I fill na with the average of the following 2 days, some of days are calculated on 2 days before because there is missing value in its following dates or without a value, and create a new dataset with missing values filled in
num_miss <- sum(is.na(activity$steps))
steps_per_day_mean <- with(activity, tapply(steps, date, mean))
olddata <- activity
activity_num <- as.numeric(nrow(activity))
for (i in 1:activity_num) {
if (is.na(activity[i, "steps"]) == TRUE){
if (activity[i, "date"] == "2012-11-30"){
activity[i, "steps"] <- (steps_per_day_mean[as.character(activity[i, "date"] - 1)] + steps_per_day_mean[as.character(activity[i, "date"] - 2)])/2
}
else if (activity[i, "date"] == "2012-11-09"){
activity[i, "steps"] <- (steps_per_day_mean[as.character(activity[i, "date"] - 1)] + steps_per_day_mean[as.character(activity[i, "date"] - 2)])/2
}
else {
activity[i, "steps"] <- (steps_per_day_mean[as.character(activity[i, "date"] + 1)] + steps_per_day_mean[as.character(activity[i, "date"] + 2)])/2
}
}
}
newdata <- activity
the total number of missing values is 2304
Seventh, I draw a histogram of total steps each day using the new dataset and compare with the unfilled one
steps_per_day_new <- with(newdata, tapply(steps, date, sum))
hist(steps_per_day_new, col = "blue")
steps_mean <- mean(steps_per_day, na.rm = TRUE)
steps_mean
## [1] 10766.19
steps_mean_new <- mean(steps_per_day, na.rm = TRUE)
steps_mean_new
## [1] 10766.19
steps_median <- median(steps_per_day, na.rm = TRUE)
steps_median
## [1] 10765
steps_median_new <- median(steps_per_day, na.rm = TRUE)
steps_median_new
## [1] 10765
the mean and median total number of steps taken per day are 1.076618910^{4} and 10765; the mean and median number of steps taken per day in first part are 1.076618910^{4} and 10765 the values are the same. Imputing missing data has no impact on estimate of the total daily number of steps
Eighth, I separate the dataset on weekend and weekday, and draw a panal plot comparing the time series
Sys.setlocale(category = "LC_ALL", locale = "english")
## [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
for (i in 1:activity_num) {
if (weekdays(activity[i, "date"]) == "Saturday" | weekdays(activity[i, "date"]) == "Sunday") {
activity$day[i] <- "weekend"
}
else {
activity$day[i] <- "weekday"
}
}
weekend_activity <- subset(activity, day == "weekend")
weekday_activity <- subset(activity, day == "weekday")
par(mfrow = c(2,1))
weekend_pattern <- with(weekend_activity, tapply(steps, interval, mean, na.rm = TRUE))
plot(names(weekend_pattern), weekend_pattern, type = "l", xlab = "5-minute interval", ylab = "average number of weekend steps")
title("weekend")
weekday_pattern <- with(weekday_activity, tapply(steps, interval, mean, na.rm = TRUE))
plot(names(weekday_pattern), weekday_pattern, type = "l", xlab = "5-minute interval", ylab = "average number of weekday steps")
title("weekday")
There is difference between weekdays and weekends, on weekdays, activity seems to be less active, only active in the period around 8 a.m. the time going for work, but on weekends, activity seems to be active all along a day