echo = TRUE # The code is displayed
activity <- read.csv(unz("activity.zip", "activity.csv"))
sapply(activity, class)
## steps date interval
## "integer" "factor" "integer"
activity$date is converted from factor to date class. I used the “zoo” package to manage the time serie which, furthermore, is regularly spaced. The data are collected on 61 days.
activity$date <- as.Date(activity$date, format = "%Y-%m-%d")
suppressWarnings(library(zoo))
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
is.regular(activity$date)
unique(activity$date)
The result is calculated and stored in the object “steps_day”. Missing values are removed. Since the resulting table is quinte large, I set {r, results = “hide”}. Approximately the 13% of cases in “steps” result in missing data.
colMeans(is.na(activity))
steps_day <- aggregate(steps ~ date, rm.na = TRUE, data = activity, FUN = sum)
Missing gvalues are not omitted in this plot.
plot(steps_day, type = "h", lwd = 10, lend = "square")
Since those calculations return two large tables the results are not shown.
aggregate(steps ~ date, data = activity, FUN = mean)
aggregate(steps ~ date, data = activity, FUN = median)
plot(aggregate(steps ~ interval, data = activity, FUN = mean), type = "l")
The base function max() should rapidly give us the answer.
max(activity$steps, na.rm = TRUE)
## [1] 806
Note that there are a number of days/intervals where there are missing values (coded as NA). The presence of missing days may introduce bias into some calculations or summaries of the data.
sum(is.na(activity))
## [1] 2304
I am going to substitute each NA with a fixed value. I set the fixed value equivalent to the overall mean of the variable activity$steps.
activity2 <- activity
sapply(activity2, class)
## steps date interval
## "integer" "Date" "integer"
activity2$steps[is.na(activity2$steps)] <- mean(na.omit(activity$steps))
activity2$date <- as.Date(activity2$date, format = "%Y-%m-%d")
steps_day2 <- aggregate(steps ~ date, rm.na = TRUE, data = activity2, FUN = sum)
par(mfrow = c(1, 2))
plot(steps_day, type = "h", lwd = 5,lend = "square", main = "With NAs")
abline(h = seq(0, 20000, 2500), lty = "dashed")
plot(steps_day2, type = "h", lwd = 5, lend = "square", main = "NAs filled")
abline(h = seq(0, 20000, 2500), lty = "dashed")
dev.off()
## null device
## 1
Filling the NA makes the distribution more homogeneous. By the way, this operation could hide interesting patterns such as the inactivity during particular days of the week.
aggregate(steps ~ date, data = activity, FUN = mean)
aggregate(steps ~ date, data = activity, FUN = median)
aggregate(steps ~ date, data = activity2, FUN = mean)
aggregate(steps ~ date, data = activity2, FUN = median)
The results, that I do not report because of the length, suggest that the strategy adopted to fill the missing values could be not adeguated. Indeed, new biases patterns evidently appear in the calculation operated on the activity2 dataset.
For this part the weekdays() function may be of some help here. Use the dataset with the filled-in missing values for this part.
activity2$weekday <- factor(format(activity2$date, "%A"))
levels(activity2$weekday) <- list(weekday = c("Monday", "Tuesday",
"Wednesday", "Thursday",
"Friday"), weekend =
c("Saturday", "Sunday"))
par(mfrow = c(2, 1))
with(activity2[activity2$weekday == "weekend",], plot(aggregate(steps ~ interval, FUN = mean), type = "l", main = "Weekends"))
with(activity2[activity2$weekday == "weekday",], plot(aggregate(steps ~ interval, FUN = mean), type = "l", main = "Weekdays"))
dev.off()
## null device
## 1