First, the csv file is read into an R object.
activity.data <- read.csv('activity.csv',header=T)
Then, steps are summed over each days, while ignoring missing data. In addition, steps are averaged across days per time interval.
daily.steps <- rowsum(activity.data$steps, group = activity.data$date,
na.rm = T)
av.interval.steps <- aggregate(activity.data$steps,
by = list(Interval = activity.data$interval),
FUN = mean, na.rm=T)
Here is an histogram reporting the distribution of the number of steps taken daily in the dataset, and the code necesssary to plot it:
hist(daily.steps,breaks=20,main = "Daily activity (missing values ignored)",
xlab = "Total number of steps taken per day",col="deeppink")
We then calculate the mean and median total number of steps taken per day:
my.mean <- mean(daily.steps)
my.median <- median(daily.steps)
The distribution of total number of steps taken per day has respectively a mean of 9354.2295082 steps and a median of 10395 steps.
Here is a plot of the average number of steps taken averaged across all days as a function of the 5-minutes intervals:
plot(av.interval.steps$Interval,av.interval.steps$x,type='l',col="dodgerblue",
xlab = "Time of day (min)",
ylab = "Average steps taken (missing values ignored)")
Then we identify the 5-minutes interval with the maximum average number of total steps.
my.interval.idx <- which(av.interval.steps$x == max(av.interval.steps$x))
my.interval <- av.interval.steps$Interval[my.interval.idx]
The 5-minute interval, on average across all the days in the dataset, which contains the maximum number of steps is at 835 minutes.
We identify missing values and calculate their total number in the dataset:
my.nas <- which(is.na(activity.data$steps))
my.nas.number <- length(my.nas)
There are 2304 missing data points in the dataset.
We will impute missing data points using the rounded (integer) mean value of steps taken over non-missing data points at the corresponding 5-minute time interval. A copy of the dataset is created to receive the new imputed values, and we replace missing data points with calculated imputed values (using the calculations above).
To proceed with imputation, for each missing data point, we will:
my.imputed.activity.data <- activity.data
for (i in 1:length(my.nas)){
my.cur.idx <- my.nas[i]
my.cur.interval <- activity.data$interval[my.cur.idx]
my.av.interval <- which(av.interval.steps$Interval == my.cur.interval)
my.new.data.pt <- round(av.interval.steps$x[my.av.interval])
my.imputed.activity.data$steps[my.cur.idx] <- my.new.data.pt
}
We re-calculate the daily steps using the data with imputed missing values:
imputed.daily.steps <- rowsum(my.imputed.activity.data$steps,
group = my.imputed.activity.data$date)
Now, here is an histogram reporting the distribution of the number of steps taken daily in the dataset, and the code necesssary to plot it with imputed values:
hist(imputed.daily.steps,breaks=20,main = "Daily activity (imputed missing values)",
xlab = "Total number of steps taken per day",col="deeppink")
We then calculate the updated mean and median total number of steps taken per day:
my.mean.imp <- mean(imputed.daily.steps)
my.median.imp <- median(imputed.daily.steps)
The distribution of total number of steps taken per day, after imputation of missing values, has respectively a mean of 10765.6393443 steps and a median of 10762 steps.
These new values are slightly different than the original observed values. The mean has now increased (10765.6393443 vs. 9354.2295082), as well as the median (10762 vs. 10395). This is probably the result of “missing steps” being now accounted for and so added to the daily tallies.
Determine which day of the week each measurement was taken on, and expand the imputed table to include a weekstatus factor detailing whether it was a week-end day or week day.
my.dotw <- weekdays(as.Date(activity.data$date))
my.weekends <- c(which(my.dotw %in% "Sunday"),which(my.dotw %in% "Saturday"))
my.imputed.activity.data$weekstatus <-
factor(rep("weekday",length(my.dotw)), levels = c("weekday","weekend"),ordered = T)
my.imputed.activity.data$weekstatus[my.weekends] <- "weekend"
We first calculate the data averages over each type of days:
av.interval.steps.weekStatus <- aggregate(my.imputed.activity.data$steps,
by = list(Interval = my.imputed.activity.data$interval,
DayType = my.imputed.activity.data$weekstatus),
FUN = mean)
Then we plot the time series data on weekdays and week-ends, using base graphics :
my.we.idx <- which(av.interval.steps.weekStatus$DayType == "weekend")
par(mfrow = c(2,1))
par(cex = 0.6)
par(mar = c(0, 0, 0, 0), oma = c(4, 4, 5, 1))
par(mgp = c(2, 0.6, 0))
plot(av.interval.steps.weekStatus$Interval[my.we.idx],
av.interval.steps.weekStatus$x[my.we.idx],type='l',
col="dodgerblue",axes = FALSE)
box(col = "black")
axis(2, at = seq(0,150, 25),las=2)
mtext("Weekend", side = 3, line = -1.5, adj = 0.5, cex = 1)
plot(av.interval.steps.weekStatus$Interval[-my.we.idx],
av.interval.steps.weekStatus$x[-my.we.idx],type='l',
col="dodgerblue",axes = FALSE)
box(col = "black")
axis(2, at = seq(0,200, 25),las=2)
mtext("Weekday", side = 3, line = -1.5, adj = 0.5, cex = 1)
axis(1, at = seq(0,2500, 500),las=1)
mtext("Time of day (min)", side = 1, outer = TRUE, line = 2.2)
mtext("Average steps taken per time interval", side = 2, outer = TRUE, line = 2.2)
We notice that on average, it seems that whereas weekday activity peaks once very noticeably, week-end activity is more homogeneous during the day.