The activity data was collected by an anonymous individual wearing a personal activity monitoring device for two months (from 2012-10-01 to 2012-11-30) and the number of steps was recorded in 5-minutes intervals.
First, we make sure that all code chunks are visible:
opts_chunk$set(echo=TRUE)
The data file activity.csv is extracted from the compressed file activity.zip and then loaded into the data frame data.
The interval data are transformed from the orignial format (hhmm, no digits for leading zeros)
into minutes passed since the beginning of the day (00:00).
In this way the intervals are spread evenly over the day
(and gaps in the original data like 50, 55, 100, 105 are transformed into 50, 55, 60, 65).
The structure of the data frame and a summary statistic are displayed.
data <- read.csv(unz("activity.zip", "activity.csv"), colClasses=c("integer", "Date", "integer"))
data$interval <- 60*floor((data$interval+1)/100) + (data$interval %% 100)
str(data)
## 'data.frame': 17568 obs. of 3 variables:
## $ steps : int NA NA NA NA NA NA NA NA NA NA ...
## $ date : Date, format: "2012-10-01" "2012-10-01" ...
## $ interval: num 0 5 10 15 20 25 30 35 40 45 ...
summary(data)
## steps date interval
## Min. : 0.0 Min. :2012-10-01 Min. : 0
## 1st Qu.: 0.0 1st Qu.:2012-10-16 1st Qu.: 359
## Median : 0.0 Median :2012-10-31 Median : 718
## Mean : 37.4 Mean :2012-10-31 Mean : 718
## 3rd Qu.: 12.0 3rd Qu.:2012-11-15 3rd Qu.:1076
## Max. :806.0 Max. :2012-11-30 Max. :1435
## NA's :2304
The total steps per day are summed up using the tapply function, and the mean and median are determined.
total_steps <- tapply(data$steps, data$date, sum, na.rm=T)
step_mean <- mean(total_steps)
step_mean
## [1] 9354
step_median <- median(total_steps)
step_median
## [1] 10395
The total steps per day are displayed as a histogram. The mean value of the total number of steps taken per day (9354.23) is highlighted by a vertical red line, the median (1.0395 × 104) by a vertical blue line. The mean is shifted to the left relative to the median.
hist(total_steps, breaks=11,
xlab="number of steps per day",
main="Histogram of total steps per day")
abline(v=step_mean, col="red", lwd=3)
abline(v=step_median, col="blue", lwd=3)
legend(x="topright", legend=c("mean","median"), col=c("red","blue"), bty="n", lwd=3)
To generate an average daily activity pattern the mean of each 5-minutes interval over all days is determined using the tapply function.
The activity pattern is plotted as a time series.
avg_steps <- tapply(data$steps, data$interval, mean, na.rm=T)
hours <- as.numeric(names(avg_steps))/60
plot(hours, avg_steps, type="l", axes=F,
xlab="time of day (h)", ylab="average number of steps in 5-min interval",
main="Daily activity pattern")
axis(2)
axis(1, at=0:6*4, labels=paste(0:6*4,":00", sep=""))
The maximum number of steps occurs in the 5-minutes interval starting at
max_act_num <- which(avg_steps==max(avg_steps))
max_act_int <- data$interval[max_act_num]
sprintf("%02d:%02d", floor(max_act_int/60), max_act_int %% 60)
## [1] "08:35"
The maximum number of steps occurs in the 104th 5-minutes interval of the day starting at 8:35.
There are many missing values in the data set (see the data summary statistic at the top of this document), to be exact 2304 missing values:
sum(is.na(data))
## [1] 2304
The daily activity pattern can be used to impute these missing values. For every missing value in the orignial data set
the average number of steps in that 5-minutes interval is used and a new data frame impute is created.
This procedure should be valid if the person has daily routine, i.e. an activity pattern that is similar over multiple days.
Instead of missing values this data set now contains a typical value of that 5-minutes interval.
impute <- transform(data, steps=ifelse(is.na(steps), avg_steps, steps))
summary(impute)
## steps date interval
## Min. : 0.0 Min. :2012-10-01 Min. : 0
## 1st Qu.: 0.0 1st Qu.:2012-10-16 1st Qu.: 359
## Median : 0.0 Median :2012-10-31 Median : 718
## Mean : 37.4 Mean :2012-10-31 Mean : 718
## 3rd Qu.: 27.0 3rd Qu.:2012-11-15 3rd Qu.:1076
## Max. :806.0 Max. :2012-11-30 Max. :1435
Now, using the data set with imputed values, the total steps per day are again summed up using the tapply function,
and the mean and median are determined.
total_impsteps <- tapply(impute$steps, impute$date, sum, na.rm=T)
impstep_mean <- mean(total_impsteps)
impstep_mean
## [1] 10766
impstep_median <- median(total_impsteps)
impstep_median
## [1] 10766
The total steps per day are displayed as a histogram. The mean value of the total number of steps taken per day (1.0766 × 104) is highlighted by a vertical red line, the median (1.0766 × 104) by a vertical blue line. The mean and the median overlap, and the peak of days with no recorded steps is gone. Both values have increased compared to the original data set. The increase of the mean, however, is much stronger.
hist(total_impsteps, breaks=11,
xlab="number of steps per day",
sub="(missing values imputed)",
main="Histogram of total steps per day")
abline(v=impstep_mean, col="red", lwd=3)
abline(v=impstep_median, col="blue", lwd=3, lty=2)
legend(x="topright", legend=c("mean","median"), col=c("red","blue"), bty="n", lwd=3)
Due to imputation the total sum of steps in these two month increases from 570608 to 6.5674 × 105.
sum(data$steps, na.rm=TRUE)
## [1] 570608
sum(impute$steps)
## [1] 656738
In order to identify differences between weekdays and weekends a daily acitity pattern is generated for both types of days.
First the data is classified as recorded either on a weekday or on a weekend, and this information is stored in week.
Then the data is aggregated by 5-minutes interval and weekday/-end.
Using the ggplot function a panel plot contrasting the weekday and weekend activity is produced.
week <- factor(weekdays(impute$date) %in% c("Saturday","Sunday"),
labels=c("weekday","weekend"), ordered=FALSE)
impsteps <- aggregate(impute$steps, by=list(interval=impute$interval, weekday=week), mean)
library(ggplot2)
g <- ggplot(impsteps, aes(interval/60, x))
g + geom_line() + facet_grid(weekday ~ .) +
scale_x_continuous(breaks=0:6*4, labels=paste(0:6*4,":00", sep="")) +
theme_bw() +
labs(y="average number of steps in 5-min interval") +
labs(x="time of day (h)") +
labs(title="Daily activity pattern")
Whereas peak and morning activity is highest during weekdays, the overall activity is higher on weekends. In light of the differences between weekdays and weekends it might be a better imputation strategy to use weekend interval averages for imputing weekend activity and weekday interval averages for imputing weekday activity.