This peer-assessed work is part of the evaluation for the ‘Reproducible Research’ MOOC.
This assignment makes use of data from a personal activity monitoring device. This device collects data at 5 minute intervals through out the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day.
The questions that this analysis aims to answer are:
A good practice is to make sure input data files are there, and if not, download them. We’re not showing results here for convenience reasons. We’re also converting the date column of type character to POSIX with format ‘year-month-day’.
options(scipen = 1, digits = 2) # Set scientific notation digits
# Load required libraries
if (!require(lubridate)) {stop('Package lubridate must be installed before proceeding.')}
if (!require(dplyr)) {stop('Package dplyr must be installed before proceeding.')}
if (!require(ggplot2)) {stop('Package ggplot2 must be installed before proceeding.')}
if (!require(gridExtra)) {stop('Package gridExtra must be installed before proceeding.')}
# Download and process data
downloadedFilename <- 'activity.zip'
dataFilename <- 'activity.csv'
if (!file.exists(dataFilename)) {
dataUrl <- paste('https://d396qusza40orc.cloudfront.net/repdata/data/',downloadedFilename, sep = '')
download.file(dataUrl, destfile = downloadedFilename)
unzip(downloadedFilename, files = dataFilename)
}
activity <- read.csv('activity.csv', stringsAsFactors = F, na.strings='NA',
colClasses = c('numeric', 'character', 'numeric'))
# Convert date from character to posix
activity <- activity %>% mutate(date=ymd(date))
For this exercise the data were grouped by date, then a histogram was plotted. Vertical lines were added at the mean and median values:
steps <- activity %>% group_by(date) %>% summarise(totalsteps=sum(steps, na.rm = T))
# Median and mean
meansteps <- mean(steps$totalsteps, na.rm = T)
mediansteps <- median(steps$totalsteps, na.rm = T)
# save plot in hist1
hist1 <- ggplot(aes(x=totalsteps), data=steps) +
geom_histogram(fill='blue', col='black', binwidth=800, show_guide=TRUE) +
geom_vline(data=steps, aes(xintercept=meansteps), color='red') +
geom_vline(data=steps, aes(xintercept=mediansteps), color='#009900') +
labs(x='Total steps measured each day (bin=800 steps)',
y='Number of days with X step count',
title='Frequency of total steps measured each day') +
annotate('text', label=paste('Mean=',round(meansteps,2)), x=meansteps - 4000, y=10, color='red') +
annotate('text', label=paste('Median=',mediansteps), x=mediansteps + 4000, y=10, color='#009900')
# plot hist1
plot(hist1)
From the histogram we can conclude:
Finally, the mean and median around the central tendency of the distribution are 9354.23 and 10395, respectively.
Make a time series plot (i.e. type = “l”) of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis)
Which 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps?
ts <- activity %>% group_by(interval) %>% summarise(meansteps=mean(steps, na.rm = T))
ggplot(data=ts, aes(x=interval, y=meansteps)) +
geom_line(color='blue', lwd=1) +
geom_smooth(lty=2, method='lm') +
labs(x='5-min interval throughout all days', y='Average steps measured', title='Average steps for all 5-min intervals throughout all days')
# Obtain interval with maximum avg of steps
maxinterval <- ts[which(ts$meansteps==max(ts$meansteps)),]
From the graph we conclude that the measured physical activity starts at around the 530 mark, which roughly corresponds to 5:30am, and it peaks at interval 835, which is around 8:30am mark with 206.17 steps. This could mean, if you wish to further hypothesize, that the time of greater activity can be due to members of the study doing their work commute.
The only variable with NAs in their measurements is steps. The other variables, being categories of the measurement, are all complete. There are:
sum(is.na(activity$steps))
## [1] 2304
rows with NA in variable steps.
Devise a strategy for filling in all of the missing values in the dataset. The strategy does not need to be sophisticated. For example, you could use the mean/median for that day, or the mean for that 5-minute interval, etc.
Create a new dataset that is equal to the original dataset but with the missing data filled in.
The strategy is to calculate the mean for each interval for all days, and replace the NA with such mean for that particular interval.
# clone original data frame activity and store it in 'impute' variable
imputed <- activity
# create new column 'mean' obtained from ts dataframe created in chunk code from previous question
imputed <- inner_join(x = imputed, y = ts, by = c('interval'='interval'))
# assign meansteps value to NA values in 'steps' column
imputed$steps <- ifelse(is.na(imputed$steps), imputed$meansteps, imputed$steps)
# verify there are no NAs left
sum(is.na(imputed$steps))
## [1] 0
# print head of 'imputed' to show there are no NAs left
head(imputed)
## steps date interval meansteps
## 1 1.717 2012-10-01 0 1.717
## 2 0.340 2012-10-01 5 0.340
## 3 0.132 2012-10-01 10 0.132
## 4 0.151 2012-10-01 15 0.151
## 5 0.075 2012-10-01 20 0.075
## 6 2.094 2012-10-01 25 2.094
The strategy was implemented using the data frame ‘ts’ from previous question, since it already has the means for each interval for all days. It is then just a matter of joining the copy of original data frame, the data frame with means for each interval, and replacing each NA in ‘steps’ for the mean in column ‘meansteps’.
# Summarize IMPUTED data
imputedsteps <- imputed %>% group_by(date) %>% summarise(totalsteps=sum(steps, na.rm = T))
# Median and mean of IMPUTED data
meanimputedsteps <- mean(imputedsteps$totalsteps, na.rm = T)
medianimputedsteps <- median(imputedsteps$totalsteps, na.rm = T)
# Build plot
hist2 <- ggplot(aes(x=totalsteps), data=imputedsteps) +
geom_histogram(fill='blue', col='black', binwidth=800, show_guide=TRUE) +
geom_vline(data=imputedsteps, aes(xintercept=meanimputedsteps), color='red') +
geom_vline(data=imputedsteps, aes(xintercept=medianimputedsteps), color='#009900') +
labs(x='Total steps measured each day (bin=800 steps)',
y='Number of days with X step count',
title='Frequency of total steps measured each day') +
annotate('text', label=paste('Mean=',round(meanimputedsteps,2)), x=meanimputedsteps - 7000, y=10, color='red') +
annotate('text', label=paste('Median=',round(medianimputedsteps,2)), x=medianimputedsteps + 7000, y=10, color='#009900')
# Plot
grid.arrange(hist1,hist2, ncol=2, main='Comparison of histograms for original data (left) and imputed data (right)')
These are the interesting observations of both imputed and original data and their plots, given the imputing strategy devised:
# Add column weekday to original data, with weekday=0 and weekend=1
activity$weekday <- ifelse(wday(activity$date) %in% c(1,7), 0,1)
# group by interval and summarise with mean.
differences <- summarise(group_by(activity, weekday, interval), meansteps=mean(steps, na.rm=T))
head(filter(activity, weekday==0))
## steps date interval weekday
## 1 0 2012-10-06 0 0
## 2 0 2012-10-06 5 0
## 3 0 2012-10-06 10 0
## 4 0 2012-10-06 15 0
## 5 0 2012-10-06 20 0
## 6 0 2012-10-06 25 0
ggplot(differences, aes(x=interval, y=meansteps, group=weekday, color=factor(weekday))) +
geom_line() +
geom_smooth(method="lm", se=FALSE, lty=2, lwd=1) +
scale_color_discrete(name="Weekday",
labels=c("Weekend", "Weekday")) +
facet_wrap(~weekday, nrow=1, ncol=2)
This comparison was constructed by taking the mean steps for all weekdays and weekends grouped by interval of the original data (not the imputed one) and following the notion of weekday and weekend activity of a hypothetical human throughout the day.
From the regression lines in the chart it is clear that during weekdays human activity almost stays flat, and that on weekends it increases substantially, even though the maximum number of steps are still taken at around 8:30 during weekdays. This could be explained by sedentarism in the workplace during weekdays, and increased outdoor activity during weekends, but a deeper analysis and further data gathering (like consumption patterns and geographical data) should be done before jumping to this conclusion.
This concludes the small analysis for the 1st Peer-assessed course project for the ‘Reproducible Research’ MOOC.