Reproducible Research: Peer Assessment 1

Loading and preprocessing the data

Perform these steps to make sure the data is available:

if (! file.exists("activity.zip")) {
    download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip",
                  destfile = "activity.zip")
}
if (! file.exists("activity.csv")) {
    unzip("activity.zip", files = "activity.csv")
}

# We use this for the plots:
library(ggplot2)

The activity data is stored in a CSV file with three columns (variables steps, date and interval):

activity <- read.csv("activity.csv")

What is mean total number of steps taken per day?

Before we begin: Thinking about NA values

The activity data is special in that a day’s steps data is either fully available or fully not available (NA). There is no day with both steps data and NA.

We have to remove the days without any steps data before drawing the histogram, or we have to ensure that the total sum of all NA values is NA and not zero.

The reason is that if we simply interpreted days without data as 0 we would make a big mistake: If you don’t know if your patient has a feaver (or if you don’t know if there has been any activity) you cannot simply assume that there is no feaver (or that there is has been no activity). Not knowing is very different from knowing that nothing happened! Further down this page we will attempt to fill the gaps and infere the missing steps data from what we do know.

Here is the number of days without any data:

length(unique(activity[is.na(activity$steps), "date"]))

## [1] 8

That means we should have that many less daily steps-totals.

Note that if we had some NAs with numeric values for a day simply ignoring the NA (as with na.rm = TRUE) removes information: If you had a day with 100 numeric measurements and compared it with a day with 99 NAs and only 1 measurement and simply interpreted NA as 0 you would think the person almost didn’t move at all that day compared to the one where you have more measurements, when in truth you just don’t know. The missing measurements might actually have been much larger in total.

So each time we remove NA values we remove information about measurement uncertainty, and if we forget that we did that when we draw our conclusions we may end up with completely unwarranted certainty about our results!

1 - Calculate the total number of steps taken per day

The R command aggregate by default omits NA values - otherwise we would have to remove it beforehand, since sum() has the property that a sum of nothing but NA values returns 0 instead of NA as it should be.

# Sum up all steps for every "date" in the activity data frame
daily_steps_total <- aggregate(steps ~ date, activity, FUN = sum)

2 - If you do not understand the difference between a histogram and a barplot, research the difference between them. Make a histogram of the total number of steps taken each day

# Calculate the optimal number of bins using nclass.FD (Freedman-Diaconis choice based on the
# inter-quartile range (IQR))
breaks <- pretty(range(daily_steps_total$steps), n = nclass.FD(daily_steps_total$steps), min.n = 1)
bwidth <- breaks[2] - breaks[1]

g <- ggplot(daily_steps_total, aes(x = steps))
g <- g + geom_histogram(colour = "white", aes(fill = ..count..), binwidth = bwidth)
g <- g + labs(x = "Steps")
g <- g + labs(y = "Count")
g <- g + labs(title = "Histogram of the total number of steps per day")
g <- g + guides(fill = guide_legend(title = "Count"))
g

3 - Calculate and report the mean and median of the total number of steps taken per day

Mean of the total number of steps taken per day:

mean(daily_steps_total$steps)

## [1] 10766.19

Median of the total number of steps taken per day:

median(daily_steps_total$steps)

## [1] 10765

What is the average daily activity pattern?

Make a time series plot (i.e. type = “l”) of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis):

steps_interval_mean <- aggregate(steps ~ interval, activity, FUN = mean)

# The *interval* vector is discontinuous, because it was constructed concatenating the
# time in hours and minutes, for example 02:45 became interval 245 and 21:10 became 2110.
# Since the activities were recorded every 5 minutes and begin with 0 the last minute
# number for every hour is 55, and hours are from 0 to 23.
# We cannot simply take this hour-minute combination as continuous number for an x-axis,
# so we will create a continuous sequence running alongside instead:
steps_interval_mean$x <- seq_along(steps_interval_mean$interval)

# We want 10 ticks/labels on the x axis, not more. We have to manually calculate
# because we overwrite the default labels, because we use the x vector but want
# intervals as labels
l_steps <- nrow(steps_interval_mean)
divider <- round(l_steps) / 12

g <- ggplot(steps_interval_mean, aes(x = x, y = steps))
g <- g + geom_line()
g <- g + scale_x_discrete(breaks=steps_interval_mean$x[seq(1, l_steps, divider)],
                          labels=steps_interval_mean$interval[seq(1, l_steps, divider)])
g <- g + theme(axis.text.x = element_text(angle = 90, vjust = .2, hjust = 1))
g <- g + labs(x = "Interval (time of day - HHMM)")
g <- g + labs(y = "Mean Number of Steps")
g

Which 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps?

# Show the location of the maximum in the plot
max_steps <- max(steps_interval_mean$steps)
max_steps_interval <- steps_interval_mean$interval[which(steps_interval_mean$steps == max_steps)]
# We have to use the sequential "x" vector for positioning, not "interval"
max_steps_x <- steps_interval_mean[steps_interval_mean$interval == max_steps_interval, "x"]

g <- ggplot(steps_interval_mean, aes(x = x, y = steps))
g <- g + geom_line()
g <- g + scale_x_discrete(breaks=steps_interval_mean$x[seq(1, l_steps, divider)],
                          labels=steps_interval_mean$interval[seq(1, l_steps, divider)])
g <- g + theme(axis.text.x = element_text(angle = 90, vjust = .2, hjust = 1))
g <- g + labs(x = "Interval (time of day - HHMM)")
g <- g + labs(y = "Mean Number of Steps")
g <- g + geom_vline(xintercept = max_steps_x, color="blue")
g <- g + annotate("text", x = 65 + max_steps_x, y = max_steps,
                  label = paste("Max. average steps:",
                                format(round(max_steps, 2), nsmall = 2),
                                "\nin interval:",
                                max_steps_interval),
                  color="blue", hjust = 0, vjust = 1)
g

Interval:

max_steps_interval

## [1] 835

Imputing missing values

Calculate and report the total number of missing values in the dataset (i.e. the total number of rows with NAs)

sum(is.na(activity$steps))

## [1] 2304

Devise a strategy for filling in all of the missing values in the dataset. The strategy does not need to be sophisticated. For example, you could use the mean/median for that day, or the mean for that 5-minute interval, etc.:

I have chosen to

Create a vector with the average number of steps per interval across all days
Use those values for any interval (regardless of day) that has no value (is NA)

We already calculated those per-interval averages for the time series plot above.

The downside of my approach is that it adds non-integer data to what actually is natural innteger data (number of steps per interval). The upside is that it is a very accurate approach taht ensures imputed data lies exactly on the average and does not change the mean of daily totals of our complete activity data set, so “mean activity per day” as an important performance number remains as it was.

Create a new dataset that is equal to the original dataset but with the missing data filled in.

# Create a copy of the original data set
filled <- activity

# Create a vector of indices for all rows that have NA as steps value
which <- which(is.na(activity$steps))

# Replace all steps data of intervals with NA with the average number of steps for that interval
filled[which, "steps"] <- steps_interval_mean[steps_interval_mean == activity[which, "interval"], "steps"]

Make a histogram of the total number of steps taken each day and Calculate and report the mean and median total number of steps taken per day.

We repeat what we did for the very first question above for the imputed data set:

daily_steps_filled_total <- aggregate(steps ~ date, filled, FUN = sum)

# Calculate the optimal number of bins using nclass.FD (Freedman-Diaconis choice based on the
# inter-quartile range (IQR))
breaks_filled <- pretty(range(daily_steps_filled_total$steps),
                        n = nclass.FD(daily_steps_filled_total$steps),
                        min.n = 1)
bwidth_filled <- breaks_filled[2] - breaks_filled[1]

g <- ggplot(daily_steps_filled_total, aes(x = steps))
g <- g + geom_histogram(colour = "white", aes(fill = ..count..), binwidth = bwidth_filled)
g <- g + labs(x = "Steps")
g <- g + labs(y = "Count")
g <- g + labs(title = "Histogram of the total number of steps per day")
g <- g + guides(fill = guide_legend(title = "Count"))
g

Do these values differ from the estimates from the first part of the assignment?

The mean does not differ at all, because we filled all missing days with mean values. The median was very close to the mean of our data to begin with, after imputing data based on daily mean values it is the same.

Mean of the total number of steps taken per day:

mean(daily_steps_filled_total$steps)

## [1] 10766.19

Median of the total number of steps taken per day:

median(daily_steps_filled_total$steps)

## [1] 10766.19

What is the impact of imputing missing data on the estimates of the total daily number of steps?

The following plot shows the previous histogram of daily step totals before adjustment for NA overlayed in red on top of the blue histogram of NA-adjusted daily step totals. Only the bin containing the mean (and median) is different.

This can be explained by two things:

We replaced the missing values with the mean values (per interval).
All missing values are entire days (you can use table(activity[which, "date"]) to see this)

Therefore all data we added to replace the NAs created daily total step numbers that lie exactly on the mean of the totals of steps per day that we calculated from the data that we have - see the answer to the first question above.

g <- ggplot(daily_steps_filled_total, aes(x = steps))
g <- g + geom_histogram(colour = "white", fill = "steelblue3", binwidth = bwidth_filled)
g <- g + geom_histogram(data = daily_steps_total, alpha = 0.3, fill = "red", binwidth = bwidth)
g <- g + labs(x = "Steps")
g <- g + labs(y = "Count")
g <- g + labs(title = "Overlayed histograms of the total number of steps per day")
g <- g + guides(fill = FALSE)
g

Are there differences in activity patterns between weekdays and weekends?

A choice: We will use the original data, not the imputed data. It is more honest: we simply have no data for the missing days, so including them in this analysis could lead to erroneous conclusions. If we want to be as accurate as possible with the result of this quesiton then we have to stick to the data (that we actually have).

Using the imputed data that assumed all days to be equal would skew the result of our plot below. Since the days we add for weekdays and weekends are equal it would remove some of the difference between weekdays and weekends that we are trying to find.

Also, since we are only trying to find out the difference between weekend and week days now we cannot add imputed data in a way neutral to the result we are only now trying to get - by definition this step has to be performed on the original data.

Create a new factor variable in the dataset with two levels - “weekday” and “weekend” indicating whether a given date is a weekday or weekend day.

# Make sure we get English names for the weekdays for the string match
Sys.setlocale("LC_TIME", "English")

## [1] "English_United States.1252"

# Create a boolean vector indicating weekends (TRUE) and weekdays (FALSE)
days <- weekdays(as.Date(activity$date)) %in% c('Saturday','Sunday')

# Create a new factor variable in the activity data set
activity$weekday <- factor(days, labels = c("weekday", "weekend"))

Make a panel plot containing a time series plot (i.e. type = “l”) of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all weekday days or weekend days (y-axis). See the README file in the GitHub repository to see an example of what this plot should look like using simulated data.

# Creates a data frame with steps grouped by interval and weekday
steps_interval_weekday <- aggregate(steps ~ interval + weekday, activity, FUN = mean)

# The *interval* vector is discontinuous, because it was constructed concatenating the
# time in hours and minutes, for example 02:45 became interval 245 and 21:10 became 2110.
# Since the activities were recorded every 5 minutes and begin with 0 the last minute
# number for every hour is 55, and hours are from 0 to 23.
# We cannot simply take this hour-minute combination as continuous number for an x-axis,
# so we will create a continuous sequence running alongside instead:
# In addition, we have TWO interval sequences, one for weekdays, one for weekends, so
# we cannot just create one sequence from 0 to the end but have to create two of
# equal length for each part.
vlength <- length(steps_interval_weekday$interval) / 2
steps_interval_weekday$x <- rep(seq_along(steps_interval_weekday$interval[1:vlength]), 2)

# We want 10 ticks/labels on the x axis, not more. We have to manually calculate
# because we overwrite the default labels, because we use the x vector but want
# intervals as labels
l_steps <- nrow(steps_interval_weekday)/2
divider <- round(l_steps) / 12

# Create a plot of interval and steps, with...
g <- ggplot(steps_interval_weekday, aes(x = x, y = steps))
# ...subplots for each value of "weekday"
g <- g + scale_x_discrete(breaks=steps_interval_weekday$x[seq(1, l_steps, divider)],
                          labels=steps_interval_weekday$interval[seq(1, l_steps, divider)])
g <- g + facet_wrap(~ weekday, ncol = 1)
g <- g + geom_line()
g <- g + theme(axis.text.x=element_text(angle=90,vjust=.2,hjust=1))
g <- g + labs(x = "Interval (time of day - HHMM)")
g <- g + labs(y = "Mean Number of Steps")
g

Compared to weekend days on weekdays more activity takes place in the early morning, while on weekends it is more equally distributed and starts later, probably because people sleep longer on weekends.