Reproducible Research: Peer Assessment 1

This assignment makes use of data from a personal activity monitoring device. This device collects data at 5 minute intervals through out the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day.

Loading and preprocessing the data

  1. Set the environment
  2. The data file is downloaded if it is not present
  3. It is uncompressed and read as a data frame (i.e. read.csv()).
  4. The steps column is considered to numeric type.
  5. The date column is considered to Date type.
  6. The interval column is considered to numeric type.
# Libraries for plotting (ggplot2) and transforming data (plyr).
library(ggplot2)
library(plyr)

# download and read the data
data <- read.csv("activity.csv", colClasses = c("numeric", "Date", "numeric"))

What is mean total number of steps taken per day?

Here we have the histogram of the total number of steps taken daily, plotted with a bin interval of 1500 steps.

byDay <- aggregate(steps ~ date, data, sum, na.action = na.pass)
# Because we wanna to track this information, we add a label
byDay <- cbind(byDay, label = rep("with.na", nrow(byDay)))
ggplot(byDay, aes(x = steps)) + geom_histogram(binwidth = 1500, colour = "black", 
    fill = "white") + labs(title = "Steps Taken per Day", x = "Number of Steps", 
    y = "Frequency")

plot of chunk steps_per_day

Then for the number of steps taken per day we have:

What is the average daily activity pattern?

Here we have the plot of the average number of steps taken daily plotted against the interval number.

byInterval <- aggregate(steps ~ interval, data, mean, na.rm = TRUE)
ggplot(byInterval, aes(x = interval, y = steps)) + geom_line() + labs(title = "Average of Steps taken Daily", 
    x = "Interval", y = "Number of steps")

plot of chunk steps_per_interval

We can obtain the 5-minute interval that contains the maximum number of steps: 835

Imputing missing values

Note that there are a number of days/intervals where there are missing values (coded as NA). The presence of missing days may introduce bias into some calculations or summaries of the data.

The total number of missing values in the dataset is: 2304

To populate missing values, we choose to replace them with the mean value at the same interval across days.

data.impute <- adply(data, 1, function(x) if (is.na(x$steps)) {
    x$steps = round(byInterval[byInterval$interval == x$interval, 2])
    x
} else {
    x
})

Obtaining the follow histogram of the number of steps taken daily, plotted with a bin interval of 1500 steps.

# Because we wanna to track this information, we add a label
byDay.impute <- aggregate(steps ~ date, data.impute, sum)
byDay.impute <- cbind(byDay.impute, label = rep("without.na", nrow(byDay.impute)))
ggplot(byDay.impute, aes(x = steps)) + geom_histogram(binwidth = 1500, colour = "black", 
    fill = "white") + labs(title = "Steps Taken per Day", x = "Number of Steps", 
    y = "Frequency")

plot of chunk complete_steps_per_day

We observe that the mean value and the median value has shifted a little bit:

Below we have the two histograms.

byDay.all <- rbind(byDay, byDay.impute)
levels(byDay.all$label) <- c("With NA", "Without NA")
ggplot(byDay.all, aes(x = steps, fill = label)) + geom_histogram(binwidth = 1500, 
    colour = "black", alpha = 0.2) + labs(title = "Steps Taken per Day", x = "Number of Steps", 
    y = "Frequency") + theme(legend.position = "bottom")

plot of chunk day_compare

Are there differences in activity patterns between weekdays and weekends?

To do this comparison with the table with filled-in missing values, we follow the next steps:

  1. Subset the data into two parts - weekends (Saturday and Sunday) and weekdays (Monday through Friday)
  2. Obtain the average steps per interval for each dataset.
  3. And plot the two datasets for comparison.
# For some problems in system time
Sys.setlocale(locale = "C")
# We obtain the two subsets
data.weekend <- subset(data.impute, weekdays(date) %in% c("Saturday", "Sunday"))
data.weekday <- subset(data.impute, !weekdays(date) %in% c("Saturday", "Sunday"))

# Obtain the average steps per interval for each dataset
data.weekend <- aggregate(steps ~ interval, data.weekend, mean)
data.weekday <- aggregate(steps ~ interval, data.weekday, mean)

# By plotting we add a label
data.weekend <- cbind(data.weekend, day = rep("weekend"))
data.weekday <- cbind(data.weekday, day = rep("weekday"))
# Combine the subsets and a specify the levels
data.week <- rbind(data.weekend, data.weekday)
levels(data.week$day) <- c("Weekend", "Weekday")

ggplot(data.week, aes(x = interval, y = steps)) + geom_line() + facet_grid(day ~ 
    .) + labs(x = "Interval", y = "Number of steps")

plot of chunk weekday_compare

We observe that activity on the weekends tends to make more activities compared to the weekdays. This could be due to the fact that activities on weekdays mostly tend to be in the work, whereas weekends tend to be in more variade spaces.