Introduction

It is now possible to collect a large amount of data about personal movement using activity monitoring devices such as a Fitbit, Nike Fuelband, or Jawbone Up. These type of devices are part of the “quantified self” movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. But these data remain under-utilized both because the raw data are hard to obtain and there is a lack of statistical methods and software for processing and interpreting the data.

This assignment makes use of data from a personal activity monitoring device. This device collects data at 5 minute intervals through out the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day.

The data for this assignment can be downloaded from the course web site.

The variables included in this dataset are:

The dataset is stored in a comma-separated-value (CSV) file and there are a total of 17,568 observations in this dataset.

Set Up Environment

library(readr)
library(tidyr)
library(dplyr)
library(ggplot2)
library(VIM)
library(mice)
set.seed(1010)

Questions

1. Code for reading in the dataset and/or processing the data

activity <- read_csv("activity.csv")
## Parsed with column specification:
## cols(
##   steps = col_integer(),
##   date = col_date(format = ""),
##   interval = col_integer()
## )

I used the read_csv function from the readr package to import the data. The function automatically handles dates, whereas read.csv requires that dates values be coerced from class character to date.

2. Histogram of the total number of steps taken each day

activity %>% group_by(date) %>% summarise(stepsPerDay = sum(steps)) %>% 
  ggplot(aes(x = stepsPerDay)) + geom_histogram(bins = 15) + 
  ggtitle("Histogram: Total Number of Steps Per Day") + xlab("Steps Per Day") + 
  ylab("Frequency")

Zero values don’t really make sense here. It’s unlikely that someone would take no steps during a day. It’s more likely the participant forgot to use the device on a couple of days.

3. Mean and median number of steps taken each day

Mean steps, missing values removed

activity %>% group_by(date) %>% 
  summarise(meanStepsPerDay = mean(steps, na.rm = TRUE)) %>% 
  summarise(meanSteps = mean(meanStepsPerDay, na.rm = TRUE))
## # A tibble: 1 × 1
##   meanSteps
##       <dbl>
## 1   37.3826

Median steps, missing values removed

activity %>% group_by(date) %>% 
  summarise(medianStepsPerDay = median(steps, na.rm = TRUE)) %>%
  summarise(medianSteps = median(medianStepsPerDay, na.rm = TRUE))
## # A tibble: 1 × 1
##   medianSteps
##         <dbl>
## 1           0

There are missing values in the data set, so the mean and median functions require their removal, otherwise the result would be NA.

4.Time series plot of the average number of steps taken

activity %>% group_by(date) %>% summarise(meanSteps = mean(steps, na.rm = TRUE)) %>%
  ggplot(aes(x = date, y = meanSteps)) + geom_line() + 
  ggtitle("Mean Steps by Date") + xlab("Date") + ylab("Mean Steps")

Notice the breaks in the time series graph, which also highlights missing values.

5. The 5-minute interval that, on average, contains the maximum number of steps

activity %>% group_by(interval) %>% 
  summarize(meanByInterval = mean(steps, na.rm = TRUE)) %>%
  filter(meanByInterval == max(meanByInterval))
## # A tibble: 1 × 2
##   interval meanByInterval
##      <int>          <dbl>
## 1      835       206.1698

The interval 835 on average, containes the maximum number of steps.

The 5-minute interval that, on average, contains the minimum number of steps

activity %>% group_by(interval) %>% 
  summarize(meanByInterval = mean(steps, na.rm = TRUE)) %>%
  filter(meanByInterval == min(meanByInterval))
## # A tibble: 19 × 2
##    interval meanByInterval
##       <int>          <dbl>
## 1        40              0
## 2       120              0
## 3       155              0
## 4       200              0
## 5       205              0
## 6       215              0
## 7       220              0
## 8       230              0
## 9       240              0
## 10      245              0
## 11      300              0
## 12      305              0
## 13      310              0
## 14      315              0
## 15      350              0
## 16      355              0
## 17      415              0
## 18      500              0
## 19     2310              0

There are nineteen intervals on average, that contain the minimum nuber of steps.

6. Code to describe and show a strategy for imputing missing data

Missing values by variable

md.pattern(activity)
##       date interval steps     
## 15264    1        1     1    0
##  2304    1        1     0    1
##          0        0  2304 2304
(missing <- sum(is.na(activity)))
## [1] 2304

There are 2304 missing values in the data set. All of the missing values occur in the steps variable.

Missing Values as percent of total, percent each column

missingPercent <- sum(is.na(activity))/(dim(activity)[1]*dim(activity)[2]) * 100 
pMiss <- function(x) { sum(is.na(x)) / length(x) * 100}
(missingPercentCol <-apply(activity, 2, pMiss))
##    steps     date interval 
## 13.11475  0.00000  0.00000

The missing values represent 4.3715847 percent of the total data, and 13.1147541 percent of the steps variable.

Missing values visualization

aggr(activity, numbers = TRUE)

We can quickly represent the previous missing values calculations with this visualization from the VIM package.

Missing values strategy: take complete cases only

activityNoMissing <- activity[complete.cases(activity),]

We chose to delete the cases with missing values. We initially thought a median imputation was in order, however the median value for steps taken per day was zero. Adding more zeros to the data set made less sense than simply deleting the missing cases.

7. Histogram of the total number of steps taken each day after missing values are removed

activityNoMissing %>% group_by(date) %>% summarise(stepsPerDay = sum(steps)) %>% 
  ggplot(aes(x = stepsPerDay)) + geom_histogram(bins = 15) + 
  ggtitle("Histogram: Total Number of Steps Per Day") + xlab("Steps Per Day") + 
  ylab("Frequency")

7a. Time series plot of average number of steps taken after missing values are removed

activityNoMissing %>% group_by(date) %>% summarise(meanSteps = mean(steps, na.rm = TRUE)) %>%
  ggplot(aes(x = date, y = meanSteps)) + geom_line() + 
  ggtitle("Mean Steps by Date") + xlab("Date") + ylab("Mean Steps")

Mean and median steps are the same as those presented in 3, above. Recall that in 3 we removed the missing values in order to make the computations.

8. Panel plot comparing the average number of steps taken per 5-minute interval across weekdays and weekends

By Weekday vs. Weekend

t <- activityNoMissing %>% mutate(dayOfWeek = weekdays(date)) %>%
  mutate(Weekend = ifelse(dayOfWeek == "Saturday" | dayOfWeek == "Sunday", "Weekend", "Weekday"))
## By Weekday vs. Weekend 
t %>% 
  group_by(Weekend, interval) %>% mutate(meanStepsInterval = mean(steps)) %>%
  ggplot(aes(x = interval, y = meanStepsInterval)) + geom_line() +
  facet_wrap(~Weekend) +ggtitle("Mean Steps by Interval: Weekday vs. Weekend") + 
  xlab("Interval") + ylab("Mean Steps")

By Days of the Week

t %>%
  group_by(dayOfWeek, interval) %>% mutate(meanStepsInterval = mean(steps)) %>%
  ggplot(aes(x = interval, y = meanStepsInterval)) + geom_line() +
  facet_wrap(~dayOfWeek) +ggtitle("Mean Steps by Interval: By Day") +
  xlab("Interval") + ylab("Mean Steps")