Questions

1. Code for reading in the dataset and/or processing the data

activity <- read_csv("activity.csv")

## Parsed with column specification:
## cols(
##   steps = col_integer(),
##   date = col_date(format = ""),
##   interval = col_integer()
## )

I used the read_csv function from the readr package to import the data. The function automatically handles dates, whereas read.csv requires that dates values be coerced from class character to date.

2. Histogram of the total number of steps taken each day

activity %>% group_by(date) %>% summarise(stepsPerDay = sum(steps)) %>% 
  ggplot(aes(x = stepsPerDay)) + geom_histogram(bins = 15) + 
  ggtitle("Histogram: Total Number of Steps Per Day") + xlab("Steps Per Day") + 
  ylab("Frequency")

Zero values don’t really make sense here. It’s unlikely that someone would take no steps during a day. It’s more likely the participant forgot to use the device on a couple of days.

3. Mean and median number of steps taken each day

Mean steps, missing values removed

activity %>% group_by(date) %>% 
  summarise(meanStepsPerDay = mean(steps, na.rm = TRUE)) %>% 
  summarise(meanSteps = mean(meanStepsPerDay, na.rm = TRUE))

## # A tibble: 1 × 1
##   meanSteps
##       <dbl>
## 1   37.3826

Median steps, missing values removed

activity %>% group_by(date) %>% 
  summarise(medianStepsPerDay = median(steps, na.rm = TRUE)) %>%
  summarise(medianSteps = median(medianStepsPerDay, na.rm = TRUE))

## # A tibble: 1 × 1
##   medianSteps
##         <dbl>
## 1           0

There are missing values in the data set, so the mean and median functions require their removal, otherwise the result would be NA.

4.Time series plot of the average number of steps taken

activity %>% group_by(date) %>% summarise(meanSteps = mean(steps, na.rm = TRUE)) %>%
  ggplot(aes(x = date, y = meanSteps)) + geom_line() + 
  ggtitle("Mean Steps by Date") + xlab("Date") + ylab("Mean Steps")

Notice the breaks in the time series graph, which also highlights missing values.

5. The 5-minute interval that, on average, contains the maximum number of steps

activity %>% group_by(interval) %>% 
  summarize(meanByInterval = mean(steps, na.rm = TRUE)) %>%
  filter(meanByInterval == max(meanByInterval))

## # A tibble: 1 × 2
##   interval meanByInterval
##      <int>          <dbl>
## 1      835       206.1698

The interval 835 on average, containes the maximum number of steps.

The 5-minute interval that, on average, contains the minimum number of steps

activity %>% group_by(interval) %>% 
  summarize(meanByInterval = mean(steps, na.rm = TRUE)) %>%
  filter(meanByInterval == min(meanByInterval))

## # A tibble: 19 × 2
##    interval meanByInterval
##       <int>          <dbl>
## 1        40              0
## 2       120              0
## 3       155              0
## 4       200              0
## 5       205              0
## 6       215              0
## 7       220              0
## 8       230              0
## 9       240              0
## 10      245              0
## 11      300              0
## 12      305              0
## 13      310              0
## 14      315              0
## 15      350              0
## 16      355              0
## 17      415              0
## 18      500              0
## 19     2310              0

There are nineteen intervals on average, that contain the minimum nuber of steps.

6. Code to describe and show a strategy for imputing missing data

Missing values by variable

md.pattern(activity)

##       date interval steps     
## 15264    1        1     1    0
##  2304    1        1     0    1
##          0        0  2304 2304

(missing <- sum(is.na(activity)))

## [1] 2304

There are 2304 missing values in the data set. All of the missing values occur in the steps variable.

Missing Values as percent of total, percent each column

missingPercent <- sum(is.na(activity))/(dim(activity)[1]*dim(activity)[2]) * 100 
pMiss <- function(x) { sum(is.na(x)) / length(x) * 100}
(missingPercentCol <-apply(activity, 2, pMiss))

##    steps     date interval 
## 13.11475  0.00000  0.00000

The missing values represent 4.3715847 percent of the total data, and 13.1147541 percent of the steps variable.

Missing values visualization

aggr(activity, numbers = TRUE)

We can quickly represent the previous missing values calculations with this visualization from the VIM package.

Missing values strategy: take complete cases only

activityNoMissing <- activity[complete.cases(activity),]

We chose to delete the cases with missing values. We initially thought a median imputation was in order, however the median value for steps taken per day was zero. Adding more zeros to the data set made less sense than simply deleting the missing cases.

7. Histogram of the total number of steps taken each day after missing values are removed

activityNoMissing %>% group_by(date) %>% summarise(stepsPerDay = sum(steps)) %>% 
  ggplot(aes(x = stepsPerDay)) + geom_histogram(bins = 15) + 
  ggtitle("Histogram: Total Number of Steps Per Day") + xlab("Steps Per Day") + 
  ylab("Frequency")

7a. Time series plot of average number of steps taken after missing values are removed

activityNoMissing %>% group_by(date) %>% summarise(meanSteps = mean(steps, na.rm = TRUE)) %>%
  ggplot(aes(x = date, y = meanSteps)) + geom_line() + 
  ggtitle("Mean Steps by Date") + xlab("Date") + ylab("Mean Steps")

Mean and median steps are the same as those presented in 3, above. Recall that in 3 we removed the missing values in order to make the computations.

8. Panel plot comparing the average number of steps taken per 5-minute interval across weekdays and weekends

By Weekday vs. Weekend

t <- activityNoMissing %>% mutate(dayOfWeek = weekdays(date)) %>%
  mutate(Weekend = ifelse(dayOfWeek == "Saturday" | dayOfWeek == "Sunday", "Weekend", "Weekday"))

## By Weekday vs. Weekend 
t %>% 
  group_by(Weekend, interval) %>% mutate(meanStepsInterval = mean(steps)) %>%
  ggplot(aes(x = interval, y = meanStepsInterval)) + geom_line() +
  facet_wrap(~Weekend) +ggtitle("Mean Steps by Interval: Weekday vs. Weekend") + 
  xlab("Interval") + ylab("Mean Steps")

By Days of the Week

t %>%
  group_by(dayOfWeek, interval) %>% mutate(meanStepsInterval = mean(steps)) %>%
  ggplot(aes(x = interval, y = meanStepsInterval)) + geom_line() +
  facet_wrap(~dayOfWeek) +ggtitle("Mean Steps by Interval: By Day") +
  xlab("Interval") + ylab("Mean Steps")

Reproducible Research - Project Assignment 1

Mark Blackmore

July 28, 2017

Introduction

Set Up Environment