It is now possible to collect a large amount of data about personal movement using activity monitoring devices such as a Fitbit, Nike Fuelband, or Jawbone Up. These type of devices are part of the “quantified self” movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. But these data remain under-utilized both because the raw data are hard to obtain and there is a lack of statistical methods and software for processing and interpreting the data.
This assignment makes use of data from a personal activity monitoring device. This device collects data at 5 minute intervals through out the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day.
The data for this assignment can be downloaded from the course web site.
The variables included in this dataset are:
The dataset is stored in a comma-separated-value (CSV) file and there are a total of 17,568 observations in this dataset.
library(readr)
library(tidyr)
library(dplyr)
library(ggplot2)
library(VIM)
library(mice)
set.seed(1010)
activity <- read_csv("activity.csv")
## Parsed with column specification:
## cols(
## steps = col_integer(),
## date = col_date(format = ""),
## interval = col_integer()
## )
I used the read_csv function from the readr package to import the data. The function automatically handles dates, whereas read.csv requires that dates values be coerced from class character to date.
activity %>% group_by(date) %>% summarise(stepsPerDay = sum(steps)) %>%
ggplot(aes(x = stepsPerDay)) + geom_histogram(bins = 15) +
ggtitle("Histogram: Total Number of Steps Per Day") + xlab("Steps Per Day") +
ylab("Frequency")
Zero values don’t really make sense here. It’s unlikely that someone would take no steps during a day. It’s more likely the participant forgot to use the device on a couple of days.
activity %>% group_by(date) %>%
summarise(meanStepsPerDay = mean(steps, na.rm = TRUE)) %>%
summarise(meanSteps = mean(meanStepsPerDay, na.rm = TRUE))
## # A tibble: 1 × 1
## meanSteps
## <dbl>
## 1 37.3826
activity %>% group_by(date) %>%
summarise(medianStepsPerDay = median(steps, na.rm = TRUE)) %>%
summarise(medianSteps = median(medianStepsPerDay, na.rm = TRUE))
## # A tibble: 1 × 1
## medianSteps
## <dbl>
## 1 0
There are missing values in the data set, so the mean and median functions require their removal, otherwise the result would be NA.
activity %>% group_by(date) %>% summarise(meanSteps = mean(steps, na.rm = TRUE)) %>%
ggplot(aes(x = date, y = meanSteps)) + geom_line() +
ggtitle("Mean Steps by Date") + xlab("Date") + ylab("Mean Steps")
Notice the breaks in the time series graph, which also highlights missing values.
activity %>% group_by(interval) %>%
summarize(meanByInterval = mean(steps, na.rm = TRUE)) %>%
filter(meanByInterval == max(meanByInterval))
## # A tibble: 1 × 2
## interval meanByInterval
## <int> <dbl>
## 1 835 206.1698
The interval 835 on average, containes the maximum number of steps.
activity %>% group_by(interval) %>%
summarize(meanByInterval = mean(steps, na.rm = TRUE)) %>%
filter(meanByInterval == min(meanByInterval))
## # A tibble: 19 × 2
## interval meanByInterval
## <int> <dbl>
## 1 40 0
## 2 120 0
## 3 155 0
## 4 200 0
## 5 205 0
## 6 215 0
## 7 220 0
## 8 230 0
## 9 240 0
## 10 245 0
## 11 300 0
## 12 305 0
## 13 310 0
## 14 315 0
## 15 350 0
## 16 355 0
## 17 415 0
## 18 500 0
## 19 2310 0
There are nineteen intervals on average, that contain the minimum nuber of steps.
md.pattern(activity)
## date interval steps
## 15264 1 1 1 0
## 2304 1 1 0 1
## 0 0 2304 2304
(missing <- sum(is.na(activity)))
## [1] 2304
There are 2304 missing values in the data set. All of the missing values occur in the steps variable.
missingPercent <- sum(is.na(activity))/(dim(activity)[1]*dim(activity)[2]) * 100
pMiss <- function(x) { sum(is.na(x)) / length(x) * 100}
(missingPercentCol <-apply(activity, 2, pMiss))
## steps date interval
## 13.11475 0.00000 0.00000
The missing values represent 4.3715847 percent of the total data, and 13.1147541 percent of the steps variable.
aggr(activity, numbers = TRUE)
We can quickly represent the previous missing values calculations with this visualization from the VIM package.
activityNoMissing <- activity[complete.cases(activity),]
We chose to delete the cases with missing values. We initially thought a median imputation was in order, however the median value for steps taken per day was zero. Adding more zeros to the data set made less sense than simply deleting the missing cases.
activityNoMissing %>% group_by(date) %>% summarise(stepsPerDay = sum(steps)) %>%
ggplot(aes(x = stepsPerDay)) + geom_histogram(bins = 15) +
ggtitle("Histogram: Total Number of Steps Per Day") + xlab("Steps Per Day") +
ylab("Frequency")
activityNoMissing %>% group_by(date) %>% summarise(meanSteps = mean(steps, na.rm = TRUE)) %>%
ggplot(aes(x = date, y = meanSteps)) + geom_line() +
ggtitle("Mean Steps by Date") + xlab("Date") + ylab("Mean Steps")
Mean and median steps are the same as those presented in 3, above. Recall that in 3 we removed the missing values in order to make the computations.
t <- activityNoMissing %>% mutate(dayOfWeek = weekdays(date)) %>%
mutate(Weekend = ifelse(dayOfWeek == "Saturday" | dayOfWeek == "Sunday", "Weekend", "Weekday"))
## By Weekday vs. Weekend
t %>%
group_by(Weekend, interval) %>% mutate(meanStepsInterval = mean(steps)) %>%
ggplot(aes(x = interval, y = meanStepsInterval)) + geom_line() +
facet_wrap(~Weekend) +ggtitle("Mean Steps by Interval: Weekday vs. Weekend") +
xlab("Interval") + ylab("Mean Steps")
t %>%
group_by(dayOfWeek, interval) %>% mutate(meanStepsInterval = mean(steps)) %>%
ggplot(aes(x = interval, y = meanStepsInterval)) + geom_line() +
facet_wrap(~dayOfWeek) +ggtitle("Mean Steps by Interval: By Day") +
xlab("Interval") + ylab("Mean Steps")