knitr::opts_chunk$set(fig.width=8, fig.height=5, echo=TRUE, warning=FALSE) #global parameters
library(readr)
library(dplyr)
library(zoo)
It is now possible to collect a large amount of data about personal movement using activity monitoring devices such as a Fitbit, Nike Fuelband, or Jawbone Up. These type of devices are part of the “quantified self” movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. But these data remain under-utilized both because the raw data are hard to obtain and there is a lack of statistical methods and software for processing and interpreting the data.
path <- getwd()
filename = "activity.csv"
if (!file.exists(filename)){
urlzip <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip"
download.file(urlzip, destfile = "./activity.zip" )
unzip("./activity.zip", exdir = path )
}
cat("The working directory is ",path)
## The working directory is C:/Users/jeff/Documents/R/Repro_Research/RepData_Project2
Data wrangling is comprised of obtaining, loading, exploring, cleaning, processing, etc. In this code chunk we do the following: * Set the path to match the working directory * read filename that was unzipped above and indicate the header is present * print the record of the activity file
activity_main <- read.csv("./activity.csv", header = TRUE)
head(activity_main,5)
It is always a good idea to not manipulate the original data file you upload (activity_main), so we make a copy:
activity <- activity_main
activity <- activity_main[!is.na(activity[, 1]), ]
act_count <- count(activity[, 0])
act_mean <- mean(activity[, 1])
act_sum <- sum(activity[, 1])
cat("Mean number of steps =",act_mean)
## Mean number of steps = 37.3826
cat("Total number of steps =",act_sum)
## Total number of steps = 570608
act_count
dim(activity) # Dimension
## [1] 15264 3
str(activity) # Structure
## 'data.frame': 15264 obs. of 3 variables:
## $ steps : int 0 0 0 0 0 0 0 0 0 0 ...
## $ date : chr "2012-10-02" "2012-10-02" "2012-10-02" "2012-10-02" ...
## $ interval: int 0 5 10 15 20 25 30 35 40 45 ...
summary(activity) # Summary
## steps date interval
## Min. : 0.00 Length:15264 Min. : 0.0
## 1st Qu.: 0.00 Class :character 1st Qu.: 588.8
## Median : 0.00 Mode :character Median :1177.5
## Mean : 37.38 Mean :1177.5
## 3rd Qu.: 12.00 3rd Qu.:1766.2
## Max. :806.00 Max. :2355.0
Now let’s answer the question, “What is mean total number of steps taken per day?”
First, let’s calculate the total number of steps per day, which is different than the calculation for total number of steps we did previously. Also, we’ll look ahead and note that we might want two options: * one with missing values (our original activity data) * one without missing values (activity0, the activity data exclusing NAs) * note that there are several ways to exclude the NAs * using the is.na() function, which queries the data for NAs and record the value ‘TRUE’ if or each ‘NA’ found, * negating is.na(), i.e.!is.na (we’ll use this here), * using ‘na.rm’, which is a logical value indicating whether ‘NA’ values should be stripped before the computation proceeds, and * using the complete.cases() function (we’ll use this later), which is used to return a logical vector with cases which are complete, i.e., no NAs.
Note that by applying a filter !is.na as a filter, we can turn it off by commenting it out
steps_by_date <- select(activity, steps, date, interval) %>%
filter(!is.na(steps)) %>% # removed filter to include NAs as 0-step days in histogram
group_by(date) %>%
summarize(total_steps = sum(steps, na.rm = TRUE)) #has the same effect as `!is.na`
Second, using the total number of steps per day calculation, we can now compute the mean and median number of steps per day. Here, we assume that each date represents a day, and that there are mulutiple periods recored each day, i.e., the 5-minute intervals.
Here, we use the complete.case() function as, previously mentioned to “remove” missing values
steps_by_date <- activity[complete.cases(activity),] %>%
group_by(date) %>%
summarize(total_steps=sum(steps))
steps_by_date
Next, we calculate and report the mean and median of the total number of steps taken per day. We compute them using the mean() and median() functions and name the computation, MeanMedian_Raw, for the unimputed data, and MeanMedian_Imputed (used later).
MeanMedian_Raw <-
activity[complete.cases(activity),] %>%
group_by(date) %>%
summarize(average_steps=mean(steps),median_steps=median(steps)) %>%
as.data.frame(MeanMedian_Raw)
mean_total_steps <- mean(steps_by_date$total_steps)
median_total_steps <- median(steps_by_date$total_steps)
cat("The mean of the total number of steps taken per day is ", mean_total_steps)
## The mean of the total number of steps taken per day is 10766.19
cat("The median of the total number of steps taken per day is ", median_total_steps)
## The median of the total number of steps taken per day is 10765
Here, we use R’s base function to plot a histogram of the total steps per day. We also add aesthetics, including * A plot title (main) * x and y labels * customer colors * breaks, which is a single number, treated as a suggestion as it gives pretty breakpoints
We also generate ablines using the abline() function, which can be used to add vertical, horizontal or regression lines to a graph. We add the vertical lines to the histogram we created earlier at the mean and median points. This also demonstrates how additional plotting objects can be added to an existing plot.
hist(steps_by_date$total_steps,
main = "Histogram of total steps per day",
xlab = "Total steps per day",
ylab = "Frequency [number of days]",
breaks = 20,
border = "slateblue4",
col = "dodgerblue"
)
abline(v = mean_total_steps, lwd = 1, lty = 2, col = "red")
abline(v = median_total_steps, lwd = 1, lty = 2, col = "red")
Here, we answer the question, “What is the average daily activity pattern?”
To do so, we create a quasi time-series plot (the plot is technical a scatterplot, as the activity data is not properly formatted with a time-based index, which we could do using the xts-package). However, we will apply a 100-interval moving average to investigate how we might impute missing values with a mean value, for example.
We’ll also customize the plot with the following aethetics: * A line type scatterplot using type = "l" * a wider line width, using lwd * a custom color, royalblue4, using col * a plot title, using main * x- and y-axis labels, with xlab and ylab, respectively * the 100-period moving average line, and * a customized legend
Finally, we will add vertical and horizontal lines, using abline(), representing the point where the maximum average steps occur.
steps_by_interval <- select(activity, steps, date, interval) %>%
group_by(interval) %>%
summarize(average_steps = mean(steps, na.rm = TRUE))
dim0 <- dim(steps_by_interval)
cat("The dimension of steps_by_interval is: ", dim0)
## The dimension of steps_by_interval is: 288 2
Which 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps?
x <- steps_by_interval[steps_by_interval$average_steps==max(steps_by_interval$average_steps),1]
cat("The 5-minute interval with the maximum steps is: ", as.character(x))
## The 5-minute interval with the maximum steps is: 835
new_steps_100ma <- zoo::rollmean(activity["steps"], k = 100, fill = NA)
plot(steps_by_interval$interval, steps_by_interval$average_steps,
type = "l",
lwd = 3,
col = 'royalblue4',
main = 'Average Steps by 5-minute Interval', cex = .85,
xlab = "Interval",
ylab = "Average number of steps")
lines(new_steps_100ma, lwd = 4, col = 'red2')
legend('topright', deaths.xts,
c('Steps', '100-period MA'),
fill = c('blue2', 'red2', 'green2'), cex = .85,
box.lty = 1)
max_average_steps <- max(steps_by_interval$average_steps)
max_average_steps_interval <- steps_by_interval[
steps_by_interval$average_steps == max_average_steps,
]$interval
abline(v = max_average_steps_interval, lwd = 1, lty = 2, col = "red")
abline(h = max_average_steps, lwd = 1, lty = 2, col = "red")
The moving average plot shows that there is an overall pattern to the daily total average steps, five or six periods. We can estimate this as 2000/5 = 500 periods. We might also consider the first 500 as a warmup period, to reach steady state in the 100-period MA, giving us points at 1000, 1500, 2000, and 2500
steps_by_interval[steps_by_interval$interval %in% c(1000,1500,2000,2500),]
head(activity[which(is.na(activity$steps) & activity$interval %in% c(1000,1500,2000,2500)),], n = 30)
Here, we investigate the effect of imputing missing steps values,
Considering that steps are derived electronically, zeros are valid entries. Although the missing values, NAs, are curious, we will develop a strategy to impute them. Here, we can use the pattern we noted with the 100-period MA, in particular, (500, 1000, 1500, 2000, 2500), since zeros are valid.
First, we calculate and report the total number of missing values, using the is.na() function.
x<-sum(is.na(activity[, 1]))
cat("Number of rows with missing values =",as.character(x))
## Number of rows with missing values = 0
Here, we use the mean() function for a given interval, which we calculated above, to replace missing steps values (NAs). The object, steps_by_interval, contains means of steps for each interval (see above).
steps_by_interval[steps_by_interval$interval %in% c(500,1000,1500,2000),]
steps_by_interval <- as.data.frame(steps_by_interval)
Next, we create an object to hold activity and imputed activity.
Imputed <- activity
Next, we find rows with missing (NA) data and replace them with the mean we calculate for the given for the intervals.
for (i in 1:nrow(Imputed)) {
if (is.na(Imputed$steps[i])){
n <- steps_by_date[steps_by_date$interval==Imputed$interval[i],average_steps]
Imputed$steps[i]<- as.integer(round(n))
}
}
Now, we are ready to compare the data we just imputed with the work accomplished earlier. We proceed as follows. * Populate the Imputed object with NAs imputed * Redefine the data with NAs present * Redefine the data with NAs removed * Generate histograms for each set of data
steps_by_date_imp <- Imputed %>%
group_by(date) %>%
summarize(imp_total_steps=sum(steps))
steps_by_date_na <- select(activity_main, steps, date, interval) %>%
#filter(!is.na(steps)) %>% # removed filter to include NAs as 0-step days in histogram
group_by(date) %>%
summarize(total_steps = sum(steps, na.rm = TRUE)) #has the same effect as `!is.na`
steps_by_date <- select(activity, steps, date, interval) %>%
filter(!is.na(steps)) %>% # removed filter to include NAs as 0-step days in histogram
group_by(date) %>%
summarize(total_steps = sum(steps, na.rm = TRUE)) #has the same effect as `!is.na`
par(mfrow=c(1,3))
hist(steps_by_date_imp$imp_total_steps,
main = "Histogram of Inputed Total Steps per Day (Avg/Interval)",
xlab = "Total steps per day",
ylab = "Frequency [number of days]",
cex.main = 1.1,
cex.lab = 1.1,
breaks = 20,
border = "slateblue4",
col = "slateblue3"
)
hist(steps_by_date$total_steps,
main = "Histogram of Imputed Total Steps per Day (NAs Removed)",
xlab = "Total steps per day",
ylab = "Frequency [number of days]",
cex.main = 1.1,
cex.lab = 1.1,
breaks = 20,
border = "cadetblue4",
col = "cadetblue3"
)
hist(steps_by_date_na$total_steps,
main = "Histogram of Total Steps per Day",
xlab = "Total steps per day",
ylab = "Frequency [number of days]",
cex.main = 1.1,
cex.lab = 1.1,
breaks = 20,
border = "dodgerblue4",
col = "dodgerblue3"
)
par(mfrow=c(1,1))
The first two histograms demonstrates the imputation and removal of the NAs present in the third histogram. Zeros are still present, since zero is a valid entry.
Now we answer the question,“Are there differences in activity patterns between weekdays and weekends?”
MeanMedian_Imputed <-
Imputed %>%
group_by(date) %>%
summarize(average_steps=mean(steps),median_steps=median(steps))
MeanMedian_Imputed
There is no significant difference
summary(MeanMedian_Raw,2:3)
## date average_steps median_steps
## Length:53 Min. : 0.1424 Min. :0
## Class :character 1st Qu.:30.6979 1st Qu.:0
## Mode :character Median :37.3785 Median :0
## Mean :37.3826 Mean :0
## 3rd Qu.:46.1597 3rd Qu.:0
## Max. :73.5903 Max. :0
summary(MeanMedian_Imputed,2:3)
## date average_steps median_steps
## Length:53 Min. : 0.1424 Min. :0
## Class :character 1st Qu.:30.6979 1st Qu.:0
## Mode :character Median :37.3785 Median :0
## Mean :37.3826 Mean :0
## 3rd Qu.:46.1597 3rd Qu.:0
## Max. :73.5903 Max. :0
Now, we answer the question, “Are there differences in activity patterns between weekdays and weekends?”
For this part the weekdays() function may be of some help here. Use the dataset with the filled-in missing values for this part. Create a new factor variable in the dataset with two levels - “weekday” and “weekend” indicating whether a given date is a weekday or weekend day.
First, we define the names of the days in a week and corresponding descriptors, weekday and weekend. We also attend to the necessary data imputations along the line of our previous work.
DayOfWeek <- data.frame(
name=c("Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"),
part=c("weekday","weekday","weekday","weekday","weekday","weekend","weekend"))
Imputed$date<-as.Date(Imputed$date)
## Create the column
Imputed$WeekPart <- as.factor(weekdays(Imputed$date))
## Set the value
Imputed$WeekPart <- DayOfWeek[Imputed$WeekPart,2]
We also make a panel plot containing a time series plot (i.e. type = “l”) of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all weekday days or weekend days (y-axis). To do this, we: * Calculate and store the means by group * Construct a lattice plot for weekend and weekdays
PlotSummary <- Imputed %>%
group_by(WeekPart,interval) %>%
summarize(StepsMean=mean(steps))
## `summarise()` has grouped output by 'WeekPart'. You can override using the `.groups` argument.
## Use the lattice graphics for easier multiple panels
library(lattice)
xyplot(PlotSummary$StepsMean ~ PlotSummary$interval | PlotSummary$WeekPart,
data=PlotSummary, type="l", layout=c(1:2),
ylab="Number of Steps (mean)",
xlab="Interval")
Examination of the two plots indicate that the weekday and weekend activity is roughly the same over the same time intervals.