file = "./activity.csv"
activity <- read.csv(file)
First, get rid of NA’s.
Now, we split the resulting data frame by date, and apply lapply to sum up the number of steps, resulting in a list with length equal to the number of days which have non-NA values for steps.
daily <- split(meanDaily, meanDaily$date)
numSteps <- lapply(daily, function(x) {
sum(x$steps)
})
Now plot a histogram for the resulting list by first converting the list into a numeric vector. We include the mean and median in a legend.
numSteps <- as.numeric(numSteps)
hist(numSteps, xlab = "Number of Steps", main = "Histogram: Total Steps per Day")
Average <- mean(numSteps)
Median <- median(numSteps)
legend("topright", legend = c("Average",Average,"Median",Median))
Split the “meanDaily” data frame by interval, then take the mean of steps taken during that interval across the dates in “meanDaily”.
meanDaily$interval <- factor(meanDaily$interval)
interval <- split(meanDaily, meanDaily$interval)
avgSteps <- lapply(interval, function(x) {
mean(x$steps)
})
Plot the time-series plot.
plot(x = as.numeric(levels(meanDaily$interval)), y = as.numeric(avgSteps), type = "l", main = "Average Daily Activity Pattern", xlab = "Intervals: 00:00 to 24:00", ylab = "Average # of Steps")
intervals <- as.numeric(names(avgSteps))
mostActive <- as.numeric(avgSteps)
mat <- cbind(intervals, mostActive)
library(dplyr)
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
mat <- tbl_dt(mat)
## Loading required namespace: data.table
mostActive <- filter(mat, mat$mostActive==max(mat$mostActive))
mostActiveInterval <- mostActive$intervals
legend("topright", legend = c("Most Active Interval = ", mostActiveInterval))
First, identify the rows with NA values. Then output the number of rows with NA’s, which is 288 * 8 (i.e., steps NA for 8 days), and input the average of relevant interval, taken from “avgSteps,” to take the place of the NA’s.
bad <- is.na(activity$steps)
sum(bad)
## [1] 2304
activity[bad,]$steps <- avgSteps
To produce a histogram of steps per day, proceed as in the first part of the project.
activity$date <- factor(activity$date)
daily2 <- split(activity, activity$date)
numSteps2 <- lapply(daily2, function(x) {
sum(as.numeric(x$steps))
})
numSteps2 <- as.numeric(numSteps2)
hist(numSteps2, xlab = "Number of Steps", main = "Histogram: Total Steps per Day (NA's Replaced)")
Average2 <- mean(numSteps2)
Median2 <- median(numSteps2)
legend("topright", legend = c("Average",Average2,"Median",Median2))
We see that the average hasn’t changed. This makes sense because we used the average to substitute for the NA’s. It also isn’t surprising that the median has shifted slightly upward to match the average because:
1. Using the average values makes it more likely that the median value will be that value (since it’s “used” more frequently); and
2. The median was very close to the mean to begin with.
The following identifies the particular day in the week for each date in the data frame “activity”.
activity$date <- as.Date(activity$date)
library(dplyr)
weekdays <- weekdays(activity$date)
weekdays <- as.list(weekdays)
The following code creates a factor “classification”, which has two levels, “Weekday” and “Weekend”.
classification <- lapply(weekdays, function(x){
if (x != "Saturday" & x != "Sunday") x <- "Weekday"
else x <- "Weekend"
})
classification <- as.factor(as.character(classification))
The factor “classification” is then appended as the fourth column (“V4”) in the data frame “activity”. This will make it easier to filter (from the dplyr package) two data frames, one for weekdays (“wkday”) and one for weekends (“wkend”).
activity$V4 <- classification
wkday <- filter(activity, activity$V4 == "Weekday")
wkend <- filter(activity, activity$V4 == "Weekend")
Now, we have the data frames we need to proceed as we did in the time-series plot above. We follow virtually the same steps to produce the plots as follows.
wkday$interval <- factor(wkday$interval)
interval_wkday <- split(wkday, wkday$interval)
avgSteps_wkday <- lapply(interval_wkday, function(x) {
mean(as.numeric(x$steps))
})
wkend$interval <- factor(wkend$interval)
interval_wkend <- split(wkend, wkend$interval)
avgSteps_wkend <- lapply(interval_wkend, function(x) {
mean(as.numeric(x$steps))
})
par(mfcol = c(2,1), mai = c(1,4,1,2), mar = c(2,4,1,2))
plot(x = as.numeric(levels(wkday$interval)), y = as.numeric(avgSteps_wkday), type = "l", ylab = "Mean # Steps: Weekdays", ylim = c(0,250), xlab = "")
plot(x = as.numeric(levels(wkend$interval)), y = as.numeric(avgSteps_wkend), type = "l", ylab = "Mean # Steps: Weekends", xlab = "Intervals: 00:00 to 24:00", ylim = c(0,250))
Note that we made the y-limits of both plots identical so as to get a better comparison.