1. Load the data
data <- read.csv("activity.csv")
head(data)
## steps date interval
## 1 NA 2012-10-01 0
## 2 NA 2012-10-01 5
## 3 NA 2012-10-01 10
## 4 NA 2012-10-01 15
## 5 NA 2012-10-01 20
## 6 NA 2012-10-01 25
2. Process/transform the data (if necessary) into a format suitable for your analysis
data$date <- as.Date(data$date, "%Y-%m-%d")
1. Calculate the total number of steps taken per day
library(dplyr, warn.conflicts = FALSE)
dataByDate <- data %>% group_by(date) %>% summarise(sum(steps, na.rm = TRUE))
2. Make a histogram of the total number of steps taken each day
colnames(dataByDate) = c("Date", "Sum")
dataByDate.f <- filter(dataByDate, Sum != 0)
hist(dataByDate.f$Sum, main = "Histogram of Total Number of Steps Taken per Day", xlab = "total number of steps taken per day", col = "green")
3. Calculate and report the mean and median of the total number of steps taken per day
round(mean(dataByDate.f$Sum))
## [1] 10766
round(median(dataByDate.f$Sum))
## [1] 10765
The mean is 10766, and the median is 10765.
1. Make a time series plot of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis).
dataByInterval <- data %>% group_by(interval) %>% summarise(mean(steps, na.rm = TRUE))
colnames(dataByInterval) = c("Interval", "Mean")
plot(dataByInterval$Mean ~ dataByInterval$Interval, type = "l", main = "Average Number of Steps Taken", xlab = "5-minute interval", ylab = "average steps across all days", col = "red")
2. Which 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps?
filter(dataByInterval, Mean == max(dataByInterval$Mean))$Interval
## [1] 835
The 835th 5-minute interval contains the maximum number of steps.
1. Calculate and report the total number of missing values in the dataset.
sum(is.na(data$steps))
## [1] 2304
The total number of missing values in the dataset is 2304.
2. Devise a strategy for filling in all of the missing values in the dataset. The strategy does not need to be sophisticated. For example, you could use the mean/median for that day, or the mean for that 5-minute interval, etc.
This method fills in the NA values with the mean for the corresponding 5-minute interval.
fillNA <- numeric()
for (i in 1:nrow(data)) {
temp <- data[i, ]
if (is.na(temp$steps)) {
steps <- subset(dataByInterval, Interval == temp$interval)$Mean
}
else {
steps <- temp$steps
}
fillNA <- c(fillNA, steps)
}
3. Create a new dataset that is equal to the original dataset but with the missing data filled in.
newData <- data
newData$steps <- fillNA
head(newData)
## steps date interval
## 1 1.7169811 2012-10-01 0
## 2 0.3396226 2012-10-01 5
## 3 0.1320755 2012-10-01 10
## 4 0.1509434 2012-10-01 15
## 5 0.0754717 2012-10-01 20
## 6 2.0943396 2012-10-01 25
4. Make a histogram of the total number of steps taken each day and Calculate and report the mean and median total number of steps taken per day. Do these values differ from the estimates from the first part of the assignment? What is the impact of imputing missing data on the estimates of the total daily number of steps?
newDataByDate <- newData %>% group_by(date) %>% summarise(sum(steps))
colnames(newDataByDate) = c("Date", "Sum")
hist(newDataByDate$Sum, main = "Histogram of Total Number of Steps Taken per Day", xlab = "total number of steps taken per day", col = "green")
round(mean(newDataByDate$Sum))
## [1] 10766
round(median(newDataByDate$Sum))
## [1] 10766
After filling in the missing values, the mean remains the same but the median is a little bit different, increased by 1 from 10765 to 10766.
1. Create a new factor variable in the dataset with two levels - “weekday” and “weekend” indicating whether a given date is a weekday or weekend day.
day <- weekdays(data$date)
level <- vector()
for (i in 1:nrow(newData)) {
if (day[i] == "Saturday" | day[i] == "Sunday") {
level[i] <- "Weekend"
}
else {
level[i] <- "Weekday"
}
}
newData <- mutate(newData, weekday.level = as.factor(level))
head(newData)
## steps date interval weekday.level
## 1 1.7169811 2012-10-01 0 Weekday
## 2 0.3396226 2012-10-01 5 Weekday
## 3 0.1320755 2012-10-01 10 Weekday
## 4 0.1509434 2012-10-01 15 Weekday
## 5 0.0754717 2012-10-01 20 Weekday
## 6 2.0943396 2012-10-01 25 Weekday
2. Make a panel plot containing a time series plot of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all weekday days or weekend days (y-axis).
newDataByDay <- newData %>% group_by(interval, weekday.level) %>% summarise(mean(steps))
colnames(newDataByDay) = c("interval", "weekday.level", "mean")
library(ggplot2)
ggplot(newDataByDay, aes(interval, mean, fill = weekday.level)) +
geom_line(col = "red") +
facet_grid(weekday.level~.) +
labs(x = "5-minute interval", y = "average steps across weekday levels", title = "Average Number of Steps Taken")
From the plot above, we can see there is a slight difference in activity patterns between weekdays and weekends, but the trend is pretty much the same.