This project makes use of data from a personal activity monitoring device which counts the number of steps taken by an anonymous individual in 5 minutes intervals throughout the day. The dataset was collected within a period of two months (October and November 2012).
The variables included in this dataset are:
steps: Number of steps taking in a 5-minute interval (missing values are coded as NA)
date: The date on which the measurement was taken in YYYY-MM-DD format
interval: Identifier for the 5-minute interval in which measurement was taken
The dataset is stored in a comma-separated-value (CSV) file with a total of 17,568 observations.
Within this Assignment, I address the following issues:
library(dplyr)
library(xtable)
library(chron)
library(lattice)
According to the extension of the file “activity.csv”, the data read using the function read.csv():
activity <- read.csv("activity.csv")
activity$date <- as.Date(activity$date, format="%Y-%m-%d")
str(activity)
## 'data.frame': 17568 obs. of 3 variables:
## $ steps : int NA NA NA NA NA NA NA NA NA NA ...
## $ date : Date, format: "2012-10-01" "2012-10-01" ...
## $ interval: int 0 5 10 15 20 25 30 35 40 45 ...
StepsPerDay <- activity %>% na.omit() %>% group_by(date) %>% summarise(TotSteps = sum(steps))
StepsPerDay
## # A tibble: 53 x 2
## date TotSteps
## <date> <int>
## 1 2012-10-02 126
## 2 2012-10-03 11352
## 3 2012-10-04 12116
## 4 2012-10-05 13294
## 5 2012-10-06 15420
## 6 2012-10-07 11015
## 7 2012-10-09 12811
## 8 2012-10-10 9900
## 9 2012-10-11 10304
## 10 2012-10-12 17382
## # ... with 43 more rows
plot(StepsPerDay$date, StepsPerDay$TotSteps, type="h", lwd=5, col="red", xlab="Days", ylab="Number of steps", main="Total number of steps taken each day")
mean(StepsPerDay$TotSteps)
## [1] 10766.19
median(StepsPerDay$TotSteps)
## [1] 10765
MeanInterval <- aggregate(activity$steps, by=list(interval=activity$interval), FUN = mean, na.rm = TRUE)
plot(MeanInterval$interval, MeanInterval$x, type = "l", col = "blue", xlab = "5-minute intervals", ylab = "Average nummber of steps", main = "Average number of steps averaged across all days")
MeanInterval[which.max(MeanInterval$x),]
## interval x
## 104 835 206.1698
sum(is.na(activity$steps))
## [1] 2304
As a strategy for filling in all of the missing values in the dataset I use dplyr to group the data according to the day and replace the NA with the average number of step across all days of its corresponding interval (from the table MeanInterval).
Create a new dataset that is equal to the original dataset but with the missing data filled in.
activityFull <- activity %>% group_by(date) %>% mutate(steps = ifelse(is.na(steps), MeanInterval$x[match(interval, MeanInterval$interval)], steps))
4.Make a histogram of the total number of steps taken each day and calculate and report the mean and median total number of steps taken per day.
Do these values differ from the estimates from the first part of the assignment?
What is the impact of imputing missing data on the estimates of the total daily number of steps?
StepsPerDayFull <- activityFull %>% group_by(date) %>% summarise(TotSteps = sum(steps))
plot(StepsPerDayFull$date, StepsPerDayFull$TotSteps, type="h", lwd=5, col="red", xlab="Days", ylab="Number of steps", main="Total number of steps taken each day obtained from a full data set")
Recap <- xtable(cbind(c("Without \"NA\"", "With \"NA\""), c(mean(StepsPerDayFull$TotSteps), mean(StepsPerDay$TotSteps)), c(median(StepsPerDayFull$TotSteps), median(StepsPerDay$TotSteps))))
colnames(Recap) <- c("Type of data set", "Mean", "Median")
rownames(Recap) <- NULL
print(Recap, include.rownames=FALSE, type="html")
| Type of data set | Mean | Median |
|---|---|---|
| Without “NA” | 10766.1886792453 | 10766.1886792453 |
| With “NA” | 10766.1886792453 | 10765 |
activityFull <- activityFull %>% mutate(WE = weekdays(date)) %>% mutate(WE = ifelse(WE == "Samstag"|WE == "Sonntag", "weekend", "weekday"))
# With the library "Chron":
# activityFulltest <- activityFull %>% mutate(WE = ifelse(is.weekend(date), "weekend", "weekday")
table(activityFull$WE)
##
## weekday weekend
## 12960 4608
MeanIntervalFull <- aggregate(activityFull$steps, by=list(interval=activityFull$interval, WE=activityFull$WE), FUN = mean)
xyplot(x ~ interval | as.factor(WE), data = MeanIntervalFull, type = "l", xlab = "Interval", ylab = "Number of steps", main = "The average number of steps across all weekday days or weekend days", layout = c(1,2))