Study of a personal activity monitoring

Introduction

This project makes use of data from a personal activity monitoring device which counts the number of steps taken by an anonymous individual in 5 minutes intervals throughout the day. The dataset was collected within a period of two months (October and November 2012).

The variables included in this dataset are:

steps: Number of steps taking in a 5-minute interval (missing values are coded as NA)

date: The date on which the measurement was taken in YYYY-MM-DD format

interval: Identifier for the 5-minute interval in which measurement was taken

The dataset is stored in a comma-separated-value (CSV) file with a total of 17,568 observations.

Within this Assignment, I address the following issues:

Loading and preprocessing the data:

library(dplyr)
library(xtable)
library(chron)
library(lattice)

According to the extension of the file “activity.csv”, the data read using the function read.csv():

activity <- read.csv("activity.csv")
activity$date <- as.Date(activity$date, format="%Y-%m-%d")
str(activity)

## 'data.frame':    17568 obs. of  3 variables:
##  $ steps   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ date    : Date, format: "2012-10-01" "2012-10-01" ...
##  $ interval: int  0 5 10 15 20 25 30 35 40 45 ...

What is mean total number of steps taken per day?

First calculate the total number of steps taken per day:

StepsPerDay <- activity %>% na.omit() %>% group_by(date) %>% summarise(TotSteps = sum(steps))
StepsPerDay

## # A tibble: 53 x 2
##    date       TotSteps
##    <date>        <int>
##  1 2012-10-02      126
##  2 2012-10-03    11352
##  3 2012-10-04    12116
##  4 2012-10-05    13294
##  5 2012-10-06    15420
##  6 2012-10-07    11015
##  7 2012-10-09    12811
##  8 2012-10-10     9900
##  9 2012-10-11    10304
## 10 2012-10-12    17382
## # ... with 43 more rows

Then, make a histogram of the total number of steps taken each day:

plot(StepsPerDay$date, StepsPerDay$TotSteps, type="h", lwd=5, col="red", xlab="Days", ylab="Number of steps", main="Total number of steps taken each day")

And finally, calculate and report the mean and median of the total number of steps taken per day

mean(StepsPerDay$TotSteps)

## [1] 10766.19

median(StepsPerDay$TotSteps)

## [1] 10765

What is the average daily activity pattern?

Make a time series plot (i.e. type = “l”) of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis)

MeanInterval <- aggregate(activity$steps, by=list(interval=activity$interval), FUN = mean, na.rm = TRUE)
plot(MeanInterval$interval, MeanInterval$x, type = "l", col = "blue", xlab = "5-minute intervals", ylab = "Average nummber of steps", main = "Average number of steps averaged across all days")

Which 5-minute interval, on average across all the days in the dataset, contains the maximum number of steps?

MeanInterval[which.max(MeanInterval$x),]

##     interval        x
## 104      835 206.1698

Imputing missing values

Calculate and report the total number of missing values in the dataset (i.e. the total number of rows with NAs)

sum(is.na(activity$steps))

## [1] 2304

As a strategy for filling in all of the missing values in the dataset I use dplyr to group the data according to the day and replace the NA with the average number of step across all days of its corresponding interval (from the table MeanInterval).
Create a new dataset that is equal to the original dataset but with the missing data filled in.

activityFull <- activity %>% group_by(date) %>% mutate(steps = ifelse(is.na(steps), MeanInterval$x[match(interval, MeanInterval$interval)], steps))

4.Make a histogram of the total number of steps taken each day and calculate and report the mean and median total number of steps taken per day.

Do these values differ from the estimates from the first part of the assignment?

What is the impact of imputing missing data on the estimates of the total daily number of steps?

StepsPerDayFull <- activityFull %>% group_by(date) %>% summarise(TotSteps = sum(steps))
plot(StepsPerDayFull$date, StepsPerDayFull$TotSteps, type="h", lwd=5, col="red", xlab="Days", ylab="Number of steps", main="Total number of steps taken each day obtained from a full data set")

Recap <- xtable(cbind(c("Without \"NA\"", "With \"NA\""), c(mean(StepsPerDayFull$TotSteps), mean(StepsPerDay$TotSteps)), c(median(StepsPerDayFull$TotSteps), median(StepsPerDay$TotSteps))))
colnames(Recap) <- c("Type of data set", "Mean", "Median")
rownames(Recap) <- NULL
print(Recap, include.rownames=FALSE, type="html")

Type of data set	Mean	Median
Without “NA”	10766.1886792453	10766.1886792453
With “NA”	10766.1886792453	10765

Are there differences in activity patterns between weekdays and weekends?

Create a new factor variable in the dataset with two levels “weekday” and “weekend” indicating whether a given date is a weekday or weekend day.

activityFull <- activityFull %>% mutate(WE = weekdays(date)) %>% mutate(WE = ifelse(WE == "Samstag"|WE == "Sonntag", "weekend", "weekday"))

# With the library "Chron":
# activityFulltest <- activityFull %>% mutate(WE = ifelse(is.weekend(date), "weekend", "weekday")

table(activityFull$WE)

## 
## weekday weekend 
##   12960    4608

Make a panel plot containing a time series plot (i.e. type = “l”) of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all weekday days or weekend days (y-axis). See the README file in the GitHub repository to see an example of what this plot should look like using simulated data.

MeanIntervalFull <- aggregate(activityFull$steps, by=list(interval=activityFull$interval, WE=activityFull$WE), FUN = mean)
xyplot(x ~ interval | as.factor(WE), data = MeanIntervalFull, type = "l", xlab = "Interval", ylab = "Number of steps", main = "The average number of steps across all weekday days or weekend days", layout = c(1,2))