Reproducible Research: Peer Assessment 1

Loading and preprocessing the data
What is mean total number of steps taken per day?
What is the average daily activity pattern?
Imputing missing values
Are there differences in activity patterns between weekdays and weekends?

Loading and preprocessing the data

Source data for this assessment is in activity.zip file.

The variables included in this dataset are:

steps: Number of steps taking in a 5-minute interval (missing values are coded as NA)
date: The date on which the measurement was taken in YYYY-MM-DD format
interval: Identifier for the 5-minute interval in which measurement was taken

Unzip and load source data:

measurements <- read.csv(unz("activity.zip", "activity.csv"))

What is mean total number of steps taken per day?

Calculate the total number of steps taken per day:

stepsPerDay <- aggregate(steps ~ date, measurements, sum)
hist(stepsPerDay$steps, main = "Steps per day", xlab = "Steps", col = "green", breaks = 8)

Calculate the mean and median of the total number of steps taken per day:

meanStepsPerDay <- mean(stepsPerDay$steps)
medianStepsPerDay <- median(stepsPerDay$steps)

The mean is 10766.19.
The median is 10765.

The mean and the median is very close to each other. There is why we can say that the average value is very revealing.

What is the average daily activity pattern?

A time series plot of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis):

stepsInterval <- aggregate(steps ~ interval, measurements, mean)
plot(stepsInterval$interval, stepsInterval$steps, type="l", xlab = "5 min - interval", ylab = "Average steps", main = "Average Daily Activity Pattern", col = "green")

Interval from 5-minute intervals, on average across all the days in the dataset, contains the maximum number of steps:

stepsInterval$interval[which.max(stepsInterval$steps)]

## [1] 835

Imputing missing values

The total number of missing values in the dataset is:

nrow(measurements[is.na(measurements$steps),])

## [1] 2304

Filling in all of the missing values in the dataset. Create a new dataset that is equal to the original dataset but with the missing data filled in.

Strategy: Let NA values will be 0.

measurementsWithoutNAs <- measurements
measurementsWithoutNAs[is.na(measurementsWithoutNAs$steps), "steps"] <- 0

Calculate the total number of steps taken per day:

stepsPerDayNoNAs <- aggregate(steps ~ date, measurementsWithoutNAs, sum)
hist(stepsPerDayNoNAs$steps, main = "Steps per day", xlab = "Steps", col = "blue", breaks = 8)

Calculate the mean and median of the total number of steps taken per day:

meanStepsPerDayNoNAs <- mean(stepsPerDayNoNAs$steps)
medianStepsPerDayNoNAs <- median(stepsPerDayNoNAs$steps)

The mean is 9354.23.
The median is 10395.

These values are differ from the estimates from the first part of the assignment. As we can see depends on the NAs filling function (in this case it was 0 value) we have shifted values. In this case there is left-shifted to 0. This affects the mean and the median values.

Are there differences in activity patterns between weekdays and weekends?

Create a new factor variable in the dataset with two levels ? ?weekday? and ?weekend? indicating whether a given date is a weekday or weekend day.
0 is Sunday, 1 is Monday, etc.

measurementsWithoutNAs$day <- as.POSIXlt(measurementsWithoutNAs$date)$wday
measurementsWithoutNAs$dayType <- as.factor(ifelse(measurementsWithoutNAs$day == 0 | measurementsWithoutNAs$day == 6, "weekend", "weekday"))
measurementsWithoutNAs <- subset(measurementsWithoutNAs, select = -c(day))

head(measurementsWithoutNAs)

##   steps       date interval dayType
## 1     0 2012-10-01        0 weekday
## 2     0 2012-10-01        5 weekday
## 3     0 2012-10-01       10 weekday
## 4     0 2012-10-01       15 weekday
## 5     0 2012-10-01       20 weekday
## 6     0 2012-10-01       25 weekday

Make a panel plot containing a time series plot (i.e. type = “l”) of the 5-minute interval (x-axis) and the average number of steps taken:

weekdaysData <- measurementsWithoutNAs[measurementsWithoutNAs$dayType == "weekday",]
weekendsData <- measurementsWithoutNAs[measurementsWithoutNAs$dayType == "weekend",]
stepsIntervalWeekdays <- aggregate(steps ~ interval, weekdaysData, mean)
stepsIntervalWeekends <- aggregate(steps ~ interval, weekendsData, mean)

par(mfrow = c(2, 1))

plot(stepsIntervalWeekdays, type = "l", col = "green", main = "Weekdays")
plot(stepsIntervalWeekends, type = "l", col = "red", main = "Weekends")

As we can see from plots there are different activity paths at the weekends and at the weekdays. At the weekends we see almost uniform distribution (exclude night). At the weekdays there is a clear peak in the morning.