Source data for this assessment is in activity.zip file.
The variables included in this dataset are:
Unzip and load source data:
measurements <- read.csv(unz("activity.zip", "activity.csv"))
Calculate the total number of steps taken per day:
stepsPerDay <- aggregate(steps ~ date, measurements, sum)
hist(stepsPerDay$steps, main = "Steps per day", xlab = "Steps", col = "green", breaks = 8)
Calculate the mean and median of the total number of steps taken per day:
meanStepsPerDay <- mean(stepsPerDay$steps)
medianStepsPerDay <- median(stepsPerDay$steps)
The mean is 10766.19.
The median is 10765.
The mean and the median is very close to each other. There is why we can say that the average value is very revealing.
A time series plot of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis):
stepsInterval <- aggregate(steps ~ interval, measurements, mean)
plot(stepsInterval$interval, stepsInterval$steps, type="l", xlab = "5 min - interval", ylab = "Average steps", main = "Average Daily Activity Pattern", col = "green")
Interval from 5-minute intervals, on average across all the days in the dataset, contains the maximum number of steps:
stepsInterval$interval[which.max(stepsInterval$steps)]
## [1] 835
The total number of missing values in the dataset is:
nrow(measurements[is.na(measurements$steps),])
## [1] 2304
Filling in all of the missing values in the dataset. Create a new dataset that is equal to the original dataset but with the missing data filled in.
Strategy: Let NA values will be 0.
measurementsWithoutNAs <- measurements
measurementsWithoutNAs[is.na(measurementsWithoutNAs$steps), "steps"] <- 0
Calculate the total number of steps taken per day:
stepsPerDayNoNAs <- aggregate(steps ~ date, measurementsWithoutNAs, sum)
hist(stepsPerDayNoNAs$steps, main = "Steps per day", xlab = "Steps", col = "blue", breaks = 8)
Calculate the mean and median of the total number of steps taken per day:
meanStepsPerDayNoNAs <- mean(stepsPerDayNoNAs$steps)
medianStepsPerDayNoNAs <- median(stepsPerDayNoNAs$steps)
The mean is 9354.23.
The median is 10395.
These values are differ from the estimates from the first part of the assignment. As we can see depends on the NAs filling function (in this case it was 0 value) we have shifted values. In this case there is left-shifted to 0. This affects the mean and the median values.
Create a new factor variable in the dataset with two levels ? ?weekday? and ?weekend? indicating whether a given date is a weekday or weekend day.
0 is Sunday, 1 is Monday, etc.
measurementsWithoutNAs$day <- as.POSIXlt(measurementsWithoutNAs$date)$wday
measurementsWithoutNAs$dayType <- as.factor(ifelse(measurementsWithoutNAs$day == 0 | measurementsWithoutNAs$day == 6, "weekend", "weekday"))
measurementsWithoutNAs <- subset(measurementsWithoutNAs, select = -c(day))
head(measurementsWithoutNAs)
## steps date interval dayType
## 1 0 2012-10-01 0 weekday
## 2 0 2012-10-01 5 weekday
## 3 0 2012-10-01 10 weekday
## 4 0 2012-10-01 15 weekday
## 5 0 2012-10-01 20 weekday
## 6 0 2012-10-01 25 weekday
Make a panel plot containing a time series plot (i.e. type = “l”) of the 5-minute interval (x-axis) and the average number of steps taken:
weekdaysData <- measurementsWithoutNAs[measurementsWithoutNAs$dayType == "weekday",]
weekendsData <- measurementsWithoutNAs[measurementsWithoutNAs$dayType == "weekend",]
stepsIntervalWeekdays <- aggregate(steps ~ interval, weekdaysData, mean)
stepsIntervalWeekends <- aggregate(steps ~ interval, weekendsData, mean)
par(mfrow = c(2, 1))
plot(stepsIntervalWeekdays, type = "l", col = "green", main = "Weekdays")
plot(stepsIntervalWeekends, type = "l", col = "red", main = "Weekends")
As we can see from plots there are different activity paths at the weekends and at the weekdays. At the weekends we see almost uniform distribution (exclude night). At the weekdays there is a clear peak in the morning.