This assignment makes use of data from a personal activity monitoring device. This device collects data at 5 minute intervals through out the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day.
read.csv()).# Libraries for plotting (ggplot2) and transforming data (plyr).
library(ggplot2)
library(plyr)
# download and read the data
data <- read.csv("activity.csv", colClasses = c("numeric", "Date", "numeric"))
Here we have the histogram of the total number of steps taken daily, plotted with a bin interval of 1500 steps.
byDay <- aggregate(steps ~ date, data, sum, na.action = na.pass)
# Because we wanna to track this information, we add a label
byDay <- cbind(byDay, label = rep("with.na", nrow(byDay)))
ggplot(byDay, aes(x = steps)) + geom_histogram(binwidth = 1500, colour = "black",
fill = "white") + labs(title = "Steps Taken per Day", x = "Number of Steps",
y = "Frequency")
Then for the number of steps taken per day we have:
Here we have the plot of the average number of steps taken daily plotted against the interval number.
byInterval <- aggregate(steps ~ interval, data, mean, na.rm = TRUE)
ggplot(byInterval, aes(x = interval, y = steps)) + geom_line() + labs(title = "Average of Steps taken Daily",
x = "Interval", y = "Number of steps")
We can obtain the 5-minute interval that contains the maximum number of steps: 835
Note that there are a number of days/intervals where there are missing
values (coded as NA). The presence of missing days may introduce
bias into some calculations or summaries of the data.
The total number of missing values in the dataset is: 2304
To populate missing values, we choose to replace them with the mean value at the same interval across days.
data.impute <- adply(data, 1, function(x) if (is.na(x$steps)) {
x$steps = round(byInterval[byInterval$interval == x$interval, 2])
x
} else {
x
})
Obtaining the follow histogram of the number of steps taken daily, plotted with a bin interval of 1500 steps.
# Because we wanna to track this information, we add a label
byDay.impute <- aggregate(steps ~ date, data.impute, sum)
byDay.impute <- cbind(byDay.impute, label = rep("without.na", nrow(byDay.impute)))
ggplot(byDay.impute, aes(x = steps)) + geom_histogram(binwidth = 1500, colour = "black",
fill = "white") + labs(title = "Steps Taken per Day", x = "Number of Steps",
y = "Frequency")
We observe that the mean value and the median value has shifted a little bit:
Below we have the two histograms.
byDay.all <- rbind(byDay, byDay.impute)
levels(byDay.all$label) <- c("With NA", "Without NA")
ggplot(byDay.all, aes(x = steps, fill = label)) + geom_histogram(binwidth = 1500,
colour = "black", alpha = 0.2) + labs(title = "Steps Taken per Day", x = "Number of Steps",
y = "Frequency") + theme(legend.position = "bottom")
To do this comparison with the table with filled-in missing values, we follow the next steps:
# For some problems in system time
Sys.setlocale(locale = "C")
# We obtain the two subsets
data.weekend <- subset(data.impute, weekdays(date) %in% c("Saturday", "Sunday"))
data.weekday <- subset(data.impute, !weekdays(date) %in% c("Saturday", "Sunday"))
# Obtain the average steps per interval for each dataset
data.weekend <- aggregate(steps ~ interval, data.weekend, mean)
data.weekday <- aggregate(steps ~ interval, data.weekday, mean)
# By plotting we add a label
data.weekend <- cbind(data.weekend, day = rep("weekend"))
data.weekday <- cbind(data.weekday, day = rep("weekday"))
# Combine the subsets and a specify the levels
data.week <- rbind(data.weekend, data.weekday)
levels(data.week$day) <- c("Weekend", "Weekday")
ggplot(data.week, aes(x = interval, y = steps)) + geom_line() + facet_grid(day ~
.) + labs(x = "Interval", y = "Number of steps")
We observe that activity on the weekends tends to make more activities compared to the weekdays. This could be due to the fact that activities on weekdays mostly tend to be in the work, whereas weekends tend to be in more variade spaces.