First of all, let’s unzip the file, store it in a variable, and take a look at it.
unzip(zipfile="activity.zip")
dat <- read.csv("activity.csv")
head(dat)
## steps date interval
## 1 NA 2012-10-01 0
## 2 NA 2012-10-01 5
## 3 NA 2012-10-01 10
## 4 NA 2012-10-01 15
## 5 NA 2012-10-01 20
## 6 NA 2012-10-01 25
It appears we have some missing values. Let’s process it by filtering out these rows for now.
data <- subset(dat, !is.na(steps))
head(data)
## steps date interval
## 289 0 2012-10-02 0
## 290 0 2012-10-02 5
## 291 0 2012-10-02 10
## 292 0 2012-10-02 15
## 293 0 2012-10-02 20
## 294 0 2012-10-02 25
All set!
This can be calculated using the tapply function, which will sum for us the steps per day. Here are the total number of steps for the first few days.
totalPerDay <- tapply(data$steps, data$date, sum)
head(totalPerDay)
## 2012-10-01 2012-10-02 2012-10-03 2012-10-04 2012-10-05 2012-10-06
## NA 126 11352 12116 13294 15420
Now we plot a nice histogram (not a barplot) that indicates the distribution of the total numbers.
hist(totalPerDay, breaks = 10, col = "red", xlab = "Total Steps",
main = "Histogram of Total Numbers of steps")
A summary tells us the mean is 10770 and the median is 10760.
summary(totalPerDay)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 41 8841 10760 10770 13290 21190 8
Again we can use tapply here, this time grouping per interval and looking for the mean rather than the sum. Let’s take a look at the mean values for the first few intervals.
meanPerInterval <- tapply(data$steps, data$interval, mean)
head(meanPerInterval)
## 0 5 10 15 20 25
## 1.7169811 0.3396226 0.1320755 0.1509434 0.0754717 2.0943396
Plotting time!
plot(unique(data$interval), meanPerInterval, type="l", col="red",
main ="Average number of Steps during an interval, accross all days",
xlab = "Interval per 5 minutes", ylab = "Mean")
That huge peak draws our attention… With the help of the max function, we can subset out the 5 minutes with the highest average number of staps, which appears to be ~206. On 5 minutes. Crazy stuff.
meanPerInterval[meanPerInterval == max(meanPerInterval)]
## 835
## 206.1698
Let us return now to the data with the missing values (stored in the variable dat). How many values are actually missing?
sum(is.na(dat$steps))
## [1] 2304
Ok, the impute function of the Hisc package can be used to replace these 2304 NAs with the mean. We first take a look at that value, and then apply the impute function to create a new dataframe without NAs: imputedData.
library(Hmisc)
mean(dat$steps, na.rm = TRUE)
## [1] 37.3826
imputedData <- dat
imputedData$steps <- impute(dat$steps, mean)
If we check imputedData, all NAs will be removed (by this mean).
head(imputedData)
## steps date interval
## 1 37.3826 2012-10-01 0
## 2 37.3826 2012-10-01 5
## 3 37.3826 2012-10-01 10
## 4 37.3826 2012-10-01 15
## 5 37.3826 2012-10-01 20
## 6 37.3826 2012-10-01 25
Now let’s do exactly the same as before: creating a histogram and retrieve the mean and median by the summary function.
totalPerDayI <- tapply(imputedData$steps, imputedData$date, sum)
hist(totalPerDayI, breaks = 10, col = "red", xlab = "Total Steps",
main = "Histogram of Total Numbers of steps (imputed data")
summary(totalPerDayI)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 41 9819 10770 10770 12810 21190
Clearly, the center bar in the histogram has grown. All the NA values are now mean values, which explains the higher estimates for mean values according to this histogram. The rest of the histogram remains the same. Unsurprisingly, however, if we look at the summary, the mean was not affected by imputing, as we only added mean values. It is still 10770. But, the median is affected. By adding only mean values, the center value actually has been shifted to a mean value now, thus 10770.
Now we’ll use the imputed dataset to discover differences between weekdays and weekends. First we add a factor variable with weekday and weekend as levels. We do this by using the weekdays function on the dates, loop through theses values to change them, and then add them to imputedData as factors. As I live in Belgium, the daynames are in Dutch.
days <- weekdays(as.Date(imputedData$date))
for(i in 1:length(days)) {
if(days[i] %in% c("zaterdag", "zondag")) days[i] = "weekend"
else days[i] = "weekday"
}
imputedData$day <- as.factor(days)
Before plotting we calculate the averages using the aggregate function. This will calculate the average number of steps per daytype (weekend or weekday) and per interval. Let’s take a look at these averages.
library(ggplot2)
averages <- aggregate(steps ~ interval + day, data=imputedData, mean)
head(averages)
## interval day steps
## 1 0 weekday 7.006569
## 2 5 weekday 5.384347
## 3 10 weekday 5.139902
## 4 15 weekday 5.162124
## 5 20 weekday 5.073235
## 6 25 weekday 6.295458
Plotting time! I’ll use ggplot because yolo.
g <- ggplot(averages, aes(interval, steps))
g + geom_line(color="blue") + facet_grid(day ~ .) +
xlab("Interval") + ylab("Average steps")
One can see how these plots ‘add up’ to the previous (red) line plot. Then again, that plot is now ‘splitted’ into weekdays and weekends.