It is now possible to collect a large amount of data about personal movement using activity monitoring devices such as a Fitbit, Nike Fuelband, or Jawbone Up. These type of devices are part of the “quantified self” movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. But these data remain under-utilized both because the raw data are hard to obtain and there is a lack of statistical methods and software for processing and interpreting the data.
Below are the results from Reproducible Data Project Assignment 1, written on R markdown and compiled with Knitr.
library(knitr)
opts_chunk$set(echo = TRUE)
library(lubridate)
library(dplyr)
library(ggplot2)
unzip("./activity.zip")
activity <- read.csv("activity.csv", head=TRUE)
activity$date <- ymd(activity$date)
str(activity)
## 'data.frame': 17568 obs. of 3 variables:
## $ steps : int NA NA NA NA NA NA NA NA NA NA ...
## $ date : Date, format: "2012-10-01" "2012-10-01" ...
## $ interval: int 0 5 10 15 20 25 30 35 40 45 ...
head(activity)
## steps date interval
## 1 NA 2012-10-01 0
## 2 NA 2012-10-01 5
## 3 NA 2012-10-01 10
## 4 NA 2012-10-01 15
## 5 NA 2012-10-01 20
## 6 NA 2012-10-01 25
First filter the NAs out of the data, then use group_by & summarize to summarize the steps by date
activity_nona <- filter(activity, !is.na(activity$steps))
grp <- group_by(activity_nona, date)
numsteps <- summarize(grp, steps = sum(steps))
head(numsteps)
## # A tibble: 6 x 2
## date steps
## <date> <int>
## 1 2012-10-02 126
## 2 2012-10-03 11352
## 3 2012-10-04 12116
## 4 2012-10-05 13294
## 5 2012-10-06 15420
## 6 2012-10-07 11015
Use ggplot to plot the histogram
g <- ggplot(numsteps, aes(x = steps))
g + geom_histogram(colour = "palevioletred1", fill = "palevioletred4", binwidth = 500)+
ggtitle("Histogram of the number of steps taken per day")+
xlab("Steps taken per day")+
ylab("Frequency of occurrence")
meansteps <- mean(numsteps$steps)
mediansteps <- median(numsteps$steps)
Printing the results
meansteps
## [1] 10766.19
mediansteps
## [1] 10765
Mean number of steps taken per day is 10766 and the median number of steps take per day is 10765
First calculate the average number of steps taken in each 5 mins interval across the days.
grp2 <- group_by(activity_nona, interval)
numsteps2 <- summarize(grp2, steps = mean(steps))
Use ggplot to plot the time series plot.
g <- ggplot(numsteps2, aes(x=interval, y=steps))
g + geom_line(color = "slateblue3")
Using which.max to find which 5-min interval has the most number of steps taken.
numsteps2[which.max(numsteps2$steps),]
## # A tibble: 1 x 2
## interval steps
## <int> <dbl>
## 1 835 206.1698
The 835 mins interval has the most number of steps taken; with an average of 206 steps.
Summarizing the missing values.
sum(is.na(activity$steps))
## [1] 2304
The number of missing values are 2304.
activity_full <- activity
na_rows <- is.na(activity_full$steps)
avgbyint <- tapply(activity_full$steps, activity_full$interval, mean, na.rm=TRUE)
activity_full$steps[na_rows] <- avgbyint[as.character(activity_full$interval[na_rows])]
head(activity_full)
## steps date interval
## 1 1.7169811 2012-10-01 0
## 2 0.3396226 2012-10-01 5
## 3 0.1320755 2012-10-01 10
## 4 0.1509434 2012-10-01 15
## 5 0.0754717 2012-10-01 20
## 6 2.0943396 2012-10-01 25
Check number of missing values is 0
sum(is.na(activity_full$steps)) == 0
## [1] TRUE
Use group_by & summarize to summarize the steps by date
grp3 <- group_by(activity_full, date)
numsteps3 <- summarize(grp3, steps = sum(steps))
head(numsteps3)
## # A tibble: 6 x 2
## date steps
## <date> <dbl>
## 1 2012-10-01 10766.19
## 2 2012-10-02 126.00
## 3 2012-10-03 11352.00
## 4 2012-10-04 12116.00
## 5 2012-10-05 13294.00
## 6 2012-10-06 15420.00
Use ggplot to draw the histogram again
g3 <- ggplot(numsteps3, aes(x = steps))
g3 + geom_histogram(colour = "lightsalmon1", fill = "lightsalmon4", binwidth = 500)+
ggtitle("Histogram of the number of steps taken per day; with imputed values")+
xlab("Steps taken per day")+
ylab("Frequency of occurrence")
Calculate the mean and median of the total number of steps taken per day again
meansteps3 <- mean(numsteps3$steps)
mediansteps3 <- median(numsteps3$steps)
Printing the results
meansteps3
## [1] 10766.19
mediansteps3
## [1] 10766.19
Mean number of steps taken per day is 10766 and the median number of steps take per day is 10766. The imputing of the new values into the NA fields resultsed in the mean and median being the same.
Write function to indicate if date is weekday or weekend
weekend <- function(actdate) {
ifelse(weekdays(actdate) == "Saturday" | weekdays(actdate) == "Sunday","weekend","weekday")
}
Impute a new column using the function above to indicate which days are weekday or weekend
activity_full <- mutate(activity_full, week_type = weekend(activity_full$date))
head(activity_full)
## steps date interval week_type
## 1 1.7169811 2012-10-01 0 weekday
## 2 0.3396226 2012-10-01 5 weekday
## 3 0.1320755 2012-10-01 10 weekday
## 4 0.1509434 2012-10-01 15 weekday
## 5 0.0754717 2012-10-01 20 weekday
## 6 2.0943396 2012-10-01 25 weekday
First calculate the average number of steps taken in each 5 mins interval across the days by weektype
grp4 <- group_by(activity_full, week_type, interval)
numsteps4 <- summarize(grp4, steps = mean(steps))
head(numsteps4)
## Source: local data frame [6 x 3]
## Groups: week_type [1]
##
## week_type interval steps
## <chr> <int> <dbl>
## 1 weekday 0 2.25115304
## 2 weekday 5 0.44528302
## 3 weekday 10 0.17316562
## 4 weekday 15 0.19790356
## 5 weekday 20 0.09895178
## 6 weekday 25 1.59035639
Then plot the 2 panel time series plot of the average steps taken by the 5-min interval by week type.
g <- ggplot(numsteps4, aes(x=interval, y=steps, color = week_type))
g + geom_line()+
facet_grid(week_type~.)
The test subjects seem to be more active during the morning and slightly more during the evening on weekdays. This might correspond to office hours. Activity is quite consistent during the day on weekends - due to no work perhaps.