It is now possible to collect a large amount of data about personal movement using activity monitoring devices such as a Fitbit, Nike Fuelband, or Jawbone Up. These type of devices are part of the “quantified self” movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. But these data remain under-utilized both because the raw data are hard to obtain and there is a lack of statistical methods and software for processing and interpreting the data.
This assignment makes use of data from a personal activity monitoring device. This device collects data at 5 minute intervals through out the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day.
The data for this assignment can be downloaded from the course web site:
Dataset: Activity monitoring data [52K]
The variables included in this dataset are:
# set wd
setwd("/home/pcbrom/Dropbox/Trabalho e Estudo/Cursos Livres/Reproducible Research/CurseProject")
# unzip file
unzip("activity.zip")
# read data
db = read.csv("activity.csv")
# see str
str(db)
## 'data.frame': 17568 obs. of 3 variables:
## $ steps : int NA NA NA NA NA NA NA NA NA NA ...
## $ date : Factor w/ 61 levels "2012-10-01","2012-10-02",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ interval: int 0 5 10 15 20 25 30 35 40 45 ...
# convert date to Date
db$date = as.Date(db$date, "%Y-%m-%d")
# view summary
require(knitr)
kable(summary(db), caption = "Summary of Data", align = "c")
| steps | date | interval | |
|---|---|---|---|
| Min. : 0.00 | Min. :2012-10-01 | Min. : 0.0 | |
| 1st Qu.: 0.00 | 1st Qu.:2012-10-16 | 1st Qu.: 588.8 | |
| Median : 0.00 | Median :2012-10-31 | Median :1177.5 | |
| Mean : 37.38 | Mean :2012-10-31 | Mean :1177.5 | |
| 3rd Qu.: 12.00 | 3rd Qu.:2012-11-15 | 3rd Qu.:1766.2 | |
| Max. :806.00 | Max. :2012-11-30 | Max. :2355.0 | |
| NA’s :2304 | NA | NA |
library(Hmisc)
# aggregate steps by date
totSteps = tapply(db$steps, db$date, FUN = sum, na.rm = T)
# view the geometry of distribuction
p1 = qplot(totSteps, binwidth = 1000,
main = "Total Number Of Steps Taken\nEach Day",
xlab = "Total of Steps", ylab = "Frequency")
p1
# get mean
mean(totSteps, na.rm = T)
## [1] 9354.23
# get median
median(totSteps, na.rm = T)
## [1] 10395
# get averages
averages = aggregate(x = list(steps = db$steps),
by = list(interval = db$interval),
FUN = mean, na.rm = T)
# view geometry
qplot(interval, steps, data = averages, xlab = "5 minute interval",
ylab = "Average number of steps taken", main = "Steps vs. Interval",
geom = "line")
# max value
kable(averages[which.max(averages$steps), ], align = "c", caption = "Max Value")
| interval | steps | |
|---|---|---|
| 104 | 835 | 206.1698 |
Before the allocation should evaluate the distribution density to whether it is appropriate to use the average or median. See Figure: Total Number Of Steps Taken Each Day (original date) we have a mixture of distributions and, if we consider the mode of the distribution with the highest average, then we will get something fairly symmetrical, ie, it is recommended imputation by mean.
# rapid density view
d1 = qplot(totSteps, geom = "density",
xlab = "Total of Steps", ylab = "Density",
main = "Total Number Of Steps Taken\nEach Day (original data)")
d1
# create secundary db
db2 = db
# impute data
db2$steps = impute(db$steps, mean)
# aggregate steps by date
totSteps2 = tapply(db2$steps, db2$date, FUN = sum)
# view the geometry of distribuction
p2 = qplot(totSteps2, binwidth = 1000,
xlab = "Total of Steps", ylab = "Frequency",
main = "Total Number Of Steps Taken\nEach Day (imputed by mean)")
# density comparison
d2 = qplot(totSteps2, geom = "density",
xlab = "Total of Steps", ylab = "Density",
main = "Total Number Of Steps Taken\nEach Day (imputed by mean)")
require(gridExtra)
grid.arrange(p1, p2, d1, d2, ncol = 2, nrow = 2)
# get mean
mean(totSteps2)
## [1] 10766.19
# get median
median(totSteps2)
## [1] 10766.19
In fact we have a visible difference between Weekday and Weekend.
# create weekdays
db2$weekdays = weekdays(db2$date)
# ceate dayType
db2$dayType = ifelse(db2$weekdays == "domingo" | db2$weekdays == "sábado",
"Weekend", "Weekday")
averages2 = aggregate(steps ~ interval + dayType, mean, data = db2)
qplot(interval, steps, data = averages2, geom = "line", facets = . ~ dayType,
xlab = "5 minute interval", ylab = "avarage number of steps")