Activity Monitoring Data

Introduction

It is now possible to collect a large amount of data about personal movement using activity monitoring devices such as a Fitbit, Nike Fuelband, or Jawbone Up. These type of devices are part of the “quantified self” movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. But these data remain under-utilized both because the raw data are hard to obtain and there is a lack of statistical methods and software for processing and interpreting the data.

This assignment makes use of data from a personal activity monitoring device. This device collects data at 5 minute intervals through out the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day.

The data for this assignment can be downloaded from the course web site:

Dataset: Activity monitoring data [52K]

The variables included in this dataset are:

steps: Number of steps taking in a 5-minute interval (missing values are coded as NA)
date: The date on which the measurement was taken in YYYY-MM-DD format
interval: Identifier for the 5-minute interval in which measurement was taken The dataset is stored in a comma-separated-value (CSV) file and there are a total of 17,568 observations in this dataset.

Loading and preprocessing the data

# set wd
setwd("/home/pcbrom/Dropbox/Trabalho e Estudo/Cursos Livres/Reproducible Research/CurseProject")

# unzip file
unzip("activity.zip")

# read data
db = read.csv("activity.csv")

# see str
str(db)

## 'data.frame':    17568 obs. of  3 variables:
##  $ steps   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ date    : Factor w/ 61 levels "2012-10-01","2012-10-02",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ interval: int  0 5 10 15 20 25 30 35 40 45 ...

# convert date to Date
db$date = as.Date(db$date, "%Y-%m-%d")

# view summary
require(knitr)
kable(summary(db), caption = "Summary of Data", align = "c")

Summary of Data
steps	date	interval
Min. : 0.00	Min. :2012-10-01	Min. : 0.0
1st Qu.: 0.00	1st Qu.:2012-10-16	1st Qu.: 588.8
Median : 0.00	Median :2012-10-31	Median :1177.5
Mean : 37.38	Mean :2012-10-31	Mean :1177.5
3rd Qu.: 12.00	3rd Qu.:2012-11-15	3rd Qu.:1766.2
Max. :806.00	Max. :2012-11-30	Max. :2355.0
NA’s :2304	NA	NA

What is mean total number of steps taken per day?

library(Hmisc)

# aggregate steps by date
totSteps = tapply(db$steps, db$date, FUN = sum, na.rm = T)

# view the geometry of distribuction
p1 = qplot(totSteps, binwidth = 1000, 
           main = "Total Number Of Steps Taken\nEach Day", 
           xlab = "Total of Steps", ylab = "Frequency")
p1

# get mean
mean(totSteps, na.rm = T)

## [1] 9354.23

# get median
median(totSteps, na.rm = T)

## [1] 10395

What is the average daily activity pattern?

# get averages
averages = aggregate(x = list(steps = db$steps), 
                     by = list(interval = db$interval), 
                     FUN = mean, na.rm = T)

# view geometry
qplot(interval, steps, data = averages, xlab = "5 minute interval", 
      ylab = "Average number of steps taken", main = "Steps vs. Interval", 
      geom = "line")

# max value
kable(averages[which.max(averages$steps), ], align = "c", caption = "Max Value")

Max Value
	interval	steps
104	835	206.1698

Imputing missing values

Before the allocation should evaluate the distribution density to whether it is appropriate to use the average or median. See Figure: Total Number Of Steps Taken Each Day (original date) we have a mixture of distributions and, if we consider the mode of the distribution with the highest average, then we will get something fairly symmetrical, ie, it is recommended imputation by mean.

# rapid density view
d1 = qplot(totSteps, geom = "density", 
           xlab = "Total of Steps", ylab = "Density",
           main = "Total Number Of Steps Taken\nEach Day (original data)")
d1

# create secundary db
db2 = db

# impute data
db2$steps = impute(db$steps, mean)

# aggregate steps by date
totSteps2 = tapply(db2$steps, db2$date, FUN = sum)

# view the geometry of distribuction
p2 = qplot(totSteps2, binwidth = 1000, 
           xlab = "Total of Steps", ylab = "Frequency",
           main = "Total Number Of Steps Taken\nEach Day (imputed by mean)")

# density comparison
d2 = qplot(totSteps2, geom = "density", 
           xlab = "Total of Steps", ylab = "Density",
           main = "Total Number Of Steps Taken\nEach Day (imputed by mean)")

require(gridExtra)
grid.arrange(p1, p2, d1, d2, ncol = 2, nrow = 2)

# get mean
mean(totSteps2)

## [1] 10766.19

# get median
median(totSteps2)

## [1] 10766.19

Are there differences in activity patterns between weekdays and weekends?

In fact we have a visible difference between Weekday and Weekend.

# create weekdays
db2$weekdays = weekdays(db2$date)

# ceate dayType
db2$dayType = ifelse(db2$weekdays == "domingo" | db2$weekdays == "sábado",
                     "Weekend", "Weekday")

averages2 = aggregate(steps ~ interval + dayType, mean, data = db2)

qplot(interval, steps, data = averages2, geom = "line", facets = . ~ dayType,
      xlab = "5 minute interval", ylab = "avarage number of steps")