Reproducible Research: Activity Monitoring

1. Loading and preprocessing the data

First thing we need to do is to take a peak inside the contents of the file.

fileName <- as.character(unzip("activity.zip", list=T)$Name)
print(fileName)

## [1] "activity.csv"

Given it's just one file we can proceed to loading the data without further actions.

1.1. Loading the data

Reading the data into R:

data <- read.csv(unz("activity.zip", fileName))

Let's take a quick look at the data, just to make sure it's all properly loaded.

head(data)

##   steps       date interval
## 1    NA 2012-10-01        0
## 2    NA 2012-10-01        5
## 3    NA 2012-10-01       10
## 4    NA 2012-10-01       15
## 5    NA 2012-10-01       20
## 6    NA 2012-10-01       25

We're good to go.

1.2. Preprocessing the data

Regarding the classes for each feature in our data frame.

str(data)

## 'data.frame':    17568 obs. of  3 variables:
##  $ steps   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ date    : Factor w/ 61 levels "2012-10-01","2012-10-02",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ interval: int  0 5 10 15 20 25 30 35 40 45 ...

Dates are stored as factors, so we need to take care of that using the lubridate package.

library(lubridate)
data$date <- ymd(data$date)
str(data)

## 'data.frame':    17568 obs. of  3 variables:
##  $ steps   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ date    : POSIXct, format: "2012-10-01" "2012-10-01" ...
##  $ interval: int  0 5 10 15 20 25 30 35 40 45 ...

Looking good.

As a personal note, I tend to prefer data in long format rather tan wide format so we will not address that.

I also read that such format is preferable when dealing with time series in R.

2. What is mean total number of steps taken per day?

2.1. Calculate the total number of steps taken per day

In order to perform the necessary transformations and summarise the data we will use the dplyr package.

library(dplyr)

## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:lubridate':
## 
##     intersect, setdiff, union
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

We need to group our data by day, so that R knows we want to summarise data by day afterwards.

Then we summarise the data using the sum of all the steps taken, obtaining the total number of steps taken each day.

As suggested, we are ignoring missing values for now.

stepsByDay <- data %>% 
                  na.omit() %>% 
                      group_by(date) %>% 
                          summarise(totalSteps=sum(steps))

Find the first 10 rows of the final result below.

head(stepsByDay, n=10)

## Source: local data frame [10 x 2]
## 
##          date totalSteps
## 1  2012-10-02        126
## 2  2012-10-03      11352
## 3  2012-10-04      12116
## 4  2012-10-05      13294
## 5  2012-10-06      15420
## 6  2012-10-07      11015
## 7  2012-10-09      12811
## 8  2012-10-10       9900
## 9  2012-10-11      10304
## 10 2012-10-12      17382

str(stepsByDay)

## Classes 'tbl_df', 'tbl' and 'data.frame':    53 obs. of  2 variables:
##  $ date      : POSIXct, format: "2012-10-02" "2012-10-03" ...
##  $ totalSteps: int  126 11352 12116 13294 15420 11015 12811 9900 10304 17382 ...
##  - attr(*, "drop")= logi TRUE

2.2. Make a histogram of the total number of steps taken each day

We are now ready to make a histogram of the total number of steps taken each day using our summarised data.

library(ggplot2)
qplot(totalSteps, 
      data=stepsByDay, 
      geom="histogram", 
      binwidth=3000, 
      main="Histogram of the Total Number of Steps Taken Each Day\n", 
      ylab="Count of Days\n", 
      xlab="\nTotal Number of Steps per Day")

plot of chunk unnamed-chunk-10

The histogram above, as all the remaining plotting throughout the analysis, was made using the ggplot2 package.

2.3. Report the mean and median of the total number of steps per day

Here we will summarise our data further.

We use the previously calculated total number of steps per day to compute the daily mean and median.

Please note that, unlike previous exercises, no grouping is required here.

stepsByDayCentralMeasures <- stepsByDay %>% 
                                 summarise(meanStepsByDay=as.numeric(mean(totalSteps)), 
                                           medianStepsByDay=as.numeric(median(totalSteps)))
print(stepsByDayCentralMeasures)

## Source: local data frame [1 x 2]
## 
##   meanStepsByDay medianStepsByDay
## 1       10766.19            10765

The mean is 1.0766189 × 10⁴ steps per day and the median is 1.0765 × 10⁴ steps per day.

3. What is the average daily activity pattern?

3.1. Make a time series plot

As required, we will use of the 5-minute inverval and the average number of steps taken, averaged across all days.

To do this, we need to regroup our data using the identifier of the 5-minute interval.

Afterwards we summarise our data, obtaining the average number of steps, across all days, for each interval.

Note that we are still ignoring missing values at this point.

stepsByInterval <- data %>%
                       na.omit() %>%
                           group_by(interval) %>%
                               summarise(averageSteps=mean(steps))

Let's take a look at the first 10 lines of the summarised data.

head(stepsByInterval, n=10)

## Source: local data frame [10 x 2]
## 
##    interval averageSteps
## 1         0    1.7169811
## 2         5    0.3396226
## 3        10    0.1320755
## 4        15    0.1509434
## 5        20    0.0754717
## 6        25    2.0943396
## 7        30    0.5283019
## 8        35    0.8679245
## 9        40    0.0000000
## 10       45    1.4716981

Having the data in proper format, plotting is our time series is right around the corner.

qplot(x=interval,
      y=averageSteps,
      data=stepsByInterval,
      geom="line",
      main="Average number of steps taken for each Interval, across all days\n",
      xlab="\nInterval",
      ylab="Number of Steps\n")

plot of chunk unnamed-chunk-14

3.2. Which 5-minute interval contains the maximum number of steps

To answer this we will use the base R, no packages. We can simply subset our data.

First we need to discover which line contains our maximum value.

maxIndex <- which.max(stepsByInterval$averageSteps)

Once we know the index of our max value, we can use it to subset our data frame.

maxSteps <- stepsByInterval[maxIndex, ]
print(maxSteps)

## Source: local data frame [1 x 2]
## 
##   interval averageSteps
## 1      835     206.1698

Therefore, the 5-minute inverval with the maximum number of steps starts at the 835th minute.

During this period we observe 206.1698113 steps on average, across all days.

4. Imputing missing values

4.1. Total number of missing values

The total number of missing value can computed with applying the following formula to our raw data.

We can do this because R attributes the values 0 to FALSE and 1 to TRUE when dealing with logical values.

sumMissing <- sum(is.na(data$steps))

There are a total 2304 values missing.

Alternatively we can sum the number of incomplete cases.

sum(!complete.cases(data))

## [1] 2304

4.2. Devise a strategy for filling in all of the missing values

We will use the mean for the 5-minute interval, as activity has highly variability throughout the day.

We can subset our data so we get a data frame with just incomplete cases.

Then, we join the resulting table with the average number of steps by interval and use it to fill our missing data.

missingValues <- data[which(is.na(data$steps)), ] 
missingValues <- missingValues %>% 
                     inner_join(stepsByInterval, by="interval") %>% 
                         mutate(steps=averageSteps) %>% 
                             select(-averageSteps)

The result is a new table with missing values filled with the mean for the specific 5-minute interval in which they occur.

head(missingValues)

##       steps       date interval
## 1 1.7169811 2012-10-01        0
## 2 0.3396226 2012-10-01        5
## 3 0.1320755 2012-10-01       10
## 4 0.1509434 2012-10-01       15
## 5 0.0754717 2012-10-01       20
## 6 2.0943396 2012-10-01       25

4.3. Create a new dataset with the missing data filled in

Now we need to replace the missing values in that origina dataset with the values derived from our strategy.

We replicate our original data in a new table.

newData <- data

Then we fill the missing values with the values from our new table.

newData[which(is.na(newData$steps)), 1] <- missingValues[ , 1]
head(newData)

##       steps       date interval
## 1 1.7169811 2012-10-01        0
## 2 0.3396226 2012-10-01        5
## 3 0.1320755 2012-10-01       10
## 4 0.1509434 2012-10-01       15
## 5 0.0754717 2012-10-01       20
## 6 2.0943396 2012-10-01       25

Mission accomplished.

sum(!complete.cases(newData))

## [1] 0

4.4. Measuring the impact of imputing missing data

We need to group our new data by day as we did before.

newStepsByDay <- newData %>% 
                     group_by(date) %>% 
                         summarise(totalSteps=sum(steps))

And the new histogram can be found below.

library(ggplot2)
qplot(totalSteps, 
      data=newStepsByDay, 
      geom="histogram", 
      binwidth=3000, 
      main="Histogram of the Total Number of Steps Taken Each Day\n", 
      ylab="Count of Days\n", 
      xlab="\nTotal Number of Steps per Day")

plot of chunk unnamed-chunk-25

Not surprisingly, variability decreased and the distribution appears to be “thinner”, converging towards the center.

This happens because missing values appear in missing days: entire days for which there is no data.

Since we are replacing entire days with the same values, this will lead to these days having the equal total steps.

tapply(missingValues$steps, as.factor(missingValues$date), sum)

## 2012-10-01 2012-10-08 2012-11-01 2012-11-04 2012-11-09 2012-11-10 
##   10766.19   10766.19   10766.19   10766.19   10766.19   10766.19 
## 2012-11-14 2012-11-30 
##   10766.19   10766.19

And this is why our distribution is now more centered: the total steps for these days are all equal to the daily mean.

Regarding the effect on the mean and median:

newStepsByDayCentralMeasures <- newStepsByDay %>% 
                                    summarise(meanStepsByDay=as.numeric(mean(totalSteps)), 
                                              medianStepsByDay=as.numeric(median(totalSteps)))
rbind(stepsByDayCentralMeasures, newStepsByDayCentralMeasures)

## Source: local data frame [2 x 2]
## 
##   meanStepsByDay medianStepsByDay
## 1       10766.19         10765.00
## 2       10766.19         10766.19

After filling in the missing data, the mean holds the same value while the median converged to the mean.

5. Differences in activity patterns

We want to explore the differences in activity patterns between weekdays and weekends.

We will be using the filled-in missing values for this part.

5.1. Create a new factor variable in the dataset

We use the dplyr packages to create a new column indicating whether a given date is a weekday or weekend day.

Concluding this, we need to convert the new column from character to factor, in order to be used later on.

newData <- newData %>% 
               mutate(weekdays=ifelse(weekdays(date) == "Saturday" | 
                                      weekdays(date) == "Sunday", 
                                      "weekend", 
                                      "weekday"))
newData$weekdays <- as.factor(newData$weekdays)
str(newData)

## 'data.frame':    17568 obs. of  4 variables:
##  $ steps   : num  1.717 0.3396 0.1321 0.1509 0.0755 ...
##  $ date    : POSIXct, format: "2012-10-01" "2012-10-01" ...
##  $ interval: int  0 5 10 15 20 25 30 35 40 45 ...
##  $ weekdays: Factor w/ 2 levels "weekday","weekend": 1 1 1 1 1 1 1 1 1 1 ...

5.2. Make a panel plot containing the time series

First thing to do is to get our data ready for plotting.

Note that this time we have an additional feature to group by: weekdays.

newStepsByInterval <- newData %>%
                          group_by(interval, weekdays) %>%
                              summarise(averageSteps=mean(steps))
head(newStepsByInterval, n=10)

## Source: local data frame [10 x 3]
## Groups: interval
## 
##    interval weekdays averageSteps
## 1         0  weekday  2.251153040
## 2         0  weekend  0.214622642
## 3         5  weekday  0.445283019
## 4         5  weekend  0.042452830
## 5        10  weekday  0.173165618
## 6        10  weekend  0.016509434
## 7        15  weekday  0.197903564
## 8        15  weekend  0.018867925
## 9        20  weekday  0.098951782
## 10       20  weekend  0.009433962

qplot(x=interval,
      y=averageSteps,
      data=newStepsByInterval,
      geom="line",
      main="Average number of steps taken for each Interval, across all days\n",
      xlab="\nInterval",
      ylab="Number of Steps\n",
      facets=weekdays ~ .)

plot of chunk unnamed-chunk-30

The activity patterns are effectively distinct for weekdays and weekend days.