Reproducible Research: Peer Assessment 1

Background

It is now possible to collect a large amount of data about personal movement using activity monitoring devices such as a Fitbit, Nike Fuelband, or Jawbone Up. These type of devices are part of the “quantified self” movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. But these data remain under-utilized both because the raw data are hard to obtain and there is a lack of statistical methods and software for processing and interpreting the data.

This assignment makes use of data from a personal activity monitoring device. This device collects data at 5 minute intervals through out the day. The data consists of two months of data from an anonymous individual collected during the months of October and November, 2012 and include the number of steps taken in 5 minute intervals each day.

The variables included in this dataset are:

steps: Number of steps taking in a 5-minute interval (missing values are coded as NA)
date: The date on which the measurement was taken in YYYY-MM-DD format
interval: Identifier for the 5-minute interval in which measurement was taken

The dataset is stored in a comma-separated-value (CSV) file and there are a total of 17,568 observations in this dataset.

The first thing to do is to read the csv file containg the data into RStudio.

act_data <- read.csv ("activity.csv", header = T, sep = ",", stringsAsFactors = F)

Quickly inspecting the data by checking:

structure of the data

str(act_data)

## 'data.frame':    17568 obs. of  3 variables:
##  $ steps   : int  NA NA NA NA NA NA NA NA NA NA ...
##  $ date    : chr  "2012-10-01" "2012-10-01" "2012-10-01" "2012-10-01" ...
##  $ interval: int  0 5 10 15 20 25 30 35 40 45 ...

dimensions of the data
the first 5 rows.

head(act_data, 5)

##   steps       date interval
## 1    NA 2012-10-01        0
## 2    NA 2012-10-01        5
## 3    NA 2012-10-01       10
## 4    NA 2012-10-01       15
## 5    NA 2012-10-01       20

the last 5 rows

tail(act_data,5)

##       steps       date interval
## 17564    NA 2012-11-30     2335
## 17565    NA 2012-11-30     2340
## 17566    NA 2012-11-30     2345
## 17567    NA 2012-11-30     2350
## 17568    NA 2012-11-30     2355

We need to change the date format as follows

act_data$date <- as.Date(act_data$date, "%Y-%m-%d")

Data Analysis

1. What is the mean total number of steps taken per day?

library (dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

mean_steps <- act_data %>% group_by(date) %>%
          summarize(total.steps = sum(steps, na.rm = T), 
                  mean.steps = mean(steps, na.rm = T))

Now we can plot the histogram as below.

library(ggplot2)
m <- ggplot(mean_steps, aes(x=total.steps))
m + geom_histogram(binwidth = 2500) + theme(axis.text = element_text(size = 13),  
      axis.title = element_text(size = 14)) + labs(y = "Number of Occurrencies") + labs(x = "Total steps/day")

From the histogram we can see that there is slightly negative skew in the distribution of the data, with an abnormally high frequency in the first bar of the histogram.

Now we can determine the exact value of the 5 number summary of the data as follows:

summary(mean_steps$mean.steps)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  0.1424 30.7000 37.3800 37.3800 46.1600 73.5900       8

There are 8 missing values in the data as represented by the NAs.

2. What is the daily activity pattern?

The data will be grouped into 5 minute intervals afterwhich their respective means will be calculated as illustrated below.

mean_int <- act_data %>% group_by(interval) %>%
      summarize(mean.steps = mean(steps, na.rm = T))
n <- ggplot(mean_int, aes(x = interval, y = mean.steps))
n + geom_line() + theme(axis.text = element_text(size = 12), 
      axis.title = element_text(size = 14, face = "bold")) + 
      labs(y = "Mean number of steps") + labs(x = "Interval")

The number of steps peaked between interval 500 and 1000.

3. Imputing missing values

mean(is.na(act_data$steps))

## [1] 0.1311475

Approximately 13% of the data is missing as shown above.

sum(is.na(act_data$steps))

## [1] 2304

Lets check for missing values in the interval column within mean_int, where we stored the mean number of steps for each 5 min interval:

sum(is.na(mean_int$mean.steps))

## [1] 0

Now let us duplicate the data as follows:

new_act_data <- act_data

In order to fill in missing values we check at each row if the column interval is NA, when the condition is true we look for the corresponding interval (index), we search for this particular interval in the mean_int data and extract it to a temporary variable values. Last we choose only the column of interest from values, which is the mean.steps and assign this number to the corresponding position in the new_act_data set. We use a for loop to run through all the rows.

for (i in 1:nrow(new_act_data)) {
      if (is.na(new_act_data$steps[i])) {
            index <- new_act_data$interval[i]
            value <- subset(mean_int, interval==index)
            new_act_data$steps[i] <- value$mean.steps
      }
}
tail(new_act_data)

##           steps       date interval
## 17563 2.6037736 2012-11-30     2330
## 17564 4.6981132 2012-11-30     2335
## 17565 3.3018868 2012-11-30     2340
## 17566 0.6415094 2012-11-30     2345
## 17567 0.2264151 2012-11-30     2350
## 17568 1.0754717 2012-11-30     2355

Grouping the data by date we can construct the histogram.

new_mean <- new_act_data %>% group_by(date) %>%
      summarize(total.steps = sum(steps, na.rm = T))

g <- ggplot(new_mean, aes(x=total.steps))
g + geom_histogram(binwidth = 2500) + theme(axis.text = element_text(size = 12),
      axis.title = element_text(size = 14)) + labs(y = "Frequency") + labs(x = "Total steps/day")

The abnormal bar that was on the left has been removed and now the data exhibits a negatively skewed distribution around the mean.

4. Are there differences in activity patterns between weekdays and weekends?

We need to explore and ascertain if there is a statistically significant difference in the activity patterns bewtween weekdays and weekends.

new_act_data$day <- ifelse(weekdays(new_act_data$date) %in% c("Saturday", "Sunday"), "weekend", "weekday")

Next we create two subsets, one containing the weekend and one containing the weekday data:

wend <- filter(new_act_data, day == "weekend")
wday <- filter(new_act_data, day == "weekday")

Since the day column is lots during the grouping, we add it again to the wend and wday dataframes. Lastly, we merge both data sets into one named new_int

wend <- wend %>%
      group_by(interval) %>%
      summarize(mean.steps = mean(steps)) 
wend$day <- "weekend"

wday <- wday %>%
      group_by(interval) %>%
      summarize(mean.steps = mean(steps)) 
wday$day <- "weekday"

new_int <- rbind(wend, wday)
new_int$day <- as.factor(new_int$day)
new_int$day <- relevel(new_int$day, "weekend")

The two panel plot is now created, using the day column as a factor to spearate the weekday from the weekend timeseries.

g <- ggplot (new_int, aes (interval, mean.steps))
g + geom_line() + facet_grid (day~.) + theme(axis.text = element_text(size = 12), 
      axis.title = element_text(size = 14)) + labs(y = "Number of Steps") + labs(x = "Interval")

There is a marked difference between weekday and weekend activity with the weeekend showing more activity. There variance during the weelends is lower than during weekdays.

Reproducible Research: Peer Assessment 1 - Activity Data

Edzai C. Zvobwo

February 11, 2017