Missing Values Matters

Synopsys

This is the first project for Reproducible Research course, which is a part of Coursera’s Data Science Specialization, and the final course of Coursera’s Data Science: Foundations Using R Specialization.

The Project aims to research data from a personal activity monitoring device.

It is now possible to collect a large amount of data about personal movement using activity monitoring devices such as a Fitbit, Nike Fuelband, or Jawbone Up. These type of devices are part of the “quantified self” movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. But these data remain under-utilized both because the raw data are hard to obtain and there is a lack of statistical methods and software for processing and interpreting the data.

The tasks to be completed
- Code for reading in the data set and/or processing the data
- Histogram of the total number of steps taken each day
- Mean and median number of steps taken each day
- Time series plot of the average number of steps taken
- The 5-minute interval that, on average, contains the maximum number of steps
- Code to describe and show a strategy for imputing missing data
- Histogram of the total number of steps taken each day after missing values are imputed
- Panel plot comparing the average number of steps taken per 5-minute interval across weekdays and weekends
- All of the R code needed to reproduce the results (numbers, plots, etc.) in the report

Setup

Load packages & data

Code chunks can be displayed by clicking Code button

library("data.table"); library("dplyr"); library("lattice")
if(!file.exists("./data/0904_DS-RR-w2_Activity/activity.zip")) {
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2Factivity.zip",
              destfile = "./data/0904_DS-RR-w2_Activity/activity.zip",
              method = "curl")
  }
if(!file.exists("./data/0904_DS-RR-w2_Activity/activity.csv")) {
  unzip("./data/0904_DS-RR-w2_Activity/activity.zip",
        exdir = "./data/0904_DS-RR-w2_Activity")
  }

Reading and preprocessing the data

Read the data

activity <- fread("./data/0904_DS-RR-w2_Activity/activity.csv")

Process/transform the data into a format suitable for your analysis

as fread function is used, the data is almost already in the suitable format

just rename some variables and then see the structure

activity <- activity %>% rename(NumberOfSteps = steps, Date = date,
                                FiveMinuteInterval = interval)
str(activity)

Classes 'data.table' and 'data.frame':  17568 obs. of  3 variables:
 $ NumberOfSteps     : int  NA NA NA NA NA NA NA NA NA NA ...
 $ Date              : IDate, format: "2012-10-01" "2012-10-01" ...
 $ FiveMinuteInterval: int  0 5 10 15 20 25 30 35 40 45 ...
 - attr(*, ".internal.selfref")=<externalptr>

there are a total of 17 568 observations in this data set
variables included in this data set:
- steps: number of steps taking in a 5-minute interval (missing values are coded as NA)
- date: the date on which the measurement was taken in YYYY-MM-DD format
- interval: identifier for the 5-minute interval in which measurement was taken

What is mean total number of steps taken per day?

For this part of the assignment, the missing values in the data set can be ignored

Make a histogram of the total number of steps taken each day

construct a new data set w/ total number of steps, then draw a histogram

TDsteps <- activity %>%
        filter(!is.na(NumberOfSteps)) %>%
        group_by(Date) %>%
        summarise(DailySteps = sum(NumberOfSteps))
histogram( ~ DailySteps, TDsteps, col= "cornflowerblue",
           main = "Histogram of the total number of step taken each day")

Calculate and report the mean and median total number of steps taken per day

apply mean and median to variable DailySteps from data set TDsteps

TDsteps %>%
        summarise(meanTDsteps = mean(DailySteps),
                  medianTDsteps = median(DailySteps))

# A tibble: 1 x 2
  meanTDsteps medianTDsteps
        <dbl>         <int>
1      10766.         10765

The mean of the total number of steps taken per day is 10 766
The median of the total number of steps taken per day is 10 765

What is the average daily activity pattern?

Make a time series plot (i.e. type = "l") of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all days (y-axis)

construct a new data set w/ averaged number of steps, then draw a plot

AvgDsteps <- activity %>%
        filter(!is.na(NumberOfSteps)) %>%
        group_by(FiveMinuteInterval) %>%
        summarise(AvgIntSteps = mean(NumberOfSteps))
xyplot(AvgIntSteps ~ FiveMinuteInterval, AvgDsteps, type = "l",
       main = "Time series plot of the average number of steps taken")

Which 5-minute interval, on average across all the days in the data set, contains the maximum number of steps?

subset the data set AvgDsteps

AvgDsteps %>% filter(AvgIntSteps == max(AvgIntSteps)) %>%
        rename(maxInterval = FiveMinuteInterval, maxSteps = AvgIntSteps)

# A tibble: 1 x 2
  maxInterval maxSteps
        <int>    <dbl>
1         835     206.

The interval 835 contains the maximum number of steps at \(206\)

Imputing missing values

Note that there are a number of days/intervals where there are missing values (coded as NA). The presence of missing days may introduce bias into some calculations or summaries of the data

Calculate and report the total number of missing values in the data set (i.e. the total number of rows with NAs)

TNAs <- sum(is.na(activity))

The total number of missing values are 2304

Devise a strategy for filling in all of the missing values in the data set. The strategy does not need to be sophisticated. For example, you could use the mean/median for that day, or the mean for that 5-minute interval, etc.

Strategy: Replace each NA with the mean for the given 5-minute interval (AvgIntSteps)

Create a new data set that is equal to the original data set but with the missing data filled in

create an auxiliary data set with only NAs, then do the loop with all rows in it using AvgDsteps, and return them to the original data set again

### create an auxiliary data set with only `NA`s ###
NAs <- activity %>% filter(is.na(NumberOfSteps))

### replace each `NA` value with data from `AvgDsteps`,
#   i.e. with the mean for the given 5-minute interval ###
for (i in 1:nrow(NAs)) {
  NAs$NumberOfSteps[i] <-
          AvgDsteps[which(NAs$FiveMinuteInterval[i] ==
                                  AvgDsteps$FiveMinuteInterval),]$AvgIntSteps
    }
### combine new `NAs` w/ remaining data in the original data set `activity` ###
noNAs <- rbind(NAs, activity %>% filter(!is.na(NumberOfSteps)))
### function `summary` will show, if there are `NA` in a new data set `noNAs` ###
summary(noNAs)

 NumberOfSteps         Date            FiveMinuteInterval
 Min.   :  0.00   Min.   :2012-10-01   Min.   :   0.0    
 1st Qu.:  0.00   1st Qu.:2012-10-16   1st Qu.: 588.8    
 Median :  0.00   Median :2012-10-31   Median :1177.5    
 Mean   : 37.38   Mean   :2012-10-31   Mean   :1177.5    
 3rd Qu.: 27.00   3rd Qu.:2012-11-15   3rd Qu.:1766.2    
 Max.   :806.00   Max.   :2012-11-30   Max.   :2355.0

There is no NA in a new data set noNAs

Make a histogram of the total number of steps taken each day and…

construct a new data set w/ total number of steps, then draw a histogram

TDsteps <- noNAs %>%
        group_by(Date) %>%
        summarise(DailySteps = sum(NumberOfSteps))
histogram( ~ DailySteps, TDsteps,  col= "cornflowerblue",
main = "Histogram of the total number of steps
           taken each day after missing values are imputed")

4a. … Calculate and report the mean and median total number of steps taken per day 4b. Do these values differ from the estimates from the first part of the assignment?

apply mean and median to variable DailySteps from data set TDsteps

TDsteps %>%
        summarise(meanTDsteps = mean(DailySteps),
                  medianTDsteps = median(DailySteps))

# A tibble: 1 x 2
  meanTDsteps medianTDsteps
        <dbl>         <dbl>
1      10766.        10766.

The mean of the total number of steps taken per day is 10 766
- the same as in the first part of the assignment
The median of the total number of steps taken per day is 10 766
- different from the first part of the assignment at 10 765

4c. What is the impact of imputing missing data on the estimates of the total daily number of steps?

Imputing missing data impact on the estimates so that the median changed, and the mean and the median of the total daily number of steps now match up

Are there differences in activity patterns between weekdays and weekends?

Use the data set with the filled-in missing values for this part

Create a new factor variable in the data set with two levels – “weekday” and “weekend” indicating whether a given date is a weekday or weekend day

create var Wdays as numbers from 1-Sunday to 7-Saturday, then replace Wdays 2-6 with weekdays, and 1,7 with weekend

noNAs <- noNAs %>% mutate(Wdays = wday(Date)) %>%
      mutate(Wdays = if_else(Wdays == 1 | Wdays == 7, "weekend", "weekday")) %>%
      mutate(Wdays = as.factor(Wdays))  
str(noNAs)

Classes 'data.table' and 'data.frame':  17568 obs. of  4 variables:
 $ NumberOfSteps     : num  1.717 0.3396 0.1321 0.1509 0.0755 ...
 $ Date              : IDate, format: "2012-10-01" "2012-10-01" ...
 $ FiveMinuteInterval: int  0 5 10 15 20 25 30 35 40 45 ...
 $ Wdays             : Factor w/ 2 levels "weekday","weekend": 1 1 1 1 1 1 1 1 1 1 ...
 - attr(*, ".internal.selfref")=<externalptr>

There is a new factor variable Wdays with two levels: weekday and weekend

Make a panel plot containing a time series plot of the 5-minute interval (x-axis) and the average number of steps taken, averaged across all weekday days or weekend days (y-axis)

construct a new data set w/ averaged number of steps, then draw a panel plot

AvgDsteps <- noNAs %>%
        group_by(Wdays, FiveMinuteInterval) %>%
        summarise(AvgIntSteps = mean(NumberOfSteps))
xyplot(AvgIntSteps ~ FiveMinuteInterval | Wdays, data = AvgDsteps, type = "l",
       layout = c(1,2), xlab = "Interval", ylab = "Number of steps",
       main = "Panel plot comparing the average number of steps
       taken per 5-minute interval across weekdays and weekends")

Session Info (for reproducibility)

R version 4.0.3 (2020-10-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] lattice_0.20-41   dplyr_1.0.2       data.table_1.13.4

loaded via a namespace (and not attached):
 [1] knitr_1.30       magrittr_2.0.1   tidyselect_1.1.0 R6_2.5.0        
 [5] rlang_0.4.9      fansi_0.4.1      stringr_1.4.0    tools_4.0.3     
 [9] grid_4.0.3       xfun_0.19        utf8_1.1.4       cli_2.2.0       
[13] htmltools_0.5.0  ellipsis_0.3.1   assertthat_0.2.1 yaml_2.2.1      
[17] digest_0.6.27    tibble_3.0.4     lifecycle_0.2.0  crayon_1.3.4    
[21] purrr_0.3.4      vctrs_0.3.5      glue_1.4.2       evaluate_0.14   
[25] rmarkdown_2.5    stringi_1.5.3    compiler_4.0.3   pillar_1.4.7    
[29] generics_0.1.0   pkgconfig_2.0.3