Data Imputation of Activity data

Objectives

The objective of this exercise is to perform the following operations
- Read a CSV file
- Analyse the Activity data set
- Devise a strategy for imputing all the missing values in the dataset
- Prepare a RMD File containing the above operations
- Publish the File on rPubs

Code & Output

knitr Global Options

# for development
knitr::opts_chunk$set(echo=TRUE, eval=TRUE, error=TRUE, warning=TRUE, message=TRUE, cache=FALSE, tidy=FALSE, fig.path='figures/')
# for production
#knitr::opts_chunk$set(echo=TRUE, eval=TRUE, error=FALSE, warning=FALSE, message=FALSE, cache=FALSE, tidy=FALSE, fig.path='figures/')

Load Libraries

library(dplyr)  
## Warning: package 'dplyr' was built under R version 3.3.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Read Data
Read Activity.csv file and display the first 6 rows of data.

cat("\014")

setwd("D:/R-BA/R-Scripts/data")
dfrActivityCSV <- read.csv("./activity.csv", header=T, stringsAsFactors=F)
head(dfrActivityCSV)
##   steps       date interval
## 1    NA 2012-10-01        0
## 2    NA 2012-10-01        5
## 3    NA 2012-10-01       10
## 4    NA 2012-10-01       15
## 5    NA 2012-10-01       20
## 6    NA 2012-10-01       25

Data Preparation and Imputation

Count the number of NA records before the Data Imputation process

dfrcount <- sum(is.na(dfrActivityCSV))  
dfrcount
## [1] 2304

Remove steps containing NA value from the original dataset and store it in new dataframe

dfrActivityMean <- subset(dfrActivityCSV, steps !="NA")

Calculate the mean of steps as per the interval

dfrActivityMean <- summarise(group_by(dfrActivityMean, interval), steps=round(mean(steps),digits = 2))

Function to replace step containing NA with the mean values as per the interval

ImputeNA <- function(step, interval)
  {
  ifelse(is.na(step), dfrActivityMean[dfrActivityMean$interval == interval, ]$steps, step)
  }

Call to Impute function using mapply function from apply function

dfrActivityNew <- dfrActivityCSV
dfrActivityNew$steps <- mapply(ImputeNA, dfrActivityNew$steps, dfrActivityNew$interval)
head(dfrActivityNew)
##   steps       date interval
## 1  1.72 2012-10-01        0
## 2  0.34 2012-10-01        5
## 3  0.13 2012-10-01       10
## 4  0.15 2012-10-01       15
## 5  0.08 2012-10-01       20
## 6  2.09 2012-10-01       25

Count the number of NA records after the Data Imputation process

dfrcount <- sum(is.na(dfrActivityNew))
dfrcount
## [1] 0

Write the Cleaned data back to a new CSV file

write.csv(dfrActivityNew,"ActivityCleaned.csv",row.names=F)

Summary Report

What are the number of NA records before data imputing process?
-Activity.csv file had 17569 data sets, out of which there were 2304 NA.
After the imputing process what are the records that still have NA as data?
-None of the records have NA after the imputing process.

Objective

The strategy to handle missing data or data imputation was met successfully using functions and apply family.