The objective of this exercise is to perform the following operations
- Read a CSV file
- Analyse the Activity data set
- Devise a strategy for imputing all the missing values in the dataset
- Prepare a RMD File containing the above operations
- Publish the File on rPubs
knitr Global Options
# for development
knitr::opts_chunk$set(echo=TRUE, eval=TRUE, error=TRUE, warning=TRUE, message=TRUE, cache=FALSE, tidy=FALSE, fig.path='figures/')
# for production
#knitr::opts_chunk$set(echo=TRUE, eval=TRUE, error=FALSE, warning=FALSE, message=FALSE, cache=FALSE, tidy=FALSE, fig.path='figures/')
Load Libraries
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.3.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Read Data
Read Activity.csv file and display the first 6 rows of data.
cat("\014")
setwd("D:/R-BA/R-Scripts/data")
dfrActivityCSV <- read.csv("./activity.csv", header=T, stringsAsFactors=F)
head(dfrActivityCSV)
## steps date interval
## 1 NA 2012-10-01 0
## 2 NA 2012-10-01 5
## 3 NA 2012-10-01 10
## 4 NA 2012-10-01 15
## 5 NA 2012-10-01 20
## 6 NA 2012-10-01 25
Data Preparation and Imputation
Count the number of NA records before the Data Imputation process
dfrcount <- sum(is.na(dfrActivityCSV))
dfrcount
## [1] 2304
Remove steps containing NA value from the original dataset and store it in new dataframe
dfrActivityMean <- subset(dfrActivityCSV, steps !="NA")
Calculate the mean of steps as per the interval
dfrActivityMean <- summarise(group_by(dfrActivityMean, interval), steps=round(mean(steps),digits = 2))
Function to replace step containing NA with the mean values as per the interval
ImputeNA <- function(step, interval)
{
ifelse(is.na(step), dfrActivityMean[dfrActivityMean$interval == interval, ]$steps, step)
}
Call to Impute function using mapply function from apply function
dfrActivityNew <- dfrActivityCSV
dfrActivityNew$steps <- mapply(ImputeNA, dfrActivityNew$steps, dfrActivityNew$interval)
head(dfrActivityNew)
## steps date interval
## 1 1.72 2012-10-01 0
## 2 0.34 2012-10-01 5
## 3 0.13 2012-10-01 10
## 4 0.15 2012-10-01 15
## 5 0.08 2012-10-01 20
## 6 2.09 2012-10-01 25
Count the number of NA records after the Data Imputation process
dfrcount <- sum(is.na(dfrActivityNew))
dfrcount
## [1] 0
Write the Cleaned data back to a new CSV file
write.csv(dfrActivityNew,"ActivityCleaned.csv",row.names=F)
What are the number of NA records before data imputing process?
-Activity.csv file had 17569 data sets, out of which there were 2304 NA.
After the imputing process what are the records that still have NA as data?
-None of the records have NA after the imputing process.
The strategy to handle missing data or data imputation was met successfully using functions and apply family.