A big data related to personal movement is provided colected from a monitoring device such that the device collects data at 5 minute intervals through out the day but it consists of missing(NA) values in the dataset for the column steps and the logic written in this code intends to replace all the NA values with the mean values specific to that particular 5 min interval.
knitr Global Options
# for development
knitr::opts_chunk$set(echo=TRUE, eval=TRUE, error=TRUE, warning=TRUE, message=TRUE, cache=FALSE, tidy=FALSE, fig.path='figures/')
# for production
#knitr::opts_chunk$set(echo=TRUE, eval=TRUE, error=FALSE, warning=FALSE, message=FALSE, cache=FALSE, tidy=FALSE, fig.path='figures/')
Working Directory
setwd("D:/R-BA/R-scripts")
Load Libraries
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Read Data
# Reading the activity .csv file
dfrActivity <- read.csv("./data/activity.csv", header=T, stringsAsFactors=F)
intRowCount <- nrow(dfrActivity)
head(dfrActivity)
## steps date interval
## 1 NA 2012-10-01 0
## 2 NA 2012-10-01 5
## 3 NA 2012-10-01 10
## 4 NA 2012-10-01 15
## 5 NA 2012-10-01 20
## 6 NA 2012-10-01 25
Total Rows Of Patient File: 17568
To check for NA Values
1. Whether they exist or not(TRUE/FALSE)
2. In which all columns and how many NA values
any(is.na(dfrActivity))
## [1] TRUE
colSums(is.na(dfrActivity))
## steps date interval
## 2304 0 0
Creating a dataframe storing all the ceiling of the mean values based on interval
dfrActivityInterval <-summarise(group_by(dfrActivity,interval), mean(steps,na.rm=TRUE))
names(dfrActivityInterval)[2] <- "stepsMean"
dfrActivityInterval$stepsMean <- ceiling(dfrActivityInterval$stepsMean)
head(dfrActivityInterval)
## # A tibble: 6 × 2
## interval stepsMean
## <int> <dbl>
## 1 0 2
## 2 5 1
## 3 10 1
## 4 15 1
## 5 20 1
## 6 25 3
Replacing the NA Values in the steps column with the character NA value to fecilitate ifelse operation
class(dfrActivity$steps)
## [1] "integer"
testind <- which(is.na(dfrActivity$steps))
dfrActivity$steps[testind] <- "NA"
head(dfrActivity)
## steps date interval
## 1 NA 2012-10-01 0
## 2 NA 2012-10-01 5
## 3 NA 2012-10-01 10
## 4 NA 2012-10-01 15
## 5 NA 2012-10-01 20
## 6 NA 2012-10-01 25
class(dfrActivity$steps)
## [1] "character"
Performing the inner join on the dataframes such that the main dataframe with the dataframe having all the mean values
dfrJoined <- inner_join(dfrActivity, dfrActivityInterval)
## Joining, by = "interval"
head(dfrJoined)
## steps date interval stepsMean
## 1 NA 2012-10-01 0 2
## 2 NA 2012-10-01 5 1
## 3 NA 2012-10-01 10 1
## 4 NA 2012-10-01 15 1
## 5 NA 2012-10-01 20 1
## 6 NA 2012-10-01 25 3
A function getVal to set the correct values in the steps column of the joined dataframe
getVal <- function(p_steps,p_stepsMean){
v_steps <- ifelse(p_steps == "NA",p_stepsMean,p_steps)
return(v_steps)
}
Using Mapply function to perform columnar transformation operation so as to assign correct step value
dfrJoined$steps <- mapply(getVal,dfrJoined$steps,dfrJoined$stepsMean)
head(dfrJoined)
## steps date interval stepsMean
## 1 2 2012-10-01 0 2
## 2 1 2012-10-01 5 1
## 3 1 2012-10-01 10 1
## 4 1 2012-10-01 15 1
## 5 1 2012-10-01 20 1
## 6 3 2012-10-01 25 3
Changing the class of steps column back to numeric
class(dfrJoined$steps)
## [1] "character"
dfrJoined$steps <- as.numeric(dfrJoined$steps)
class(dfrJoined$steps)
## [1] "numeric"
Final dataframe: removing the unncessary column
head(dfrJoined)
## steps date interval stepsMean
## 1 2 2012-10-01 0 2
## 2 1 2012-10-01 5 1
## 3 1 2012-10-01 10 1
## 4 1 2012-10-01 15 1
## 5 1 2012-10-01 20 1
## 6 3 2012-10-01 25 3
dfrImputedActivityData <- select(dfrJoined, -stepsMean)
head(dfrImputedActivityData)
## steps date interval
## 1 2 2012-10-01 0
## 2 1 2012-10-01 5
## 3 1 2012-10-01 10
## 4 1 2012-10-01 15
## 5 1 2012-10-01 20
## 6 3 2012-10-01 25
Comparing the Final dataframe with the Activity .csv file dataframe & checking for NA values if any
1. Initial Dataframe
2. Final Dataframe
3. Checking for NA Values
head(dfrActivity)
## steps date interval
## 1 NA 2012-10-01 0
## 2 NA 2012-10-01 5
## 3 NA 2012-10-01 10
## 4 NA 2012-10-01 15
## 5 NA 2012-10-01 20
## 6 NA 2012-10-01 25
head(dfrImputedActivityData)
## steps date interval
## 1 2 2012-10-01 0
## 2 1 2012-10-01 5
## 3 1 2012-10-01 10
## 4 1 2012-10-01 15
## 5 1 2012-10-01 20
## 6 3 2012-10-01 25
any(is.na(dfrImputedActivityData))
## [1] FALSE
Writing the imputed data into a new file
write.csv(dfrImputedActivityData,"ModifiedActivityData.csv",row.names=F)
1. Initially the data had 2304 NA values
2. Data Imputation logic was applied to convert rather replace all the NA
values with the Appropriate values computed using the mean logic based on interval
3. The final dataframe had 0 NA values.