Analysis Of Activity Data

Overview of the Project

A big data related to personal movement is provided colected from a monitoring device such that the device collects data at 5 minute intervals through out the day but it consists of missing(NA) values in the dataset for the column steps and the logic written in this code intends to replace all the NA values with the mean values specific to that particular 5 min interval.

Code & Output

knitr Global Options

# for development
knitr::opts_chunk$set(echo=TRUE, eval=TRUE, error=TRUE, warning=TRUE, message=TRUE, cache=FALSE, tidy=FALSE, fig.path='figures/')
# for production
#knitr::opts_chunk$set(echo=TRUE, eval=TRUE, error=FALSE, warning=FALSE, message=FALSE, cache=FALSE, tidy=FALSE, fig.path='figures/')

Working Directory

setwd("D:/R-BA/R-scripts")

Load Libraries

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Read Data

# Reading the activity .csv file
dfrActivity <- read.csv("./data/activity.csv", header=T, stringsAsFactors=F)
intRowCount <- nrow(dfrActivity)
head(dfrActivity)
##   steps       date interval
## 1    NA 2012-10-01        0
## 2    NA 2012-10-01        5
## 3    NA 2012-10-01       10
## 4    NA 2012-10-01       15
## 5    NA 2012-10-01       20
## 6    NA 2012-10-01       25

Total Rows Of Patient File: 17568

To check for NA Values
1. Whether they exist or not(TRUE/FALSE)
2. In which all columns and how many NA values

any(is.na(dfrActivity))   
## [1] TRUE
colSums(is.na(dfrActivity))
##    steps     date interval 
##     2304        0        0

Creating a dataframe storing all the ceiling of the mean values based on interval

dfrActivityInterval <-summarise(group_by(dfrActivity,interval), mean(steps,na.rm=TRUE))
names(dfrActivityInterval)[2] <- "stepsMean"
dfrActivityInterval$stepsMean <- ceiling(dfrActivityInterval$stepsMean)
head(dfrActivityInterval)
## # A tibble: 6 × 2
##   interval stepsMean
##      <int>     <dbl>
## 1        0         2
## 2        5         1
## 3       10         1
## 4       15         1
## 5       20         1
## 6       25         3

Replacing the NA Values in the steps column with the character NA value to fecilitate ifelse operation

class(dfrActivity$steps)
## [1] "integer"
testind <- which(is.na(dfrActivity$steps))
dfrActivity$steps[testind] <- "NA"
head(dfrActivity)
##   steps       date interval
## 1    NA 2012-10-01        0
## 2    NA 2012-10-01        5
## 3    NA 2012-10-01       10
## 4    NA 2012-10-01       15
## 5    NA 2012-10-01       20
## 6    NA 2012-10-01       25
class(dfrActivity$steps)
## [1] "character"

Performing the inner join on the dataframes such that the main dataframe with the dataframe having all the mean values

dfrJoined <- inner_join(dfrActivity, dfrActivityInterval)
## Joining, by = "interval"
head(dfrJoined)
##   steps       date interval stepsMean
## 1    NA 2012-10-01        0         2
## 2    NA 2012-10-01        5         1
## 3    NA 2012-10-01       10         1
## 4    NA 2012-10-01       15         1
## 5    NA 2012-10-01       20         1
## 6    NA 2012-10-01       25         3

A function getVal to set the correct values in the steps column of the joined dataframe

getVal <- function(p_steps,p_stepsMean){
  v_steps <- ifelse(p_steps == "NA",p_stepsMean,p_steps)
  return(v_steps)
}

Using Mapply function to perform columnar transformation operation so as to assign correct step value

dfrJoined$steps <- mapply(getVal,dfrJoined$steps,dfrJoined$stepsMean)
head(dfrJoined)
##   steps       date interval stepsMean
## 1     2 2012-10-01        0         2
## 2     1 2012-10-01        5         1
## 3     1 2012-10-01       10         1
## 4     1 2012-10-01       15         1
## 5     1 2012-10-01       20         1
## 6     3 2012-10-01       25         3

Changing the class of steps column back to numeric

class(dfrJoined$steps)
## [1] "character"
dfrJoined$steps <- as.numeric(dfrJoined$steps)
class(dfrJoined$steps)
## [1] "numeric"

Final dataframe: removing the unncessary column

head(dfrJoined)
##   steps       date interval stepsMean
## 1     2 2012-10-01        0         2
## 2     1 2012-10-01        5         1
## 3     1 2012-10-01       10         1
## 4     1 2012-10-01       15         1
## 5     1 2012-10-01       20         1
## 6     3 2012-10-01       25         3
dfrImputedActivityData <- select(dfrJoined, -stepsMean) 
head(dfrImputedActivityData)
##   steps       date interval
## 1     2 2012-10-01        0
## 2     1 2012-10-01        5
## 3     1 2012-10-01       10
## 4     1 2012-10-01       15
## 5     1 2012-10-01       20
## 6     3 2012-10-01       25

Comparing the Final dataframe with the Activity .csv file dataframe & checking for NA values if any
1. Initial Dataframe
2. Final Dataframe
3. Checking for NA Values

head(dfrActivity)
##   steps       date interval
## 1    NA 2012-10-01        0
## 2    NA 2012-10-01        5
## 3    NA 2012-10-01       10
## 4    NA 2012-10-01       15
## 5    NA 2012-10-01       20
## 6    NA 2012-10-01       25
head(dfrImputedActivityData)
##   steps       date interval
## 1     2 2012-10-01        0
## 2     1 2012-10-01        5
## 3     1 2012-10-01       10
## 4     1 2012-10-01       15
## 5     1 2012-10-01       20
## 6     3 2012-10-01       25
any(is.na(dfrImputedActivityData))
## [1] FALSE

Writing the imputed data into a new file

write.csv(dfrImputedActivityData,"ModifiedActivityData.csv",row.names=F)

—————————————————-

Summary Report

—————————————————

1. Initially the data had 2304 NA values
2. Data Imputation logic was applied to convert rather replace all the NA 
values with the Appropriate values computed using the mean logic based on interval
3. The final dataframe had 0 NA values.