Data Imputation and Analysis of Activity data

Objectives

This primary object ive of this project to check the fol lowing skills of the part icipants attending Business Analytics Using R workshop
- Basic R Concepts
- Reading A Fi le / Writing a file
- Data Imputation

Probem Definition

Provide R code to read the activity.csv file and do imputation   
- Read the data file    
- Data Cleaning and Imputation
- Writing a file  

Dataset

Dataset : activity.csv
The variables included in this dataset are:
  - steps
      Number of steps taking in a 5-minute interval (missing values as NA)
  - date
      The date on which the measurement was taken in YYYY-MM-DD format
  - interval
      Identifier for the 5-minute interval in which measurement was taken

Total of 17,568 observations in this dataset.

Overview of Project

Nowadays companies collects data about personnel movement using activity monitoring devices such as Fitbit, Nike Fuelband or Jawbone Up.  
The Activity data is large amount of data from a personal monitoring device.
This device col lects data at 5 minute intervals through out the
day. The data consists of two months of data from an anonymous
individual col lected during the months of October and November, 2012
and include the number of steps taken in 5 minute intervals each day.

Code & Output

Set working Directory

# inline comments
setwd("D:/Welingkar/Trim 3/R/Assignment")

Read Data

# inline comments
dfractivity_raw <- read.csv("./activity.csv", header=T, stringsAsFactors=F)
intRowCount <- nrow(dfractivity_raw)
head(dfractivity_raw)
##   steps       date interval
## 1    NA 2012-10-01        0
## 2    NA 2012-10-01        5
## 3    NA 2012-10-01       10
## 4    NA 2012-10-01       15
## 5    NA 2012-10-01       20
## 6    NA 2012-10-01       25

Total Rows Of activity File: 17568

Find Out the No. of Missing Records

# inline comments
Num_miss <- sum(!complete.cases(dfractivity_raw))

Num_complete <- sum(complete.cases(dfractivity_raw))

num_per_miss <- Num_miss/(Num_miss+Num_complete)

Total no of rows where data not available are: 2304
Toal no of rows having complete data are : 15264
13.1147541% of data is missing

Data Imputation Process

** Method 1 - To find out the mean according to 5 Minute Interval**

# inline comments
dfr_mean <- aggregate(list(steps = dfractivity_raw$steps), by=list(interval = dfractivity_raw$interval), mean, na.rm=TRUE, na.action=NULL)

*##Method 2 - To find out mean according to 5 Minute Interval

# inline comments
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.3.3
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
dfr_mean <- subset(dfractivity_raw, steps !="NA")
dfr_mean <- summarise(group_by(dfr_mean, interval), steps=mean(steps))

Method 1 - Data Imputation

# inline comments
replace_na <- function(step, interval) {
  ifelse(is.na(step), dfr_mean[dfr_mean$interval == interval, ]$steps, step)
}

dfr_activity_new <- dfractivity_raw
dfr_activity_new$steps <- mapply(replace_na, dfr_activity_new$steps, dfr_activity_new$interval)
head(dfr_activity_new)
##       steps       date interval
## 1 1.7169811 2012-10-01        0
## 2 0.3396226 2012-10-01        5
## 3 0.1320755 2012-10-01       10
## 4 0.1509434 2012-10-01       15
## 5 0.0754717 2012-10-01       20
## 6 2.0943396 2012-10-01       25

Data Check

# inline comments
Num_Miss1 <- sum(!complete.cases(dfr_activity_new))  
Num_Comp1 <- sum(complete.cases(dfr_activity_new))  

No of records after cleaning the data are 17568
No of records with NA values are 0
No of complete data records are 17568

Writing the Data in File

# inline comments
write.csv(dfr_activity_new,"Activity_New.csv",row.names=F)

Summary

Activity data file was having the total 17568, so total 13.1147541 % data was missing in the file.  
To fill the missing data with substituted values mean for that 5-minute interval strategy is used.  
After data cleaning and imputation there are 17568 no of records, out of which 0 records are missing while rest 17568 records are complete.

Objectives

It was a good exercise which helped to know about the data imputation and helped to do data cleaning and imputation through R.