This primary object ive of this project to check the fol lowing skills of the part icipants attending Business Analytics Using R workshop
- Basic R Concepts
- Reading A Fi le / Writing a file
- Data Imputation
Provide R code to read the activity.csv file and do imputation
- Read the data file
- Data Cleaning and Imputation
- Writing a file
Dataset : activity.csv
The variables included in this dataset are:
- steps
Number of steps taking in a 5-minute interval (missing values as NA)
- date
The date on which the measurement was taken in YYYY-MM-DD format
- interval
Identifier for the 5-minute interval in which measurement was taken
Total of 17,568 observations in this dataset.
Nowadays companies collects data about personnel movement using activity monitoring devices such as Fitbit, Nike Fuelband or Jawbone Up.
The Activity data is large amount of data from a personal monitoring device.
This device col lects data at 5 minute intervals through out the
day. The data consists of two months of data from an anonymous
individual col lected during the months of October and November, 2012
and include the number of steps taken in 5 minute intervals each day.
Set working Directory
# inline comments
setwd("D:/Welingkar/Trim 3/R/Assignment")
Read Data
# inline comments
dfractivity_raw <- read.csv("./activity.csv", header=T, stringsAsFactors=F)
intRowCount <- nrow(dfractivity_raw)
head(dfractivity_raw)
## steps date interval
## 1 NA 2012-10-01 0
## 2 NA 2012-10-01 5
## 3 NA 2012-10-01 10
## 4 NA 2012-10-01 15
## 5 NA 2012-10-01 20
## 6 NA 2012-10-01 25
Total Rows Of activity File: 17568
Find Out the No. of Missing Records
# inline comments
Num_miss <- sum(!complete.cases(dfractivity_raw))
Num_complete <- sum(complete.cases(dfractivity_raw))
num_per_miss <- Num_miss/(Num_miss+Num_complete)
Total no of rows where data not available are: 2304
Toal no of rows having complete data are : 15264
13.1147541% of data is missing
** Method 1 - To find out the mean according to 5 Minute Interval**
# inline comments
dfr_mean <- aggregate(list(steps = dfractivity_raw$steps), by=list(interval = dfractivity_raw$interval), mean, na.rm=TRUE, na.action=NULL)
*##Method 2 - To find out mean according to 5 Minute Interval
# inline comments
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.3.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
dfr_mean <- subset(dfractivity_raw, steps !="NA")
dfr_mean <- summarise(group_by(dfr_mean, interval), steps=mean(steps))
Method 1 - Data Imputation
# inline comments
replace_na <- function(step, interval) {
ifelse(is.na(step), dfr_mean[dfr_mean$interval == interval, ]$steps, step)
}
dfr_activity_new <- dfractivity_raw
dfr_activity_new$steps <- mapply(replace_na, dfr_activity_new$steps, dfr_activity_new$interval)
head(dfr_activity_new)
## steps date interval
## 1 1.7169811 2012-10-01 0
## 2 0.3396226 2012-10-01 5
## 3 0.1320755 2012-10-01 10
## 4 0.1509434 2012-10-01 15
## 5 0.0754717 2012-10-01 20
## 6 2.0943396 2012-10-01 25
Data Check
# inline comments
Num_Miss1 <- sum(!complete.cases(dfr_activity_new))
Num_Comp1 <- sum(complete.cases(dfr_activity_new))
No of records after cleaning the data are 17568
No of records with NA values are 0
No of complete data records are 17568
Writing the Data in File
# inline comments
write.csv(dfr_activity_new,"Activity_New.csv",row.names=F)
Activity data file was having the total 17568, so total 13.1147541 % data was missing in the file.
To fill the missing data with substituted values mean for that 5-minute interval strategy is used.
After data cleaning and imputation there are 17568 no of records, out of which 0 records are missing while rest 17568 records are complete.
It was a good exercise which helped to know about the data imputation and helped to do data cleaning and imputation through R.