Introduction
As part of Johns Hopkins Data science specializtion course, this project comes under Practical Machine Learning course. Purpose of this project is to use appropriate machine learning model for the given data set and predict based of the test data set.
Project description
Background
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: click here
Project goal
The goal of your project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.
Data
The goal of your project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.
Data Wrangling
Required Libraries
Apart from the important libraries even though most of them is not used, I have imported them anyways.
if (!require(tidyr)){
install.packages("tidyr")
library(tidyr)
}
if (!require(dplyr)){
install.packages("dplyr")
library(dplyr)
}
if (!require(caret)){
install.packages("caret")
library(caret)
}
if (!require(glmnet)){
install.packages("glmnet")
library(glmnet)
}
if (!require(VIM)){
install.packages("VIM")
library(VIM)
}
if (!require(randomForest)){
install.packages("randomForest")
library(randomForest)
}
if (!require(rpart)){
install.packages("rpart")
library(rpart)
}
if (!require(rpart.plot)){
install.packages("rpart.plot")
library(rpart.plot)
}
if (!require(rattle)){
install.packages("rattle")
library(rattle)
}
## Loading required package: rattle
## Loading required package: tibble
## Loading required package: bitops
##
## Attaching package: 'bitops'
## The following object is masked from 'package:Matrix':
##
## %&%
## Rattle: A free graphical interface for data science with R.
## Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
##
## Attaching package: 'rattle'
## The following object is masked from 'package:randomForest':
##
## importance
## The following object is masked from 'package:VIM':
##
## wine
Required data sets
# download the datasets
train_url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
test_url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
train_file <- paste(getwd(),"train_data.csv", sep = "/")
test_file <- paste(getwd(),"test_data.csv",sep = "/")
if (!file.exists(train_file)){
download.file(url=train_url,
destfile = train_file)
}
if (!file.exists(test_file)){
download.file(url=test_url,
destfile = test_file)
}
Importing the data sets and reading the .csv files.
# Importing Data
training <- read.csv("train_data.csv",
na.strings = c('#DIV/0!','','NA'),
stringsAsFactors = F
)
testing <- read.csv("test_data.csv",
na.strings = c('#DIV/0!','','NA'),
stringsAsFactors = F
)
Training data: changing variable class
# training data: change variable class
training$new_window <- as.factor(training$new_window)
training$kurtosis_yaw_belt <- as.numeric(training$kurtosis_yaw_belt)
training$skewness_yaw_belt <- as.numeric(training$skewness_yaw_belt)
training$kurtosis_yaw_dumbbell <- as.numeric(training$kurtosis_yaw_dumbbell)
training$skewness_yaw_dumbbell <- as.numeric(training$skewness_yaw_dumbbell)
training$cvtd_timestamp <- as.factor(training$cvtd_timestamp)
# testing data: change variable class
testing$new_window <- as.factor(testing$new_window)
testing$kurtosis_yaw_belt <- as.numeric(testing$kurtosis_yaw_belt)
testing$skewness_yaw_belt <- as.numeric(testing$skewness_yaw_belt)
testing$kurtosis_yaw_dumbbell <- as.numeric(testing$kurtosis_yaw_dumbbell)
testing$skewness_yaw_dumbbell <- as.numeric(testing$skewness_yaw_dumbbell)
testing$cvtd_timestamp <- as.factor(testing$cvtd_timestamp)
Several skewness and kurtosis variables were coerced from class factor to class numeric. A timestamp variable was coerced to factor based on the observation that stamps were categorical in nature.Class conversion on the training set were replicated on the test set.
Handling Missing Values
For better understanding visualisation, this approach can be used if there exists any missing values in the data sets.
# The plot shows several variables with high proportions of missing data, with some variables nearly missing entirely.
aggr(training)
To know how much data is missing in the training data set.
# How much data is missing?
sum(is.na(training))/(dim(training)[1]*dim(training)[2])
## [1] 0.6131835
To understand the missing value by variable,
# Missing values fraction by column / variable
missCol <- apply(training, 2, function(x) sum(is.na(x)/length(x)))
hist(missCol, main = "Missing Data by Variable")
To check how many number of predictors are missing,
missIndCol <- which(missCol > 0.9)
#Number of predictors > 90% missing
length(missIndCol)
## [1] 100
Sixty one percent of the total data are missing. One hundred variables had in excess of ninety percent missing data. We removed these latter variables and unneccesary observations such as row nummbers and raw timestamps.
# Remove variables
## remove missing variables from training and test set
train_missing_values <- training[, -missIndCol]
test_missing_values <- testing[,-missIndCol]
## remove raw count variable and raw time stamps
train_clean <- train_missing_values[,-c(1,3,4)]
test_clean <- test_missing_values[,-c(1,3,4)]
# The plot shows several variables with high proportions of missing data, with some variables nearly missing entirely.
aggr(train_clean)
To check if there exists any missing data or NA values,
sum(!complete.cases(train_clean))
## [1] 0
Machine Learing Model
Decision tree model
- First I will split the train_clean data to train data and test data. Even though the terminology is counfusing, let us do it anyways.
# creating training set and testing test from train_clean data set
in_train <- createDataPartition(
train_clean$classe, p = 0.70,
list = FALSE
)
train_set <- train_clean[in_train,]
test_set <- train_clean[-in_train,]
Here, I will program a decesion tree model,
set.seed(1967)
fit_DT <- rpart(classe~., data = train_set, method = "class")
fancyRpartPlot(fit_DT)
predict_DT <- predict(fit_DT, newdata = test_set, type="class")
conf_matrix_DT <- confusionMatrix(table(predict_DT, test_set$classe))
conf_matrix_DT
## Confusion Matrix and Statistics
##
##
## predict_DT A B C D E
## A 1606 170 0 0 0
## B 22 812 85 55 0
## C 46 148 926 110 45
## D 0 9 14 710 83
## E 0 0 1 89 954
##
## Overall Statistics
##
## Accuracy : 0.851
## 95% CI : (0.8416, 0.86)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8111
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9594 0.7129 0.9025 0.7365 0.8817
## Specificity 0.9596 0.9659 0.9282 0.9785 0.9813
## Pos Pred Value 0.9043 0.8337 0.7263 0.8701 0.9138
## Neg Pred Value 0.9835 0.9334 0.9783 0.9499 0.9736
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2729 0.1380 0.1573 0.1206 0.1621
## Detection Prevalence 0.3018 0.1655 0.2167 0.1387 0.1774
## Balanced Accuracy 0.9595 0.8394 0.9154 0.8575 0.9315
plot(conf_matrix_DT$table, col = conf_matrix_DT$byClass,
main = paste("Decision Tree Model: Predictive Accuracy =",
round(conf_matrix_DT$overall['Accuracy'], 4)))
10 fold Cross validation Random forest
For this project Random forest model is being used. As part of the project, to get accurate prediction Random forest under 10 fold cross validation is used. This does bit of some time, be patient.
set.seed(1967)
modFor <- train(classe~.,
data = train_clean,
method = "rf",
trControl = trainControl(method = "cv", number = 10, verboseIter = FALSE),
na.action = na.pass)
modFor
## Random Forest
##
## 19622 samples
## 56 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 17658, 17660, 17660, 17661, 17660, 17661, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9864436 0.9828469
## 40 0.9988278 0.9985173
## 78 0.9980125 0.9974860
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 40.
getTrainPerf(modFor)
## TrainAccuracy TrainKappa method
## 1 0.998726 0.9983886 rf
plot(modFor, main="RF Model Accuracy by number of predictors" ,cex=4)
Cross validated accuracy is nearly 100%. Out of sample error is less
than 0.2%. I see no reason to seek a better model. I anticipate a
reduction in accuracy when the data is applied to our testing
obervations.
Prediction and Visualization
prediction_test <- predict(modFor,
newdata = test_clean)
prediction_test
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
Conclusion
Data wrangling and treating missing observation was essential to this analysis. The result of these actions led to a training set of complete cases. No further pre-processing was required. A Random Forest model fit to this training data using default parameters produced 100% accuracy in prediction, a positive result.