In the era of the 21st century, applications of machine learning is omniscient. In this study we will discuss in this paper (http://groupware.les.inf.puc-rio.br/har). They investigated the use of computing to evaluate “proper” exercise form (possibly allowing computers to replace personal trainers to help us become better, faster, stronger.
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement, a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, Machine learning algorithms and techniques will be used to make a model. We will also check if the model is able to predict the way exercises are being done. The success of this model also asserts the argument that trainers could be replaced by machines which can “correct” the exercising technique with more accuracy.
In the study referenced above, the data was obtained by attaching sensors (inertial measurement units) to both study participants, and weights, to measure the motion as exercises were performed. Each participant was instructed to perform an exercise five different ways (one “correct” way and differnt “incorrect” ways)
#loading the required packages
library(caret) # For performing PCA
library(caTools) # For splitting the training data for cross validation
library(randomForest) # For performing randomForest
library(rpart) # For performing regression and classification trees
library(rpart.plot) # For plotting the output of CART models
library(e1071) # For "intelligently" performing cross-validations
library(rattle) # Another library to make a visually aesthetic CART plots
Now that we are done with loading the libraries, lets download the data. It is observed that many observations are ‘#DIV/0’. We will set that as an argument to read as NA.
rm(list=ls()) # removing existing files in the workspace
setwd("C:/Users/rruj/Desktop") # Setting the working directory
#Downloading and reading the data
trainURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
train <- read.csv(url(trainURL), header = T, na.strings = c("NA","#DIV/0!",""))
testURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
test <- read.csv(url(testURL), header = T, na.strings = c("NA","#DIV/0!",""))
dim(train) # Checking the dimensions of the training set
## [1] 19622 160
dim(test) # Checking the dimensions of the testing set
## [1] 20 160
NAcount <- apply(train, 2, function(x) sum(is.na(x))) # Counts the number of NA's in each column
train <- train[,!NAcount/nrow(train) >= 0.7] # Selecting the columns
test <- test[,names(test) %in% names(train)]
train <- train[,-c(1:7)]
test <- test[,-c(1:7)]
dim(train) # Checking the dimensions of the training set
## [1] 19622 53
dim(test) # Checking the dimensions of the testing set
## [1] 20 52
We are down to just 53 columns from the initial 160 columns. Good going! Lets, split the training data into 2 parts. I’ll choose the standard 70%-30% split. The library used here is caTools which is a wonderful library to split datasets while maintaining uniformity in the target column.
set.seed(144) # Setting seed to ensure the results are reproducible.
split <- sample.split(train$classe, SplitRatio = 0.7)
training <- train[split,]
testing <- train[!split,]
dim(training) # Checking the dimensions of the training set
## [1] 13735 53
dim(testing) # Checking the dimensions of the testing set
## [1] 5887 53
Now that we are done with cleaning the data, lets jump into making a predciction model. We will try out the CART model from the rpart library. To start with, we will feed the model with default parameters.
CART <- rpart(classe ~ ., data = training, minbucket = 2000) # Creating the prediction model
predictionsCART <- predict(CART, newdata = testing, type = "class") # Making predictions
confusionMatrix(predictionsCART, testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1317 513 279 290 142
## B 0 0 0 0 0
## C 0 0 0 0 0
## D 218 453 587 659 402
## E 139 173 161 16 538
##
## Overall Statistics
##
## Accuracy : 0.427
## 95% CI : (0.4144, 0.4398)
## No Information Rate : 0.2844
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.266
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.7867 0.0000 0.0000 0.6829 0.49723
## Specificity 0.7095 1.0000 1.0000 0.6627 0.89823
## Pos Pred Value 0.5183 NaN NaN 0.2842 0.52386
## Neg Pred Value 0.8933 0.8065 0.8255 0.9142 0.88807
## Prevalence 0.2844 0.1935 0.1745 0.1639 0.18379
## Detection Rate 0.2237 0.0000 0.0000 0.1119 0.09139
## Detection Prevalence 0.4316 0.0000 0.0000 0.3939 0.17445
## Balanced Accuracy 0.7481 0.5000 0.5000 0.6728 0.69773
Not bad! We have a 42.7% accuracy rate of the out-of-sample data. Let’s have a closer look on how our model looks like.
fancyRpartPlot(CART)
Pretty complicated eh!. Anyways, we want to make a model with better prediction capability, so we will compromise on the interpretibility. Let’s try to tune the model. We will use the e1071 library to find the optimum vale of the cp parameter for the rpart model.
numFolds <- trainControl(method = "cv", number = 10)
cpGrid <- expand.grid(.cp = seq(0.0001, 0.001, 0.0001))
train(classe ~ ., data = training, method = "rpart", trControl = numFolds, tuneGrid = cpGrid)
## CART
##
## 13735 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
##
## Summary of sample sizes: 12361, 12361, 12362, 12361, 12362, 12363, ...
##
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa Accuracy SD Kappa SD
## 1e-04 0.9248614 0.9049472 0.01363328 0.01725461
## 2e-04 0.9261717 0.9066079 0.01347175 0.01705217
## 3e-04 0.9255172 0.9057736 0.01320146 0.01670743
## 4e-04 0.9242795 0.9042140 0.01273306 0.01611984
## 5e-04 0.9219492 0.9012614 0.01289269 0.01631309
## 6e-04 0.9185281 0.8969309 0.01134797 0.01436154
## 7e-04 0.9164170 0.8942519 0.01150834 0.01455500
## 8e-04 0.9110287 0.8874383 0.01295398 0.01636988
## 9e-04 0.9105914 0.8868838 0.01244165 0.01571792
## 1e-03 0.9053496 0.8802634 0.01280944 0.01617831
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 2e-04.
# It is clearly seen that the model has the highest accuracy with cp = 0.0002. We will set it as out cp value in the rpart model.
tunedCART <- rpart(classe ~ ., data = training, method = "class", cp = 0.0002)
predictionstunedCART <- predict(tunedCART, newdata = testing, type = "class")
confusionMatrix(testing$classe, predictionstunedCART)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1613 34 12 7 8
## B 45 1027 36 11 20
## C 10 46 952 11 8
## D 10 24 40 876 15
## E 4 25 17 17 1019
##
## Overall Statistics
##
## Accuracy : 0.9321
## 95% CI : (0.9253, 0.9384)
## No Information Rate : 0.2857
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.914
## Mcnemar's Test P-Value : 0.0008454
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9590 0.8884 0.9007 0.9501 0.9523
## Specificity 0.9855 0.9763 0.9845 0.9821 0.9869
## Pos Pred Value 0.9636 0.9017 0.9270 0.9078 0.9418
## Neg Pred Value 0.9836 0.9728 0.9784 0.9907 0.9894
## Prevalence 0.2857 0.1964 0.1795 0.1566 0.1818
## Detection Rate 0.2740 0.1745 0.1617 0.1488 0.1731
## Detection Prevalence 0.2844 0.1935 0.1745 0.1639 0.1838
## Balanced Accuracy 0.9722 0.9324 0.9426 0.9661 0.9696
Bingo!. The Out-of-sample accuracy has moved tp 93%. We can play around with the parameters further, but lets check other machine learning algorithms if we can do better. I will now use randomForests to check the accuracy of the model.
#random forest
set.seed(144)
rf <- randomForest(classe ~ ., data = training, ntree = 500, nodesize = 1, importance = TRUE)
predictRF <- predict(rf, newdata = testing)
confusionMatrix(predictRF, testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1673 1 0 0 0
## B 0 1135 3 0 0
## C 0 3 1022 7 0
## D 1 0 2 958 2
## E 0 0 0 0 1080
##
## Overall Statistics
##
## Accuracy : 0.9968
## 95% CI : (0.995, 0.9981)
## No Information Rate : 0.2844
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9959
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9994 0.9965 0.9951 0.9927 0.9982
## Specificity 0.9998 0.9994 0.9979 0.9990 1.0000
## Pos Pred Value 0.9994 0.9974 0.9903 0.9948 1.0000
## Neg Pred Value 0.9998 0.9992 0.9990 0.9986 0.9996
## Prevalence 0.2844 0.1935 0.1745 0.1639 0.1838
## Detection Rate 0.2842 0.1928 0.1736 0.1627 0.1835
## Detection Prevalence 0.2844 0.1933 0.1753 0.1636 0.1835
## Balanced Accuracy 0.9996 0.9979 0.9965 0.9959 0.9991
plot(rf)
we see randomForest has done quite a good job in making the out of sample predictions with an accuracy of 99.6%. We observe that randomForest was able to make accurate predictions after 100 trees from the plot. Let’s check how randomForest has interpreted the variables.
#plotting the variable importance plot
varImpPlot(rf, n.var = 15)
#plot for the count of variable used in the randomForest model (Top 10 variables)
vu = varUsed(rf, count = TRUE)
vusorted = sort(vu, decreasing = F, index.return = T)
dotchart(vusorted$x[1:10], names(rf$forest$xlevels[vusorted$ix[1:10]]), main = "Variable used count", xlab = "Count")
Predictions <- predict(rf, newdata = test, type = "class")
pml_write_files = function(x){
n = length(x)
for(i in 1:n){
filename = paste0("problem_id_",i,".txt")
write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
}
pml_write_files(Predictions)