In this project we will use data from motionsensors (accelerometers) placed on dumbbels and on participants bodies (arms, shoulders, belts) to monitor movements during weight lifting excercises (“Unilateral Dumbbell Biceps Curl”). The participants have been observed during registration and their performance have been described by 5 categories: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). This variable is stored as “classe” in the training data set. Using these data, our task is to use machine learning techniques to find a good model to predict the performance of a weight lifting exercise based on motion sensor data. We will use the caret package in R to develop the models and compare their accuracy.
## Loading required package: lattice
## Loading required package: ggplot2
## Rattle: A free graphical interface for data mining with R.
## Version 3.4.1 Copyright (c) 2006-2014 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
## Loading required package: bitops
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:data.table':
##
## between, last
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
The datafiles (training and test set) were downloaded from the course website and read into R dataframes. The data were cleaned by removing columns not relevant for the prediction (not describing movements) and columns with missing data:
setwd("~/datasciencecoursera/macine_learning/project")
train<-read.csv("pml-training.csv")
test<-read.csv("pml-testing.csv")
classe<-train$classe
train <- train[, colSums(is.na(train)) == 0]
train<-train[,sapply(train,is.numeric)]
train$classe<-classe
train<-subset(train,select = -c(1:4))
#Do the same for test
test <- test[, colSums(is.na(train)) == 0]
test<-test[,sapply(test,is.numeric)]
test <- test[, colSums(is.na(test)) == 0]
test<-subset(test,select = -c(1:4))
dim(train)
## [1] 19622 53
dim(test)
## [1] 20 53
The training data are sliced into training (70%) and validation sets (30%). Correlation between variables can be visualized using “Corrplot”.
Plot <- cor(traindata[, -length(names(traindata))])
corrplot(Plot, method="color", main = "Correlation matrix for predictors")
The mode of excercise excecution was then predicted using various methods: Classification tree, Random forest and Bagging. These methods were then compared for accuracy
This method split the variables into groups and evaluate homogeneity within each group. If this is low, the group will be split again into new groups
## user system elapsed
## 85.055 2.100 87.212
fancyRpartPlot(modelfit1$finalModel)
#summary(modelfit1)
#Estimate the performance of the model on the validation data
pred1<-predict(modelfit1, valdata)
cm1 = confusionMatrix(pred1, valdata$classe)
cm1
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1530 486 493 452 168
## B 35 379 31 164 145
## C 105 274 502 348 302
## D 0 0 0 0 0
## E 4 0 0 0 467
##
## Overall Statistics
##
## Accuracy : 0.489
## 95% CI : (0.4762, 0.5019)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3311
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9140 0.3327 0.4893 0.0000 0.43161
## Specificity 0.6203 0.9210 0.7882 1.0000 0.99917
## Pos Pred Value 0.4890 0.5027 0.3279 NaN 0.99151
## Neg Pred Value 0.9478 0.8519 0.8797 0.8362 0.88641
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.18386
## Detection Rate 0.2600 0.0644 0.0853 0.0000 0.07935
## Detection Prevalence 0.5317 0.1281 0.2602 0.0000 0.08003
## Balanced Accuracy 0.7671 0.6269 0.6388 0.5000 0.71539
The accuracy of the tree model was not very high. Random forest uses boostrapping for making decision trees and create classification. This method corrects for overfitting on the training set.
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
## user system elapsed
## 540.933 3.840 544.825
The number and importance of predictors can be plotted:
plot(modelfit2,main="Random Forest: Accuracy vs number of predictors")
#summary(modelfit2)
pred2<-predict(modelfit2, valdata)
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
##
## The following object is masked from 'package:dplyr':
##
## combine
cm2 = confusionMatrix(pred2, valdata$classe)
plot(varImp(modelfit2), top=10, main= "Random forest -Top ten predictors")
cm2
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 8 0 0 0
## B 0 1130 6 1 0
## C 0 1 1016 5 2
## D 0 0 4 956 3
## E 0 0 0 2 1077
##
## Overall Statistics
##
## Accuracy : 0.9946
## 95% CI : (0.9923, 0.9963)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9931
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9921 0.9903 0.9917 0.9954
## Specificity 0.9981 0.9985 0.9984 0.9986 0.9996
## Pos Pred Value 0.9952 0.9938 0.9922 0.9927 0.9981
## Neg Pred Value 1.0000 0.9981 0.9979 0.9984 0.9990
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2845 0.1920 0.1726 0.1624 0.1830
## Detection Prevalence 0.2858 0.1932 0.1740 0.1636 0.1833
## Balanced Accuracy 0.9991 0.9953 0.9943 0.9951 0.9975
Bagging is short for bootstrap aggregation and is a method that resample cases and recalculate predictions
## Loading required package: plyr
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
##
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## user system elapsed
## 2081.310 33.865 2115.329
#summary(modelfit3)
pred3<-predict(modelfit3, valdata)
## Loading required package: plyr
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
##
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
cm3 = confusionMatrix(pred3, valdata$classe)
varImp(modelfit3)
## treebag variable importance
##
## only 20 most important variables shown (out of 52)
##
## Overall
## roll_belt 100.00
## yaw_belt 85.56
## pitch_forearm 74.48
## pitch_belt 73.23
## magnet_dumbbell_y 65.06
## magnet_dumbbell_z 61.35
## roll_forearm 59.15
## accel_dumbbell_y 51.32
## roll_dumbbell 44.42
## magnet_dumbbell_x 42.08
## magnet_belt_y 38.35
## accel_belt_z 37.13
## magnet_belt_z 35.12
## yaw_arm 28.47
## accel_forearm_x 28.20
## accel_dumbbell_z 26.15
## accel_arm_x 23.97
## magnet_forearm_z 23.96
## total_accel_belt 22.97
## total_accel_dumbbell 22.02
plot(varImp(modelfit3), top = 10, main = "Bagging - top ten predictors")
cm3
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1672 15 1 1 0
## B 2 1114 4 4 0
## C 0 3 1016 6 6
## D 0 4 5 952 8
## E 0 3 0 1 1068
##
## Overall Statistics
##
## Accuracy : 0.9893
## 95% CI : (0.9863, 0.9918)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9865
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9988 0.9781 0.9903 0.9876 0.9871
## Specificity 0.9960 0.9979 0.9969 0.9965 0.9992
## Pos Pred Value 0.9899 0.9911 0.9855 0.9825 0.9963
## Neg Pred Value 0.9995 0.9947 0.9979 0.9976 0.9971
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2841 0.1893 0.1726 0.1618 0.1815
## Detection Prevalence 0.2870 0.1910 0.1752 0.1647 0.1822
## Balanced Accuracy 0.9974 0.9880 0.9936 0.9920 0.9931
Finally, the 3 models are usen on the testdata to predict the performance :
treemodel<-predict(modelfit1, newdata=test)
summary(treemodel)
## A B C D E
## 11 0 9 0 0
rfmodel<-predict(modelfit2, newdata=test)
summary(rfmodel)
## A B C D E
## 7 8 1 1 3
bagmodel<-predict(modelfit3, newdata=test)
summary(bagmodel)
## A B C D E
## 7 8 1 1 3
treemodel
## [1] C A C A A C C A A A C C C A C A A A A C
## Levels: A B C D E
rfmodel
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
bagmodel
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E