Practical machine learning

Introduction

In this project we will use data from motionsensors (accelerometers) placed on dumbbels and on participants bodies (arms, shoulders, belts) to monitor movements during weight lifting excercises (“Unilateral Dumbbell Biceps Curl”). The participants have been observed during registration and their performance have been described by 5 categories: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). This variable is stored as “classe” in the training data set. Using these data, our task is to use machine learning techniques to find a good model to predict the performance of a weight lifting exercise based on motion sensor data. We will use the caret package in R to develop the models and compare their accuracy.

## Loading required package: lattice
## Loading required package: ggplot2
## Rattle: A free graphical interface for data mining with R.
## Version 3.4.1 Copyright (c) 2006-2014 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
## Loading required package: bitops
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:data.table':
## 
##     between, last
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

The datafiles (training and test set) were downloaded from the course website and read into R dataframes. The data were cleaned by removing columns not relevant for the prediction (not describing movements) and columns with missing data:

setwd("~/datasciencecoursera/macine_learning/project")
train<-read.csv("pml-training.csv")
test<-read.csv("pml-testing.csv")

classe<-train$classe
train <- train[, colSums(is.na(train)) == 0] 
train<-train[,sapply(train,is.numeric)]
train$classe<-classe
train<-subset(train,select = -c(1:4))
#Do the same for test
test <- test[, colSums(is.na(train)) == 0] 
test<-test[,sapply(test,is.numeric)]
test <- test[, colSums(is.na(test)) == 0] 
test<-subset(test,select = -c(1:4))
dim(train)

## [1] 19622    53

dim(test)

## [1] 20 53

The training data are sliced into training (70%) and validation sets (30%). Correlation between variables can be visualized using “Corrplot”.

Plot <- cor(traindata[, -length(names(traindata))])
corrplot(Plot, method="color", main = "Correlation matrix for predictors")

The mode of excercise excecution was then predicted using various methods: Classification tree, Random forest and Bagging. These methods were then compared for accuracy

Classification tree

This method split the variables into groups and evaluate homogeneity within each group. If this is low, the group will be split again into new groups

##    user  system elapsed 
##  85.055   2.100  87.212

fancyRpartPlot(modelfit1$finalModel)

#summary(modelfit1)

#Estimate the performance of the model on the validation data
pred1<-predict(modelfit1, valdata)
cm1 = confusionMatrix(pred1, valdata$classe)
cm1

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1530  486  493  452  168
##          B   35  379   31  164  145
##          C  105  274  502  348  302
##          D    0    0    0    0    0
##          E    4    0    0    0  467
## 
## Overall Statistics
##                                           
##                Accuracy : 0.489           
##                  95% CI : (0.4762, 0.5019)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3311          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9140   0.3327   0.4893   0.0000  0.43161
## Specificity            0.6203   0.9210   0.7882   1.0000  0.99917
## Pos Pred Value         0.4890   0.5027   0.3279      NaN  0.99151
## Neg Pred Value         0.9478   0.8519   0.8797   0.8362  0.88641
## Prevalence             0.2845   0.1935   0.1743   0.1638  0.18386
## Detection Rate         0.2600   0.0644   0.0853   0.0000  0.07935
## Detection Prevalence   0.5317   0.1281   0.2602   0.0000  0.08003
## Balanced Accuracy      0.7671   0.6269   0.6388   0.5000  0.71539

The accuracy of the tree model was not very high. Random forest uses boostrapping for making decision trees and create classification. This method corrects for overfitting on the training set.

## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

##    user  system elapsed 
## 540.933   3.840 544.825

The number and importance of predictors can be plotted:

plot(modelfit2,main="Random Forest: Accuracy vs number of predictors")

#summary(modelfit2)
pred2<-predict(modelfit2, valdata)

## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## 
## The following object is masked from 'package:dplyr':
## 
##     combine

cm2 = confusionMatrix(pred2, valdata$classe)
plot(varImp(modelfit2), top=10, main= "Random forest -Top ten predictors")

cm2

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674    8    0    0    0
##          B    0 1130    6    1    0
##          C    0    1 1016    5    2
##          D    0    0    4  956    3
##          E    0    0    0    2 1077
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9946          
##                  95% CI : (0.9923, 0.9963)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9931          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9921   0.9903   0.9917   0.9954
## Specificity            0.9981   0.9985   0.9984   0.9986   0.9996
## Pos Pred Value         0.9952   0.9938   0.9922   0.9927   0.9981
## Neg Pred Value         1.0000   0.9981   0.9979   0.9984   0.9990
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2845   0.1920   0.1726   0.1624   0.1830
## Detection Prevalence   0.2858   0.1932   0.1740   0.1636   0.1833
## Balanced Accuracy      0.9991   0.9953   0.9943   0.9951   0.9975

Bagging

Bagging is short for bootstrap aggregation and is a method that resample cases and recalculate predictions

## Loading required package: plyr
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## 
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

##     user   system  elapsed 
## 2081.310   33.865 2115.329

#summary(modelfit3)
pred3<-predict(modelfit3, valdata)

## Loading required package: plyr
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## 
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize

cm3 = confusionMatrix(pred3, valdata$classe)
varImp(modelfit3)

## treebag variable importance
## 
##   only 20 most important variables shown (out of 52)
## 
##                      Overall
## roll_belt             100.00
## yaw_belt               85.56
## pitch_forearm          74.48
## pitch_belt             73.23
## magnet_dumbbell_y      65.06
## magnet_dumbbell_z      61.35
## roll_forearm           59.15
## accel_dumbbell_y       51.32
## roll_dumbbell          44.42
## magnet_dumbbell_x      42.08
## magnet_belt_y          38.35
## accel_belt_z           37.13
## magnet_belt_z          35.12
## yaw_arm                28.47
## accel_forearm_x        28.20
## accel_dumbbell_z       26.15
## accel_arm_x            23.97
## magnet_forearm_z       23.96
## total_accel_belt       22.97
## total_accel_dumbbell   22.02

plot(varImp(modelfit3), top = 10, main = "Bagging - top ten predictors")

cm3

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1672   15    1    1    0
##          B    2 1114    4    4    0
##          C    0    3 1016    6    6
##          D    0    4    5  952    8
##          E    0    3    0    1 1068
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9893          
##                  95% CI : (0.9863, 0.9918)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9865          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9988   0.9781   0.9903   0.9876   0.9871
## Specificity            0.9960   0.9979   0.9969   0.9965   0.9992
## Pos Pred Value         0.9899   0.9911   0.9855   0.9825   0.9963
## Neg Pred Value         0.9995   0.9947   0.9979   0.9976   0.9971
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2841   0.1893   0.1726   0.1618   0.1815
## Detection Prevalence   0.2870   0.1910   0.1752   0.1647   0.1822
## Balanced Accuracy      0.9974   0.9880   0.9936   0.9920   0.9931

Finally, the 3 models are usen on the testdata to predict the performance :

treemodel<-predict(modelfit1, newdata=test)
summary(treemodel)

##  A  B  C  D  E 
## 11  0  9  0  0

rfmodel<-predict(modelfit2, newdata=test)
summary(rfmodel)

## A B C D E 
## 7 8 1 1 3

bagmodel<-predict(modelfit3, newdata=test)
summary(bagmodel)

## A B C D E 
## 7 8 1 1 3

treemodel

##  [1] C A C A A C C A A A C C C A C A A A A C
## Levels: A B C D E

rfmodel

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

bagmodel

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Practical machine learning

tgjoen

November 20, 2015

Introduction

Classification tree

Bagging

The conclusion is that random forest give the best prediction and smallest out of sample error of these 3 methods.