Machine Learning Project

Executive summary

The purpose of this project is to predict how well certain users perform a particular activity using accelerometers on the belt, forearm, arm, and dumbell of 6 participants. The measure of activity quality is measured in the ‘classe’ variable, which has 5 levels going from A (best quality) to E (worst quality).

Download & read data

The first step is to gather the training data and the test data from the provided links. It is important to define the strings that should be comverted to NA values such as the “#DIV/0!” string that excel sometimes show.

url.training="https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
url.test="https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(url.training,destfile="training.csv", method = "curl")
download.file(url.test, destfile = "test.csv", method="curl")
training=read.csv("training.csv", na.strings = c("#DIV/0!", "", "NA"))
dim(training)

## [1] 19622   160

test=read.csv("test.csv", na.strings = c("#DIV/0!", "", "NA"))
dim(test)

## [1]  20 160

Once that the data has been imported into R, the first column is removed from both datasets since it’s only the row number.

training[,1]=NULL
test[,1]=NULL

Divide training & validation sets

In order to prevent overfitting when training our models a cross validation approach will be used. The training dataset will be divided into two different parts. The first part (~70%) will be used to train the model, and the second part (~30%) will be used to evaluate the performance of the models. This second dataset is defined as ‘validation’.

library (caret)

## Warning: package 'caret' was built under R version 3.2.5

## Loading required package: lattice

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.2.4

sample=createDataPartition(y=training$classe, p=0.6, list=F)
validation=training[-sample,]
training=training[sample,]

Cleaning Data

The first step into predicting the quality of the movement is to identify all the variables that will not be relevant to the model.

Cleanup zero variance variables

The first cleanup approach is to identify the variables that have a variance close to zero, using the following code. Once that these variables have been identified they are removed from all the datasets.

nearzero=nearZeroVar(training, saveMetrics =F)
training=training[,-nearzero]
validation=validation[,-nearzero]
test=test[,-nearzero]

Cleanup variables with too many NA’s

The second approach is to remove all those variables that have a great number of NA values. All the variables that have more than 30% of NA values are classified as irrelevant to the model, and hence removed from the datasets.

na.var=sapply(training, function(y) sum(length(which(is.na(y)))))
na.var=data.frame(na.var)
na.var$total=sapply(training,function (y) length(y))
na.var$perc=na.var$na.var/na.var$total
na.var.col=which(na.var$perc>0.3)
training=training[,-na.var.col]
validation=validation[,-na.var.col]
test=test[,-na.var.col]
test[,58]=NULL

Machine Learining Models

Once that all the relevant variables have been filtered, different machine learining algorithms such as decision trees, random forests, and boosting are used with the training dataset.

Decision Trees

The first and most simple approach is the decision tree algorithm.

library(rpart)
tree.model=rpart(classe ~ ., data=training, method="class")
tree.pred=predict(tree.model,newdata = validation, type="class")
tree.matrix=confusionMatrix(tree.pred,validation$classe)
tree.matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2150   68   10    2    0
##          B   78 1383  156   35    0
##          C    4   54 1177  111   62
##          D    0   13   13  941   85
##          E    0    0   12  197 1295
## 
## Overall Statistics
##                                          
##                Accuracy : 0.8853         
##                  95% CI : (0.878, 0.8923)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.8548         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9633   0.9111   0.8604   0.7317   0.8981
## Specificity            0.9857   0.9575   0.9643   0.9831   0.9674
## Pos Pred Value         0.9641   0.8372   0.8359   0.8945   0.8610
## Neg Pred Value         0.9854   0.9782   0.9703   0.9492   0.9768
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2740   0.1763   0.1500   0.1199   0.1651
## Detection Prevalence   0.2842   0.2106   0.1795   0.1341   0.1917
## Balanced Accuracy      0.9745   0.9343   0.9124   0.8574   0.9327

plot(tree.matrix$table, main="Decision Tree Confusion Matrix")

As we can see the accuracy of this model is good (88.53%), but it has room to improve. The expected Out of Sample error is 11.47%. Because of this result much more complex models will be run.

Random Forests

The second model is a random forest using the 57 predictors.

random.model=train(classe~., data=training, method="rf", prox=T)

## Loading required package: randomForest

## randomForest 4.6-10

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

random.pred=predict(random.model, newdata=validation)
random.matrix=confusionMatrix(random.pred,validation$classe)
random.matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2232    1    0    0    0
##          B    0 1517    1    0    0
##          C    0    0 1365    5    0
##          D    0    0    2 1281    1
##          E    0    0    0    0 1441
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9987          
##                  95% CI : (0.9977, 0.9994)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9984          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9993   0.9978   0.9961   0.9993
## Specificity            0.9998   0.9998   0.9992   0.9995   1.0000
## Pos Pred Value         0.9996   0.9993   0.9964   0.9977   1.0000
## Neg Pred Value         1.0000   0.9998   0.9995   0.9992   0.9998
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2845   0.1933   0.1740   0.1633   0.1837
## Detection Prevalence   0.2846   0.1935   0.1746   0.1637   0.1837
## Balanced Accuracy      0.9999   0.9996   0.9985   0.9978   0.9997

plot(random.matrix$table, main="Random Forest Confusion Matrix")

As we can see from the confusion matrix, the accuracy of this model is quite optimal (99.87%). The expected Out of Sample Error is 0.13%.

Boosting

Finally a third model is built using GBM boosting.

boosting.model=train(classe~., data=training, method="gbm", verbose=F)

## Loading required package: gbm

## Loading required package: survival

## 
## Attaching package: 'survival'

## The following object is masked from 'package:caret':
## 
##     cluster

## Loading required package: splines

## Loading required package: parallel

## Loaded gbm 2.1.1

## Loading required package: plyr

## Warning: package 'plyr' was built under R version 3.2.5

boosting.pred=predict(boosting.model, newdata=validation)
boosting.matrix=confusionMatrix(boosting.pred,validation$classe)
boosting.matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2232    1    0    0    0
##          B    0 1510    1    0    0
##          C    0    3 1358    5    0
##          D    0    4    9 1278    3
##          E    0    0    0    3 1439
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9963          
##                  95% CI : (0.9947, 0.9975)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9953          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9947   0.9927   0.9938   0.9979
## Specificity            0.9998   0.9998   0.9988   0.9976   0.9995
## Pos Pred Value         0.9996   0.9993   0.9941   0.9876   0.9979
## Neg Pred Value         1.0000   0.9987   0.9985   0.9988   0.9995
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2845   0.1925   0.1731   0.1629   0.1834
## Detection Prevalence   0.2846   0.1926   0.1741   0.1649   0.1838
## Balanced Accuracy      0.9999   0.9973   0.9957   0.9957   0.9987

plot(boosting.matrix$table, main="Boosting Confusion Matrix")

The accuracy of this model is quite close to the optimal point (99.63%), but not as good as the Random Forests approach. The expected Out of Sample Error is 0.37%.

Create Predictions

Since the random forest model was the one that had the lowest validation error, it is the one that will be used to predict the classe in the test dataset. Finally, the predicted results will be saved in a .csv file

final.pred=predict(random.model,newdata=test)

## Loading required package: randomForest

## randomForest 4.6-10

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

data.frame(final.pred)

##    final.pred
## 1           B
## 2           A
## 3           B
## 4           A
## 5           A
## 6           E
## 7           D
## 8           B
## 9           A
## 10          A
## 11          B
## 12          C
## 13          B
## 14          A
## 15          E
## 16          E
## 17          A
## 18          B
## 19          B
## 20          B

write.csv(final.pred, file="predictions.csv", row.names=F)