Coursera Practical Machine Learning - Project

Alejandro Fraga June, 2016

Objective

One thing that people regularly these days is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, my goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.

Preparation

Load the data from the indivuals who participated in the study and the libraries needed to perform the analysis:

## Loading required package: lattice

## Loading required package: ggplot2

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:randomForest':
## 
##     combine

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Next we need to do some data preparation

# Columns full of NAs are not useful, let's clean them
relevantFeatures <- names(test[,colSums(is.na(test)) == 0])[8:59]
# Only use relevant features used in the test_data cases.
train_data <- train_data[,c(relevantFeatures,"classe")]
test <- test[,c(relevantFeatures,"problem_id")]

Boostrap the training set

I will hold 25% of the data set for testing

set.seed(246)
inTrain = createDataPartition(train_data$classe, p = 0.75, list = F)
training = train_data[inTrain,]
testing = train_data[-inTrain,]

Feature Identification

To simplify the analysis, let’s remove those features from the training set which are highly correlated (>90%)

outcome = which(names(training) == "classe")
highCorrCols = findCorrelation(abs(cor(training[,-outcome])),0.90)
# highCorrFeatures variable will subset those highly correlated features
highCorrFeatures = names(training)[highCorrCols]
training = training[,-highCorrCols]
outcome = which(names(training) == "classe")
str(outcome)

##  int 46

From this analysis I found that those features with high correlation are: accel_belt_z, roll_belt, accel_belt_y, accel_belt_x, gyros_arm_y, gyros_forearm_z, and gyros_dumbbell_x.

Analyzing with Random Forest Algorithm and k-nearest neighbors

As we learn the Random Forest method is good for non linear features as is the case on this stufy plus reduces overfitting. I will also use KNN algorithm to identify who provide better accuracy.

Training with Random Forest

First I will Random Forest this method to discover the most important features.

featuresRF = randomForest(training[,-outcome], training[,outcome], importance = T)
importanceRF = data.frame(featuresRF$importance)
impFeatures = order(-importanceRF$MeanDecreaseGini)
inImp = createDataPartition(train_data$classe, p = 0.05, list = F)

The feature plot for the 4 most importan features (pitch_belt, yaw_belt, total_accel_belt, gyros_belt_x) is shown below:

featurePlot(training[inImp,impFeatures[1:4]],training$classe[inImp], plot = "pairs")

# Working on Random Forest Model
ctrlRF = trainControl(method = "oob")
modelRF <- randomForest(classe ~ ., data=training)
RFPredTrain <- predict(modelRF, newdata=training, type="class")
RFAccuracyTrain <- confusionMatrix(RFPredTrain, training$classe)
RFAccuracyTrain

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 4185    0    0    0    0
##          B    0 2848    0    0    0
##          C    0    0 2567    0    0
##          D    0    0    0 2412    0
##          E    0    0    0    0 2706
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9997, 1)
##     No Information Rate : 0.2843     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Prevalence             0.2843   0.1935   0.1744   0.1639   0.1839
## Detection Rate         0.2843   0.1935   0.1744   0.1639   0.1839
## Detection Prevalence   0.2843   0.1935   0.1744   0.1639   0.1839
## Balanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000

Training with KNN

Next I will train the model using k-nearest neighbors comparison

# Developing KNN model
ctrlKNN = trainControl(method = "adaptive_cv")
modelKNN = train(classe ~ ., training, method = "knn", trControl = ctrlKNN)
KNNPredTrain <- predict(modelKNN, newdata=training)
KNNAccuracyTrain <- confusionMatrix(KNNPredTrain, training$classe)
KNNAccuracyTrain

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 4115   74   16   13   16
##          B   16 2664   46    4   38
##          C   22   43 2458   91   30
##          D   27   32   27 2281   40
##          E    5   35   20   23 2582
## 
## Overall Statistics
##                                           
##                Accuracy : 0.958           
##                  95% CI : (0.9546, 0.9612)
##     No Information Rate : 0.2843          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9469          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9833   0.9354   0.9575   0.9457   0.9542
## Specificity            0.9887   0.9912   0.9847   0.9898   0.9931
## Pos Pred Value         0.9719   0.9624   0.9297   0.9477   0.9689
## Neg Pred Value         0.9933   0.9846   0.9910   0.9894   0.9897
## Prevalence             0.2843   0.1935   0.1744   0.1639   0.1839
## Detection Rate         0.2796   0.1810   0.1670   0.1550   0.1754
## Detection Prevalence   0.2877   0.1881   0.1796   0.1635   0.1811
## Balanced Accuracy      0.9860   0.9633   0.9711   0.9677   0.9736

Testing with Random Forest model

As we can see the the random forest provides better accuracy compared with k-nearest neighbors method. Next I provide the confusion matrix for the Random Forest model applied to the testing set

PredTest <- predict(modelRF, testing)
AccuracyTest <- confusionMatrix(PredTest, testing$classe)
AccuracyTest

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1395    1    0    0    0
##          B    0  946    2    0    0
##          C    0    2  853    9    0
##          D    0    0    0  795    1
##          E    0    0    0    0  900
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9969         
##                  95% CI : (0.995, 0.9983)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9961         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9968   0.9977   0.9888   0.9989
## Specificity            0.9997   0.9995   0.9973   0.9998   1.0000
## Pos Pred Value         0.9993   0.9979   0.9873   0.9987   1.0000
## Neg Pred Value         1.0000   0.9992   0.9995   0.9978   0.9998
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2845   0.1929   0.1739   0.1621   0.1835
## Detection Prevalence   0.2847   0.1933   0.1762   0.1623   0.1835
## Balanced Accuracy      0.9999   0.9982   0.9975   0.9943   0.9994

Conclusion

Based on the two models used I conclude that Random Forest provides the best outcome prediction with a 0.99 accuracy.

Predictions for Assignment

# Run against 20 testing 
genPredictions <- predict(modelRF, newdata=testing, type="class")

# The following function generate the predictions requested on the assingment
predmodel_write_files = function(x){
  n = length(x)
  for(i in 1:n){
    fname = paste0("problem_id_",i,".txt")
    write.table(x[i],file=fname,quote=FALSE,row.names=FALSE,col.names=FALSE)
  }
}

predmodel_write_files(genPredictions)

Reference

Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.