Machine Learning Report

Synopsis

The present report takes the Weight Lifting Exercise Dataset available in (http://groupware.les.inf.puc-rio.br/har), which measures through different sensors how well some participants execute an exercise. This dataset classifies the exercise in 5 different classes, of which the class “A” represents a well executed exercise and the remain ones “B”, “C” “D” and “E” are related to different forms of bad execution. In the following sections will be explained the necessary steps to build the model and find the predictions.

Data Load

The training data for this project are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv","./data/pml-training.csv", mode="wb")

download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv","./data/pml-testing.csv", mode="wb")

The data is loaded converting some invalid results to NA.

#Data Loading
training <- read.csv("./data/pml-training.csv", na.strings=c("", "NA", "#DIV/0!"))
testing <- read.csv("./data/pml-testing.csv", na.strings=c("", "NA", "#DIV/0!"))

To execute the following steps were loaded these libraries

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

library(rpart)

Data Preparation

Data Cleaning

The features with more than 50% of empty values are discarded.

#Select variables with more than 50% of complete values
training2 <- training
training3 <- as.data.frame(lapply(training2, function(x)sum(is.na(x ))/length(x)))
training2 <- training2[,which(colMeans(training3)<0.5)]

The resulting data set doesn’t have any NA

sum(complete.cases(training2)== FALSE)

## [1] 0

Pre-Processing

In order to avoid skewness and high variance, which are factors that interfere with some models, the data is scaled and centered in zero.

preObj <- preProcess(training2, method = c("center", "scale"))
training1 <- predict(preObj, training2)

Data Particion

It is defined that 60% of the data is used to train the model and 40% is used to test.

#Create data partition
trainingC <- data.frame(training1)
inTrain <- createDataPartition(y = trainingC$classe, p = 0.6, list = FALSE )
train <- trainingC[inTrain,]
test <- trainingC[-inTrain,]

Model Selection

It was used the following seed.

set.seed(1904)

Features selection

From the 160 initial features remained 60 for selection.

dim(train)

## [1] 11776    60

From these features were discarded the ones more highly correlated with each other, decreasing the redundant effect. All the features with a correlation higher than 0.75 were discarded

# calculate correlation matrix
trainingC1 <- train[,-c(1:6,60)]
correlationMatrix <- cor(trainingC1)
highlyCorrelated <- findCorrelation(correlationMatrix, cutoff=0.75)
tc2 <- trainingC1[,-highlyCorrelated]
str(tc2)

## 'data.frame':    11776 obs. of  33 variables:
##  $ num_window          : num  -1.69 -1.69 -1.69 -1.69 -1.69 ...
##  $ yaw_belt            : num  -0.874 -0.874 -0.874 -0.874 -0.874 ...
##  $ gyros_belt_x        : num  0.027 0.123 0.027 0.123 0.123 ...
##  $ gyros_belt_y        : num  -0.506 -0.506 -0.506 -0.25 -0.506 ...
##  $ gyros_belt_z        : num  0.458 0.458 0.458 0.458 0.458 ...
##  $ magnet_belt_x       : num  -0.913 -0.975 -0.897 -0.96 -0.929 ...
##  $ magnet_belt_y       : num  0.149 0.401 0.177 0.177 0.149 ...
##  $ roll_arm            : num  -2 -2 -2 -2 -2 ...
##  $ pitch_arm           : num  0.884 0.884 0.884 0.871 0.864 ...
##  $ yaw_arm             : num  -2.25 -2.25 -2.25 -2.25 -2.25 ...
##  $ total_accel_arm     : num  0.807 0.807 0.807 0.807 0.807 ...
##  $ gyros_arm_y         : num  0.302 0.278 0.278 0.267 0.267 ...
##  $ gyros_arm_z         : num  -0.523 -0.523 -0.523 -0.487 -0.487 ...
##  $ accel_arm_y         : num  0.695 0.705 0.705 0.714 0.714 ...
##  $ magnet_arm_x        : num  -1.26 -1.26 -1.26 -1.28 -1.27 ...
##  $ magnet_arm_z        : num  0.641 0.632 0.632 0.611 0.62 ...
##  $ roll_dumbbell       : num  -0.154 -0.153 -0.157 -0.15 -0.153 ...
##  $ pitch_dumbbell      : num  -1.61 -1.62 -1.61 -1.61 -1.61 ...
##  $ yaw_dumbbell        : num  -1.05 -1.05 -1.05 -1.05 -1.05 ...
##  $ total_accel_dumbbell: num  2.28 2.28 2.28 2.28 2.28 ...
##  $ gyros_dumbbell_y    : num  -0.108 -0.108 -0.108 -0.108 -0.108 ...
##  $ magnet_dumbbell_z   : num  -0.793 -0.786 -0.779 -0.815 -0.829 ...
##  $ roll_forearm        : num  -0.0502 -0.0512 -0.0512 -0.0539 -0.0549 ...
##  $ pitch_forearm       : num  -2.65 -2.65 -2.65 -2.65 -2.65 ...
##  $ yaw_forearm         : num  -1.67 -1.67 -1.66 -1.66 -1.66 ...
##  $ total_accel_forearm : num  0.128 0.128 0.128 0.128 0.128 ...
##  $ gyros_forearm_x     : num  -0.197 -0.213 -0.197 -0.213 -0.213 ...
##  $ gyros_forearm_z     : num  -0.0976 -0.0976 -0.0862 -0.0976 -0.0976 ...
##  $ accel_forearm_x     : num  1.4 1.4 1.43 1.39 1.42 ...
##  $ accel_forearm_z     : num  -1.15 -1.16 -1.14 -1.15 -1.15 ...
##  $ magnet_forearm_x    : num  0.852 0.849 0.849 0.852 0.849 ...
##  $ magnet_forearm_y    : num  0.538 0.551 0.546 0.54 0.548 ...
##  $ magnet_forearm_z    : num  0.223 0.215 0.204 0.215 0.207 ...

tc3 <- data.frame(train[,5:6],tc2, classe = train[,60])

Model selection

Since the present data set contains numeric and factor variables, they were chosen two models:

Support Vector Machine
Random Forest

Results

Model training

It was applied a Cross-validation with 10 folds em 3 repetitions.

cv.ctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3, 
                        classProbs = TRUE)

In the next step both models (SVM and Random Forest) will be trained.

Support Vector Machine:

model_svm <- train(classe ~ ., method = "svmRadial", trControl = cv.ctrl, data = tc3, verbose = FALSE)

## Loading required package: kernlab

## Warning: package 'kernlab' was built under R version 3.2.4

## 
## Attaching package: 'kernlab'

## The following object is masked from 'package:ggplot2':
## 
##     alpha

Random Forest:

model_rf <- train(classe ~ ., method = "rf", trControl = cv.ctrl, data = tc3)

## Loading required package: randomForest

## Warning: package 'randomForest' was built under R version 3.2.5

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

Model results

In the next step will be compared the results of both methods (SVM and Random Forest), to decide which one achieves a better performance. In order to get a more real behavior, it is used a new data sample (test).

Support Vector Machine results:

res <- predict(model_svm, test)

## Loading required package: kernlab

## 
## Attaching package: 'kernlab'

## The following object is masked from 'package:ggplot2':
## 
##     alpha

confusionMatrix(res,test$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2208   89   89    5    0
##          B   23 1320  494   29    0
##          C    1  108  584  179    7
##          D    0    1  201  955   42
##          E    0    0    0  118 1393
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8233          
##                  95% CI : (0.8147, 0.8317)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7756          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9892   0.8696  0.42690   0.7426   0.9660
## Specificity            0.9674   0.9137  0.95446   0.9628   0.9816
## Pos Pred Value         0.9235   0.7074  0.66439   0.7965   0.9219
## Neg Pred Value         0.9956   0.9669  0.88747   0.9502   0.9923
## Prevalence             0.2845   0.1935  0.17436   0.1639   0.1838
## Detection Rate         0.2814   0.1682  0.07443   0.1217   0.1775
## Detection Prevalence   0.3047   0.2378  0.11203   0.1528   0.1926
## Balanced Accuracy      0.9783   0.8916  0.69068   0.8527   0.9738

Random Forest results:

res <- predict(model_rf, test)

## Loading required package: randomForest

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

confusionMatrix(res,test$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2232    2    0    0    0
##          B    0 1514    0    0    0
##          C    0    2 1368    1    0
##          D    0    0    0 1285    0
##          E    0    0    0    0 1442
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9994          
##                  95% CI : (0.9985, 0.9998)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9992          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9974   1.0000   0.9992   1.0000
## Specificity            0.9996   1.0000   0.9995   1.0000   1.0000
## Pos Pred Value         0.9991   1.0000   0.9978   1.0000   1.0000
## Neg Pred Value         1.0000   0.9994   1.0000   0.9998   1.0000
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2845   0.1930   0.1744   0.1638   0.1838
## Detection Prevalence   0.2847   0.1930   0.1747   0.1638   0.1838
## Balanced Accuracy      0.9998   0.9987   0.9998   0.9996   1.0000

With almost 100% percent of accuracy (out of sample error = 6e-04), the random forest model performs considerable better than SVM, which just achieves 82% of accuracy and demonstrates some issues specially in class C sensitivity. The