Practical Machine Learning - Prediction Assignment

Data

The data for this project come from this source: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har. If you use the document you create for this class for any purpose please cite them as they have been very generous in allowing their data to be used for this kind of assignment.

The training data for this project are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

# Load the data, removing the likely unavialbe information   
Test <- read.csv("pml-testing.csv", na.strings = c("", "NA", "#DIV/0!")) 
Train <- read.csv("pml-training.csv", na.strings = c("", "NA", "#DIV/0!"))

Initial Exploration & Cleaning

Need to explore the dataset in order to have an improved understanding of the available data, before attempting to tidy the data.

# Dimensions of data  
dim(Train)
head(Train)
summary(Train)
sapply(Train, class)
table(Train$classe)
duplicated(colnames(Train))

The training data has 19622 instances with 160 attributes, which might be worth reducing. Experimented with removing data columns with high percentages of NA information. Removing either 50% or 95% NAs both resulted in 60 attributes remaining.

CTrain <- Train[, -which(colMeans(is.na(Train)) > .5)]
dim(CTrain)

## [1] 19622    60

CTrain <- Train[, -which(colMeans(is.na(Train)) > .95)]
dim(CTrain)

## [1] 19622    60

Explore distribution breakdown

Continued exploring the kind of data inside the dataset. Played around with the users versus data as well as the expected classe versus data pertinent to the course. Then removed the superfluous columns in order to streamline and improve the modeling predictions.

levels(CTrain$user_name)

## [1] "adelmo"   "carlitos" "charles"  "eurico"   "jeremy"   "pedro"

Upercent <- prop.table(table(CTrain$user_name)) * 100
cbind(freq = table(Train$user_name), percentage = Upercent)

##          freq percentage
## adelmo   3892   19.83488
## carlitos 3112   15.85975
## charles  3536   18.02059
## eurico   3070   15.64570
## jeremy   3402   17.33768
## pedro    2610   13.30140

plot(Train$user_name, )

levels(CTrain$classe)

## [1] "A" "B" "C" "D" "E"

Cpercent <- prop.table(table(CTrain$classe)) * 100
cbind(freq = table(Train$classe), percentage = Cpercent)

##   freq percentage
## A 5580   28.43747
## B 3797   19.35073
## C 3422   17.43961
## D 3216   16.38977
## E 3607   18.38243

plot(Train$classe, )

### Remove likely unnecessary columns
TempTrain <- !names(CTrain) %in% 
  c('X', 'user_name', 'raw_timestamp_part_1', 'raw_timestamp_part_2',
    'cvtd_timestamp', 'new_window')
CTrain <- CTrain[, TempTrain]

Partition the training data for crossvalidation

Basic preparation of the training data to create a training and testing dataset that did not impede on the actual testing data.

PTrain <- createDataPartition(CTrain$classe, p = 0.6)[[1]]
CrossV <- CTrain[-PTrain,]
PTrain <- CTrain[ PTrain, ]
PTrain <- createDataPartition(CrossV$classe, p = 0.75)[[1]]
CrossVtest <- CrossV[-PTrain, ]
CrossV <- CrossV[PTrain, ]

Train different models

Attempted three different models. The Random Forest had the highest accuracy and lowest in-sample/out-of-sample error, so that one was selected. Additionally, the model was excellent in crossvalidation tests.

mod1 <- train(classe ~ ., data=CTrain, method="rf")
mod2 <- train(classe ~ ., data=CTrain, method="gbm")
mod3 <- train(classe ~ ., data=CTrain, method="lda")

pred1 <- predict(mod1, CrossV)
pred2 <- predict(mod2, CrossV)
pred3 <- predict(mod3, CrossV)

### Confusion Matrices  
confusionMatrix(pred1, CrossV$classe)
confusionMatrix(pred2, CrossV$classe)
confusionMatrix(pred3, CrossV$classe)

### Create Combination Model  

Cmodel <- data.frame(pred1, pred2, classe=CrossV$classe)
Cmodel2 <- data.frame(pred2, pred3, classe=CrossV$classe)
Cmodel3 <- data.frame(pred1, pred2, pred3, classe=CrossV$classe)

CmodelFit <- train(classe ~ ., method="rf", data=Cmodel)
CmodelFit2 <- train(classe ~ ., method="rf", data=Cmodel2)
CmodelFit3 <- train(classe ~ ., method="rf", data=Cmodel3)

#### in-sample error
CmodelFitIn <- predict(CmodelFit, Cmodel)
CmodelFitIn2 <- predict(CmodelFit2, Cmodel)
CmodelFitIn3 <- predict(CmodelFit3, Cmodel)
confusionMatrix(CmodelFitIn, Cmodel$classe)
confusionMatrix(CmodelFitIn2, Cmodel$classe)
confusionMatrix(CmodelFitIn3, Cmodel$classe)

# ERROR NEEDS WORK out-of-sample error
pred1 <- predict(mod1, CrossV)
pred3 <- predict(mod3, CrossV)
confusionMatrix(pred1, CrossVtest$classe)

Random Forest

The Random Forest model with a 5-fold cross validation will be used as the predictor rather than the other methods or a combination of methods.
* RF handls large input well, including when the interacations are unknown.
* RF has built-in cross-validation that estimates the out-of-sample error rate.

Then, I explored the model with various data visualizations, including a list of the top variables.

RFmodel <- train(classe ~., data=CTrain, method = "rf", 
                 trControl = trainControl(method = "cv", number = 5))
save(RFmodel, file = "RFmodel2.Rda")
print(RFmodel)

## Random Forest 
## 
## 19622 samples
##    53 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 15697, 15699, 15697, 15698, 15697 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9962287  0.9952296
##   27    0.9984711  0.9980662
##   53    0.9960759  0.9950359
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.

plot(RFmodel)

varImp(RFmodel)

## rf variable importance
## 
##   only 20 most important variables shown (out of 53)
## 
##                      Overall
## num_window           100.000
## roll_belt             61.803
## pitch_forearm         37.701
## yaw_belt              31.640
## magnet_dumbbell_z     28.210
## magnet_dumbbell_y     27.872
## pitch_belt            27.056
## roll_forearm          22.823
## accel_dumbbell_y      12.473
## roll_dumbbell         10.433
## accel_belt_z          10.374
## magnet_dumbbell_x     10.314
## accel_forearm_x        9.818
## total_accel_dumbbell   8.911
## accel_dumbbell_z       7.930
## magnet_forearm_z       7.043
## magnet_belt_z          6.868
## magnet_belt_y          6.436
## magnet_belt_x          5.569
## roll_arm               5.141

plot(varImp(RFmodel))

plot(varImp(RFmodel), main = "Importance of Top 30 Variables", top = 30)

plot(varImp(RFmodel), main = "Importance of Top 15 Variables", top = 15)

plot(varImp(RFmodel), main = "Importance of Top 10 Variables", top = 10)

RFmodel$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 27
## 
##         OOB estimate of  error rate: 0.13%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 5578    1    0    0    1 0.0003584229
## B    4 3790    2    1    0 0.0018435607
## C    0    5 3417    0    0 0.0014611338
## D    0    0    7 3207    2 0.0027985075
## E    0    0    0    2 3605 0.0005544774

Overall, this model has a low error rate, consistently under .16%.

Let’s test the prediction

Intersected the cleaned training data with the testing data to ensure that I had the same variables, and then ran predictions against the testing data.

CTest <- Test[ , intersect(names(CTrain), names(Test))]
RFpredict <- predict(RFmodel, CTest)
confusionMatrix(RFmodel, Test$classe)

## Cross-Validated (5 fold) Confusion Matrix 
## 
## (entries are un-normalized aggregated counts)
##  
##           Reference
## Prediction    A    B    C    D    E
##          A 5578    6    0    0    0
##          B    1 3788    6    0    0
##          C    0    2 3416   10    0
##          D    0    1    0 3205    2
##          E    1    0    0    1 3605
##                             
##  Accuracy (average) : 0.9985

The accuracy remains high, consistently over 99.8% accuracy.