Practical Machine Learning Project

Project introduction

In this project, my goal is to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants as Train data, then use the appropriate cross validation method to select a good model. The Test data will be used to validate the last chosen model.

The data for this project come from this source: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har.

Data exploring

rm(list=ls())
library(caret);library(rpart);library(rpart.plot);library(RColorBrewer);library(rattle);library(randomForest);library(gbm)

set.seed(12345)
TrainDT <- read.csv(url("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"),header=TRUE)
dim(TrainDT)

## [1] 19622   160

#str(TrainDT)
TestDT <- read.csv(url("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"),header=TRUE)
dim(TestDT)

## [1]  20 160

#str(TestDT)

#Check the missing values in each column and find many columns have >19000 missings, which can be deleted, also delete the first 7 variables because no use
MisColChkTrain <- (colSums(is.na(TrainDT) |TrainDT==""))
CleanedTrainDT <- TrainDT[, -which(MisColChkTrain>19000)]
CleanedTrainDT <-CleanedTrainDT[,-c(1:7)]
dim(CleanedTrainDT)

## [1] 19622    53

#Check the test data and will remove all missing variables, also delete the first 7 variables because no use
MisColChkTest <- (colSums(is.na(TestDT) |TestDT==""))
CleanedTestDT <- TestDT[, -which(MisColChkTest==20)]
CleanedTestDT <-CleanedTestDT[,-c(1:7)]
dim(CleanedTestDT)

## [1] 20 53

Model selection

You can also embed plots, for example:

# Create partition of the cleaned traning data set 
PartitionDT <- createDataPartition(CleanedTrainDT$classe, p=0.7, list=FALSE)
Train1 <- CleanedTrainDT[PartitionDT,]
Test1 <- CleanedTrainDT[-PartitionDT,]
dim(Train1)

## [1] 13737    53

dim(Test1)

## [1] 5885   53

#Classification tree modeling
TrainControl <- trainControl(method="cv", number=5)
model_CT <- train(classe~., data=Train1, method="rpart", trControl=TrainControl)
fancyRpartPlot(model_CT$finalModel)

TrainPred <- predict(model_CT,newdata=Test1)
CT <- confusionMatrix(Test1$classe,TrainPred)
CT$table

##           Reference
## Prediction    A    B    C    D    E
##          A 1525   29  116    0    4
##          B  484  385  270    0    0
##          C  499   37  490    0    0
##          D  423  187  354    0    0
##          E  153  159  289    0  481

CT$overall[1]

##  Accuracy 
## 0.4895497

# Random forests modeling
model_RF <- train(classe~., data=Train1, method="rf", trControl=TrainControl, verbose=FALSE)
print(model_RF)

## Random Forest 
## 
## 13737 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 10990, 10989, 10990, 10990, 10989 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9898813  0.9871995
##   27    0.9898085  0.9871076
##   52    0.9832572  0.9788195
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

plot(model_RF,main="Random forest model accuracy by number of predictors")

TrainPred <- predict(model_RF,newdata=Test1)
RF <- confusionMatrix(Test1$classe,TrainPred)
RF$table

##           Reference
## Prediction    A    B    C    D    E
##          A 1673    1    0    0    0
##          B    4 1134    1    0    0
##          C    0    5 1021    0    0
##          D    0    0   21  941    2
##          E    0    0    0    0 1082

RF$overall[1]

##  Accuracy 
## 0.9942226

Applying random forests model to the validation data

result <- predict(model_RF, newdata=CleanedTestDT)
result

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Conclusion

We can notice that the accuracy of Classification tree modeling is not good enough(about 49%).This means that the outcome class will not be predicted very well by the other predictors. While, with random forest, we reach an accuracy of 99%. I also tried with Gradient boosting. The code and results are not shown here because of limited space. But all in all, the random forest is the best predition method. Thus, random forest model was use for the validation data for the prediction.