In this project, my goal is to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants as Train data, then use the appropriate cross validation method to select a good model. The Test data will be used to validate the last chosen model.
The data for this project come from this source: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har.
rm(list=ls())
library(caret);library(rpart);library(rpart.plot);library(RColorBrewer);library(rattle);library(randomForest);library(gbm)
set.seed(12345)
TrainDT <- read.csv(url("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"),header=TRUE)
dim(TrainDT)
## [1] 19622 160
#str(TrainDT)
TestDT <- read.csv(url("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"),header=TRUE)
dim(TestDT)
## [1] 20 160
#str(TestDT)
#Check the missing values in each column and find many columns have >19000 missings, which can be deleted, also delete the first 7 variables because no use
MisColChkTrain <- (colSums(is.na(TrainDT) |TrainDT==""))
CleanedTrainDT <- TrainDT[, -which(MisColChkTrain>19000)]
CleanedTrainDT <-CleanedTrainDT[,-c(1:7)]
dim(CleanedTrainDT)
## [1] 19622 53
#Check the test data and will remove all missing variables, also delete the first 7 variables because no use
MisColChkTest <- (colSums(is.na(TestDT) |TestDT==""))
CleanedTestDT <- TestDT[, -which(MisColChkTest==20)]
CleanedTestDT <-CleanedTestDT[,-c(1:7)]
dim(CleanedTestDT)
## [1] 20 53
You can also embed plots, for example:
# Create partition of the cleaned traning data set
PartitionDT <- createDataPartition(CleanedTrainDT$classe, p=0.7, list=FALSE)
Train1 <- CleanedTrainDT[PartitionDT,]
Test1 <- CleanedTrainDT[-PartitionDT,]
dim(Train1)
## [1] 13737 53
dim(Test1)
## [1] 5885 53
#Classification tree modeling
TrainControl <- trainControl(method="cv", number=5)
model_CT <- train(classe~., data=Train1, method="rpart", trControl=TrainControl)
fancyRpartPlot(model_CT$finalModel)
TrainPred <- predict(model_CT,newdata=Test1)
CT <- confusionMatrix(Test1$classe,TrainPred)
CT$table
## Reference
## Prediction A B C D E
## A 1525 29 116 0 4
## B 484 385 270 0 0
## C 499 37 490 0 0
## D 423 187 354 0 0
## E 153 159 289 0 481
CT$overall[1]
## Accuracy
## 0.4895497
# Random forests modeling
model_RF <- train(classe~., data=Train1, method="rf", trControl=TrainControl, verbose=FALSE)
print(model_RF)
## Random Forest
##
## 13737 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 10990, 10989, 10990, 10990, 10989
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9898813 0.9871995
## 27 0.9898085 0.9871076
## 52 0.9832572 0.9788195
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
plot(model_RF,main="Random forest model accuracy by number of predictors")
TrainPred <- predict(model_RF,newdata=Test1)
RF <- confusionMatrix(Test1$classe,TrainPred)
RF$table
## Reference
## Prediction A B C D E
## A 1673 1 0 0 0
## B 4 1134 1 0 0
## C 0 5 1021 0 0
## D 0 0 21 941 2
## E 0 0 0 0 1082
RF$overall[1]
## Accuracy
## 0.9942226
result <- predict(model_RF, newdata=CleanedTestDT)
result
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
We can notice that the accuracy of Classification tree modeling is not good enough(about 49%).This means that the outcome class will not be predicted very well by the other predictors. While, with random forest, we reach an accuracy of 99%. I also tried with Gradient boosting. The code and results are not shown here because of limited space. But all in all, the random forest is the best predition method. Thus, random forest model was use for the validation data for the prediction.