Practical Machine Learning Project

1. Loading data and data cleaning

test <- read.csv("pml-testing.csv", header = TRUE, na.strings = c("NA", "#DIV/0!"))
train <- read.csv("pml-training.csv", header = TRUE, na.strings = c("NA", "#DIV/0!"))
train <- train[ ,colSums(is.na(train)) <= nrow(train)*0.6]
test <- test[ ,colSums(is.na(test)) != nrow(test)]

If the count of NAs in a column equals to the number of rows, then the column must be entirely NA.

library(caret)

## Warning: package 'caret' was built under R version 3.2.1

## Loading required package: lattice
## Loading required package: ggplot2

col.NZV <- nearZeroVar(train, saveMetrics = TRUE) # Figure out the columns that has near zero variance
drop.columns <- c("X", "problem_id", names(col.NZV))
test <- test[ , !colnames(test) %in% drop.columns]
classe <- train$classe
train <- train[, colnames(train) %in% names(test)]
train <- data.frame(train, classe)

Drop the unnecessary columns, including $observation id$ , $problem id$ , and $near zero variance$ . Since variables in $test$ data set are less than $train$ data set, we use the variables appear in both data sets to fit the model.

set.seed(123)
folds <- createFolds(train$user_name, k=5, list = TRUE)

2. Model fitting: Decision Tree with 5-folds cross validation

library(rpart)
# tree <- train(classe ~., data = training, method = "rpart") # it seems there is error using caret?
# tree <- rpart(classe ~., data = training, method = "class")
# rattle::fancyRpartPlot(tree)
# tree.predict <- predict(tree, newdata = validation, type = "class")
# confusionMatrix(validation$classe, tree.predict)
k = 5
accuracy <- rep(NA, 5)
for(i in 1:k){
    kfolds.train <- train[-folds[[i]], ]
    kfolds.test <- train[folds[[i]], ]
    tree <- rpart(classe ~., data = kfolds.train, method = "class")
    tree.predict <- predict(tree, newdata = kfolds.test, type = "class")
    result <- confusionMatrix(kfolds.test$classe, tree.predict)
    accuracy[i] <- result$overall[1]
}
accuracy

## [1] 0.8707622 0.8650025 0.8679918 0.8749045 0.8827727

mean(accuracy)

## [1] 0.8722867

The Decision tree performs quite well. The 5-folds average accuracy is $0.872$ , and average out of sample error is $0.128$ .

3. Model fitting: Random Forest with 5-folds cross validation

library(randomForest)

## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

k = 5
accuracy <- rep(NA, 5)
for(i in 1:k){
    set.seed(1234)
    kfolds.train <- train[-folds[[i]], ]
    kfolds.test <- train[folds[[i]], ]
    rf <- randomForest(classe ~., data = kfolds.train, ntree = 200)
    rf.predict <- predict(rf, newdata = kfolds.test)
    result <- confusionMatrix(kfolds.test$classe, rf.predict)
    accuracy[i] <- result$overall[1]
}
accuracy

## [1] 0.9984706 0.9989812 0.9989806 0.9992357 0.9992355

mean(accuracy)

## [1] 0.9989807

Random forest performs much better than decision tree. The 5-folds average accuracy attains $0.999$ , the average out of sample error is only $0.001$ .

From the 5-folds cross validation, we see the performance of Random forest is very good. We fit the Random forest with all the $train$ data set to get the final model which would be used in prediction of test data set later.

set.seed(1234)
rf <- randomForest(classe ~., data = train, ntree = 200)
plot(rf, main = "MSE versus number of tree of Random Forest")

We can see only after fitting 30 trees, RandomForst reaches a very small error rate. The default number of tree (ntree) is 500, however we only need set a small number to reduce computing time.

4. Model fitting: Boosting with 3-folds cross validation

We use $caret$ to fit the Boosting here and set k-folds cross validation through its parameter. We split the $train$ data into $training$ and $validation$ , then do boosting with 3-folds cross validation on $training$ , then evaluate performance on $validation$ data set. (To expedite the running we only use 3-folds)

Actually if you fit k-folds cross validation by hand, it will generator k models, base on the average out of sample error rate you could determine the performance of this kind of model. Then you fit the model with the whole data set, use the fitted model to predict the response of test data set. But if you do k-folds in caret by setting the parameter, it gives you one final model. I am confused how does this final model come

set.seed(12345)
inTrain <-createDataPartition(train$user_name, p=0.6, list = FALSE) 
training <- train[inTrain, ]
validation <- train[-inTrain, ]

set.seed(12345)
fitControl <- trainControl(method = "repeatedcv",
                           number = 3,
                           repeats = 1)
boost <- train(classe ~ ., data=training, method = "gbm",
                 trControl = fitControl,
                 verbose = FALSE)

## Loading required package: gbm

## Warning: package 'gbm' was built under R version 3.2.3

## Loading required package: survival
## 
## Attaching package: 'survival'
## 
## The following object is masked from 'package:caret':
## 
##     cluster
## 
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.1
## Loading required package: plyr

## Warning: package 'plyr' was built under R version 3.2.1

boostingPred <- predict(boost, newdata=validation)
confusionMatrix(boostingPred, validation$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2228    0    0    0    0
##          B    0 1517    1    0    0
##          C    0    1 1340    5    0
##          D    0    3    4 1315    4
##          E    0    0    0    6 1422
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9969         
##                  95% CI : (0.9955, 0.998)
##     No Information Rate : 0.284          
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9961         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             1.000   0.9974   0.9963   0.9917   0.9972
## Specificity             1.000   0.9998   0.9991   0.9983   0.9991
## Pos Pred Value          1.000   0.9993   0.9955   0.9917   0.9958
## Neg Pred Value          1.000   0.9994   0.9992   0.9983   0.9994
## Prevalence              0.284   0.1939   0.1714   0.1690   0.1817
## Detection Rate          0.284   0.1933   0.1708   0.1676   0.1812
## Detection Prevalence    0.284   0.1935   0.1716   0.1690   0.1820
## Balanced Accuracy       1.000   0.9986   0.9977   0.9950   0.9981

# library(gbm)
# set.seed(12345)
# boost <- gbm(classe ~., data = training, distribution = "multinomial", cv.folds = 2, n.trees = 100, 
#              shrinkage = 0.2, interaction.depth = 3,verbose = FALSE)
# # notice that boost.predict is a array
# boost.predict <- predict(boost, newdata = validation, n.trees = 100, type = "response")
# boost.predict[1:6, ,]
# boost.predIndex <- apply(boost.predict, MARGIN = 1, which.max) # find out the index of maximal prob. each row
# boost.predClass <- colnames(boost.predict)[boost.predIndex] # convert index into class A-E
# levels(boost.predClass) <- levels(validation$classe) # make sure they have the same level
# confusionMatrix(validation$classe, boost.predClass)

Even though the overall accuracy is $0.9973$ , actually Boosting performs a little bit better than Random forest if we re-fit the model with other parameters. When fitting Boosting, you need tune parameters carefully including $n.trees$ , $shrinkage$ , $interaction.depth$ , etc. The running time increases a lot if you increase $n.trees$ and $interaction.depth$ .

5. Predicting 20 test cases.

levels(test$cvtd_timestamp) <- levels(train$cvtd_timestamp)
levels(test$new_window) <- levels(train$new_window)
test.predicted <- predict(rf, newdata = test, type = "class")
test.predicted

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E

Consider both accuracy and running time, we choose $RandomForest$ as the final model. Please note that since the test data set are too small, we need manually assign the levels of specific columns in $train$ dataset to corresponding columns in $test$ dataset, so that test dataset could be predicted. Another method is that you can combine the training and testing dataset at the very begining and then split them, which would make them have the same levels.

Practical Machine Learning Project_Coursera

Xiang Cao

February 14, 2016

1. Loading data and data cleaning

2. Model fitting: Decision Tree with 5-folds cross validation

3. Model fitting: Random Forest with 5-folds cross validation

4. Model fitting: Boosting with 3-folds cross validation

5. Predicting 20 test cases.