Practical Machine Learning Project

1. Loading data and data cleaning

test <- read.csv("pml-testing.csv", header = TRUE, na.strings = c("NA", "#DIV/0!"))
train <- read.csv("pml-training.csv", header = TRUE, na.strings = c("NA", "#DIV/0!"))
train <- train[ ,colSums(is.na(train)) <= nrow(train)*0.6]
test <- test[ ,colSums(is.na(test)) != nrow(test)]

If the count of NAs in a column equals to the number of rows, then the column must be entirely NA.

library(caret)

## Warning: package 'caret' was built under R version 3.2.1

## Loading required package: lattice
## Loading required package: ggplot2

col.NZV <- nearZeroVar(train, saveMetrics = TRUE) # Figure out the columns that has near zero variance
drop.columns <- c("X", "problem_id", names(col.NZV))
test <- test[ , !colnames(test) %in% drop.columns]
classe <- train$classe
train <- train[, colnames(train) %in% names(test)]
train <- data.frame(train, classe)

Drop the unnecessary columns, including \(observation id\), \(problem id\), and \(near zero variance\). Since variables in \(test\) data set are less than \(train\) data set, we use the variables appear in both data sets to fit the model.

set.seed(123)
inTrain <-createDataPartition(train$user_name, p=0.6, list = FALSE) 
training <- train[inTrain, ]
validation <- train[-inTrain, ]

Creat new \(training\) data set and \(validation\) data set from the original data.

2. Model fitting: Decision Tree

library(rpart)
# tree <- train(classe ~., data = training, method = "rpart") # it seems there is error using caret?
tree <- rpart(classe ~., data = training, method = "class")
rattle::fancyRpartPlot(tree)

tree.predict <- predict(tree, newdata = validation, type = "class")
confusionMatrix(validation$classe, tree.predict)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2143   59   19    0    0
##          B   64 1271  174    9    0
##          C    9   96 1224    7    7
##          D    1   70  139  868  206
##          E    0    0   60   80 1340
## 
## Overall Statistics
##                                          
##                Accuracy : 0.8725         
##                  95% CI : (0.865, 0.8798)
##     No Information Rate : 0.2826         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.8388         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9666   0.8496   0.7574   0.9004   0.8628
## Specificity            0.9861   0.9611   0.9809   0.9396   0.9778
## Pos Pred Value         0.9649   0.8373   0.9114   0.6760   0.9054
## Neg Pred Value         0.9868   0.9644   0.9397   0.9854   0.9665
## Prevalence             0.2826   0.1907   0.2060   0.1229   0.1979
## Detection Rate         0.2731   0.1620   0.1560   0.1106   0.1708
## Detection Prevalence   0.2831   0.1935   0.1712   0.1637   0.1886
## Balanced Accuracy      0.9764   0.9054   0.8692   0.9200   0.9203

The Decision tree performs quite well. Accuracy is \(0.8725\)

3. Model fitting: Random Forest

library(randomForest)

## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.

set.seed(1234)
rf <- randomForest(classe ~., data = training, ntree = 200)
rf.predict <- predict(rf, newdata = validation)
confusionMatrix(validation$classe, rf.predict)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2221    0    0    0    0
##          B    1 1516    1    0    0
##          C    0    2 1338    3    0
##          D    0    0    1 1282    1
##          E    0    0    0    0 1480
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9989          
##                  95% CI : (0.9978, 0.9995)
##     No Information Rate : 0.2832          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9985          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9995   0.9987   0.9985   0.9977   0.9993
## Specificity            1.0000   0.9997   0.9992   0.9997   1.0000
## Pos Pred Value         1.0000   0.9987   0.9963   0.9984   1.0000
## Neg Pred Value         0.9998   0.9997   0.9997   0.9995   0.9998
## Prevalence             0.2832   0.1935   0.1708   0.1638   0.1888
## Detection Rate         0.2831   0.1932   0.1705   0.1634   0.1886
## Detection Prevalence   0.2831   0.1935   0.1712   0.1637   0.1886
## Balanced Accuracy      0.9998   0.9992   0.9989   0.9987   0.9997

plot(rf, main = "MSE versus number of tree of Random Forest")

Random forest performs much better than decision tree. The accuracy attains \(0.9989\). The the plot we can see only after fitting 30 trees, RandomForst reaches a very small error rate. The default number of tree (ntree) is 500, however we only need set a small number to reduce computing time.

4. Model fitting: Boosting

library(gbm)

## Warning: package 'gbm' was built under R version 3.2.3

## Loading required package: survival
## 
## Attaching package: 'survival'
## 
## The following object is masked from 'package:caret':
## 
##     cluster
## 
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.1

set.seed(12345)
boost <- gbm(classe ~., data = training, distribution = "multinomial", n.trees = 100, 
             shrinkage = 0.2, interaction.depth = 3,verbose = FALSE)
# notice that boost.predict is a array
boost.predict <- predict(boost, newdata = validation, n.trees = 100, type = "response")
boost.predict[1:6, ,]

##              A            B            C            D            E
## [1,] 0.9997076 0.0001108933 7.413240e-05 2.352405e-05 8.382577e-05
## [2,] 0.9996852 0.0001108908 9.654712e-05 2.352352e-05 8.382389e-05
## [3,] 0.9997006 0.0001108925 8.114757e-05 2.352388e-05 8.382519e-05
## [4,] 0.9996852 0.0001108908 9.654712e-05 2.352352e-05 8.382389e-05
## [5,] 0.9997023 0.0001108927 7.946069e-05 2.352392e-05 8.382533e-05
## [6,] 0.9996852 0.0001108908 9.654712e-05 2.352352e-05 8.382389e-05

boost.predIndex <- apply(boost.predict, MARGIN = 1, which.max) # find out the index of maximal prob. each row
boost.predClass <- colnames(boost.predict)[boost.predIndex] # convert index into class A-E
levels(boost.predClass) <- levels(validation$classe) # make sure they have the same level
confusionMatrix(validation$classe, boost.predClass)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2218    3    0    0    0
##          B    2 1509    6    1    0
##          C    0    1 1335    7    0
##          D    0    0    1 1283    0
##          E    0    0    0    7 1473
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9964          
##                  95% CI : (0.9948, 0.9976)
##     No Information Rate : 0.2829          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9955          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9991   0.9974   0.9948   0.9884   1.0000
## Specificity            0.9995   0.9986   0.9988   0.9998   0.9989
## Pos Pred Value         0.9986   0.9941   0.9940   0.9992   0.9953
## Neg Pred Value         0.9996   0.9994   0.9989   0.9977   1.0000
## Prevalence             0.2829   0.1928   0.1710   0.1654   0.1877
## Detection Rate         0.2827   0.1923   0.1702   0.1635   0.1877
## Detection Prevalence   0.2831   0.1935   0.1712   0.1637   0.1886
## Balanced Accuracy      0.9993   0.9980   0.9968   0.9941   0.9995

Even though the overall accuracy is \(0.9964\), actually Boosting performs a little bit better than Random forest if we re-fit the model with other parameters. When fitting Boosting, you need tune parameters carefully including \(n.trees\), \(shrinkage\), \(interaction.depth\), etc. The running time increases a lot if you increase \(n.trees\) and \(interaction.depth\).

5. Predicting 20 test cases.

levels(test$cvtd_timestamp) <- levels(train$cvtd_timestamp)
levels(test$new_window) <- levels(train$new_window)
test.predicted <- predict(rf, newdata = test, type = "class")
test.predicted

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E

Consider both accuracy and running time, we choose \(RandomForest\) as the final model. Please note that since the test data set are too small, we need manually assign the levels of specific columns in \(train\) dataset to corresponding columns in \(test\) dataset, so that test dataset could be predicted. Another method is that you can combine the training and testing dataset at the very begining and then split them, which would make them have the same levels.

Practical Machine Learning Project_Coursera

Xiang Cao

February 14, 2016

1. Loading data and data cleaning

2. Model fitting: Decision Tree

3. Model fitting: Random Forest

4. Model fitting: Boosting

5. Predicting 20 test cases.