machine learning project

Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

Introduction

Variables

Outcome variable is classe, a factor variable with 5 levels. For this data set, “participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in 5 different fashions: - exactly according to the specification (Class A) - throwing the elbows to the front (Class B) - lifting the dumbbell only halfway (Class C) - lowering the dumbbell only halfway (Class D) - throwing the hips to the front (Class E)

Methods of Models

Two models will be tested using decision tree and random forest algorithms. The model with the higher accuracy will be chosen as the final model.

Loading data and cleaning data

I first download the training and testing data sets from the given URLs. And then do data cleaning for further analysis.

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

trnLnk <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
tstLnk <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

training_data <- read.csv(url(trnLnk),na.strings=c("NA","#DIV/0!",""),header=T)
testing_data <- read.csv(url(tstLnk),na.strings=c("NA","#DIV/0!",""),header=T)

#Check dimensions of train dataset
dim(training_data)

## [1] 19622   160

dim(testing_data)

## [1]  20 160

##Create list of unwanted fields:
trnRemCols <- grepl("^X|timestamp|window|user_name",names(training_data))
tstRemCols <- grepl("^X|timestamp|window|user_name",names(testing_data))

#Remove unwanted fields
trnRmUnwtdCols <- training_data[,!trnRemCols]
tstRmUnwtdCols <- testing_data[,!tstRemCols]

#Create list of near zero variance fields
NearZeroVar <- nearZeroVar(trnRmUnwtdCols,saveMetrics=T)

#Remove near zero variance fields
trnRmZVCols <- trnRmUnwtdCols[,!NearZeroVar$nzv]
tstRmZVCols <- tstRmUnwtdCols[,!NearZeroVar$nzv]

#Remove fields with NAs
trnNArmCondn <- (colSums(is.na(trnRmZVCols))!=0)
tstNArmCondn <- (colSums(is.na(tstRmZVCols))!=0)

trnRmNACols <- trnRmZVCols[,!trnNArmCondn]
tstRmNACols <- tstRmZVCols[,!tstNArmCondn]

#New Training and Testing Datasets after clean-up
trnDataNew <- trnRmNACols
tstDataNew <- tstRmNACols

dim(trnDataNew); dim(tstDataNew)

## [1] 19622    53

## [1] 20 53

The training dataset has 19622 observations and 160 variables, and the testing data set contains 20 observations and the same variables as the training set.

Data spliting

In order to get out-of-sample errors, I split the training set into a training set (70%) for prediction and a validation set (30%) to compute the out-of-sample errors.

set.seed(1234) 

#Train-model Validation Partition
intrain <- createDataPartition(y=trnDataNew$classe,p=0.7,list=F)

modTRNSample <- trnDataNew[intrain,] #To be used for model training
modTSTSample <- trnDataNew[-intrain,] #To be used for testing accuracy of models

dim(modTRNSample); dim(modTSTSample)

## [1] 13737    53

## [1] 5885   53

Cross validation

library(caret)
control <- trainControl(method = "cv", number = 5)
fit_rpart <- train(classe ~ ., data = modTRNSample, method = "rpart", trControl = control)

## Loading required package: rpart

print(fit_rpart, digits = 4)

## CART 
## 
## 13737 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 10990, 10988, 10991, 10990, 10989 
## Resampling results across tuning parameters:
## 
##   cp       Accuracy  Kappa  
##   0.03550  0.5214    0.38010
##   0.06093  0.4175    0.21094
##   0.11738  0.3333    0.07467
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.0355.

library(rattle)

## R session is headless; GTK+ not initialized.

## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

fancyRpartPlot(fit_rpart$finalModel)

library(caret)
predict_rpart <- predict(fit_rpart, modTSTSample)
conf_rpart <- confusionMatrix(modTSTSample$classe, predict_rpart)
conf_rpart

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1530   35  105    0    4
##          B  486  379  274    0    0
##          C  493   31  502    0    0
##          D  452  164  348    0    0
##          E  168  145  302    0  467
## 
## Overall Statistics
##                                           
##                Accuracy : 0.489           
##                  95% CI : (0.4762, 0.5019)
##     No Information Rate : 0.5317          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3311          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.4890   0.5027   0.3279       NA  0.99151
## Specificity            0.9478   0.8519   0.8797   0.8362  0.88641
## Pos Pred Value         0.9140   0.3327   0.4893       NA  0.43161
## Neg Pred Value         0.6203   0.9210   0.7882       NA  0.99917
## Prevalence             0.5317   0.1281   0.2602   0.0000  0.08003
## Detection Rate         0.2600   0.0644   0.0853   0.0000  0.07935
## Detection Prevalence   0.2845   0.1935   0.1743   0.1638  0.18386
## Balanced Accuracy      0.7184   0.6773   0.6038       NA  0.93896

accuracy_rpart <- conf_rpart$overall[1]
accuracy_rpart

##  Accuracy 
## 0.4890399

From the confusion matrix, the accuracy rate is about 0.5, and so the out-of-sample error rate is about 0.5. Using classification tree does not predict the outcome classe very well. Now I use decision trees and random forests to predict the outcome.

Method 1: Decision Tree

library(rpart)
model1 <- rpart(classe~.,method="class",data=modTRNSample)
prediction1<-predict(model1,modTSTSample,type="class")
library(rpart.plot)
rpart.plot(model1,main="classification tree",extra=102,under=T,faclen=0)

confusionMatrix(prediction1,modTSTSample$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1364  169   24   48   16
##          B   60  581   46   79   74
##          C   52  137  765  129  145
##          D  183  194  125  650  159
##          E   15   58   66   58  688
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6879          
##                  95% CI : (0.6758, 0.6997)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6066          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8148  0.51010   0.7456   0.6743   0.6359
## Specificity            0.9390  0.94543   0.9047   0.8657   0.9590
## Pos Pred Value         0.8415  0.69167   0.6230   0.4958   0.7774
## Neg Pred Value         0.9273  0.88940   0.9440   0.9314   0.9212
## Prevalence             0.2845  0.19354   0.1743   0.1638   0.1839
## Detection Rate         0.2318  0.09873   0.1300   0.1105   0.1169
## Detection Prevalence   0.2754  0.14274   0.2087   0.2228   0.1504
## Balanced Accuracy      0.8769  0.72776   0.8252   0.7700   0.7974

Method 2: Random Forest

library(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

model2 <- randomForest(classe~.,method="class",data=modTRNSample)
prediction2<-predict(model2,modTSTSample,type="class")
confusionMatrix(prediction2,modTSTSample$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674    7    0    0    0
##          B    0 1131    6    0    0
##          C    0    1 1020    5    0
##          D    0    0    0  958    1
##          E    0    0    0    1 1081
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9964          
##                  95% CI : (0.9946, 0.9978)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9955          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9930   0.9942   0.9938   0.9991
## Specificity            0.9983   0.9987   0.9988   0.9998   0.9998
## Pos Pred Value         0.9958   0.9947   0.9942   0.9990   0.9991
## Neg Pred Value         1.0000   0.9983   0.9988   0.9988   0.9998
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2845   0.1922   0.1733   0.1628   0.1837
## Detection Prevalence   0.2856   0.1932   0.1743   0.1630   0.1839
## Balanced Accuracy      0.9992   0.9959   0.9965   0.9968   0.9994

Conclusion

Random forest, though a little more complex, was way more accurate. Hence, the random forest technique was chosen as the final prediction algorithm.

submission

predictfinal <- predict(model2, testing_data, type="class")
predictfinal

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E