R Notebook - Machine Learning coursera project

The data for this project come from this source- http://groupware.les.inf.puc-rio.br/har Importing data from the given URLs

library('caret')

## Warning: package 'caret' was built under R version 3.3.2

## Loading required package: lattice

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.3.3

library('rpart')

## Warning: package 'rpart' was built under R version 3.3.3

library('randomForest')

## Warning: package 'randomForest' was built under R version 3.3.3

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

trngurl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testurl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

trng <- read.csv(trngurl, na.strings =c("NA","#DIV/0!", ""))
test <- read.csv(testurl, na.strings =c("NA","#DIV/0!", ""))

Data cleaning and partition

#removing the columns having NAs - since our dataset is very high dimensional
trngnonzero <- trng[,colSums(is.na(trng)) == 0]
testnonzero <- test[,colSums(is.na(test)) == 0]

#removing non relevant columns such as time stamps, dates, serial numbers etc
trngrelcols <- trngnonzero[,-c(1:7)]
testrelcols <- testnonzero[,-c(1:7)]

#data partitioning - the test set here is created from training set - serves as validation set
sample <- createDataPartition(y= trngrelcols$classe, p = 0.70, list = F)
trngset <- trngrelcols[sample,]
testset <- trngrelcols[-sample,]
dim(trngset)

## [1] 13737    53

Training and predictions Recursive Partition

# recursive partition
model1 <- rpart(classe ~ ., data=trngset, method="class")
prediction1 <- predict(model1, testset, type = "class")
confusionMatrix(testset$classe, prediction1)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1508   48   43   22   53
##          B  237  591  104   75  132
##          C   25   55  818   67   61
##          D   88   29  141  638   68
##          E   37   58  136   58  793
## 
## Overall Statistics
##                                         
##                Accuracy : 0.7388        
##                  95% CI : (0.7274, 0.75)
##     No Information Rate : 0.322         
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.6683        
##  Mcnemar's Test P-Value : < 2.2e-16     
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.7958   0.7567   0.6586   0.7419   0.7164
## Specificity            0.9584   0.8926   0.9552   0.9351   0.9395
## Pos Pred Value         0.9008   0.5189   0.7973   0.6618   0.7329
## Neg Pred Value         0.9081   0.9600   0.9127   0.9549   0.9346
## Prevalence             0.3220   0.1327   0.2110   0.1461   0.1881
## Detection Rate         0.2562   0.1004   0.1390   0.1084   0.1347
## Detection Prevalence   0.2845   0.1935   0.1743   0.1638   0.1839
## Balanced Accuracy      0.8771   0.8247   0.8069   0.8385   0.8279

Here, we can see that our rpart model has acheived an accuracy of around 75%. Now let us try and improve it using other models and techniques.

Random Forest

# random forest
model2 <- randomForest(classe ~ ., data=trngset, method="class")
prediction2 <- predict(model2, testset, type = "class")
confusionMatrix(testset$classe, prediction2)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1671    3    0    0    0
##          B    3 1134    2    0    0
##          C    0    5 1021    0    0
##          D    0    0    8  956    0
##          E    0    0    1    2 1079
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9959          
##                  95% CI : (0.9939, 0.9974)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9948          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9982   0.9930   0.9893   0.9979   1.0000
## Specificity            0.9993   0.9989   0.9990   0.9984   0.9994
## Pos Pred Value         0.9982   0.9956   0.9951   0.9917   0.9972
## Neg Pred Value         0.9993   0.9983   0.9977   0.9996   1.0000
## Prevalence             0.2845   0.1941   0.1754   0.1628   0.1833
## Detection Rate         0.2839   0.1927   0.1735   0.1624   0.1833
## Detection Prevalence   0.2845   0.1935   0.1743   0.1638   0.1839
## Balanced Accuracy      0.9987   0.9960   0.9942   0.9981   0.9997

Here, we can see that our random forest has done pretty well with an accuracy of 99.54%. Since our dataset has high number of columns, we can try reducing the columns by keeping only relevant columns.

Random Forest with reduced dimensionality

# reducing dimensionality of dataset by removing the less contributing columns
#saving our target variables for later use
ycol <- trngset$classe
ycoltest <- testset$classe
nzv <- nearZeroVar(trngset, saveMetrics = TRUE)
#print(paste('Range:',range(nzv$percentUnique)))
head(nzv)

##                  freqRatio percentUnique zeroVar   nzv
## roll_belt         1.045383     8.1677222   FALSE FALSE
## pitch_belt        1.059701    12.1642280   FALSE FALSE
## yaw_belt          1.095775    13.1542549   FALSE FALSE
## total_accel_belt  1.068142     0.1965495   FALSE FALSE
## gyros_belt_x      1.011399     0.9463493   FALSE FALSE
## gyros_belt_y      1.143104     0.4586154   FALSE FALSE

#sort by percentUnique
sort(nzv$percentUnique, decreasing = TRUE)

##  [1] 86.78022858 86.19058018 84.54538837 20.36106865 19.09441654
##  [6] 19.07985732 17.58753731 13.79486060 13.28528791 13.15425493
## [11] 12.84123171 12.16422800 11.60369804 10.53359540  9.56540729
## [16]  9.11407149  8.16772221  7.79646211  7.14857684  6.22406639
## [21]  5.97655966  5.72177331  5.60529956  5.55434229  5.25587828
## [26]  4.79726287  4.54247652  4.06202228  3.81451554  3.31222246
## [31]  3.14479144  2.99919924  2.89728471  2.64249836  2.22756060
## [36]  2.10380724  2.09652763  2.08196841  2.08196841  1.94365582
## [41]  1.69614909  1.65975104  1.39040547  1.18657640  1.14289874
## [46]  0.99002693  0.94634928  0.50229308  0.47317464  0.45861542
## [51]  0.31302322  0.19654946  0.03639805

Selecting columns having percent unique greater than 5 - since the number of columns are high, there is a possibility that not all the columns are important and relevant. The “percent of unique values’’ is the number of unique values divided by the total number of samples (times 100) that approaches zero as the granularity of the data increases.

#selecting column having unique percent greater than 5
pcacols <- trngset[c(rownames(nzv[nzv$percentUnique > 5,]))]
pcacolstest <- testset[c(rownames(nzv[nzv$percentUnique > 5,]))]

#adding the target variable back to the dataset
pcacols$classe <- ycol
pcacolstest$classe <- ycoltest
dim(pcacols)

## [1] 13737    26

dim(pcacolstest)

## [1] 5885   26

Predicting after reducing the dimension of the dataset

pcamodel <- randomForest(classe ~ ., data=pcacols, method="class")
prediction3 <- predict(pcamodel, pcacolstest, type = "class")
confusionMatrix(pcacolstest$classe, prediction3)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1669    3    0    0    2
##          B    7 1130    2    0    0
##          C    1    7 1018    0    0
##          D    0    0    5  959    0
##          E    0    2    2    1 1077
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9946          
##                  95% CI : (0.9923, 0.9963)
##     No Information Rate : 0.285           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9931          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9952   0.9895   0.9912   0.9990   0.9981
## Specificity            0.9988   0.9981   0.9984   0.9990   0.9990
## Pos Pred Value         0.9970   0.9921   0.9922   0.9948   0.9954
## Neg Pred Value         0.9981   0.9975   0.9981   0.9998   0.9996
## Prevalence             0.2850   0.1941   0.1745   0.1631   0.1833
## Detection Rate         0.2836   0.1920   0.1730   0.1630   0.1830
## Detection Prevalence   0.2845   0.1935   0.1743   0.1638   0.1839
## Balanced Accuracy      0.9970   0.9938   0.9948   0.9990   0.9986

We can see that there is marginal improvement in the accuracy although we have used less number of columns(26 columns as compared to 53 used in random forest(model2). There is a scope for reducing the columns further and checking how does the accuracy varies.

Final prediction on the test data downloaded from the URL provided. I have considered only the columns with a variance threshold, this can be played upon further.

testds <- testrelcols[c(rownames(nzv[nzv$percentUnique > 5,]))]
finalpred <- predict(pcamodel, testds, type = "class")
finalpred

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E

These are the prediction of classe.

R Notebook - Machine Learning coursera project

Swapnil Awasthi

30th May 2017