The data for this project come from this source- http://groupware.les.inf.puc-rio.br/har Importing data from the given URLs
library('caret')
## Warning: package 'caret' was built under R version 3.3.2
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.3.3
library('rpart')
## Warning: package 'rpart' was built under R version 3.3.3
library('randomForest')
## Warning: package 'randomForest' was built under R version 3.3.3
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
trngurl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testurl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
trng <- read.csv(trngurl, na.strings =c("NA","#DIV/0!", ""))
test <- read.csv(testurl, na.strings =c("NA","#DIV/0!", ""))
Data cleaning and partition
#removing the columns having NAs - since our dataset is very high dimensional
trngnonzero <- trng[,colSums(is.na(trng)) == 0]
testnonzero <- test[,colSums(is.na(test)) == 0]
#removing non relevant columns such as time stamps, dates, serial numbers etc
trngrelcols <- trngnonzero[,-c(1:7)]
testrelcols <- testnonzero[,-c(1:7)]
#data partitioning - the test set here is created from training set - serves as validation set
sample <- createDataPartition(y= trngrelcols$classe, p = 0.70, list = F)
trngset <- trngrelcols[sample,]
testset <- trngrelcols[-sample,]
dim(trngset)
## [1] 13737 53
Training and predictions Recursive Partition
# recursive partition
model1 <- rpart(classe ~ ., data=trngset, method="class")
prediction1 <- predict(model1, testset, type = "class")
confusionMatrix(testset$classe, prediction1)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1508 48 43 22 53
## B 237 591 104 75 132
## C 25 55 818 67 61
## D 88 29 141 638 68
## E 37 58 136 58 793
##
## Overall Statistics
##
## Accuracy : 0.7388
## 95% CI : (0.7274, 0.75)
## No Information Rate : 0.322
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6683
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.7958 0.7567 0.6586 0.7419 0.7164
## Specificity 0.9584 0.8926 0.9552 0.9351 0.9395
## Pos Pred Value 0.9008 0.5189 0.7973 0.6618 0.7329
## Neg Pred Value 0.9081 0.9600 0.9127 0.9549 0.9346
## Prevalence 0.3220 0.1327 0.2110 0.1461 0.1881
## Detection Rate 0.2562 0.1004 0.1390 0.1084 0.1347
## Detection Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Balanced Accuracy 0.8771 0.8247 0.8069 0.8385 0.8279
Here, we can see that our rpart model has acheived an accuracy of around 75%. Now let us try and improve it using other models and techniques.
Random Forest
# random forest
model2 <- randomForest(classe ~ ., data=trngset, method="class")
prediction2 <- predict(model2, testset, type = "class")
confusionMatrix(testset$classe, prediction2)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1671 3 0 0 0
## B 3 1134 2 0 0
## C 0 5 1021 0 0
## D 0 0 8 956 0
## E 0 0 1 2 1079
##
## Overall Statistics
##
## Accuracy : 0.9959
## 95% CI : (0.9939, 0.9974)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9948
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9982 0.9930 0.9893 0.9979 1.0000
## Specificity 0.9993 0.9989 0.9990 0.9984 0.9994
## Pos Pred Value 0.9982 0.9956 0.9951 0.9917 0.9972
## Neg Pred Value 0.9993 0.9983 0.9977 0.9996 1.0000
## Prevalence 0.2845 0.1941 0.1754 0.1628 0.1833
## Detection Rate 0.2839 0.1927 0.1735 0.1624 0.1833
## Detection Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Balanced Accuracy 0.9987 0.9960 0.9942 0.9981 0.9997
Here, we can see that our random forest has done pretty well with an accuracy of 99.54%. Since our dataset has high number of columns, we can try reducing the columns by keeping only relevant columns.
Random Forest with reduced dimensionality
# reducing dimensionality of dataset by removing the less contributing columns
#saving our target variables for later use
ycol <- trngset$classe
ycoltest <- testset$classe
nzv <- nearZeroVar(trngset, saveMetrics = TRUE)
#print(paste('Range:',range(nzv$percentUnique)))
head(nzv)
## freqRatio percentUnique zeroVar nzv
## roll_belt 1.045383 8.1677222 FALSE FALSE
## pitch_belt 1.059701 12.1642280 FALSE FALSE
## yaw_belt 1.095775 13.1542549 FALSE FALSE
## total_accel_belt 1.068142 0.1965495 FALSE FALSE
## gyros_belt_x 1.011399 0.9463493 FALSE FALSE
## gyros_belt_y 1.143104 0.4586154 FALSE FALSE
#sort by percentUnique
sort(nzv$percentUnique, decreasing = TRUE)
## [1] 86.78022858 86.19058018 84.54538837 20.36106865 19.09441654
## [6] 19.07985732 17.58753731 13.79486060 13.28528791 13.15425493
## [11] 12.84123171 12.16422800 11.60369804 10.53359540 9.56540729
## [16] 9.11407149 8.16772221 7.79646211 7.14857684 6.22406639
## [21] 5.97655966 5.72177331 5.60529956 5.55434229 5.25587828
## [26] 4.79726287 4.54247652 4.06202228 3.81451554 3.31222246
## [31] 3.14479144 2.99919924 2.89728471 2.64249836 2.22756060
## [36] 2.10380724 2.09652763 2.08196841 2.08196841 1.94365582
## [41] 1.69614909 1.65975104 1.39040547 1.18657640 1.14289874
## [46] 0.99002693 0.94634928 0.50229308 0.47317464 0.45861542
## [51] 0.31302322 0.19654946 0.03639805
Selecting columns having percent unique greater than 5 - since the number of columns are high, there is a possibility that not all the columns are important and relevant. The “percent of unique values’’ is the number of unique values divided by the total number of samples (times 100) that approaches zero as the granularity of the data increases.
#selecting column having unique percent greater than 5
pcacols <- trngset[c(rownames(nzv[nzv$percentUnique > 5,]))]
pcacolstest <- testset[c(rownames(nzv[nzv$percentUnique > 5,]))]
#adding the target variable back to the dataset
pcacols$classe <- ycol
pcacolstest$classe <- ycoltest
dim(pcacols)
## [1] 13737 26
dim(pcacolstest)
## [1] 5885 26
Predicting after reducing the dimension of the dataset
pcamodel <- randomForest(classe ~ ., data=pcacols, method="class")
prediction3 <- predict(pcamodel, pcacolstest, type = "class")
confusionMatrix(pcacolstest$classe, prediction3)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1669 3 0 0 2
## B 7 1130 2 0 0
## C 1 7 1018 0 0
## D 0 0 5 959 0
## E 0 2 2 1 1077
##
## Overall Statistics
##
## Accuracy : 0.9946
## 95% CI : (0.9923, 0.9963)
## No Information Rate : 0.285
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9931
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9952 0.9895 0.9912 0.9990 0.9981
## Specificity 0.9988 0.9981 0.9984 0.9990 0.9990
## Pos Pred Value 0.9970 0.9921 0.9922 0.9948 0.9954
## Neg Pred Value 0.9981 0.9975 0.9981 0.9998 0.9996
## Prevalence 0.2850 0.1941 0.1745 0.1631 0.1833
## Detection Rate 0.2836 0.1920 0.1730 0.1630 0.1830
## Detection Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Balanced Accuracy 0.9970 0.9938 0.9948 0.9990 0.9986
We can see that there is marginal improvement in the accuracy although we have used less number of columns(26 columns as compared to 53 used in random forest(model2). There is a scope for reducing the columns further and checking how does the accuracy varies.
Final prediction on the test data downloaded from the URL provided. I have considered only the columns with a variance threshold, this can be played upon further.
testds <- testrelcols[c(rownames(nzv[nzv$percentUnique > 5,]))]
finalpred <- predict(pcamodel, testds, type = "class")
finalpred
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
These are the prediction of classe.