The purpose of this project is to predict how well certain users perform a particular activity using accelerometers on the belt, forearm, arm, and dumbell of 6 participants. The measure of activity quality is measured in the ‘classe’ variable, which has 5 levels going from A (best quality) to E (worst quality).
The first step is to gather the training data and the test data from the provided links. It is important to define the strings that should be comverted to NA values such as the “#DIV/0!” string that excel sometimes show.
url.training="https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
url.test="https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(url.training,destfile="training.csv", method = "curl")
download.file(url.test, destfile = "test.csv", method="curl")
training=read.csv("training.csv", na.strings = c("#DIV/0!", "", "NA"))
dim(training)
## [1] 19622 160
test=read.csv("test.csv", na.strings = c("#DIV/0!", "", "NA"))
dim(test)
## [1] 20 160
Once that the data has been imported into R, the first column is removed from both datasets since it’s only the row number.
training[,1]=NULL
test[,1]=NULL
In order to prevent overfitting when training our models a cross validation approach will be used. The training dataset will be divided into two different parts. The first part (~70%) will be used to train the model, and the second part (~30%) will be used to evaluate the performance of the models. This second dataset is defined as ‘validation’.
library (caret)
## Warning: package 'caret' was built under R version 3.2.5
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.2.4
sample=createDataPartition(y=training$classe, p=0.6, list=F)
validation=training[-sample,]
training=training[sample,]
The first step into predicting the quality of the movement is to identify all the variables that will not be relevant to the model.
The first cleanup approach is to identify the variables that have a variance close to zero, using the following code. Once that these variables have been identified they are removed from all the datasets.
nearzero=nearZeroVar(training, saveMetrics =F)
training=training[,-nearzero]
validation=validation[,-nearzero]
test=test[,-nearzero]
The second approach is to remove all those variables that have a great number of NA values. All the variables that have more than 30% of NA values are classified as irrelevant to the model, and hence removed from the datasets.
na.var=sapply(training, function(y) sum(length(which(is.na(y)))))
na.var=data.frame(na.var)
na.var$total=sapply(training,function (y) length(y))
na.var$perc=na.var$na.var/na.var$total
na.var.col=which(na.var$perc>0.3)
training=training[,-na.var.col]
validation=validation[,-na.var.col]
test=test[,-na.var.col]
test[,58]=NULL
Once that all the relevant variables have been filtered, different machine learining algorithms such as decision trees, random forests, and boosting are used with the training dataset.
The first and most simple approach is the decision tree algorithm.
library(rpart)
tree.model=rpart(classe ~ ., data=training, method="class")
tree.pred=predict(tree.model,newdata = validation, type="class")
tree.matrix=confusionMatrix(tree.pred,validation$classe)
tree.matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2150 68 10 2 0
## B 78 1383 156 35 0
## C 4 54 1177 111 62
## D 0 13 13 941 85
## E 0 0 12 197 1295
##
## Overall Statistics
##
## Accuracy : 0.8853
## 95% CI : (0.878, 0.8923)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.8548
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9633 0.9111 0.8604 0.7317 0.8981
## Specificity 0.9857 0.9575 0.9643 0.9831 0.9674
## Pos Pred Value 0.9641 0.8372 0.8359 0.8945 0.8610
## Neg Pred Value 0.9854 0.9782 0.9703 0.9492 0.9768
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2740 0.1763 0.1500 0.1199 0.1651
## Detection Prevalence 0.2842 0.2106 0.1795 0.1341 0.1917
## Balanced Accuracy 0.9745 0.9343 0.9124 0.8574 0.9327
plot(tree.matrix$table, main="Decision Tree Confusion Matrix")
As we can see the accuracy of this model is good (88.53%), but it has room to improve. The expected Out of Sample error is 11.47%. Because of this result much more complex models will be run.
The second model is a random forest using the 57 predictors.
random.model=train(classe~., data=training, method="rf", prox=T)
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
random.pred=predict(random.model, newdata=validation)
random.matrix=confusionMatrix(random.pred,validation$classe)
random.matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2232 1 0 0 0
## B 0 1517 1 0 0
## C 0 0 1365 5 0
## D 0 0 2 1281 1
## E 0 0 0 0 1441
##
## Overall Statistics
##
## Accuracy : 0.9987
## 95% CI : (0.9977, 0.9994)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9984
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9993 0.9978 0.9961 0.9993
## Specificity 0.9998 0.9998 0.9992 0.9995 1.0000
## Pos Pred Value 0.9996 0.9993 0.9964 0.9977 1.0000
## Neg Pred Value 1.0000 0.9998 0.9995 0.9992 0.9998
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2845 0.1933 0.1740 0.1633 0.1837
## Detection Prevalence 0.2846 0.1935 0.1746 0.1637 0.1837
## Balanced Accuracy 0.9999 0.9996 0.9985 0.9978 0.9997
plot(random.matrix$table, main="Random Forest Confusion Matrix")
As we can see from the confusion matrix, the accuracy of this model is quite optimal (99.87%). The expected Out of Sample Error is 0.13%.
Finally a third model is built using GBM boosting.
boosting.model=train(classe~., data=training, method="gbm", verbose=F)
## Loading required package: gbm
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.1
## Loading required package: plyr
## Warning: package 'plyr' was built under R version 3.2.5
boosting.pred=predict(boosting.model, newdata=validation)
boosting.matrix=confusionMatrix(boosting.pred,validation$classe)
boosting.matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 2232 1 0 0 0
## B 0 1510 1 0 0
## C 0 3 1358 5 0
## D 0 4 9 1278 3
## E 0 0 0 3 1439
##
## Overall Statistics
##
## Accuracy : 0.9963
## 95% CI : (0.9947, 0.9975)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9953
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9947 0.9927 0.9938 0.9979
## Specificity 0.9998 0.9998 0.9988 0.9976 0.9995
## Pos Pred Value 0.9996 0.9993 0.9941 0.9876 0.9979
## Neg Pred Value 1.0000 0.9987 0.9985 0.9988 0.9995
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2845 0.1925 0.1731 0.1629 0.1834
## Detection Prevalence 0.2846 0.1926 0.1741 0.1649 0.1838
## Balanced Accuracy 0.9999 0.9973 0.9957 0.9957 0.9987
plot(boosting.matrix$table, main="Boosting Confusion Matrix")
The accuracy of this model is quite close to the optimal point (99.63%), but not as good as the Random Forests approach. The expected Out of Sample Error is 0.37%.
Since the random forest model was the one that had the lowest validation error, it is the one that will be used to predict the classe in the test dataset. Finally, the predicted results will be saved in a .csv file
final.pred=predict(random.model,newdata=test)
## Loading required package: randomForest
## randomForest 4.6-10
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
data.frame(final.pred)
## final.pred
## 1 B
## 2 A
## 3 B
## 4 A
## 5 A
## 6 E
## 7 D
## 8 B
## 9 A
## 10 A
## 11 B
## 12 C
## 13 B
## 14 A
## 15 E
## 16 E
## 17 A
## 18 B
## 19 B
## 20 B
write.csv(final.pred, file="predictions.csv", row.names=F)