Alejandro Fraga June, 2016
One thing that people regularly these days is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, my goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.
Load the data from the indivuals who participated in the study and the libraries needed to perform the analysis:
## Loading required package: lattice
## Loading required package: ggplot2
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Next we need to do some data preparation
# Columns full of NAs are not useful, let's clean them
relevantFeatures <- names(test[,colSums(is.na(test)) == 0])[8:59]
# Only use relevant features used in the test_data cases.
train_data <- train_data[,c(relevantFeatures,"classe")]
test <- test[,c(relevantFeatures,"problem_id")]
I will hold 25% of the data set for testing
set.seed(246)
inTrain = createDataPartition(train_data$classe, p = 0.75, list = F)
training = train_data[inTrain,]
testing = train_data[-inTrain,]
To simplify the analysis, let’s remove those features from the training set which are highly correlated (>90%)
outcome = which(names(training) == "classe")
highCorrCols = findCorrelation(abs(cor(training[,-outcome])),0.90)
# highCorrFeatures variable will subset those highly correlated features
highCorrFeatures = names(training)[highCorrCols]
training = training[,-highCorrCols]
outcome = which(names(training) == "classe")
str(outcome)
## int 46
From this analysis I found that those features with high correlation are: accel_belt_z, roll_belt, accel_belt_y, accel_belt_x, gyros_arm_y, gyros_forearm_z, and gyros_dumbbell_x.
As we learn the Random Forest method is good for non linear features as is the case on this stufy plus reduces overfitting. I will also use KNN algorithm to identify who provide better accuracy.
First I will Random Forest this method to discover the most important features.
featuresRF = randomForest(training[,-outcome], training[,outcome], importance = T)
importanceRF = data.frame(featuresRF$importance)
impFeatures = order(-importanceRF$MeanDecreaseGini)
inImp = createDataPartition(train_data$classe, p = 0.05, list = F)
The feature plot for the 4 most importan features (pitch_belt, yaw_belt, total_accel_belt, gyros_belt_x) is shown below:
featurePlot(training[inImp,impFeatures[1:4]],training$classe[inImp], plot = "pairs")
# Working on Random Forest Model
ctrlRF = trainControl(method = "oob")
modelRF <- randomForest(classe ~ ., data=training)
RFPredTrain <- predict(modelRF, newdata=training, type="class")
RFAccuracyTrain <- confusionMatrix(RFPredTrain, training$classe)
RFAccuracyTrain
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 4185 0 0 0 0
## B 0 2848 0 0 0
## C 0 0 2567 0 0
## D 0 0 0 2412 0
## E 0 0 0 0 2706
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9997, 1)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Prevalence 0.2843 0.1935 0.1744 0.1639 0.1839
## Detection Rate 0.2843 0.1935 0.1744 0.1639 0.1839
## Detection Prevalence 0.2843 0.1935 0.1744 0.1639 0.1839
## Balanced Accuracy 1.0000 1.0000 1.0000 1.0000 1.0000
Next I will train the model using k-nearest neighbors comparison
# Developing KNN model
ctrlKNN = trainControl(method = "adaptive_cv")
modelKNN = train(classe ~ ., training, method = "knn", trControl = ctrlKNN)
KNNPredTrain <- predict(modelKNN, newdata=training)
KNNAccuracyTrain <- confusionMatrix(KNNPredTrain, training$classe)
KNNAccuracyTrain
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 4115 74 16 13 16
## B 16 2664 46 4 38
## C 22 43 2458 91 30
## D 27 32 27 2281 40
## E 5 35 20 23 2582
##
## Overall Statistics
##
## Accuracy : 0.958
## 95% CI : (0.9546, 0.9612)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9469
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9833 0.9354 0.9575 0.9457 0.9542
## Specificity 0.9887 0.9912 0.9847 0.9898 0.9931
## Pos Pred Value 0.9719 0.9624 0.9297 0.9477 0.9689
## Neg Pred Value 0.9933 0.9846 0.9910 0.9894 0.9897
## Prevalence 0.2843 0.1935 0.1744 0.1639 0.1839
## Detection Rate 0.2796 0.1810 0.1670 0.1550 0.1754
## Detection Prevalence 0.2877 0.1881 0.1796 0.1635 0.1811
## Balanced Accuracy 0.9860 0.9633 0.9711 0.9677 0.9736
As we can see the the random forest provides better accuracy compared with k-nearest neighbors method. Next I provide the confusion matrix for the Random Forest model applied to the testing set
PredTest <- predict(modelRF, testing)
AccuracyTest <- confusionMatrix(PredTest, testing$classe)
AccuracyTest
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1395 1 0 0 0
## B 0 946 2 0 0
## C 0 2 853 9 0
## D 0 0 0 795 1
## E 0 0 0 0 900
##
## Overall Statistics
##
## Accuracy : 0.9969
## 95% CI : (0.995, 0.9983)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9961
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9968 0.9977 0.9888 0.9989
## Specificity 0.9997 0.9995 0.9973 0.9998 1.0000
## Pos Pred Value 0.9993 0.9979 0.9873 0.9987 1.0000
## Neg Pred Value 1.0000 0.9992 0.9995 0.9978 0.9998
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2845 0.1929 0.1739 0.1621 0.1835
## Detection Prevalence 0.2847 0.1933 0.1762 0.1623 0.1835
## Balanced Accuracy 0.9999 0.9982 0.9975 0.9943 0.9994
Based on the two models used I conclude that Random Forest provides the best outcome prediction with a 0.99 accuracy.
# Run against 20 testing
genPredictions <- predict(modelRF, newdata=testing, type="class")
# The following function generate the predictions requested on the assingment
predmodel_write_files = function(x){
n = length(x)
for(i in 1:n){
fname = paste0("problem_id_",i,".txt")
write.table(x[i],file=fname,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
}
predmodel_write_files(genPredictions)
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.
Read more: http://groupware.les.inf.puc-rio.br/har#ixzz4BnUwkLsS