This project analyze using devices to recognize human activity and to predict the manner people did exercise. This document shows these stages: getting and cleaning data, exploratory analysis and machine learning models. In Getting and cleaning the information, I’ve created three subsets: “train_train”, “train_test” from pml-training and “test_test” from pml-test. I’ve cleaned those databases eliminating NA’s columns. I’ve done exploratory data analysis with “train_train” subset and run the machine learning models. The other subsets were to test the model and predict new outcomes. The results with tree prediction models show an accuracy of 49%.
Firstly, I’ve got the information from the web
trainUrl <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testUrl <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
traininghr <- read.csv(url(trainUrl), na.strings=c("NA","#DIV/0!",""))
testinghr <- read.csv(url(testUrl), na.strings=c("NA","#DIV/0!",""))
Then, I’ve partioned “traininghr” in two subsets: train_train and train_test
colnames_train <- colnames(traininghr)
colnames_test <- colnames(testinghr)
inTrain <- createDataPartition(y=traininghr$classe, p=0.7, list=FALSE)
train_train<- traininghr[inTrain, ]
train_test<- traininghr[-inTrain, ]
To run the machine learning models, I had to eliminate some columns that had many any NA’s. Firstly, I configured those columns that have NA’s
contarNAs <- function(x) {
as.vector(apply(x, 2, function(x) length(which(!is.na(x)))))
}
columnaNA <- contarNAs(train_train)
drops <- c()
for (colNA in 1:length(columnaNA)) {
if (columnaNA[colNA] < nrow(train_train)) {
drops <- c(drops, colnames_train[colNA])
}
}
Then, I reduced the columns in the three subsets:
train_train<- train_train[,!(names(train_train) %in% drops)]
train_train<- train_train[,8:length(colnames(train_train))]
train_test<- train_test[,!(names(train_test) %in% drops)]
train_test<- train_test[,8:length(colnames(train_test))]
test_test<-testinghr
test_test<- test_test[,!(names(test_test) %in% drops)]
test_test<- test_test[,8:length(colnames(test_test))]
In this dataset and after reducing some NA´s variables, there are more than fifty possile variables. In that sense, I´ve decided to explore trouhg a heat map if there is any pattern or association.
ccc<-as.matrix(train_train[c(1:50), c(1:52)])
heatmap(ccc, cexCol=0.5, cexRow = 0.5)
After checking the variables there was not any strong relationship. Even so, there were more accel variables on the left and magnet variables on the right. Then, I’ve chosen the variable total acceleration to observe which classe has more acceleration.
bp1<-qplot(classe, total_accel_belt, data=train_train, geom="boxplot")
bp2<-qplot(classe, total_accel_arm, data=train_train, geom="boxplot")
bp3<-qplot(classe, total_accel_forearm, data=train_train, geom="boxplot")
bp4<-qplot(classe, total_accel_dumbbell, data=train_train, geom="boxplot")
grid.arrange(bp1, bp2, bp3, bp4, ncol=2)
Again, there is no a strong distinction between classe. Even so, it seems that classe A has less acceleration.
I’ve tried to use “lm method”" but R pointed that it was not the best election. Therefore, I’ve changed to decision tree models trough “rpart” method.
tree1<-train(classe~., data=train_train, method="rpart", trControl=trainControl(method = "cv", number = 10))
I’ve used cross validation (10 folds). We can see the results in those commands:
print(tree1)
## CART
##
## 13737 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 12362, 12362, 12363, 12363, 12365, 12363, ...
## Resampling results across tuning parameters:
##
## cp Accuracy Kappa
## 0.03478792 0.5128489 0.36407569
## 0.06004815 0.4270878 0.22780076
## 0.11504425 0.3157327 0.04780251
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.03478792.
print(tree1$finalModel)
## n= 13737
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 13737 9831 A (0.28 0.19 0.17 0.16 0.18)
## 2) roll_belt< 130.5 12592 8693 A (0.31 0.21 0.19 0.18 0.11)
## 4) pitch_forearm< -33.95 1119 6 A (0.99 0.0054 0 0 0) *
## 5) pitch_forearm>=-33.95 11473 8687 A (0.24 0.23 0.21 0.2 0.12)
## 10) magnet_dumbbell_y< 439.5 9688 6962 A (0.28 0.18 0.24 0.19 0.11)
## 20) roll_forearm< 122.5 6000 3558 A (0.41 0.18 0.18 0.16 0.063) *
## 21) roll_forearm>=122.5 3688 2467 C (0.077 0.18 0.33 0.23 0.18) *
## 11) magnet_dumbbell_y>=439.5 1785 891 B (0.034 0.5 0.043 0.23 0.2) *
## 3) roll_belt>=130.5 1145 7 E (0.0061 0 0 0 0.99) *
grid.newpage()
ff<-fancyRpartPlot(tree1$finalModel)
ff
## NULL
Additionally I’ve applied a confusion matrix to see the prediction power of the model
prediccion1_1<-predict(tree1, train_test)
print(confusionMatrix(prediccion1_1, train_test$classe), digits=4)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1520 465 488 447 147
## B 21 392 31 163 137
## C 126 282 507 354 305
## D 0 0 0 0 0
## E 7 0 0 0 493
##
## Overall Statistics
##
## Accuracy : 0.4948
## 95% CI : (0.482, 0.5077)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3397
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9080 0.34416 0.49415 0.0000 0.45564
## Specificity 0.6326 0.92583 0.78041 1.0000 0.99854
## Pos Pred Value 0.4956 0.52688 0.32211 NaN 0.98600
## Neg Pred Value 0.9454 0.85470 0.87961 0.8362 0.89062
## Prevalence 0.2845 0.19354 0.17434 0.1638 0.18386
## Detection Rate 0.2583 0.06661 0.08615 0.0000 0.08377
## Detection Prevalence 0.5212 0.12642 0.26746 0.0000 0.08496
## Balanced Accuracy 0.7703 0.63500 0.63728 0.5000 0.72709
The accuracy of the model is between 49-55% which is not a strong prediction model and it will require further research.
Even so, I’ve applied this algorithm to the test_test subset.
prediccion1_2<-predict(tree1, test_test)
test_test$prediccion1_2<-prediccion1_2
table(test_test$prediccion1_2)
##
## A B C D E
## 11 0 9 0 0