1. Summary

This project analyze using devices to recognize human activity and to predict the manner people did exercise. This document shows these stages: getting and cleaning data, exploratory analysis and machine learning models. In Getting and cleaning the information, I’ve created three subsets: “train_train”, “train_test” from pml-training and “test_test” from pml-test. I’ve cleaned those databases eliminating NA’s columns. I’ve done exploratory data analysis with “train_train” subset and run the machine learning models. The other subsets were to test the model and predict new outcomes. The results with tree prediction models show an accuracy of 49%.

2. Getting and cleaning data

2.1 Getting the data

Firstly, I’ve got the information from the web

trainUrl <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testUrl <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

traininghr <- read.csv(url(trainUrl), na.strings=c("NA","#DIV/0!",""))
testinghr <- read.csv(url(testUrl), na.strings=c("NA","#DIV/0!",""))

Then, I’ve partioned “traininghr” in two subsets: train_train and train_test

colnames_train <- colnames(traininghr)
colnames_test <- colnames(testinghr)

inTrain <- createDataPartition(y=traininghr$classe, p=0.7, list=FALSE)
train_train<- traininghr[inTrain, ]
train_test<- traininghr[-inTrain, ]

2.2 Cleaning the data

To run the machine learning models, I had to eliminate some columns that had many any NA’s. Firstly, I configured those columns that have NA’s

contarNAs <- function(x) {
        as.vector(apply(x, 2, function(x) length(which(!is.na(x)))))
}

columnaNA <- contarNAs(train_train)
drops <- c()
for (colNA in 1:length(columnaNA)) {
        if (columnaNA[colNA] < nrow(train_train)) {
                drops <- c(drops, colnames_train[colNA])
        }
}

Then, I reduced the columns in the three subsets:

train_train<- train_train[,!(names(train_train) %in% drops)]
train_train<- train_train[,8:length(colnames(train_train))]

train_test<- train_test[,!(names(train_test) %in% drops)]
train_test<- train_test[,8:length(colnames(train_test))]

test_test<-testinghr
test_test<- test_test[,!(names(test_test) %in% drops)]
test_test<- test_test[,8:length(colnames(test_test))]

3. Exploratory Analysis

In this dataset and after reducing some NA´s variables, there are more than fifty possile variables. In that sense, I´ve decided to explore trouhg a heat map if there is any pattern or association.

ccc<-as.matrix(train_train[c(1:50), c(1:52)])
heatmap(ccc, cexCol=0.5, cexRow = 0.5)

After checking the variables there was not any strong relationship. Even so, there were more accel variables on the left and magnet variables on the right. Then, I’ve chosen the variable total acceleration to observe which classe has more acceleration.

bp1<-qplot(classe, total_accel_belt, data=train_train, geom="boxplot")
bp2<-qplot(classe, total_accel_arm, data=train_train, geom="boxplot")
bp3<-qplot(classe, total_accel_forearm, data=train_train, geom="boxplot")
bp4<-qplot(classe, total_accel_dumbbell, data=train_train, geom="boxplot")
grid.arrange(bp1, bp2, bp3, bp4, ncol=2)

Again, there is no a strong distinction between classe. Even so, it seems that classe A has less acceleration.

4. Machine learning models

I’ve tried to use “lm method”" but R pointed that it was not the best election. Therefore, I’ve changed to decision tree models trough “rpart” method.

tree1<-train(classe~., data=train_train, method="rpart", trControl=trainControl(method = "cv", number = 10))

I’ve used cross validation (10 folds). We can see the results in those commands:

print(tree1)

## CART 
## 
## 13737 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 12362, 12362, 12363, 12363, 12365, 12363, ... 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa     
##   0.03478792  0.5128489  0.36407569
##   0.06004815  0.4270878  0.22780076
##   0.11504425  0.3157327  0.04780251
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was cp = 0.03478792.

print(tree1$finalModel)

## n= 13737 
## 
## node), split, n, loss, yval, (yprob)
##       * denotes terminal node
## 
##  1) root 13737 9831 A (0.28 0.19 0.17 0.16 0.18)  
##    2) roll_belt< 130.5 12592 8693 A (0.31 0.21 0.19 0.18 0.11)  
##      4) pitch_forearm< -33.95 1119    6 A (0.99 0.0054 0 0 0) *
##      5) pitch_forearm>=-33.95 11473 8687 A (0.24 0.23 0.21 0.2 0.12)  
##       10) magnet_dumbbell_y< 439.5 9688 6962 A (0.28 0.18 0.24 0.19 0.11)  
##         20) roll_forearm< 122.5 6000 3558 A (0.41 0.18 0.18 0.16 0.063) *
##         21) roll_forearm>=122.5 3688 2467 C (0.077 0.18 0.33 0.23 0.18) *
##       11) magnet_dumbbell_y>=439.5 1785  891 B (0.034 0.5 0.043 0.23 0.2) *
##    3) roll_belt>=130.5 1145    7 E (0.0061 0 0 0 0.99) *

grid.newpage()
ff<-fancyRpartPlot(tree1$finalModel)

ff

## NULL

Additionally I’ve applied a confusion matrix to see the prediction power of the model

prediccion1_1<-predict(tree1, train_test)
print(confusionMatrix(prediccion1_1, train_test$classe), digits=4)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1520  465  488  447  147
##          B   21  392   31  163  137
##          C  126  282  507  354  305
##          D    0    0    0    0    0
##          E    7    0    0    0  493
## 
## Overall Statistics
##                                          
##                Accuracy : 0.4948         
##                  95% CI : (0.482, 0.5077)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.3397         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9080  0.34416  0.49415   0.0000  0.45564
## Specificity            0.6326  0.92583  0.78041   1.0000  0.99854
## Pos Pred Value         0.4956  0.52688  0.32211      NaN  0.98600
## Neg Pred Value         0.9454  0.85470  0.87961   0.8362  0.89062
## Prevalence             0.2845  0.19354  0.17434   0.1638  0.18386
## Detection Rate         0.2583  0.06661  0.08615   0.0000  0.08377
## Detection Prevalence   0.5212  0.12642  0.26746   0.0000  0.08496
## Balanced Accuracy      0.7703  0.63500  0.63728   0.5000  0.72709

The accuracy of the model is between 49-55% which is not a strong prediction model and it will require further research.

Even so, I’ve applied this algorithm to the test_test subset.

prediccion1_2<-predict(tree1, test_test)
test_test$prediccion1_2<-prediccion1_2
table(test_test$prediccion1_2)