1 Synopsis

One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, our goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants who were asked to perform barbell lifts correctly and incorrectly in 5 differents ways.

2 Summary

We use a classification tree method as the outome variable (classe) is qualitative. After trying a simple classification tree, a cross-validation and a random forest method we see that the random forest perform very well with an accuracy of about \(99\)%.

3 Importing and Cleaning Data

## Loading all the required packages 
library(tidyverse)
library(caret)
library(randomForest)
library(rpart)
library(rattle)
## Reading the files

training <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv",
                     row.names = 1)
testing <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", row.names = 1)

dim(training)
## [1] 19622   159
dim(testing)
## [1]  20 159

Let us clean the data a little bit. There are many columns with the wrong class and entire columns with just NA. We will remove all the columns with 95% and more of NAs and also remove some obsolete columns like the user_name variable.

## Function to clean the data we will use it for the testing to
clean_data <- function(df) {
## Remove some useless column
df_clean <- select(df, -c(user_name, cvtd_timestamp,num_window,
                               raw_timestamp_part_1, raw_timestamp_part_2))

## Select all the column with class except classe factor and turn them to double
factor_column <- keep(df_clean[,-154], ~ (is.factor(.x))) %>%
                        map_df(~ as.numeric(as.character(.x)))

## Check all columns with 95% of NAs

factor_column_NA <- names(keep(factor_column, ~ mean(is.na(.x)) >= 0.95))

## Remove all the column with 95% of NAs from the clean data
df_clean <- df_clean%>%
                        select(-factor_column_NA) %>%
                        discard( ~ mean(is.na(.x)) >= 0.95)
df_clean
}
## Clean training data 
training_clean <- clean_data(training)
## Clean testing data 
testing_clean <- clean_data(testing)

After cleaning the data we have now just 53 variables features to use for the prediction.

4 Predictions

4.1 Simple tree prediction

Now that we have a clean data we try some simple tree prediction and see its accuracy.

## Set the seed for reproducibility
set.seed(190306)


## Create the data partition
partition <- createDataPartition(training$classe, p = 0.75, list = FALSE)
## Create the training set with the partition 
trainSet <- training_clean[partition,]
## Create the testing set with the partition 
testSet <- training_clean[-partition,]

## Model is learned using the train set 

tree <- rpart(classe ~., trainSet, method = "class")

## Prediction on test using tree
pred <- predict(tree, testSet, type = "class") 

## Confusion Matrix 
conf <- table(testSet$classe, pred)

## Accuracy of the model
accuracy <- sum(diag(conf)) / sum(conf)

With this simple model we have an accuracy of just 0.7506117 and that is not enough. Let us try to do some cross-validation and prune the data to add some robustness to our model.

4.2 Model with Cross-validation

## Model with cross validation with 4-fold and pruning with cp
treeCV <- rpart(classe ~., trainSet, method = "class", control = rpart.control(xval = 4, cp = 0.001))

predCV <- predict(treeCV, testSet, type = "class") 

confCV <- table(testSet$classe, predCV)

accuracyCV <- sum(diag(confCV)) / sum(confCV) 

So with the 4-fold cross-validation and the complexity parameter at \(0.001\) we have an accurcy of about 91% which is good with a out-of-sample error about 9%. But at this point let us try a random forest which is more powerful the find if we can have a better accuracy.

4.3 Random Forest

treeRF <-  randomForest(classe ~., data = trainSet)

predRF <- predict(treeRF, testSet, type = "class") 

confRF <- table(testSet$classe, predRF)

accuracyRF <- sum(diag(confRF)) / sum(confRF) 

With the random forest method we have an accuracy of about 99.653% which is a very good result, we have a out-of-sample error of just about 0.347%

## Predict the result with the testing set 
(pred_testing <- predict(treeRF, newdata = testing_clean))
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E
## writing the results in a file
write_csv(data.frame(problem_id = 1:20, prediction = pred_testing), path = "results.csv")

5 Appendix - figures

## 
## Call:
##  randomForest(formula = classe ~ ., data = trainSet) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 0.41%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 4181    4    0    0    0 0.0009557945
## B    9 2834    5    0    0 0.0049157303
## C    0   11 2553    3    0 0.0054538372
## D    0    0   21 2390    1 0.0091210614
## E    0    0    2    5 2699 0.0025868441