Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement â a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. In this project, the goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har(see the section on the Weight Lifting Exercise Dataset).
The goal of the project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. I will also use the prediction model to predict 20 different test cases.
The training data for this project are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv The test data are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
Download data. The pml-training.csv data is used as training and the pml-test.csv data as testing (is used to predict 20 cases based on the best trained model).
getwd()
## [1] "C:/Users/pruebas/Documents/Nueva carpeta/machinelearning"
setwd("C:/Users/pruebas/Documents/Nueva carpeta/pml")
urltr <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
urlts <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
training <-read.csv(url(urltr), na.strings=c("NA","#DIV/0!", ""), stringsAsFactors = F)
testing <-read.csv(url(urlts), na.strings=c("NA","#DIV/0!", ""))
dim(training)
## [1] 19622 160
dim(testing)
## [1] 20 160
NA values are omitted in both datasets. Irrelevant columns such as user_name, raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp, new_window, and num_window (columns 1 to 7) will be removed in the subset.
training<-training[,colSums(is.na(training)) == 0]
testing <-testing[,colSums(is.na(testing)) == 0]
training <-training[,-c(1:7)]
testing <-testing[,-c(1:7)]
I plot a correlogram to see whether we could simplify the data set by identifying pairs of strongly correlated variables that are shown with the dark red and blue dark colours:
See the pdf in the repo files.
pdf("multi-corr.pdf",50,50)
library(corrgram)
corrgram(training, order = TRUE, lower.panel=panel.pie)
dev.off()
## png
## 2
We donôt observe a hight number of correlation variables, so we will try two preprocessings before training the model: - Preprocessing using PCA analysis before training the model(we will used a 95% of thresh, it is the default % of variance that should have the principal components). - No preprocessing with PCA, so we used all the variables to train the model.
Cross validation uses 5 folds for resampling to compute the accuracy in order to select the optimal model using the largest value tuning parameter(mtry).
The reason i use random forest algorithm to train the model is beacause it is one of the most accurate prediction ones. I will test a version with all variables as well as a version with feature space reduced by PCA. In both cases, we will train the model with 5-folds cross-validation to estimate an out-of-sample error.
library(caret)
## Loading required package: lattice
##
## Attaching package: 'lattice'
## The following object is masked from 'package:corrgram':
##
## panel.fill
## Loading required package: ggplot2
library(parallel)
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
cluster <- makeCluster(detectCores() - 1)
registerDoParallel(cluster)
set.seed(1234)
fit.nopca <- train(classe ~ ., data = training, ntree = 100, method = 'rf',
trControl = trainControl(method = "cv", number = 5,allowParallel = TRUE))
stopCluster(cluster)
registerDoSEQ()
cluster <- makeCluster(detectCores() - 1)
registerDoParallel(cluster)
fit.pca <- train(classe ~ ., data = training, ntree = 100, method = 'rf',
preProcess = "pca", trControl = trainControl(method = "cv", number = 5, allowParallel=TRUE))
stopCluster(cluster)
registerDoSEQ()
In order to select the best final training model that i will use to predict with, i take the smallest OOB estimate of error rate (0.52 %) . I fitted two random forest models with and without principal component preprocessing to see which one improves my trainin model fit. The OOB estimate error uses unknown cases in each split for the training model because they were not used to fit the model, instead they were used to estimate the best performance of the model that we can see in the confusion matrix that shows the errors of the prediction algorithm.
fit.nopca$finalModel
##
## Call:
## randomForest(x = x, y = y, ntree = 100, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 27
##
## OOB estimate of error rate: 0.49%
## Confusion matrix:
## A B C D E class.error
## A 5572 7 1 0 0 0.001433692
## B 20 3770 6 1 0 0.007110877
## C 1 10 3401 10 0 0.006136762
## D 0 0 27 3186 3 0.009328358
## E 0 1 3 6 3597 0.002772387
fit.pca$finalModel
##
## Call:
## randomForest(x = x, y = y, ntree = 100, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 2.13%
## Confusion matrix:
## A B C D E class.error
## A 5531 15 23 8 3 0.008781362
## B 53 3691 39 3 11 0.027916776
## C 6 45 3333 29 9 0.026008182
## D 4 3 105 3097 7 0.037002488
## E 1 22 16 15 3553 0.014970890
The model with all features performs better than the one with PCA pre-processing.
We predict the 20 cases with the testing data set, with the best final model that we select above (fit.nopca):
predictions = predict(fit.nopca, testing)
predictions
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E