This is the course project of the Practical Machine Learning Curse. The report develops how the goals of this projects are acomplished: 1. The data is cleaned to avoid using NA variables. 2. The 19622 experiments for training are divided by 70/30 for create the model and for test the results and for measure the accuracy. 3. A first model using classification tree is created, but the accuracy is not enought. 4. A final model is created using random forest which computes a 99% of accuracy, which is requiered to obtain a 95% of confidence for predincting 20 cases.
In order to improve the performance, the model is training using a k-fold=5 and processing in parallel. 5) As the accuracy of the used model is of 99%, we predict the 20 cases with a 95% of confidence.Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks.
One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
First, the required library are loaded and the input data is read.
library(rpart)
library(rattle)
## Rattle: A free graphical interface for data science with R.
## Version 5.2.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(parallel)
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
library(knitr)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(rpart.plot)
#library(randomForest)
#library(corrplot)
set.seed(12345)
pml_training = read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", na.strings = c("NA", "#DIV/0!", ""), header = TRUE)
pml_testing = read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv",na.strings = c("NA", "#DIV/0!", ""), header = TRUE)
dim(pml_training)
## [1] 19622 160
dim(pml_testing)
## [1] 20 160
summary(pml_testing)
## X user_name raw_timestamp_part_1 raw_timestamp_part_2
## Min. : 1.00 adelmo :1 Min. :1.322e+09 Min. : 36553
## 1st Qu.: 5.75 carlitos:3 1st Qu.:1.323e+09 1st Qu.:268655
## Median :10.50 charles :1 Median :1.323e+09 Median :530706
## Mean :10.50 eurico :4 Mean :1.323e+09 Mean :512167
## 3rd Qu.:15.25 jeremy :8 3rd Qu.:1.323e+09 3rd Qu.:787738
## Max. :20.00 pedro :3 Max. :1.323e+09 Max. :920315
##
## cvtd_timestamp new_window num_window roll_belt
## 30/11/2011 17:11:4 no:20 Min. : 48.0 Min. : -5.9200
## 05/12/2011 11:24:3 1st Qu.:250.0 1st Qu.: 0.9075
## 30/11/2011 17:12:3 Median :384.5 Median : 1.1100
## 05/12/2011 14:23:2 Mean :379.6 Mean : 31.3055
## 28/11/2011 14:14:2 3rd Qu.:467.0 3rd Qu.: 32.5050
## 02/12/2011 13:33:1 Max. :859.0 Max. :129.0000
## (Other) :5
## pitch_belt yaw_belt total_accel_belt kurtosis_roll_belt
## Min. :-41.600 Min. :-93.70 Min. : 2.00 Mode:logical
## 1st Qu.: 3.013 1st Qu.:-88.62 1st Qu.: 3.00 NA's:20
## Median : 4.655 Median :-87.85 Median : 4.00
## Mean : 5.824 Mean :-59.30 Mean : 7.55
## 3rd Qu.: 6.135 3rd Qu.:-63.50 3rd Qu.: 8.00
## Max. : 27.800 Max. :162.00 Max. :21.00
##
## kurtosis_picth_belt kurtosis_yaw_belt skewness_roll_belt
## Mode:logical Mode:logical Mode:logical
## NA's:20 NA's:20 NA's:20
##
##
##
##
##
## skewness_roll_belt.1 skewness_yaw_belt max_roll_belt max_picth_belt
## Mode:logical Mode:logical Mode:logical Mode:logical
## NA's:20 NA's:20 NA's:20 NA's:20
##
##
##
##
##
## max_yaw_belt min_roll_belt min_pitch_belt min_yaw_belt
## Mode:logical Mode:logical Mode:logical Mode:logical
## NA's:20 NA's:20 NA's:20 NA's:20
##
##
##
##
##
## amplitude_roll_belt amplitude_pitch_belt amplitude_yaw_belt
## Mode:logical Mode:logical Mode:logical
## NA's:20 NA's:20 NA's:20
##
##
##
##
##
## var_total_accel_belt avg_roll_belt stddev_roll_belt var_roll_belt
## Mode:logical Mode:logical Mode:logical Mode:logical
## NA's:20 NA's:20 NA's:20 NA's:20
##
##
##
##
##
## avg_pitch_belt stddev_pitch_belt var_pitch_belt avg_yaw_belt
## Mode:logical Mode:logical Mode:logical Mode:logical
## NA's:20 NA's:20 NA's:20 NA's:20
##
##
##
##
##
## stddev_yaw_belt var_yaw_belt gyros_belt_x gyros_belt_y
## Mode:logical Mode:logical Min. :-0.500 Min. :-0.050
## NA's:20 NA's:20 1st Qu.:-0.070 1st Qu.:-0.005
## Median : 0.020 Median : 0.000
## Mean :-0.045 Mean : 0.010
## 3rd Qu.: 0.070 3rd Qu.: 0.020
## Max. : 0.240 Max. : 0.110
##
## gyros_belt_z accel_belt_x accel_belt_y accel_belt_z
## Min. :-0.4800 Min. :-48.00 Min. :-16.00 Min. :-187.00
## 1st Qu.:-0.1375 1st Qu.:-19.00 1st Qu.: 2.00 1st Qu.: -24.00
## Median :-0.0250 Median :-13.00 Median : 4.50 Median : 27.00
## Mean :-0.1005 Mean :-13.50 Mean : 18.35 Mean : -17.60
## 3rd Qu.: 0.0000 3rd Qu.: -8.75 3rd Qu.: 25.50 3rd Qu.: 38.25
## Max. : 0.0500 Max. : 46.00 Max. : 72.00 Max. : 49.00
##
## magnet_belt_x magnet_belt_y magnet_belt_z roll_arm
## Min. :-13.00 Min. :566.0 Min. :-426.0 Min. :-137.00
## 1st Qu.: 5.50 1st Qu.:578.5 1st Qu.:-398.5 1st Qu.: 0.00
## Median : 33.50 Median :600.5 Median :-313.5 Median : 0.00
## Mean : 35.15 Mean :601.5 Mean :-346.9 Mean : 16.42
## 3rd Qu.: 46.25 3rd Qu.:631.2 3rd Qu.:-305.0 3rd Qu.: 71.53
## Max. :169.00 Max. :638.0 Max. :-291.0 Max. : 152.00
##
## pitch_arm yaw_arm total_accel_arm var_accel_arm
## Min. :-63.800 Min. :-167.00 Min. : 3.00 Mode:logical
## 1st Qu.: -9.188 1st Qu.: -60.15 1st Qu.:20.25 NA's:20
## Median : 0.000 Median : 0.00 Median :29.50
## Mean : -3.950 Mean : -2.80 Mean :26.40
## 3rd Qu.: 3.465 3rd Qu.: 25.50 3rd Qu.:33.25
## Max. : 55.000 Max. : 178.00 Max. :44.00
##
## avg_roll_arm stddev_roll_arm var_roll_arm avg_pitch_arm
## Mode:logical Mode:logical Mode:logical Mode:logical
## NA's:20 NA's:20 NA's:20 NA's:20
##
##
##
##
##
## stddev_pitch_arm var_pitch_arm avg_yaw_arm stddev_yaw_arm
## Mode:logical Mode:logical Mode:logical Mode:logical
## NA's:20 NA's:20 NA's:20 NA's:20
##
##
##
##
##
## var_yaw_arm gyros_arm_x gyros_arm_y gyros_arm_z
## Mode:logical Min. :-3.710 Min. :-2.0900 Min. :-0.6900
## NA's:20 1st Qu.:-0.645 1st Qu.:-0.6350 1st Qu.:-0.1800
## Median : 0.020 Median :-0.0400 Median :-0.0250
## Mean : 0.077 Mean :-0.1595 Mean : 0.1205
## 3rd Qu.: 1.248 3rd Qu.: 0.2175 3rd Qu.: 0.5650
## Max. : 3.660 Max. : 1.8500 Max. : 1.1300
##
## accel_arm_x accel_arm_y accel_arm_z magnet_arm_x
## Min. :-341.0 Min. :-65.00 Min. :-404.00 Min. :-428.00
## 1st Qu.:-277.0 1st Qu.: 52.25 1st Qu.:-128.50 1st Qu.:-373.75
## Median :-194.5 Median :112.00 Median : -83.50 Median :-265.00
## Mean :-134.6 Mean :103.10 Mean : -87.85 Mean : -38.95
## 3rd Qu.: 5.5 3rd Qu.:168.25 3rd Qu.: -27.25 3rd Qu.: 250.50
## Max. : 106.0 Max. :245.00 Max. : 93.00 Max. : 750.00
##
## magnet_arm_y magnet_arm_z kurtosis_roll_arm kurtosis_picth_arm
## Min. :-307.0 Min. :-499.0 Mode:logical Mode:logical
## 1st Qu.: 205.2 1st Qu.: 403.0 NA's:20 NA's:20
## Median : 291.0 Median : 476.5
## Mean : 239.4 Mean : 369.8
## 3rd Qu.: 358.8 3rd Qu.: 517.0
## Max. : 474.0 Max. : 633.0
##
## kurtosis_yaw_arm skewness_roll_arm skewness_pitch_arm skewness_yaw_arm
## Mode:logical Mode:logical Mode:logical Mode:logical
## NA's:20 NA's:20 NA's:20 NA's:20
##
##
##
##
##
## max_roll_arm max_picth_arm max_yaw_arm min_roll_arm
## Mode:logical Mode:logical Mode:logical Mode:logical
## NA's:20 NA's:20 NA's:20 NA's:20
##
##
##
##
##
## min_pitch_arm min_yaw_arm amplitude_roll_arm amplitude_pitch_arm
## Mode:logical Mode:logical Mode:logical Mode:logical
## NA's:20 NA's:20 NA's:20 NA's:20
##
##
##
##
##
## amplitude_yaw_arm roll_dumbbell pitch_dumbbell yaw_dumbbell
## Mode:logical Min. :-111.118 Min. :-54.97 Min. :-103.3200
## NA's:20 1st Qu.: 7.494 1st Qu.:-51.89 1st Qu.: -75.2809
## Median : 50.403 Median :-40.81 Median : -8.2863
## Mean : 33.760 Mean :-19.47 Mean : -0.9385
## 3rd Qu.: 58.129 3rd Qu.: 16.12 3rd Qu.: 55.8335
## Max. : 123.984 Max. : 96.87 Max. : 132.2337
##
## kurtosis_roll_dumbbell kurtosis_picth_dumbbell kurtosis_yaw_dumbbell
## Mode:logical Mode:logical Mode:logical
## NA's:20 NA's:20 NA's:20
##
##
##
##
##
## skewness_roll_dumbbell skewness_pitch_dumbbell skewness_yaw_dumbbell
## Mode:logical Mode:logical Mode:logical
## NA's:20 NA's:20 NA's:20
##
##
##
##
##
## max_roll_dumbbell max_picth_dumbbell max_yaw_dumbbell min_roll_dumbbell
## Mode:logical Mode:logical Mode:logical Mode:logical
## NA's:20 NA's:20 NA's:20 NA's:20
##
##
##
##
##
## min_pitch_dumbbell min_yaw_dumbbell amplitude_roll_dumbbell
## Mode:logical Mode:logical Mode:logical
## NA's:20 NA's:20 NA's:20
##
##
##
##
##
## amplitude_pitch_dumbbell amplitude_yaw_dumbbell total_accel_dumbbell
## Mode:logical Mode:logical Min. : 1.0
## NA's:20 NA's:20 1st Qu.: 7.0
## Median :15.5
## Mean :17.2
## 3rd Qu.:29.0
## Max. :31.0
##
## var_accel_dumbbell avg_roll_dumbbell stddev_roll_dumbbell
## Mode:logical Mode:logical Mode:logical
## NA's:20 NA's:20 NA's:20
##
##
##
##
##
## var_roll_dumbbell avg_pitch_dumbbell stddev_pitch_dumbbell
## Mode:logical Mode:logical Mode:logical
## NA's:20 NA's:20 NA's:20
##
##
##
##
##
## var_pitch_dumbbell avg_yaw_dumbbell stddev_yaw_dumbbell var_yaw_dumbbell
## Mode:logical Mode:logical Mode:logical Mode:logical
## NA's:20 NA's:20 NA's:20 NA's:20
##
##
##
##
##
## gyros_dumbbell_x gyros_dumbbell_y gyros_dumbbell_z accel_dumbbell_x
## Min. :-1.0300 Min. :-1.1100 Min. :-1.180 Min. :-159.00
## 1st Qu.: 0.1600 1st Qu.:-0.2100 1st Qu.:-0.485 1st Qu.:-140.25
## Median : 0.3600 Median : 0.0150 Median :-0.280 Median : -19.00
## Mean : 0.2690 Mean : 0.0605 Mean :-0.266 Mean : -47.60
## 3rd Qu.: 0.4625 3rd Qu.: 0.1450 3rd Qu.:-0.165 3rd Qu.: 15.75
## Max. : 1.0600 Max. : 1.9100 Max. : 1.100 Max. : 185.00
##
## accel_dumbbell_y accel_dumbbell_z magnet_dumbbell_x magnet_dumbbell_y
## Min. :-30.00 Min. :-221.0 Min. :-576.0 Min. :-558.0
## 1st Qu.: 5.75 1st Qu.:-192.2 1st Qu.:-528.0 1st Qu.: 259.5
## Median : 71.50 Median : -3.0 Median :-508.5 Median : 316.0
## Mean : 70.55 Mean : -60.0 Mean :-304.2 Mean : 189.3
## 3rd Qu.:151.25 3rd Qu.: 76.5 3rd Qu.:-317.0 3rd Qu.: 348.2
## Max. :166.00 Max. : 100.0 Max. : 523.0 Max. : 403.0
##
## magnet_dumbbell_z roll_forearm pitch_forearm yaw_forearm
## Min. :-164.00 Min. :-176.00 Min. :-63.500 Min. :-168.000
## 1st Qu.: -33.00 1st Qu.: -40.25 1st Qu.:-11.457 1st Qu.: -93.375
## Median : 49.50 Median : 94.20 Median : 8.830 Median : -19.250
## Mean : 71.40 Mean : 38.66 Mean : 7.099 Mean : 2.195
## 3rd Qu.: 96.25 3rd Qu.: 143.25 3rd Qu.: 28.500 3rd Qu.: 104.500
## Max. : 368.00 Max. : 176.00 Max. : 59.300 Max. : 159.000
##
## kurtosis_roll_forearm kurtosis_picth_forearm kurtosis_yaw_forearm
## Mode:logical Mode:logical Mode:logical
## NA's:20 NA's:20 NA's:20
##
##
##
##
##
## skewness_roll_forearm skewness_pitch_forearm skewness_yaw_forearm
## Mode:logical Mode:logical Mode:logical
## NA's:20 NA's:20 NA's:20
##
##
##
##
##
## max_roll_forearm max_picth_forearm max_yaw_forearm min_roll_forearm
## Mode:logical Mode:logical Mode:logical Mode:logical
## NA's:20 NA's:20 NA's:20 NA's:20
##
##
##
##
##
## min_pitch_forearm min_yaw_forearm amplitude_roll_forearm
## Mode:logical Mode:logical Mode:logical
## NA's:20 NA's:20 NA's:20
##
##
##
##
##
## amplitude_pitch_forearm amplitude_yaw_forearm total_accel_forearm
## Mode:logical Mode:logical Min. :21.00
## NA's:20 NA's:20 1st Qu.:24.00
## Median :32.50
## Mean :32.05
## 3rd Qu.:36.75
## Max. :47.00
##
## var_accel_forearm avg_roll_forearm stddev_roll_forearm var_roll_forearm
## Mode:logical Mode:logical Mode:logical Mode:logical
## NA's:20 NA's:20 NA's:20 NA's:20
##
##
##
##
##
## avg_pitch_forearm stddev_pitch_forearm var_pitch_forearm avg_yaw_forearm
## Mode:logical Mode:logical Mode:logical Mode:logical
## NA's:20 NA's:20 NA's:20 NA's:20
##
##
##
##
##
## stddev_yaw_forearm var_yaw_forearm gyros_forearm_x gyros_forearm_y
## Mode:logical Mode:logical Min. :-1.0600 Min. :-5.9700
## NA's:20 NA's:20 1st Qu.:-0.5850 1st Qu.:-1.2875
## Median : 0.0200 Median : 0.0350
## Mean :-0.0200 Mean :-0.0415
## 3rd Qu.: 0.2925 3rd Qu.: 2.0475
## Max. : 1.3800 Max. : 4.2600
##
## gyros_forearm_z accel_forearm_x accel_forearm_y accel_forearm_z
## Min. :-1.2600 Min. :-212.0 Min. :-331.0 Min. :-282.0
## 1st Qu.:-0.0975 1st Qu.:-114.8 1st Qu.: 8.5 1st Qu.:-199.0
## Median : 0.2300 Median : 86.0 Median : 138.0 Median :-148.5
## Mean : 0.2610 Mean : 38.8 Mean : 125.3 Mean : -93.7
## 3rd Qu.: 0.7625 3rd Qu.: 166.2 3rd Qu.: 268.0 3rd Qu.: -31.0
## Max. : 1.8000 Max. : 232.0 Max. : 406.0 Max. : 179.0
##
## magnet_forearm_x magnet_forearm_y magnet_forearm_z problem_id
## Min. :-714.0 Min. :-787.0 Min. :-32.0 Min. : 1.00
## 1st Qu.:-427.2 1st Qu.:-328.8 1st Qu.:275.2 1st Qu.: 5.75
## Median :-189.5 Median : 487.0 Median :491.5 Median :10.50
## Mean :-159.2 Mean : 191.8 Mean :460.2 Mean :10.50
## 3rd Qu.: 41.5 3rd Qu.: 720.8 3rd Qu.:661.5 3rd Qu.:15.25
## Max. : 532.0 Max. : 800.0 Max. :884.0 Max. :20.00
##
Both created datasets have 160 variables. Those variables have plenty of NA, that can be removed with the cleaning procedures below. The Near Zero variance (NZV) variables are also removed and the ID variables as well.
There are several variables (columns) with NA value. These colums are removed using the function is.na to test if the sum of column is or not NA before removing
training1<- pml_training[,colSums(is.na(pml_training)) == 0]
testing1<- pml_testing[,colSums(is.na(pml_testing)) == 0]
The first seven columns are removed before they give information about the people who did the test, and timestamps, which are not related with the classification we are trying to predict.
training<- training1[,-c(1:7)]
testing<- testing1[,-c(1:7)]
dim(training)
## [1] 19622 53
#how many sambles we have for each classe
table(training$classe)
##
## A B C D E
## 5580 3797 3422 3216 3607
There are 19622 experiments with 53 variables for training and validation of our models, and 20 rows for testing
The training set is used for training and for validation, in 75/25 proportion.
inTrain = createDataPartition(training$classe, p = 0.75)[[1]]
training_part = training[ inTrain,]
valid_part = training[-inTrain,]
A classification tree model is created using 13737 experiments of the training set. The tree is plotted.
model_CT <- train(classe~., data=training_part, method="rpart")
fancyRpartPlot(model_CT$finalModel)
We predict values using the valid set and we calculate the confussion matrix with the accurary results
predict_validation<- predict(model_CT, newdata = valid_part)
cm_ct<-confusionMatrix(predict_validation,valid_part$classe)
cm_ct$cm_ct$overall['Accuracy']
## NULL
The accuracy result is low, of 49% with a 95% CI of(48%-50%).
We create a new model using random forest. As the training would be very slow, I follow the instructions of the next link https://github.com/lgreski/datasciencectacontent/blob/master/markdown/pml-randomForestPerformance.md. A cluster is created and the resampling method is changing for using k-fold cross-validation with number=5.
#use k_fold=5 in cross_validation to improve the performance
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)
trainControl_function <-trainControl(method = "cv",number = 5, allowParallel = TRUE)
model_rf <- train(classe~., data=training_part, method="rf",trControl = trainControl_function)
print(model_rf$finalmodel)
## NULL
stop of paralling computing.
stopCluster(cluster)
registerDoSEQ()
predict_validation_rf<- predict(model_rf, newdata = valid_part)
cm_rf<-confusionMatrix(predict_validation_rf,valid_part$classe)
cm_rf$overall['Accuracy']
## Accuracy
## 0.9946982
The accuracy result is 99%, enough to get the prediction of the 20 values. As you can see in the next entry, this is the accuracy required to obtain a 95% of confidence in the prediction of 20 values. https://github.com/lgreski/datasciencectacontent/blob/master/markdown/pml-requiredModelAccuracy.md
varImp(model_rf)
## rf variable importance
##
## only 20 most important variables shown (out of 52)
##
## Overall
## roll_belt 100.00
## pitch_forearm 61.39
## yaw_belt 56.38
## magnet_dumbbell_y 45.55
## pitch_belt 43.89
## roll_forearm 43.45
## magnet_dumbbell_z 42.75
## accel_dumbbell_y 22.32
## accel_forearm_x 17.75
## roll_dumbbell 16.94
## magnet_dumbbell_x 16.24
## magnet_belt_z 15.04
## accel_belt_z 14.81
## total_accel_dumbbell 13.89
## magnet_forearm_z 13.86
## accel_dumbbell_z 13.10
## magnet_belt_y 11.84
## yaw_arm 11.13
## gyros_belt_z 10.86
## magnet_belt_x 10.69
The random forest model is now used to predict the manner in which the people will do the exercise. The final results are saved in a file.
predict_test<- predict(model_rf, testing)
predict_test
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
write.csv(predict_test, file = "predict_test.csv")