Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, we will use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset). The goal is to predict the manner in which they did the exercise.
Let’s first load the librarys:
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(rpart)
library(rpart.plot)
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.0.2
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.0.2
## corrplot 0.84 loaded
library(FactoMineR)
## Warning: package 'FactoMineR' was built under R version 4.0.2
library(factoextra)
## Warning: package 'factoextra' was built under R version 4.0.2
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Then we download the trainig and the test datas and :
trainData <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv")
testData <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv")
dim(trainData)
## [1] 19622 160
dim(testData)
## [1] 20 160
The training data set contains 19622 observations and 160 variables, while the testing data set contains 20 observations and 160 variables. The “classe” variable in the training set is the outcome to predict. Columns, which variance is near zero, does not have an impact on the outcome:
nonZeroValueColumns <- nearZeroVar(trainData)
trainSet <- trainData[,-nonZeroValueColumns]
testSet <- testData[,-nonZeroValueColumns]
dim(trainSet)
## [1] 19622 100
We now have only 100 columns. If we launch head(testData), we can see that a lot of columns contains missing values (N/A). We will remove them also:
trainSet <- trainSet[, colSums(is.na(trainSet)) == 0]
testSet <- testSet[, colSums(is.na(testSet)) == 0]
dim(trainSet)
## [1] 19622 59
We only have now 59 columns:
colnames(trainSet)
## [1] "X" "user_name" "raw_timestamp_part_1"
## [4] "raw_timestamp_part_2" "cvtd_timestamp" "num_window"
## [7] "roll_belt" "pitch_belt" "yaw_belt"
## [10] "total_accel_belt" "gyros_belt_x" "gyros_belt_y"
## [13] "gyros_belt_z" "accel_belt_x" "accel_belt_y"
## [16] "accel_belt_z" "magnet_belt_x" "magnet_belt_y"
## [19] "magnet_belt_z" "roll_arm" "pitch_arm"
## [22] "yaw_arm" "total_accel_arm" "gyros_arm_x"
## [25] "gyros_arm_y" "gyros_arm_z" "accel_arm_x"
## [28] "accel_arm_y" "accel_arm_z" "magnet_arm_x"
## [31] "magnet_arm_y" "magnet_arm_z" "roll_dumbbell"
## [34] "pitch_dumbbell" "yaw_dumbbell" "total_accel_dumbbell"
## [37] "gyros_dumbbell_x" "gyros_dumbbell_y" "gyros_dumbbell_z"
## [40] "accel_dumbbell_x" "accel_dumbbell_y" "accel_dumbbell_z"
## [43] "magnet_dumbbell_x" "magnet_dumbbell_y" "magnet_dumbbell_z"
## [46] "roll_forearm" "pitch_forearm" "yaw_forearm"
## [49] "total_accel_forearm" "gyros_forearm_x" "gyros_forearm_y"
## [52] "gyros_forearm_z" "accel_forearm_x" "accel_forearm_y"
## [55] "accel_forearm_z" "magnet_forearm_x" "magnet_forearm_y"
## [58] "magnet_forearm_z" "classe"
By looking at the first value of the 5 first columns, we can see they are only identifiers columns. X, for instance is only the identifier of the row:
head(trainSet$X)
## [1] 1 2 3 4 5 6
As those columns are not supposed to have an impact on the outcome, we will remove them:
trainSet <- trainSet[,-(1:5)]
testSet <- testSet[,-(1:5)]
dim(trainSet)
## [1] 19622 54
Ok, now we have 54 variables. Which is still a lot.
We can look at PCA to reduce:
res.pca <- PCA(trainSet[,-54], graph = FALSE)
eig.val <- get_eigenvalue(res.pca)
eig.val
## eigenvalue variance.percent cumulative.variance.percent
## Dim.1 8.356529571 15.76703693 15.76704
## Dim.2 8.157789204 15.39205510 31.15909
## Dim.3 4.680496137 8.83112479 39.99022
## Dim.4 4.129791792 7.79205998 47.78228
## Dim.5 3.653025252 6.89250048 54.67478
## Dim.6 3.032659710 5.72199945 60.39678
## Dim.7 2.251399768 4.24792409 64.64470
## Dim.8 2.074181477 3.91354996 68.55825
## Dim.9 1.725575131 3.25580213 71.81405
## Dim.10 1.508936773 2.84705051 74.66110
## Dim.11 1.396506582 2.63491808 77.29602
## Dim.12 1.151330391 2.17232149 79.46834
## Dim.13 1.044760623 1.97124646 81.43959
## Dim.14 0.944834998 1.78270754 83.22230
## Dim.15 0.885880543 1.67147272 84.89377
## Dim.16 0.805453931 1.51972440 86.41349
## Dim.17 0.727539733 1.37271648 87.78621
## Dim.18 0.677464870 1.27823560 89.06445
## Dim.19 0.600749371 1.13348938 90.19794
## Dim.20 0.528929457 0.99798011 91.19592
## Dim.21 0.481046223 0.90763438 92.10355
## Dim.22 0.417822308 0.78834398 92.89189
## Dim.23 0.389818097 0.73550584 93.62740
## Dim.24 0.382474044 0.72164914 94.34905
## Dim.25 0.334237040 0.63063593 94.97968
## Dim.26 0.305846885 0.57706959 95.55675
## Dim.27 0.290899728 0.54886741 96.10562
## Dim.28 0.255360613 0.48181248 96.58743
## Dim.29 0.233671949 0.44089047 97.02832
## Dim.30 0.203413037 0.38379818 97.41212
## Dim.31 0.179758581 0.33916713 97.75129
## Dim.32 0.170006297 0.32076660 98.07206
## Dim.33 0.131140769 0.24743541 98.31949
## Dim.34 0.121765495 0.22974622 98.54924
## Dim.35 0.112182353 0.21166482 98.76090
## Dim.36 0.091891312 0.17337983 98.93428
## Dim.37 0.079717397 0.15041018 99.08469
## Dim.38 0.063954533 0.12066893 99.20536
## Dim.39 0.056406751 0.10642783 99.31179
## Dim.40 0.055141837 0.10404120 99.41583
## Dim.41 0.040797623 0.07697665 99.49281
## Dim.42 0.037730125 0.07118892 99.56400
## Dim.43 0.035291970 0.06658862 99.63059
## Dim.44 0.033662153 0.06351350 99.69410
## Dim.45 0.031450615 0.05934078 99.75344
## Dim.46 0.028617470 0.05399523 99.80743
## Dim.47 0.026552215 0.05009852 99.85753
## Dim.48 0.021661889 0.04087149 99.89840
## Dim.49 0.020426583 0.03854072 99.93695
## Dim.50 0.013440216 0.02535890 99.96230
## Dim.51 0.011874676 0.02240505 99.98471
## Dim.52 0.005954973 0.01123580 99.99595
## Dim.53 0.002148928 0.00405458 100.00000
The variance explained by each eigenvalues of the covariance is given by the second column. We can see that 12 of the columns explains 79% of the variance. We could use PCA but as we have few variables that are correlated, we can use random forest to build our model because it automatically choose the most important variables and is robust to correlate covariates and outcome.
First, we will split the training data in training data and cros validation data. The second one will be used to measure the biais.
set.seed(12345)
inTrain <- createDataPartition(trainSet$classe, p=0.70, list=F)
newTrainSet <- trainSet[inTrain, ]
validationSet <- trainSet[-inTrain, ]
Then, we fit the model:
controlRf <- trainControl(method="cv", 5)
modelRf <- train(classe ~ ., data=newTrainSet, method="rf", trControl=controlRf, ntree=250)
modelRf
## Random Forest
##
## 13737 samples
## 53 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 10987, 10990, 10990, 10991, 10990
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9937393 0.9920801
## 27 0.9963598 0.9953954
## 53 0.9943948 0.9929089
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
Then, we estimate the performance of the model on the validation data set.
predictRf <- predict(modelRf, validationSet)
confusionMatrix(table(validationSet$classe,predictRf))
## Confusion Matrix and Statistics
##
## predictRf
## A B C D E
## A 1674 0 0 0 0
## B 1 1138 0 0 0
## C 0 2 1024 0 0
## D 0 0 3 961 0
## E 0 0 0 0 1082
##
## Overall Statistics
##
## Accuracy : 0.999
## 95% CI : (0.9978, 0.9996)
## No Information Rate : 0.2846
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9987
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9994 0.9982 0.9971 1.0000 1.0000
## Specificity 1.0000 0.9998 0.9996 0.9994 1.0000
## Pos Pred Value 1.0000 0.9991 0.9981 0.9969 1.0000
## Neg Pred Value 0.9998 0.9996 0.9994 1.0000 1.0000
## Prevalence 0.2846 0.1937 0.1745 0.1633 0.1839
## Detection Rate 0.2845 0.1934 0.1740 0.1633 0.1839
## Detection Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Balanced Accuracy 0.9997 0.9990 0.9983 0.9997 1.0000
The accuracy is 99%.
Now, we apply the model to the original testing data set downloaded from the data source. We remove the problem_id column first.
result <- predict(modelRf, testSet[, -length(names(testSet))])
result
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E