Practical machine learning course project

Introduction

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, we will use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset). The goal is to predict the manner in which they did the exercise.

Preprocessing

Let’s first load the librarys:

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

library(rpart)

library(rpart.plot)

library(randomForest)

## Warning: package 'randomForest' was built under R version 4.0.2

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(corrplot)

## Warning: package 'corrplot' was built under R version 4.0.2

## corrplot 0.84 loaded

library(FactoMineR)

## Warning: package 'FactoMineR' was built under R version 4.0.2

library(factoextra)

## Warning: package 'factoextra' was built under R version 4.0.2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

Then we download the trainig and the test datas and :

trainData <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv")

testData <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv")

dim(trainData)

## [1] 19622   160

dim(testData)

## [1]  20 160

The training data set contains 19622 observations and 160 variables, while the testing data set contains 20 observations and 160 variables. The “classe” variable in the training set is the outcome to predict. Columns, which variance is near zero, does not have an impact on the outcome:

nonZeroValueColumns <- nearZeroVar(trainData)
trainSet <- trainData[,-nonZeroValueColumns]
testSet <- testData[,-nonZeroValueColumns]
dim(trainSet)

## [1] 19622   100

We now have only 100 columns. If we launch head(testData), we can see that a lot of columns contains missing values (N/A). We will remove them also:

trainSet <- trainSet[, colSums(is.na(trainSet)) == 0]
testSet <- testSet[, colSums(is.na(testSet)) == 0]
dim(trainSet)

## [1] 19622    59

We only have now 59 columns:

colnames(trainSet)

##  [1] "X"                    "user_name"            "raw_timestamp_part_1"
##  [4] "raw_timestamp_part_2" "cvtd_timestamp"       "num_window"          
##  [7] "roll_belt"            "pitch_belt"           "yaw_belt"            
## [10] "total_accel_belt"     "gyros_belt_x"         "gyros_belt_y"        
## [13] "gyros_belt_z"         "accel_belt_x"         "accel_belt_y"        
## [16] "accel_belt_z"         "magnet_belt_x"        "magnet_belt_y"       
## [19] "magnet_belt_z"        "roll_arm"             "pitch_arm"           
## [22] "yaw_arm"              "total_accel_arm"      "gyros_arm_x"         
## [25] "gyros_arm_y"          "gyros_arm_z"          "accel_arm_x"         
## [28] "accel_arm_y"          "accel_arm_z"          "magnet_arm_x"        
## [31] "magnet_arm_y"         "magnet_arm_z"         "roll_dumbbell"       
## [34] "pitch_dumbbell"       "yaw_dumbbell"         "total_accel_dumbbell"
## [37] "gyros_dumbbell_x"     "gyros_dumbbell_y"     "gyros_dumbbell_z"    
## [40] "accel_dumbbell_x"     "accel_dumbbell_y"     "accel_dumbbell_z"    
## [43] "magnet_dumbbell_x"    "magnet_dumbbell_y"    "magnet_dumbbell_z"   
## [46] "roll_forearm"         "pitch_forearm"        "yaw_forearm"         
## [49] "total_accel_forearm"  "gyros_forearm_x"      "gyros_forearm_y"     
## [52] "gyros_forearm_z"      "accel_forearm_x"      "accel_forearm_y"     
## [55] "accel_forearm_z"      "magnet_forearm_x"     "magnet_forearm_y"    
## [58] "magnet_forearm_z"     "classe"

By looking at the first value of the 5 first columns, we can see they are only identifiers columns. X, for instance is only the identifier of the row:

head(trainSet$X)

## [1] 1 2 3 4 5 6

As those columns are not supposed to have an impact on the outcome, we will remove them:

trainSet <- trainSet[,-(1:5)]
testSet <- testSet[,-(1:5)]
dim(trainSet)

## [1] 19622    54

Ok, now we have 54 variables. Which is still a lot.

Observation (PCA)

We can look at PCA to reduce:

res.pca <- PCA(trainSet[,-54], graph = FALSE)
eig.val <- get_eigenvalue(res.pca)
eig.val

##         eigenvalue variance.percent cumulative.variance.percent
## Dim.1  8.356529571      15.76703693                    15.76704
## Dim.2  8.157789204      15.39205510                    31.15909
## Dim.3  4.680496137       8.83112479                    39.99022
## Dim.4  4.129791792       7.79205998                    47.78228
## Dim.5  3.653025252       6.89250048                    54.67478
## Dim.6  3.032659710       5.72199945                    60.39678
## Dim.7  2.251399768       4.24792409                    64.64470
## Dim.8  2.074181477       3.91354996                    68.55825
## Dim.9  1.725575131       3.25580213                    71.81405
## Dim.10 1.508936773       2.84705051                    74.66110
## Dim.11 1.396506582       2.63491808                    77.29602
## Dim.12 1.151330391       2.17232149                    79.46834
## Dim.13 1.044760623       1.97124646                    81.43959
## Dim.14 0.944834998       1.78270754                    83.22230
## Dim.15 0.885880543       1.67147272                    84.89377
## Dim.16 0.805453931       1.51972440                    86.41349
## Dim.17 0.727539733       1.37271648                    87.78621
## Dim.18 0.677464870       1.27823560                    89.06445
## Dim.19 0.600749371       1.13348938                    90.19794
## Dim.20 0.528929457       0.99798011                    91.19592
## Dim.21 0.481046223       0.90763438                    92.10355
## Dim.22 0.417822308       0.78834398                    92.89189
## Dim.23 0.389818097       0.73550584                    93.62740
## Dim.24 0.382474044       0.72164914                    94.34905
## Dim.25 0.334237040       0.63063593                    94.97968
## Dim.26 0.305846885       0.57706959                    95.55675
## Dim.27 0.290899728       0.54886741                    96.10562
## Dim.28 0.255360613       0.48181248                    96.58743
## Dim.29 0.233671949       0.44089047                    97.02832
## Dim.30 0.203413037       0.38379818                    97.41212
## Dim.31 0.179758581       0.33916713                    97.75129
## Dim.32 0.170006297       0.32076660                    98.07206
## Dim.33 0.131140769       0.24743541                    98.31949
## Dim.34 0.121765495       0.22974622                    98.54924
## Dim.35 0.112182353       0.21166482                    98.76090
## Dim.36 0.091891312       0.17337983                    98.93428
## Dim.37 0.079717397       0.15041018                    99.08469
## Dim.38 0.063954533       0.12066893                    99.20536
## Dim.39 0.056406751       0.10642783                    99.31179
## Dim.40 0.055141837       0.10404120                    99.41583
## Dim.41 0.040797623       0.07697665                    99.49281
## Dim.42 0.037730125       0.07118892                    99.56400
## Dim.43 0.035291970       0.06658862                    99.63059
## Dim.44 0.033662153       0.06351350                    99.69410
## Dim.45 0.031450615       0.05934078                    99.75344
## Dim.46 0.028617470       0.05399523                    99.80743
## Dim.47 0.026552215       0.05009852                    99.85753
## Dim.48 0.021661889       0.04087149                    99.89840
## Dim.49 0.020426583       0.03854072                    99.93695
## Dim.50 0.013440216       0.02535890                    99.96230
## Dim.51 0.011874676       0.02240505                    99.98471
## Dim.52 0.005954973       0.01123580                    99.99595
## Dim.53 0.002148928       0.00405458                   100.00000

The variance explained by each eigenvalues of the covariance is given by the second column. We can see that 12 of the columns explains 79% of the variance. We could use PCA but as we have few variables that are correlated, we can use random forest to build our model because it automatically choose the most important variables and is robust to correlate covariates and outcome.

Fit a model with random forest

First, we will split the training data in training data and cros validation data. The second one will be used to measure the biais.

set.seed(12345) 

inTrain <- createDataPartition(trainSet$classe, p=0.70, list=F)

newTrainSet <- trainSet[inTrain, ]

validationSet <- trainSet[-inTrain, ]

Then, we fit the model:

controlRf <- trainControl(method="cv", 5)
modelRf <- train(classe ~ ., data=newTrainSet, method="rf", trControl=controlRf, ntree=250)
modelRf

## Random Forest 
## 
## 13737 samples
##    53 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 10987, 10990, 10990, 10991, 10990 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9937393  0.9920801
##   27    0.9963598  0.9953954
##   53    0.9943948  0.9929089
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.

Then, we estimate the performance of the model on the validation data set.

predictRf <- predict(modelRf, validationSet)

confusionMatrix(table(validationSet$classe,predictRf))

## Confusion Matrix and Statistics
## 
##    predictRf
##        A    B    C    D    E
##   A 1674    0    0    0    0
##   B    1 1138    0    0    0
##   C    0    2 1024    0    0
##   D    0    0    3  961    0
##   E    0    0    0    0 1082
## 
## Overall Statistics
##                                           
##                Accuracy : 0.999           
##                  95% CI : (0.9978, 0.9996)
##     No Information Rate : 0.2846          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9987          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9994   0.9982   0.9971   1.0000   1.0000
## Specificity            1.0000   0.9998   0.9996   0.9994   1.0000
## Pos Pred Value         1.0000   0.9991   0.9981   0.9969   1.0000
## Neg Pred Value         0.9998   0.9996   0.9994   1.0000   1.0000
## Prevalence             0.2846   0.1937   0.1745   0.1633   0.1839
## Detection Rate         0.2845   0.1934   0.1740   0.1633   0.1839
## Detection Prevalence   0.2845   0.1935   0.1743   0.1638   0.1839
## Balanced Accuracy      0.9997   0.9990   0.9983   0.9997   1.0000

The accuracy is 99%.

Predicting for Test Data Set

Now, we apply the model to the original testing data set downloaded from the data source. We remove the problem_id column first.

result <- predict(modelRf, testSet[, -length(names(testSet))])

result

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E