Data Loading

Loading data

getwd()
## [1] "/Users/seanmurphy/Desktop/DS_coursera/Human-Activity-Recognition-Modelling"
if (!file.exists("data")){dir.create("data")}
trainUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(trainUrl, destfile="./data/train.csv", method="curl")
download.file(testUrl, destfile="./data/test.csv", method="curl")
list.files("./data")
## [1] "test.csv"  "train.csv"
dateDownloaded <- date()
dateDownloaded
## [1] "Fri Nov 29 16:20:17 2024"

Reading data and loading libraries

library(ggplot2)
library(readr)
library(caret)
## Loading required package: lattice
library(Hmisc)
## 
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:base':
## 
##     format.pval, units
library(e1071)
## 
## Attaching package: 'e1071'
## The following object is masked from 'package:Hmisc':
## 
##     impute
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(Metrics)
## 
## Attaching package: 'Metrics'
## The following objects are masked from 'package:caret':
## 
##     precision, recall
library(patchwork)
trainData <- read.csv("./data/train.csv")
testData <- read.csv("./data/test.csv")
head(trainData)
##   X user_name raw_timestamp_part_1 raw_timestamp_part_2   cvtd_timestamp
## 1 1  carlitos           1323084231               788290 05/12/2011 11:23
## 2 2  carlitos           1323084231               808298 05/12/2011 11:23
## 3 3  carlitos           1323084231               820366 05/12/2011 11:23
## 4 4  carlitos           1323084232               120339 05/12/2011 11:23
## 5 5  carlitos           1323084232               196328 05/12/2011 11:23
## 6 6  carlitos           1323084232               304277 05/12/2011 11:23
##   new_window num_window roll_belt pitch_belt yaw_belt total_accel_belt
## 1         no         11      1.41       8.07    -94.4                3
## 2         no         11      1.41       8.07    -94.4                3
## 3         no         11      1.42       8.07    -94.4                3
## 4         no         12      1.48       8.05    -94.4                3
## 5         no         12      1.48       8.07    -94.4                3
## 6         no         12      1.45       8.06    -94.4                3
##   kurtosis_roll_belt kurtosis_picth_belt kurtosis_yaw_belt skewness_roll_belt
## 1                                                                            
## 2                                                                            
## 3                                                                            
## 4                                                                            
## 5                                                                            
## 6                                                                            
##   skewness_roll_belt.1 skewness_yaw_belt max_roll_belt max_picth_belt
## 1                                                   NA             NA
## 2                                                   NA             NA
## 3                                                   NA             NA
## 4                                                   NA             NA
## 5                                                   NA             NA
## 6                                                   NA             NA
##   max_yaw_belt min_roll_belt min_pitch_belt min_yaw_belt amplitude_roll_belt
## 1                         NA             NA                               NA
## 2                         NA             NA                               NA
## 3                         NA             NA                               NA
## 4                         NA             NA                               NA
## 5                         NA             NA                               NA
## 6                         NA             NA                               NA
##   amplitude_pitch_belt amplitude_yaw_belt var_total_accel_belt avg_roll_belt
## 1                   NA                                      NA            NA
## 2                   NA                                      NA            NA
## 3                   NA                                      NA            NA
## 4                   NA                                      NA            NA
## 5                   NA                                      NA            NA
## 6                   NA                                      NA            NA
##   stddev_roll_belt var_roll_belt avg_pitch_belt stddev_pitch_belt
## 1               NA            NA             NA                NA
## 2               NA            NA             NA                NA
## 3               NA            NA             NA                NA
## 4               NA            NA             NA                NA
## 5               NA            NA             NA                NA
## 6               NA            NA             NA                NA
##   var_pitch_belt avg_yaw_belt stddev_yaw_belt var_yaw_belt gyros_belt_x
## 1             NA           NA              NA           NA         0.00
## 2             NA           NA              NA           NA         0.02
## 3             NA           NA              NA           NA         0.00
## 4             NA           NA              NA           NA         0.02
## 5             NA           NA              NA           NA         0.02
## 6             NA           NA              NA           NA         0.02
##   gyros_belt_y gyros_belt_z accel_belt_x accel_belt_y accel_belt_z
## 1         0.00        -0.02          -21            4           22
## 2         0.00        -0.02          -22            4           22
## 3         0.00        -0.02          -20            5           23
## 4         0.00        -0.03          -22            3           21
## 5         0.02        -0.02          -21            2           24
## 6         0.00        -0.02          -21            4           21
##   magnet_belt_x magnet_belt_y magnet_belt_z roll_arm pitch_arm yaw_arm
## 1            -3           599          -313     -128      22.5    -161
## 2            -7           608          -311     -128      22.5    -161
## 3            -2           600          -305     -128      22.5    -161
## 4            -6           604          -310     -128      22.1    -161
## 5            -6           600          -302     -128      22.1    -161
## 6             0           603          -312     -128      22.0    -161
##   total_accel_arm var_accel_arm avg_roll_arm stddev_roll_arm var_roll_arm
## 1              34            NA           NA              NA           NA
## 2              34            NA           NA              NA           NA
## 3              34            NA           NA              NA           NA
## 4              34            NA           NA              NA           NA
## 5              34            NA           NA              NA           NA
## 6              34            NA           NA              NA           NA
##   avg_pitch_arm stddev_pitch_arm var_pitch_arm avg_yaw_arm stddev_yaw_arm
## 1            NA               NA            NA          NA             NA
## 2            NA               NA            NA          NA             NA
## 3            NA               NA            NA          NA             NA
## 4            NA               NA            NA          NA             NA
## 5            NA               NA            NA          NA             NA
## 6            NA               NA            NA          NA             NA
##   var_yaw_arm gyros_arm_x gyros_arm_y gyros_arm_z accel_arm_x accel_arm_y
## 1          NA        0.00        0.00       -0.02        -288         109
## 2          NA        0.02       -0.02       -0.02        -290         110
## 3          NA        0.02       -0.02       -0.02        -289         110
## 4          NA        0.02       -0.03        0.02        -289         111
## 5          NA        0.00       -0.03        0.00        -289         111
## 6          NA        0.02       -0.03        0.00        -289         111
##   accel_arm_z magnet_arm_x magnet_arm_y magnet_arm_z kurtosis_roll_arm
## 1        -123         -368          337          516                  
## 2        -125         -369          337          513                  
## 3        -126         -368          344          513                  
## 4        -123         -372          344          512                  
## 5        -123         -374          337          506                  
## 6        -122         -369          342          513                  
##   kurtosis_picth_arm kurtosis_yaw_arm skewness_roll_arm skewness_pitch_arm
## 1                                                                         
## 2                                                                         
## 3                                                                         
## 4                                                                         
## 5                                                                         
## 6                                                                         
##   skewness_yaw_arm max_roll_arm max_picth_arm max_yaw_arm min_roll_arm
## 1                            NA            NA          NA           NA
## 2                            NA            NA          NA           NA
## 3                            NA            NA          NA           NA
## 4                            NA            NA          NA           NA
## 5                            NA            NA          NA           NA
## 6                            NA            NA          NA           NA
##   min_pitch_arm min_yaw_arm amplitude_roll_arm amplitude_pitch_arm
## 1            NA          NA                 NA                  NA
## 2            NA          NA                 NA                  NA
## 3            NA          NA                 NA                  NA
## 4            NA          NA                 NA                  NA
## 5            NA          NA                 NA                  NA
## 6            NA          NA                 NA                  NA
##   amplitude_yaw_arm roll_dumbbell pitch_dumbbell yaw_dumbbell
## 1                NA      13.05217      -70.49400    -84.87394
## 2                NA      13.13074      -70.63751    -84.71065
## 3                NA      12.85075      -70.27812    -85.14078
## 4                NA      13.43120      -70.39379    -84.87363
## 5                NA      13.37872      -70.42856    -84.85306
## 6                NA      13.38246      -70.81759    -84.46500
##   kurtosis_roll_dumbbell kurtosis_picth_dumbbell kurtosis_yaw_dumbbell
## 1                                                                     
## 2                                                                     
## 3                                                                     
## 4                                                                     
## 5                                                                     
## 6                                                                     
##   skewness_roll_dumbbell skewness_pitch_dumbbell skewness_yaw_dumbbell
## 1                                                                     
## 2                                                                     
## 3                                                                     
## 4                                                                     
## 5                                                                     
## 6                                                                     
##   max_roll_dumbbell max_picth_dumbbell max_yaw_dumbbell min_roll_dumbbell
## 1                NA                 NA                                 NA
## 2                NA                 NA                                 NA
## 3                NA                 NA                                 NA
## 4                NA                 NA                                 NA
## 5                NA                 NA                                 NA
## 6                NA                 NA                                 NA
##   min_pitch_dumbbell min_yaw_dumbbell amplitude_roll_dumbbell
## 1                 NA                                       NA
## 2                 NA                                       NA
## 3                 NA                                       NA
## 4                 NA                                       NA
## 5                 NA                                       NA
## 6                 NA                                       NA
##   amplitude_pitch_dumbbell amplitude_yaw_dumbbell total_accel_dumbbell
## 1                       NA                                          37
## 2                       NA                                          37
## 3                       NA                                          37
## 4                       NA                                          37
## 5                       NA                                          37
## 6                       NA                                          37
##   var_accel_dumbbell avg_roll_dumbbell stddev_roll_dumbbell var_roll_dumbbell
## 1                 NA                NA                   NA                NA
## 2                 NA                NA                   NA                NA
## 3                 NA                NA                   NA                NA
## 4                 NA                NA                   NA                NA
## 5                 NA                NA                   NA                NA
## 6                 NA                NA                   NA                NA
##   avg_pitch_dumbbell stddev_pitch_dumbbell var_pitch_dumbbell avg_yaw_dumbbell
## 1                 NA                    NA                 NA               NA
## 2                 NA                    NA                 NA               NA
## 3                 NA                    NA                 NA               NA
## 4                 NA                    NA                 NA               NA
## 5                 NA                    NA                 NA               NA
## 6                 NA                    NA                 NA               NA
##   stddev_yaw_dumbbell var_yaw_dumbbell gyros_dumbbell_x gyros_dumbbell_y
## 1                  NA               NA                0            -0.02
## 2                  NA               NA                0            -0.02
## 3                  NA               NA                0            -0.02
## 4                  NA               NA                0            -0.02
## 5                  NA               NA                0            -0.02
## 6                  NA               NA                0            -0.02
##   gyros_dumbbell_z accel_dumbbell_x accel_dumbbell_y accel_dumbbell_z
## 1             0.00             -234               47             -271
## 2             0.00             -233               47             -269
## 3             0.00             -232               46             -270
## 4            -0.02             -232               48             -269
## 5             0.00             -233               48             -270
## 6             0.00             -234               48             -269
##   magnet_dumbbell_x magnet_dumbbell_y magnet_dumbbell_z roll_forearm
## 1              -559               293               -65         28.4
## 2              -555               296               -64         28.3
## 3              -561               298               -63         28.3
## 4              -552               303               -60         28.1
## 5              -554               292               -68         28.0
## 6              -558               294               -66         27.9
##   pitch_forearm yaw_forearm kurtosis_roll_forearm kurtosis_picth_forearm
## 1         -63.9        -153                                             
## 2         -63.9        -153                                             
## 3         -63.9        -152                                             
## 4         -63.9        -152                                             
## 5         -63.9        -152                                             
## 6         -63.9        -152                                             
##   kurtosis_yaw_forearm skewness_roll_forearm skewness_pitch_forearm
## 1                                                                  
## 2                                                                  
## 3                                                                  
## 4                                                                  
## 5                                                                  
## 6                                                                  
##   skewness_yaw_forearm max_roll_forearm max_picth_forearm max_yaw_forearm
## 1                                    NA                NA                
## 2                                    NA                NA                
## 3                                    NA                NA                
## 4                                    NA                NA                
## 5                                    NA                NA                
## 6                                    NA                NA                
##   min_roll_forearm min_pitch_forearm min_yaw_forearm amplitude_roll_forearm
## 1               NA                NA                                     NA
## 2               NA                NA                                     NA
## 3               NA                NA                                     NA
## 4               NA                NA                                     NA
## 5               NA                NA                                     NA
## 6               NA                NA                                     NA
##   amplitude_pitch_forearm amplitude_yaw_forearm total_accel_forearm
## 1                      NA                                        36
## 2                      NA                                        36
## 3                      NA                                        36
## 4                      NA                                        36
## 5                      NA                                        36
## 6                      NA                                        36
##   var_accel_forearm avg_roll_forearm stddev_roll_forearm var_roll_forearm
## 1                NA               NA                  NA               NA
## 2                NA               NA                  NA               NA
## 3                NA               NA                  NA               NA
## 4                NA               NA                  NA               NA
## 5                NA               NA                  NA               NA
## 6                NA               NA                  NA               NA
##   avg_pitch_forearm stddev_pitch_forearm var_pitch_forearm avg_yaw_forearm
## 1                NA                   NA                NA              NA
## 2                NA                   NA                NA              NA
## 3                NA                   NA                NA              NA
## 4                NA                   NA                NA              NA
## 5                NA                   NA                NA              NA
## 6                NA                   NA                NA              NA
##   stddev_yaw_forearm var_yaw_forearm gyros_forearm_x gyros_forearm_y
## 1                 NA              NA            0.03            0.00
## 2                 NA              NA            0.02            0.00
## 3                 NA              NA            0.03           -0.02
## 4                 NA              NA            0.02           -0.02
## 5                 NA              NA            0.02            0.00
## 6                 NA              NA            0.02           -0.02
##   gyros_forearm_z accel_forearm_x accel_forearm_y accel_forearm_z
## 1           -0.02             192             203            -215
## 2           -0.02             192             203            -216
## 3            0.00             196             204            -213
## 4            0.00             189             206            -214
## 5           -0.02             189             206            -214
## 6           -0.03             193             203            -215
##   magnet_forearm_x magnet_forearm_y magnet_forearm_z classe
## 1              -17              654              476      A
## 2              -18              661              473      A
## 3              -18              658              469      A
## 4              -16              658              469      A
## 5              -17              655              473      A
## 6               -9              660              478      A
levels(trainData$classe)
## NULL
trainData$classe <- as.factor(trainData$classe)
dim(trainData) #19622, 160
## [1] 19622   160
table(trainData$classe) #data is imbalanced toward classe A and slightly more for classe B
## 
##    A    B    C    D    E 
## 5580 3797 3422 3216 3607
#Fixing imbalance
set.seed(123)
#Imbalance ratio
imbalanceRatio <- table(trainData$classe)["A"] / mean(table(trainData$classe)[c("B", "C", "D", "E")])
imbalanceRatio
##        A 
## 1.589517
class_weights <- c(
  "A" = 1.589517,  # This class is 1.589517 times larger than the average
  "B" = 1,         # Set weight of 1 for the average classes
  "C" = 1,
  "D" = 1,
  "E" = 1
)

Initial EDA

inBuild <- createDataPartition(y=trainData$classe,
                               p=0.7, list=FALSE)
head(inBuild)
##      Resample1
## [1,]         1
## [2,]         2
## [3,]         5
## [4,]         7
## [5,]         9
## [6,]        10
validation <- trainData[-inBuild, ]; buildData <- trainData[inBuild,]

inTrain <- createDataPartition(y=buildData$classe,
                               p=0.7, list=FALSE)
training <- buildData[inTrain, ]; testing <- buildData[-inTrain, ]

dim(training) # 9619, 160
## [1] 9619  160
dim(testing) # 4118, 160
## [1] 4118  160
dim(validation) # 5885, 160
## [1] 5885  160

Exploratory Data Analysis

BELT

p2 <- qplot(classe, roll_belt, data=training,
            geom=c("boxplot"))
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
p3 <- qplot(classe, pitch_belt, data=training,
            geom=c("boxplot"))
p4 <- qplot(classe, yaw_belt, data=training,
            geom=c("boxplot"))
p5 <- qplot(classe, total_accel_belt, data=training,
            geom=c("boxplot"))
(p2 | p3) / (p4 | p5) 

For pitch_belt, the medians of classe A:E appear around zero. However, roll_belt, yaw_belt and total_accelt_belt medians increase as movement form goes from classe A to classe E. This indicates that as roll_belt, yaw_belt and total_accel_belt increase, movement form decreases, except for classe D, which may be an outlier.

ARM

a1 <- qplot(classe, roll_arm, data=training,
            geom=c("boxplot"))
a2 <- qplot(classe, pitch_arm, data=training,
            geom=c("boxplot"))
a3 <- qplot(classe, yaw_arm, data=training,
            geom=c("boxplot"))
a4 <- qplot(classe, total_accel_arm, data=training,
            geom=c("boxplot"))
(a1 | a2) / (a3 | a4)

Three of the four charts above highlight differences between classe: A and the remaining classes. Roll_arm, pitch_arm and total_accel_arm all show stark differences between levels for classe: A and classe: B : classe: E, this highlights that there are certain factors that will help predict exercise performance.

##FOREARM

f1 <- qplot(classe, roll_forearm, data=training,
            geom=c("boxplot"))
f2 <- qplot(classe, pitch_forearm, data=training,
            geom=c("boxplot"))
f3 <- qplot(classe, yaw_forearm, data=training,
            geom=c("boxplot"))
f4 <- qplot(classe, total_accel_forearm, data=training,
            geom=c("boxplot"))
(f1 | f2) / (f3 | f4)

Variables pitch_forearm and total_accel_forearm highlight slight differneces between classe A and the remaining classes. the median for classe A is zero in pitch_forearm and at its lowest in total_accel_forearm, while medians are higher for the remaining four variables, highlighting a pitch_forearm and total_accel_forearm as a potential predictor of exercise performance.

##DUMBELL

d1 <- qplot(classe, roll_dumbbell, data=training,
            geom=c("boxplot"))
d2 <- qplot(classe, pitch_dumbbell, data=training,
            geom=c("boxplot"))
d3 <- qplot(classe, yaw_dumbbell, data=training,
            geom=c("boxplot"))
d4 <- qplot(classe, total_accel_dumbbell, data=training,
            geom=c("boxplot"))
(d1 | d2) / (d3 | d4)

There is no discernible trend or pattern when looking at the dumbbell itself. The classe A shares a median with a different classe in nearly every graph, for example classe A and classe C have a median slightly below 0 for pitch_dumbbell and in yaw_dumbbell where, classe A and classe E are around equal at slightly below 0.

EDA Conclusion

The EDA phase has highlighted that there are certain variables that differ drastically between classe A : classe E. As well as this, there are an extremely large amount of potential predictors (159) and reducing them to what are actually important will be essential for model performance and computational complexity. Pre-processing will be an important step for the modelling phase.

Preprocessing

## NSV
nsv <- nearZeroVar(training)
nsvTraining <- training[, -nsv]


## Median Impute
preProc <- preProcess(nsvTraining, method="medianImpute")
miTraining <- predict(preProc, nsvTraining)

dim(miTraining) # Removed 51 variables
## [1] 9619  106

K-nearest neighbour model with principal component analysis

# Assign the appropriate weight to each row based on its class
weights_vector_train <- class_weights[as.character(miTraining$classe)]

# Check if the length of weights_vector matches the number of rows in miTraining
length(weights_vector_train) 
## [1] 9619
set.seed(123)
#Knn with principal component analysis 
knnFit <- train(classe~., 
                data=miTraining, 
                method="knn", 
                preProcess=c("pca"),
                trControl=trainControl(method="cv"),
                weights= weights_vector_train,
                tuneLength=10
                )
print(knnFit)
## k-Nearest Neighbors 
## 
## 9619 samples
##  105 predictor
##    5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: principal component signal extraction (127), centered
##  (127), scaled (127) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 8658, 8656, 8658, 8657, 8656, 8657, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.9539467  0.9417290
##    7  0.9411592  0.9255400
##    9  0.9318026  0.9136805
##   11  0.9270195  0.9076367
##   13  0.9183932  0.8967191
##   15  0.9134030  0.8904208
##   17  0.9093469  0.8852670
##   19  0.9038389  0.8782781
##   21  0.8989523  0.8720775
##   23  0.8934417  0.8651185
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.

The best model used k=5, balancing bias and variance effectively. Model began over-smoothing, as can be seen by decreased accuracy as k increases. Model k=5 had an accuracy of 0.9540 and a Kappa of 0.9417, which indicates that the model is highly accurate beyond chance.

Performance metrics

confusionMatrix(knnFit)
## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction    A    B    C    D    E
##          A 27.9  0.8  0.1  0.0  0.0
##          B  0.3 17.6  0.4  0.0  0.1
##          C  0.2  0.8 16.6  0.6  0.1
##          D  0.0  0.1  0.3 15.6  0.4
##          E  0.0  0.0  0.0  0.2 17.8
##                             
##  Accuracy (average) : 0.9539
knnMatrix <- as.table(matrix(c(28.2,  0.3,  0.1,  0.0,  0.0,
 0.1, 18.5,  0.2,  0.0,  0.1,
 0.1,  0.4, 17.0,  0.3,  0.0,
 0.0,  0.0,  0.1, 16.0,  0.1,
 0.0,  0.0,  0.0,  0.1, 18.2), nrow=5, byrow=TRUE))
knnPrecision <- diag(knnMatrix) / rowSums(knnMatrix)
knnRecall <- diag(knnMatrix) / colSums(knnMatrix)
knnF1 <- 2 * (knnPrecision * knnRecall) / (knnPrecision + knnRecall)

knnResults <- data.frame(knnPrecision, knnRecall, knnF1)
print(knnResults)
##   knnPrecision knnRecall     knnF1
## A    0.9860140 0.9929577 0.9894737
## B    0.9788360 0.9635417 0.9711286
## C    0.9550562 0.9770115 0.9659091
## D    0.9876543 0.9756098 0.9815951
## E    0.9945355 0.9891304 0.9918256

The model using K-nearest neighbour with principal component analysis pre-processing appears to identify most posititve cases and of those predicted positives, the majority are correct. The high F1 scores also demonstrate balanced performance between precision and recall, indicating that this is an accurate classifier of the different types of classes in the data.

Classification Tree Model

classFit <- train(classe ~ ., 
                  method="rpart", 
                  data=miTraining,
                  preProc=c("center", "scale"),
                  trControl=trainControl(method="cv"),
                  weights=weights_vector_train
                  )
print(classFit)
## CART 
## 
## 9619 samples
##  105 predictor
##    5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: centered (127), scaled (127) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 8657, 8658, 8658, 8658, 8658, 8656, ... 
## Resampling results across tuning parameters:
## 
##   cp         Accuracy   Kappa     
##   0.2437536  0.7134786  0.63563637
##   0.2568274  0.5325015  0.40208044
##   0.2703370  0.3420693  0.09905888
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.2437536.

A classification tree was a less accurate predictor of exercise performance compared the k-Nearest Neighbour model. The best model was 0.7135 or 71.35% accurate. The complexity parameter was best at 0.2438, meaning this is where the model best balanced its accuracy while maintaining simplicity. The best model’s kappa was 0.6356, meaning that the model has substantial agreement between accurate predicted and actual observations while accounting for agreement by chance.

Performance metrics

confusionMatrix(classFit)
## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction    A    B    C    D    E
##          A 28.4  0.0  0.0  0.0  0.0
##          B  0.0 19.3  0.0  0.0  0.0
##          C  0.0  0.0  5.2  4.9  0.0
##          D  0.0  0.0  0.0  0.0  0.0
##          E  0.0  0.0 12.2 11.5 18.4
##                             
##  Accuracy (average) : 0.7135
classMatrix <- as.table(matrix(c(28.4,  0.0,  0.0,  0.0,  0.0,
          0.0, 19.4,  0.0,  0.0 , 0.0,
          0.0 , 0.0, 10.5,  9.8 , 0.0,
          0.0,  0.0,  0.0,  0.0 , 0.0,
          0.0,  0.0,  7.0,  6.5,  18.4), nrow=5, byrow=TRUE))
classPrecision <- diag(classMatrix) / rowSums(classMatrix)
classRecall <- diag(classMatrix) / colSums(classMatrix)
classF1 <- 2 * (classPrecision * classRecall) / (classPrecision + classRecall)

classResults <- data.frame(classPrecision, classRecall, classF1)
print(classResults)
##   classPrecision classRecall   classF1
## A      1.0000000         1.0 1.0000000
## B      1.0000000         1.0 1.0000000
## C      0.5172414         0.6 0.5555556
## D            NaN         0.0       NaN
## E      0.5768025         1.0 0.7316103

The classification again shows worrying signs when we look at precision, recall and the F1 score. The model performs well in identifying most positive cases for classe A and classe B. However, for classe C, classe D and E, the model appears to predict poorly the relevant data points and only those points for classe C, while for classe E, the model can accurately identify all data points relevant to the class, this is likely due to a high number of classifications of classe D as it has a low precision score.

Support Vector Machine Model

svmFit <- train(classe~., 
              data=miTraining, 
              method="svmRadial",
              preProcess=c("pca"),
              trControl=trainControl(method="cv"),
              weights=weights_vector_train,
              tuneLength=10
              )
print(svmFit)
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 9619 samples
##  105 predictor
##    5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## Pre-processing: principal component signal extraction (127), centered
##  (127), scaled (127) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 8658, 8657, 8657, 8656, 8657, 8657, ... 
## Resampling results across tuning parameters:
## 
##   C       Accuracy   Kappa    
##     0.25  0.9004084  0.8740077
##     0.50  0.9228609  0.9023343
##     1.00  0.9436544  0.9286235
##     2.00  0.9627836  0.9528812
##     4.00  0.9786900  0.9730261
##     8.00  0.9844067  0.9802647
##    16.00  0.9865901  0.9830283
##    32.00  0.9869020  0.9834225
##    64.00  0.9870059  0.9835540
##   128.00  0.9871099  0.9836855
## 
## Tuning parameter 'sigma' was held constant at a value of 0.01515548
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.01515548 and C = 128.

Performance metrics

confusionMatrix(svmFit)
## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction    A    B    C    D    E
##          A 28.2  0.2  0.1  0.1  0.2
##          B  0.0 19.0  0.0  0.0  0.0
##          C  0.0  0.0 17.2  0.0  0.0
##          D  0.0  0.0  0.0 16.2  0.0
##          E  0.2  0.2  0.1  0.1 18.1
##                             
##  Accuracy (average) : 0.9871
svmMatrix <- as.table(matrix(c(28.3,  0.0,  0.0,  0.0,  0.0,
         0.0, 19.1,  0.0,  0.0,  0.0,
         0.0,  0.0, 17.3,  0.0,  0.0,
         0.0,  0.0,  0.0, 16.2,  0.0,
         0.1,  0.2, 0.1,  0.1, 18.3), nrow=5, byrow=TRUE))
svmPrecision <- diag(svmMatrix) / rowSums(svmMatrix)
svmRecall <- diag(svmMatrix) / colSums(svmMatrix)
svmF1 <- 2 * (svmPrecision * svmRecall) / (svmPrecision + svmRecall)

svmResults <- data.frame(svmPrecision, svmRecall, svmF1)
print(svmResults)
##   svmPrecision svmRecall     svmF1
## A    1.0000000 0.9964789 0.9982363
## B    1.0000000 0.9896373 0.9947917
## C    1.0000000 0.9942529 0.9971182
## D    1.0000000 0.9938650 0.9969231
## E    0.9734043 1.0000000 0.9865229

The SVM model appears to be the most accurate predictor of each individual class with the highest accuracy score (when c=128) at 0.9871 and with perfect precision for for classe A : classe D, and a very high precision score for class E. The F1 scores are also nearly 100% accurate for every classe.

The SVM model outperforms the kNN model on every metric, and more importantly when comparing something like the F1 score, the SVM more accurately identified classe A, classe B, classe C and classe D. The only classe that kNN performed better on is classe E, but only marginally. However, the two models (kNN and SVM) can now be tested on the validation dataset. Then, the best model will be tested against the test dataset to understand its performance further on unseen data.

Model Validation

Preprocessing

dim(validation)
## [1] 5885  160
validation$classe <- as.factor(validation$classe)
nsvVal <- nearZeroVar(validation)
nsvValData <- validation[, -nsvVal]
preProcVal <- preProcess(nsvValData, method="medianImpute")
miValData <- predict(preProcVal, nsvValData)
dim(miValData) # 48 variables removed
## [1] 5885  107
apa <- na.omit(validation$amplitude_pitch_arm)
mapa <- median(apa)
dim(miValData)
## [1] 5885  107
miValData$amplitude_pitch_arm <- rep(mapa, 5885)

Validate on two best models

pred1 <- predict(knnFit, miValData) ; pred2 <- predict(svmFit, miValData)
## object 'amplitude_pitch_arm' not found: Corrected above
model1 <- confusionMatrix(pred1, miValData$classe)
model2 <- confusionMatrix(pred2, miValData$classe)

print("kNN Model")
## [1] "kNN Model"
model1$overall[c("Accuracy", "Kappa")]
##  Accuracy     Kappa 
## 0.9546304 0.9426166
print("SVM Model")
## [1] "SVM Model"
model2$overall[c("Accuracy", "Kappa")]
##  Accuracy     Kappa 
## 0.9757009 0.9692028
print("kNN vs SVM")
## [1] "kNN vs SVM"
difference_between_models <- model1$byClass - model2$byClass
difference_between_models[, c("Sensitivity", "Specificity", "Precision", "F1")]
##           Sensitivity  Specificity   Precision            F1
## Class: A -0.022700119  0.009498931  0.02152675  6.945822e-06
## Class: B -0.050043898 -0.008006743 -0.03476330 -4.262571e-02
## Class: C -0.032163743 -0.012348220 -0.05501026 -4.392818e-02
## Class: D  0.014522822 -0.010770169 -0.05450658 -1.900580e-02
## Class: E -0.009242144 -0.002706642 -0.01220127 -1.069698e-02

When model was initially run, there was an error regarding a missing variable. This variable was removed during pre-processing but as both models had been trained with it included, it was re-added using a Median Impute.

Both models being run on the validation dataset highlights a consistent out-performance by the SVM in terms of model accuracy when compared to the kNN model. The SVM model is only slightly better (2.05% more accurate) but also significantly outperforms the kNN model on other performance metrics: Sensitivity, Specificity, Precision and F1 score, with SVM performing better on 16 out of 20 metrics.

Preprocessing

dim(testing)
## [1] 4118  160
testing$classe <- as.factor(testing$classe)
nsvTest <- nearZeroVar(testing)
nsvTestData <- testing[, -nsvTest]
preProcTest <- preProcess(nsvTestData, method="medianImpute")
miTestData <- predict(preProcTest, nsvTestData)
dim(miTestData) # 34 variables removed this time
## [1] 4118  126
predTest <- predict(svmFit, miTestData)
testModel <- confusionMatrix(predTest, miTestData$classe)
testModel
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1171  244    7   10    9
##          B    0  553    5    0    2
##          C    0    0  705  283    0
##          D    0    0    0  381   13
##          E    0    0    1    1  733
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8604          
##                  95% CI : (0.8494, 0.8708)
##     No Information Rate : 0.2844          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8219          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.6939   0.9819  0.56444   0.9683
## Specificity            0.9084   0.9979   0.9168  0.99622   0.9994
## Pos Pred Value         0.8126   0.9875   0.7136  0.96701   0.9973
## Neg Pred Value         1.0000   0.9314   0.9958  0.92105   0.9929
## Prevalence             0.2844   0.1935   0.1744  0.16391   0.1838
## Detection Rate         0.2844   0.1343   0.1712  0.09252   0.1780
## Detection Prevalence   0.3499   0.1360   0.2399  0.09568   0.1785
## Balanced Accuracy      0.9542   0.8459   0.9493  0.78033   0.9839

However, when the SVM model is tested on even more unseen data, the model’s accuracy drops off massively. The accuracy is now between 84.94% and 87.08%. Moreover, the model is also a poor predictor of Class: B and Class: D, highlighted by its poor sensitivity score for both classes.

predTestkNN <- predict(knnFit, miTestData)
testModelkNN <- confusionMatrix(predTestkNN, miTestData$classe)
testModelkNN
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1151   19   11    0    1
##          B   12  742   12    0    4
##          C    5   30  679   28    2
##          D    1    5   16  640   15
##          E    2    1    0    7  735
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9585          
##                  95% CI : (0.9519, 0.9644)
##     No Information Rate : 0.2844          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9475          
##                                           
##  Mcnemar's Test P-Value : 0.001901        
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9829   0.9310   0.9457   0.9481   0.9709
## Specificity            0.9895   0.9916   0.9809   0.9893   0.9970
## Pos Pred Value         0.9738   0.9636   0.9126   0.9453   0.9866
## Neg Pred Value         0.9932   0.9836   0.9884   0.9898   0.9935
## Prevalence             0.2844   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2795   0.1802   0.1649   0.1554   0.1785
## Detection Prevalence   0.2870   0.1870   0.1807   0.1644   0.1809
## Balanced Accuracy      0.9862   0.9613   0.9633   0.9687   0.9840

As shown above, the kNN model is a more accurate predictor overall when comparing the two models across three different sets of data (1 training, 2 testing). The kNN consistently predicted between 95.19% and 96.44% of the cases of each individual class across the three tests. What is particular impressive is its performance on unseen data, where the SVM model performed less consistently. The precision and specificity comparison between the kNN and SVM further highlight the superior performance of the kNN model, with no sensitivity score below 0.9310 and a specificity score of 0.9806 or greater for each individual class.

Models were built with the issue of classification of categories in mind, this is why the kNN, Classification Tree and SVM models were chosen. Each model was chosen for different reasons, for example the SVM works well with high dimensional data, and as there were 159 variables to predict from and 19622 observations, this seemed like a good fit. In retrospect, the Classification Tree was a poor decision for the purposes of this data as it was far from a simple structure. My choice to use this model was more so driven by the interprability of the tree itself. Finally, the kNN model was my first choice due to 1. its simplicity and 2. because of the exploratory data analysis. From the 4 comparisons looked at during EDA, it was clear that different measurements grouped together quite clearly, and from my understanding of kNN this seemed like a good fit.

In conclusion, the kNN model is a more accurate predictor of the ‘Classe’ variable than the SVM model and the Classification Tree model, correctly predicting between 95.19% and 96.44% of the cases 95% of the time.