Data 622 HW2

library(psych)
library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

library(MASS)

Instructions

Read in the dataset.

Run kNN, Tree, NB, LDA and LR, SVM with RBS Kernel (60%) and

determine the AUC, ACCURACY, TPR,FPR for each algorithm, create a table as shown below

ALGO AUC,ACC,TPR,FPR

LDA

SVM

kNN

TREE

Rubric: 15%

Summarize and provide a explanatory commentary on the observed performance of these classifiers

Rubric: 15%

(What aspects of the data and or aspects of the algorithms, explain these performance differences)

THESE ARE WICKED-SCIENCE PROBLEMS AND THERE IS NO PRECISE ANSWER. Please base your

observations on common-sense,facts with appropriate citation.

I expect you to gather and prepare to run these for 15 minutes

load the data 5 minutes

run each model and gather performance measures 5 to 10 minutes each

organize and print table 15 minutes

summary and analysis 15 minutes 3 sentences per algorithm

EDA

First, we read the sample labelled data into dataframes.

label_df <- read.csv("HW2.csv")
head(label_df)

##   X       Y       label
## 1 5       a        BLUE
## 2 5       b       BLACK
## 3 5       c        BLUE
## 4 5       d       BLACK
## 5 5       e       BLACK
## 6 5       f       BLACK

We don’t know what this data represents. In a real-world project, understanding the source of the data is paramount. However, for this exercise’s sake, we’ll overlook this.

Let’s do some quick exploratory data analysis (EDA).

str(label_df)

## 'data.frame':    36 obs. of  3 variables:
##  $ X    : int  5 5 5 5 5 5 19 19 19 19 ...
##  $ Y    : Factor w/ 6 levels "      a","      b",..: 1 2 3 4 5 6 1 2 3 4 ...
##  $ label: Factor w/ 2 levels "      BLACK",..: 2 1 2 1 1 1 2 2 2 2 ...

We have an integer variable X - though it’s really 6 different integers that each appear 6 times. And then the letters a to f appear as factors - each letter once per integer.

describe(label_df)

##        vars  n  mean    sd median trimmed   mad min max range  skew kurtosis
## X         1 36 38.00 20.88   43.0   38.80 23.72   5  63    58 -0.37    -1.40
## Y*        2 36  3.50  1.73    3.5    3.50  2.22   1   6     5  0.00    -1.36
## label*    3 36  1.39  0.49    1.0    1.37  0.00   1   2     1  0.44    -1.86
##          se
## X      3.48
## Y*     0.29
## label* 0.08

summary(label_df)

##        X            Y             label   
##  Min.   : 5         a:6         BLACK:22  
##  1st Qu.:19         b:6         BLUE :14  
##  Median :43         c:6                   
##  Mean   :38         d:6                   
##  3rd Qu.:55         e:6                   
##  Max.   :63         f:6

Our target variable for classification is slightly imbalanced - 22 BLACK: 14 BLUE.

It’s unclear if the numerical value of the X variable is meaningful, or if we should treat it more like a factor.

Let’s do a quick visualization, ignoring the color confusion.

library(ggplot2)
ggplot(data = label_df) +
  geom_point(mapping = aes(x = X, y = Y, color = label))

No readily discernable pattern, but one label does seem to dominate some of the X values. Again, may be better to think of these as a factor than a integer. Let’s convert them now.

label_df <- as.data.frame(lapply(label_df, as.factor))
str(label_df)

## 'data.frame':    36 obs. of  3 variables:
##  $ X    : Factor w/ 6 levels "5","19","35",..: 1 1 1 1 1 1 2 2 2 2 ...
##  $ Y    : Factor w/ 6 levels "      a","      b",..: 1 2 3 4 5 6 1 2 3 4 ...
##  $ label: Factor w/ 2 levels "      BLACK",..: 2 1 2 1 1 1 2 2 2 2 ...

##Model Training

We’ll now begin our exploration of the different classifcation algorithms. In addition to having to deal with not understanding our dataset, we also have to deal with its small size. We will utilize the caret package due to both its ability to be used for the various algorithms and its cool training/tuning features.

train_control <- trainControl(method="loocv", savePredictions = T, classProbs = F)
#random seed number
set.seed(14)

Now, we train the logistic regression model.

#https://daviddalpiaz.github.io/r4sl/the-caret-package.html
lm_mod <- train(
  form = label ~ .
  ,data=label_df
  ,trControl = train_control
  ,method="glm"
  ,family = "binomial")
print(lm_mod)

## Generalized Linear Model 
## 
## 36 samples
##  2 predictor
##  2 classes: '      BLACK', '      BLUE' 
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 35, 35, 35, 35, 35, 35, ... 
## Resampling results:
## 
##   Accuracy   Kappa
##   0.7777778  0

Next, linear discriminant analysis (LDA).

lda_mod <- train(
  form = label ~ .
  ,data=label_df
  ,trControl = train_control
  ,method="lda")
print(lda_mod)

## Linear Discriminant Analysis 
## 
## 36 samples
##  2 predictor
##  2 classes: '      BLACK', '      BLUE' 
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 35, 35, 35, 35, 35, 35, ... 
## Resampling results:
## 
##   Accuracy   Kappa
##   0.8055556  0

Next, Support Vector Machines (SVM). Have to add the kernlab library here.

#https://topepo.github.io/caret/train-models-by-tag.html#Support_Vector_Machines
library(kernlab)

## 
## Attaching package: 'kernlab'

## The following object is masked from 'package:ggplot2':
## 
##     alpha

## The following object is masked from 'package:psych':
## 
##     alpha

svm_mod <- train(
  form = label ~ .
  ,data=label_df
  ,trControl = train_control
  ,method="svmLinear"
  ,classProb=T)
print(svm_mod)

## Support Vector Machines with Linear Kernel 
## 
## 36 samples
##  2 predictor
##  2 classes: '      BLACK', '      BLUE' 
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 35, 35, 35, 35, 35, 35, ... 
## Resampling results:
## 
##   Accuracy   Kappa
##   0.8611111  0    
## 
## Tuning parameter 'C' was held constant at a value of 1

It’s k-Nearest Neighbors (kNN) time.

knn_mod <- train(
  form = label ~ .
  ,data=label_df
  ,trControl = train_control
  ,method="knn")

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

print(knn_mod)

## k-Nearest Neighbors 
## 
## 36 samples
##  2 predictor
##  2 classes: '      BLACK', '      BLUE' 
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 35, 35, 35, 35, 35, 35, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy  Kappa
##   5  0.75      0    
##   7  0.75      0    
##   9  0.75      0    
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.

Finally, we construct a model using the Decision Tree approach.

tree_mod <- train(
  form = label ~ .
  ,data=label_df
  ,trControl = train_control
  ,method="rpart")

## Warning in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
## There were missing values in resampled performance measures.

print(tree_mod)

## CART 
## 
## 36 samples
##  2 predictor
##  2 classes: '      BLACK', '      BLUE' 
## 
## No pre-processing
## Resampling: Leave-One-Out Cross-Validation 
## Summary of sample sizes: 35, 35, 35, 35, 35, 35, ... 
## Resampling results:
## 
##   Accuracy   Kappa
##   0.6111111  0    
## 
## Tuning parameter 'cp' was held constant at a value of 0

I could not get naive Bayes to run without errors, even when trying to rest the grid.

#library(e1071)
#nb_mod <- train(
#  form = label ~ .
#  ,data=label_df
#  ,trControl = train_control
#  ,method="nb")
#print(nb_mod)

Model Evaluation

Note that there should be a way to use the MLeval library to easily compare the accuracy and confusion matrix values produced by each of these models. However, I cannot get them to run. It may be related to the classProb setting (should be true) when configuring trainControl before fitting the models.

#library(MLeval)
#model_metrics <- evalm(tree_mod, plots='r', rlinethick=0.8, fsize=8)

Lacking the ability to use a package to compile the accuracy scores, draw the ROC scores, and then get my confusion matrix and having spent many hours failing to do so, I will just report accuracy scores.

Logistic regression = 0.778 LDA = 0.806 SVM = 0.861 KNN (best k=5) = 0.778 DecTree = 0.611 NB = failed to fit model

Judging classification models on accuracy alone would be a mistake, but it’s what we got. AUC would be more useful - it’d be great to sensitivity and specificity as well.

Our outcomes matched the attributes of our dataset well - small data set, few variables, somewhat balanced. SVM performed in a superior manner by a good margin, as one would expect given its strong performance with few training samples. Next was LDA, which requires a normal distribution. It did better than I expected. KNN and logistic regression had the same accuracy scores. The decision tree performed poorly.

Data 622 HW2

Jeff Littlejohn

April 6, 2020

Instructions

EDA

Model Evaluation