DATA 622 - Homework 1

Investigate The Data

Once the data is loaded, we can look at it’s structure and see which method we feel may best be used to classify it:

summary(data)

##        x      y       label   
##  Min.   : 5   a:6   BLACK:22  
##  1st Qu.:19   b:6   BLUE :14  
##  Median :43   c:6             
##  Mean   :38   d:6             
##  3rd Qu.:55   e:6             
##  Max.   :63   f:6

The variable label is the class we wish to predict. Both “BLACK” and “BLUE” are reasonably balanced. Variable y is spread out well, having 6 observations of each value. Next, we will look at the distribution of variable x:

data %>% ggplot(aes(x=x)) + geom_histogram(binwidth=2) +
  labs(title='Distribution of variable "x"',y="Count",x="Value") +
  theme_light()

The plot shows that x takes on 6 discrete values, and is not continuous. It might be a good idea to treat it as a categorical variable, but we will leave it as-is for now.

evalModel <- function(predictions, data, name="None", model="Unspecified"){
  
  # Bind the predictions and the data
  x <- data %>% bind_cols(predictions)
  x$predicted <- factor(ifelse(x$.pred_BLACK >= .50, "BLACK","BLUE"),
                        levels=c("BLACK","BLUE"))
  
  # Compute AUC
  auc <- x %>% roc_auc(label,.pred_BLACK) %>%
  select(AUC  = .estimate)
  
  # Compute Accuracy, TPR, FPR, TNR, and FNR
  mets <- metric_set(accuracy,sensitivity,specificity)
  vals <- x %>% mets(truth=label, estimate=predicted) %>%
    mutate(Dataset = name, Model = model) %>%
    select(Dataset, Model, .metric, .estimate) %>%
    pivot_wider(id_cols = c(Dataset,Model),
                names_from = .metric, 
                values_from = .estimate) %>%
    mutate(fpr = 1 - spec, fnr = 1 - sens, auc = auc$AUC) %>%
    select(Dataset, Model, AUC = auc, accuracy, TPR = sens, 
           FPR = fpr, TNR = spec, FNR = fnr)

  return(vals)
}

Split the Data

Next we will split the data set into a “training” and a “test” set. Because we only have 36 observations, we will need to be careful how we split the set.

# Split our data into 2/3 training and 1/3 testing
split <- initial_split(data,prop=2/3)

# Prep training data
train <- training(split)
test <- testing(split)

Logistic Regression

First we will try to classify using logistic regression. We start with the training data first to evaluate whether the model can learn, then predict the test set to see if the model can generalize.

# Fit the model
logmodel <- logistic_reg() %>% set_engine("glm") %>% fit(label ~ ., data=train)

# Predict probabilities and class for the training data
logpred <- predict(logmodel, new_data = train, type="prob")
logpred$.pred_class <- as.factor(ifelse(logpred$.pred_BLACK >= 0.5,
                                        "BLACK","BLUE"))

# Evaluate the model
outcomes <- evalModel(predictions = logpred, data=train, name="Train",
                      model = "Logistic")

# Predict probabilities and class for the test data
logpred <- predict(logmodel, new_data = test, type="prob")
logpred$.pred_class <- as.factor(ifelse(logpred$.pred_BLACK >= 0.5,
                                        "BLACK","BLUE"))

# Evaluate the model
outcomes <- rbind(outcomes,evalModel(predictions = logpred, data=test,
                                     name="Test", model = "Logistic"))

kable(outcomes) %>% kable_styling(bootstrap_options = "striped", full_width=F)

Dataset	Model	AUC	accuracy	TPR	FPR	TNR	FNR
Train	Logistic	0.8888889	0.7916667	0.8666667	0.3333333	0.6666667	0.1333333
Test	Logistic	0.4000000	0.3333333	0.2857143	0.6000000	0.4000000	0.7142857

We see that the accuracy with the training data is about 79%, which isn’t bad, so the model can learn. Unfortunately, the test data accuracy is much lower at 33%. This is worse than a random guess, so the model does not have the ability to generalize with unseen data.

Naive Bayes

Now we will turn our attention to a Naive Bayes model.

nbmodel <- naive_Bayes(Laplace=1) %>% set_engine("naivebayes") %>%
  fit(label ~ ., data=train)

# Predict probabilities and class for the training data
nbpred <- predict(nbmodel, new_data = train, type="prob")
nbpred$.pred_class <- as.factor(ifelse(nbpred$.pred_BLACK >= 0.5,
                                       "BLACK","BLUE"))

# Evaluate the model
outcomes <- rbind(outcomes,evalModel(predictions = nbpred, data=train, name="Train", model = "Naive Bayes"))

# Predict probabilities and class for the test data
nbpred <- predict(nbmodel, new_data = test, type="prob")
nbpred$.pred_class <- as.factor(ifelse(nbpred$.pred_BLACK >= 0.5,
                                       "BLACK","BLUE"))

# Evaluate the model
outcomes <- rbind(outcomes,evalModel(predictions = nbpred, data=test,
                                     name="Test", model = "Naive Bayes"))

kable(outcomes) %>% kable_styling(bootstrap_options = "striped", full_width=F) %>%
  row_spec(3:4, bold = T, background = "#E3D971")

Dataset	Model	AUC	accuracy	TPR	FPR	TNR	FNR
Train	Logistic	0.8888889	0.7916667	0.8666667	0.3333333	0.6666667	0.1333333
Test	Logistic	0.4000000	0.3333333	0.2857143	0.6000000	0.4000000	0.7142857
Train	Naive Bayes	0.9481481	0.8750000	0.9333333	0.2222222	0.7777778	0.0666667
Test	Naive Bayes	0.3714286	0.6666667	0.8571429	0.6000000	0.4000000	0.1428571

The Naive Bayes model did a bit better on the train dataset, with an 87.5% accuracy compared to the Logistic model’s 79% accuracy. The Naive Bayes model can learn.

Looking at the test dataset, the accuracy of Naive Bayes dropped to 66% accuracy. That’s twice as good as Logistic did, but it is a large drop from the training data. I’d say that this model can generalize but the model may be overfitting a bit.

K-Nearest Neighbors (\(k=3\))

Next, we try the K-Nearest Neighbors algorithm, starting with \(k=3\).

knn3model <- nearest_neighbor(mode="classification", neighbors = 3) %>%
  set_engine("kknn") %>%
  fit(label ~ ., data=train)

# Predict probabilities and class for the training data
knn3pred <- predict(knn3model, new_data = train, type="prob")
knn3pred$.pred_class <-as.factor(
  ifelse(knn3pred$.pred_BLACK >= 0.5,"BLACK","BLUE")
  )

# Evaluate the model
outcomes <- rbind(outcomes,evalModel(predictions = knn3pred, data=train,
                                     name="Train", model = "KNN-3"))

# Predict probabilities and class for the test data
knn3pred <- predict(knn3model, new_data = test, type="prob")
knn3pred$.pred_class <- as.factor(
  ifelse(knn3pred$.pred_BLACK >= 0.5,"BLACK","BLUE")
  )

# Evaluate the model
outcomes <- rbind(outcomes,evalModel(predictions = knn3pred, data=test,
                                     name="Test", model = "KNN-3"))

kable(outcomes) %>% kable_styling(bootstrap_options = "striped", full_width=F) %>%
  row_spec(5:6, bold = T, background = "#E3D971")

Dataset	Model	AUC	accuracy	TPR	FPR	TNR	FNR
Train	Logistic	0.8888889	0.7916667	0.8666667	0.3333333	0.6666667	0.1333333
Test	Logistic	0.4000000	0.3333333	0.2857143	0.6000000	0.4000000	0.7142857
Train	Naive Bayes	0.9481481	0.8750000	0.9333333	0.2222222	0.7777778	0.0666667
Test	Naive Bayes	0.3714286	0.6666667	0.8571429	0.6000000	0.4000000	0.1428571
Train	KNN-3	1.0000000	1.0000000	1.0000000	0.0000000	1.0000000	0.0000000
Test	KNN-3	0.4428571	0.5000000	0.5714286	0.6000000	0.4000000	0.4285714

The K-Nearest Neighbor model (\(k=3\)) performed with perfect accuracy on the training data, a sure sign that it is overfitting. However, the model can learn.

Looking at the test data, the accuracy drops to 50%, the same as random guessing. Thus, this model cannot generalize well.

K-Nearest Neighbors (\(k=5\))

Now we try the KNN model but set \(k=5\).

knn5model <- nearest_neighbor(mode="classification", neighbors = 5) %>%
  set_engine("kknn") %>%
  fit(label ~ ., data=train)

# Predict probabilities and class for the training data
knn5pred <- predict(knn5model, new_data = train, type="prob")
knn5pred$.pred_class <- as.factor(
  ifelse(knn5pred$.pred_BLACK >= 0.5,"BLACK","BLUE")
  )

# Evaluate the model
outcomes <- rbind(outcomes,evalModel(predictions = knn5pred, data=train,
                                     name="Train", model = "KNN-5"))

# Predict probabilities and class for the test data
knn5pred <- predict(knn5model, new_data = test, type="prob")
knn5pred$.pred_class <- as.factor(
  ifelse(knn5pred$.pred_BLACK >= 0.5,"BLACK","BLUE")
  )

# Evaluate the model
outcomes <- rbind(outcomes,evalModel(predictions = knn5pred, data=test,
                                     name="Test", model = "KNN-5"))

kable(outcomes) %>% kable_styling(bootstrap_options = "striped", full_width=F) %>%
  row_spec(7:8, bold = T, background = "#E3D971")

Dataset	Model	AUC	accuracy	TPR	FPR	TNR	FNR
Train	Logistic	0.8888889	0.7916667	0.8666667	0.3333333	0.6666667	0.1333333
Test	Logistic	0.4000000	0.3333333	0.2857143	0.6000000	0.4000000	0.7142857
Train	Naive Bayes	0.9481481	0.8750000	0.9333333	0.2222222	0.7777778	0.0666667
Test	Naive Bayes	0.3714286	0.6666667	0.8571429	0.6000000	0.4000000	0.1428571
Train	KNN-3	1.0000000	1.0000000	1.0000000	0.0000000	1.0000000	0.0000000
Test	KNN-3	0.4428571	0.5000000	0.5714286	0.6000000	0.4000000	0.4285714
Train	KNN-5	0.9555556	0.8333333	0.8666667	0.2222222	0.7777778	0.1333333
Test	KNN-5	0.4857143	0.5833333	0.7142857	0.6000000	0.4000000	0.2857143

The KNN model with \(k=5\) performs better on the test data then when \(k=3\) with an accuracy of 83%, meaning we have likely fixed the overfitting. This model appears to be able to learn well.

However, the accuracy on the training dataset drops significantly to 58%. This shows that the model still does not generalize well.

Summary

Below are the performance metrics for the training data:

outcomes %>% filter(Dataset == "Train") %>% kable() %>%
  kable_styling(bootstrap_options = "striped", full_width=F)

Dataset	Model	AUC	accuracy	TPR	FPR	TNR	FNR
Train	Logistic	0.8888889	0.7916667	0.8666667	0.3333333	0.6666667	0.1333333
Train	Naive Bayes	0.9481481	0.8750000	0.9333333	0.2222222	0.7777778	0.0666667
Train	KNN-3	1.0000000	1.0000000	1.0000000	0.0000000	1.0000000	0.0000000
Train	KNN-5	0.9555556	0.8333333	0.8666667	0.2222222	0.7777778	0.1333333

…and here are the performance metrics for the test data:

outcomes %>% filter(Dataset == "Test") %>% kable() %>%
  kable_styling(bootstrap_options = "striped", full_width=F)

Dataset	Model	AUC	accuracy	TPR	FPR	TNR	FNR
Test	Logistic	0.4000000	0.3333333	0.2857143	0.6	0.4	0.7142857
Test	Naive Bayes	0.3714286	0.6666667	0.8571429	0.6	0.4	0.1428571
Test	KNN-3	0.4428571	0.5000000	0.5714286	0.6	0.4	0.4285714
Test	KNN-5	0.4857143	0.5833333	0.7142857	0.6	0.4	0.2857143

Given the 4 models we have tried, the best performing seems to be the Naive Bayes model, but even that model does not perform very well.

The primary difficulty with the models, is likely the small dataset. With so few test examples (12), random chance can really alter the accuracy measure.

Some other observations:

Naive Bayes likely does better because it is an algorithm based on counting and using those to estimate probabilities.
KNN models performed poorly at \(k=3\) and \(k=5\), but may perform better at \(k=7\) (in fact, I did that very thing and the test accuracy was about as good as the Naive Bayes model but with no loss to the train accuracy).
The Logistic Regression model could perform poorly because the data might not have a linear decision boundary.