Once the data is loaded, we can look at it’s structure and see which method we feel may best be used to classify it:
summary(data)
## x y label
## Min. : 5 a:6 BLACK:22
## 1st Qu.:19 b:6 BLUE :14
## Median :43 c:6
## Mean :38 d:6
## 3rd Qu.:55 e:6
## Max. :63 f:6
The variable label is the class we wish to predict. Both “BLACK” and “BLUE” are reasonably balanced. Variable y is spread out well, having 6 observations of each value. Next, we will look at the distribution of variable x:
data %>% ggplot(aes(x=x)) + geom_histogram(binwidth=2) +
labs(title='Distribution of variable "x"',y="Count",x="Value") +
theme_light()
The plot shows that x takes on 6 discrete values, and is not continuous. It might be a good idea to treat it as a categorical variable, but we will leave it as-is for now.
evalModel <- function(predictions, data, name="None", model="Unspecified"){
# Bind the predictions and the data
x <- data %>% bind_cols(predictions)
x$predicted <- factor(ifelse(x$.pred_BLACK >= .50, "BLACK","BLUE"),
levels=c("BLACK","BLUE"))
# Compute AUC
auc <- x %>% roc_auc(label,.pred_BLACK) %>%
select(AUC = .estimate)
# Compute Accuracy, TPR, FPR, TNR, and FNR
mets <- metric_set(accuracy,sensitivity,specificity)
vals <- x %>% mets(truth=label, estimate=predicted) %>%
mutate(Dataset = name, Model = model) %>%
select(Dataset, Model, .metric, .estimate) %>%
pivot_wider(id_cols = c(Dataset,Model),
names_from = .metric,
values_from = .estimate) %>%
mutate(fpr = 1 - spec, fnr = 1 - sens, auc = auc$AUC) %>%
select(Dataset, Model, AUC = auc, accuracy, TPR = sens,
FPR = fpr, TNR = spec, FNR = fnr)
return(vals)
}
Next we will split the data set into a “training” and a “test” set. Because we only have 36 observations, we will need to be careful how we split the set.
# Split our data into 2/3 training and 1/3 testing
split <- initial_split(data,prop=2/3)
# Prep training data
train <- training(split)
test <- testing(split)
First we will try to classify using logistic regression. We start with the training data first to evaluate whether the model can learn, then predict the test set to see if the model can generalize.
# Fit the model
logmodel <- logistic_reg() %>% set_engine("glm") %>% fit(label ~ ., data=train)
# Predict probabilities and class for the training data
logpred <- predict(logmodel, new_data = train, type="prob")
logpred$.pred_class <- as.factor(ifelse(logpred$.pred_BLACK >= 0.5,
"BLACK","BLUE"))
# Evaluate the model
outcomes <- evalModel(predictions = logpred, data=train, name="Train",
model = "Logistic")
# Predict probabilities and class for the test data
logpred <- predict(logmodel, new_data = test, type="prob")
logpred$.pred_class <- as.factor(ifelse(logpred$.pred_BLACK >= 0.5,
"BLACK","BLUE"))
# Evaluate the model
outcomes <- rbind(outcomes,evalModel(predictions = logpred, data=test,
name="Test", model = "Logistic"))
kable(outcomes) %>% kable_styling(bootstrap_options = "striped", full_width=F)
| Dataset | Model | AUC | accuracy | TPR | FPR | TNR | FNR |
|---|---|---|---|---|---|---|---|
| Train | Logistic | 0.8888889 | 0.7916667 | 0.8666667 | 0.3333333 | 0.6666667 | 0.1333333 |
| Test | Logistic | 0.4000000 | 0.3333333 | 0.2857143 | 0.6000000 | 0.4000000 | 0.7142857 |
We see that the accuracy with the training data is about 79%, which isn’t bad, so the model can learn. Unfortunately, the test data accuracy is much lower at 33%. This is worse than a random guess, so the model does not have the ability to generalize with unseen data.
Now we will turn our attention to a Naive Bayes model.
nbmodel <- naive_Bayes(Laplace=1) %>% set_engine("naivebayes") %>%
fit(label ~ ., data=train)
# Predict probabilities and class for the training data
nbpred <- predict(nbmodel, new_data = train, type="prob")
nbpred$.pred_class <- as.factor(ifelse(nbpred$.pred_BLACK >= 0.5,
"BLACK","BLUE"))
# Evaluate the model
outcomes <- rbind(outcomes,evalModel(predictions = nbpred, data=train, name="Train", model = "Naive Bayes"))
# Predict probabilities and class for the test data
nbpred <- predict(nbmodel, new_data = test, type="prob")
nbpred$.pred_class <- as.factor(ifelse(nbpred$.pred_BLACK >= 0.5,
"BLACK","BLUE"))
# Evaluate the model
outcomes <- rbind(outcomes,evalModel(predictions = nbpred, data=test,
name="Test", model = "Naive Bayes"))
kable(outcomes) %>% kable_styling(bootstrap_options = "striped", full_width=F) %>%
row_spec(3:4, bold = T, background = "#E3D971")
| Dataset | Model | AUC | accuracy | TPR | FPR | TNR | FNR |
|---|---|---|---|---|---|---|---|
| Train | Logistic | 0.8888889 | 0.7916667 | 0.8666667 | 0.3333333 | 0.6666667 | 0.1333333 |
| Test | Logistic | 0.4000000 | 0.3333333 | 0.2857143 | 0.6000000 | 0.4000000 | 0.7142857 |
| Train | Naive Bayes | 0.9481481 | 0.8750000 | 0.9333333 | 0.2222222 | 0.7777778 | 0.0666667 |
| Test | Naive Bayes | 0.3714286 | 0.6666667 | 0.8571429 | 0.6000000 | 0.4000000 | 0.1428571 |
The Naive Bayes model did a bit better on the train dataset, with an 87.5% accuracy compared to the Logistic model’s 79% accuracy. The Naive Bayes model can learn.
Looking at the test dataset, the accuracy of Naive Bayes dropped to 66% accuracy. That’s twice as good as Logistic did, but it is a large drop from the training data. I’d say that this model can generalize but the model may be overfitting a bit.
Next, we try the K-Nearest Neighbors algorithm, starting with \(k=3\).
knn3model <- nearest_neighbor(mode="classification", neighbors = 3) %>%
set_engine("kknn") %>%
fit(label ~ ., data=train)
# Predict probabilities and class for the training data
knn3pred <- predict(knn3model, new_data = train, type="prob")
knn3pred$.pred_class <-as.factor(
ifelse(knn3pred$.pred_BLACK >= 0.5,"BLACK","BLUE")
)
# Evaluate the model
outcomes <- rbind(outcomes,evalModel(predictions = knn3pred, data=train,
name="Train", model = "KNN-3"))
# Predict probabilities and class for the test data
knn3pred <- predict(knn3model, new_data = test, type="prob")
knn3pred$.pred_class <- as.factor(
ifelse(knn3pred$.pred_BLACK >= 0.5,"BLACK","BLUE")
)
# Evaluate the model
outcomes <- rbind(outcomes,evalModel(predictions = knn3pred, data=test,
name="Test", model = "KNN-3"))
kable(outcomes) %>% kable_styling(bootstrap_options = "striped", full_width=F) %>%
row_spec(5:6, bold = T, background = "#E3D971")
| Dataset | Model | AUC | accuracy | TPR | FPR | TNR | FNR |
|---|---|---|---|---|---|---|---|
| Train | Logistic | 0.8888889 | 0.7916667 | 0.8666667 | 0.3333333 | 0.6666667 | 0.1333333 |
| Test | Logistic | 0.4000000 | 0.3333333 | 0.2857143 | 0.6000000 | 0.4000000 | 0.7142857 |
| Train | Naive Bayes | 0.9481481 | 0.8750000 | 0.9333333 | 0.2222222 | 0.7777778 | 0.0666667 |
| Test | Naive Bayes | 0.3714286 | 0.6666667 | 0.8571429 | 0.6000000 | 0.4000000 | 0.1428571 |
| Train | KNN-3 | 1.0000000 | 1.0000000 | 1.0000000 | 0.0000000 | 1.0000000 | 0.0000000 |
| Test | KNN-3 | 0.4428571 | 0.5000000 | 0.5714286 | 0.6000000 | 0.4000000 | 0.4285714 |
The K-Nearest Neighbor model (\(k=3\)) performed with perfect accuracy on the training data, a sure sign that it is overfitting. However, the model can learn.
Looking at the test data, the accuracy drops to 50%, the same as random guessing. Thus, this model cannot generalize well.
Now we try the KNN model but set \(k=5\).
knn5model <- nearest_neighbor(mode="classification", neighbors = 5) %>%
set_engine("kknn") %>%
fit(label ~ ., data=train)
# Predict probabilities and class for the training data
knn5pred <- predict(knn5model, new_data = train, type="prob")
knn5pred$.pred_class <- as.factor(
ifelse(knn5pred$.pred_BLACK >= 0.5,"BLACK","BLUE")
)
# Evaluate the model
outcomes <- rbind(outcomes,evalModel(predictions = knn5pred, data=train,
name="Train", model = "KNN-5"))
# Predict probabilities and class for the test data
knn5pred <- predict(knn5model, new_data = test, type="prob")
knn5pred$.pred_class <- as.factor(
ifelse(knn5pred$.pred_BLACK >= 0.5,"BLACK","BLUE")
)
# Evaluate the model
outcomes <- rbind(outcomes,evalModel(predictions = knn5pred, data=test,
name="Test", model = "KNN-5"))
kable(outcomes) %>% kable_styling(bootstrap_options = "striped", full_width=F) %>%
row_spec(7:8, bold = T, background = "#E3D971")
| Dataset | Model | AUC | accuracy | TPR | FPR | TNR | FNR |
|---|---|---|---|---|---|---|---|
| Train | Logistic | 0.8888889 | 0.7916667 | 0.8666667 | 0.3333333 | 0.6666667 | 0.1333333 |
| Test | Logistic | 0.4000000 | 0.3333333 | 0.2857143 | 0.6000000 | 0.4000000 | 0.7142857 |
| Train | Naive Bayes | 0.9481481 | 0.8750000 | 0.9333333 | 0.2222222 | 0.7777778 | 0.0666667 |
| Test | Naive Bayes | 0.3714286 | 0.6666667 | 0.8571429 | 0.6000000 | 0.4000000 | 0.1428571 |
| Train | KNN-3 | 1.0000000 | 1.0000000 | 1.0000000 | 0.0000000 | 1.0000000 | 0.0000000 |
| Test | KNN-3 | 0.4428571 | 0.5000000 | 0.5714286 | 0.6000000 | 0.4000000 | 0.4285714 |
| Train | KNN-5 | 0.9555556 | 0.8333333 | 0.8666667 | 0.2222222 | 0.7777778 | 0.1333333 |
| Test | KNN-5 | 0.4857143 | 0.5833333 | 0.7142857 | 0.6000000 | 0.4000000 | 0.2857143 |
The KNN model with \(k=5\) performs better on the test data then when \(k=3\) with an accuracy of 83%, meaning we have likely fixed the overfitting. This model appears to be able to learn well.
However, the accuracy on the training dataset drops significantly to 58%. This shows that the model still does not generalize well.
Below are the performance metrics for the training data:
outcomes %>% filter(Dataset == "Train") %>% kable() %>%
kable_styling(bootstrap_options = "striped", full_width=F)
| Dataset | Model | AUC | accuracy | TPR | FPR | TNR | FNR |
|---|---|---|---|---|---|---|---|
| Train | Logistic | 0.8888889 | 0.7916667 | 0.8666667 | 0.3333333 | 0.6666667 | 0.1333333 |
| Train | Naive Bayes | 0.9481481 | 0.8750000 | 0.9333333 | 0.2222222 | 0.7777778 | 0.0666667 |
| Train | KNN-3 | 1.0000000 | 1.0000000 | 1.0000000 | 0.0000000 | 1.0000000 | 0.0000000 |
| Train | KNN-5 | 0.9555556 | 0.8333333 | 0.8666667 | 0.2222222 | 0.7777778 | 0.1333333 |
…and here are the performance metrics for the test data:
outcomes %>% filter(Dataset == "Test") %>% kable() %>%
kable_styling(bootstrap_options = "striped", full_width=F)
| Dataset | Model | AUC | accuracy | TPR | FPR | TNR | FNR |
|---|---|---|---|---|---|---|---|
| Test | Logistic | 0.4000000 | 0.3333333 | 0.2857143 | 0.6 | 0.4 | 0.7142857 |
| Test | Naive Bayes | 0.3714286 | 0.6666667 | 0.8571429 | 0.6 | 0.4 | 0.1428571 |
| Test | KNN-3 | 0.4428571 | 0.5000000 | 0.5714286 | 0.6 | 0.4 | 0.4285714 |
| Test | KNN-5 | 0.4857143 | 0.5833333 | 0.7142857 | 0.6 | 0.4 | 0.2857143 |
Given the 4 models we have tried, the best performing seems to be the Naive Bayes model, but even that model does not perform very well.
The primary difficulty with the models, is likely the small dataset. With so few test examples (12), random chance can really alter the accuracy measure.
Some other observations:
Naive Bayes likely does better because it is an algorithm based on counting and using those to estimate probabilities.
KNN models performed poorly at \(k=3\) and \(k=5\), but may perform better at \(k=7\) (in fact, I did that very thing and the test accuracy was about as good as the Naive Bayes model but with no loss to the train accuracy).
The Logistic Regression model could perform poorly because the data might not have a linear decision boundary.