First we load and review the data.
df <- read_csv("HW1-Q1-40.csv", skip = 1, n_max = 14,
col_types = "fffff") %>%
rename(age = 'age-group', credit = 'credit_rating', prospect = "class:prospect")
summary(df)
## age networth status credit prospect
## youth :5 high :4 employed :7 fair :8 no :5
## middle:4 medium:6 unemployed:7 excellent:6 yes:9
## senior:5 low :4
The prior probabilities are the (unconditioned) class probabilities, which can be estimated from the dataset:
\[\Pr(\text{prospect})\]
(prior <- prop.table(table(prospect = df$prospect)))
## prospect
## no yes
## 0.3571429 0.6428571
The conditional probabilities are given by the probability distributions for each predictor, conditioned on class:
\[\Pr(X_i | \text{prospect})\]
where \(X_i\) refers to the predictor variables age, net worth, employment status, and credit rating.
# age conditionals and labels
(cond_age <- table(age = df$age, prospect = df$prospect) %>%
prop.table(margin = 2))
## prospect
## age no yes
## youth 0.6000000 0.2222222
## middle 0.0000000 0.4444444
## senior 0.4000000 0.3333333
names_age <- rownames(cond_age)
# net worth conditionals and labels
(cond_networth <- table(networth = df$networth, prospect = df$prospect) %>%
prop.table(margin = 2))
## prospect
## networth no yes
## high 0.4000000 0.2222222
## medium 0.4000000 0.4444444
## low 0.2000000 0.3333333
names_networth <- rownames(cond_networth)
# status conditionals and labels
(cond_status <- table(status = df$status, prospect = df$prospect) %>%
prop.table(margin = 2))
## prospect
## status no yes
## employed 0.8000000 0.3333333
## unemployed 0.2000000 0.6666667
names_status <- rownames(cond_status)
# credit rating conditionals and labels
(cond_credit <- table(credit = df$credit, prospect = df$prospect) %>%
prop.table(margin = 2))
## prospect
## credit no yes
## fair 0.4000000 0.6666667
## excellent 0.6000000 0.3333333
names_credit <- rownames(cond_credit)
The posterior probabilities are the class probabilities conditioned on the predictors, which from Bayes’ theorem, we can express as the following:
\[ \begin{array}{rcl} \Pr(\text{prospect} | X) &=& \frac{\Pr(X | \text{prospect}) \Pr(\text{prospect})}{\Pr(X)}\\ &=& \frac{\Pr(X_1, X_2, X_3, X_4 | \text{prospect}) \Pr(\text{prospect})}{\Pr(X)} \end{array} \]
where \(X = (X_1, X_2, X_3, X_4)\). Then assuming that the predictors \(X_1, X_2, X_3\), and \(X_4\) are conditionally independent (the naive Bayes assumption), we can write:
\[\Pr(X_1, X_2, X_3, X_4 | \text{prospect}) = \Pr(X_1 | \text{prospect}) \Pr(X_2 | \text{prospect}) \Pr(X_3 | \text{prospect}) \Pr(X_4 | \text{prospect})\]
from which the posterior probabilities can be expressed as:
\[\Pr(\text{prospect} | X) = \frac{\Pr(X_1 | \text{prospect}) \Pr(X_2 | \text{prospect}) \Pr(X_3 | \text{prospect}) \Pr(X_4 | \text{prospect}) \Pr(\text{prospect})}{\Pr(X)}\]
In this expression, the prior probabilities \(\Pr(\text{prospect})\) and the conditional probabilities \(\Pr(X_i | \text{prospect})\) can be estimated from the dataset, which are given above. The denominator can be computed explicitly as:
\[\Pr(X) = \Pr(X | \text{prospect = no}) \Pr(\text{prospect = no}) + \Pr(X | \text{prospect = yes}) \Pr(\text{prospect = yes})\]
Alternatively, rather than calculating the denominator, we need only compute the numerators (for \(\text{prospect = no}\) and \(\text{prospect = yes}\)), and then we can normalize the probabilities to arrive at the posteriors.
# build dataframe of all posteriors
df_post <- tibble()
for (a in names_age) {
for (b in names_networth) {
for (c in names_status) {
for (d in names_credit) {
num_no <- prior['no'] * cond_age[a, 'no'] * cond_networth[b, 'no'] *
cond_status[c, 'no'] * cond_credit[d, 'no']
num_yes <- prior['yes'] * cond_age[a, 'yes'] * cond_networth[b, 'yes'] *
cond_status[c, 'yes'] * cond_credit[d, 'yes']
df_post <- rbind(df_post, c(a, b, c, d, num_no, num_yes), stringsAsFactors = FALSE)
}
}
}
}
colnames(df_post) <- c("age", "networth", "status", "credit", "num_no", "num_yes")
# normalize probabilities
df_post <- df_post %>%
mutate(num_no = as.double(num_no), num_yes = as.double(num_yes)) %>%
mutate(post_no = num_no / (num_no + num_yes), post_yes = num_yes / (num_no + num_yes)) %>%
dplyr::select(-num_no, -num_yes)
# display
df_post %>% kable(caption = "Computed posterior probabilities")
| age | networth | status | credit | post_no | post_yes |
|---|---|---|---|---|---|
| youth | high | employed | fair | 0.7954173 | 0.2045827 |
| youth | high | employed | excellent | 0.9210360 | 0.0789640 |
| youth | high | unemployed | fair | 0.3270525 | 0.6729475 |
| youth | high | unemployed | excellent | 0.5931652 | 0.4068348 |
| youth | medium | employed | fair | 0.6603261 | 0.3396739 |
| youth | medium | employed | excellent | 0.8536300 | 0.1463700 |
| youth | medium | unemployed | fair | 0.1954948 | 0.8045052 |
| youth | medium | unemployed | excellent | 0.4216310 | 0.5783690 |
| youth | low | employed | fair | 0.5644599 | 0.4355401 |
| youth | low | employed | excellent | 0.7954173 | 0.2045827 |
| youth | low | unemployed | fair | 0.1394148 | 0.8605852 |
| youth | low | unemployed | excellent | 0.3270525 | 0.6729475 |
| middle | high | employed | fair | 0.0000000 | 1.0000000 |
| middle | high | employed | excellent | 0.0000000 | 1.0000000 |
| middle | high | unemployed | fair | 0.0000000 | 1.0000000 |
| middle | high | unemployed | excellent | 0.0000000 | 1.0000000 |
| middle | medium | employed | fair | 0.0000000 | 1.0000000 |
| middle | medium | employed | excellent | 0.0000000 | 1.0000000 |
| middle | medium | unemployed | fair | 0.0000000 | 1.0000000 |
| middle | medium | unemployed | excellent | 0.0000000 | 1.0000000 |
| middle | low | employed | fair | 0.0000000 | 1.0000000 |
| middle | low | employed | excellent | 0.0000000 | 1.0000000 |
| middle | low | unemployed | fair | 0.0000000 | 1.0000000 |
| middle | low | unemployed | excellent | 0.0000000 | 1.0000000 |
| senior | high | employed | fair | 0.6334311 | 0.3665689 |
| senior | high | employed | excellent | 0.8382924 | 0.1617076 |
| senior | high | unemployed | fair | 0.1776316 | 0.8223684 |
| senior | high | unemployed | excellent | 0.3932039 | 0.6067961 |
| senior | medium | employed | fair | 0.4635193 | 0.5364807 |
| senior | medium | employed | excellent | 0.7216036 | 0.2783964 |
| senior | medium | unemployed | fair | 0.0974729 | 0.9025271 |
| senior | medium | unemployed | excellent | 0.2447130 | 0.7552870 |
| senior | low | employed | fair | 0.3654822 | 0.6345178 |
| senior | low | employed | excellent | 0.6334311 | 0.3665689 |
| senior | low | unemployed | fair | 0.0671642 | 0.9328358 |
| senior | low | unemployed | excellent | 0.1776316 | 0.8223684 |
klaRFinally, we check our work using the NaiveBayes function from the klaR package. The prior, conditional, and posterior probabilities all agree with the results computed above.
nb <- NaiveBayes(prospect ~ ., df)
nb$apriori
## grouping
## no yes
## 0.3571429 0.6428571
nb$tables
## $age
## var
## grouping youth middle senior
## no 0.6000000 0.0000000 0.4000000
## yes 0.2222222 0.4444444 0.3333333
##
## $networth
## var
## grouping high medium low
## no 0.4000000 0.4000000 0.2000000
## yes 0.2222222 0.4444444 0.3333333
##
## $status
## var
## grouping employed unemployed
## no 0.8000000 0.2000000
## yes 0.3333333 0.6666667
##
## $credit
## var
## grouping fair excellent
## no 0.4000000 0.6000000
## yes 0.6666667 0.3333333
data.frame(nb$x, predict(nb)$posterior) %>%
kable(caption = "Posterior probabilities from naive Bayes classifier")
| age | networth | status | credit | no | yes |
|---|---|---|---|---|---|
| youth | high | employed | fair | 0.7954173 | 0.2045827 |
| youth | high | employed | excellent | 0.9210360 | 0.0789640 |
| middle | high | employed | fair | 0.0032295 | 0.9967705 |
| senior | medium | employed | fair | 0.4635193 | 0.5364807 |
| senior | low | unemployed | fair | 0.0671642 | 0.9328358 |
| senior | low | unemployed | excellent | 0.1776316 | 0.8223684 |
| middle | low | unemployed | excellent | 0.0004048 | 0.9995952 |
| youth | medium | employed | fair | 0.6603261 | 0.3396739 |
| youth | low | unemployed | fair | 0.1394148 | 0.8605852 |
| senior | medium | unemployed | fair | 0.0974729 | 0.9025271 |
| youth | medium | unemployed | excellent | 0.4216310 | 0.5783690 |
| middle | medium | employed | excellent | 0.0048365 | 0.9951635 |
| middle | high | unemployed | fair | 0.0004048 | 0.9995952 |
| senior | medium | employed | excellent | 0.7216036 | 0.2783964 |
First we load and summarize the data. We observe the following:
a and b and target class); however the target classes are labeled differently in the two datasets (1/2 in the first dataset and 0/1 in the second dataset)raw1 <- read_table2("junk1.txt", col_types = "ddf")
summary(raw1)
## a b class
## Min. :-2.29854 Min. :-3.17174 1:50
## 1st Qu.:-0.85014 1st Qu.:-1.04712 2:50
## Median :-0.04754 Median :-0.07456
## Mean : 0.04758 Mean : 0.01324
## 3rd Qu.: 1.09109 3rd Qu.: 1.05342
## Max. : 3.00604 Max. : 3.10230
(apply(raw1, 2, sd)[1:2])
## a b
## 1.267740 1.446067
cor(raw1[1:2])
## a b
## a 1.0000000 -0.1021507
## b -0.1021507 1.0000000
raw2 <- read_csv("junk2.csv", col_types = "ddf")
summary(raw2)
## a b class
## Min. :-4.16505 Min. :-3.90472 0:3750
## 1st Qu.:-1.01447 1st Qu.:-0.89754 1: 250
## Median : 0.08754 Median :-0.08358
## Mean :-0.05126 Mean : 0.05624
## 3rd Qu.: 0.89842 3rd Qu.: 1.00354
## Max. : 4.62647 Max. : 4.31052
(apply(raw2, 2, sd)[1:2])
## a b
## 1.298076 1.314386
cor(raw2[1:2])
## a b
## a 1.00000000 0.07392388
## b 0.07392388 1.00000000
Next we visualize the distribution of the data. We note the following:
# scatter plots
ggplot(raw1) + geom_point(aes(x = a, y = b, col = class)) +
labs(title = "Scatter plot of dataset #1")
ggplot(raw2) + geom_point(aes(x = a, y = b, col = class)) +
labs(title = "Scatter plot of dataset #2")
# pairs plots
ggpairs(raw1, aes(col = factor(class), alpha = 0.5), title = "Pairs plot of dataset #1")
ggpairs(raw2, aes(col = factor(class), alpha = 0.5), title = "Pairs plot of dataset #2")
At this point it would be good to ask questions and get some guidance. In particular:
a and b and the target class correspond to the same variables in both datasets?We assume here (based on the answers to the questions above) that the purpose of the exercise is to develop classification models and that we should keep the two datasets separate.
Based on the data exploration above, we might expect that classification will be achievable / more successful for dataset #2 than dataset #1, since:
We try various classification algorithms using the klaR package, including:
lda)qda)rpart)naiveBayes)rda)sknn)Note that for all the classification methods above, we use the default options from klaR (including k = 3 for the kNN classifier). We can visualize the separation boundaries for each dataset and classifier using the partimat function, which plots correct and incorrect cases in black and red type, respectively, as well as summarizes the error rate.
Reviewing the partition plots for dataset #1, it appears that the following algorithms provide reasonable classifiers:
However the linear discriminant analysis and Naive Bayes classifiers are not a good fit for this dataset, since they produce linear separation boundaries that are not effective in separating classes that overlap throughout the sample space.
# methods to try
methods <- c("lda", "qda", "rpart", "naiveBayes", "rda", "sknn")
# junk1
for (j in methods) {
partimat(class ~ a + b, data = raw1,
method = j,
main = paste0("2D partition of dataset #1: ", j, " classifier"))
}
For dataset #2, it appears from the partition plots below that none of the methods provide satisfactory classifiers. In particular, the kNN classifier achieves a 4% error rate but there is evidence of over-fitting (which might be mitigated by increasing k). On the other hand, none of the remaining algorithms are able to provide any classification beyond grouping all instances into the dominant class (94% class “0”), which results in a baseline 6% error rate. The challenge with dataset #2 lies with the imbalanced nature of the class distribution. Since the target class is distributed 94/6 into 0/1, almost all the algorithms default to the baseline classifier, whereas in fact, we want a classifier that will produce a circular separation boundary around the class “1” instances, which are localized in the sample a-b space. None of the classification methods considered here are able to achieve this type of boundary.
# junk2
for (j in methods) {
partimat(class ~ a + b, data = raw2,
method = j,
main = paste0("2D partition of dataset #2: ", j, " classifier"))
}
Based on the analysis above, our recommendations are the following:
First we read the icu.csv file and prepare the dataset, including:
COMA, which equals 1 if LOC is 2 and 0 otherwiseTYP, COMA, AGE, INF, and STA.AGE is a numeric variable while the other variables are categorical; however for modeling purposes, we will treat the categorical variables as numeric. Note from the statistical summary that the categorical variables have skewed / imbalanced distributions:
TYP is skewed to 1’s (26/74 split of 0/1)COMA is predominantly 0’s (95/5 split of 0/1)INF is relatively more balanced (58/42 split of 0/1)STA is skewed to 0’s (80/20 split of 0/1).df <- read_csv("icu.csv") %>%
mutate(COMA = ifelse(LOC == 2, 1, 0),
STA = as.character(STA)) %>%
select(TYP, COMA, AGE, INF, STA)
# summary stats
df_stats <- df %>% mutate(TYP = factor(TYP),
COMA = factor(COMA),
INF = factor(INF),
STA = factor(STA))
summary(df_stats)
## TYP COMA AGE INF STA
## 0: 53 0:190 Min. :16.00 0:116 0:160
## 1:147 1: 10 1st Qu.:46.75 1: 84 1: 40
## Median :63.00
## Mean :57.55
## 3rd Qu.:72.00
## Max. :92.00
Next let’s partition the data into a 70/30 split of training and test sets. Since the target class is imbalanced, we need to make sure that the training and test sets both have representative distributions of the target class. We can see below that the target class distribution is 80/20 in both the training and test sets.
# create training/test partitions
set.seed(3)
n <- nrow(df)
train_idx <- sample(n, size = 0.7 * n)
train_df <- df[train_idx, ]
test_df <- df[-train_idx, ]
# check class distributions
labelcol <- 5 # col of class labels
prop.table(table(STA_train = train_df[[labelcol]]))
## STA_train
## 0 1
## 0.8 0.2
prop.table(table(STA_test = test_df[[labelcol]]))
## STA_test
## 0 1
## 0.8 0.2
Now we create k-nearest neighbor classifiers based on the model \(\text{STA} \sim \text{TYP} + \text{COMA} + \text{AGE} + \text{INF}\), for various values of the parameter \(k = 3, 5, 7, 15, 25, 50\). We do this by adapting the generic code provided. Then we evaluate the kNN classifiers using prediction accuracy on the test set.
# distance metric
euclideanDist <- function(a, b) {
sqrt(sum( (a - b)^2 ))
}
# knn classifier
knn_predict <- function(test_data, train_data, k_value, labelcol) {
pred <- character() # initialize predicted classes
# loop1 - test data
for (i in 1:nrow(test_data)) {
eu_dist = numeric() # initialize distances and classes
eu_char = character()
# loop2 - training data
for (j in 1:nrow(train_data)) {
# calc euclidean distance between test data & training data;
# only use columns before labelcol in test data (since we add
# prediction columns to test df)
eu_dist <- c(eu_dist, euclideanDist(test_data[i, 1:(labelcol-1)],
train_data[j, -labelcol]))
# record class variable of training data
eu_char <- c(eu_char, as.character(train_data[j, labelcol]))
}
eu <- data.frame(eu_char, eu_dist) # dataframe of class & distance
eu <- eu[order(eu$eu_dist), ] # sort by ascending distance
eu <- eu[1:k_value, ] # dataframe with K nearest neighbors
tbl.sm.df <- table(eu$eu_char) # tabulate classes of nearest neighbors
cl_label <- names(tbl.sm.df)[which.max(tbl.sm.df)] # choose most frequent class
pred <- c(pred, cl_label)
}
return(pred) # return predicted classes
}
# accuracy metric
accuracy <- function(test_data, labelcol, predcol) {
correct <- sum(test_data[[labelcol]] == test_data[[predcol]])
accu = (correct / nrow(test_data)) * 100
return(accu)
}
# k values for kNN
krange <- c(3, 5, 7, 15, 25, 50)
knum <- length(krange)
# accuracy metric & contingency table
acc1 <- numeric()
confmat1 <- list()
# loop over k values
for (j in 1:knum) {
kval <- krange[j]
kname <- paste0("k=", kval)
# calc kNN predictions & add to test df
predictions <- knn_predict(test_df, train_df, kval, labelcol)
test_df[kname] <- predictions
# compute accuracy and contingency table
acc1[kname] <- accuracy(test_df, labelcol, labelcol + j)
confmat1[[kname]] <- addmargins(table(Pred = factor(test_df[[labelcol + j]],
levels = c("0", "1")),
Obs = test_df[[labelcol]]))
}
It’s important to realize that in the kNN algorithm, we used the euclidean distance metric on the 4 predictors without adjustment, which gives equal weight to each predictor regardless of scale. However the scales are vastly different for the 3 categorical predictors (which consist of 0’s and 1’s) and the age predictor (which varies from 16 to 92). As a result, the distance metric and therefore the kNN output will be determined primarily by the age variable.
Let’s plot the accuracy as a function of \(k\), the number of nearest neighbors for each kNN model. We see that the accuracy initially increases from 68% for \(k=3\) to 77% for \(k=7\), but then plateaus at 80% for \(k \ge 15\). Recall that the class distribution of STA is imbalanced, with an 80/20 split of 0’s and 1’s. Consider the simple classifier that predicts all cases to be 0. This simple classifier when applied to the test set will achieve an accuracy of 80%, since it correctly classifies all the 0 cases which comprise 80% of the test data. Returning to our kNN classifier, as k increases from 3 to 5 to 7, etc., eventually all test cases will be predicted to be 0, since the k nearest neighbors will be dominated by the 0 cases in the training set for large enough k. This explains the behavior of the accuracy curve.
# plot accuracy
df_plot <- data.frame(k = krange, Accuracy = acc1)
ggplot(df_plot) + geom_line(aes(x = k, y = acc1)) +
labs(title = "Accuracy of kNN classifiers (without scaling predictors)",
x = "k = number of nearest neighbors",
y = "Accuracy")
We can verify from the contingency tables below that for \(k \ge 15\), the kNN model classifies all instances into the “0” class. This results in an 80% accuracy rate, consistent with our explanation above.
# contingency tables
confmat1
## $`k=3`
## Obs
## Pred 0 1 Sum
## 0 40 11 51
## 1 8 1 9
## Sum 48 12 60
##
## $`k=5`
## Obs
## Pred 0 1 Sum
## 0 43 10 53
## 1 5 2 7
## Sum 48 12 60
##
## $`k=7`
## Obs
## Pred 0 1 Sum
## 0 44 10 54
## 1 4 2 6
## Sum 48 12 60
##
## $`k=15`
## Obs
## Pred 0 1 Sum
## 0 48 12 60
## 1 0 0 0
## Sum 48 12 60
##
## $`k=25`
## Obs
## Pred 0 1 Sum
## 0 48 12 60
## 1 0 0 0
## Sum 48 12 60
##
## $`k=50`
## Obs
## Pred 0 1 Sum
## 0 48 12 60
## 1 0 0 0
## Sum 48 12 60
In order to normalize the impact of scale on the euclidean distance metric, we do a second run where we center and scale the predictor variables. We center and scale the variables by subtracting their mean values and then dividing by their standard deviations. The resulting variables have a mean of 0 and standard deviation of 1, so are standardized and expressed on the same scale. While the accuracy curves with scaling (below) and without scaling (above) have similar shapes, the accuracy curve with scaling reaches the 80% level earlier at \(k \ge 5\). However, this may occur by chance and may be dependent on this particular sample; reviewing the contingency tables, we see that the kNN classifier continues to make “1” predictions through \(k=15\), and then makes all “0” predictions thereafter.
train_df2 <- train_df
test_df2 <- test_df[1:labelcol]
# center & scale predictor variables
centerscale <- function(x)
x <- (x - mean(x)) / sd(x)
for (j in 1:4) {
train_df2[[j]] <- centerscale(train_df2[[j]])
test_df2[[j]] <- centerscale(test_df2[[j]])
}
# accuracy metric & contingency table
acc2 <- numeric()
confmat2 <- list()
# loop over k values
for (j in 1:knum) {
kval <- krange[j]
kname <- paste0("k=", kval)
# calc kNN predictions & add to test df
predictions <- knn_predict(test_df2, train_df2, kval, labelcol)
test_df2[kname] <- predictions
# compute accuracy and contingency table
acc2[kname] <- accuracy(test_df2, labelcol, labelcol + j)
confmat2[[kname]] <- addmargins(table(Pred = factor(test_df2[[labelcol + j]],
levels = c("0", "1")),
Obs = test_df2[[labelcol]]))
}
# plot accuracy
df2_plot <- data.frame(k = krange, Accuracy = acc2)
ggplot(df2_plot) + geom_line(aes(x = k, y = acc2)) +
labs(title = "Accuracy of kNN classifiers (after scaling predictors)",
x = "k = number of nearest neighbors",
y = "Accuracy")
# contingency tables
confmat2
## $`k=3`
## Obs
## Pred 0 1 Sum
## 0 42 8 50
## 1 6 4 10
## Sum 48 12 60
##
## $`k=5`
## Obs
## Pred 0 1 Sum
## 0 45 9 54
## 1 3 3 6
## Sum 48 12 60
##
## $`k=7`
## Obs
## Pred 0 1 Sum
## 0 45 9 54
## 1 3 3 6
## Sum 48 12 60
##
## $`k=15`
## Obs
## Pred 0 1 Sum
## 0 47 11 58
## 1 1 1 2
## Sum 48 12 60
##
## $`k=25`
## Obs
## Pred 0 1 Sum
## 0 48 12 60
## 1 0 0 0
## Sum 48 12 60
##
## $`k=50`
## Obs
## Pred 0 1 Sum
## 0 48 12 60
## 1 0 0 0
## Sum 48 12 60
klaRFinally, to check our work, we use the sknn function from the klaR package. The accuracy measures agree with those we computed above.
# check using sknn function from klaR
train_df3 <- train_df
test_df3 <- test_df[1:labelcol]
train_df4 <- train_df2
test_df4 <- test_df2[1:labelcol]
acc3 <- numeric()
confmat3 <- list()
acc4 <- numeric()
confmat4 <- list()
# loop over k values
for (j in 1:knum) {
kval <- krange[j]
kname <- paste0("k=", kval)
# calc kNN predictions & add to test df
predictions3 <- sknn(factor(STA) ~ ., train_df3, kn = kval) %>%
predict(newdata = test_df3)
predictions4 <- sknn(factor(STA) ~ ., train_df4, kn = kval) %>%
predict(newdata = test_df4)
test_df3[kname] <- predictions3$class
test_df4[kname] <- predictions4$class
# compute accuracy and contingency table
acc3[kname] <- accuracy(test_df3, labelcol, labelcol + j)
acc4[kname] <- accuracy(test_df4, labelcol, labelcol + j)
confmat3[[kname]] <- addmargins(table(Pred = factor(test_df3[[labelcol + j]],
levels = c("0", "1")),
Obs = test_df3[[labelcol]]))
confmat4[[kname]] <- addmargins(table(Pred = factor(test_df4[[labelcol + j]],
levels = c("0", "1")),
Obs = test_df4[[labelcol]]))
}
# compare accuracy to klaR
cbind(acc1, acc3, acc2, acc4) %>%
kable(col.names = c("Non-scaled: Calc", "Non-scaled: klaR", "Scaled: Calc", "Scaled: klaR"),
caption = "Accuracy comparison vs. k: Calculated & from klaR")
| Non-scaled: Calc | Non-scaled: klaR | Scaled: Calc | Scaled: klaR | |
|---|---|---|---|---|
| k=3 | 68.33333 | 68.33333 | 76.66667 | 76.66667 |
| k=5 | 75.00000 | 75.00000 | 80.00000 | 80.00000 |
| k=7 | 76.66667 | 76.66667 | 80.00000 | 80.00000 |
| k=15 | 80.00000 | 80.00000 | 80.00000 | 80.00000 |
| k=25 | 80.00000 | 80.00000 | 80.00000 | 80.00000 |
| k=50 | 80.00000 | 80.00000 | 80.00000 | 80.00000 |
In summary, we have seen the following:
Imbalanced target classes: the target variable STA has an imbalanced class distribution (80/20 split of 0/1). As a result, for high enough \(k\), eventually the kNN model will classify all instances as “0”, and the accuracy will saturate at 80%. For the training/test split analyzed here, this behavior occurs at \(k \ge 15\) (if the variables are not standardized) or \(k \ge 25\) (if the variables are standardized). We expect that the more imbalanced the distribution, the earlier this behavior occurs (i.e., the smaller the value of \(k\) where the accuracy saturates at the dominant class proportion).
Scaling of different predictors: the AGE predictor, which is numeric (ranging from 16 to 92), has a dramatically different scale than the other predictors, which are categorical (0’s and 1’s). This has an impact on the distance metric and therefore on the predictions of the kNN classifier. The distance metric used in the kNN algorithm, as implemented in this exercise, is the euclidean distance, which gives equal weight to each predictor dimension. Since the AGE variable has a much larger magnitude than the other variables, its variability will tend to dominate the distance calculation. This can be mitigated by standardizing the variables (i.e., centering and scaling), in which case, the distance metric will be equally affected by the variability of each predictor. We saw from the contingency tables that this impacted the accuracy profile of the kNN classifier.