Q1: Naive Bayes

First we load and review the data.

df <- read_csv("HW1-Q1-40.csv", skip = 1, n_max = 14, 
               col_types = "fffff") %>% 
    rename(age = 'age-group', credit = 'credit_rating', prospect = "class:prospect")
summary(df)
##      age      networth        status        credit  prospect
##  youth :5   high  :4   employed  :7   fair     :8   no :5   
##  middle:4   medium:6   unemployed:7   excellent:6   yes:9   
##  senior:5   low   :4

Prior probabilities

The prior probabilities are the (unconditioned) class probabilities, which can be estimated from the dataset:

\[\Pr(\text{prospect})\]

(prior <- prop.table(table(prospect = df$prospect)))
## prospect
##        no       yes 
## 0.3571429 0.6428571

Conditional probabilities

The conditional probabilities are given by the probability distributions for each predictor, conditioned on class:

\[\Pr(X_i | \text{prospect})\]

where \(X_i\) refers to the predictor variables age, net worth, employment status, and credit rating.

# age conditionals and labels
(cond_age <- table(age = df$age, prospect = df$prospect) %>% 
    prop.table(margin = 2)) 
##         prospect
## age             no       yes
##   youth  0.6000000 0.2222222
##   middle 0.0000000 0.4444444
##   senior 0.4000000 0.3333333
names_age <- rownames(cond_age)
# net worth conditionals and labels
(cond_networth <- table(networth = df$networth, prospect = df$prospect) %>% 
    prop.table(margin = 2))
##         prospect
## networth        no       yes
##   high   0.4000000 0.2222222
##   medium 0.4000000 0.4444444
##   low    0.2000000 0.3333333
names_networth <- rownames(cond_networth)
# status conditionals and labels
(cond_status <- table(status = df$status, prospect = df$prospect) %>% 
    prop.table(margin = 2))
##             prospect
## status              no       yes
##   employed   0.8000000 0.3333333
##   unemployed 0.2000000 0.6666667
names_status <- rownames(cond_status)
# credit rating conditionals and labels
(cond_credit <- table(credit = df$credit, prospect = df$prospect) %>% 
    prop.table(margin = 2)) 
##            prospect
## credit             no       yes
##   fair      0.4000000 0.6666667
##   excellent 0.6000000 0.3333333
names_credit <- rownames(cond_credit)

Posterior probabilities

The posterior probabilities are the class probabilities conditioned on the predictors, which from Bayes’ theorem, we can express as the following:

\[ \begin{array}{rcl} \Pr(\text{prospect} | X) &=& \frac{\Pr(X | \text{prospect}) \Pr(\text{prospect})}{\Pr(X)}\\ &=& \frac{\Pr(X_1, X_2, X_3, X_4 | \text{prospect}) \Pr(\text{prospect})}{\Pr(X)} \end{array} \]

where \(X = (X_1, X_2, X_3, X_4)\). Then assuming that the predictors \(X_1, X_2, X_3\), and \(X_4\) are conditionally independent (the naive Bayes assumption), we can write:

\[\Pr(X_1, X_2, X_3, X_4 | \text{prospect}) = \Pr(X_1 | \text{prospect}) \Pr(X_2 | \text{prospect}) \Pr(X_3 | \text{prospect}) \Pr(X_4 | \text{prospect})\]

from which the posterior probabilities can be expressed as:

\[\Pr(\text{prospect} | X) = \frac{\Pr(X_1 | \text{prospect}) \Pr(X_2 | \text{prospect}) \Pr(X_3 | \text{prospect}) \Pr(X_4 | \text{prospect}) \Pr(\text{prospect})}{\Pr(X)}\]

In this expression, the prior probabilities \(\Pr(\text{prospect})\) and the conditional probabilities \(\Pr(X_i | \text{prospect})\) can be estimated from the dataset, which are given above. The denominator can be computed explicitly as:

\[\Pr(X) = \Pr(X | \text{prospect = no}) \Pr(\text{prospect = no}) + \Pr(X | \text{prospect = yes}) \Pr(\text{prospect = yes})\]

Alternatively, rather than calculating the denominator, we need only compute the numerators (for \(\text{prospect = no}\) and \(\text{prospect = yes}\)), and then we can normalize the probabilities to arrive at the posteriors.

# build dataframe of all posteriors
df_post <- tibble()
for (a in names_age) {
  for (b in names_networth) {
    for (c in names_status) { 
      for (d in names_credit) { 
        num_no <- prior['no'] * cond_age[a, 'no'] * cond_networth[b, 'no'] * 
          cond_status[c, 'no'] * cond_credit[d, 'no']
        num_yes <- prior['yes'] * cond_age[a, 'yes'] * cond_networth[b, 'yes'] * 
          cond_status[c, 'yes'] * cond_credit[d, 'yes']
        df_post <- rbind(df_post, c(a, b, c, d, num_no, num_yes), stringsAsFactors = FALSE)
      }
    }
  }
}
colnames(df_post) <- c("age", "networth", "status", "credit", "num_no", "num_yes")
# normalize probabilities
df_post <- df_post %>% 
  mutate(num_no = as.double(num_no), num_yes = as.double(num_yes)) %>% 
  mutate(post_no = num_no / (num_no + num_yes), post_yes = num_yes / (num_no + num_yes)) %>% 
  dplyr::select(-num_no, -num_yes)
# display
df_post %>% kable(caption = "Computed posterior probabilities")
Computed posterior probabilities
age networth status credit post_no post_yes
youth high employed fair 0.7954173 0.2045827
youth high employed excellent 0.9210360 0.0789640
youth high unemployed fair 0.3270525 0.6729475
youth high unemployed excellent 0.5931652 0.4068348
youth medium employed fair 0.6603261 0.3396739
youth medium employed excellent 0.8536300 0.1463700
youth medium unemployed fair 0.1954948 0.8045052
youth medium unemployed excellent 0.4216310 0.5783690
youth low employed fair 0.5644599 0.4355401
youth low employed excellent 0.7954173 0.2045827
youth low unemployed fair 0.1394148 0.8605852
youth low unemployed excellent 0.3270525 0.6729475
middle high employed fair 0.0000000 1.0000000
middle high employed excellent 0.0000000 1.0000000
middle high unemployed fair 0.0000000 1.0000000
middle high unemployed excellent 0.0000000 1.0000000
middle medium employed fair 0.0000000 1.0000000
middle medium employed excellent 0.0000000 1.0000000
middle medium unemployed fair 0.0000000 1.0000000
middle medium unemployed excellent 0.0000000 1.0000000
middle low employed fair 0.0000000 1.0000000
middle low employed excellent 0.0000000 1.0000000
middle low unemployed fair 0.0000000 1.0000000
middle low unemployed excellent 0.0000000 1.0000000
senior high employed fair 0.6334311 0.3665689
senior high employed excellent 0.8382924 0.1617076
senior high unemployed fair 0.1776316 0.8223684
senior high unemployed excellent 0.3932039 0.6067961
senior medium employed fair 0.4635193 0.5364807
senior medium employed excellent 0.7216036 0.2783964
senior medium unemployed fair 0.0974729 0.9025271
senior medium unemployed excellent 0.2447130 0.7552870
senior low employed fair 0.3654822 0.6345178
senior low employed excellent 0.6334311 0.3665689
senior low unemployed fair 0.0671642 0.9328358
senior low unemployed excellent 0.1776316 0.8223684

Check using klaR

Finally, we check our work using the NaiveBayes function from the klaR package. The prior, conditional, and posterior probabilities all agree with the results computed above.

nb <- NaiveBayes(prospect ~ ., df)
nb$apriori
## grouping
##        no       yes 
## 0.3571429 0.6428571
nb$tables
## $age
##         var
## grouping     youth    middle    senior
##      no  0.6000000 0.0000000 0.4000000
##      yes 0.2222222 0.4444444 0.3333333
## 
## $networth
##         var
## grouping      high    medium       low
##      no  0.4000000 0.4000000 0.2000000
##      yes 0.2222222 0.4444444 0.3333333
## 
## $status
##         var
## grouping  employed unemployed
##      no  0.8000000  0.2000000
##      yes 0.3333333  0.6666667
## 
## $credit
##         var
## grouping      fair excellent
##      no  0.4000000 0.6000000
##      yes 0.6666667 0.3333333
data.frame(nb$x, predict(nb)$posterior) %>% 
  kable(caption = "Posterior probabilities from naive Bayes classifier")
Posterior probabilities from naive Bayes classifier
age networth status credit no yes
youth high employed fair 0.7954173 0.2045827
youth high employed excellent 0.9210360 0.0789640
middle high employed fair 0.0032295 0.9967705
senior medium employed fair 0.4635193 0.5364807
senior low unemployed fair 0.0671642 0.9328358
senior low unemployed excellent 0.1776316 0.8223684
middle low unemployed excellent 0.0004048 0.9995952
youth medium employed fair 0.6603261 0.3396739
youth low unemployed fair 0.1394148 0.8605852
senior medium unemployed fair 0.0974729 0.9025271
youth medium unemployed excellent 0.4216310 0.5783690
middle medium employed excellent 0.0048365 0.9951635
middle high unemployed fair 0.0004048 0.9995952
senior medium employed excellent 0.7216036 0.2783964

Q2: Exploratory data analysis

Data exploration

First we load and summarize the data. We observe the following:

  • Fortunately there are no missing data and no extreme outliers in either dataset
  • Both datasets have the same predictor and target variables (predictors a and b and target class); however the target classes are labeled differently in the two datasets (1/2 in the first dataset and 0/1 in the second dataset)
  • The datasets have very different sizes (100 vs. 4000 observations)
  • Dataset #1 (“junk1”)
    • 100 observations
    • Target class distribution: balanced 50/50 split of 1/2
    • Predictors a & b
      • Relatively symmetric distributions
      • Mean & median close to 0
      • Standard deviation of 1.3 and 1.4
      • Almost uncorrelated (slight correlation of -0.10)
  • Dataset #2 (“junk2”)
    • 4000 observations
    • Target class distribution: imbalanced 94/6 split of 0/1
    • Predictors a & b
      • Relatively symmetric distributions
      • Mean & median close to 0
      • Standard deviation of 1.3 and 1.3
      • Almost uncorrelated (slight correlation of 0.07)
raw1 <- read_table2("junk1.txt", col_types = "ddf")
summary(raw1)
##        a                  b            class 
##  Min.   :-2.29854   Min.   :-3.17174   1:50  
##  1st Qu.:-0.85014   1st Qu.:-1.04712   2:50  
##  Median :-0.04754   Median :-0.07456         
##  Mean   : 0.04758   Mean   : 0.01324         
##  3rd Qu.: 1.09109   3rd Qu.: 1.05342         
##  Max.   : 3.00604   Max.   : 3.10230
(apply(raw1, 2, sd)[1:2]) 
##        a        b 
## 1.267740 1.446067
cor(raw1[1:2])
##            a          b
## a  1.0000000 -0.1021507
## b -0.1021507  1.0000000
raw2 <- read_csv("junk2.csv", col_types = "ddf")
summary(raw2)
##        a                  b            class   
##  Min.   :-4.16505   Min.   :-3.90472   0:3750  
##  1st Qu.:-1.01447   1st Qu.:-0.89754   1: 250  
##  Median : 0.08754   Median :-0.08358           
##  Mean   :-0.05126   Mean   : 0.05624           
##  3rd Qu.: 0.89842   3rd Qu.: 1.00354           
##  Max.   : 4.62647   Max.   : 4.31052
(apply(raw2, 2, sd)[1:2]) 
##        a        b 
## 1.298076 1.314386
cor(raw2[1:2])
##            a          b
## a 1.00000000 0.07392388
## b 0.07392388 1.00000000

Next we visualize the distribution of the data. We note the following:

  • Dataset #1
    • The conditional distributions of the predictor variables (conditioned on class) are not localized and they overlap substantially throughout the sample a-b space
    • The conditional distributions appear almost bi-modal (i.e., having two humps)
    • Both classes have similar means and variances across both predictor dimensions.
  • Dataset #2
    • The conditional distributions of the predictor variables are well defined, particularly the class 1 distribution which appears to be localized; in this case, there is much less class overlap in the sample a-b space than seen in dataset #1
    • The conditional distribution for class 0 appears to be bi-modal, whereas the distribution for class 1 appears to be normal
    • The target classes have distinguishable means and variances across both predictor dimensions.
# scatter plots
ggplot(raw1) + geom_point(aes(x = a, y = b, col = class)) + 
  labs(title = "Scatter plot of dataset #1")

ggplot(raw2) + geom_point(aes(x = a, y = b, col = class)) + 
  labs(title = "Scatter plot of dataset #2")

# pairs plots
ggpairs(raw1, aes(col = factor(class), alpha = 0.5), title = "Pairs plot of dataset #1")

ggpairs(raw2, aes(col = factor(class), alpha = 0.5), title = "Pairs plot of dataset #2")

Questions to ask

At this point it would be good to ask questions and get some guidance. In particular:

  • What is the objective of the exercise?
  • What is the relationship between the two datasets?
    • Do the predictors a and b and the target class correspond to the same variables in both datasets?
    • What is the relationship between classes 1 & 2 in dataset #1 and classes 0 & 1 in dataset #2?

Preliminary classification modeling

We assume here (based on the answers to the questions above) that the purpose of the exercise is to develop classification models and that we should keep the two datasets separate.

Based on the data exploration above, we might expect that classification will be achievable / more successful for dataset #2 than dataset #1, since:

  • In dataset #1, the target classes are not clearly separable and overlap substantially throughout the sample a-b space
  • In dataset #2, the target classes are more clearly separable and overlap only locally in the a-b space. However we should keep in mind that dataset #2 is highly imbalanced, and this will likely impair classifier performance.

We try various classification algorithms using the klaR package, including:

  • Linear discriminant analysis (lda)
  • Quadratic discriminant analysis (qda)
  • Recursive partitioning and regression tree (rpart)
  • Naive Bayes (naiveBayes)
  • Regularized discriminant analysis (rda)
  • Simple k-nearest neighbors (sknn)

Note that for all the classification methods above, we use the default options from klaR (including k = 3 for the kNN classifier). We can visualize the separation boundaries for each dataset and classifier using the partimat function, which plots correct and incorrect cases in black and red type, respectively, as well as summarizes the error rate.

Reviewing the partition plots for dataset #1, it appears that the following algorithms provide reasonable classifiers:

  • Quadratic discriminant analysis (0.23 error rate)
  • Recursive partitioning and regression tree (0.2 error rate)
  • Regularized discriminant analysis (0.23 error rate)
  • K-nearest neighbors (0.16 error rate), although there is evidence of over-fitting (which might be mitigated by increasing k).

However the linear discriminant analysis and Naive Bayes classifiers are not a good fit for this dataset, since they produce linear separation boundaries that are not effective in separating classes that overlap throughout the sample space.

# methods to try
methods <- c("lda", "qda", "rpart", "naiveBayes", "rda", "sknn")

# junk1
for (j in methods) {
  partimat(class ~ a + b, data = raw1, 
           method = j, 
           main = paste0("2D partition of dataset #1: ", j, " classifier"))
}

For dataset #2, it appears from the partition plots below that none of the methods provide satisfactory classifiers. In particular, the kNN classifier achieves a 4% error rate but there is evidence of over-fitting (which might be mitigated by increasing k). On the other hand, none of the remaining algorithms are able to provide any classification beyond grouping all instances into the dominant class (94% class “0”), which results in a baseline 6% error rate. The challenge with dataset #2 lies with the imbalanced nature of the class distribution. Since the target class is distributed 94/6 into 0/1, almost all the algorithms default to the baseline classifier, whereas in fact, we want a classifier that will produce a circular separation boundary around the class “1” instances, which are localized in the sample a-b space. None of the classification methods considered here are able to achieve this type of boundary.

# junk2
for (j in methods) {
  partimat(class ~ a + b, data = raw2, 
           method = j, 
           main = paste0("2D partition of dataset #2: ", j, " classifier"))
}

Recommendations

Based on the analysis above, our recommendations are the following:

  • Clarify the objective of the exercise
  • Determine the relationship between the two datasets, and between the target labels in the two datasets
  • For dataset #1:
    • Gather more data if possible, since the dataset includes only 100 instances and the two classes have substantial overlap in the sample a-b space
      • For instance is there another variable which might help separate the classes?
      • Or are there more observations that might help distinguish the classes in the existing variable space?
    • Consider other non-linear classification methods
  • For dataset #2:
    • Consider other non-linear classification methods; as we saw above, the ideal separation boundary will be a circular / oval shape around the localized density of class 1 instances.

Q3: K-nearest neighbors

Data preparation

First we read the icu.csv file and prepare the dataset, including:

  • Creating the variable COMA, which equals 1 if LOC is 2 and 0 otherwise
  • Selecting the variables TYP, COMA, AGE, INF, and STA.

AGE is a numeric variable while the other variables are categorical; however for modeling purposes, we will treat the categorical variables as numeric. Note from the statistical summary that the categorical variables have skewed / imbalanced distributions:

  • TYP is skewed to 1’s (26/74 split of 0/1)
  • COMA is predominantly 0’s (95/5 split of 0/1)
  • INF is relatively more balanced (58/42 split of 0/1)
  • STA is skewed to 0’s (80/20 split of 0/1).
df <- read_csv("icu.csv") %>% 
    mutate(COMA = ifelse(LOC == 2, 1, 0),
           STA = as.character(STA)) %>%
    select(TYP, COMA, AGE, INF, STA)

# summary stats
df_stats <- df %>% mutate(TYP = factor(TYP), 
                          COMA = factor(COMA), 
                          INF = factor(INF), 
                          STA = factor(STA))
summary(df_stats)
##  TYP     COMA         AGE        INF     STA    
##  0: 53   0:190   Min.   :16.00   0:116   0:160  
##  1:147   1: 10   1st Qu.:46.75   1: 84   1: 40  
##                  Median :63.00                  
##                  Mean   :57.55                  
##                  3rd Qu.:72.00                  
##                  Max.   :92.00

Next let’s partition the data into a 70/30 split of training and test sets. Since the target class is imbalanced, we need to make sure that the training and test sets both have representative distributions of the target class. We can see below that the target class distribution is 80/20 in both the training and test sets.

# create training/test partitions
set.seed(3)
n <- nrow(df)
train_idx <- sample(n, size = 0.7 * n)
train_df <- df[train_idx, ]
test_df <- df[-train_idx, ]

# check class distributions
labelcol <- 5       # col of class labels 
prop.table(table(STA_train = train_df[[labelcol]]))
## STA_train
##   0   1 
## 0.8 0.2
prop.table(table(STA_test = test_df[[labelcol]]))
## STA_test
##   0   1 
## 0.8 0.2

Modeling

Now we create k-nearest neighbor classifiers based on the model \(\text{STA} \sim \text{TYP} + \text{COMA} + \text{AGE} + \text{INF}\), for various values of the parameter \(k = 3, 5, 7, 15, 25, 50\). We do this by adapting the generic code provided. Then we evaluate the kNN classifiers using prediction accuracy on the test set.

# distance metric
euclideanDist <- function(a, b) {
    sqrt(sum( (a - b)^2 ))
}

# knn classifier
knn_predict <- function(test_data, train_data, k_value, labelcol) {
    pred <- character()       # initialize predicted classes 
    
    # loop1 - test data
    for (i in 1:nrow(test_data)) {
        eu_dist = numeric()     # initialize distances and classes
        eu_char = character() 
        
        # loop2 - training data 
        for (j in 1:nrow(train_data)) {
            # calc euclidean distance between test data & training data;
            # only use columns before labelcol in test data (since we add 
            # prediction columns to test df)
            eu_dist <- c(eu_dist, euclideanDist(test_data[i, 1:(labelcol-1)], 
                                                train_data[j, -labelcol]))
            # record class variable of training data
            eu_char <- c(eu_char, as.character(train_data[j, labelcol]))
        }
    
        eu <- data.frame(eu_char, eu_dist)  # dataframe of class & distance
        eu <- eu[order(eu$eu_dist), ]       # sort by ascending distance
        eu <- eu[1:k_value, ]               # dataframe with K nearest neighbors
 
        tbl.sm.df <- table(eu$eu_char)      # tabulate classes of nearest neighbors
        cl_label <- names(tbl.sm.df)[which.max(tbl.sm.df)]    # choose most frequent class
        pred <- c(pred, cl_label)
    }
    return(pred)    # return predicted classes
}

# accuracy metric
accuracy <- function(test_data, labelcol, predcol) {
  correct <- sum(test_data[[labelcol]] == test_data[[predcol]])
  accu = (correct / nrow(test_data)) * 100  
  return(accu)
}

# k values for kNN
krange <- c(3, 5, 7, 15, 25, 50)
knum <- length(krange)

# accuracy metric & contingency table
acc1 <- numeric()
confmat1 <- list()

# loop over k values
for (j in 1:knum) {
    kval <- krange[j]
    kname <- paste0("k=", kval)
    
    # calc kNN predictions & add to test df
    predictions <- knn_predict(test_df, train_df, kval, labelcol) 
    test_df[kname] <- predictions     
    
    # compute accuracy and contingency table
    acc1[kname] <- accuracy(test_df, labelcol, labelcol + j)
    confmat1[[kname]] <- addmargins(table(Pred = factor(test_df[[labelcol + j]], 
                                                        levels = c("0", "1")), 
                                          Obs = test_df[[labelcol]]))
}

It’s important to realize that in the kNN algorithm, we used the euclidean distance metric on the 4 predictors without adjustment, which gives equal weight to each predictor regardless of scale. However the scales are vastly different for the 3 categorical predictors (which consist of 0’s and 1’s) and the age predictor (which varies from 16 to 92). As a result, the distance metric and therefore the kNN output will be determined primarily by the age variable.

Results

Let’s plot the accuracy as a function of \(k\), the number of nearest neighbors for each kNN model. We see that the accuracy initially increases from 68% for \(k=3\) to 77% for \(k=7\), but then plateaus at 80% for \(k \ge 15\). Recall that the class distribution of STA is imbalanced, with an 80/20 split of 0’s and 1’s. Consider the simple classifier that predicts all cases to be 0. This simple classifier when applied to the test set will achieve an accuracy of 80%, since it correctly classifies all the 0 cases which comprise 80% of the test data. Returning to our kNN classifier, as k increases from 3 to 5 to 7, etc., eventually all test cases will be predicted to be 0, since the k nearest neighbors will be dominated by the 0 cases in the training set for large enough k. This explains the behavior of the accuracy curve.

# plot accuracy
df_plot <- data.frame(k = krange, Accuracy = acc1)
ggplot(df_plot) + geom_line(aes(x = k, y = acc1)) + 
  labs(title = "Accuracy of kNN classifiers (without scaling predictors)", 
       x = "k = number of nearest neighbors", 
       y = "Accuracy")

We can verify from the contingency tables below that for \(k \ge 15\), the kNN model classifies all instances into the “0” class. This results in an 80% accuracy rate, consistent with our explanation above.

# contingency tables
confmat1
## $`k=3`
##      Obs
## Pred   0  1 Sum
##   0   40 11  51
##   1    8  1   9
##   Sum 48 12  60
## 
## $`k=5`
##      Obs
## Pred   0  1 Sum
##   0   43 10  53
##   1    5  2   7
##   Sum 48 12  60
## 
## $`k=7`
##      Obs
## Pred   0  1 Sum
##   0   44 10  54
##   1    4  2   6
##   Sum 48 12  60
## 
## $`k=15`
##      Obs
## Pred   0  1 Sum
##   0   48 12  60
##   1    0  0   0
##   Sum 48 12  60
## 
## $`k=25`
##      Obs
## Pred   0  1 Sum
##   0   48 12  60
##   1    0  0   0
##   Sum 48 12  60
## 
## $`k=50`
##      Obs
## Pred   0  1 Sum
##   0   48 12  60
##   1    0  0   0
##   Sum 48 12  60

In order to normalize the impact of scale on the euclidean distance metric, we do a second run where we center and scale the predictor variables. We center and scale the variables by subtracting their mean values and then dividing by their standard deviations. The resulting variables have a mean of 0 and standard deviation of 1, so are standardized and expressed on the same scale. While the accuracy curves with scaling (below) and without scaling (above) have similar shapes, the accuracy curve with scaling reaches the 80% level earlier at \(k \ge 5\). However, this may occur by chance and may be dependent on this particular sample; reviewing the contingency tables, we see that the kNN classifier continues to make “1” predictions through \(k=15\), and then makes all “0” predictions thereafter.

train_df2 <- train_df
test_df2 <- test_df[1:labelcol]
# center & scale predictor variables
centerscale <- function(x) 
    x <- (x - mean(x)) / sd(x)
for (j in 1:4) {
    train_df2[[j]] <- centerscale(train_df2[[j]])
    test_df2[[j]] <- centerscale(test_df2[[j]])
}

# accuracy metric & contingency table
acc2 <- numeric()
confmat2 <- list()

# loop over k values
for (j in 1:knum) {
    kval <- krange[j]
    kname <- paste0("k=", kval)
    
    # calc kNN predictions & add to test df
    predictions <- knn_predict(test_df2, train_df2, kval, labelcol) 
    test_df2[kname] <- predictions     
    
    # compute accuracy and contingency table
    acc2[kname] <- accuracy(test_df2, labelcol, labelcol + j)
    confmat2[[kname]] <- addmargins(table(Pred = factor(test_df2[[labelcol + j]], 
                                                        levels = c("0", "1")), 
                                          Obs = test_df2[[labelcol]]))
}
# plot accuracy
df2_plot <- data.frame(k = krange, Accuracy = acc2)
ggplot(df2_plot) + geom_line(aes(x = k, y = acc2)) + 
  labs(title = "Accuracy of kNN classifiers (after scaling predictors)", 
       x = "k = number of nearest neighbors", 
       y = "Accuracy")

# contingency tables
confmat2
## $`k=3`
##      Obs
## Pred   0  1 Sum
##   0   42  8  50
##   1    6  4  10
##   Sum 48 12  60
## 
## $`k=5`
##      Obs
## Pred   0  1 Sum
##   0   45  9  54
##   1    3  3   6
##   Sum 48 12  60
## 
## $`k=7`
##      Obs
## Pred   0  1 Sum
##   0   45  9  54
##   1    3  3   6
##   Sum 48 12  60
## 
## $`k=15`
##      Obs
## Pred   0  1 Sum
##   0   47 11  58
##   1    1  1   2
##   Sum 48 12  60
## 
## $`k=25`
##      Obs
## Pred   0  1 Sum
##   0   48 12  60
##   1    0  0   0
##   Sum 48 12  60
## 
## $`k=50`
##      Obs
## Pred   0  1 Sum
##   0   48 12  60
##   1    0  0   0
##   Sum 48 12  60

Check using klaR

Finally, to check our work, we use the sknn function from the klaR package. The accuracy measures agree with those we computed above.

# check using sknn function from klaR
train_df3 <- train_df
test_df3 <- test_df[1:labelcol]
train_df4 <- train_df2
test_df4 <- test_df2[1:labelcol]
acc3 <- numeric()
confmat3 <- list()
acc4 <- numeric()
confmat4 <- list()

# loop over k values
for (j in 1:knum) {
    kval <- krange[j]
    kname <- paste0("k=", kval)
    
    # calc kNN predictions & add to test df
    predictions3 <- sknn(factor(STA) ~ ., train_df3, kn = kval) %>% 
      predict(newdata = test_df3)
    predictions4 <- sknn(factor(STA) ~ ., train_df4, kn = kval) %>% 
      predict(newdata = test_df4)
    test_df3[kname] <- predictions3$class     
    test_df4[kname] <- predictions4$class     
    
    # compute accuracy and contingency table
    acc3[kname] <- accuracy(test_df3, labelcol, labelcol + j)
    acc4[kname] <- accuracy(test_df4, labelcol, labelcol + j)
    confmat3[[kname]] <- addmargins(table(Pred = factor(test_df3[[labelcol + j]], 
                                                        levels = c("0", "1")), 
                                          Obs = test_df3[[labelcol]]))
    confmat4[[kname]] <- addmargins(table(Pred = factor(test_df4[[labelcol + j]], 
                                                        levels = c("0", "1")), 
                                          Obs = test_df4[[labelcol]]))
}
# compare accuracy to klaR
cbind(acc1, acc3, acc2, acc4) %>% 
  kable(col.names = c("Non-scaled: Calc", "Non-scaled: klaR", "Scaled: Calc", "Scaled: klaR"), 
        caption = "Accuracy comparison vs. k: Calculated & from klaR")
Accuracy comparison vs. k: Calculated & from klaR
Non-scaled: Calc Non-scaled: klaR Scaled: Calc Scaled: klaR
k=3 68.33333 68.33333 76.66667 76.66667
k=5 75.00000 75.00000 80.00000 80.00000
k=7 76.66667 76.66667 80.00000 80.00000
k=15 80.00000 80.00000 80.00000 80.00000
k=25 80.00000 80.00000 80.00000 80.00000
k=50 80.00000 80.00000 80.00000 80.00000

Summary of findings

In summary, we have seen the following:

  • Imbalanced target classes: the target variable STA has an imbalanced class distribution (80/20 split of 0/1). As a result, for high enough \(k\), eventually the kNN model will classify all instances as “0”, and the accuracy will saturate at 80%. For the training/test split analyzed here, this behavior occurs at \(k \ge 15\) (if the variables are not standardized) or \(k \ge 25\) (if the variables are standardized). We expect that the more imbalanced the distribution, the earlier this behavior occurs (i.e., the smaller the value of \(k\) where the accuracy saturates at the dominant class proportion).

  • Scaling of different predictors: the AGE predictor, which is numeric (ranging from 16 to 92), has a dramatically different scale than the other predictors, which are categorical (0’s and 1’s). This has an impact on the distance metric and therefore on the predictions of the kNN classifier. The distance metric used in the kNN algorithm, as implemented in this exercise, is the euclidean distance, which gives equal weight to each predictor dimension. Since the AGE variable has a much larger magnitude than the other variables, its variability will tend to dominate the distance calculation. This can be mitigated by standardizing the variables (i.e., centering and scaling), in which case, the distance metric will be equally affected by the variability of each predictor. We saw from the contingency tables that this impacted the accuracy profile of the kNN classifier.