Building a Classification ML Model using kNN

Introduction

Dataset

Loading the diabetes data

Now, let’s load some data built into the mclust package, convert it into a tibble, and explore it a little (a tibble is the tidyverse way of storing rectangular data). We have a tibble with 145 cases and 4 variables. The class factor shows that 76 of the cases were non-diabetic (Normal), 36 were chemically diabetic (Chemical), and 33 were overtly diabetic (Overt). The other three variables are continuous measures of the level of blood glucose and insulin after a glucose tolerance test (glucose and insulin, respectively), and the steady-state level of blood glucose (sspg).

library(mlr)
library(tidyverse)
data(diabetes, package = "mclust")
diabetesTib <- as_tibble(diabetes)
summary(diabetesTib)

##       class       glucose       insulin            sspg      
##  Chemical:36   Min.   : 70   Min.   :  45.0   Min.   : 10.0  
##  Normal  :76   1st Qu.: 90   1st Qu.: 352.0   1st Qu.:118.0  
##  Overt   :33   Median : 97   Median : 403.0   Median :156.0  
##                Mean   :122   Mean   : 540.8   Mean   :186.1  
##                3rd Qu.:112   3rd Qu.: 558.0   3rd Qu.:221.0  
##                Max.   :353   Max.   :1568.0   Max.   :748.0

diabetesTib

## # A tibble: 145 x 4
##    class  glucose insulin  sspg
##    <fct>    <dbl>   <dbl> <dbl>
##  1 Normal      80     356   124
##  2 Normal      97     289   117
##  3 Normal     105     319   143
##  4 Normal      90     356   199
##  5 Normal      90     323   240
##  6 Normal      86     381   157
##  7 Normal     100     350   221
##  8 Normal      85     301   186
##  9 Normal      97     379   142
## 10 Normal      97     296   131
## # … with 135 more rows

Plotting the diabetes data

To show how these variables are related, they are plotted against each other in figure.

Fig descr: Plotting the relationships between variables in diabetesTib. All three combinations of the continuous variables are shown, shaded by class.

ggplot(diabetesTib, aes(glucose, insulin, col = class)) +
  geom_point()  +
  theme_bw()

ggplot(diabetesTib, aes(sspg, insulin, col = class)) +
  geom_point() +
  theme_bw()

ggplot(diabetesTib, aes(sspg, glucose, col = class)) +
  geom_point() +
  theme_bw()

Looking at the data, we can see there are differences in the continuous variables among the three classes, so let’s build a kNN classifier that we can use to predict dia- betes status from measurements of future patients.

Our dataset only consists of continuous predictor variables, but often we may be work- ing with categorical predictor variables too. The kNN algorithm can’t handle categorical variables natively; they need to first be encoded somehow, or distance metrics other than Euclidean distance must be used.

It’s also very important for kNN (and many machine learning algorithms) to scale the predictor variables by dividing them by their standard deviation. This preserves the relationships between the variables, but ensures that variables measured on larger scales aren’t given more importance by the algorithm. In the current example, if we divided the glucose and insulin variables by 1,000,000, then predictions would rely mostly on the value of the sspg variable. We don’t need to scale the predictors our- selves because, by default, the kNN algorithm wrapped by the mlr package does this for us.

Using mlr to train the kNN model

We understand the problem we’re trying to solve (classifying new patients into one of three classes), and now we need to train the kNN algorithm to build a model that will solve that problem. Building a machine learning model with the mlr package has three main stages:

Define the task. The task consists of the data and what we want to do with it. In this case, the data is diabetesTib, and we want to classify the data with the class variable as the target variable.
Define the learner. The learner is simply the name of the algorithm we plan to use, along with any additional arguments the algorithm accepts.
Train the model. This stage is what it sounds like: you pass the task to the learner, and the learner generates a model that you can use to make future predictions.

Let’s begin by defining our task. The components needed to define a task are * The data containing the predictor variables (variables we hope contain the information needed to make predictions/solve our problem) * The target variable we want to predict

Defining the task

For supervised learning, the target variable will be categorical if we have a classification problem, and continuous if we have a regression problem. For unsupervised learning, we omit the target variable from our task definition, as we don’t have access to labeled data.

Defining a task in mlr. A task definition consists of the data containing the predictor variables and, for classification and regression problems, a target variable we want to predict. For unsupervised learning, the target is omitted.

We want to build a classification model, so we use the makeClassifTask() function to define a classification task. When we build regression and clustering models in parts 3 and 5 of the book, we’ll use makeRegrTask() and makeClusterTask(), respectively. We supply the name of our tibble as the data argument and the name of the factor that contains the class labels as the target argument:

diabetesTask <- makeClassifTask(data = diabetesTib, target = "class")
diabetesTask

## Supervised task: diabetesTib
## Type: classif
## Target: class
## Observations: 145
## Features:
##    numerics     factors     ordered functionals 
##           3           0           0           0 
## Missings: FALSE
## Has weights: FALSE
## Has blocking: FALSE
## Has coordinates: FALSE
## Classes: 3
## Chemical   Normal    Overt 
##       36       76       33 
## Positive class: NA

If we call the task, we can see it’s a classification task on the diabetesTib tibble, whose target is the class variable. We also get some information about the number of observations and the number of different types of variables. Some additional information includes whether we have missing data, the number of observations in each class, and which class is considered to be the “positive” class (only relevant for two-class tasks):

Defining the learner

Next, let’s define our learner. The components needed to define a learner are as follows:

The class of algorithm we are using (“classif.” for classification, “regr.” for regression, “cluster.” for clustering)
The algorithm we are using
Any additional options we may wish to use to control the algorithm

As you’ll see, the first and second components are combined together in a single character argument to define which algorithm will be used (for example, “classif.knn”).

Defining a learner in mlr. A learner definition consists of the class of algorithm you want to use, the name of the individual algorithm, and, optionally, any additional arguments to control the algorithm’s behavior.

We use the makeLearner() function to define a learner. The first argument to the makeLearner() function is the algorithm that we’re going to use to train our model. In this case, we want to use the kNN algorithm, so we supply “classif.knn” as the argument. See how this is the class (“classif.) joined to the name (knn”) of the algorithm?

The argument par.vals stands for parameter values, which allows us to specify the number of k-nearest neighbors we want the algorithm to use. For now, we’ll just set this to 2, but we’ll discuss how to choose k soon:

knn <- makeLearner("classif.knn", par.vals = list("k" = 2))

How to list all of mlr’s algorithms The mlr package has a large number of machine learning algorithms that we can give to the makeLearner() function, more than I can remember without checking! To list all the available learners, simply use listLearners()\(class Or list them by function: listLearners("classif")\)class listLearners(“regr”)\(class listLearners("cluster")\)class If you’re ever unsure which algorithms are available to you or which argument to pass to makeLearner() for a particular algorithm, use these functions to remind yourself.

Training the model

Now that we’ve defined our task and our learner, we can now train our model. The components needed to train a model are the learner and task we defined earlier. The whole process of defining the task and learner and combining them to train the model is shown in figure

Training a model in mlr. Training a model simply consists of combining a learner with a task.

This is achieved with the train() function, which takes the learner as the first argu- ment and the task as its second argument:

knnModel <- train(knn, diabetesTask)

We have our model, so let’s pass the data through it to see how it performs. The predict() function takes unlabeled data and passes it through the model to get the predicted classes. The first argument is the model, and the data being passed to it is given as the newdata argument:

knnPred <- predict(knnModel, newdata = diabetesTib)

## Warning in predict.WrappedModel(knnModel, newdata = diabetesTib): Provided data
## for prediction is not a pure data.frame but from class tbl_df, hence it will be
## converted.

We can pass these predictions as the first argument of the performance() function. This function compares the classes predicted by the model to the true classes, and returns performance metrics of how well the predicted and true values match each other. Use of the predict() and performance() functions is illustrated in figure

A summary of the predict() and performance() functions of mlr. predict() passes observations into a model and outputs the predicted values. performance() compares these predicted values to the cases’ true values and outputs one or more performance metrics summarizing the similarity between the two.

We specify which performance metrics we want the function to return by supplying them as a list to the measures argument. The two measures I’ve asked for are mmce, the mean misclassification error; and acc, or accuracy. MMCE is simply the proportion of cases classified as a class other than their true class. Accuracy is the opposite of this: the proportion of cases that were correctly classified by the model. You can see that the two sum to 1.00:

performance(knnPred, measures = list(mmce, acc))

##       mmce        acc 
## 0.07586207 0.92413793

So our model is correctly classifying 95.2% of cases! Does this mean it will perform well on new, unseen patients? The truth is that we don’t know. Evaluating model perfor- mance by asking it to make predictions on data you used to train it in the first place tells you very little about how the model will perform when making predictions on completely unseen data. Therefore, you should never evaluate model performance this way.