Title: KNN

Course: MBA563 ************************************************************************

Run this code before you start.

# load needed packages
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

#install.packages('e1071')
library(e1071)
# Confusion Matrix function
my_confusion_matrix <- function(cf_table) {
  true_positive <- cf_table[4]
  true_negative <- cf_table[1]
  false_positive <- cf_table[2]
  false_negative <- cf_table[3]
  accuracy <- (true_positive + true_negative) / (true_positive + true_negative + false_positive + false_negative)
  sensitivity_recall <- true_positive / (true_positive + false_negative) 
  specificity_selectivity <- true_negative / (true_negative + false_positive)
  precision <- true_positive / (true_positive + false_positive) 
  neg_pred_value <- true_negative/(true_negative + false_negative)
  print(cf_table)
  my_list <- list(sprintf("%1.0f = True Positive (TP), Hit", true_positive),
                  sprintf("%1.0f = True Negative (TN), Rejection", true_negative),
                  sprintf("%1.0f = False Positive (FP), Type 1 Error", false_positive),
                  sprintf("%1.0f = False Negative (FN), Type 2 Error", false_negative),
                  sprintf("%1.4f = Accuracy (TP+TN/(TP+TN+FP+FN))", accuracy), 
                  sprintf("%1.4f = Sensitivity, Recall, Hit Rate, True Positive Rate (How many positives did the model get right? TP/(TP+FN))", sensitivity_recall),
                  sprintf("%1.4f = Specificity, Selectivity, True Negative Rate (How many negatives did the model get right? TN/(TN+FP))", specificity_selectivity),                   
                  sprintf("%1.4f = Precision, Positive Predictive Value (How good are the model's positive predictions? TP/(TP+FP))", precision),
                  sprintf("%1.4f = Negative Predictive Value (How good are the model's negative predictions? TN/(TN+FN)", neg_pred_value)
  )
  return(my_list)
}

We will now us the KNN algorithm to address our business problem.

TECA is planning for the future and would like to set up their business so they are not so reliant on selling gas. One way to do that is to increase the sales of profitable products. Thus, TECA would like to predict when a transaction is a going to be a sale of a high gross profit margin product. Profit margin is the percentage of gross profit to revenue, or revenue minus costs divided by revenue. It is a percentage of how much profit each product makes or what percentage of profit is earned for each dollar of revenue.

The first thing we need to do is bring in the dataset we are going to work with. This is TECA data that has been transformed to be used in our KNN analysis. Let’s load it and examine it a bit. We first notice that the dataset has 20,000 rows and 38 features. This data is aggregated at the level of a purchase. Thus, each example or row is the purchase of one item. The first feature is the target feature or the dependent variable. This is the variable we are trying to predict. It is called high_gpm. It is a factor that is either low or high. Next, we see revenue. This variable is the amount of revenue TECA makes for each purchase. Revenue is a continuous variable. Next, we have four variables related to quarter. These are one-hot encoded or dummy variables for the quarter of the year. For example, if the purchase happened during the first quarter of the year, quarter.1 would have a 1 value and the other three quarters’ variables would have a 0. In using the KNN algorithm we must all of the features, except the target feature, as numbers. Thus, to we created these four dummy variables from a one feature that called quarter that had four different types of entries–‘quarter 1’, ‘quarter 2’, ‘quarter 3’, and ‘quarter 4’. Next, income, bachelors_degree, and population list these averages for the location of the store where the purchase took place. Next, we have another set of dummy variables–for each of the states TECA operates in. Next, we have num_trans, which indicates how many purchases were made by this same person as part of this transaction. basket.no and basket.yes measure whether the purchase was part of a multi-part transaction. refill.no and refill.yes indicate whether or not the purchase was a refill of fountain soda. The next 11 dummy variables that all start with area, indicate the area of the store the purchase was made, as follows: * area.alcohol: products in this area include, for example, wine and beer; * area.cooler: products in this area include, for example, energy drinks, canned soda, and juice; * area.dispensed: products in this area include, for example, cold and hot dispensed (fountain) drinks; * area.fresh: products in this area include, for example, pizza, hot sandwiches, salads, and roller grill items; * area.fuel: products in this area include, for example, gas; * area.grocery: products in this area include, for example, milk, eggs, and cheese; * area.lottery: products in this area include, for example, lottery tickets; * area.miscellaneous: products in this area include, for example, store services and coupons; * area.nongrocery: products in this area include, for example, clothing, magazines, medicine, and newspapers; * area.snacks: products in this area include, for example, candy, gum, chips, and salty snacks; and * area.tobacco: products in this area include, for example, cigarettes and chewing tobacco.

Next, items_sold indicate how many of these items were purchased. Finally, the two loyalty dummies indicate whether the purchase was made by a loyalty customer (a customer who scanned their loyalty card) or not. Note, that the data was adjusted so that about half of the purchases in this dataset were made by loyalty customers, since TECA is particularly interested in these customers.

Explore the data

knn_input <- read_rds('knn_input.rds')
str(knn_input)

## 'data.frame':    20000 obs. of  38 variables:
##  $ high_gpm                   : Factor w/ 2 levels "low","high": 2 1 1 2 2 2 1 1 1 1 ...
##  $ revenue                    : num  0.99 27.25 2.49 2.26 3 ...
##  $ quarter.1                  : int  1 0 1 1 0 0 0 0 0 0 ...
##  $ quarter.2                  : int  0 1 0 0 1 1 1 0 0 0 ...
##  $ quarter.3                  : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ quarter.4                  : int  0 0 0 0 0 0 0 1 0 1 ...
##  $ income                     : int  65268 58854 61115 45874 43547 41827 45874 66250 70963 61115 ...
##  $ bachelors_degree           : int  118 68 264 2488 502 169 2488 22 1959 264 ...
##  $ population                 : int  1003 734 2048 30327 10615 2379 30327 567 16052 2048 ...
##  $ state_province.Alabama     : int  0 0 0 0 1 0 0 0 1 0 ...
##  $ state_province.Arkansas    : int  0 0 0 1 0 0 1 0 0 0 ...
##  $ state_province.Colorado    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ state_province.Iowa        : int  1 0 0 0 0 0 0 1 0 0 ...
##  $ state_province.Minnesota   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ state_province.Missouri    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ state_province.Nebraska    : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ state_province.Oklahoma    : int  0 0 1 0 0 1 0 0 0 1 ...
##  $ state_province.South Dakota: int  0 1 0 0 0 0 0 0 0 0 ...
##  $ state_province.Wyoming     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ num_trans                  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ basket.no                  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ basket.yes                 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ refill.no                  : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ refill.yes                 : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ area.alcohol               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ area.cooler                : int  0 0 1 0 0 0 0 0 0 0 ...
##  $ area.dispensed             : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ area.fresh                 : int  1 0 0 1 1 0 0 0 0 0 ...
##  $ area.fuel                  : int  0 1 0 0 0 0 1 1 0 1 ...
##  $ area.grocery               : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ area.lottery               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ area.miscellaneous         : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ area.nongrocery            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ area.snacks                : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ area.tobacco               : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ items_sold                 : num  1 1 1 1 3 1 1 1 1 1 ...
##  $ loyalty2.not loyal         : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ loyalty2.loyal             : int  0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "dummies")=List of 6
##   ..$ quarter       : int [1:4] 3 4 5 6
##   ..$ state_province: int [1:10] 10 11 12 13 14 15 16 17 18 19
##   ..$ basket        : int [1:2] 21 22
##   ..$ refill        : int [1:2] 23 24
##   ..$ area          : int [1:11] 25 26 27 28 29 30 31 32 33 34 ...
##   ..$ loyalty2      : int [1:2] 37 38

slice_sample(knn_input, n=10)

##         high_gpm revenue quarter.1 quarter.2 quarter.3 quarter.4 income
## 1402640      low    8.00         1         0         0         0  45278
## 81954        low    1.69         0         0         1         0  43438
## 381626      high    1.00         1         0         0         0  58594
## 393438       low    1.79         1         0         0         0  43547
## 116217       low    1.00         0         0         0         1  37917
## 450690       low   54.34         0         0         0         1  89196
## 964038       low    5.79         0         0         1         0  37130
## 857079       low    3.59         0         0         1         0  70963
## 353406       low    5.98         0         0         0         1  58386
## 296850      high    2.00         0         1         0         0  60350
##         bachelors_degree population state_province.Alabama
## 1402640              162       1741                      0
## 81954                 51        279                      0
## 381626              1167       8624                      0
## 393438               502      10615                      1
## 116217                10        546                      0
## 450690                90        423                      0
## 964038              1238      15153                      0
## 857079              1959      16052                      1
## 353406              7602      39333                      0
## 296850              4076      55532                      0
##         state_province.Arkansas state_province.Colorado state_province.Iowa
## 1402640                       0                       0                   0
## 81954                         0                       1                   0
## 381626                        0                       0                   0
## 393438                        0                       0                   0
## 116217                        1                       0                   0
## 450690                        0                       0                   0
## 964038                        0                       1                   0
## 857079                        0                       0                   0
## 353406                        0                       1                   0
## 296850                        0                       1                   0
##         state_province.Minnesota state_province.Missouri
## 1402640                        0                       0
## 81954                          0                       0
## 381626                         0                       1
## 393438                         0                       0
## 116217                         0                       0
## 450690                         0                       0
## 964038                         0                       0
## 857079                         0                       0
## 353406                         0                       0
## 296850                         0                       0
##         state_province.Nebraska state_province.Oklahoma
## 1402640                       0                       1
## 81954                         0                       0
## 381626                        0                       0
## 393438                        0                       0
## 116217                        0                       0
## 450690                        0                       1
## 964038                        0                       0
## 857079                        0                       0
## 353406                        0                       0
## 296850                        0                       0
##         state_province.South Dakota state_province.Wyoming num_trans basket.no
## 1402640                           0                      0         1         1
## 81954                             0                      0         1         1
## 381626                            0                      0         1         1
## 393438                            0                      0         1         1
## 116217                            0                      0         1         1
## 450690                            0                      0         1         1
## 964038                            0                      0         1         1
## 857079                            0                      0         1         1
## 353406                            0                      0         1         1
## 296850                            0                      0         1         1
##         basket.yes refill.no refill.yes area.alcohol area.cooler area.dispensed
## 1402640          0         1          0            0           0              0
## 81954            0         1          0            0           0              0
## 381626           0         1          0            0           0              1
## 393438           0         1          0            0           1              0
## 116217           0         1          0            0           0              0
## 450690           0         1          0            0           0              0
## 964038           0         1          0            0           0              0
## 857079           0         1          0            0           0              0
## 353406           0         1          0            0           1              0
## 296850           0         1          0            0           0              0
##         area.fresh area.fuel area.grocery area.lottery area.miscellaneous
## 1402640          0         1            0            0                  0
## 81954            0         0            0            0                  0
## 381626           0         0            0            0                  0
## 393438           0         0            0            0                  0
## 116217           0         0            0            1                  0
## 450690           0         1            0            0                  0
## 964038           0         0            0            0                  0
## 857079           0         0            0            0                  0
## 353406           0         0            0            0                  0
## 296850           1         0            0            0                  0
##         area.nongrocery area.snacks area.tobacco items_sold loyalty2.not loyal
## 1402640               0           0            0          1                  1
## 81954                 0           1            0          1                  0
## 381626                0           0            0          1                  0
## 393438                0           0            0          1                  1
## 116217                0           0            0          1                  0
## 450690                0           0            0          1                  0
## 964038                0           0            1          1                  1
## 857079                0           0            1          1                  1
## 353406                0           0            0          2                  0
## 296850                0           0            0          2                  0
##         loyalty2.loyal
## 1402640              0
## 81954                1
## 381626               1
## 393438               0
## 116217               1
## 450690               1
## 964038               0
## 857079               0
## 353406               1
## 296850               1

Let’s look more in depth at the target feature. For KNN analysis, the target feature is a categorical variable. In this implementation we can leave it as a factor. About 44% of these purchases are for high gross profit items.

Explore the target feature

freq <- table(knn_input$high_gpm)
freq[2]/(freq[1]+freq[2])

##    high 
## 0.43695

contrasts(knn_input$high_gpm)

##      high
## low     0
## high    1

Before using the algorithm, we need to prepare the data. The first line below loads the caret package. The next line sets the seed for the randomization that will be used for the algorithm. The caret::createDataPartition() function uses the caret package to split the data. Basically, it creates a number for p amount, in this case 0.75, of the target feature and lists these numbers in a matrix (since list is FALSE). Next, the training data and testing data are created. These use the numbers from the partition matrix that we just created. data_train retains each of the rows with numbers in partition while data_test takes the numbers not in partition, i.e., -partition.

Partition the data

library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

set.seed(77)
partition <- caret::createDataPartition(y=knn_input$high_gpm, p=.75, list=FALSE)
data_train <- knn_input[partition, ]
data_test <- knn_input[-partition, ]

Next, we need to remove the dependent variable, high_gpm to create testing data and training data without the target variable, y and training and testing data that is just the dependent variable.

Separate the target variable

X_train <- data_train %>% select(-high_gpm)
X_test <-  data_test %>% select(-high_gpm) 
y_train <- data_train$high_gpm
y_test <- data_test$high_gpm

Next, let’s standardize our data. Recall, that KNN uses a distance function to determine its nearest neighbor. Thus, we need all of the variables to be on similar scales. This function creates matrices that have variables that have been standardized with z-score standardization. The mean of each feature is subtracted from each individual feature and then divided by that featureâ€™s standard deviation. This transformation rescales the feature such it has a mean of zero and a standard deviation of one. Thus, it is measured in how many standard deviations it falls above or below its mean.

z-score standardization

X_train <- scale(X_train)
X_test <- scale(X_test)

Let’s just double check that our training data is the correct percentage and that the size of the training input variables and the training output variables are the same.

Double check sizes

nrow(X_train)/(nrow(X_test)+nrow(X_train))

## [1] 0.75005

dim(X_train)

## [1] 15001    37

length(y_train)

## [1] 15001

Finally, let’s run the model. Please remember that this algorithm works differently from some of the others you will see. It is what we call lazy because it does not create a model, but rather stores all of the training data and then uses it on the test data all in one step. Thus, the training step and the prediction step are combined into one step.

To implement KNN we use one function that accepts four arguments: * The training data with the label’s column/target variable removed * The testing data with the label’s column/target variable removed * The class labels (cl)/target variable for the training data (but not the testing data, since that is what is being predicted), and * The K that we select.

How did I select this k? There is no one right way to do this, but I just took the square root of the total number of rows.

knn1 is a feature vector of predicted labels for each of the rows/examples in the test data.

Run the model

library(class)
knn1 = class::knn(train=X_train, test=X_test, cl=y_train, k=141)

Finally, let’s check to see how accurate our model is. We will use the function created at the beginning of the notebook. Overall, our model is very accurate. The model makes the correct prediction about 83% of the time! Recall, that KNN has used the training data to predict on the testing data. Thus, there is some assurance that the model would predict well on new data.

Let’s explore some details of the accuracy of this model. The table shows the following output: * When a transaction is actually/truthfully a high gross profit margin transaction, the model correctly classifies it as such-by saying “high”, 1700 times. This is called a True Positive (TP), Hit.

When a transaction is truthfully a low gross profit margin transaction, the model correctly classifies it as such–by saying “low”, 2448 times. This is a True Negative (TN), Rejection.
On the other hand, the model makes two kinds of errors. When the model transaction is not a high gross profit margin transaction, but the model incorrectly says it is, this is called a False Positive (FP), Type 1 Error. It happens 367 times.
Finally, when a transaction is a high gross profit margin transaction and the model says it is not, which here happens 484, it is called a False Negative (FN), Type 2 Error.

These numbers can then be manipulated to create different measures of accuracy, as follows: * Overall accuracy (TP+TN/(TP+TN+FP+FN)) is 0.8298.

Sensitivity, Recall, Hit Rate, True Positive Rate (How many positives did the model get right? TP/(TP+FN)), is 0.7784.
Specificity, Selectivity, True Negative Rate (How many negatives did the model get right? TN/(TN+FP)) is 0.8696.
Precision, Positive Predictive Value (How good are the model’s positive predictions? TP/(TP+FP)) is 0.8224.
Negative Predictive Value (How good are the model’s negative predictions? TN/(TN+FN) is 0.8349.

But what does this all tell us? Overall, our model does a great job at predicting when a purchase will be high profit margin versus low profit margin and gets it right 83% of the time. The aspect of the model that is the least effective is its sensitivity and the aspect that is the most effective is the specificity. This means that the models is very good at classifying the low margin purchases successfully (specificity), but slightly worse at classifying the high margin purchases correctly (sensitivity). Thus, while the model is quite good overall, it particularly excels at identifying bad transactions.

Confusion matrix - checking accuracy

table2 <- table(knn1, y_test) #prediction on left and truth on top
my_confusion_matrix(table2)

##       y_test
## knn1    low high
##   low  2448  484
##   high  367 1700

## [[1]]
## [1] "1700 = True Positive (TP), Hit"
## 
## [[2]]
## [1] "2448 = True Negative (TN), Rejection"
## 
## [[3]]
## [1] "367 = False Positive (FP), Type 1 Error"
## 
## [[4]]
## [1] "484 = False Negative (FN), Type 2 Error"
## 
## [[5]]
## [1] "0.8298 = Accuracy (TP+TN/(TP+TN+FP+FN))"
## 
## [[6]]
## [1] "0.7784 = Sensitivity, Recall, Hit Rate, True Positive Rate (How many positives did the model get right? TP/(TP+FN))"
## 
## [[7]]
## [1] "0.8696 = Specificity, Selectivity, True Negative Rate (How many negatives did the model get right? TN/(TN+FP))"
## 
## [[8]]
## [1] "0.8224 = Precision, Positive Predictive Value (How good are the model's positive predictions? TP/(TP+FP))"
## 
## [[9]]
## [1] "0.8349 = Negative Predictive Value (How good are the model's negative predictions? TN/(TN+FN)"

The above confusion matrix was made from our function above. It is helpful because I put in some extra text that helps interpret the results. R, of course, has several packages that will build the confusion matrix for you. Here is one. Note, that we did need to specify which of our levels is the “Positive Class”, since this normally takes the first level of the variable, which is our case is low, which is the “negative” class, or the wrong thing, rather than the thing we are predicting. One thing that is provided here is a significance test of accuracy. As you can see from the very low P-Value and Mcnemar's Test P-Value, our model is significant.

Pre-programmed confusion matrix

caret::confusionMatrix(knn1, y_test, positive='high')

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  low high
##       low  2448  484
##       high  367 1700
##                                           
##                Accuracy : 0.8298          
##                  95% CI : (0.8191, 0.8401)
##     No Information Rate : 0.5631          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6519          
##                                           
##  Mcnemar's Test P-Value : 6.996e-05       
##                                           
##             Sensitivity : 0.7784          
##             Specificity : 0.8696          
##          Pos Pred Value : 0.8224          
##          Neg Pred Value : 0.8349          
##              Prevalence : 0.4369          
##          Detection Rate : 0.3401          
##    Detection Prevalence : 0.4135          
##       Balanced Accuracy : 0.8240          
##                                           
##        'Positive' Class : high            
##

Finally, we can put the prediction back into the test data and compare when our model is and is not accurate. The first line of the code below adds the predicted low or high gross profit margin back to the test data. The second line creates a new variable called correct that takes on the value TRUE when the model was correct and FALSE when the model was incorrect. Printing out a sample of this dataframe and scrolling through it gives us the ability to investigate where the model went wrong.

It also gives us some insight into TECA’s problem. Just doing a quick scan, it seems like promoting certain areas, such as fresh and dispensed might help increase the sale of high profit margin products, while other areas, such as lottery, might not help. However, are efforts here were to make a prediction model. To examine the relevance of the factors we would need to employee another method, such as logistic regression or decision trees.

Evaluate the data

data_test$prediction <- knn1
data_test <- data_test %>% mutate(correct = high_gpm==prediction)
slice_sample(data_test, n=20)

##         high_gpm revenue quarter.1 quarter.2 quarter.3 quarter.4 income
## 209931      high    1.79         0         0         1         0  34210
## 1403545      low   26.91         0         0         0         1  29091
## 1226812      low   10.00         1         0         0         0  42772
## 428400       low    5.00         1         0         0         0  66250
## 417907       low   21.00         1         0         0         0  76071
## 38464       high    0.99         0         1         0         0  52500
## 1612610     high    1.00         0         0         0         1  29091
## 1224527      low   45.25         0         0         0         1  70557
## 879815      high    1.98         0         0         1         0  66160
## 318760       low    7.03         0         1         0         0  37130
## 1242817      low   26.90         0         0         0         1  50819
## 382899      high    1.69         0         0         1         0  66250
## 118951       low   13.98         1         0         0         0  41827
## 855741       low    4.75         0         1         0         0  70963
## 1551687     high    2.69         0         0         1         0  66250
## 1201431      low   20.00         0         0         0         1  51461
## 471531      high    7.56         0         0         1         0  63673
## 15110        low    2.13         0         0         0         1  53026
## 1555083     high    1.30         1         0         0         0 132230
## 417764       low   44.64         0         0         1         0  73148
##         bachelors_degree population state_province.Alabama
## 209931              1632      25999                      1
## 1403545                3        181                      0
## 1226812              160       1876                      0
## 428400                22        567                      0
## 417907                93        493                      0
## 38464                 95       1192                      0
## 1612610                3        181                      0
## 1224527             4048      51863                      0
## 879815             10178      57364                      0
## 318760              1238      15153                      0
## 1242817              958       8778                      0
## 382899                22        567                      0
## 118951               169       2379                      0
## 855741              1959      16052                      1
## 1551687               22        567                      0
## 1201431             2922      52172                      0
## 471531              1172       9046                      0
## 15110                 19        557                      0
## 1555083              974       3294                      0
## 417764               139        852                      0
##         state_province.Arkansas state_province.Colorado state_province.Iowa
## 209931                        0                       0                   0
## 1403545                       0                       0                   0
## 1226812                       0                       0                   0
## 428400                        0                       0                   1
## 417907                        0                       0                   0
## 38464                         0                       0                   0
## 1612610                       0                       0                   0
## 1224527                       0                       1                   0
## 879815                        0                       0                   0
## 318760                        0                       1                   0
## 1242817                       0                       0                   0
## 382899                        0                       0                   1
## 118951                        0                       0                   0
## 855741                        0                       0                   0
## 1551687                       0                       0                   1
## 1201431                       0                       1                   0
## 471531                        0                       0                   1
## 15110                         0                       0                   1
## 1555083                       1                       0                   0
## 417764                        0                       0                   1
##         state_province.Minnesota state_province.Missouri
## 209931                         0                       0
## 1403545                        0                       0
## 1226812                        0                       0
## 428400                         0                       0
## 417907                         0                       0
## 38464                          0                       0
## 1612610                        0                       0
## 1224527                        0                       0
## 879815                         0                       1
## 318760                         0                       0
## 1242817                        1                       0
## 382899                         0                       0
## 118951                         0                       0
## 855741                         0                       0
## 1551687                        0                       0
## 1201431                        0                       0
## 471531                         0                       0
## 15110                          0                       0
## 1555083                        0                       0
## 417764                         0                       0
##         state_province.Nebraska state_province.Oklahoma
## 209931                        0                       0
## 1403545                       0                       1
## 1226812                       0                       1
## 428400                        0                       0
## 417907                        1                       0
## 38464                         1                       0
## 1612610                       0                       1
## 1224527                       0                       0
## 879815                        0                       0
## 318760                        0                       0
## 1242817                       0                       0
## 382899                        0                       0
## 118951                        0                       1
## 855741                        0                       0
## 1551687                       0                       0
## 1201431                       0                       0
## 471531                        0                       0
## 15110                         0                       0
## 1555083                       0                       0
## 417764                        0                       0
##         state_province.South Dakota state_province.Wyoming num_trans basket.no
## 209931                            0                      0         1         1
## 1403545                           0                      0         1         1
## 1226812                           0                      0         1         1
## 428400                            0                      0         1         1
## 417907                            0                      0         1         1
## 38464                             0                      0         1         1
## 1612610                           0                      0         1         1
## 1224527                           0                      0         1         1
## 879815                            0                      0         1         1
## 318760                            0                      0         1         1
## 1242817                           0                      0         1         1
## 382899                            0                      0         1         1
## 118951                            0                      0         1         1
## 855741                            0                      0         1         1
## 1551687                           0                      0         1         1
## 1201431                           0                      0         1         1
## 471531                            0                      0         1         1
## 15110                             0                      0         1         1
## 1555083                           0                      0         1         1
## 417764                            0                      0         1         1
##         basket.yes refill.no refill.yes area.alcohol area.cooler area.dispensed
## 209931           0         1          0            0           0              0
## 1403545          0         1          0            0           0              0
## 1226812          0         1          0            0           0              0
## 428400           0         1          0            0           0              0
## 417907           0         1          0            0           0              0
## 38464            0         1          0            0           1              0
## 1612610          0         1          0            0           0              1
## 1224527          0         1          0            0           0              0
## 879815           0         1          0            0           0              0
## 318760           0         1          0            0           0              0
## 1242817          0         1          0            0           0              0
## 382899           0         1          0            0           0              1
## 118951           0         1          0            0           0              0
## 855741           0         1          0            0           0              0
## 1551687          0         1          0            0           0              0
## 1201431          0         1          0            0           0              0
## 471531           0         1          0            0           1              0
## 15110            0         1          0            0           1              0
## 1555083          0         1          0            0           0              0
## 417764           0         1          0            0           0              0
##         area.fresh area.fuel area.grocery area.lottery area.miscellaneous
## 209931           1         0            0            0                  0
## 1403545          0         1            0            0                  0
## 1226812          0         1            0            0                  0
## 428400           0         1            0            0                  0
## 417907           0         1            0            0                  0
## 38464            0         0            0            0                  0
## 1612610          0         0            0            0                  0
## 1224527          0         1            0            0                  0
## 879815           0         0            0            0                  0
## 318760           0         0            0            0                  0
## 1242817          0         1            0            0                  0
## 382899           0         0            0            0                  0
## 118951           0         0            0            0                  0
## 855741           0         0            0            0                  0
## 1551687          1         0            0            0                  0
## 1201431          0         1            0            0                  0
## 471531           0         0            0            0                  0
## 15110            0         0            0            0                  0
## 1555083          1         0            0            0                  0
## 417764           0         1            0            0                  0
##         area.nongrocery area.snacks area.tobacco items_sold loyalty2.not loyal
## 209931                0           0            0          1                  1
## 1403545               0           0            0          1                  1
## 1226812               0           0            0          1                  1
## 428400                0           0            0          1                  1
## 417907                0           0            0          1                  0
## 38464                 0           0            0          1                  0
## 1612610               0           0            0          1                  1
## 1224527               0           0            0          1                  1
## 879815                0           1            0          2                  1
## 318760                0           0            1          1                  0
## 1242817               0           0            0          1                  1
## 382899                0           0            0          1                  0
## 118951                0           0            1          2                  1
## 855741                0           0            1          1                  1
## 1551687               0           0            0          2                  1
## 1201431               0           0            0          1                  1
## 471531                0           0            0          4                  1
## 15110                 0           0            0          1                  0
## 1555083               0           0            0          1                  1
## 417764                0           0            0          1                  0
##         loyalty2.loyal prediction correct
## 209931               0       high    TRUE
## 1403545              0        low    TRUE
## 1226812              0        low    TRUE
## 428400               0        low    TRUE
## 417907               1        low    TRUE
## 38464                1        low   FALSE
## 1612610              0       high    TRUE
## 1224527              0        low    TRUE
## 879815               0       high    TRUE
## 318760               1        low    TRUE
## 1242817              0        low    TRUE
## 382899               1       high    TRUE
## 118951               0        low    TRUE
## 855741               0        low    TRUE
## 1551687              0       high    TRUE
## 1201431              0        low    TRUE
## 471531               0        low   FALSE
## 15110                1        low    TRUE
## 1555083              0       high    TRUE
## 417764               1        low    TRUE

Finally, note that we are not covering advanced topics such as hyperparameter tuning. When you gain more experience, you might want to examine methods for picking k, cross validation, etc. Our goal is to provide you the framework to understand the algorithm and start working with it.