Throughout your early career as a data scientist you’ve built complex visualizations, explored NBA talent, minded text on Data Science news and gained a better understanding how to create commercials with great success but you’ve suddenly realized you need to enhance your ability to assess the models you are building. As the most important part about understanding any machine learning model (any model) is understanding it’s weakness or better said it’s vulnerabilities.

In doing so you’ve decided to practice on datasets that are of interest to you, but use a approach to which you are very familiar, kNN.

Part 1. Select either as a lab or individual two datasets that you have not used before but that are of interest to you/group. Define questions that can be answered using a classification, specifically kNN, for each dataset. Build kNN models and then use the evaluation metrics we discussed in class (Accuracy, TPR, FPR, F1, Kappa, LogLoss and ROC/AUC) to assess the quality of the models. Make sure to calculate the base rate or prevalence to provide a reference for some of these measures.

Part 2. Take a closer look at where miss-classification errors are occurring, is there a pattern? If so discuss this pattern and why you think this is the case.

Part 3. Based on your exploration in Part 2, change the threshold using the function provided, what differences do you see in the evaluation metrics? Speak specifically to the metrics you think are best suited to address the questions you are trying to answer.

Part 4. Summarize your findings to include recommendations on how you might change each of the two kNN models based on the results. These recommendations might include gathering more data, adjusting the threshold or maybe that it’s working fine at the current level and nothing should be done. Regardless of the outcome, what should we be aware of when these models are deployed?

Question of interest for the first data set

We are interested in creating a model to predict whether an instagram account is a spam account or not so our spam accounts can go undetected.

## ── Attaching packages ────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.5.0
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift

Looking at the variables for correlations

Splitting the data set into two for testing and training

## [1] 221 441 416  60 110 366
## [1] 0.8003472
## [1] 461
## [1] 115
## 'data.frame':    461 obs. of  12 variables:
##  $ profile.pic         : int  1 1 1 1 1 0 1 0 1 1 ...
##  $ nums.length.username: num  0 0.1 0.18 0.12 0 0 0 0.89 0 0 ...
##  $ fullname.words      : int  2 1 1 2 2 1 2 0 1 2 ...
##  $ nums.length.fullname: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ name..username      : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ description.length  : int  61 0 0 0 0 0 81 0 35 48 ...
##  $ external.URL        : int  0 0 0 0 0 0 0 0 1 0 ...
##  $ private             : int  1 1 0 0 0 0 0 0 1 0 ...
##  $ X.posts             : int  217 5 0 69 9 11 50 1 35 222 ...
##  $ X.followers         : int  1152 6 42 320377 218 42 691 50 1809 5282 ...
##  $ X.follows           : int  292 24 146 228 75 26 680 39 416 652 ...
##  $ fake                : int  0 1 1 0 0 1 0 1 0 0 ...

Running KNN with k set to three as a baseline

##  Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  - attr(*, "prob")= num [1:115] 0.667 1 1 1 1 ...
## [1] 115
## insta_3NN
##  0  1 
## 56 59
## $levels
## [1] "0" "1"
## 
## $class
## [1] "factor"
## 
## $prob
##   [1] 0.6666667 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##   [8] 0.6666667 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [15] 1.0000000 1.0000000 1.0000000 0.6666667 1.0000000 1.0000000 1.0000000
##  [22] 1.0000000 1.0000000 1.0000000 1.0000000 0.6666667 1.0000000 1.0000000
##  [29] 1.0000000 1.0000000 1.0000000 0.6666667 1.0000000 1.0000000 0.6666667
##  [36] 0.6666667 1.0000000 1.0000000 1.0000000 1.0000000 0.6666667 1.0000000
##  [43] 1.0000000 1.0000000 0.6666667 1.0000000 1.0000000 1.0000000 1.0000000
##  [50] 1.0000000 1.0000000 1.0000000 1.0000000 0.6666667 1.0000000 1.0000000
##  [57] 0.6666667 0.6666667 1.0000000 1.0000000 0.6666667 1.0000000 0.6666667
##  [64] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [71] 1.0000000 1.0000000 1.0000000 0.6666667 0.6666667 0.6666667 1.0000000
##  [78] 0.6666667 0.6666667 1.0000000 1.0000000 1.0000000 0.6666667 1.0000000
##  [85] 1.0000000 1.0000000 0.6666667 1.0000000 1.0000000 0.6666667 1.0000000
##  [92] 1.0000000 0.6666667 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [99] 1.0000000 1.0000000 1.0000000 1.0000000 0.6666667 1.0000000 1.0000000
## [106] 0.6666667 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [113] 1.0000000 0.6666667 1.0000000

looking at a table of output from KNN

##          
## insta_3NN  0  1
##         0 48  8
##         1  9 50
## [1] 48 50

Calculating a confusion matrix from the KNN with k equal to three

## [1] 0.862069
## [1] 0.8521739
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 48  8
##          1  9 50
##                                           
##                Accuracy : 0.8522          
##                  95% CI : (0.7739, 0.9115)
##     No Information Rate : 0.5043          
##     P-Value [Acc > NIR] : 5.11e-15        
##                                           
##                   Kappa : 0.7043          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8621          
##             Specificity : 0.8421          
##          Pos Pred Value : 0.8475          
##          Neg Pred Value : 0.8571          
##              Prevalence : 0.5043          
##          Detection Rate : 0.4348          
##    Detection Prevalence : 0.5130          
##       Balanced Accuracy : 0.8521          
##                                           
##        'Positive' Class : 1               
## 

With K = 3 we had an accuracy of .85 and a sensitivity of .86. These are okay results but hopefully increasing our k will improve these values.

Choose K function will help us to select the optimal k

##  num [1:2, 1:11] 1 0.843 3 0.852 5 ...
## [1] "matrix" "array"
##           [,1]      [,2]      [,3]      [,4]      [,5]       [,6]       [,7]
## [1,] 1.0000000 3.0000000 5.0000000 7.0000000 9.0000000 11.0000000 13.0000000
## [2,] 0.8434783 0.8521739 0.8608696 0.8782609 0.8521739  0.8521739  0.8608696
##            [,8]       [,9]      [,10]      [,11]
## [1,] 15.0000000 17.0000000 19.0000000 21.0000000
## [2,]  0.8434783  0.8347826  0.8434783  0.8521739

Looking at a plot to choose k

## 7 seems to be our best option here

Running KNN again but this time with K equal to 7

##  Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  - attr(*, "prob")= num [1:115] 0.714 1 0.857 1 0.857 ...
## [1] 115
## insta_7NN
##  0  1 
## 57 58
## $levels
## [1] "0" "1"
## 
## $class
## [1] "factor"
## 
## $prob
##   [1] 0.7142857 1.0000000 0.8571429 1.0000000 0.8571429 1.0000000 1.0000000
##   [8] 0.8571429 1.0000000 1.0000000 1.0000000 1.0000000 0.8571429 1.0000000
##  [15] 0.5714286 1.0000000 1.0000000 0.5714286 0.8571429 1.0000000 0.7142857
##  [22] 0.8571429 1.0000000 1.0000000 0.5714286 0.5714286 1.0000000 1.0000000
##  [29] 0.5714286 1.0000000 0.7142857 0.5714286 1.0000000 0.8571429 0.5714286
##  [36] 0.5714286 1.0000000 1.0000000 1.0000000 1.0000000 0.8571429 1.0000000
##  [43] 0.7142857 1.0000000 0.5714286 1.0000000 1.0000000 1.0000000 0.7142857
##  [50] 1.0000000 1.0000000 1.0000000 1.0000000 0.5714286 1.0000000 0.7142857
##  [57] 0.8571429 0.8571429 1.0000000 1.0000000 0.5714286 1.0000000 0.8571429
##  [64] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.8571429 1.0000000
##  [71] 1.0000000 0.8571429 1.0000000 0.8571429 0.8571429 0.8571429 1.0000000
##  [78] 0.8571429 0.5714286 1.0000000 0.8571429 1.0000000 0.7142857 1.0000000
##  [85] 0.8571429 1.0000000 0.5714286 0.7142857 0.8571429 0.8571429 1.0000000
##  [92] 1.0000000 0.8571429 1.0000000 0.8571429 1.0000000 1.0000000 1.0000000
##  [99] 1.0000000 1.0000000 0.8571429 1.0000000 0.8571429 1.0000000 1.0000000
## [106] 0.5714286 1.0000000 0.8571429 0.7142857 1.0000000 1.0000000 1.0000000
## [113] 0.8571429 0.7142857 1.0000000

Confusion matirx for the KNN with K equals 7

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 50  7
##          1  7 51
##                                           
##                Accuracy : 0.8783          
##                  95% CI : (0.8042, 0.9318)
##     No Information Rate : 0.5043          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7565          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8793          
##             Specificity : 0.8772          
##          Pos Pred Value : 0.8793          
##          Neg Pred Value : 0.8772          
##              Prevalence : 0.5043          
##          Detection Rate : 0.4435          
##    Detection Prevalence : 0.5043          
##       Balanced Accuracy : 0.8783          
##                                           
##        'Positive' Class : 1               
## 

With K = 7 we had an accuracy of 0.8783 and a sensitivity of .8793. Both of these values are decent results but it would be nice if we could bump these up a little more.

Splitting the data again from the original data set to build our tree with

## [1] 462  12
## [1] 114  12
## k-Nearest Neighbors 
## 
## 462 samples
##  11 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 462, 462, 462, 462, 462, 462, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.8716966  0.7425661
##   7  0.8788322  0.7571091
##   9  0.8792054  0.7579037
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 9.
## [1] 0.8792054
## NULL

Running predictions and adjusting the threshold

##           
## insta_eval  0  1
##          0 50  5
##          1  7 52
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 50  5
##          1  7 52
##                                           
##                Accuracy : 0.8947          
##                  95% CI : (0.8233, 0.9444)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.7895          
##                                           
##  Mcnemar's Test P-Value : 0.7728          
##                                           
##             Sensitivity : 0.9123          
##             Specificity : 0.8772          
##          Pos Pred Value : 0.8814          
##          Neg Pred Value : 0.9091          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4561          
##    Detection Prevalence : 0.5175          
##       Balanced Accuracy : 0.8947          
##                                           
##        'Positive' Class : 1               
## 
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 46  1
##          1 11 56
##                                           
##                Accuracy : 0.8947          
##                  95% CI : (0.8233, 0.9444)
##     No Information Rate : 0.5             
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7895          
##                                           
##  Mcnemar's Test P-Value : 0.009375        
##                                           
##             Sensitivity : 0.9825          
##             Specificity : 0.8070          
##          Pos Pred Value : 0.8358          
##          Neg Pred Value : 0.9787          
##               Precision : 0.8358          
##                  Recall : 0.9825          
##                      F1 : 0.9032          
##              Prevalence : 0.5000          
##          Detection Rate : 0.4912          
##    Detection Prevalence : 0.5877          
##       Balanced Accuracy : 0.8947          
##                                           
##        'Positive' Class : 1               
## 
## [1] 0.1052632

Our overall error rate was 10% which isn’t too bad but it would be nice to get that a little lower still.

Using the ROCR package for evaluation

## 'data.frame':    114 obs. of  3 variables:
##  $ pred_class: Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ pred_prob : num  0.111 0 0.222 0 0 ...
##  $ target    : num  1 1 1 1 1 1 1 1 1 1 ...

## [[1]]
## [1] 0.9621422

We got an AUC Value of .96

Running log loss and finding our F1 score

## 
## Attaching package: 'MLmetrics'
## The following objects are masked from 'package:caret':
## 
##     MAE, RMSE
## The following object is masked from 'package:base':
## 
##     Recall
## [1] -0.6480729

Had a LogLoss of -0.6481 which is okay but we want to be as clsoe to 0 as possible

Overall the goal of our model was to be able to detect fake instagram accounts in order for us to be able to have our spam accounts go undetected. Our model does actually do a decent job of this as with k = 7 we had an accuracy and sensitivity of .87. Ideally this would be higher, but it’s not too bad either. Looking at the evaluation metrics, we had an error rate of 10% which is decent as well. Like with the other values, it would ideally be better but it’s at least a good starting point. We also had an AUC value of .96 which is actually really good given you want to be close to 1. For logloss we had a value of -0.6 which is only okay since you want to be as close to 0 as possible. Overall, this model is mostly accurate and is at the least a very good starting point towards creating a model to detect spam instagram accounts.

Question of interest for the second data set

Can we create a model for accurately predicting whether a tumor is malignant or benign?

Reading in the Breast Cancer Data

## 'data.frame':    569 obs. of  32 variables:
##  $ id                     : int  842302 842517 84300903 84348301 84358402 843786 844359 84458202 844981 84501001 ...
##  $ diagnosis              : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ radius_mean            : num  18 20.6 19.7 11.4 20.3 ...
##  $ texture_mean           : num  10.4 17.8 21.2 20.4 14.3 ...
##  $ perimeter_mean         : num  122.8 132.9 130 77.6 135.1 ...
##  $ area_mean              : num  1001 1326 1203 386 1297 ...
##  $ smoothness_mean        : num  0.1184 0.0847 0.1096 0.1425 0.1003 ...
##  $ compactness_mean       : num  0.2776 0.0786 0.1599 0.2839 0.1328 ...
##  $ concavity_mean         : num  0.3001 0.0869 0.1974 0.2414 0.198 ...
##  $ concave.points_mean    : num  0.1471 0.0702 0.1279 0.1052 0.1043 ...
##  $ symmetry_mean          : num  0.242 0.181 0.207 0.26 0.181 ...
##  $ fractal_dimension_mean : num  0.0787 0.0567 0.06 0.0974 0.0588 ...
##  $ radius_se              : num  1.095 0.543 0.746 0.496 0.757 ...
##  $ texture_se             : num  0.905 0.734 0.787 1.156 0.781 ...
##  $ perimeter_se           : num  8.59 3.4 4.58 3.44 5.44 ...
##  $ area_se                : num  153.4 74.1 94 27.2 94.4 ...
##  $ smoothness_se          : num  0.0064 0.00522 0.00615 0.00911 0.01149 ...
##  $ compactness_se         : num  0.049 0.0131 0.0401 0.0746 0.0246 ...
##  $ concavity_se           : num  0.0537 0.0186 0.0383 0.0566 0.0569 ...
##  $ concave.points_se      : num  0.0159 0.0134 0.0206 0.0187 0.0188 ...
##  $ symmetry_se            : num  0.03 0.0139 0.0225 0.0596 0.0176 ...
##  $ fractal_dimension_se   : num  0.00619 0.00353 0.00457 0.00921 0.00511 ...
##  $ radius_worst           : num  25.4 25 23.6 14.9 22.5 ...
##  $ texture_worst          : num  17.3 23.4 25.5 26.5 16.7 ...
##  $ perimeter_worst        : num  184.6 158.8 152.5 98.9 152.2 ...
##  $ area_worst             : num  2019 1956 1709 568 1575 ...
##  $ smoothness_worst       : num  0.162 0.124 0.144 0.21 0.137 ...
##  $ compactness_worst      : num  0.666 0.187 0.424 0.866 0.205 ...
##  $ concavity_worst        : num  0.712 0.242 0.45 0.687 0.4 ...
##  $ concave.points_worst   : num  0.265 0.186 0.243 0.258 0.163 ...
##  $ symmetry_worst         : num  0.46 0.275 0.361 0.664 0.236 ...
##  $ fractal_dimension_worst: num  0.1189 0.089 0.0876 0.173 0.0768 ...

Creating a training and testing data set from our original

## [1] 221 441 416  60 110 366
## [1] 0.7996485
## [1] 455
## [1] 114
## 'data.frame':    455 obs. of  32 variables:
##  $ id                     : int  8812816 909411 905686 858477 864018 901088 854941 915460 8510824 875938 ...
##  $ diagnosis              : num  0 0 0 0 0 1 0 1 0 1 ...
##  $ radius_mean            : num  13.65 10.97 11.89 8.62 11.34 ...
##  $ texture_mean           : num  13.2 17.2 21.2 11.8 21.3 ...
##  $ perimeter_mean         : num  87.9 71.7 76.4 54.3 72.5 ...
##  $ area_mean              : num  569 372 434 224 396 ...
##  $ smoothness_mean        : num  0.0965 0.0891 0.0977 0.0975 0.0876 ...
##  $ compactness_mean       : num  0.0871 0.1113 0.0812 0.0527 0.0658 ...
##  $ concavity_mean         : num  0.0389 0.0946 0.0255 0.0206 0.0513 ...
##  $ concave.points_mean    : num  0.0256 0.0361 0.0218 0.0078 0.019 ...
##  $ symmetry_mean          : num  0.136 0.149 0.202 0.168 0.149 ...
##  $ fractal_dimension_mean : num  0.0634 0.0664 0.0629 0.0719 0.0653 ...
##  $ radius_se              : num  0.21 0.257 0.275 0.156 0.234 ...
##  $ texture_se             : num  0.434 1.376 1.203 0.58 0.986 ...
##  $ perimeter_se           : num  1.39 2.81 1.93 1.05 1.6 ...
##  $ area_se                : num  17.4 18.15 19.53 8.32 16.41 ...
##  $ smoothness_se          : num  0.00413 0.00856 0.00989 0.01011 0.00911 ...
##  $ compactness_se         : num  0.0169 0.0464 0.0305 0.0106 0.0156 ...
##  $ concavity_se           : num  0.0165 0.0643 0.0163 0.0198 0.0244 ...
##  $ concave.points_se      : num  0.00666 0.01768 0.00928 0.00574 0.00643 ...
##  $ symmetry_se            : num  0.0137 0.0152 0.0226 0.0209 0.0157 ...
##  $ fractal_dimension_se   : num  0.00274 0.00498 0.00227 0.00279 0.00248 ...
##  $ radius_worst           : num  15.34 12.36 13.05 9.51 13.01 ...
##  $ texture_worst          : num  16.4 26.9 27.2 15.4 29.1 ...
##  $ perimeter_worst        : num  99.7 90.1 85.1 59.9 84 ...
##  $ area_worst             : num  706 476 523 275 518 ...
##  $ smoothness_worst       : num  0.131 0.139 0.143 0.173 0.17 ...
##  $ compactness_worst      : num  0.247 0.408 0.219 0.124 0.22 ...
##  $ concavity_worst        : num  0.176 0.478 0.116 0.117 0.312 ...
##  $ concave.points_worst   : num  0.0806 0.1555 0.0826 0.0442 0.0828 ...
##  $ symmetry_worst         : num  0.238 0.254 0.307 0.322 0.283 ...
##  $ fractal_dimension_worst: num  0.0872 0.0953 0.0735 0.0903 0.0883 ...

Running KNN with K = 3 as a baseline

##  Factor w/ 2 levels "0","1": 2 2 2 2 2 1 2 2 2 1 ...
##  - attr(*, "prob")= num [1:114] 1 1 1 0.667 0.667 ...
## [1] 114
## bc_3NN
##  0  1 
## 78 36
## $levels
## [1] "0" "1"
## 
## $class
## [1] "factor"
## 
## $prob
##   [1] 1.0000000 1.0000000 1.0000000 0.6666667 0.6666667 1.0000000 0.6666667
##   [8] 1.0000000 0.6666667 1.0000000 1.0000000 1.0000000 0.6666667 0.6666667
##  [15] 0.6666667 0.6666667 1.0000000 1.0000000 0.6666667 0.6666667 1.0000000
##  [22] 0.6666667 1.0000000 0.6666667 1.0000000 1.0000000 0.6666667 0.6666667
##  [29] 1.0000000 0.6666667 1.0000000 1.0000000 0.6666667 0.6666667 0.6666667
##  [36] 0.6666667 0.6666667 1.0000000 0.6666667 0.6666667 0.6666667 0.6666667
##  [43] 1.0000000 1.0000000 1.0000000 0.6666667 0.6666667 0.6666667 0.6666667
##  [50] 1.0000000 0.6666667 1.0000000 0.7500000 1.0000000 0.6666667 1.0000000
##  [57] 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 0.6666667
##  [64] 0.6666667 1.0000000 1.0000000 0.7500000 1.0000000 0.6666667 1.0000000
##  [71] 0.7500000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
##  [78] 1.0000000 1.0000000 0.6666667 1.0000000 1.0000000 0.6666667 1.0000000
##  [85] 1.0000000 1.0000000 0.6666667 0.6666667 0.7500000 1.0000000 0.6666667
##  [92] 1.0000000 0.6666667 0.6666667 1.0000000 1.0000000 1.0000000 1.0000000
##  [99] 1.0000000 1.0000000 0.6666667 1.0000000 0.6666667 0.6666667 1.0000000
## [106] 0.6666667 0.6666667 1.0000000 1.0000000 1.0000000 1.0000000 1.0000000
## [113] 1.0000000 1.0000000

Looking out the KNN output table

##       
## bc_3NN  0  1
##      0 60 18
##      1  8 28
## [1] 60 28

Looking at the confusion matix for the KNN with K = 3

## [1] 0.6086957
## [1] 0.7719298
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 60 18
##          1  8 28
##                                          
##                Accuracy : 0.7719         
##                  95% CI : (0.684, 0.8453)
##     No Information Rate : 0.5965         
##     P-Value [Acc > NIR] : 5.911e-05      
##                                          
##                   Kappa : 0.5089         
##                                          
##  Mcnemar's Test P-Value : 0.07756        
##                                          
##             Sensitivity : 0.6087         
##             Specificity : 0.8824         
##          Pos Pred Value : 0.7778         
##          Neg Pred Value : 0.7692         
##              Prevalence : 0.4035         
##          Detection Rate : 0.2456         
##    Detection Prevalence : 0.3158         
##       Balanced Accuracy : 0.7455         
##                                          
##        'Positive' Class : 1              
## 

We have an accuracy value of .77 and a sensitivity of .88. This isn’t a very high accuracy percentage but a decently high sensitivity value.

Running the choose k fuction to find the optimal k value

##  num [1:2, 1:11] 1 0.816 3 0.772 5 ...
## [1] "matrix" "array"
##           [,1]      [,2]     [,3]      [,4]      [,5]       [,6]       [,7]
## [1,] 1.0000000 3.0000000 5.000000 7.0000000 9.0000000 11.0000000 13.0000000
## [2,] 0.8157895 0.7719298 0.745614 0.7192982 0.6754386  0.6842105  0.6929825
##            [,8]       [,9]      [,10]      [,11]
## [1,] 15.0000000 17.0000000 19.0000000 21.0000000
## [2,]  0.6929825  0.6842105  0.6754386  0.6666667

Running a ggplot to choose the optimal k value

## K = 1 seems to be our best option

Running KNN with K = 1 this time

##  Factor w/ 2 levels "0","1": 2 2 2 2 2 1 1 2 2 1 ...
##  - attr(*, "prob")= num [1:114] 1 1 1 1 1 1 1 1 1 1 ...
## [1] 114
## bc_1NN
##  0  1 
## 79 35
## $levels
## [1] "0" "1"
## 
## $class
## [1] "factor"
## 
## $prob
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [38] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [75] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [112] 1 1 1

Looking at the confusion matrix for K = 1

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 63 16
##          1  5 30
##                                           
##                Accuracy : 0.8158          
##                  95% CI : (0.7323, 0.8822)
##     No Information Rate : 0.5965          
##     P-Value [Acc > NIR] : 4.534e-07       
##                                           
##                   Kappa : 0.6019          
##                                           
##  Mcnemar's Test P-Value : 0.0291          
##                                           
##             Sensitivity : 0.6522          
##             Specificity : 0.9265          
##          Pos Pred Value : 0.8571          
##          Neg Pred Value : 0.7975          
##              Prevalence : 0.4035          
##          Detection Rate : 0.2632          
##    Detection Prevalence : 0.3070          
##       Balanced Accuracy : 0.7893          
##                                           
##        'Positive' Class : 1               
## 

We have an accuracy value of .81 and a sensitivity of .92. The accuracy value is better but still not great but our sensitivity has gone up to .92 which is pretty good.

Splitting our originial data set up again to do predictions with and build our tree

## [1] 456  32
## [1] 113  32
## k-Nearest Neighbors 
## 
## 456 samples
##  31 predictor
##   2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 456, 456, 456, 456, 456, 456, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa    
##   5  0.7158040  0.3518183
##   7  0.6964320  0.3043686
##   9  0.6863108  0.2773809
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
## [1] 0.715804
## NULL

Running predctions and adjusting the threshold

##        
## bc_eval  0  1
##       0 68 22
##       1  3 20
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 68 22
##          1  3 20
##                                          
##                Accuracy : 0.7788         
##                  95% CI : (0.691, 0.8514)
##     No Information Rate : 0.6283         
##     P-Value [Acc > NIR] : 0.0004449      
##                                          
##                   Kappa : 0.4781         
##                                          
##  Mcnemar's Test P-Value : 0.0003182      
##                                          
##             Sensitivity : 0.4762         
##             Specificity : 0.9577         
##          Pos Pred Value : 0.8696         
##          Neg Pred Value : 0.7556         
##              Prevalence : 0.3717         
##          Detection Rate : 0.1770         
##    Detection Prevalence : 0.2035         
##       Balanced Accuracy : 0.7170         
##                                          
##        'Positive' Class : 1              
## 
## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 61 15
##          1 10 27
##                                          
##                Accuracy : 0.7788         
##                  95% CI : (0.691, 0.8514)
##     No Information Rate : 0.6283         
##     P-Value [Acc > NIR] : 0.0004449      
##                                          
##                   Kappa : 0.5145         
##                                          
##  Mcnemar's Test P-Value : 0.4237108      
##                                          
##             Sensitivity : 0.6429         
##             Specificity : 0.8592         
##          Pos Pred Value : 0.7297         
##          Neg Pred Value : 0.8026         
##               Precision : 0.7297         
##                  Recall : 0.6429         
##                      F1 : 0.6835         
##              Prevalence : 0.3717         
##          Detection Rate : 0.2389         
##    Detection Prevalence : 0.3274         
##       Balanced Accuracy : 0.7510         
##                                          
##        'Positive' Class : 1              
## 
## [1] 0.2212389

From the above we can see our True Positive Rate or sensitivity is quite low @ 64%, False Positive Rate (1-Specificity) is decent ~ @ 15%, we want this to be low. Our true error rate is 22% which isn’t very good.

Finding our AUC values

## 'data.frame':    113 obs. of  3 variables:
##  $ pred_class: Factor w/ 2 levels "0","1": 2 2 2 2 2 1 1 2 2 2 ...
##  $ pred_prob : num  1 1 1 1 1 0.4 0.4 1 1 1 ...
##  $ target    : num  2 2 2 2 2 1 1 2 2 2 ...

## [[1]]
## [1] 0.8526157

We got an AUC value of .85 which is decent but we want to be as close to 1 as possible.

Finding LogLoss and F1 score

## [1] 8.520391

We have a log loss of 8.52 which is awful considering we want to be as close to 0 as possible

Overall this model for prediction whether a tumor is benign or malignant isn’t very good. With k = 1, our optimal K value, we only had an accuracy of .81. We did have a decent sensitivity value of .92 which is better. As for the evaluation metrics, these weren’t very good either. We have a true error rate of 22% which is pretty high. We had a decent AUC of .85 but we would like this to be closer to 1. For LogLoss, we had a value of 8.5 which is terrible considering we want this to be as close to 0 as possible. This model as a whole wasn’t very good but it may be a good starting point for trying to predict tumors.