Preliminary

We will use the DescTools, caret and class packages. If you do not already have the class package installed, you will first install it using the install.packages() function.

install.packages("class")

Next, we load all of the necessary libraries for use in the session.

library(caret)
library(DescTools)
library(class)

In the lesson that follows we will use the cancer.csv file. The cancer.csv dataset contains diagnostic information for 569 patients. The variables are derived from images of the tumors, which are diagnosed (Diagnosis) as either malignant (M, cancerous) or benign (B, non-cancerous).

The variables describing the tumors are:

  • ID: unique identifier of the patient
  • Radius: mean of distances from center of tumor to points on the perimeter
  • Texture: standard deviation of gray-scale values
  • Perimeter: perimeter measure of tumor
  • Area: area measure of tumor
  • Smoothness: local variation in radius lengths
  • Compactness: perimeter^2 / area - 1.0
  • Concavity: severity of concave portions of the contour
  • ConcavePoints: number of concave portions of the contour
  • Symmetry: The degree of symmetry of the tumor
  • Fractal Dimension: coastline approximation - 1

The pathologist diagnosing the tumors would like to automate the process by creating the best possible predictive model to predict the Diagnosis variable. The pathologist prioritizes correct cancer diagnoses (Diagnosis = M and prediction = M), or true positives and wants to minimizes false positives (Diagnosis = B and prediction = M).

We use the read.csv() function to import the CSV file into R as a dataframe named cancer. We set stringsAsFactors = FALSE to keep any character columns as-is.

cancer <- read.csv(file = "cancer.csv",
                   stringsAsFactors = FALSE)

Data Exploration & Variable Preparation

First, we can obtain high-level information about the cancer dataframe to look at the variable types and to check for missing (NA) values.

Abstract(cancer)
## ------------------------------------------------------------------------------ 
## cancer
## 
## data frame:  569 obs. of  12 variables
##      569 complete cases (100.0%)
## 
##   Nr  ColName        Class      NAs  Levels
##   1   ID             integer    .          
##   2   Diagnosis      character  .          
##   3   Radius         numeric    .          
##   4   Texture        numeric    .          
##   5   Perimeter      numeric    .          
##   6   Area           numeric    .          
##   7   Smoothness     numeric    .          
##   8   Compactness    numeric    .          
##   9   Concavity      numeric    .          
##   10  ConcavePoints  numeric    .          
##   11  Symmetry       numeric    .          
##   12  FractalDim     numeric    .

Preparing Target (Y) Variables

Next, we can convert our target class variable that we want to predict, Diagnosis to a nominal factor variable. Note: B = 1, M = 2 based on alphabetical ordering.

cancer$Diagnosis <- factor(cancer$Diagnosis)

Next, we can obtain summary statistic information using the summary() function.

summary(cancer)
##        ID            Diagnosis     Radius          Texture     
##  Min.   :     8670   B:357     Min.   : 6.981   Min.   : 9.71  
##  1st Qu.:   869218   M:212     1st Qu.:11.700   1st Qu.:16.17  
##  Median :   906024             Median :13.370   Median :18.84  
##  Mean   : 30371831             Mean   :14.127   Mean   :19.29  
##  3rd Qu.:  8813129             3rd Qu.:15.780   3rd Qu.:21.80  
##  Max.   :911320502             Max.   :28.110   Max.   :39.28  
##    Perimeter           Area          Smoothness       Compactness     
##  Min.   : 43.79   Min.   : 143.5   Min.   :0.05263   Min.   :0.01938  
##  1st Qu.: 75.17   1st Qu.: 420.3   1st Qu.:0.08637   1st Qu.:0.06492  
##  Median : 86.24   Median : 551.1   Median :0.09587   Median :0.09263  
##  Mean   : 91.97   Mean   : 654.9   Mean   :0.09636   Mean   :0.10434  
##  3rd Qu.:104.10   3rd Qu.: 782.7   3rd Qu.:0.10530   3rd Qu.:0.13040  
##  Max.   :188.50   Max.   :2501.0   Max.   :0.16340   Max.   :0.34540  
##    Concavity       ConcavePoints        Symmetry        FractalDim     
##  Min.   :0.00000   Min.   :0.00000   Min.   :0.1060   Min.   :0.04996  
##  1st Qu.:0.02956   1st Qu.:0.02031   1st Qu.:0.1619   1st Qu.:0.05770  
##  Median :0.06154   Median :0.03350   Median :0.1792   Median :0.06154  
##  Mean   :0.08880   Mean   :0.04892   Mean   :0.1812   Mean   :0.06280  
##  3rd Qu.:0.13070   3rd Qu.:0.07400   3rd Qu.:0.1957   3rd Qu.:0.06612  
##  Max.   :0.42680   Max.   :0.20120   Max.   :0.3040   Max.   :0.09744

We can plot the distribution of our output (Y) variable using a barplot, which is the default plot for the plot() function when plotting factor variables. As shown, there are more benign tumors than malignant tumors in our sample.

plot(x = cancer$Diagnosis,
     main = "Diagnosis")

Preparing Predictor (X) Variables

We can set up a vector of the input (X) variable names (vars) for convenience. We exclude the ID variable and our Y variable, Diagnosis, which are in the 1st and 2nd columns.

vars <- names(cancer)[-(1:2)]

Data Preprocessing & Transformation

For kNN, we know that we are okay keeping irrelevant X variables, but we need to impute or remove any missing values, remove redundant variables, transform/rescale our numeric variables and binarize categorical variables.

  1. Missing Values First, we check for missing values. If missing values are present, we can either remove them row-wise (na.omit()) or perform imputation.
PlotMiss(x = cancer)

As shown, we do not have any missing values in the data.

  1. Redundant Variables We need to identify highly correlated numeric input (X) variables and exclude them from our predictive model.

First, we obtain the correlation matrix for our numeric predictor variables.

cor_vars <- cor(x = cancer[ ,vars])

We can start by looking at the symbolic correlation matrix to manually identify correlated variables.

symnum(x = cor_vars,
       corr = TRUE)
##               R T P A Sm Cm Cn CP Sy F
## Radius        1                       
## Texture       . 1                     
## Perimeter     B . 1                   
## Area          B . B 1                 
## Smoothness            1               
## Compactness   .   . . ,  1            
## Concavity     , . , , .  +  1         
## ConcavePoints +   + + .  +  *  1      
## Symmetry              .  ,  .  .  1   
## FractalDim    .       .  .  .     .  1
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1

We can use the findCorrelation() function in the caret package to identify redundant variables for us. Setting names = TRUE will output a list of the variable name that are determined to be redundant. We can then remove them from our vars vector and exclude them from our analysis. We set a cutoff point for our correlation at 0.75.

high_corrs <- findCorrelation(x = cor_vars, 
                              cutoff = .75, 
                              names = TRUE)

By running a code line with the name of the output object (high_corrs), we can view the names of the redundant variables.

high_corrs
## [1] "ConcavePoints" "Concavity"     "Perimeter"     "Radius"

Now, we can remove them from our vars vector so that we exclude them from our list of input (X) variable names.

vars <- vars[!vars %in% high_corrs]
vars
## [1] "Texture"     "Area"        "Smoothness"  "Compactness" "Symmetry"   
## [6] "FractalDim"

We can view the correlation matrix after removing redundant variables.

symnum(x = cor(x = cancer[ ,vars]),
       corr = TRUE)
##             T A Sm C Sy F
## Texture     1            
## Area        . 1          
## Smoothness      1        
## Compactness   . ,  1     
## Symmetry        .  , 1   
## FractalDim      .  . .  1
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1
  1. Rescale Numeric Variables kNN has been shown to perform well with min-max (range) normalization, converting the numeric variables to range between 0 and 1. We can use the preProcess() and predict() functions and save the rescaled data as a new dataframe named cancer_mm.
cen_mm <- preProcess(x = cancer[ ,vars],
                     method = "range")
cancer_mm <- predict(object = cen_mm,
                 newdata = cancer)

Note: the variables that we are not using as input to our model will not be rescaled using the above.

  1. Binarization If categorical input (X) variables are used in analysis, they must be converted to binary variables using the class2ind() function from the caret package for categorical variables with 2 class levels and the dummyVars() function from the caret package and the predict() function for categorical variables with more than 2 class levels.

Training & Testing Sets

We use the createDataPartition() function from the caret package to identify the row numbers that we will include in our training set. Then, all other rows will be put in our testing set. We split the data using an 80/20 split (80% in training and 20% in testing). By using createDataPartition() we preserve the distribution of our outcome (Y) variable (Diagnosis). Since the function takes a random sample, we initialize a random seed first for reproducibility.

set.seed(831) # initialize the random seed

# Generate the list of observations for the
# train dataframe
sub <- createDataPartition(y = cancer_mm$Diagnosis, 
                           p = 0.80, 
                           list = FALSE)

Next, we subset the rows of the cancer_mm dataframe to include the row numbers in the sub object to create the train dataframe. We use all observations not in the sub object to create the test dataframe.

train <- cancer_mm[sub, ] 
test <- cancer_mm[-sub, ]

Analysis

Naive Model (Best Guess k value)

First, we can identify a ‘best guess’ value of k (the square root of the number of training observations).

ceiling(sqrt(nrow(train)))
## [1] 22

To avoid ties, We should choose an odd number, in this case either 21 or 23 as the k value in our naive kNN model.

To build a basic (Naive) kNN model we use the knn() function from the class package. We use k = 21 to use 21 nearest neighbors.

knn.pred <- knn(train = train[ ,vars],
                test = test[ ,vars], 
                cl = train$Diagnosis, 
                k = 21)

Model Performance & Fit

With kNN, no model is built, so we cannot assess performance on the training set and cannot assess goodness of fit by comparing the performance results of the model on the training and testing sets. For this reason, we move on to looking at performance of the model on the testing set.

We can use the confusionMatrix() function from the caret package to obtain a confusion matrix and obtain performance measures for our model. We can set mode = “everything” to obtain all available performance measures. We can view the output by running a code line with the name of the confusionMatrix object.

conf_basic <- confusionMatrix(data = knn.pred, # knn model
                reference = test$Diagnosis, # true class labels 
                positive = "M", # the class level to display class-level performance for
                mode = "everything") # get all performance measures
conf_basic
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 69  5
##          M  2 37
##                                           
##                Accuracy : 0.9381          
##                  95% CI : (0.8765, 0.9747)
##     No Information Rate : 0.6283          
##     P-Value [Acc > NIR] : 1.718e-14       
##                                           
##                   Kappa : 0.8654          
##                                           
##  Mcnemar's Test P-Value : 0.4497          
##                                           
##             Sensitivity : 0.8810          
##             Specificity : 0.9718          
##          Pos Pred Value : 0.9487          
##          Neg Pred Value : 0.9324          
##               Precision : 0.9487          
##                  Recall : 0.8810          
##                      F1 : 0.9136          
##              Prevalence : 0.3717          
##          Detection Rate : 0.3274          
##    Detection Prevalence : 0.3451          
##       Balanced Accuracy : 0.9264          
##                                           
##        'Positive' Class : M               
## 

Hyperparameter Tuning Model

We use the train() function in the caret package to tune our hyperparameters using repeated k-fold cross validation. In the case of kNN, the hyperparameter that we want to tune is k, the number of nearest neighbors.

By default, the train() function will determine the ‘best’ model based on Accuracy for classification and RMSE for regression. For classification models, the Accuracy and Kappa are automatically computed and provided.

First, we set up a trainControl object (named ctrl) using the trainControl() function in the caret package. We specify that we want to perform 10-fold cross validation, repeated 3 times. We use this object as input to the trControl argument in the train() function below.

ctrl <- trainControl(method = "repeatedcv",
                     number = 10,
                     repeats = 3)

Next, we initialize a random seed for our cross validation, since we will randomly resample the data.

set.seed(831)

Then, we use the train() function to train the kNN model using 10-Fold Cross Validation (repeated 3 times). We set tuneLength = 10 to try the first 10 default values of k (odd numbers from 5 - 23) as our grid in a grid search.

knnFit1 <- train(x = train[ ,vars],
                 y = train$Diagnosis, 
                 method = "knn", 
                 trControl = ctrl, 
                 tuneLength = 10)

We can view the results of our cross validation across k values for Accuracy and Kappa. The output will also identify the optimal k.

knnFit1
## k-Nearest Neighbors 
## 
## 456 samples
##   6 predictor
##   2 classes: 'B', 'M' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times) 
## Summary of sample sizes: 410, 411, 411, 410, 410, 410, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    5  0.9348148  0.8585740
##    7  0.9296296  0.8470910
##    9  0.9266989  0.8397922
##   11  0.9267311  0.8392903
##   13  0.9282609  0.8428475
##   15  0.9282609  0.8425604
##   17  0.9304831  0.8468086
##   19  0.9290177  0.8433096
##   21  0.9253301  0.8347032
##   23  0.9238647  0.8312971
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.

We can plot the train object (knnFit1) using the plot() function to view a plot of the hyperparameter, k, on the x-axis and the Accuracy on the y-axis.

plot(knnFit1)

Model Performance & Fit

Since we performed repeated k-fold cross validation on our training set, we can view the confusion matrix showing the average predictive performance of the model across resamples.

confusionMatrix(knnFit1)
## Cross-Validated (10 fold, repeated 3 times) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction    B    M
##          B 60.6  4.4
##          M  2.1 32.9
##                             
##  Accuracy (average) : 0.9349

Finally, we can use our best tuned model to predict the testing data (test). First, we use the predict() function to predict the value of the Diagnosis variable using the best model we created using the train() function, in the knnFit1 object, and the true classes of the Diagnosis in the test dataframe.

outpreds <- predict(object = knnFit1, 
                    newdata = test)

Again, we can use the confusionMatrix() function from the caret package to obtain a confusion matrix and obtain performance measures for our model. We can set mode = "everything" to obtain all available performance measures.

conf_tuned <- confusionMatrix(data = outpreds, 
                reference = test$Diagnosis, 
                positive = "M",
                mode = "everything")
conf_tuned
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  B  M
##          B 66  5
##          M  5 37
##                                           
##                Accuracy : 0.9115          
##                  95% CI : (0.8433, 0.9567)
##     No Information Rate : 0.6283          
##     P-Value [Acc > NIR] : 6.062e-12       
##                                           
##                   Kappa : 0.8105          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8810          
##             Specificity : 0.9296          
##          Pos Pred Value : 0.8810          
##          Neg Pred Value : 0.9296          
##               Precision : 0.8810          
##                  Recall : 0.8810          
##                      F1 : 0.8810          
##              Prevalence : 0.3717          
##          Detection Rate : 0.3274          
##    Detection Prevalence : 0.3717          
##       Balanced Accuracy : 0.9053          
##                                           
##        'Positive' Class : M               
## 

To describe the overall model performance, we can focus on the accuracy and kappa values, where we prefer values close to 1 for both measures.

conf_tuned$overall[c("Accuracy", "Kappa")]
##  Accuracy     Kappa 
## 0.9115044 0.8105298

We can describe class-level performance for the different class levels. above, we set positive = "M", so the class-level measures are for malignant tumors.

conf_tuned$byClass
##          Sensitivity          Specificity       Pos Pred Value 
##            0.8809524            0.9295775            0.8809524 
##       Neg Pred Value            Precision               Recall 
##            0.9295775            0.8809524            0.8809524 
##                   F1           Prevalence       Detection Rate 
##            0.8809524            0.3716814            0.3274336 
## Detection Prevalence    Balanced Accuracy 
##            0.3716814            0.9052649

Note: We could run it again setting positive = “B” to get class-level performance for the benign tumor class.

The pathologist wants the model to have high sensitivity, so that cancerous tumors are correctly diagnosed and to avoid false positives by having a low false positive rate (type 1 error), which is 1 - specificity.

We can isolate the Sensitivity first, where we prefer values close to 1.

conf_tuned$byClass["Sensitivity"]
## Sensitivity 
##   0.8809524

Next, we can isolate the Specificity component of our confusionMatrix object and subtract it from 1 to obtain our false positive rate, where we prefer values close to 0.

1 - conf_tuned$byClass["Specificity"]
## Specificity 
##  0.07042254