We will use the DescTools, caret and class packages. If you do not already have the class package installed, you will first install it using the install.packages() function.
install.packages("class")
Next, we load all of the necessary libraries for use in the session.
library(caret)
library(DescTools)
library(class)
In the lesson that follows we will use the cancer.csv file. The cancer.csv dataset contains diagnostic information for 569 patients. The variables are derived from images of the tumors, which are diagnosed (Diagnosis) as either malignant (M, cancerous) or benign (B, non-cancerous).
The variables describing the tumors are:
The pathologist diagnosing the tumors would like to automate the process by creating the best possible predictive model to predict the Diagnosis variable. The pathologist prioritizes correct cancer diagnoses (Diagnosis = M and prediction = M), or true positives and wants to minimizes false positives (Diagnosis = B and prediction = M).
We use the read.csv() function to import the CSV file into R as a dataframe named cancer. We set stringsAsFactors = FALSE to keep any character columns as-is.
cancer <- read.csv(file = "cancer.csv",
stringsAsFactors = FALSE)
First, we can obtain high-level information about the cancer dataframe to look at the variable types and to check for missing (NA) values.
Abstract(cancer)
## ------------------------------------------------------------------------------
## cancer
##
## data frame: 569 obs. of 12 variables
## 569 complete cases (100.0%)
##
## Nr ColName Class NAs Levels
## 1 ID integer .
## 2 Diagnosis character .
## 3 Radius numeric .
## 4 Texture numeric .
## 5 Perimeter numeric .
## 6 Area numeric .
## 7 Smoothness numeric .
## 8 Compactness numeric .
## 9 Concavity numeric .
## 10 ConcavePoints numeric .
## 11 Symmetry numeric .
## 12 FractalDim numeric .
Next, we can convert our target class variable that we want to predict, Diagnosis to a nominal factor variable. Note: B = 1, M = 2 based on alphabetical ordering.
cancer$Diagnosis <- factor(cancer$Diagnosis)
Next, we can obtain summary statistic information using the summary() function.
summary(cancer)
## ID Diagnosis Radius Texture
## Min. : 8670 B:357 Min. : 6.981 Min. : 9.71
## 1st Qu.: 869218 M:212 1st Qu.:11.700 1st Qu.:16.17
## Median : 906024 Median :13.370 Median :18.84
## Mean : 30371831 Mean :14.127 Mean :19.29
## 3rd Qu.: 8813129 3rd Qu.:15.780 3rd Qu.:21.80
## Max. :911320502 Max. :28.110 Max. :39.28
## Perimeter Area Smoothness Compactness
## Min. : 43.79 Min. : 143.5 Min. :0.05263 Min. :0.01938
## 1st Qu.: 75.17 1st Qu.: 420.3 1st Qu.:0.08637 1st Qu.:0.06492
## Median : 86.24 Median : 551.1 Median :0.09587 Median :0.09263
## Mean : 91.97 Mean : 654.9 Mean :0.09636 Mean :0.10434
## 3rd Qu.:104.10 3rd Qu.: 782.7 3rd Qu.:0.10530 3rd Qu.:0.13040
## Max. :188.50 Max. :2501.0 Max. :0.16340 Max. :0.34540
## Concavity ConcavePoints Symmetry FractalDim
## Min. :0.00000 Min. :0.00000 Min. :0.1060 Min. :0.04996
## 1st Qu.:0.02956 1st Qu.:0.02031 1st Qu.:0.1619 1st Qu.:0.05770
## Median :0.06154 Median :0.03350 Median :0.1792 Median :0.06154
## Mean :0.08880 Mean :0.04892 Mean :0.1812 Mean :0.06280
## 3rd Qu.:0.13070 3rd Qu.:0.07400 3rd Qu.:0.1957 3rd Qu.:0.06612
## Max. :0.42680 Max. :0.20120 Max. :0.3040 Max. :0.09744
We can plot the distribution of our output (Y) variable using a barplot, which is the default plot for the plot() function when plotting factor variables. As shown, there are more benign tumors than malignant tumors in our sample.
plot(x = cancer$Diagnosis,
main = "Diagnosis")
We can set up a vector of the input (X) variable names (vars) for convenience. We exclude the ID variable and our Y variable, Diagnosis, which are in the 1st and 2nd columns.
vars <- names(cancer)[-(1:2)]
For kNN, we know that we are okay keeping irrelevant X variables, but we need to impute or remove any missing values, remove redundant variables, transform/rescale our numeric variables and binarize categorical variables.
na.omit()) or perform imputation.PlotMiss(x = cancer)
As shown, we do not have any missing values in the data.
First, we obtain the correlation matrix for our numeric predictor variables.
cor_vars <- cor(x = cancer[ ,vars])
We can start by looking at the symbolic correlation matrix to manually identify correlated variables.
symnum(x = cor_vars,
corr = TRUE)
## R T P A Sm Cm Cn CP Sy F
## Radius 1
## Texture . 1
## Perimeter B . 1
## Area B . B 1
## Smoothness 1
## Compactness . . . , 1
## Concavity , . , , . + 1
## ConcavePoints + + + . + * 1
## Symmetry . , . . 1
## FractalDim . . . . . 1
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1
We can use the findCorrelation() function in the caret package to identify redundant variables for us. Setting names = TRUE will output a list of the variable name that are determined to be redundant. We can then remove them from our vars vector and exclude them from our analysis. We set a cutoff point for our correlation at 0.75.
high_corrs <- findCorrelation(x = cor_vars,
cutoff = .75,
names = TRUE)
By running a code line with the name of the output object (high_corrs), we can view the names of the redundant variables.
high_corrs
## [1] "ConcavePoints" "Concavity" "Perimeter" "Radius"
Now, we can remove them from our vars vector so that we exclude them from our list of input (X) variable names.
vars <- vars[!vars %in% high_corrs]
vars
## [1] "Texture" "Area" "Smoothness" "Compactness" "Symmetry"
## [6] "FractalDim"
We can view the correlation matrix after removing redundant variables.
symnum(x = cor(x = cancer[ ,vars]),
corr = TRUE)
## T A Sm C Sy F
## Texture 1
## Area . 1
## Smoothness 1
## Compactness . , 1
## Symmetry . , 1
## FractalDim . . . 1
## attr(,"legend")
## [1] 0 ' ' 0.3 '.' 0.6 ',' 0.8 '+' 0.9 '*' 0.95 'B' 1
preProcess() and predict() functions and save the rescaled data as a new dataframe named cancer_mm.cen_mm <- preProcess(x = cancer[ ,vars],
method = "range")
cancer_mm <- predict(object = cen_mm,
newdata = cancer)
Note: the variables that we are not using as input to our model will not be rescaled using the above.
class2ind() function from the caret package for categorical variables with 2 class levels and the dummyVars() function from the caret package and the predict() function for categorical variables with more than 2 class levels.We use the createDataPartition() function from the caret package to identify the row numbers that we will include in our training set. Then, all other rows will be put in our testing set. We split the data using an 80/20 split (80% in training and 20% in testing). By using createDataPartition() we preserve the distribution of our outcome (Y) variable (Diagnosis). Since the function takes a random sample, we initialize a random seed first for reproducibility.
set.seed(831) # initialize the random seed
# Generate the list of observations for the
# train dataframe
sub <- createDataPartition(y = cancer_mm$Diagnosis,
p = 0.80,
list = FALSE)
Next, we subset the rows of the cancer_mm dataframe to include the row numbers in the sub object to create the train dataframe. We use all observations not in the sub object to create the test dataframe.
train <- cancer_mm[sub, ]
test <- cancer_mm[-sub, ]
First, we can identify a ‘best guess’ value of k (the square root of the number of training observations).
ceiling(sqrt(nrow(train)))
## [1] 22
To avoid ties, We should choose an odd number, in this case either 21 or 23 as the k value in our naive kNN model.
To build a basic (Naive) kNN model we use the knn() function from the class package. We use k = 21 to use 21 nearest neighbors.
knn.pred <- knn(train = train[ ,vars],
test = test[ ,vars],
cl = train$Diagnosis,
k = 21)
With kNN, no model is built, so we cannot assess performance on the training set and cannot assess goodness of fit by comparing the performance results of the model on the training and testing sets. For this reason, we move on to looking at performance of the model on the testing set.
We can use the confusionMatrix() function from the caret package to obtain a confusion matrix and obtain performance measures for our model. We can set mode = “everything” to obtain all available performance measures. We can view the output by running a code line with the name of the confusionMatrix object.
conf_basic <- confusionMatrix(data = knn.pred, # knn model
reference = test$Diagnosis, # true class labels
positive = "M", # the class level to display class-level performance for
mode = "everything") # get all performance measures
conf_basic
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 69 5
## M 2 37
##
## Accuracy : 0.9381
## 95% CI : (0.8765, 0.9747)
## No Information Rate : 0.6283
## P-Value [Acc > NIR] : 1.718e-14
##
## Kappa : 0.8654
##
## Mcnemar's Test P-Value : 0.4497
##
## Sensitivity : 0.8810
## Specificity : 0.9718
## Pos Pred Value : 0.9487
## Neg Pred Value : 0.9324
## Precision : 0.9487
## Recall : 0.8810
## F1 : 0.9136
## Prevalence : 0.3717
## Detection Rate : 0.3274
## Detection Prevalence : 0.3451
## Balanced Accuracy : 0.9264
##
## 'Positive' Class : M
##
We use the train() function in the caret package to tune our hyperparameters using repeated k-fold cross validation. In the case of kNN, the hyperparameter that we want to tune is k, the number of nearest neighbors.
By default, the train() function will determine the ‘best’ model based on Accuracy for classification and RMSE for regression. For classification models, the Accuracy and Kappa are automatically computed and provided.
First, we set up a trainControl object (named ctrl) using the trainControl() function in the caret package. We specify that we want to perform 10-fold cross validation, repeated 3 times. We use this object as input to the trControl argument in the train() function below.
ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 3)
Next, we initialize a random seed for our cross validation, since we will randomly resample the data.
set.seed(831)
Then, we use the train() function to train the kNN model using 10-Fold Cross Validation (repeated 3 times). We set tuneLength = 10 to try the first 10 default values of k (odd numbers from 5 - 23) as our grid in a grid search.
knnFit1 <- train(x = train[ ,vars],
y = train$Diagnosis,
method = "knn",
trControl = ctrl,
tuneLength = 10)
We can view the results of our cross validation across k values for Accuracy and Kappa. The output will also identify the optimal k.
knnFit1
## k-Nearest Neighbors
##
## 456 samples
## 6 predictor
## 2 classes: 'B', 'M'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 3 times)
## Summary of sample sizes: 410, 411, 411, 410, 410, 410, ...
## Resampling results across tuning parameters:
##
## k Accuracy Kappa
## 5 0.9348148 0.8585740
## 7 0.9296296 0.8470910
## 9 0.9266989 0.8397922
## 11 0.9267311 0.8392903
## 13 0.9282609 0.8428475
## 15 0.9282609 0.8425604
## 17 0.9304831 0.8468086
## 19 0.9290177 0.8433096
## 21 0.9253301 0.8347032
## 23 0.9238647 0.8312971
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
We can plot the train object (knnFit1) using the plot() function to view a plot of the hyperparameter, k, on the x-axis and the Accuracy on the y-axis.
plot(knnFit1)
Since we performed repeated k-fold cross validation on our training set, we can view the confusion matrix showing the average predictive performance of the model across resamples.
confusionMatrix(knnFit1)
## Cross-Validated (10 fold, repeated 3 times) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction B M
## B 60.6 4.4
## M 2.1 32.9
##
## Accuracy (average) : 0.9349
Finally, we can use our best tuned model to predict the testing data (test). First, we use the predict() function to predict the value of the Diagnosis variable using the best model we created using the train() function, in the knnFit1 object, and the true classes of the Diagnosis in the test dataframe.
outpreds <- predict(object = knnFit1,
newdata = test)
Again, we can use the confusionMatrix() function from the caret package to obtain a confusion matrix and obtain performance measures for our model. We can set mode = "everything" to obtain all available performance measures.
conf_tuned <- confusionMatrix(data = outpreds,
reference = test$Diagnosis,
positive = "M",
mode = "everything")
conf_tuned
## Confusion Matrix and Statistics
##
## Reference
## Prediction B M
## B 66 5
## M 5 37
##
## Accuracy : 0.9115
## 95% CI : (0.8433, 0.9567)
## No Information Rate : 0.6283
## P-Value [Acc > NIR] : 6.062e-12
##
## Kappa : 0.8105
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8810
## Specificity : 0.9296
## Pos Pred Value : 0.8810
## Neg Pred Value : 0.9296
## Precision : 0.8810
## Recall : 0.8810
## F1 : 0.8810
## Prevalence : 0.3717
## Detection Rate : 0.3274
## Detection Prevalence : 0.3717
## Balanced Accuracy : 0.9053
##
## 'Positive' Class : M
##
To describe the overall model performance, we can focus on the accuracy and kappa values, where we prefer values close to 1 for both measures.
conf_tuned$overall[c("Accuracy", "Kappa")]
## Accuracy Kappa
## 0.9115044 0.8105298
We can describe class-level performance for the different class levels. above, we set positive = "M", so the class-level measures are for malignant tumors.
conf_tuned$byClass
## Sensitivity Specificity Pos Pred Value
## 0.8809524 0.9295775 0.8809524
## Neg Pred Value Precision Recall
## 0.9295775 0.8809524 0.8809524
## F1 Prevalence Detection Rate
## 0.8809524 0.3716814 0.3274336
## Detection Prevalence Balanced Accuracy
## 0.3716814 0.9052649
Note: We could run it again setting positive = “B” to get class-level performance for the benign tumor class.
The pathologist wants the model to have high sensitivity, so that cancerous tumors are correctly diagnosed and to avoid false positives by having a low false positive rate (type 1 error), which is 1 - specificity.
We can isolate the Sensitivity first, where we prefer values close to 1.
conf_tuned$byClass["Sensitivity"]
## Sensitivity
## 0.8809524
Next, we can isolate the Specificity component of our confusionMatrix object and subtract it from 1 to obtain our false positive rate, where we prefer values close to 0.
1 - conf_tuned$byClass["Specificity"]
## Specificity
## 0.07042254