Our Plan:
* We want to create a KNN model that predicts if a person has chronic kidney disease or not
* We want to know what features best predict if a person will have chronic kidney disease or not
* We aim to optimize the classifier in order to identify if a patient has chronic kidney disease or not
* Knowing the most prominent risk factors will allow people with these risk factors to get tested for chronic kidney disease more often, so the disease will likely be caught earlier, which can be life saving
* We believe blood pressure, potassium, diabetes mellitus and age will be integral in predicting if a patient has chronic kidney disease or not
* The desired optimized metric is Sensitivity, as in health care data the true positive rate must be maximized to increase the rate of true kidney disease diagnoses.
Taking a closer look at the target variable, we can see that our dataset is not balanced. There are a little over 100 more cases of people with chronic kidney disease than people that are healthy.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Older people are more susceptible to diseases in general, and the same ohld true for chronic kidney disease. Looking at the age distribution of our data, the majority of the patients are in the middle to old age range.
People often have both diabetes and chronic kidney disease, so looking closer at the diabetes metric will help us understand our data better.
##
## 0 1
## 150 250
## 1
## 0.625
## 0
## 0.375
The base split of the data shows that 62.5% have chronic kidney disease and that 37.5% do not have ckd.
We have 320 training points and 80 testing data points
##
## subsetted2_3NN 0 1
## 0 33 1
## 1 0 46
## [1] 33 46
## [1] 0.9875
## [1] 0.9787234
At k=3, accuracy =0.9875 and sensitivity=0.979. In the context of this project, it is most important to optimize the sensitivity (TPR) of diagnoses in order to maximize the true number of chronic kidney disease diagnoses.
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 33 1
## 1 0 46
##
## Accuracy : 0.9875
## 95% CI : (0.9323, 0.9997)
## No Information Rate : 0.5875
## P-Value [Acc > NIR] : <2e-16
##
## Kappa : 0.9743
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9787
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9706
## Prevalence : 0.5875
## Detection Rate : 0.5750
## Detection Prevalence : 0.5750
## Balanced Accuracy : 0.9894
##
## 'Positive' Class : 1
##
Specificity=1.000, possibly due to limited size of the dataset.
At k values lower than k=3, there is a risk of overfitting and merely memorizing the data, producing high accuracy and sensitivity metrics. At k values greater than k=3, the sensitivity is increased only slightly. Given the cost of running KN as a ML classifier, it is best to reduce the complexity of the model by keeping the number of neighbors at k=3; there is not a significant improvement in sensitivity for higher hyperparameter values. Thus, k=3 is the optimal metric for the classifier.
As the k hyperparameter increases, the accuracy metric declines. This trend reinforces our decision to keep k=3. Next, at the 50% threshold of classification, accuracy =0.9 while sensitivity=0.84. While these metrics are decent, it was found that lowering the threshold for classification further to approximately 30% decreased accuracy to 0.8875 and increased sensitivity to 0.9. As sensitivity is the desired metric to optimize, it is favorable to have a lower threshold. This allows for more observations to be classified as 1 (chronic kidney disease), allowing for more truly positive classifications to be made and reducing the number of false positives (0/noncdk observations).
## [[1]]
## [1] 0.9466667
The area beneath the ROC curve is 94.67%. This is a high AUC value, which means that the classifier is generally maximized for sensitivity (true positive rate), the desired metric. The TPR is maximized at 0.85 according to the graphic produced, and thus the sensitivity is maximized at the same value.
##
## Attaching package: 'MLmetrics'
## The following objects are masked from 'package:caret':
##
## MAE, RMSE
## The following object is masked from 'package:base':
##
## Recall
## [1] 4.900803
A high log loss indicates that when the model classifies an observation incorrectly, it does so severely. The logloss calculated is 4.9, which is relatively high. At a slightly higher hyperparameter value the logloss may improve. In order to improve the model further past altering the hyperparameter, the Random Forest method was employed. Additionally, the small size of the data set (203 rows) may include observations that heavily skew the logloss.
The Random Forest method is a way to perform bootstrapping (sampling with replacement) for classification, and reduces the over-fitting of the model. Employing multiple trees will introduce more variation into the model, as opposed to using a single tree which may merely memorize the testing data.
-“ntree” is the number of trees used in the forest -“mtry” is the number of variables used to make each split in the tree
-ntree=1000,mtry=4 (as the initial values used) -5.28% OOB and 7.35% class 0 and 4.02% class 1
##
## Call:
## randomForest(formula = as.factor(classification) ~ ., data = kidney_train, ntree = 1000, mtry = 4, replace = TRUE, sampsize = 100, nodesize = 10, importance = TRUE, proximity = FALSE, norm.votes = TRUE, do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE)
## Type of random forest: classification
## Number of trees: 1000
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 5.28%
## Confusion matrix:
## 0 1 class.error
## 0 126 10 0.07352941
## 1 9 215 0.04017857
##
## 0 1
## 150 250
## [1] 0.9469231
-Confusion Matrix: 150 obs. classified as negative class (notcdk) while 250 obs. classified as positive class (cdk) -Accuracy = 0.9469 The accuracy of the initial random forest is 0.9469, which is a relatively high accuracy.
## 0 1 MeanDecreaseAccuracy MeanDecreaseGini
## age 0.005591822 3.250976e-03 0.004137421 1.70282740
## bp 0.004726589 1.292696e-03 0.002598743 0.57093410
## su 0.043264080 4.870567e-04 0.016656058 1.61327361
## rbc 0.015383129 3.254801e-04 0.006006388 0.52878105
## pc 0.028376827 1.912111e-05 0.010706415 0.87958306
## pcc 0.006626822 -8.663545e-05 0.002440321 0.20537986
## ba 0.004678200 9.878902e-05 0.001828676 0.25887168
## bgr 0.162819987 1.014681e-02 0.067650398 6.97336165
## bu 0.165655583 1.101810e-02 0.069331175 6.59600118
## sod 0.081477548 1.184088e-02 0.038118792 3.98087739
## pot 0.052474240 9.382889e-03 0.025640086 2.75674215
## wc 0.069731726 1.250578e-02 0.034175240 3.43877696
## htn 0.211345131 2.453555e-02 0.094983285 8.07006473
## cad 0.002291292 -1.586216e-04 0.000768000 0.08679487
## appet 0.038740491 -8.126450e-04 0.014040164 1.10107257
## pe 0.033894053 -3.187137e-04 0.012590950 1.06303313
## ane 0.015653307 -6.141705e-04 0.005532842 0.52238923
The top variables of importance in the classification of whether a patient has chronic kidney disease or not are age, blood pressure, sugar, and red blood cells. The “Mean Decrease Accuracy” measure is recording how much the classification accuracy decreases if that corresponding variable is not used, so if age isn’t used the model will be 0.4% less accurate, for example. The “Mean Decrease Gini” shows how much the gini coefficient decreases if that variable isn’t used.
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
As illustrated, the error gets significantly less as more trees are added, and even dips more severely around specific points including 650-700 trees. This makes sense, as adding more trees protects against over-fitting and reduces error in large numbers.
In order to optimize the tree, the parameter for ntree was lowered from 1000 to ntree=300, and mtry=4 raised to mtry=5. The sample size was increased to 150. The error output shows 5.00% OOB and 7.35% class 0, 3.57% class 1. Therefore, class 1 classification error is reduced from 4.02% (higher sensitivity) which is the intention of the optimization procedure.
## 0 1 class.error
## 0 126 10 0.07352941
## 1 9 215 0.04017857
## 0 1 class.error
## 0 126 10 0.07352941
## 1 8 216 0.03571429
The class 1 error was reduced from 4.01% to 3.57%, indicating an optimization of the sensitivity (TPR). This demonstrates that the second model with a smaller ntree and larger mtry is better at correctly identifying the CDK observations (class 1). In the future, a smaller ntree and larger mtry would be recommended to optimize the model’s sensitivity.
##
## 0 1
## 0 0 14
## 1 0 26
## [1] 0.35
There is a 35% error rate on the test set; the model could still use some improvement! Below we used variable importance, mtry value, and size of the forest to optimize the classifier.
## null device
## 1
A visualization on variable importance (from the improved/optimized model) is shown. The most important variables in predicting class 1 versus class 0 in the optimized model are hypertension, blood urea, and blood glucose random. These variables are different from the variables estimated to be most important in variable importance prior to optimization, so it is important to always optimize the model before drawing conclusions.
## null device
## 1
## mtry = 2 OOB error = 18.06%
## Searching left ...
## mtry = 1 OOB error = 17.78%
## 0.01538462 0.05
## Searching right ...
## mtry = 4 OOB error = 17.5%
## 0.03076923 0.05
## mtry OOBError
## 1.OOB 1 0.1777778
## 2.OOB 2 0.1805556
## 4.OOB 4 0.1750000
Based on the OOB Error Table generated, the OOB Error is minimized around 4-5 for the mtry value; so it is recommended to use 4 or 5 in the next iteration of the model.
#### Size of the forest ####
# If you want to look at the size of the trees in the random forest,
# or how many nodes each tree has, you can use the treesize() function.
treesize(kidney_RF_2, #<- the randomForest object to use
terminal = FALSE) #<- when TRUE, only the terminal nodes are counted, when FALSE, all nodes are counted
## [1] 19 27 21 21 25 29 21 21 15 27 15 27 21 19 21 29 33 29 23 21 31 27 23 25 19
## [26] 15 21 29 25 19 31 23 17 29 27 23 27 35 25 27 27 23 21 25 29 21 21 27 29 27
## [51] 23 25 27 23 19 39 21 35 23 31 29 35 27 19 15 27 23 27 31 31 19 27 31 37 33
## [76] 31 23 37 27 23 33 27 31 33 33 19 29 37 25 23 27 33 23 25 27 23 21 21 35 29
## [101] 17 27 23 23 23 23 31 21 27 25 21 37 27 21 35 31 25 19 31 27 19 31 29 37 23
## [126] 31 15 37 29 31 21 25 19 13 31 21 21 17 29 29 17 29 27 27 19 19 29 27 27 23
## [151] 27 27 27 33 23 19 25 17 23 27 23 29 19 25 25 33 27 23 27 27 9 33 27 33 23
## [176] 23 19 17 25 21 27 27 31 17 23 25 23 33 25 25 31 31 29 23 23 27 23 19 29 25
## [201] 31 25 29 25 25 35 23 25 21 25 21 35 21 27 23 25 23 21 21 19 19 31 29 17 27
## [226] 17 27 23 33 25 31 21 35 21 23 29 29 31 23 21 21 21 25 17 31 29 29 29 25 23
## [251] 23 21 17 19 21 27 25 27 19 17 25 21 35 19 27 19 25 27 27 23 27 27 13 33 27
## [276] 29 27 23 27 25 33 29 25 27 17 19 31 23 31 25 19 23 33 21 29 25 29 31 13 25
# You can use the treesize() function to create a histogram for a visual presentation.
hist(treesize(kidney_RF_2,
terminal = TRUE), main="Tree Size")
#dev.off()
It seems that the optimal size will be about 12-13 nodes per tree, according to the peak of the histogram created.