Background

Dataset

  • The dataset can be found here.
  • The data was taken over a 2-month period in India with 25 features
  • The variable of interest is classification, which states whether a patient has chronic kidney disease or not
  • The 25 features are:
    1. age
    2. blood pressure
    3. specific gravity
    4. albumin levels
    5. sugar levels
    6. red blood cells
    7. pus cell count
    8. pus cell clumps
    9. bacteria
    10. blood glucose random
    11. blood urea
    12. serum creatinine
    13. sodium
    14. potassium
    15. hemoglobin
    16. packed cell volume
    17. white blood cell count
    18. red blood cell count
    19. hypertension
    20. diabetes mellitus
    21. coronary artery disease
    22. appetite
    23. pedal edema
    24. anemia
    25. classification

Motivation

  • Chronic kidney disease is a condition that means a patient is gradually losing kidney function
  • Kidneys filter waste and excessive fluids from your blood, which come out in your urine
  • When chronic kidney disease becomes more advanced, dangerous levels of electrolytes, waste and fluids build up in the body
  • Chronic Kidney disease can lead to complete kidney failure, which is fatal without artificial filtering or a kidney transplant
  • The treatment for chronic kidney disease only focuses on slowing the progression of kidney damage. It is not a curable disease.
  • Chronic kidney disease is very dangerous even when it is not at a fatal stage, and other complications of chronic kidney disease are:
    • Heart and blood vessel disease
    • A sudden rise in potassium levels in your blood, which could cause heart failure
    • Weak bones
    • Decreased immune response
    • Damage to your central nervous system
    • Pregnancy complications for both the mother and the baby
  • Since the only treatment is slowing kidney damage, finding out whether one has chronic kidney disease early on is crucial

Question/Problem

  • Since it is so critical to detect chronic kidney disease early on, we need to gain an understanding of the risk factors that make someone more likely to develop chronic kidney disease. This makes early preventative treatment possible.

Methods

Our Plan:
* We want to create a KNN model that predicts if a person has chronic kidney disease or not
* We want to know what features best predict if a person will have chronic kidney disease or not
* We aim to optimize the classifier in order to identify if a patient has chronic kidney disease or not
* Knowing the most prominent risk factors will allow people with these risk factors to get tested for chronic kidney disease more often, so the disease will likely be caught earlier, which can be life saving
* We believe blood pressure, potassium, diabetes mellitus and age will be integral in predicting if a patient has chronic kidney disease or not
* The desired optimized metric is Sensitivity, as in health care data the true positive rate must be maximized to increase the rate of true kidney disease diagnoses.

Classifier Plan

  • We plan to use a KNN model to do classification on this data set, as well as Random Forest
  • First, we will standardize all of the variables
  • Second, we will drop all highly correlated variables
  • Finally, we will run a KNN model on these
  • In order to further improve the model, we will use Random Forest and further optimize its parameters to increase the model’s Sensitivity metric.

EDA

Taking a closer look at the target variable, we can see that our dataset is not balanced. There are a little over 100 more cases of people with chronic kidney disease than people that are healthy.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Older people are more susceptible to diseases in general, and the same ohld true for chronic kidney disease. Looking at the age distribution of our data, the majority of the patients are in the middle to old age range.

People often have both diabetes and chronic kidney disease, so looking closer at the diabetes metric will help us understand our data better.

## 
##   0   1 
## 150 250
##     1 
## 0.625
##     0 
## 0.375

The base split of the data shows that 62.5% have chronic kidney disease and that 37.5% do not have ckd.

We have 320 training points and 80 testing data points

Train the classifier using k = 3

##               
## subsetted2_3NN  0  1
##              0 33  1
##              1  0 46
## [1] 33 46
## [1] 0.9875
## [1] 0.9787234

At k=3, accuracy =0.9875 and sensitivity=0.979. In the context of this project, it is most important to optimize the sensitivity (TPR) of diagnoses in order to maximize the true number of chronic kidney disease diagnoses.

Confusion Matrix Output and Observations

## Confusion Matrix and Statistics
## 
##           Actual
## Prediction  0  1
##          0 33  1
##          1  0 46
##                                           
##                Accuracy : 0.9875          
##                  95% CI : (0.9323, 0.9997)
##     No Information Rate : 0.5875          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9743          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9787          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9706          
##              Prevalence : 0.5875          
##          Detection Rate : 0.5750          
##    Detection Prevalence : 0.5750          
##       Balanced Accuracy : 0.9894          
##                                           
##        'Positive' Class : 1               
## 

Specificity=1.000, possibly due to limited size of the dataset.

Re-analyzing at “optimal” k-value of k=5

At k values lower than k=3, there is a risk of overfitting and merely memorizing the data, producing high accuracy and sensitivity metrics. At k values greater than k=3, the sensitivity is increased only slightly. Given the cost of running KN as a ML classifier, it is best to reduce the complexity of the model by keeping the number of neighbors at k=3; there is not a significant improvement in sensitivity for higher hyperparameter values. Thus, k=3 is the optimal metric for the classifier.

Model Evaluation

Adjustment of Thresholds

As the k hyperparameter increases, the accuracy metric declines. This trend reinforces our decision to keep k=3. Next, at the 50% threshold of classification, accuracy =0.9 while sensitivity=0.84. While these metrics are decent, it was found that lowering the threshold for classification further to approximately 30% decreased accuracy to 0.8875 and increased sensitivity to 0.9. As sensitivity is the desired metric to optimize, it is favorable to have a lower threshold. This allows for more observations to be classified as 1 (chronic kidney disease), allowing for more truly positive classifications to be made and reducing the number of false positives (0/noncdk observations).

ROC Curves/AUC

## [[1]]
## [1] 0.9466667

The area beneath the ROC curve is 94.67%. This is a high AUC value, which means that the classifier is generally maximized for sensitivity (true positive rate), the desired metric. The TPR is maximized at 0.85 according to the graphic produced, and thus the sensitivity is maximized at the same value.

Log Loss

## 
## Attaching package: 'MLmetrics'
## The following objects are masked from 'package:caret':
## 
##     MAE, RMSE
## The following object is masked from 'package:base':
## 
##     Recall
## [1] 4.900803

Log Loss Interpretation

A high log loss indicates that when the model classifies an observation incorrectly, it does so severely. The logloss calculated is 4.9, which is relatively high. At a slightly higher hyperparameter value the logloss may improve. In order to improve the model further past altering the hyperparameter, the Random Forest method was employed. Additionally, the small size of the data set (203 rows) may include observations that heavily skew the logloss.

Random Forest Method

The Random Forest method is a way to perform bootstrapping (sampling with replacement) for classification, and reduces the over-fitting of the model. Employing multiple trees will introduce more variation into the model, as opposed to using a single tree which may merely memorize the testing data.

-“ntree” is the number of trees used in the forest -“mtry” is the number of variables used to make each split in the tree

Training and Testing Split

Build Random Forest

-ntree=1000,mtry=4 (as the initial values used) -5.28% OOB and 7.35% class 0 and 4.02% class 1

## 
## Call:
##  randomForest(formula = as.factor(classification) ~ ., data = kidney_train,      ntree = 1000, mtry = 4, replace = TRUE, sampsize = 100, nodesize = 10,      importance = TRUE, proximity = FALSE, norm.votes = TRUE,      do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 4
## 
##         OOB estimate of  error rate: 5.28%
## Confusion matrix:
##     0   1 class.error
## 0 126  10  0.07352941
## 1   9 215  0.04017857
## 
##   0   1 
## 150 250
## [1] 0.9469231

-Confusion Matrix: 150 obs. classified as negative class (notcdk) while 250 obs. classified as positive class (cdk) -Accuracy = 0.9469 The accuracy of the initial random forest is 0.9469, which is a relatively high accuracy.

Variable Importance (prior to optimization)

##                 0             1 MeanDecreaseAccuracy MeanDecreaseGini
## age   0.005591822  3.250976e-03          0.004137421       1.70282740
## bp    0.004726589  1.292696e-03          0.002598743       0.57093410
## su    0.043264080  4.870567e-04          0.016656058       1.61327361
## rbc   0.015383129  3.254801e-04          0.006006388       0.52878105
## pc    0.028376827  1.912111e-05          0.010706415       0.87958306
## pcc   0.006626822 -8.663545e-05          0.002440321       0.20537986
## ba    0.004678200  9.878902e-05          0.001828676       0.25887168
## bgr   0.162819987  1.014681e-02          0.067650398       6.97336165
## bu    0.165655583  1.101810e-02          0.069331175       6.59600118
## sod   0.081477548  1.184088e-02          0.038118792       3.98087739
## pot   0.052474240  9.382889e-03          0.025640086       2.75674215
## wc    0.069731726  1.250578e-02          0.034175240       3.43877696
## htn   0.211345131  2.453555e-02          0.094983285       8.07006473
## cad   0.002291292 -1.586216e-04          0.000768000       0.08679487
## appet 0.038740491 -8.126450e-04          0.014040164       1.10107257
## pe    0.033894053 -3.187137e-04          0.012590950       1.06303313
## ane   0.015653307 -6.141705e-04          0.005532842       0.52238923

The top variables of importance in the classification of whether a patient has chronic kidney disease or not are age, blood pressure, sugar, and red blood cells. The “Mean Decrease Accuracy” measure is recording how much the classification accuracy decreases if that corresponding variable is not used, so if age isn’t used the model will be 0.4% less accurate, for example. The “Mean Decrease Gini” shows how much the gini coefficient decreases if that variable isn’t used.

Visualizing the Addition of Trees to Random Forest

## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

As illustrated, the error gets significantly less as more trees are added, and even dips more severely around specific points including 650-700 trees. This makes sense, as adding more trees protects against over-fitting and reduces error in large numbers.

Optimizing the Random Forest Model (Increasing Sensitivity for Positive Class)

In order to optimize the tree, the parameter for ntree was lowered from 1000 to ntree=300, and mtry=4 raised to mtry=5. The sample size was increased to 150. The error output shows 5.00% OOB and 7.35% class 0, 3.57% class 1. Therefore, class 1 classification error is reduced from 4.02% (higher sensitivity) which is the intention of the optimization procedure.

Comparison of Models

##     0   1 class.error
## 0 126  10  0.07352941
## 1   9 215  0.04017857
##     0   1 class.error
## 0 126  10  0.07352941
## 1   8 216  0.03571429

The class 1 error was reduced from 4.01% to 3.57%, indicating an optimization of the sensitivity (TPR). This demonstrates that the second model with a smaller ntree and larger mtry is better at correctly identifying the CDK observations (class 1). In the future, a smaller ntree and larger mtry would be recommended to optimize the model’s sensitivity.

Test set Error Rates

##    
##      0  1
##   0  0 14
##   1  0 26
## [1] 0.35

There is a 35% error rate on the test set; the model could still use some improvement! Below we used variable importance, mtry value, and size of the forest to optimize the classifier.

Further Improving the Model

Variable Importance in Optimized Model

## null device 
##           1

A visualization on variable importance (from the improved/optimized model) is shown. The most important variables in predicting class 1 versus class 0 in the optimized model are hypertension, blood urea, and blood glucose random. These variables are different from the variables estimated to be most important in variable importance prior to optimization, so it is important to always optimize the model before drawing conclusions.

Finding the Optimal “mtry” value (number of variables used for each split)

## null device 
##           1
## mtry = 2  OOB error = 18.06% 
## Searching left ...
## mtry = 1     OOB error = 17.78% 
## 0.01538462 0.05 
## Searching right ...
## mtry = 4     OOB error = 17.5% 
## 0.03076923 0.05
##       mtry  OOBError
## 1.OOB    1 0.1777778
## 2.OOB    2 0.1805556
## 4.OOB    4 0.1750000

Based on the OOB Error Table generated, the OOB Error is minimized around 4-5 for the mtry value; so it is recommended to use 4 or 5 in the next iteration of the model.

Finding Optimal Size of Forest

#### Size of the forest ####

# If you want to look at the size of the trees in the random forest, 
# or how many nodes each tree has, you can use the treesize() function.
treesize(kidney_RF_2,    #<- the randomForest object to use
         terminal = FALSE)  #<- when TRUE, only the terminal nodes are counted, when FALSE, all nodes are counted
##   [1] 19 27 21 21 25 29 21 21 15 27 15 27 21 19 21 29 33 29 23 21 31 27 23 25 19
##  [26] 15 21 29 25 19 31 23 17 29 27 23 27 35 25 27 27 23 21 25 29 21 21 27 29 27
##  [51] 23 25 27 23 19 39 21 35 23 31 29 35 27 19 15 27 23 27 31 31 19 27 31 37 33
##  [76] 31 23 37 27 23 33 27 31 33 33 19 29 37 25 23 27 33 23 25 27 23 21 21 35 29
## [101] 17 27 23 23 23 23 31 21 27 25 21 37 27 21 35 31 25 19 31 27 19 31 29 37 23
## [126] 31 15 37 29 31 21 25 19 13 31 21 21 17 29 29 17 29 27 27 19 19 29 27 27 23
## [151] 27 27 27 33 23 19 25 17 23 27 23 29 19 25 25 33 27 23 27 27  9 33 27 33 23
## [176] 23 19 17 25 21 27 27 31 17 23 25 23 33 25 25 31 31 29 23 23 27 23 19 29 25
## [201] 31 25 29 25 25 35 23 25 21 25 21 35 21 27 23 25 23 21 21 19 19 31 29 17 27
## [226] 17 27 23 33 25 31 21 35 21 23 29 29 31 23 21 21 21 25 17 31 29 29 29 25 23
## [251] 23 21 17 19 21 27 25 27 19 17 25 21 35 19 27 19 25 27 27 23 27 27 13 33 27
## [276] 29 27 23 27 25 33 29 25 27 17 19 31 23 31 25 19 23 33 21 29 25 29 31 13 25
# You can use the treesize() function to create a histogram for a visual presentation.
hist(treesize(kidney_RF_2,
              terminal = TRUE), main="Tree Size")

#dev.off()

It seems that the optimal size will be about 12-13 nodes per tree, according to the peak of the histogram created.

Conclusion

  • Drawing upon the conclusions already made at each stage of the project, the most important takeaways are:
  1. k=3 is the optimal hyperparameter to use for the KNN classifier
  2. The model evaluation of the KNN classifier warranted the use of Random Forest to further produce a model that optimized sensitivity
  3. A ntree=300, mtry=5, and sampsize=150 were the ideal parameters to maximize the true positive rate within the random forest
  4. In order to further optimize the random forest in the future (for sensitivity), the best tree size is estimated to be about 12 nodes, and the best mtry value is about 5
  5. For the optimized Random Forest Model, the most important variables in classifying whether a patient has chronic kidney disease or not are hypertension, blood urea, and blood glucose random. When the model is maximized for sensitivity, these variables are the ones that doctors can use to most effectively predict if a patient will have chronic kidney disease

Limitations

  • Over half of the rows in the dataset had NAs, so we had to impute values using column averages and default healthy values for categorical variables, which is not necessarily accurate
  • Upon standardizing prior to using KNN, many of the variables standardized were binary whereas many others were numeric; the inconsistency of types made standardizing difficult
  • The dataset is small, and so the specificity=1.000 is most likely due to that feature of the data (in KNN)
  • Also due to the small size of the data set (400 rows), thhe logloss is relatively high because of observations that may have heavily skewed it

Future Work

  • Run the model using a more robust dataset with less NAs, and hence more accurate data
  • Run the model using a dataset that is larger, which will produce more accurate logloss values
  • Find other ways to standardize the binary variables prior to using KNN
  • Optimize the random forest further using the recommendations concluded (for mtry and tree size)