[ source files available on GitHub ]

PRELIMINARIES

Libraries needed for data processing and plotting:

library("dplyr")
library("magrittr")
library("ggplot2")

library("caTools")
library("ROCR")

ABOUT THE DATA

The objective was to assess quality, health care quality. So we used a large health insurance claims database, and we randomly selected 131 diabetes patients.

An expert physician reviewed the claims and wrote descriptive notes and rated the quality of care on a two-point scale, poor or good.

Variable Extraction

  • Dependent Variable
    • Quality of care, a categorical variable.
  • Independent Variables
    • ongoing use of narcotics.
    • only on Avandia, not a good first choice drug.
    • Had regular visits, mammogram, and immunizations.
    • Was given home testing supplies.

LOADING THE DATA

Read the dataset quality.

quality <- read.csv("data/quality.csv")

str(quality)
## 'data.frame':    131 obs. of  14 variables:
##  $ MemberID            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ InpatientDays       : int  0 1 0 0 8 2 16 2 2 4 ...
##  $ ERVisits            : int  0 1 0 1 2 0 1 0 1 2 ...
##  $ OfficeVisits        : int  18 6 5 19 19 9 8 8 4 0 ...
##  $ Narcotics           : int  1 1 3 0 3 2 1 0 3 2 ...
##  $ DaysSinceLastERVisit: num  731 411 731 158 449 ...
##  $ Pain                : int  10 0 10 34 10 6 4 5 5 2 ...
##  $ TotalVisits         : int  18 8 5 20 29 11 25 10 7 6 ...
##  $ ProviderCount       : int  21 27 16 14 24 40 19 11 28 21 ...
##  $ MedicalClaims       : int  93 19 27 59 51 53 40 28 20 17 ...
##  $ ClaimLines          : int  222 115 148 242 204 156 261 87 98 66 ...
##  $ StartedOnCombination: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ AcuteDrugGapSmall   : int  0 1 5 0 0 4 0 0 0 0 ...
##  $ PoorCare            : int  0 0 0 0 0 1 0 0 1 0 ...

The variables in the dataset quality.csv are as follows:

# mycolors <- c("blue2", "red2")
mycolors <- c("forestgreen", "firebrick2")
ggplot(data = quality, aes(x = OfficeVisits, y = Narcotics)) + theme_bw() + 
    scale_colour_manual(values = mycolors) +
    xlab("Number of Office Visits") + 
    ylab("Number of Narcotics Prescribed") + 
    geom_point(aes(col = factor(PoorCare), alpha = 0.5), pch = 16, size = 4.0)

This plot shows two of our independent variables,

Each point is an observation or a patient in our data set.
The red points are patients who received poor care, and the green points are patients who received good care. It looks like maybe more office visits and more narcotics are more likely to correspind to poor care.

Model Healthcare Quality

Let’s see if we can build a logistic regression model to better predict poor care.

Baseline Model

tmp <- table(quality$PoorCare)
tmp
## 
##  0  1 
## 98 33

In a classification problem, a standard baseline method is to just predict the most frequent outcome for all observations. Since good care is more common than poor care, in this case, we would predict that all patients are receiving good care.

If we did this, we would get 98 out of the 131 observations correct, or have an accuracy of about 75%. This is the accuracy of the baseline model. This is what we’ll try to beat with our logistic regression model.

Preparing data for modeling : Create training and testing sets

Randomly split data creating a dummy variable:

# for Reproducibility
set.seed(88)
split <- sample.split(quality$PoorCare, SplitRatio = 0.75)

The function sample.split() randomly splits the data, while making sure that the outcome variable is well-balanced in each piece. We saw earlier that about 75% of our patients are receiving good care. This function makes sure that in our training set, 75% of our patients are receiving good care and in our testing set 75% of our patients are receiving good care.
We want to do this so that our test set is representative of our training set.

qualityTrain <- subset(quality, split == TRUE)

qualityTest <- subset(quality, split == FALSE)

Logistic Regression Model

As independent variabiles we use OfficeVisits and Narcotics.

QualityLog <- glm(PoorCare ~ OfficeVisits + Narcotics, data = qualityTrain, family = binomial)

The option family = binomial tells glm() that we want a logistic regression.

Let’s look at the results:

summary(QualityLog)
## 
## Call:
## glm(formula = PoorCare ~ OfficeVisits + Narcotics, family = binomial, 
##     data = qualityTrain)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.06303  -0.63155  -0.50503  -0.09689   2.16686  
## 
## Coefficients:
##              Estimate Std. Error z value    Pr(>|z|)    
## (Intercept)  -2.64613    0.52357  -5.054 0.000000433 ***
## OfficeVisits  0.08212    0.03055   2.688     0.00718 ** 
## Narcotics     0.07630    0.03205   2.381     0.01728 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 111.888  on 98  degrees of freedom
## Residual deviance:  89.127  on 96  degrees of freedom
## AIC: 95.127
## 
## Number of Fisher Scoring iterations: 4

What we want to focus on is the coefficients table. This gives the estimate values for the coefficients, or the betas, for our logistic regression model.
We see here that the coefficients for OfficeVisits and Narcotics are both positive, which means that higher values in these two variables are indicative of poor care as we suspected from looking at the data.

The last thing we want to look at in the output is the AIC value. This is a measure of the quality of the model and is like Adjusted R-squared in that it accounts for the number of variables used compared to the number of observations.

Unfortunately, it can only be compared between models on the same data set. But it provides a means for model selection.
The preferred model is the one with the minimum AIC.

Make predictions on training set

predictTrain <- predict(QualityLog, type="response")

Analyze predictions.
Since we’re expecting probabilities, all of the numbers should be between zero and one.
And we see that the minimum value is about 0.07 and the maximum value is 0.98.

summary(predictTrain)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.06623 0.11910 0.15970 0.25250 0.26760 0.98460

Let’s see if we are predicting higher probabilities for the actual poor care cases as we expect. To do this, use the tapply() function, giving as arguments predictTrain and then QualityTrain$PoorCare and then mean (the function to be applied). This will compute the average prediction for each of the true outcomes.

tapply(predictTrain, qualityTrain$PoorCare, mean)
##         0         1 
## 0.1894512 0.4392246

Probability to Prediction: Threshold Value

The outcome of a logistic regression model is a probability. Often, we want to make a binary prediction.

  • Did this patient receive poor care or good care?

We can do this using a threshold value t

  • If \(P(PoorCare = 1) \ge t\), predict poor quality.
  • If \(P(PoorCare = 1) < t\), predict good quality.

Two Types of Errors

There are two types of errors that a model can make:

  • ones where we predict 1, or poor care, but the actual outcome is 0, and
  • ones where we predict 0, or good care, but the actual outcome is 1.

What value should we pick for t?

The threshold value is often selected based on which errors are “better”

  • If t is large, predict poor care rarely.
    • More errors where we say good care, but it is actually poor care.
    • Detects patients who are receiving the worst care
  • If t is small, predict good care rarely.
    • More errors where we say poor care, but it is actually good care
    • Detects all patients who might be receiving poor care
  • With no preference between the errors, select \(t = 0.5\).
    • Predicts the more likely outcome

Compare actual outcomes to predicted outcomes using a confusion matrix (classification matrix):

confusion matrix

This compares the actual outcomes to the predicted outcomes. The rows are labeled with the actual outcome, and the columns are labeled with the predicted outcome. Each entry of the table gives the number of data observations that fall into that category.

These are the two combinations that correspond to correct predictions:

  • The true negatives, or TN, is the number of observations that are actually good care and for which we predict good care.
  • The true positives, or TP, is the number of observations that are actually poor care and for which we predict poor care.

And two combinations that are bad:

  • The false positives, or FP, are the number of data points for which we predict poor care, but they’re actually good care.
  • The false negatives, or FN, are the number of data points for which we predict good care, but they’re actually poor care.

A couple of important definitions of outcome measures that help us determine what types of errors we are making:

  • sensitivity = \(\frac{TP}{TP + FN}\)

  • specificity = \(\frac{TN}{TN + FP}\)

Threshold vs. Sensitivity (Specificity)

  • A model with a higher threshold will have a lower sensitivity and a higher specificity.
  • A model with a lower threshold will have a higher sensitivity and a lower specificity.

Let’s compute some confusion matrices using different threshold values.

Threshold = 0.5

tmp <- table(qualityTrain$PoorCare, predictTrain > 0.5)
tmp 
##    
##     FALSE TRUE
##   0    70    4
##   1    15   10
  • Sensitivity = 10 / 25 = 0.4
    Specificity = 70 / 74 = 0.95

Threshold = 0.7

tmp <- table(qualityTrain$PoorCare, predictTrain > 0.7)
tmp
##    
##     FALSE TRUE
##   0    73    1
##   1    17    8
  • Sensitivity = 8 / 25 = 0.32
    Specificity = 73 / 74 = 0.99

  • By increasing the threshold, our sensitivity went down and our specificity went up.

Threshold = 0.2

tmp <- table(qualityTrain$PoorCare, predictTrain > 0.2)
tmp
##    
##     FALSE TRUE
##   0    54   20
##   1     9   16
  • Sensitivity = 16 / 25 = 0.64
    Specificity = 54 / 74 = 0.73
  • So with the lower threshold, our sensitivity went up, and our specificity went down.

But which threshold should we pick?

Receiver Operator Characteristic (ROC) Curve

Picking a good threshold value is often challenging. A Receiver Operator Characteristic curve, or ROC curve, can help you decide which value of the threshold is best.

The ROC curve for our problem is shown below.

  • The sensitivity, or true positive rate of the model, is shown on the y-axis.
  • The false positive rate, or 1 minus the specificity, is given on the x-axis.

The line shows how these two outcome measures vary with different threshold values.

The ROC curve captures all thresholds simultaneously.

  • The higher the threshold, or closer to (0, 0), the higher the specificity and the lower the sensitivity.
  • The lower the threshold, or closer to (1, 1), the higher the sensitivity and lower the specificity.

ROC for our model

# Prediction function
ROCRpred <- prediction(predictTrain, qualityTrain$PoorCare)

The ROCRpred() function takes two arguments.

  • The first is the predictions we made with our model, which we called predictTrain.
  • The second argument is the true outcomes of our data points, which in our case, is qualityTrain$PoorCare.
# Performance function
ROCRperf <- performance(ROCRpred, "tpr", "fpr")

# Plot ROC curve
# plot(ROCRperf, main = "Receiver Operator Characteristic Curve", lwd = 3)

# Add colors
# plot(ROCRperf, main = "Receiver Operator Characteristic Curve", lwd = 3, colorize = TRUE)

# Add threshold labels 
plot(ROCRperf, main = "Receiver Operator Characteristic Curve", lwd = 3, colorize = TRUE, 
     print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))

  • At the point (0, 0.4), you’re correctly labeling about 40% of the poor care cases with a very small false positive rate.
  • At the point (0.6, 0.9), you’re correctly labeling about 90% of the poor care cases, but have a false positive rate of 60%.
  • In the middle, around (0.3, 0.8), you’re correctly labeling about 80% of the poor care cases, with a 30% false positive rate.

So which threshold value should you pick?

You should select the best threshold for the trade-off you want to make considering:

  • cost of failing to detect positives
  • costs of raising false alarms

So,

  • If you’re more concerned with having a high specificity or low false positive rate,
    • pick the threshold that maximizes the true positive rate while keeping the false positive rate really low.
    • A threshold around (0.1, 0.5) on this ROC curve looks like a good choice in this case.
  • On the other hand, if you’re more concerned with having a high sensitivity or high true positive rate,
    • pick a threshold that minimizes the false positive rate but has a very high true positive rate.
    • A threshold around (0.3, 0.8) looks like a good choice in this case.

INTERPRETING THE MODEL

Let us examine how to interpret the model we developed.

Multicollinearity

One of the things we should look after is that there might be what is called multicollinearity.
Multicollinearity occurs when the various independent variables are correlated, and this might confuse the coefficients (the betas) in the model.

We can test this by checking the correlations of independent variables. If they are excessively high, this would mean that there might be multicollinearity, and you have to potentially revisit the model,

We should also consider whether or not the signs of the model coefficients make sense. Is the coefficient beta positive or negative?
If it agrees with intuition, then multicollinearity has not been a problem, but if intuition suggests a different sign, this might be a sign of multicollinearity.

Significance (Area Under the Curve, AUC)

The next important element is significance.
So how do we interpret the results, and how do we understand whether we have a good model or not?

For that purpose, let’s take a look at what is called Area Under the Curve, or AUC for short. The Area Under the Curve shows an absolute measure of quality of prediction. In this particular case, 77.5%.
Given that the perfect score is 100%, this is like a B.

So the area under the curve gives an absolute measure of quality, and it’s less affected by various benchmarks. It illustrates how accurate the model is on a more absolute sense.

Other Outcome Measures (confusion matrix)

confusion matrix

Making Predictions Out Of Sample

Just like in linear regression, we want to make predictions on a test set to compute out-of-sample metrics. We set aside 25% of our data, i.e. 32 cases, as a test set, against which we can compare the predicted outcomes of the model that was trained on the other 75% of the data.

predictTest <- predict(QualityLog, type = "response", newdata = qualityTest)
ROCRpredTest <- prediction(predictTest, qualityTest$PoorCare)
auc <- as.numeric(performance(ROCRpredTest, "auc")@y.values)

The AUC of this model on the test data is 79.9%.

If we use a threshold value of 0.3, we get the following confusion matrix:

tmp <- table(qualityTest$PoorCare, predictTest > 0.3)
tmp 
##    
##     FALSE TRUE
##   0    19    5
##   1     2    6

Comparison with Baseline Model

The baseline model would predict good care all the time. In that case, we will be correct 24 times, versus 25 times, in our case.
But notice that predicting always good care does not capture the dynamics of what is happening, versus the logistic regression model that is far more intelligent in capturing these effects.

CONCLUSIONS

The Competitive Edge of Models

  • While humans can accurately analyze small amounts of information, models allow larger scalability
  • Models do not replace expert judgment
    • Experts can improve and refine the model
  • Models can integrate assessments of many experts into one final unbiased and unemotional prediction