The Analytics Edge - Unit 3 : Replicating Expert Assessment

PRELIMINARIES

Libraries needed for data processing and plotting:

library("dplyr")
library("magrittr")
library("ggplot2")

library("caTools")
library("ROCR")

ABOUT THE DATA

The objective was to assess quality, health care quality. So we used a large health insurance claims database, and we randomly selected 131 diabetes patients.

Large health insurance claims database.
Randomly selected 131 diabetes patients.
Ages: from 35 to 55
Costs: $10,000 - $20,000
The period in which these claims were recorded: September 1, 2003 - August 31, 2005

An expert physician reviewed the claims and wrote descriptive notes and rated the quality of care on a two-point scale, poor or good.

Variable Extraction

Dependent Variable
- Quality of care, a categorical variable.
Independent Variables
- ongoing use of narcotics.
- only on Avandia, not a good first choice drug.
- Had regular visits, mammogram, and immunizations.
- Was given home testing supplies.

LOADING THE DATA

Read the dataset quality.

quality <- read.csv("data/quality.csv")

str(quality)
## 'data.frame':    131 obs. of  14 variables:
##  $ MemberID            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ InpatientDays       : int  0 1 0 0 8 2 16 2 2 4 ...
##  $ ERVisits            : int  0 1 0 1 2 0 1 0 1 2 ...
##  $ OfficeVisits        : int  18 6 5 19 19 9 8 8 4 0 ...
##  $ Narcotics           : int  1 1 3 0 3 2 1 0 3 2 ...
##  $ DaysSinceLastERVisit: num  731 411 731 158 449 ...
##  $ Pain                : int  10 0 10 34 10 6 4 5 5 2 ...
##  $ TotalVisits         : int  18 8 5 20 29 11 25 10 7 6 ...
##  $ ProviderCount       : int  21 27 16 14 24 40 19 11 28 21 ...
##  $ MedicalClaims       : int  93 19 27 59 51 53 40 28 20 17 ...
##  $ ClaimLines          : int  222 115 148 242 204 156 261 87 98 66 ...
##  $ StartedOnCombination: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ AcuteDrugGapSmall   : int  0 1 5 0 0 4 0 0 0 0 ...
##  $ PoorCare            : int  0 0 0 0 0 1 0 0 1 0 ...

The variables in the dataset quality.csv are as follows:

MemberID numbers the patients from 1 to 131, and is just an identifying number.
InpatientDays is the number of inpatient visits, or number of days the person spent in the hospital.
ERVisits is the number of times the patient visited the emergency room.
OfficeVisits is the number of times the patient visited any doctor’s office.
Narcotics is the number of prescriptions the patient had for narcotics.
DaysSinceLastERVisit is the number of days between the patient’s last emergency room visit and the end of the study period (set to the length of the study period if they never visited the ER).
Pain is the number of visits for which the patient complained about pain.
TotalVisits is the total number of times the patient visited any healthcare provider.
ProviderCount is the number of providers that served the patient.
MedicalClaims is the number of days on which the patient had a medical claim.
ClaimLines is the total number of medical claims.
StartedOnCombination is whether or not the patient was started on a combination of drugs to treat their diabetes (TRUE or FALSE).
AcuteDrugGapSmall is the fraction of acute drugs that were refilled quickly after the prescription ran out.
PoorCare is the outcome or dependent variable, and is equal to 1 if the patient had poor care, and equal to 0 if the patient had good care.

# mycolors <- c("blue2", "red2")
mycolors <- c("forestgreen", "firebrick2")
ggplot(data = quality, aes(x = OfficeVisits, y = Narcotics)) + theme_bw() + 
    scale_colour_manual(values = mycolors) +
    xlab("Number of Office Visits") + 
    ylab("Number of Narcotics Prescribed") + 
    geom_point(aes(col = factor(PoorCare), alpha = 0.5), pch = 16, size = 4.0)

This plot shows two of our independent variables,

the number of office visits on the x-axis (OfficeVisits), and
the number of narcotics prescribed on the y-axis (Narcotics).

Each point is an observation or a patient in our data set.
The red points are patients who received poor care, and the green points are patients who received good care. It looks like maybe more office visits and more narcotics are more likely to correspind to poor care.

Model Healthcare Quality

Let’s see if we can build a logistic regression model to better predict poor care.

Baseline Model

tmp <- table(quality$PoorCare)
tmp
## 
##  0  1 
## 98 33

In a classification problem, a standard baseline method is to just predict the most frequent outcome for all observations. Since good care is more common than poor care, in this case, we would predict that all patients are receiving good care.

If we did this, we would get 98 out of the 131 observations correct, or have an accuracy of about 75%. This is the accuracy of the baseline model. This is what we’ll try to beat with our logistic regression model.

Preparing data for modeling : Create training and testing sets

Randomly split data creating a dummy variable:

# for Reproducibility
set.seed(88)
split <- sample.split(quality$PoorCare, SplitRatio = 0.75)

The function sample.split() randomly splits the data, while making sure that the outcome variable is well-balanced in each piece. We saw earlier that about 75% of our patients are receiving good care. This function makes sure that in our training set, 75% of our patients are receiving good care and in our testing set 75% of our patients are receiving good care.
We want to do this so that our test set is representative of our training set.

qualityTrain <- subset(quality, split == TRUE)

qualityTest <- subset(quality, split == FALSE)

Logistic Regression Model

As independent variabiles we use OfficeVisits and Narcotics.

QualityLog <- glm(PoorCare ~ OfficeVisits + Narcotics, data = qualityTrain, family = binomial)

The option family = binomial tells glm() that we want a logistic regression.

Let’s look at the results:

summary(QualityLog)
## 
## Call:
## glm(formula = PoorCare ~ OfficeVisits + Narcotics, family = binomial, 
##     data = qualityTrain)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.06303  -0.63155  -0.50503  -0.09689   2.16686  
## 
## Coefficients:
##              Estimate Std. Error z value    Pr(>|z|)    
## (Intercept)  -2.64613    0.52357  -5.054 0.000000433 ***
## OfficeVisits  0.08212    0.03055   2.688     0.00718 ** 
## Narcotics     0.07630    0.03205   2.381     0.01728 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 111.888  on 98  degrees of freedom
## Residual deviance:  89.127  on 96  degrees of freedom
## AIC: 95.127
## 
## Number of Fisher Scoring iterations: 4

What we want to focus on is the coefficients table. This gives the estimate values for the coefficients, or the betas, for our logistic regression model.
We see here that the coefficients for OfficeVisits and Narcotics are both positive, which means that higher values in these two variables are indicative of poor care as we suspected from looking at the data.

The last thing we want to look at in the output is the AIC value. This is a measure of the quality of the model and is like Adjusted R-squared in that it accounts for the number of variables used compared to the number of observations.

Unfortunately, it can only be compared between models on the same data set. But it provides a means for model selection.
The preferred model is the one with the minimum AIC.

Make predictions on training set

predictTrain <- predict(QualityLog, type="response")

Analyze predictions.
Since we’re expecting probabilities, all of the numbers should be between zero and one.
And we see that the minimum value is about 0.07 and the maximum value is 0.98.

summary(predictTrain)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.06623 0.11910 0.15970 0.25250 0.26760 0.98460

Let’s see if we are predicting higher probabilities for the actual poor care cases as we expect. To do this, use the tapply() function, giving as arguments predictTrain and then QualityTrain$PoorCare and then mean (the function to be applied). This will compute the average prediction for each of the true outcomes.

tapply(predictTrain, qualityTrain$PoorCare, mean)
##         0         1 
## 0.1894512 0.4392246

Probability to Prediction: Threshold Value

The outcome of a logistic regression model is a probability. Often, we want to make a binary prediction.

Did this patient receive poor care or good care?

We can do this using a threshold value t

If $P(PoorCare = 1) \ge t$, predict poor quality.
If $P(PoorCare = 1) < t$, predict good quality.

Two Types of Errors

There are two types of errors that a model can make:

ones where we predict 1, or poor care, but the actual outcome is 0, and
ones where we predict 0, or good care, but the actual outcome is 1.

What value should we pick for t?

The threshold value is often selected based on which errors are “better”

If t is large, predict poor care rarely.
- More errors where we say good care, but it is actually poor care.
- Detects patients who are receiving the worst care
If t is small, predict good care rarely.
- More errors where we say poor care, but it is actually good care
- Detects all patients who might be receiving poor care
With no preference between the errors, select $t = 0.5$.
- Predicts the more likely outcome

Compare actual outcomes to predicted outcomes using a confusion matrix (classification matrix):

confusion matrix

This compares the actual outcomes to the predicted outcomes. The rows are labeled with the actual outcome, and the columns are labeled with the predicted outcome. Each entry of the table gives the number of data observations that fall into that category.

These are the two combinations that correspond to correct predictions:

The true negatives, or TN, is the number of observations that are actually good care and for which we predict good care.
The true positives, or TP, is the number of observations that are actually poor care and for which we predict poor care.

And two combinations that are bad:

The false positives, or FP, are the number of data points for which we predict poor care, but they’re actually good care.
The false negatives, or FN, are the number of data points for which we predict good care, but they’re actually poor care.

A couple of important definitions of outcome measures that help us determine what types of errors we are making:

sensitivity = $\frac{TP}{TP + FN}$
specificity = $\frac{TN}{TN + FP}$

Threshold vs. Sensitivity (Specificity)

A model with a higher threshold will have a lower sensitivity and a higher specificity.
A model with a lower threshold will have a higher sensitivity and a lower specificity.

Let’s compute some confusion matrices using different threshold values.

Threshold = 0.5

tmp <- table(qualityTrain$PoorCare, predictTrain > 0.5)
tmp 
##    
##     FALSE TRUE
##   0    70    4
##   1    15   10

Sensitivity = 10 / 25 = 0.4
Specificity = 70 / 74 = 0.95

Threshold = 0.7

tmp <- table(qualityTrain$PoorCare, predictTrain > 0.7)
tmp
##    
##     FALSE TRUE
##   0    73    1
##   1    17    8

Sensitivity = 8 / 25 = 0.32
Specificity = 73 / 74 = 0.99
By increasing the threshold, our sensitivity went down and our specificity went up.

Threshold = 0.2

tmp <- table(qualityTrain$PoorCare, predictTrain > 0.2)
tmp
##    
##     FALSE TRUE
##   0    54   20
##   1     9   16

Sensitivity = 16 / 25 = 0.64
Specificity = 54 / 74 = 0.73
So with the lower threshold, our sensitivity went up, and our specificity went down.

But which threshold should we pick?

Receiver Operator Characteristic (ROC) Curve

Picking a good threshold value is often challenging. A Receiver Operator Characteristic curve, or ROC curve, can help you decide which value of the threshold is best.

The ROC curve for our problem is shown below.

The sensitivity, or true positive rate of the model, is shown on the y-axis.
The false positive rate, or 1 minus the specificity, is given on the x-axis.

The line shows how these two outcome measures vary with different threshold values.

The ROC curve captures all thresholds simultaneously.

The higher the threshold, or closer to (0, 0), the higher the specificity and the lower the sensitivity.
The lower the threshold, or closer to (1, 1), the higher the sensitivity and lower the specificity.

ROC for our model

# Prediction function
ROCRpred <- prediction(predictTrain, qualityTrain$PoorCare)

The ROCRpred() function takes two arguments.

The first is the predictions we made with our model, which we called predictTrain.
The second argument is the true outcomes of our data points, which in our case, is qualityTrain$PoorCare.

# Performance function
ROCRperf <- performance(ROCRpred, "tpr", "fpr")

# Plot ROC curve
# plot(ROCRperf, main = "Receiver Operator Characteristic Curve", lwd = 3)

# Add colors
# plot(ROCRperf, main = "Receiver Operator Characteristic Curve", lwd = 3, colorize = TRUE)

# Add threshold labels 
plot(ROCRperf, main = "Receiver Operator Characteristic Curve", lwd = 3, colorize = TRUE, 
     print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))

At the point (0, 0.4), you’re correctly labeling about 40% of the poor care cases with a very small false positive rate.
At the point (0.6, 0.9), you’re correctly labeling about 90% of the poor care cases, but have a false positive rate of 60%.
In the middle, around (0.3, 0.8), you’re correctly labeling about 80% of the poor care cases, with a 30% false positive rate.

So which threshold value should you pick?

You should select the best threshold for the trade-off you want to make considering:

cost of failing to detect positives
costs of raising false alarms

So,

If you’re more concerned with having a high specificity or low false positive rate,
- pick the threshold that maximizes the true positive rate while keeping the false positive rate really low.
- A threshold around (0.1, 0.5) on this ROC curve looks like a good choice in this case.
On the other hand, if you’re more concerned with having a high sensitivity or high true positive rate,
- pick a threshold that minimizes the false positive rate but has a very high true positive rate.
- A threshold around (0.3, 0.8) looks like a good choice in this case.

INTERPRETING THE MODEL

Let us examine how to interpret the model we developed.

Multicollinearity

One of the things we should look after is that there might be what is called multicollinearity.
Multicollinearity occurs when the various independent variables are correlated, and this might confuse the coefficients (the betas) in the model.

We can test this by checking the correlations of independent variables. If they are excessively high, this would mean that there might be multicollinearity, and you have to potentially revisit the model,

We should also consider whether or not the signs of the model coefficients make sense. Is the coefficient beta positive or negative?
If it agrees with intuition, then multicollinearity has not been a problem, but if intuition suggests a different sign, this might be a sign of multicollinearity.

Significance (Area Under the Curve, AUC)

The next important element is significance.
So how do we interpret the results, and how do we understand whether we have a good model or not?

For that purpose, let’s take a look at what is called Area Under the Curve, or AUC for short. The Area Under the Curve shows an absolute measure of quality of prediction. In this particular case, 77.5%.
Given that the perfect score is 100%, this is like a B.

So the area under the curve gives an absolute measure of quality, and it’s less affected by various benchmarks. It illustrates how accurate the model is on a more absolute sense.

Other Outcome Measures (confusion matrix)

confusion matrix

Making Predictions Out Of Sample

Just like in linear regression, we want to make predictions on a test set to compute out-of-sample metrics. We set aside 25% of our data, i.e. 32 cases, as a test set, against which we can compare the predicted outcomes of the model that was trained on the other 75% of the data.

predictTest <- predict(QualityLog, type = "response", newdata = qualityTest)
ROCRpredTest <- prediction(predictTest, qualityTest$PoorCare)
auc <- as.numeric(performance(ROCRpredTest, "auc")@y.values)

The AUC of this model on the test data is 79.9%.

If we use a threshold value of 0.3, we get the following confusion matrix:

tmp <- table(qualityTest$PoorCare, predictTest > 0.3)
tmp 
##    
##     FALSE TRUE
##   0    19    5
##   1     2    6

Overall Accuracy = 0.78
Sensitivity = 6 / 8 = 0.75 ( = TP rate)
Specificity = 19 / 24 = 0.79
FP rate = 5 / 24 = 0.21

Comparison with Baseline Model

The baseline model would predict good care all the time. In that case, we will be correct 24 times, versus 25 times, in our case.
But notice that predicting always good care does not capture the dynamics of what is happening, versus the logistic regression model that is far more intelligent in capturing these effects.

CONCLUSIONS

An expert-trained model can accurately identify diabetics receiving low-quality care
- Out-of-sample accuracy of 78%.
- Identifies most patients receiving poor care
In practice, the probabilities returned by the logistic regression model can be used to prioritize patients for intervention
Electronic medical records could be used in the future

The Competitive Edge of Models

While humans can accurately analyze small amounts of information, models allow larger scalability
Models do not replace expert judgment
- Experts can improve and refine the model
Models can integrate assessments of many experts into one final unbiased and unemotional prediction

The Analytics Edge - Unit 3 :
Replicating Expert Assessment

Reproducible notes following lecture slides and videos

Giovanni Fossati

PRELIMINARIES

ABOUT THE DATA

Variable Extraction

LOADING THE DATA

Model Healthcare Quality

Baseline Model

Preparing data for modeling : Create training and testing sets

Logistic Regression Model

Make predictions on training set

Probability to Prediction: Threshold Value

Two Types of Errors

What value should we pick for t?

Threshold vs. Sensitivity (Specificity)

Threshold = 0.5

Threshold = 0.7

Threshold = 0.2

Receiver Operator Characteristic (ROC) Curve

ROC for our model

So which threshold value should you pick?

INTERPRETING THE MODEL

Multicollinearity

Significance (Area Under the Curve, AUC)

Other Outcome Measures (confusion matrix)

Making Predictions Out Of Sample

Comparison with Baseline Model

CONCLUSIONS

The Competitive Edge of Models