[ source files available on GitHub ]
Libraries needed for data processing and plotting:
library("dplyr")
library("magrittr")
library("ggplot2")
library("caTools")
library("ROCR")
The objective was to assess quality, health care quality. So we used a large health insurance claims database, and we randomly selected 131 diabetes patients.
An expert physician reviewed the claims and wrote descriptive notes and rated the quality of care on a two-point scale, poor or good.
Read the dataset quality
.
quality <- read.csv("data/quality.csv")
str(quality)
## 'data.frame': 131 obs. of 14 variables:
## $ MemberID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ InpatientDays : int 0 1 0 0 8 2 16 2 2 4 ...
## $ ERVisits : int 0 1 0 1 2 0 1 0 1 2 ...
## $ OfficeVisits : int 18 6 5 19 19 9 8 8 4 0 ...
## $ Narcotics : int 1 1 3 0 3 2 1 0 3 2 ...
## $ DaysSinceLastERVisit: num 731 411 731 158 449 ...
## $ Pain : int 10 0 10 34 10 6 4 5 5 2 ...
## $ TotalVisits : int 18 8 5 20 29 11 25 10 7 6 ...
## $ ProviderCount : int 21 27 16 14 24 40 19 11 28 21 ...
## $ MedicalClaims : int 93 19 27 59 51 53 40 28 20 17 ...
## $ ClaimLines : int 222 115 148 242 204 156 261 87 98 66 ...
## $ StartedOnCombination: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ AcuteDrugGapSmall : int 0 1 5 0 0 4 0 0 0 0 ...
## $ PoorCare : int 0 0 0 0 0 1 0 0 1 0 ...
The variables in the dataset quality.csv
are as follows:
1
if the patient had poor care, and equal to 0
if the patient had good care.# mycolors <- c("blue2", "red2")
mycolors <- c("forestgreen", "firebrick2")
ggplot(data = quality, aes(x = OfficeVisits, y = Narcotics)) + theme_bw() +
scale_colour_manual(values = mycolors) +
xlab("Number of Office Visits") +
ylab("Number of Narcotics Prescribed") +
geom_point(aes(col = factor(PoorCare), alpha = 0.5), pch = 16, size = 4.0)
This plot shows two of our independent variables,
OfficeVisits
), andNarcotics
).Each point is an observation or a patient in our data set.
The red points are patients who received poor care, and the green points are patients who received good care. It looks like maybe more office visits and more narcotics are more likely to correspind to poor care.
Let’s see if we can build a logistic regression model to better predict poor care.
tmp <- table(quality$PoorCare)
tmp
##
## 0 1
## 98 33
In a classification problem, a standard baseline method is to just predict the most frequent outcome for all observations. Since good care is more common than poor care, in this case, we would predict that all patients are receiving good care.
If we did this, we would get 98 out of the 131 observations correct, or have an accuracy of about 75%. This is the accuracy of the baseline model. This is what we’ll try to beat with our logistic regression model.
Randomly split data creating a dummy variable:
# for Reproducibility
set.seed(88)
split <- sample.split(quality$PoorCare, SplitRatio = 0.75)
The function sample.split()
randomly splits the data, while making sure that the outcome variable is well-balanced in each piece. We saw earlier that about 75% of our patients are receiving good care. This function makes sure that in our training set, 75% of our patients are receiving good care and in our testing set 75% of our patients are receiving good care.
We want to do this so that our test set is representative of our training set.
qualityTrain <- subset(quality, split == TRUE)
qualityTest <- subset(quality, split == FALSE)
As independent variabiles we use OfficeVisits
and Narcotics
.
QualityLog <- glm(PoorCare ~ OfficeVisits + Narcotics, data = qualityTrain, family = binomial)
The option family = binomial
tells glm()
that we want a logistic regression.
Let’s look at the results:
summary(QualityLog)
##
## Call:
## glm(formula = PoorCare ~ OfficeVisits + Narcotics, family = binomial,
## data = qualityTrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.06303 -0.63155 -0.50503 -0.09689 2.16686
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.64613 0.52357 -5.054 0.000000433 ***
## OfficeVisits 0.08212 0.03055 2.688 0.00718 **
## Narcotics 0.07630 0.03205 2.381 0.01728 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 111.888 on 98 degrees of freedom
## Residual deviance: 89.127 on 96 degrees of freedom
## AIC: 95.127
##
## Number of Fisher Scoring iterations: 4
What we want to focus on is the coefficients table. This gives the estimate values for the coefficients, or the betas, for our logistic regression model.
We see here that the coefficients for OfficeVisits
and Narcotics
are both positive, which means that higher values in these two variables are indicative of poor care as we suspected from looking at the data.
The last thing we want to look at in the output is the AIC value. This is a measure of the quality of the model and is like Adjusted R-squared in that it accounts for the number of variables used compared to the number of observations.
Unfortunately, it can only be compared between models on the same data set. But it provides a means for model selection.
The preferred model is the one with the minimum AIC.
predictTrain <- predict(QualityLog, type="response")
Analyze predictions.
Since we’re expecting probabilities, all of the numbers should be between zero and one.
And we see that the minimum value is about 0.07 and the maximum value is 0.98.
summary(predictTrain)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.06623 0.11910 0.15970 0.25250 0.26760 0.98460
Let’s see if we are predicting higher probabilities for the actual poor care cases as we expect. To do this, use the tapply()
function, giving as arguments predictTrain
and then QualityTrain$PoorCare
and then mean
(the function to be applied). This will compute the average prediction for each of the true outcomes.
tapply(predictTrain, qualityTrain$PoorCare, mean)
## 0 1
## 0.1894512 0.4392246
The outcome of a logistic regression model is a probability. Often, we want to make a binary prediction.
We can do this using a threshold value t
There are two types of errors that a model can make:
The threshold value is often selected based on which errors are “better”
Compare actual outcomes to predicted outcomes using a confusion matrix (classification matrix):
This compares the actual outcomes to the predicted outcomes. The rows are labeled with the actual outcome, and the columns are labeled with the predicted outcome. Each entry of the table gives the number of data observations that fall into that category.
These are the two combinations that correspond to correct predictions:
And two combinations that are bad:
A couple of important definitions of outcome measures that help us determine what types of errors we are making:
sensitivity = \(\frac{TP}{TP + FN}\)
specificity = \(\frac{TN}{TN + FP}\)
Let’s compute some confusion matrices using different threshold values.
tmp <- table(qualityTrain$PoorCare, predictTrain > 0.5)
tmp
##
## FALSE TRUE
## 0 70 4
## 1 15 10
tmp <- table(qualityTrain$PoorCare, predictTrain > 0.7)
tmp
##
## FALSE TRUE
## 0 73 1
## 1 17 8
Sensitivity = 8 / 25 = 0.32
Specificity = 73 / 74 = 0.99
By increasing the threshold, our sensitivity went down and our specificity went up.
tmp <- table(qualityTrain$PoorCare, predictTrain > 0.2)
tmp
##
## FALSE TRUE
## 0 54 20
## 1 9 16
But which threshold should we pick?
Picking a good threshold value is often challenging. A Receiver Operator Characteristic curve, or ROC curve, can help you decide which value of the threshold is best.
The ROC curve for our problem is shown below.
The line shows how these two outcome measures vary with different threshold values.
The ROC curve captures all thresholds simultaneously.
# Prediction function
ROCRpred <- prediction(predictTrain, qualityTrain$PoorCare)
The ROCRpred()
function takes two arguments.
predictTrain
.qualityTrain$PoorCare
.# Performance function
ROCRperf <- performance(ROCRpred, "tpr", "fpr")
# Plot ROC curve
# plot(ROCRperf, main = "Receiver Operator Characteristic Curve", lwd = 3)
# Add colors
# plot(ROCRperf, main = "Receiver Operator Characteristic Curve", lwd = 3, colorize = TRUE)
# Add threshold labels
plot(ROCRperf, main = "Receiver Operator Characteristic Curve", lwd = 3, colorize = TRUE,
print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))
You should select the best threshold for the trade-off you want to make considering:
So,
Let us examine how to interpret the model we developed.
One of the things we should look after is that there might be what is called multicollinearity.
Multicollinearity occurs when the various independent variables are correlated, and this might confuse the coefficients (the betas) in the model.
We can test this by checking the correlations of independent variables. If they are excessively high, this would mean that there might be multicollinearity, and you have to potentially revisit the model,
We should also consider whether or not the signs of the model coefficients make sense. Is the coefficient beta positive or negative?
If it agrees with intuition, then multicollinearity has not been a problem, but if intuition suggests a different sign, this might be a sign of multicollinearity.
The next important element is significance.
So how do we interpret the results, and how do we understand whether we have a good model or not?
For that purpose, let’s take a look at what is called Area Under the Curve, or AUC for short. The Area Under the Curve shows an absolute measure of quality of prediction. In this particular case, 77.5%.
Given that the perfect score is 100%, this is like a B.
So the area under the curve gives an absolute measure of quality, and it’s less affected by various benchmarks. It illustrates how accurate the model is on a more absolute sense.
Just like in linear regression, we want to make predictions on a test set to compute out-of-sample metrics. We set aside 25% of our data, i.e. 32 cases, as a test set, against which we can compare the predicted outcomes of the model that was trained on the other 75% of the data.
predictTest <- predict(QualityLog, type = "response", newdata = qualityTest)
ROCRpredTest <- prediction(predictTest, qualityTest$PoorCare)
auc <- as.numeric(performance(ROCRpredTest, "auc")@y.values)
The AUC of this model on the test data is 79.9%.
If we use a threshold value of 0.3, we get the following confusion matrix:
tmp <- table(qualityTest$PoorCare, predictTest > 0.3)
tmp
##
## FALSE TRUE
## 0 19 5
## 1 2 6
The baseline model would predict good care all the time. In that case, we will be correct 24 times, versus 25 times, in our case.
But notice that predicting always good care does not capture the dynamics of what is happening, versus the logistic regression model that is far more intelligent in capturing these effects.
In practice, the probabilities returned by the logistic regression model can be used to prioritize patients for intervention
Electronic medical records could be used in the future