Analytics Edge: Unit 3 - An Introduction to Logistic Regression

Modeling the Expert

Ask the Experts!

Critical decisions are often made by people with expert knowledge
Healthcare Quality Assessment
- Good quality care educates patients and controls costs
- Need to assess quality for proper medical interventions
- No single set of guidelines for defining quality of healthcare
- Health professionals are experts in quality of care assessment

Replicating Expert Assessment

Can we develop analytical tools that replicate expert assessment on a large scale?
Learn from expert human judgement
- Develop a model, interpret results, and adjust the model
- Make predictions/evaluations on a large scale
- Let’s identify poor healthcare quality using analytics

Claims Data

Electronically available
Standardized
Not 100% accurate
Under-reporting is common
Claims for hospital visits can be vague

Creating the Dataset

Claims Samples

Large health insurance claims database
Randomly selected 131 diabetes patients
Ages range from 35 to 55
Cost $10,000 - $20,000

Expert Review

Expert physician reviewed claims and wrote descriptive notes.

Expert Assessment

Rated quality on a two-point scale (poor/good)

Variable Extraction

Dependent Variable
- Quality of care
Independent Variables
- Ongoing use of narcotics
- only on Avandia, not a good first choice drug
- Had regular visits, mammogram, and immunizations
- Was given home testing supplies
- Diabetes treatment
- Patient demographics
- Healthcare utilization
- Providers
- Claims
- Prescriptions

Predicting Quality of Care

The dependent variable is modeled as a binary variable
- 1 if low-quality case, 0 if high-quality care
This is a categorical variable
- A small number of possible outcomes
Linear regression would predict a continuous outcome

Logistic Regression

Predicts the probability of poor care
- Denote the dependent variable “PoorCare” by y
- \[P(y = 1)\]
Then \[P(y = 0) = 1 - P(y = 1)\]
Independent variables \[x_1, x_2, ..., x_k\]
Uses the Logistic Response Function \[P(y = 1) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_k x_k)}}\]

Understanding the Logistic Function

Positive values are predictive of class 1
Negative values are predictive of class 0
The coefficients are selected to
- Predict a high probability for the poor care cases
- Predict a low probability for the good care cases
We can talk about Odds (like in gambling) \[Odds = \frac{P(y = 1)}{P(y = 0)}\]
Odds > 1 if y = 1 is more likely
Odds < 1 if y = 0 is more likely

The Logit

\[Odds = e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_k x_k}\] \[log(Odds) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_k x_k\]

This is called the “Logit” and looks like linear regression
The bigger the Logit, the bigger P(y = 1)

Model for Healthcare Quality

Plot of the independent variables
- Number of Office Visits
- Number of Narcotics Prescribed
Red are poor care
Green are good care

Threshold Value

The outcome of a logistic regression model is a probability
Often, we want to make a binary prediction
- Did this patient receive poor care or good care?
We can do this by using a threshold vlaue t
If P(PoorCare = 1) t, predict poor quality
If P(PoorCare = 1) < t, predict good quality
How do we select the value of t?
Often selected based on which errors are “better”
If t is large, predict poor rarely (when P(y = 1) is large)
- More errors when we say good care, but it is actually poor care
- Detects patients who are receiving the worst care
If t is small, predict good care rarely (when P(y = 1) is small)
- More errors where we say poor care, but it is actually good care
- Detects all patients who might be receiving poor care
With no preference between the errors, select t = 0.85
- Predicts the more likely outcome

Selecting a Threshold Value

Compare actual outcomes to predicted outcomes using a confusion matrix (classification matrix)

Receiver Operator Characteristic (ROC) Curve

True positive rate (sensitive) on y-axis
- Proportion of poor care caught
False Positive rate (specificity) on x-axis
- Proportion of good care labeled as poor care

Selecting a Threshold using ROC

Captures all thresholds simultaneously
High Threshold
- High specificity
- Low sensitivity
Low Threshold
- Low specificity
- High sensitivity
Choose best threshold for best best trade off
- cost of failing to detect positives
- costs of raising false alarms

Intepreting the Model

Multicollinearity could be a problem
- Do the coefficients make sense
- Check correlations
Measures of accuracy

Compute Outcome Measures

N = number of observations

\[Overall Accuracy = \frac{TN + TP}{N}\]

\[Sensitivity = \frac{TP}{TP + FN}\]

\[Specificity = \frac{TN}{TN + FP}\]

\[Overall Error Rate = \frac{FP + FN}{N}\]

\[False Negative Error Rate = \frac{FN}{TP + FN}\]

\[False Positive Error Rate = \frac{FP}{TN + FP}\]

Making Predictions

Just like in linear regression, we want to make predictions on a test set to compute out-of-sample metrics

predictTest = predict(QualityLog, type = “response”, newdata = qualityTest)

This makes predictions for probabilities
If we use a threshold value of 0.3, we get the following confusion matrix

Area Under the ROC Curve (AUC)

Just take the area under the curve
Interpretation
- Given a random positive and negative proportion of the time you guess which is correct
Less affected by sample balance than accuracy

What is a good AUC?
- Maximum of 1 (perfect prediction)
What is a bad AUC?
- Minimum of 0.5 (just guessing)

Conclusions

An expert-trained model can accurately identify diabetics receiving low-quality care
- Out-of-sample accuracy of 78%
- Identifies most patients receiving poor care
In practice, the probabilities returned by the logistic regression model can be used to prioritize patients for intervention
Electronic medical records could be used in the future

The Competitive Edge of Models

While humans can accurately analyze small amounts of information, models allow large scalability
Models do not replace expert judgement
- Experts can improve and refine the model
Models can integrate assessments of many experts into one final unbiased and unemotional prediction.

Modeling the Expert in R

Read in dataset

# Read in the dataset
quality = read.csv("quality.csv")

Look at structure

# Output structure
str(quality)
## 'data.frame':    131 obs. of  14 variables:
##  $ MemberID            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ InpatientDays       : int  0 1 0 0 8 2 16 2 2 4 ...
##  $ ERVisits            : int  0 1 0 1 2 0 1 0 1 2 ...
##  $ OfficeVisits        : int  18 6 5 19 19 9 8 8 4 0 ...
##  $ Narcotics           : int  1 1 3 0 3 2 1 0 3 2 ...
##  $ DaysSinceLastERVisit: num  731 411 731 158 449 ...
##  $ Pain                : int  10 0 10 34 10 6 4 5 5 2 ...
##  $ TotalVisits         : int  18 8 5 20 29 11 25 10 7 6 ...
##  $ ProviderCount       : int  21 27 16 14 24 40 19 11 28 21 ...
##  $ MedicalClaims       : int  93 19 27 59 51 53 40 28 20 17 ...
##  $ ClaimLines          : int  222 115 148 242 204 156 261 87 98 66 ...
##  $ StartedOnCombination: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ AcuteDrugGapSmall   : int  0 1 5 0 0 4 0 0 0 0 ...
##  $ PoorCare            : int  0 0 0 0 0 1 0 0 1 0 ...

Table outcome

# Tabulate the amount of poor care in the dataset
z = table(quality$PoorCare)
kable(z)

Var1	Freq
0	98
1	33

# Baseline accuracy
98/131
## [1] 0.7480916

Load caTools package

# Install the caTools to split the training and testing set
library(caTools)

# Randomly split data
set.seed(88)
split = sample.split(quality$PoorCare, SplitRatio = 0.75)
split
##   [1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE
##  [33] FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [65]  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE
##  [97]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE
## [129]  TRUE  TRUE FALSE

# Create training and testing sets
qualityTrain = subset(quality, split == TRUE)
qualityTest = subset(quality, split == FALSE)

Logistic Regression

# Logistic Regression Model
QualityLog = glm(PoorCare ~ OfficeVisits + Narcotics, data=qualityTrain, family=binomial)
summary(QualityLog)
## 
## Call:
## glm(formula = PoorCare ~ OfficeVisits + Narcotics, family = binomial, 
##     data = qualityTrain)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.06303  -0.63155  -0.50503  -0.09689   2.16686  
## 
## Coefficients:
##              Estimate Std. Error z value    Pr(>|z|)    
## (Intercept)  -2.64613    0.52357  -5.054 0.000000433 ***
## OfficeVisits  0.08212    0.03055   2.688     0.00718 ** 
## Narcotics     0.07630    0.03205   2.381     0.01728 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 111.888  on 98  degrees of freedom
## Residual deviance:  89.127  on 96  degrees of freedom
## AIC: 95.127
## 
## Number of Fisher Scoring iterations: 4

Make Predictions

# Make predictions on training set
predictTrain = predict(QualityLog, type="response")

Analyze the predictions

# Analyze predictions
summary(predictTrain)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.06623 0.11912 0.15967 0.25253 0.26765 0.98456
z = tapply(predictTrain, qualityTrain$PoorCare, mean)
kable(z)

	x
0	0.1894512
1	0.4392246

Confusion matrix

# Confusion matrix for threshold of 0.5
z = table(qualityTrain$PoorCare, predictTrain > 0.5)
kable(z)

	FALSE	TRUE
0	70	4
1	15	10


# Sensitivity and specificity
10/25
## [1] 0.4
70/74
## [1] 0.9459459

# Confusion matrix for threshold of 0.7
z = table(qualityTrain$PoorCare, predictTrain > 0.7)
kable(z)

	FALSE	TRUE
0	73	1
1	17	8


# Sensitivity and specificity
8/25
## [1] 0.32
73/74
## [1] 0.9864865

# Confusion matrix for threshold of 0.2
z = table(qualityTrain$PoorCare, predictTrain > 0.2)
kable(z)

	FALSE	TRUE
0	54	20
1	9	16

# Sensitivity and specificity
16/25
## [1] 0.64
54/74
## [1] 0.7297297

ROCR Curve

# Load ROCR package
library(ROCR)

# Prediction function
ROCRpred = prediction(predictTrain, qualityTrain$PoorCare)

# Performance function
ROCRperf = performance(ROCRpred, "tpr", "fpr")

# Plot ROC curve
plot(ROCRperf)


# Add colors
plot(ROCRperf, colorize=TRUE)

# Add threshold labels 
plot(ROCRperf, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))

Analytics Edge: Unit 3 - An Introduction to Logistic Regression

Sulman Khan

October 26, 2018

Modeling the Expert

Ask the Experts!

Replicating Expert Assessment

Claims Data

Creating the Dataset

Claims Samples

Expert Review

Expert Assessment

Variable Extraction

Predicting Quality of Care

Logistic Regression

Understanding the Logistic Function

The Logit

Model for Healthcare Quality

Threshold Value

Selecting a Threshold Value

Receiver Operator Characteristic (ROC) Curve

Selecting a Threshold using ROC

Intepreting the Model

Compute Outcome Measures

Making Predictions

Area Under the ROC Curve (AUC)

Conclusions

The Competitive Edge of Models

Modeling the Expert in R

Read in dataset

Look at structure

Table outcome

Load caTools package

Logistic Regression

Make Predictions

Analyze the predictions

Confusion matrix

ROCR Curve