Critical decisions are often made by people with expert knowledge
We can talk about Odds (like in gambling) \[Odds = \frac{P(y = 1)}{P(y = 0)}\]
Odds < 1 if y = 0 is more likely
\[Odds = e^{\beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_k x_k}\] \[log(Odds) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + ... + \beta_k x_k\]
The outcome of a logistic regression model is a probability
We can do this by using a threshold vlaue t
If P(PoorCare = 1) < t, predict good quality
How do we select the value of t?
Often selected based on which errors are “better”
Captures all thresholds simultaneously
N = number of observations
\[Overall Accuracy = \frac{TN + TP}{N}\]
\[Sensitivity = \frac{TP}{TP + FN}\]
\[Specificity = \frac{TN}{TN + FP}\]
\[Overall Error Rate = \frac{FP + FN}{N}\]
\[False Negative Error Rate = \frac{FN}{TP + FN}\]
\[False Positive Error Rate = \frac{FP}{TN + FP}\]
predictTest = predict(QualityLog, type = “response”, newdata = qualityTest)
This makes predictions for probabilities
If we use a threshold value of 0.3, we get the following confusion matrix
Just take the area under the curve
Less affected by sample balance than accuracy
In practice, the probabilities returned by the logistic regression model can be used to prioritize patients for intervention
Electronic medical records could be used in the future
While humans can accurately analyze small amounts of information, models allow large scalability
Models can integrate assessments of many experts into one final unbiased and unemotional prediction.
# Read in the dataset
quality = read.csv("quality.csv")
# Output structure
str(quality)
## 'data.frame': 131 obs. of 14 variables:
## $ MemberID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ InpatientDays : int 0 1 0 0 8 2 16 2 2 4 ...
## $ ERVisits : int 0 1 0 1 2 0 1 0 1 2 ...
## $ OfficeVisits : int 18 6 5 19 19 9 8 8 4 0 ...
## $ Narcotics : int 1 1 3 0 3 2 1 0 3 2 ...
## $ DaysSinceLastERVisit: num 731 411 731 158 449 ...
## $ Pain : int 10 0 10 34 10 6 4 5 5 2 ...
## $ TotalVisits : int 18 8 5 20 29 11 25 10 7 6 ...
## $ ProviderCount : int 21 27 16 14 24 40 19 11 28 21 ...
## $ MedicalClaims : int 93 19 27 59 51 53 40 28 20 17 ...
## $ ClaimLines : int 222 115 148 242 204 156 261 87 98 66 ...
## $ StartedOnCombination: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ AcuteDrugGapSmall : int 0 1 5 0 0 4 0 0 0 0 ...
## $ PoorCare : int 0 0 0 0 0 1 0 0 1 0 ...
# Tabulate the amount of poor care in the dataset
z = table(quality$PoorCare)
kable(z)
Var1 | Freq |
---|---|
0 | 98 |
1 | 33 |
# Baseline accuracy
98/131
## [1] 0.7480916
# Install the caTools to split the training and testing set
library(caTools)
# Randomly split data
set.seed(88)
split = sample.split(quality$PoorCare, SplitRatio = 0.75)
split
## [1] TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE FALSE FALSE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE
## [33] FALSE TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE FALSE TRUE FALSE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [65] TRUE FALSE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE
## [97] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE FALSE TRUE
## [129] TRUE TRUE FALSE
# Create training and testing sets
qualityTrain = subset(quality, split == TRUE)
qualityTest = subset(quality, split == FALSE)
# Logistic Regression Model
QualityLog = glm(PoorCare ~ OfficeVisits + Narcotics, data=qualityTrain, family=binomial)
summary(QualityLog)
##
## Call:
## glm(formula = PoorCare ~ OfficeVisits + Narcotics, family = binomial,
## data = qualityTrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.06303 -0.63155 -0.50503 -0.09689 2.16686
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.64613 0.52357 -5.054 0.000000433 ***
## OfficeVisits 0.08212 0.03055 2.688 0.00718 **
## Narcotics 0.07630 0.03205 2.381 0.01728 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 111.888 on 98 degrees of freedom
## Residual deviance: 89.127 on 96 degrees of freedom
## AIC: 95.127
##
## Number of Fisher Scoring iterations: 4
# Make predictions on training set
predictTrain = predict(QualityLog, type="response")
# Analyze predictions
summary(predictTrain)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.06623 0.11912 0.15967 0.25253 0.26765 0.98456
z = tapply(predictTrain, qualityTrain$PoorCare, mean)
kable(z)
x | |
---|---|
0 | 0.1894512 |
1 | 0.4392246 |
# Confusion matrix for threshold of 0.5
z = table(qualityTrain$PoorCare, predictTrain > 0.5)
kable(z)
FALSE | TRUE | |
---|---|---|
0 | 70 | 4 |
1 | 15 | 10 |
# Sensitivity and specificity
10/25
## [1] 0.4
70/74
## [1] 0.9459459
# Confusion matrix for threshold of 0.7
z = table(qualityTrain$PoorCare, predictTrain > 0.7)
kable(z)
FALSE | TRUE | |
---|---|---|
0 | 73 | 1 |
1 | 17 | 8 |
# Sensitivity and specificity
8/25
## [1] 0.32
73/74
## [1] 0.9864865
# Confusion matrix for threshold of 0.2
z = table(qualityTrain$PoorCare, predictTrain > 0.2)
kable(z)
FALSE | TRUE | |
---|---|---|
0 | 54 | 20 |
1 | 9 | 16 |
# Sensitivity and specificity
16/25
## [1] 0.64
54/74
## [1] 0.7297297
# Load ROCR package
library(ROCR)
# Prediction function
ROCRpred = prediction(predictTrain, qualityTrain$PoorCare)
# Performance function
ROCRperf = performance(ROCRpred, "tpr", "fpr")
# Plot ROC curve
plot(ROCRperf)
# Add colors
plot(ROCRperf, colorize=TRUE)
# Add threshold labels
plot(ROCRperf, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))