Quality dataset contains the information about the patient weather he/she is receiving good health care or not and all other necessary details.
quality = read.csv("C:\\Users\\aman96\\Desktop\\the analytics edge\\unit 3\\quality.csv")
str(quality)
## 'data.frame': 131 obs. of 14 variables:
## $ MemberID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ InpatientDays : int 0 1 0 0 8 2 16 2 2 4 ...
## $ ERVisits : int 0 1 0 1 2 0 1 0 1 2 ...
## $ OfficeVisits : int 18 6 5 19 19 9 8 8 4 0 ...
## $ Narcotics : int 1 1 3 0 3 2 1 0 3 2 ...
## $ DaysSinceLastERVisit: num 731 411 731 158 449 ...
## $ Pain : int 10 0 10 34 10 6 4 5 5 2 ...
## $ TotalVisits : int 18 8 5 20 29 11 25 10 7 6 ...
## $ ProviderCount : int 21 27 16 14 24 40 19 11 28 21 ...
## $ MedicalClaims : int 93 19 27 59 51 53 40 28 20 17 ...
## $ ClaimLines : int 222 115 148 242 204 156 261 87 98 66 ...
## $ StartedOnCombination: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ AcuteDrugGapSmall : int 0 1 5 0 0 4 0 0 0 0 ...
## $ PoorCare : int 0 0 0 0 0 1 0 0 1 0 ...
summary(quality)
## MemberID InpatientDays ERVisits OfficeVisits
## Min. : 1.0 Min. : 0.000 Min. : 0.000 Min. : 0.00
## 1st Qu.: 33.5 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 7.00
## Median : 66.0 Median : 0.000 Median : 1.000 Median :12.00
## Mean : 66.0 Mean : 2.718 Mean : 1.496 Mean :13.23
## 3rd Qu.: 98.5 3rd Qu.: 3.000 3rd Qu.: 2.000 3rd Qu.:18.50
## Max. :131.0 Max. :30.000 Max. :11.000 Max. :46.00
## Narcotics DaysSinceLastERVisit Pain TotalVisits
## Min. : 0.000 Min. : 6.0 Min. : 0.00 Min. : 0.00
## 1st Qu.: 0.000 1st Qu.:207.0 1st Qu.: 1.00 1st Qu.: 8.00
## Median : 1.000 Median :641.0 Median : 8.00 Median :15.00
## Mean : 4.573 Mean :480.6 Mean : 15.56 Mean :17.44
## 3rd Qu.: 3.000 3rd Qu.:731.0 3rd Qu.: 23.00 3rd Qu.:22.50
## Max. :59.000 Max. :731.0 Max. :104.00 Max. :69.00
## ProviderCount MedicalClaims ClaimLines StartedOnCombination
## Min. : 5.00 Min. : 11.00 Min. : 20.0 Mode :logical
## 1st Qu.:15.00 1st Qu.: 25.50 1st Qu.: 83.5 FALSE:125
## Median :20.00 Median : 37.00 Median :120.0 TRUE :6
## Mean :23.98 Mean : 43.24 Mean :142.9 NA's :0
## 3rd Qu.:30.00 3rd Qu.: 49.50 3rd Qu.:185.0
## Max. :82.00 Max. :194.00 Max. :577.0
## AcuteDrugGapSmall PoorCare
## Min. : 0.000 Min. :0.0000
## 1st Qu.: 0.000 1st Qu.:0.0000
## Median : 1.000 Median :0.0000
## Mean : 2.695 Mean :0.2519
## 3rd Qu.: 3.000 3rd Qu.:0.5000
## Max. :71.000 Max. :1.0000
We have 131 observations, one for each of the patients in our data set, and 14 different variables. The 12 variables from InpatientDays to AcuteDrugGapSmall are the independent variables. We’ll be using the number of office visits and the number of prescriptions for narcotics that the patient had.
table(quality$PoorCare)
##
## 0 1
## 98 33
We can see that 98 out of the 131 patients in our data set received good care, or 0, and 33 patients received poor care, or those labeled with 1. So the baseline model accuracy is (98/131) around 75 percent. Where baseline model accuracy is defined in the classification problem by the most frequent outcome of the variable.
library(caTools)
## Warning: package 'caTools' was built under R version 3.3.3
set.seed(88)
split = sample.split(quality$PoorCare, SplitRatio = 0.75)
split
## [1] TRUE TRUE TRUE TRUE FALSE TRUE FALSE TRUE FALSE FALSE TRUE
## [12] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [23] TRUE TRUE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
## [34] TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE FALSE TRUE TRUE
## [45] FALSE TRUE FALSE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE
## [56] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE
## [67] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
## [78] TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE
## [89] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE
## [100] TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE FALSE
## [111] FALSE TRUE TRUE FALSE TRUE TRUE TRUE FALSE TRUE TRUE FALSE
## [122] TRUE TRUE FALSE TRUE TRUE FALSE TRUE TRUE TRUE FALSE
qualityTrain = subset(quality, split == TRUE)
qualityTest = subset(quality, split == FALSE)
numquality<-quality[,c(-1, -12)]
summary(numquality)
## InpatientDays ERVisits OfficeVisits Narcotics
## Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 0.000 1st Qu.: 0.000 1st Qu.: 7.00 1st Qu.: 0.000
## Median : 0.000 Median : 1.000 Median :12.00 Median : 1.000
## Mean : 2.718 Mean : 1.496 Mean :13.23 Mean : 4.573
## 3rd Qu.: 3.000 3rd Qu.: 2.000 3rd Qu.:18.50 3rd Qu.: 3.000
## Max. :30.000 Max. :11.000 Max. :46.00 Max. :59.000
## DaysSinceLastERVisit Pain TotalVisits ProviderCount
## Min. : 6.0 Min. : 0.00 Min. : 0.00 Min. : 5.00
## 1st Qu.:207.0 1st Qu.: 1.00 1st Qu.: 8.00 1st Qu.:15.00
## Median :641.0 Median : 8.00 Median :15.00 Median :20.00
## Mean :480.6 Mean : 15.56 Mean :17.44 Mean :23.98
## 3rd Qu.:731.0 3rd Qu.: 23.00 3rd Qu.:22.50 3rd Qu.:30.00
## Max. :731.0 Max. :104.00 Max. :69.00 Max. :82.00
## MedicalClaims ClaimLines AcuteDrugGapSmall PoorCare
## Min. : 11.00 Min. : 20.0 Min. : 0.000 Min. :0.0000
## 1st Qu.: 25.50 1st Qu.: 83.5 1st Qu.: 0.000 1st Qu.:0.0000
## Median : 37.00 Median :120.0 Median : 1.000 Median :0.0000
## Mean : 43.24 Mean :142.9 Mean : 2.695 Mean :0.2519
## 3rd Qu.: 49.50 3rd Qu.:185.0 3rd Qu.: 3.000 3rd Qu.:0.5000
## Max. :194.00 Max. :577.0 Max. :71.000 Max. :1.0000
cor(numquality)
## InpatientDays ERVisits OfficeVisits Narcotics
## InpatientDays 1.000000000 0.440087299 0.1759011 -0.093768932
## ERVisits 0.440087299 1.000000000 0.3085257 -0.003731653
## OfficeVisits 0.175901119 0.308525685 1.0000000 0.275759302
## Narcotics -0.093768932 -0.003731653 0.2757593 1.000000000
## DaysSinceLastERVisit -0.290121046 -0.735246070 -0.1283879 0.065054809
## Pain 0.304058069 0.546779466 0.3529678 0.106860359
## TotalVisits 0.622035618 0.586438628 0.8653868 0.163992449
## ProviderCount 0.244023304 0.457429030 0.3654691 0.293478180
## MedicalClaims 0.286377975 0.355318952 0.4985134 0.220540818
## ClaimLines 0.386951074 0.542000500 0.4249532 0.185798702
## AcuteDrugGapSmall -0.001144346 -0.072749681 0.2007348 0.710888560
## PoorCare 0.080725715 0.135400778 0.3295118 0.447236064
## DaysSinceLastERVisit Pain TotalVisits
## InpatientDays -0.29012105 0.30405807 0.6220356
## ERVisits -0.73524607 0.54677947 0.5864386
## OfficeVisits -0.12838788 0.35296784 0.8653868
## Narcotics 0.06505481 0.10686036 0.1639924
## DaysSinceLastERVisit 1.00000000 -0.35878080 -0.3446395
## Pain -0.35878080 1.00000000 0.4829592
## TotalVisits -0.34463954 0.48295915 1.0000000
## ProviderCount -0.29770084 0.40509514 0.4515455
## MedicalClaims -0.19811441 0.29669718 0.5493080
## ClaimLines -0.41279666 0.46471274 0.5696186
## AcuteDrugGapSmall 0.13108501 -0.03149016 0.1348611
## PoorCare -0.10798298 0.09216828 0.3005403
## ProviderCount MedicalClaims ClaimLines
## InpatientDays 0.2440233 0.2863780 0.38695107
## ERVisits 0.4574290 0.3553190 0.54200050
## OfficeVisits 0.3654691 0.4985134 0.42495323
## Narcotics 0.2934782 0.2205408 0.18579870
## DaysSinceLastERVisit -0.2977008 -0.1981144 -0.41279666
## Pain 0.4050951 0.2966972 0.46471274
## TotalVisits 0.4515455 0.5493080 0.56961864
## ProviderCount 1.0000000 0.5170023 0.60535725
## MedicalClaims 0.5170023 1.0000000 0.81393452
## ClaimLines 0.6053573 0.8139345 1.00000000
## AcuteDrugGapSmall 0.1412836 0.0856369 -0.01322946
## PoorCare 0.2201661 0.1673987 0.12917477
## AcuteDrugGapSmall PoorCare
## InpatientDays -0.001144346 0.08072572
## ERVisits -0.072749681 0.13540078
## OfficeVisits 0.200734789 0.32951181
## Narcotics 0.710888560 0.44723606
## DaysSinceLastERVisit 0.131085008 -0.10798298
## Pain -0.031490160 0.09216828
## TotalVisits 0.134861075 0.30054033
## ProviderCount 0.141283618 0.22016613
## MedicalClaims 0.085636905 0.16739875
## ClaimLines -0.013229464 0.12917477
## AcuteDrugGapSmall 1.000000000 0.34143466
## PoorCare 0.341434658 1.00000000
If we look at the correlation, we can clearly see that correlation between PoorCare and OfficeVisits, and PoorCare and Narcotics is relatively higher. So we will take only these variables for further prediction. Although correlation between PoorCare and AcuteDrugGapSmall also significant, but correlation between AcuteDrugGapSmall and Narcotics is higher and it can cause the Multicollinearity so we will prefer to drop this variable from the predictions because Narcotics seems more significant variable.
So we will build logistic regression model using PoorCare as dependent variable and OfficeVisits and Narcotics as independent variables, Where family=binomial, tells the glm function to build a logistic regression model.
QualityLog = glm(PoorCare ~ OfficeVisits + Narcotics, data=qualityTrain, family=binomial)
summary(QualityLog)
##
## Call:
## glm(formula = PoorCare ~ OfficeVisits + Narcotics, family = binomial,
## data = qualityTrain)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.06303 -0.63155 -0.50503 -0.09689 2.16686
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.64613 0.52357 -5.054 4.33e-07 ***
## OfficeVisits 0.08212 0.03055 2.688 0.00718 **
## Narcotics 0.07630 0.03205 2.381 0.01728 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 111.888 on 98 degrees of freedom
## Residual deviance: 89.127 on 96 degrees of freedom
## AIC: 95.127
##
## Number of Fisher Scoring iterations: 4
We also see that both of these variables have at least one star, meaning that they’re significant in our model.
AIC value is a measure of the quality of the model and is like Adjusted R-squared in that it accounts for the number of variables used compared to the number of observations. Unfortunately, it can only be compared between models on the same data set. The preferred model is the one with the minimum AIC.
predictTrain = predict(QualityLog, type="response")
predictTrain predicts the outcome of PoorCare for each patient in terms of probabilities.
summary(predictTrain)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.06623 0.11910 0.15970 0.25250 0.26760 0.98460
tapply function finds the mean of predictions with respect to each outcome of the PoorCare.
tapply(predictTrain, qualityTrain$PoorCare, mean)
## 0 1
## 0.1894512 0.4392246
table(qualityTrain$PoorCare, predictTrain > 0.5)
##
## FALSE TRUE
## 0 70 4
## 1 15 10
The threshold value t, is often selected based on which errors are better. If we pick the large threshold value, it will mostly predict TRUE NEGETIVE, means specificity will increase, but if we pick low threshold value it will predict mostly TRUE POSITIVE values, means sensitivity will increase.
Where \[Sensitivity = TP/(TP+FN)\] And \[Specificity = TN/(TN+FP)\]
A model with a higher threshold will have a lower sensitivity and a higher specificity.
A model with a lower threshold will have a higher sensitivity and a lower specificity.
10/25
## [1] 0.4
70/74
## [1] 0.9459459
table(qualityTrain$PoorCare, predictTrain > 0.7)
##
## FALSE TRUE
## 0 73 1
## 1 17 8
8/25
## [1] 0.32
73/74
## [1] 0.9864865
Here we increased the threshold value and we can clearly see that our specificity increased.
table(qualityTrain$PoorCare, predictTrain > 0.2)
##
## FALSE TRUE
## 0 54 20
## 1 9 16
16/25
## [1] 0.64
54/74
## [1] 0.7297297
Here we decreased the threshold value and now we can clearly see that our sensitivity increased.
library(ROCR)
## Warning: package 'ROCR' was built under R version 3.3.3
## Loading required package: gplots
## Warning: package 'gplots' was built under R version 3.3.3
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
ROCRpred = prediction(predictTrain, qualityTrain$PoorCare)
Which takes predicted values and x and y labels, where c lab is “trp” true positive rate, and y lab is “fpr” false positive rate.
ROCRperf = performance(ROCRpred, "tpr", "fpr")
plot(ROCRperf)
plot(ROCRperf, colorize=TRUE)
plot(ROCRperf, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))
We can see that ROC curve with threshold values added. Using this curve, we can determine which threshold value we want to use depending on our preferences as a decision-maker.