The Quality Dataset

Quality dataset contains the information about the patient weather he/she is receiving good health care or not and all other necessary details.

Now let’s load the dataset into R console using read.csv function

quality = read.csv("C:\\Users\\aman96\\Desktop\\the analytics edge\\unit 3\\quality.csv")

Exploratory Data Analysis

str(quality)
## 'data.frame':    131 obs. of  14 variables:
##  $ MemberID            : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ InpatientDays       : int  0 1 0 0 8 2 16 2 2 4 ...
##  $ ERVisits            : int  0 1 0 1 2 0 1 0 1 2 ...
##  $ OfficeVisits        : int  18 6 5 19 19 9 8 8 4 0 ...
##  $ Narcotics           : int  1 1 3 0 3 2 1 0 3 2 ...
##  $ DaysSinceLastERVisit: num  731 411 731 158 449 ...
##  $ Pain                : int  10 0 10 34 10 6 4 5 5 2 ...
##  $ TotalVisits         : int  18 8 5 20 29 11 25 10 7 6 ...
##  $ ProviderCount       : int  21 27 16 14 24 40 19 11 28 21 ...
##  $ MedicalClaims       : int  93 19 27 59 51 53 40 28 20 17 ...
##  $ ClaimLines          : int  222 115 148 242 204 156 261 87 98 66 ...
##  $ StartedOnCombination: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ AcuteDrugGapSmall   : int  0 1 5 0 0 4 0 0 0 0 ...
##  $ PoorCare            : int  0 0 0 0 0 1 0 0 1 0 ...
summary(quality)
##     MemberID     InpatientDays       ERVisits       OfficeVisits  
##  Min.   :  1.0   Min.   : 0.000   Min.   : 0.000   Min.   : 0.00  
##  1st Qu.: 33.5   1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 7.00  
##  Median : 66.0   Median : 0.000   Median : 1.000   Median :12.00  
##  Mean   : 66.0   Mean   : 2.718   Mean   : 1.496   Mean   :13.23  
##  3rd Qu.: 98.5   3rd Qu.: 3.000   3rd Qu.: 2.000   3rd Qu.:18.50  
##  Max.   :131.0   Max.   :30.000   Max.   :11.000   Max.   :46.00  
##    Narcotics      DaysSinceLastERVisit      Pain         TotalVisits   
##  Min.   : 0.000   Min.   :  6.0        Min.   :  0.00   Min.   : 0.00  
##  1st Qu.: 0.000   1st Qu.:207.0        1st Qu.:  1.00   1st Qu.: 8.00  
##  Median : 1.000   Median :641.0        Median :  8.00   Median :15.00  
##  Mean   : 4.573   Mean   :480.6        Mean   : 15.56   Mean   :17.44  
##  3rd Qu.: 3.000   3rd Qu.:731.0        3rd Qu.: 23.00   3rd Qu.:22.50  
##  Max.   :59.000   Max.   :731.0        Max.   :104.00   Max.   :69.00  
##  ProviderCount   MedicalClaims      ClaimLines    StartedOnCombination
##  Min.   : 5.00   Min.   : 11.00   Min.   : 20.0   Mode :logical       
##  1st Qu.:15.00   1st Qu.: 25.50   1st Qu.: 83.5   FALSE:125           
##  Median :20.00   Median : 37.00   Median :120.0   TRUE :6             
##  Mean   :23.98   Mean   : 43.24   Mean   :142.9   NA's :0             
##  3rd Qu.:30.00   3rd Qu.: 49.50   3rd Qu.:185.0                       
##  Max.   :82.00   Max.   :194.00   Max.   :577.0                       
##  AcuteDrugGapSmall    PoorCare     
##  Min.   : 0.000    Min.   :0.0000  
##  1st Qu.: 0.000    1st Qu.:0.0000  
##  Median : 1.000    Median :0.0000  
##  Mean   : 2.695    Mean   :0.2519  
##  3rd Qu.: 3.000    3rd Qu.:0.5000  
##  Max.   :71.000    Max.   :1.0000

We have 131 observations, one for each of the patients in our data set, and 14 different variables. The 12 variables from InpatientDays to AcuteDrugGapSmall are the independent variables. We’ll be using the number of office visits and the number of prescriptions for narcotics that the patient had.

Table function to find out the outcome of the variable PoorCare

table(quality$PoorCare)
## 
##  0  1 
## 98 33

We can see that 98 out of the 131 patients in our data set received good care, or 0, and 33 patients received poor care, or those labeled with 1. So the baseline model accuracy is (98/131) around 75 percent. Where baseline model accuracy is defined in the classification problem by the most frequent outcome of the variable.

Now we will randomly split our data into two data set, train dataset and test dataset using caTools package.

Loading caTools package

library(caTools)
## Warning: package 'caTools' was built under R version 3.3.3

Randomly split data

set.seed(88)
split = sample.split(quality$PoorCare, SplitRatio = 0.75)
split
##   [1]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE
##  [12] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [23]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE
##  [34]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE
##  [45] FALSE  TRUE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE
##  [56]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE
##  [67]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
##  [78]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
##  [89]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
## [100]  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE FALSE
## [111] FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE FALSE
## [122]  TRUE  TRUE FALSE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE

Creating training and testing sets

qualityTrain = subset(quality, split == TRUE)
qualityTest = subset(quality, split == FALSE)

Finding the correlation between all numeric variables

Making a new dataset which contains numeric variables
numquality<-quality[,c(-1, -12)]
summary(numquality)
##  InpatientDays       ERVisits       OfficeVisits     Narcotics     
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.00   Min.   : 0.000  
##  1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 7.00   1st Qu.: 0.000  
##  Median : 0.000   Median : 1.000   Median :12.00   Median : 1.000  
##  Mean   : 2.718   Mean   : 1.496   Mean   :13.23   Mean   : 4.573  
##  3rd Qu.: 3.000   3rd Qu.: 2.000   3rd Qu.:18.50   3rd Qu.: 3.000  
##  Max.   :30.000   Max.   :11.000   Max.   :46.00   Max.   :59.000  
##  DaysSinceLastERVisit      Pain         TotalVisits    ProviderCount  
##  Min.   :  6.0        Min.   :  0.00   Min.   : 0.00   Min.   : 5.00  
##  1st Qu.:207.0        1st Qu.:  1.00   1st Qu.: 8.00   1st Qu.:15.00  
##  Median :641.0        Median :  8.00   Median :15.00   Median :20.00  
##  Mean   :480.6        Mean   : 15.56   Mean   :17.44   Mean   :23.98  
##  3rd Qu.:731.0        3rd Qu.: 23.00   3rd Qu.:22.50   3rd Qu.:30.00  
##  Max.   :731.0        Max.   :104.00   Max.   :69.00   Max.   :82.00  
##  MedicalClaims      ClaimLines    AcuteDrugGapSmall    PoorCare     
##  Min.   : 11.00   Min.   : 20.0   Min.   : 0.000    Min.   :0.0000  
##  1st Qu.: 25.50   1st Qu.: 83.5   1st Qu.: 0.000    1st Qu.:0.0000  
##  Median : 37.00   Median :120.0   Median : 1.000    Median :0.0000  
##  Mean   : 43.24   Mean   :142.9   Mean   : 2.695    Mean   :0.2519  
##  3rd Qu.: 49.50   3rd Qu.:185.0   3rd Qu.: 3.000    3rd Qu.:0.5000  
##  Max.   :194.00   Max.   :577.0   Max.   :71.000    Max.   :1.0000
finding the correlation between them
cor(numquality)
##                      InpatientDays     ERVisits OfficeVisits    Narcotics
## InpatientDays          1.000000000  0.440087299    0.1759011 -0.093768932
## ERVisits               0.440087299  1.000000000    0.3085257 -0.003731653
## OfficeVisits           0.175901119  0.308525685    1.0000000  0.275759302
## Narcotics             -0.093768932 -0.003731653    0.2757593  1.000000000
## DaysSinceLastERVisit  -0.290121046 -0.735246070   -0.1283879  0.065054809
## Pain                   0.304058069  0.546779466    0.3529678  0.106860359
## TotalVisits            0.622035618  0.586438628    0.8653868  0.163992449
## ProviderCount          0.244023304  0.457429030    0.3654691  0.293478180
## MedicalClaims          0.286377975  0.355318952    0.4985134  0.220540818
## ClaimLines             0.386951074  0.542000500    0.4249532  0.185798702
## AcuteDrugGapSmall     -0.001144346 -0.072749681    0.2007348  0.710888560
## PoorCare               0.080725715  0.135400778    0.3295118  0.447236064
##                      DaysSinceLastERVisit        Pain TotalVisits
## InpatientDays                 -0.29012105  0.30405807   0.6220356
## ERVisits                      -0.73524607  0.54677947   0.5864386
## OfficeVisits                  -0.12838788  0.35296784   0.8653868
## Narcotics                      0.06505481  0.10686036   0.1639924
## DaysSinceLastERVisit           1.00000000 -0.35878080  -0.3446395
## Pain                          -0.35878080  1.00000000   0.4829592
## TotalVisits                   -0.34463954  0.48295915   1.0000000
## ProviderCount                 -0.29770084  0.40509514   0.4515455
## MedicalClaims                 -0.19811441  0.29669718   0.5493080
## ClaimLines                    -0.41279666  0.46471274   0.5696186
## AcuteDrugGapSmall              0.13108501 -0.03149016   0.1348611
## PoorCare                      -0.10798298  0.09216828   0.3005403
##                      ProviderCount MedicalClaims  ClaimLines
## InpatientDays            0.2440233     0.2863780  0.38695107
## ERVisits                 0.4574290     0.3553190  0.54200050
## OfficeVisits             0.3654691     0.4985134  0.42495323
## Narcotics                0.2934782     0.2205408  0.18579870
## DaysSinceLastERVisit    -0.2977008    -0.1981144 -0.41279666
## Pain                     0.4050951     0.2966972  0.46471274
## TotalVisits              0.4515455     0.5493080  0.56961864
## ProviderCount            1.0000000     0.5170023  0.60535725
## MedicalClaims            0.5170023     1.0000000  0.81393452
## ClaimLines               0.6053573     0.8139345  1.00000000
## AcuteDrugGapSmall        0.1412836     0.0856369 -0.01322946
## PoorCare                 0.2201661     0.1673987  0.12917477
##                      AcuteDrugGapSmall    PoorCare
## InpatientDays             -0.001144346  0.08072572
## ERVisits                  -0.072749681  0.13540078
## OfficeVisits               0.200734789  0.32951181
## Narcotics                  0.710888560  0.44723606
## DaysSinceLastERVisit       0.131085008 -0.10798298
## Pain                      -0.031490160  0.09216828
## TotalVisits                0.134861075  0.30054033
## ProviderCount              0.141283618  0.22016613
## MedicalClaims              0.085636905  0.16739875
## ClaimLines                -0.013229464  0.12917477
## AcuteDrugGapSmall          1.000000000  0.34143466
## PoorCare                   0.341434658  1.00000000

If we look at the correlation, we can clearly see that correlation between PoorCare and OfficeVisits, and PoorCare and Narcotics is relatively higher. So we will take only these variables for further prediction. Although correlation between PoorCare and AcuteDrugGapSmall also significant, but correlation between AcuteDrugGapSmall and Narcotics is higher and it can cause the Multicollinearity so we will prefer to drop this variable from the predictions because Narcotics seems more significant variable.

So we will build logistic regression model using PoorCare as dependent variable and OfficeVisits and Narcotics as independent variables, Where family=binomial, tells the glm function to build a logistic regression model.

Logistic Regression Model

QualityLog = glm(PoorCare ~ OfficeVisits + Narcotics, data=qualityTrain, family=binomial)
summary(QualityLog)
## 
## Call:
## glm(formula = PoorCare ~ OfficeVisits + Narcotics, family = binomial, 
##     data = qualityTrain)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.06303  -0.63155  -0.50503  -0.09689   2.16686  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.64613    0.52357  -5.054 4.33e-07 ***
## OfficeVisits  0.08212    0.03055   2.688  0.00718 ** 
## Narcotics     0.07630    0.03205   2.381  0.01728 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 111.888  on 98  degrees of freedom
## Residual deviance:  89.127  on 96  degrees of freedom
## AIC: 95.127
## 
## Number of Fisher Scoring iterations: 4

We also see that both of these variables have at least one star, meaning that they’re significant in our model.
AIC value is a measure of the quality of the model and is like Adjusted R-squared in that it accounts for the number of variables used compared to the number of observations. Unfortunately, it can only be compared between models on the same data set. The preferred model is the one with the minimum AIC.

Now Lets Make predictions on training set

predictTrain = predict(QualityLog, type="response")

predictTrain predicts the outcome of PoorCare for each patient in terms of probabilities.

summary(predictTrain)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.06623 0.11910 0.15970 0.25250 0.26760 0.98460

tapply function finds the mean of predictions with respect to each outcome of the PoorCare.

tapply(predictTrain, qualityTrain$PoorCare, mean)
##         0         1 
## 0.1894512 0.4392246

Confusion matrix for threshold of 0.5

table(qualityTrain$PoorCare, predictTrain > 0.5)
##    
##     FALSE TRUE
##   0    70    4
##   1    15   10

The threshold value t, is often selected based on which errors are better. If we pick the large threshold value, it will mostly predict TRUE NEGETIVE, means specificity will increase, but if we pick low threshold value it will predict mostly TRUE POSITIVE values, means sensitivity will increase.

Where \[Sensitivity = TP/(TP+FN)\] And \[Specificity = TN/(TN+FP)\]

A model with a higher threshold will have a lower sensitivity and a higher specificity.
A model with a lower threshold will have a higher sensitivity and a lower specificity.

Sensitivity and specificity

10/25
## [1] 0.4
70/74
## [1] 0.9459459

Confusion matrix for threshold of 0.7

table(qualityTrain$PoorCare, predictTrain > 0.7)
##    
##     FALSE TRUE
##   0    73    1
##   1    17    8

Sensitivity and specificity

8/25
## [1] 0.32
73/74
## [1] 0.9864865

Here we increased the threshold value and we can clearly see that our specificity increased.

Confusion matrix for threshold of 0.2

table(qualityTrain$PoorCare, predictTrain > 0.2)
##    
##     FALSE TRUE
##   0    54   20
##   1     9   16

Sensitivity and specificity

16/25
## [1] 0.64
54/74
## [1] 0.7297297

Here we decreased the threshold value and now we can clearly see that our sensitivity increased.

loading ROCR package

library(ROCR)
## Warning: package 'ROCR' was built under R version 3.3.3
## Loading required package: gplots
## Warning: package 'gplots' was built under R version 3.3.3
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess

Using Prediction function to predict the outcome of the variable PoorCare

ROCRpred = prediction(predictTrain, qualityTrain$PoorCare)

Performance function

Which takes predicted values and x and y labels, where c lab is “trp” true positive rate, and y lab is “fpr” false positive rate.

ROCRperf = performance(ROCRpred, "tpr", "fpr")

Ploting ROC curve

plot(ROCRperf)

Adding additional argument colors in it.

plot(ROCRperf, colorize=TRUE)

Adding threshold labels

plot(ROCRperf, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1), text.adj=c(-0.2,1.7))

We can see that ROC curve with threshold values added. Using this curve, we can determine which threshold value we want to use depending on our preferences as a decision-maker.

Conclusions