Theoretical Background
According to Massaro, Magaletti, Cosoli, Giardinelli, & Leogrande
(2022) the presence of diabetes is positively correlated with age, BMI,
number of pregnencies, plasma glucose concentration, and family history,
and negatively correlated with blood pressure. Based on the findings
from Alkhatib, Sindian, Funjan, & Alshdaifat (2020) skin thickness
may be a new predictor of the progression of diabetes in women patients
as well.
Since we have a theoretical background for the variables used as
explanatory ones in the model we will put them all in the model from the
start.
fit <- glm(diabetes ~ pregnant + glucose + pressure + triceps + mass + pedigree + age,
family = binomial,
data = mydata)
Checking the Assumptions
We also have to look at the ASSUMPTIONS:
- sufficiently large sample (we have 532 units)
- independence of observation
- Y is dichotomous, the explanatory variables are numeric (or dummy
variables)
- no outliers
- no units with high impact
- not too strong multicolinierity
- linearity between the natural logarithm of the odds(𝑌=1) and the set
of explanatory variables (logit transformation)
OUTLIERS AND UNITS WITH HIGH IMPACT
mydata$StdResid <- rstandard(fit)
mydata$Cook <- cooks.distance(fit)
hist(mydata$StdResid,
main = "Histogram of standardized residuals",
ylab = "Frequency",
xlab = "Standardized residuals")

hist(mydata$Cook,
main = "Histogram of Cook's distance",
ylab = "Frequency",
xlab = "Cook's distance")

head(mydata[order(-mydata$Cook), c("ID", "Cook")])
## ID Cook
## 229 154 0.07242482
## 488 340 0.04391496
## 745 516 0.03128182
## 674 469 0.02588386
## 126 82 0.02526765
## 460 318 0.02271536
We see we do have outliers as well as units with high impact. Below
we will remove them.
mydata <- mydata[abs(mydata$StdResid) < 3, ]
mydata <- mydata[mydata$Cook < 0.03, ]
MULTICOLINEARITY
fit <- glm(diabetes ~ pregnant + glucose + pressure + triceps + mass + pedigree + age,
family = binomial,
data = mydata)
suppressPackageStartupMessages(library(car))
## Warning: package 'car' was built under R version 4.2.2
vif(fit)
## pregnant glucose pressure triceps mass pedigree age
## 1.711166 1.068159 1.235087 1.606322 1.747471 1.040704 1.828524
mean(vif(fit))
## [1] 1.46249
All are below 5, the average VIF is relatively close to 1, so we do
not have a problem with too strong multicolinearity.
LINEARITY
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(rstatix))
## Warning: package 'rstatix' was built under R version 4.2.2
probabilities <- predict(fit, type = "response")
Mydata <- mydata[c(2:8)]
Mydata <- Mydata %>%
mutate(logit = log(probabilities/(1-probabilities))) %>%
gather(key = "predictors", value = "predictor.value", -logit) %>%
convert_as_factor(predictors)
ggplot(Mydata, aes(logit, predictor.value))+
geom_point(size = 0.5, alpha = 0.5) +
theme_bw() +
facet_wrap(~predictors, scales = "free_y")

While most seem relatively ok, maybe age, mass, pedigree do seem like
there might be a problem with linearity and could, therefore, need some
transformations. Nonetheless, let us here continue as is.
Descriptive Statistics of the Variables Used in the Final Model
suppressPackageStartupMessages(library(pastecs))
round(stat.desc(mydata[ ,c(2:8)]), 2)
## pregnant glucose pressure triceps mass pedigree age
## nbr.val 529.00 529.00 529.00 529.00 529.00 529.00 529.00
## nbr.null 76.00 0.00 0.00 0.00 0.00 0.00 0.00
## nbr.na 0.00 0.00 0.00 0.00 0.00 0.00 0.00
## min 0.00 56.00 24.00 7.00 18.20 0.09 21.00
## max 17.00 199.00 110.00 99.00 67.10 2.42 81.00
## range 17.00 143.00 86.00 92.00 48.90 2.34 60.00
## sum 1854.00 63865.00 37805.00 15417.00 17373.80 262.92 16691.00
## median 2.00 115.00 72.00 29.00 32.80 0.42 28.00
## mean 3.50 120.73 71.47 29.14 32.84 0.50 31.55
## SE.mean 0.14 1.34 0.54 0.46 0.30 0.01 0.47
## CI.mean.0.95 0.28 2.63 1.05 0.90 0.59 0.03 0.92
## var 10.84 948.34 151.80 111.07 47.13 0.11 115.04
## std.dev 3.29 30.80 12.32 10.54 6.86 0.33 10.73
## coef.var 0.94 0.26 0.17 0.36 0.21 0.67 0.34
sum(mydata[,9] == 'pos')
## [1] 177
sum(mydata[,9] == 'neg')
## [1] 352
We will use 529 units. The minimum of pregnant is 0 and the maximum
17, the minimum of glucose 56mg/dl and maximum 199mg/dl, the minimum of
pressure 24mm Hg and the maximum 110mm Hg, the minimum of triceps 7mm
and the maximum 99cm, the minimum of mass (BMI) is 18.20kg/m2 and the
maximum 67.10kg/m2, the minimum of pedigree is 0.09 and the maximum
2.42, the minimum of age is 21 years and the maximum 81 years.
The patients have been on average pregnant 3.5 times, have on average
plasma glucose concentration of 120.73mg/dl, diastolic blood pressure
71.47mm Hg, 29.14mm triceps skin fold thickness, BMI of 32.84kg/m2, are
on average 31.55 years old, and their diabetes pedigree function is on
average 0.5. The medians are relatively very similar to the means,
higher differences being in the variables pregnant, glucose, and age.
50% of patients has been pregnant 2 times or less, has plasma glucose
concentration 115.00mg/dl or less, and are 28 years old or younger
(other 50% having higher vales than that for the mentioned
variables).
The 95% confidence intervals for means are relatively small for all
variables, the largest being for plasma glucose concentration; between
118.1 and 123.36mg/dl (p-value = 5%).
The highest variability according to coefficient of variance is for
the number of pregnancies.
Out of 529 units in the sample, 177 had a positive test for diabetes
and 352 a negative one.
The Model Description and Explanation
#from the last estimation of the model nothing changed so we will not re-estimate it here
summary(fit)
##
## Call:
## glm(formula = diabetes ~ pregnant + glucose + pressure + triceps +
## mass + pedigree + age, family = binomial, data = mydata)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9918 -0.6414 -0.3319 0.5717 2.5843
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.503142 1.065403 -9.858 < 2e-16 ***
## pregnant 0.133233 0.045513 2.927 0.003418 **
## glucose 0.038457 0.004480 8.584 < 2e-16 ***
## pressure -0.008756 0.010618 -0.825 0.409554
## triceps 0.004772 0.015176 0.314 0.753193
## mass 0.093886 0.024337 3.858 0.000114 ***
## pedigree 1.772657 0.384204 4.614 3.95e-06 ***
## age 0.028298 0.014483 1.954 0.050712 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 674.35 on 528 degrees of freedom
## Residual deviance: 442.89 on 521 degrees of freedom
## AIC: 458.89
##
## Number of Fisher Scoring iterations: 5
exp(cbind(OR = fit$coefficients, confint.default(fit)))
## OR 2.5 % 97.5 %
## (Intercept) 2.745006e-05 3.401502e-06 2.215214e-04
## pregnant 1.142516e+00 1.045013e+00 1.249116e+00
## glucose 1.039206e+00 1.030120e+00 1.048372e+00
## pressure 9.912819e-01 9.708657e-01 1.012127e+00
## triceps 1.004783e+00 9.753368e-01 1.035119e+00
## mass 1.098434e+00 1.047270e+00 1.152098e+00
## pedigree 5.886476e+00 2.772172e+00 1.249944e+01
## age 1.028702e+00 9.999123e-01 1.058321e+00
We have not find a significant impact of age, diastolic blood
pressure or triceps skin fold thickness on the test for diabetes being
positive or negative (p-value is above 5%, although for age it is only
5.08%).
Based on our sample data if a patient goes through another pregnancy,
the odds that they get a positive diabetes test are 1.14 times the
initial odds, c.p., p-value = 0.004.
Based on our sample data if a patient’s plasma glucose concentration
increases by 1mg/dl, the odds that they get a positive diabetes test are
1.04 times the initial odds, c.p., p-value < 0.001.
Based on our sample data if a patient’s BMI increases by 1kg/m2, the
odds that they get a positive diabetes test are 1.10 times the initial
odds, c.p., p-value < 0.001.
Based on our sample data if a patient’s pedigree function is
increased by 1mg/dl, the odds that they get a positive diabetes test are
5.89 times the initial odds, c.p., p-value < 0.001.
We also see that the deviance is lower with our model than with the
simplest model (the one with no explanatory variables). Furthermore, the
difference is not that small.
fit1 <- glm(diabetes ~ 1,
family = binomial,
data = mydata)
anova(fit1, fit, test = "Chi")
## Analysis of Deviance Table
##
## Model 1: diabetes ~ 1
## Model 2: diabetes ~ pregnant + glucose + pressure + triceps + mass + pedigree +
## age
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 528 674.35
## 2 521 442.89 7 231.46 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
H0: Simple model (Model 1) is better.
H1: Complex model (Model 2) is better.
As can be expected we can reject the null hypothesis, meaning the
complex model is indeed better than the simplest (p-value <
0.001).
We can also save in our table what the estimated probabilities of the
diabetes test being positive are.
mydata$EstProb <- fitted(fit)
head(mydata)
## ID pregnant glucose pressure triceps mass pedigree age diabetes StdResid Cook EstProb
## 1 1 6 148 72 35 33.6 0.627 50 pos 0.7934175 5.525063e-04 0.76947241
## 2 2 1 85 66 29 26.6 0.351 31 neg -0.2742754 1.655924e-05 0.02809089
## 4 3 1 89 66 23 28.1 0.167 21 neg -0.2388853 9.095734e-06 0.02009590
## 5 4 0 137 40 35 43.1 2.288 33 pos 0.3782762 3.677897e-04 0.97387404
## 7 5 3 78 50 32 31.0 0.248 26 pos 2.4854817 1.245152e-02 0.03546473
## 9 6 2 197 70 45 30.5 0.158 53 pos 0.6356972 1.152790e-03 0.82983977
Here, we can see that ID 5 would be incorrectly classified with our
model. Let us look at how many units were incorrectly classified in
total.
mydata$Classification <- ifelse(test = mydata$EstProb < 0.50,
yes = "neg",
no = "pos")
mydata$Classification <- factor(mydata$Classification,
levels = c("neg", "pos"),
labels = c("neg", "pos"))
ClassificationTable <- table(mydata$diabetes, mydata$Classification)
ClassificationTable
##
## neg pos
## neg 316 36
## pos 69 108
From this table we see that 105 patients were wrongly classified with
our model, and 424 correctly.
# pseudo R2 for our model
Pseudo_R2_fit <- (ClassificationTable[1, 1] + ClassificationTable[2, 2] )/ nrow(mydata)
Pseudo_R2_fit
## [1] 0.8015123
# the probability of a random patient having a positive diabetes test (without knowing anything about them)
sum(mydata$diabetes == 'pos')/nrow(mydata)
## [1] 0.3345936
# pseudo R2 for the simplest model
1 - sum(mydata$diabetes == 'pos')/nrow(mydata)
## [1] 0.6654064
We can see that with all the explanatory variables we quite improved
the percentage of correctly predicted diabetes test results (from 66.54%
to 80.15%). Of course, it could be further improved by adding additional
variables.