Tina Pivk, 18. 2. 2023

Pima Indians Diabetes Database

library(mlbench)

## Warning: package 'mlbench' was built under R version 4.2.2

data(PimaIndiansDiabetes2)
mydata <- PimaIndiansDiabetes2[, -5]

head(mydata)

##   pregnant glucose pressure triceps mass pedigree age diabetes
## 1        6     148       72      35 33.6    0.627  50      pos
## 2        1      85       66      29 26.6    0.351  31      neg
## 3        8     183       64      NA 23.3    0.672  32      pos
## 4        1      89       66      23 28.1    0.167  21      neg
## 5        0     137       40      35 43.1    2.288  33      pos
## 6        5     116       74      NA 25.6    0.201  30      neg

Description:

unit: one Indian patient (the time frame not provided)
sample size: 768 (529 are appropriate for the actual analysis)

Variables:

pregnant: Number of times pregnant
glucose: Plasma glucose concentration (glucose tolerance test) (mg/dl)
pressure: Diastolic blood pressure (mm Hg)
triceps: Triceps skin fold thickness (mm)
mass: Body mass index (kg/(m2))
pedigree: Diabetes pedigree function that indicates the likelihood of diabetes based on family history (could not find the units)
age: Age (years)
diabetes: Test for diabetes (positive or negative)

Data source:

PimaIndiansDiabetes from the R database, package mlbench, where the following source is stated:

Original owners: National Institute of Diabetes and Digestive and Kidney Diseases

Donor of database: Vincent Sigillito

These data have been taken from the UCI Repository Of Machine Learning Databases at

ftp://ftp.ics.uci.edu/pub/machine-learning-databases

http://www.ics.uci.edu/~mlearn/MLRepository.html

and were converted to R format by Friedrich Leisch.

Main Goal of the Analysis

The main goal of the analysis carried out in homework 4 is to see which of the variables (number of pregnancies, plasma glucose concentration, diastolic blood pressure, triceps skin fold thickness, body mass index, family history, and age) can help us predict whether the patient has diabetes or not, and how big of an influence they have. For this we will use binary logistic regression since our dependent variable (test for diabetes: positive/negative) is binary.

Data Manipulation

When showing the data above we have seen that diabetes is already a factor and that we have some NA, so let us remove them.

library(tidyr)
mydata <- drop_na(mydata)
colSums(is.na(mydata))

## pregnant  glucose pressure  triceps     mass pedigree      age diabetes 
##        0        0        0        0        0        0        0        0

mydata$ID <- seq.int(nrow(mydata))
mydata <- mydata[,c(9,1,2,3,4,5,6,7,8)]
sapply(mydata[, c(2,3,4,5,6,7,8)], FUN=max)

## pregnant  glucose pressure  triceps     mass pedigree      age 
##    17.00   199.00   110.00    99.00    67.10     2.42    81.00

sapply(mydata[,c(2,3,4,5,6,7,8)], FUN=min)

## pregnant  glucose pressure  triceps     mass pedigree      age 
##    0.000   56.000   24.000    7.000   18.200    0.085   21.000

There are none min/max that would very obviously be impossible, let us just check for pregnant and pressure variable what the distribution is.

hist(mydata$pregnant, 
     main = "Distribution of Number of pregnancies", 
     xlab = "Number of pregnancies",
     ylab = "Frequency", 
     breaks = seq(from = 0, to = 20, by = 1))

hist(mydata$pressure, 
     main = "Distribution of Diastolic blood pressure", 
     xlab = "Diastolic blood pressure",
     ylab = "Frequency", 
     breaks = seq(from = 0, to = 120, by = 10))

For pregnant there is a gap but a small one, making the value 17 still not seem as a definite outlier. With diastolic blood pressure there are no gaps. We will consider all units as appropriate to use in the further analysis.

Binary Logistic Regression Model (Binomial Logit Model) Definition

Theoretical Background

According to Massaro, Magaletti, Cosoli, Giardinelli, & Leogrande (2022) the presence of diabetes is positively correlated with age, BMI, number of pregnencies, plasma glucose concentration, and family history, and negatively correlated with blood pressure. Based on the findings from Alkhatib, Sindian, Funjan, & Alshdaifat (2020) skin thickness may be a new predictor of the progression of diabetes in women patients as well.

Since we have a theoretical background for the variables used as explanatory ones in the model we will put them all in the model from the start.

fit <- glm(diabetes ~ pregnant + glucose + pressure + triceps + mass + pedigree + age,  
            family = binomial, 
            data = mydata)

Checking the Assumptions

We also have to look at the ASSUMPTIONS:

sufficiently large sample (we have 532 units)
independence of observation
Y is dichotomous, the explanatory variables are numeric (or dummy variables)
no outliers
no units with high impact
not too strong multicolinierity
linearity between the natural logarithm of the odds(𝑌=1) and the set of explanatory variables (logit transformation)

OUTLIERS AND UNITS WITH HIGH IMPACT

mydata$StdResid <- rstandard(fit)
mydata$Cook <- cooks.distance(fit)

hist(mydata$StdResid,
     main = "Histogram of standardized residuals",
     ylab = "Frequency",
     xlab = "Standardized residuals")

hist(mydata$Cook,
     main = "Histogram of Cook's distance",
     ylab = "Frequency",
     xlab = "Cook's distance")

head(mydata[order(-mydata$Cook), c("ID", "Cook")])

##      ID       Cook
## 229 154 0.07242482
## 488 340 0.04391496
## 745 516 0.03128182
## 674 469 0.02588386
## 126  82 0.02526765
## 460 318 0.02271536

We see we do have outliers as well as units with high impact. Below we will remove them.

mydata <- mydata[abs(mydata$StdResid) < 3, ]
mydata <- mydata[mydata$Cook < 0.03, ]

MULTICOLINEARITY

fit <- glm(diabetes ~ pregnant + glucose + pressure + triceps + mass + pedigree + age,  
            family = binomial, 
            data = mydata)

suppressPackageStartupMessages(library(car))

## Warning: package 'car' was built under R version 4.2.2

vif(fit)

## pregnant  glucose pressure  triceps     mass pedigree      age 
## 1.711166 1.068159 1.235087 1.606322 1.747471 1.040704 1.828524

mean(vif(fit))

## [1] 1.46249

All are below 5, the average VIF is relatively close to 1, so we do not have a problem with too strong multicolinearity.

LINEARITY

suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(rstatix))

## Warning: package 'rstatix' was built under R version 4.2.2

probabilities <- predict(fit, type = "response")

Mydata <- mydata[c(2:8)]
Mydata <- Mydata %>%
  mutate(logit = log(probabilities/(1-probabilities))) %>% 
  gather(key = "predictors", value = "predictor.value", -logit) %>% 
  convert_as_factor(predictors)

ggplot(Mydata, aes(logit, predictor.value))+
  geom_point(size = 0.5, alpha = 0.5) +
  theme_bw() + 
  facet_wrap(~predictors, scales = "free_y")

While most seem relatively ok, maybe age, mass, pedigree do seem like there might be a problem with linearity and could, therefore, need some transformations. Nonetheless, let us here continue as is.

Descriptive Statistics of the Variables Used in the Final Model

suppressPackageStartupMessages(library(pastecs))
round(stat.desc(mydata[ ,c(2:8)]), 2)

##              pregnant  glucose pressure  triceps     mass pedigree      age
## nbr.val        529.00   529.00   529.00   529.00   529.00   529.00   529.00
## nbr.null        76.00     0.00     0.00     0.00     0.00     0.00     0.00
## nbr.na           0.00     0.00     0.00     0.00     0.00     0.00     0.00
## min              0.00    56.00    24.00     7.00    18.20     0.09    21.00
## max             17.00   199.00   110.00    99.00    67.10     2.42    81.00
## range           17.00   143.00    86.00    92.00    48.90     2.34    60.00
## sum           1854.00 63865.00 37805.00 15417.00 17373.80   262.92 16691.00
## median           2.00   115.00    72.00    29.00    32.80     0.42    28.00
## mean             3.50   120.73    71.47    29.14    32.84     0.50    31.55
## SE.mean          0.14     1.34     0.54     0.46     0.30     0.01     0.47
## CI.mean.0.95     0.28     2.63     1.05     0.90     0.59     0.03     0.92
## var             10.84   948.34   151.80   111.07    47.13     0.11   115.04
## std.dev          3.29    30.80    12.32    10.54     6.86     0.33    10.73
## coef.var         0.94     0.26     0.17     0.36     0.21     0.67     0.34

sum(mydata[,9] == 'pos')

## [1] 177

sum(mydata[,9] == 'neg')

## [1] 352

We will use 529 units. The minimum of pregnant is 0 and the maximum 17, the minimum of glucose 56mg/dl and maximum 199mg/dl, the minimum of pressure 24mm Hg and the maximum 110mm Hg, the minimum of triceps 7mm and the maximum 99cm, the minimum of mass (BMI) is 18.20kg/m2 and the maximum 67.10kg/m2, the minimum of pedigree is 0.09 and the maximum 2.42, the minimum of age is 21 years and the maximum 81 years.

The patients have been on average pregnant 3.5 times, have on average plasma glucose concentration of 120.73mg/dl, diastolic blood pressure 71.47mm Hg, 29.14mm triceps skin fold thickness, BMI of 32.84kg/m2, are on average 31.55 years old, and their diabetes pedigree function is on average 0.5. The medians are relatively very similar to the means, higher differences being in the variables pregnant, glucose, and age. 50% of patients has been pregnant 2 times or less, has plasma glucose concentration 115.00mg/dl or less, and are 28 years old or younger (other 50% having higher vales than that for the mentioned variables).

The 95% confidence intervals for means are relatively small for all variables, the largest being for plasma glucose concentration; between 118.1 and 123.36mg/dl (p-value = 5%).

The highest variability according to coefficient of variance is for the number of pregnancies.

Out of 529 units in the sample, 177 had a positive test for diabetes and 352 a negative one.

The Model Description and Explanation

#from the last estimation of the model nothing changed so we will not re-estimate it here
summary(fit)

## 
## Call:
## glm(formula = diabetes ~ pregnant + glucose + pressure + triceps + 
##     mass + pedigree + age, family = binomial, data = mydata)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9918  -0.6414  -0.3319   0.5717   2.5843  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -10.503142   1.065403  -9.858  < 2e-16 ***
## pregnant      0.133233   0.045513   2.927 0.003418 ** 
## glucose       0.038457   0.004480   8.584  < 2e-16 ***
## pressure     -0.008756   0.010618  -0.825 0.409554    
## triceps       0.004772   0.015176   0.314 0.753193    
## mass          0.093886   0.024337   3.858 0.000114 ***
## pedigree      1.772657   0.384204   4.614 3.95e-06 ***
## age           0.028298   0.014483   1.954 0.050712 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 674.35  on 528  degrees of freedom
## Residual deviance: 442.89  on 521  degrees of freedom
## AIC: 458.89
## 
## Number of Fisher Scoring iterations: 5

exp(cbind(OR = fit$coefficients, confint.default(fit)))

##                       OR        2.5 %       97.5 %
## (Intercept) 2.745006e-05 3.401502e-06 2.215214e-04
## pregnant    1.142516e+00 1.045013e+00 1.249116e+00
## glucose     1.039206e+00 1.030120e+00 1.048372e+00
## pressure    9.912819e-01 9.708657e-01 1.012127e+00
## triceps     1.004783e+00 9.753368e-01 1.035119e+00
## mass        1.098434e+00 1.047270e+00 1.152098e+00
## pedigree    5.886476e+00 2.772172e+00 1.249944e+01
## age         1.028702e+00 9.999123e-01 1.058321e+00

We have not find a significant impact of age, diastolic blood pressure or triceps skin fold thickness on the test for diabetes being positive or negative (p-value is above 5%, although for age it is only 5.08%).

Based on our sample data if a patient goes through another pregnancy, the odds that they get a positive diabetes test are 1.14 times the initial odds, c.p., p-value = 0.004.

Based on our sample data if a patient’s plasma glucose concentration increases by 1mg/dl, the odds that they get a positive diabetes test are 1.04 times the initial odds, c.p., p-value < 0.001.

Based on our sample data if a patient’s BMI increases by 1kg/m2, the odds that they get a positive diabetes test are 1.10 times the initial odds, c.p., p-value < 0.001.

Based on our sample data if a patient’s pedigree function is increased by 1mg/dl, the odds that they get a positive diabetes test are 5.89 times the initial odds, c.p., p-value < 0.001.

We also see that the deviance is lower with our model than with the simplest model (the one with no explanatory variables). Furthermore, the difference is not that small.

fit1 <- glm(diabetes ~ 1,  
            family = binomial, 
            data = mydata)
anova(fit1, fit, test = "Chi")

## Analysis of Deviance Table
## 
## Model 1: diabetes ~ 1
## Model 2: diabetes ~ pregnant + glucose + pressure + triceps + mass + pedigree + 
##     age
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       528     674.35                          
## 2       521     442.89  7   231.46 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

H0: Simple model (Model 1) is better.

H1: Complex model (Model 2) is better.

As can be expected we can reject the null hypothesis, meaning the complex model is indeed better than the simplest (p-value < 0.001).

We can also save in our table what the estimated probabilities of the diabetes test being positive are.

mydata$EstProb <- fitted(fit)
head(mydata)

##   ID pregnant glucose pressure triceps mass pedigree age diabetes   StdResid         Cook    EstProb
## 1  1        6     148       72      35 33.6    0.627  50      pos  0.7934175 5.525063e-04 0.76947241
## 2  2        1      85       66      29 26.6    0.351  31      neg -0.2742754 1.655924e-05 0.02809089
## 4  3        1      89       66      23 28.1    0.167  21      neg -0.2388853 9.095734e-06 0.02009590
## 5  4        0     137       40      35 43.1    2.288  33      pos  0.3782762 3.677897e-04 0.97387404
## 7  5        3      78       50      32 31.0    0.248  26      pos  2.4854817 1.245152e-02 0.03546473
## 9  6        2     197       70      45 30.5    0.158  53      pos  0.6356972 1.152790e-03 0.82983977

Here, we can see that ID 5 would be incorrectly classified with our model. Let us look at how many units were incorrectly classified in total.

mydata$Classification <- ifelse(test = mydata$EstProb < 0.50, 
                                yes = "neg", 
                                no = "pos")

mydata$Classification <- factor(mydata$Classification,
                                 levels = c("neg", "pos"), 
                                 labels = c("neg", "pos"))

ClassificationTable <- table(mydata$diabetes, mydata$Classification) 

ClassificationTable

##      
##       neg pos
##   neg 316  36
##   pos  69 108

From this table we see that 105 patients were wrongly classified with our model, and 424 correctly.

# pseudo R2 for our model
Pseudo_R2_fit <- (ClassificationTable[1, 1] + ClassificationTable[2, 2] )/ nrow(mydata)
Pseudo_R2_fit

## [1] 0.8015123

# the probability of a random patient having a positive diabetes test (without knowing anything about them)
sum(mydata$diabetes == 'pos')/nrow(mydata)

## [1] 0.3345936

# pseudo R2 for the simplest model
1 - sum(mydata$diabetes == 'pos')/nrow(mydata)

## [1] 0.6654064

We can see that with all the explanatory variables we quite improved the percentage of correctly predicted diabetes test results (from 66.54% to 80.15%). Of course, it could be further improved by adding additional variables.

Conclusion

By using binary logistic regression we have found a good model in which result of the diabetes test is the dependent variable and number of pregnancies, plasma glucose concentration, diastolic blood pressure, triceps skin fold thickness, body mass index, family history, and age the explanatory variables. With the explanatory variables we can better predict the diabetes test results, meaning we can better predict which patients have diabetes, than by not using any explanatory variables. Furthermore, we have found a significant effect of number of pregnancies, plasma glucose concentration, BMI, and diabetes pedigree function (family history) on diabetes test results, but not for diastolic blood pressure, triceps skin fold thickness, and age (although for age the p-value was only 5.8%). We have to keep in mind this can only be generalized on the population of the specific patients.

References

Massaro, A., Magaletti, N., Cosoli, G., Giardinelli, V. & Leogrande, A. (2022, June 13th). The Prediction of Diabetes. Research Square. Retrieved from https://www.researchsquare.com/article/rs-1753046/v1
Alkhatib, A. J., Sindian, A. M., Funjan, K. I. & Alshdaifat, E. H. (2020, July). Skin Thickness can Predict the Progress of Diabetes Type 2: A New Medical Hypothesis. EC Diabetes andMetabolic Research 4(8).

HW4