ELMR 2.3

A) Summarize Data

d <- faraway::pima

describe(d)
d %>%
  gather() %>% 
  ggplot(aes(value)) +
    facet_wrap(~ key, scales = "free") +
    geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

From the above, we can see that there are occasional missing data - in many cases we observe “0” values in a context that makes no sense. In the case of “diastolic”, “Tricep” and “BMI” we can be sure that these 0’s represent missing values. I’m less certain in the case of “diabetesPedigreeFunction” and “Insulin”, however, for our purposes here, we’ll replace zeros in all of these variables with NAs

#replace 0 with NA in the identified cols
d[,2:6][d[,2:6]==0] <-NA

head(d,10)
#retain only the complete cases
d<-d[complete.cases(d),]

B) Build a Model

#build model
gm1 <- glm(test ~ ., family="binomial", data = d)
summary(gm1)
## 
## Call:
## glm(formula = test ~ ., family = "binomial", data = d)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7823  -0.6603  -0.3642   0.6409   2.5612  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.004e+01  1.218e+00  -8.246  < 2e-16 ***
## pregnant     8.216e-02  5.543e-02   1.482  0.13825    
## glucose      3.827e-02  5.768e-03   6.635 3.24e-11 ***
## diastolic   -1.420e-03  1.183e-02  -0.120  0.90446    
## triceps      1.122e-02  1.708e-02   0.657  0.51128    
## insulin     -8.253e-04  1.306e-03  -0.632  0.52757    
## bmi          7.054e-02  2.734e-02   2.580  0.00989 ** 
## diabetes     1.141e+00  4.274e-01   2.669  0.00760 ** 
## age          3.395e-02  1.838e-02   1.847  0.06474 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 498.10  on 391  degrees of freedom
## Residual deviance: 344.02  on 383  degrees of freedom
## AIC: 362.02
## 
## Number of Fisher Scoring iterations: 5
#assess fit

paste("model p-value:",with(gm1, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE)))
## [1] "model p-value: 2.76493429047124e-29"

We can see that given the p-value (~3e-29) the model is a good fit.

C)BMI

d1 <- with(d, data.frame(pregnant = mean(pregnant),
                          glucose = mean(glucose),
                          diastolic = mean(diastolic),
                          triceps = mean(triceps),
                          insulin = mean(insulin),
                          bmi = as.numeric(factor(c("28.4","37.1"))),  #1st and 3rd quartile values
                          diabetes = mean(diabetes),
                          age = mean(age)))

          
pred <- predict(gm1, newdata = d1, type = "response",se=T)

pred
## $fit
##          1          2 
## 0.03684332 0.03942984 
## 
## $se.fit
##          1          2 
## 0.03208725 0.03322540 
## 
## $residual.scale
## [1] 1

Based on the above, we can say the odds increase slightly from 3.7% to 3.95% assuming that the an individual is average in all other respects and only their BMI changes from 1st to 3rd quartile.

D) Diastolic Blood Preassure

ggplot(data = d[,c("test","diastolic")], aes(x=test,y=diastolic,group=test)) +
    geom_boxplot()

d[,c("test","diastolic")]

The data show that women who test positive for diabetes tend to have higher blood preassure (as per above boxplot) however the model says that the variable is not significant. The questions are distinct: the first question is asking whether women who are diabetic are likely to have high blood preassure according to the data, the second asks whether the model identified this as important. In can be emperically true, and still unimportant in the context of the model, as we see here.

E) Diagnostics

vif(gm1)
##  pregnant   glucose diastolic   triceps   insulin       bmi  diabetes 
##  1.892387  1.378937  1.191287  1.638865  1.384038  1.832416  1.031715 
##       age 
##  1.974053
plot(gm1, which = 4, id.n = 5)

gm1.data <- augment(gm1) %>% 
  mutate(index = 1:n()) 

gm1.data %>% top_n(3, .cooksd)
ggplot(gm1.data, aes(index, .std.resid)) + 
  geom_point(aes(color = test), alpha = .5) 

gm1.data %>% 
  filter(abs(.std.resid) > 3)

Here we look at a few things * We look at Cooks distance for influential values. * Next we look at the standardized residuals. Generally we’re looking for values over 3, of which we see none. * Finally we look at multi-colinearity using vif() and find no concerning values

From this we can conclude that the model suits the data.

F) Predict

d2 <- data.frame(pregnant =1,
                          glucose = 99,
                          diastolic = 64,
                          triceps = 22,
                          insulin = 76 ,
                          bmi = 27,  
                          diabetes =0.25, 
                          age = 25)

pred <- predict(gm1, newdata = d2, type = "response",se=T)

pred
## $fit
##          1 
## 0.04573331 
## 
## $se.fit
##          1 
## 0.01390283 
## 
## $residual.scale
## [1] 1

The model predicts a negative (no diabetes) for this case.