d <- faraway::pima
describe(d)
d %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
From the above, we can see that there are occasional missing data - in many cases we observe “0” values in a context that makes no sense. In the case of “diastolic”, “Tricep” and “BMI” we can be sure that these 0’s represent missing values. I’m less certain in the case of “diabetesPedigreeFunction” and “Insulin”, however, for our purposes here, we’ll replace zeros in all of these variables with NAs
#replace 0 with NA in the identified cols
d[,2:6][d[,2:6]==0] <-NA
head(d,10)
#retain only the complete cases
d<-d[complete.cases(d),]
#build model
gm1 <- glm(test ~ ., family="binomial", data = d)
summary(gm1)
##
## Call:
## glm(formula = test ~ ., family = "binomial", data = d)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.7823 -0.6603 -0.3642 0.6409 2.5612
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.004e+01 1.218e+00 -8.246 < 2e-16 ***
## pregnant 8.216e-02 5.543e-02 1.482 0.13825
## glucose 3.827e-02 5.768e-03 6.635 3.24e-11 ***
## diastolic -1.420e-03 1.183e-02 -0.120 0.90446
## triceps 1.122e-02 1.708e-02 0.657 0.51128
## insulin -8.253e-04 1.306e-03 -0.632 0.52757
## bmi 7.054e-02 2.734e-02 2.580 0.00989 **
## diabetes 1.141e+00 4.274e-01 2.669 0.00760 **
## age 3.395e-02 1.838e-02 1.847 0.06474 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 498.10 on 391 degrees of freedom
## Residual deviance: 344.02 on 383 degrees of freedom
## AIC: 362.02
##
## Number of Fisher Scoring iterations: 5
#assess fit
paste("model p-value:",with(gm1, pchisq(null.deviance - deviance, df.null - df.residual, lower.tail = FALSE)))
## [1] "model p-value: 2.76493429047124e-29"
We can see that given the p-value (~3e-29) the model is a good fit.
d1 <- with(d, data.frame(pregnant = mean(pregnant),
glucose = mean(glucose),
diastolic = mean(diastolic),
triceps = mean(triceps),
insulin = mean(insulin),
bmi = as.numeric(factor(c("28.4","37.1"))), #1st and 3rd quartile values
diabetes = mean(diabetes),
age = mean(age)))
pred <- predict(gm1, newdata = d1, type = "response",se=T)
pred
## $fit
## 1 2
## 0.03684332 0.03942984
##
## $se.fit
## 1 2
## 0.03208725 0.03322540
##
## $residual.scale
## [1] 1
Based on the above, we can say the odds increase slightly from 3.7% to 3.95% assuming that the an individual is average in all other respects and only their BMI changes from 1st to 3rd quartile.
ggplot(data = d[,c("test","diastolic")], aes(x=test,y=diastolic,group=test)) +
geom_boxplot()
d[,c("test","diastolic")]
The data show that women who test positive for diabetes tend to have higher blood preassure (as per above boxplot) however the model says that the variable is not significant. The questions are distinct: the first question is asking whether women who are diabetic are likely to have high blood preassure according to the data, the second asks whether the model identified this as important. In can be emperically true, and still unimportant in the context of the model, as we see here.
vif(gm1)
## pregnant glucose diastolic triceps insulin bmi diabetes
## 1.892387 1.378937 1.191287 1.638865 1.384038 1.832416 1.031715
## age
## 1.974053
plot(gm1, which = 4, id.n = 5)
gm1.data <- augment(gm1) %>%
mutate(index = 1:n())
gm1.data %>% top_n(3, .cooksd)
ggplot(gm1.data, aes(index, .std.resid)) +
geom_point(aes(color = test), alpha = .5)
gm1.data %>%
filter(abs(.std.resid) > 3)
Here we look at a few things * We look at Cooks distance for influential values. * Next we look at the standardized residuals. Generally we’re looking for values over 3, of which we see none. * Finally we look at multi-colinearity using vif() and find no concerning values
From this we can conclude that the model suits the data.
d2 <- data.frame(pregnant =1,
glucose = 99,
diastolic = 64,
triceps = 22,
insulin = 76 ,
bmi = 27,
diabetes =0.25,
age = 25)
pred <- predict(gm1, newdata = d2, type = "response",se=T)
pred
## $fit
## 1
## 0.04573331
##
## $se.fit
## 1
## 0.01390283
##
## $residual.scale
## [1] 1
The model predicts a negative (no diabetes) for this case.