y <- c(3.36,2.88,3.63,3.41,3.78,4.02,4.00,4.23,3.73,3.85,3.97,4.51,4.54, 5.00,5.00,4.72,5.00) #' time to death
x <- c(65,156,100,134,16,108,121,4,39,143,56,26,22,1,1,5,6) #' log of white blood cell
Plot \(y\) against \(x\). Is there a trend?
plot(y ~ x)
plot(1/y ~ x)
Yes, and it becomes linear when we plot \(1/y\) vs \(x\).
A possible model is \(E[Y_j] = 1/(\beta_1 + \beta_2 x_j )\). What is the link function in this case?
This is inverse link function.
Assuming an exponential error fit a model with your suggested link. Perform model diagnostics and suggest another model.
#' g1 is fitting Y following an exponential distribution
g1 <- glm(y ~ x, family = Gamma(link="inverse"))
summary(g1, dispersion=1)
##
## Call:
## glm(formula = y ~ x, family = Gamma(link = "inverse"))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.2156853 0.0804951 2.679 0.00737 **
## x 0.0005382 0.0011642 0.462 0.64387
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 1)
##
## Null deviance: 0.38385 on 16 degrees of freedom
## Residual deviance: 0.15971 on 15 degrees of freedom
## AIC: 22.477
##
## Number of Fisher Scoring iterations: 4
anova(g1, dispersion=1, test = "Chisq")
## Analysis of Deviance Table
##
## Model: Gamma, link: inverse
##
## Response: y
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 16 0.38385
## x 1 0.22413 15 0.15971 0.6359
This doesn’t look right since \(x\) is not significant, but from the exploratory plot, there is a nonlinear association between them. We then consider a Gamma GLM with estimated dispersion parameter.
summary(g1)
##
## Call:
## glm(formula = y ~ x, family = Gamma(link = "inverse"))
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.2156853 0.0081804 26.366 5.57e-14 ***
## x 0.0005382 0.0001183 4.549 0.000384 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Gamma family taken to be 0.01032795)
##
## Null deviance: 0.38385 on 16 degrees of freedom
## Residual deviance: 0.15971 on 15 degrees of freedom
## AIC: 22.477
##
## Number of Fisher Scoring iterations: 4
anova(g1, test = "Chisq")
## Analysis of Deviance Table
##
## Model: Gamma, link: inverse
##
## Response: y
##
## Terms added sequentially (first to last)
##
##
## Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL 16 0.38385
## x 1 0.22413 15 0.15971 3.185e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A Gamma GLM with estimated dispersion parameter suggests that \(x\) is significant. Also, the estimated dispersion parameter is obviously not equal to 1.
The dataset is from the library. In the first step, we load the dataset and briefly explore the variables.
library(mlbench)
library(ggplot2)
library(reshape2)
data(Glass)
dim(Glass)
## [1] 214 10
summary(Glass)
## RI Na Mg Al
## Min. :1.511 Min. :10.73 Min. :0.000 Min. :0.290
## 1st Qu.:1.517 1st Qu.:12.91 1st Qu.:2.115 1st Qu.:1.190
## Median :1.518 Median :13.30 Median :3.480 Median :1.360
## Mean :1.518 Mean :13.41 Mean :2.685 Mean :1.445
## 3rd Qu.:1.519 3rd Qu.:13.82 3rd Qu.:3.600 3rd Qu.:1.630
## Max. :1.534 Max. :17.38 Max. :4.490 Max. :3.500
## Si K Ca Ba
## Min. :69.81 Min. :0.0000 Min. : 5.430 Min. :0.000
## 1st Qu.:72.28 1st Qu.:0.1225 1st Qu.: 8.240 1st Qu.:0.000
## Median :72.79 Median :0.5550 Median : 8.600 Median :0.000
## Mean :72.65 Mean :0.4971 Mean : 8.957 Mean :0.175
## 3rd Qu.:73.09 3rd Qu.:0.6100 3rd Qu.: 9.172 3rd Qu.:0.000
## Max. :75.41 Max. :6.2100 Max. :16.190 Max. :3.150
## Fe Type
## Min. :0.00000 1:70
## 1st Qu.:0.00000 2:76
## Median :0.00000 3:17
## Mean :0.05701 5:13
## 3rd Qu.:0.10000 6: 9
## Max. :0.51000 7:29
## help(Glass)
varnames <- colnames(Glass)
## Some correlation plots between covariates can also be drawn to explore but not required.
The dataset consists of \(n = 214\) observations and 10 variables among which denoting glass type is the response variable of interest. We want to investigate the association between glass type using the chemical composition. Columns 1-9 are 9 covariates proving information on the chemical composition of the glasses. There are seven categories of the glasses in the original dataset. The covariates are all continuous; hence factorization is not necessary.
melt_Glass <- melt(Glass[, 1:9])
## No id variables; using all as measure variables
qplot(value, data = melt_Glass, geom = "density", adjust = 2, fill = I(gray(0.5)),
xlab = "", ylab = "") + facet_wrap(~ variable, scales = "free", ncol = 3)
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Variables are highly skewed, has bimodal. When we consider improving the model, transformation such as log, Box-Cox of these variable can be considered.
Find a suitable GLM to classify if glass type is 1 or not.
Glass$Type <- ifelse(Glass$Type == 1, 1, 0)
table(Glass$Type)
##
## 0 1
## 144 70
After transforming, type one glass has 70 observations and the rest 144 observations do not belong to class 1. The most suitable distribution for would be a binomial distribution which is discrete.
Give an expression of the linear predictor and discuss how it associates with the conditional mean \(E(Y | X)\).
The linear predictor \(\eta = \mathbf{X} \beta = \sum_{j = 1}^{p}X_j \beta_j\) and the conditional mean \(\xi = E(Y | X) = p\) the sucess rate for binomial distribution. We know that they are connected through a nonlinear function called the link function denoted by \(g(\cdot)\). In the case of binomial distribution, three link functions are available, i.e.
Among the above three link functions, the logistic link function is also the canonical link function for logistic regression (Note: check why, prepare for derivation and explanation.)
We first propose to fit a binomial GLM using all 9 covariates with intercept.
form <- Type ~.
full <- glm(form, data = Glass, family = binomial(link = "logit"))
summary(full)
##
## Call:
## glm(formula = form, family = binomial(link = "logit"), data = Glass)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -575.3570 341.4893 -1.685 0.09202 .
## RI 67.1196 192.6359 0.348 0.72752
## Na 3.5201 2.0937 1.681 0.09271 .
## Mg 6.1380 2.2075 2.780 0.00543 **
## Al 1.4019 2.3365 0.600 0.54850
## Si 5.0003 2.0758 2.409 0.01600 *
## K 4.4712 2.4649 1.814 0.06969 .
## Ca 4.3862 2.1879 2.005 0.04499 *
## Ba 4.9974 2.5373 1.970 0.04889 *
## Fe -0.9779 2.0595 -0.475 0.63489
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 270.54 on 213 degrees of freedom
## Residual deviance: 178.67 on 204 degrees of freedom
## AIC: 198.67
##
## Number of Fisher Scoring iterations: 7
We see that the three variables , , and are not significant. It might be reasonable to drop them. Although is sometimes used to determine glass type, the current dataset suggests that it may not help explaining the deviance of the model. We now test a null hypothesis \(H_0: \beta_1 = \beta_4 = \beta_9 = 0\) versus alternative that the coefficients are nonzero by a likelihood ratio test.
update1 <- update(full, .~. - Fe - Al - RI)
anova(update1, full, test = "LRT")
## Analysis of Deviance Table
##
## Model 1: Type ~ Na + Mg + Si + K + Ca + Ba
## Model 2: Type ~ RI + Na + Mg + Al + Si + K + Ca + Ba + Fe
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 207 179.78
## 2 204 178.67 3 1.1104 0.7746
Test result suggest that there isn’t a significant difference between residual deviance after dropping these three terms. The null hypothesis is accepted, i.e., \(\beta_1\beta_4 = \beta_9 = 0\) the coefficients for are zero. We also consider use stepwise AIC to perform a model selection which returns the same model.
library(stats)
stepAIC <- step(full, direction = "both")
## Start: AIC=198.67
## Type ~ RI + Na + Mg + Al + Si + K + Ca + Ba + Fe
##
## Df Deviance AIC
## - RI 1 178.79 196.79
## - Fe 1 178.90 196.90
## - Al 1 179.05 197.05
## <none> 178.67 198.67
## - Ba 1 181.87 199.87
## - Na 1 182.09 200.09
## - K 1 182.22 200.22
## - Ca 1 183.49 201.49
## - Si 1 186.16 204.16
## - Mg 1 188.86 206.86
##
## Step: AIC=196.79
## Type ~ Na + Mg + Al + Si + K + Ca + Ba + Fe
##
## Df Deviance AIC
## - Fe 1 179.05 195.05
## - Al 1 179.20 195.20
## <none> 178.79 196.79
## - Ba 1 182.24 198.24
## - Na 1 182.62 198.62
## - K 1 182.64 198.64
## + RI 1 178.67 198.67
## - Ca 1 185.20 201.20
## - Si 1 186.42 202.42
## - Mg 1 190.45 206.45
##
## Step: AIC=195.05
## Type ~ Na + Mg + Al + Si + K + Ca + Ba
##
## Df Deviance AIC
## - Al 1 179.78 193.78
## <none> 179.05 195.05
## - Ba 1 182.76 196.76
## + Fe 1 178.79 196.79
## + RI 1 178.90 196.90
## - K 1 183.53 197.53
## - Na 1 184.03 198.03
## - Ca 1 186.75 200.75
## - Si 1 188.14 202.14
## - Mg 1 192.34 206.34
##
## Step: AIC=193.78
## Type ~ Na + Mg + Si + K + Ca + Ba
##
## Df Deviance AIC
## <none> 179.78 193.78
## - Ba 1 182.76 194.76
## + Al 1 179.05 195.05
## + Fe 1 179.20 195.20
## + RI 1 179.57 195.57
## - K 1 183.95 195.95
## - Na 1 188.50 200.50
## - Si 1 200.71 212.71
## - Ca 1 203.39 215.39
## - Mg 1 229.08 241.08
summary(stepAIC)
##
## Call:
## glm(formula = Type ~ Na + Mg + Si + K + Ca + Ba, family = binomial(link = "logit"),
## data = Glass)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -369.0581 86.3466 -4.274 1.92e-05 ***
## Na 2.5755 0.8662 2.973 0.00295 **
## Mg 5.0890 0.9354 5.440 5.32e-08 ***
## Si 3.9260 0.9305 4.219 2.45e-05 ***
## K 3.5030 1.5684 2.233 0.02552 *
## Ca 3.4646 0.7678 4.513 6.40e-06 ***
## Ba 4.1433 1.8672 2.219 0.02649 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 270.54 on 213 degrees of freedom
## Residual deviance: 179.78 on 207 degrees of freedom
## AIC: 193.78
##
## Number of Fisher Scoring iterations: 7
We suggest a linear predictor \[\begin{equation}\label{eq: model} \eta = -369.0581 + 2.5755 \cdot \mbox{Na} + 5.0890 \cdot \mbox{Mg} + 3.9260 \cdot \mbox{Si} + 3.5030 \cdot \mbox{K} + 3.4646 \cdot \mbox{Ca} + 4.1433 \cdot \mbox{Ba}, \end{equation}\] and the link function we chose is the logistic link which also suggests the following equivalence (Note: check why this holds and you are expected to explain) \[ p = E[Y \mid X ] = \frac{\exp(\eta)}{1 + \exp(\eta)}. \] This indicates that whey the value of the \(j\)’th covariate, such as as \(X_1\), increases from 0 to 1, the probability of being classified as Type 1 glass increase \(\frac{e^{\beta_j}}{1 + e^{\beta_j}}\) (Note: also possible to explain from log-odds ratio). It maybe worth checking if Type 1 glass has overall high level of the chemical composition among all seven since all coefficients are positive and quite large.
Perform model diagnostics and identify problems with the above model.
We know that binomial GLM could potentially have dispersion issues due to the dispersion parameter being fixed at one. This could be checked by using a goodness-of-fit test based on residuals. Further, for discrete distributions in GLM, deviance is used for model diagnostics instead of pearson residuals. Hence, the goodness-of-fit test is perform as follows.
pr <- sum(residuals(update1, type = "deviance")^2)
dispersion <- pr/update1$df.residual
pchisq(pr, update1$df.residual, lower.tail = FALSE)
## [1] 0.9143631
The test result suggest that the model fits well and there is no evidence of dispersion.
Fit a quasi-likelihood model with the same mean structure. Briefly discuss difference.
quasi <- glm(form, data = Glass, family = quasibinomial(link = "logit"))
quasi <- update(quasi, .~. - Fe - Al - RI)
summary(quasi)
##
## Call:
## glm(formula = Type ~ Na + Mg + Si + K + Ca + Ba, family = quasibinomial(link = "logit"),
## data = Glass)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -369.0581 75.2137 -4.907 1.87e-06 ***
## Na 2.5755 0.7546 3.413 0.000772 ***
## Mg 5.0890 0.8148 6.245 2.36e-09 ***
## Si 3.9260 0.8106 4.844 2.50e-06 ***
## K 3.5030 1.3662 2.564 0.011053 *
## Ca 3.4646 0.6688 5.181 5.23e-07 ***
## Ba 4.1433 1.6264 2.547 0.011577 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasibinomial family taken to be 0.7587586)
##
## Null deviance: 270.54 on 213 degrees of freedom
## Residual deviance: 179.78 on 207 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 7
We compare the fitted quasi model with the one in \(\eqref{eq: model}\). The dispersion parameter changes from 1 to approximately 0.76 and there is no AIC value anymore Note: check why and explain. The residual deviance of the quasi model didn’t change. The dispersion parameter and the residual deviance agree with the goodness-of-test result. A binomial GLM is preferred. However, it is worth noting that the residual deviance is not ideal here compared to the null deviance. The model didn’t explain a large proportion of deviance in the data. Further improvement could include interaction terms, polynomial terms.
The Predict function can be used for prediction. Use the following codes and briefly explain the output.
set.seed(12345)
index <- sample(dim(Glass)[1], 10, replace = TRUE)
predictions <- predict(update1, newdata = Glass[index, ], type = "response")
We use the logistic regression model in \(\eqref{eq: model}\) for predicting the probability of observations being in class 1 based on their chemical compositions. The 10 observations used are
names(predictions)
## [1] "142" "51" "208" "152" "58" "93" "75" "96" "2" "86"
and their corresponding predicted probabilities of being in class 1 are
predictions
## 142 51 208 152 58 93
## 0.700077627 0.887669450 0.001277154 0.778492379 0.379186110 0.272200378
## 75 96 2 86
## 0.380535477 0.127788231 0.350208184 0.286440541