Leukaemia Data Analysis: The data below are the times to death, y, and the log of the white blood cell count, x, for 17 patients suffering from leukaemia.

y <- c(3.36,2.88,3.63,3.41,3.78,4.02,4.00,4.23,3.73,3.85,3.97,4.51,4.54, 5.00,5.00,4.72,5.00)  #' time to death
x <- c(65,156,100,134,16,108,121,4,39,143,56,26,22,1,1,5,6)  #' log of white blood cell

Plot \(y\) against \(x\). Is there a trend?

plot(y ~ x)

plot(1/y ~ x)

Yes, and it becomes linear when we plot \(1/y\) vs \(x\).

A possible model is \(E[Y_j] = 1/(\beta_1 + \beta_2 x_j )\). What is the link function in this case?

This is inverse link function.

Assuming an exponential error fit a model with your suggested link. Perform model diagnostics and suggest another model.

#' g1 is fitting Y following an exponential distribution
g1 <- glm(y ~ x, family = Gamma(link="inverse"))
summary(g1, dispersion=1)
## 
## Call:
## glm(formula = y ~ x, family = Gamma(link = "inverse"))
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept) 0.2156853  0.0804951   2.679  0.00737 **
## x           0.0005382  0.0011642   0.462  0.64387   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 1)
## 
##     Null deviance: 0.38385  on 16  degrees of freedom
## Residual deviance: 0.15971  on 15  degrees of freedom
## AIC: 22.477
## 
## Number of Fisher Scoring iterations: 4
anova(g1, dispersion=1, test = "Chisq")
## Analysis of Deviance Table
## 
## Model: Gamma, link: inverse
## 
## Response: y
## 
## Terms added sequentially (first to last)
## 
## 
##      Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL                    16    0.38385         
## x     1  0.22413        15    0.15971   0.6359

This doesn’t look right since \(x\) is not significant, but from the exploratory plot, there is a nonlinear association between them. We then consider a Gamma GLM with estimated dispersion parameter.

summary(g1)
## 
## Call:
## glm(formula = y ~ x, family = Gamma(link = "inverse"))
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.2156853  0.0081804  26.366 5.57e-14 ***
## x           0.0005382  0.0001183   4.549 0.000384 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Gamma family taken to be 0.01032795)
## 
##     Null deviance: 0.38385  on 16  degrees of freedom
## Residual deviance: 0.15971  on 15  degrees of freedom
## AIC: 22.477
## 
## Number of Fisher Scoring iterations: 4
anova(g1, test = "Chisq")
## Analysis of Deviance Table
## 
## Model: Gamma, link: inverse
## 
## Response: y
## 
## Terms added sequentially (first to last)
## 
## 
##      Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
## NULL                    16    0.38385              
## x     1  0.22413        15    0.15971 3.185e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A Gamma GLM with estimated dispersion parameter suggests that \(x\) is significant. Also, the estimated dispersion parameter is obviously not equal to 1.

Glass Dataset Analysis

The dataset is from the library. In the first step, we load the dataset and briefly explore the variables.

library(mlbench)
library(ggplot2)
library(reshape2)
data(Glass)
dim(Glass)
## [1] 214  10
summary(Glass)
##        RI              Na              Mg              Al       
##  Min.   :1.511   Min.   :10.73   Min.   :0.000   Min.   :0.290  
##  1st Qu.:1.517   1st Qu.:12.91   1st Qu.:2.115   1st Qu.:1.190  
##  Median :1.518   Median :13.30   Median :3.480   Median :1.360  
##  Mean   :1.518   Mean   :13.41   Mean   :2.685   Mean   :1.445  
##  3rd Qu.:1.519   3rd Qu.:13.82   3rd Qu.:3.600   3rd Qu.:1.630  
##  Max.   :1.534   Max.   :17.38   Max.   :4.490   Max.   :3.500  
##        Si              K                Ca               Ba       
##  Min.   :69.81   Min.   :0.0000   Min.   : 5.430   Min.   :0.000  
##  1st Qu.:72.28   1st Qu.:0.1225   1st Qu.: 8.240   1st Qu.:0.000  
##  Median :72.79   Median :0.5550   Median : 8.600   Median :0.000  
##  Mean   :72.65   Mean   :0.4971   Mean   : 8.957   Mean   :0.175  
##  3rd Qu.:73.09   3rd Qu.:0.6100   3rd Qu.: 9.172   3rd Qu.:0.000  
##  Max.   :75.41   Max.   :6.2100   Max.   :16.190   Max.   :3.150  
##        Fe          Type  
##  Min.   :0.00000   1:70  
##  1st Qu.:0.00000   2:76  
##  Median :0.00000   3:17  
##  Mean   :0.05701   5:13  
##  3rd Qu.:0.10000   6: 9  
##  Max.   :0.51000   7:29
## help(Glass)

varnames <- colnames(Glass)

## Some correlation plots between covariates can also be drawn to explore but not required.

The dataset consists of \(n = 214\) observations and 10 variables among which denoting glass type is the response variable of interest. We want to investigate the association between glass type using the chemical composition. Columns 1-9 are 9 covariates proving information on the chemical composition of the glasses. There are seven categories of the glasses in the original dataset. The covariates are all continuous; hence factorization is not necessary.

melt_Glass <- melt(Glass[, 1:9])
## No id variables; using all as measure variables
qplot(value, data = melt_Glass, geom = "density", adjust = 2, fill = I(gray(0.5)),
      xlab = "", ylab = "") + facet_wrap(~ variable, scales = "free", ncol = 3)
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Variables are highly skewed, has bimodal. When we consider improving the model, transformation such as log, Box-Cox of these variable can be considered.

Find a suitable GLM to classify if glass type is 1 or not.

Glass$Type <- ifelse(Glass$Type == 1, 1, 0)
table(Glass$Type)
## 
##   0   1 
## 144  70

After transforming, type one glass has 70 observations and the rest 144 observations do not belong to class 1. The most suitable distribution for would be a binomial distribution which is discrete.

Give an expression of the linear predictor and discuss how it associates with the conditional mean \(E(Y | X)\).

The linear predictor \(\eta = \mathbf{X} \beta = \sum_{j = 1}^{p}X_j \beta_j\) and the conditional mean \(\xi = E(Y | X) = p\) the sucess rate for binomial distribution. We know that they are connected through a nonlinear function called the link function denoted by \(g(\cdot)\). In the case of binomial distribution, three link functions are available, i.e.

Among the above three link functions, the logistic link function is also the canonical link function for logistic regression (Note: check why, prepare for derivation and explanation.)

We first propose to fit a binomial GLM using all 9 covariates with intercept.

form <- Type ~. 
full <- glm(form, data = Glass, family = binomial(link = "logit"))
summary(full) 
## 
## Call:
## glm(formula = form, family = binomial(link = "logit"), data = Glass)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)   
## (Intercept) -575.3570   341.4893  -1.685  0.09202 . 
## RI            67.1196   192.6359   0.348  0.72752   
## Na             3.5201     2.0937   1.681  0.09271 . 
## Mg             6.1380     2.2075   2.780  0.00543 **
## Al             1.4019     2.3365   0.600  0.54850   
## Si             5.0003     2.0758   2.409  0.01600 * 
## K              4.4712     2.4649   1.814  0.06969 . 
## Ca             4.3862     2.1879   2.005  0.04499 * 
## Ba             4.9974     2.5373   1.970  0.04889 * 
## Fe            -0.9779     2.0595  -0.475  0.63489   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 270.54  on 213  degrees of freedom
## Residual deviance: 178.67  on 204  degrees of freedom
## AIC: 198.67
## 
## Number of Fisher Scoring iterations: 7

We see that the three variables , , and are not significant. It might be reasonable to drop them. Although is sometimes used to determine glass type, the current dataset suggests that it may not help explaining the deviance of the model. We now test a null hypothesis \(H_0: \beta_1 = \beta_4 = \beta_9 = 0\) versus alternative that the coefficients are nonzero by a likelihood ratio test.

update1 <- update(full, .~. - Fe - Al - RI)
anova(update1, full, test = "LRT")
## Analysis of Deviance Table
## 
## Model 1: Type ~ Na + Mg + Si + K + Ca + Ba
## Model 2: Type ~ RI + Na + Mg + Al + Si + K + Ca + Ba + Fe
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1       207     179.78                     
## 2       204     178.67  3   1.1104   0.7746

Test result suggest that there isn’t a significant difference between residual deviance after dropping these three terms. The null hypothesis is accepted, i.e., \(\beta_1\beta_4 = \beta_9 = 0\) the coefficients for are zero. We also consider use stepwise AIC to perform a model selection which returns the same model.

library(stats)
stepAIC <- step(full, direction = "both")
## Start:  AIC=198.67
## Type ~ RI + Na + Mg + Al + Si + K + Ca + Ba + Fe
## 
##        Df Deviance    AIC
## - RI    1   178.79 196.79
## - Fe    1   178.90 196.90
## - Al    1   179.05 197.05
## <none>      178.67 198.67
## - Ba    1   181.87 199.87
## - Na    1   182.09 200.09
## - K     1   182.22 200.22
## - Ca    1   183.49 201.49
## - Si    1   186.16 204.16
## - Mg    1   188.86 206.86
## 
## Step:  AIC=196.79
## Type ~ Na + Mg + Al + Si + K + Ca + Ba + Fe
## 
##        Df Deviance    AIC
## - Fe    1   179.05 195.05
## - Al    1   179.20 195.20
## <none>      178.79 196.79
## - Ba    1   182.24 198.24
## - Na    1   182.62 198.62
## - K     1   182.64 198.64
## + RI    1   178.67 198.67
## - Ca    1   185.20 201.20
## - Si    1   186.42 202.42
## - Mg    1   190.45 206.45
## 
## Step:  AIC=195.05
## Type ~ Na + Mg + Al + Si + K + Ca + Ba
## 
##        Df Deviance    AIC
## - Al    1   179.78 193.78
## <none>      179.05 195.05
## - Ba    1   182.76 196.76
## + Fe    1   178.79 196.79
## + RI    1   178.90 196.90
## - K     1   183.53 197.53
## - Na    1   184.03 198.03
## - Ca    1   186.75 200.75
## - Si    1   188.14 202.14
## - Mg    1   192.34 206.34
## 
## Step:  AIC=193.78
## Type ~ Na + Mg + Si + K + Ca + Ba
## 
##        Df Deviance    AIC
## <none>      179.78 193.78
## - Ba    1   182.76 194.76
## + Al    1   179.05 195.05
## + Fe    1   179.20 195.20
## + RI    1   179.57 195.57
## - K     1   183.95 195.95
## - Na    1   188.50 200.50
## - Si    1   200.71 212.71
## - Ca    1   203.39 215.39
## - Mg    1   229.08 241.08
summary(stepAIC)
## 
## Call:
## glm(formula = Type ~ Na + Mg + Si + K + Ca + Ba, family = binomial(link = "logit"), 
##     data = Glass)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -369.0581    86.3466  -4.274 1.92e-05 ***
## Na             2.5755     0.8662   2.973  0.00295 ** 
## Mg             5.0890     0.9354   5.440 5.32e-08 ***
## Si             3.9260     0.9305   4.219 2.45e-05 ***
## K              3.5030     1.5684   2.233  0.02552 *  
## Ca             3.4646     0.7678   4.513 6.40e-06 ***
## Ba             4.1433     1.8672   2.219  0.02649 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 270.54  on 213  degrees of freedom
## Residual deviance: 179.78  on 207  degrees of freedom
## AIC: 193.78
## 
## Number of Fisher Scoring iterations: 7

We suggest a linear predictor \[\begin{equation}\label{eq: model} \eta = -369.0581 + 2.5755 \cdot \mbox{Na} + 5.0890 \cdot \mbox{Mg} + 3.9260 \cdot \mbox{Si} + 3.5030 \cdot \mbox{K} + 3.4646 \cdot \mbox{Ca} + 4.1433 \cdot \mbox{Ba}, \end{equation}\] and the link function we chose is the logistic link which also suggests the following equivalence (Note: check why this holds and you are expected to explain) \[ p = E[Y \mid X ] = \frac{\exp(\eta)}{1 + \exp(\eta)}. \] This indicates that whey the value of the \(j\)’th covariate, such as as \(X_1\), increases from 0 to 1, the probability of being classified as Type 1 glass increase \(\frac{e^{\beta_j}}{1 + e^{\beta_j}}\) (Note: also possible to explain from log-odds ratio). It maybe worth checking if Type 1 glass has overall high level of the chemical composition among all seven since all coefficients are positive and quite large.

Perform model diagnostics and identify problems with the above model.

We know that binomial GLM could potentially have dispersion issues due to the dispersion parameter being fixed at one. This could be checked by using a goodness-of-fit test based on residuals. Further, for discrete distributions in GLM, deviance is used for model diagnostics instead of pearson residuals. Hence, the goodness-of-fit test is perform as follows.

pr <- sum(residuals(update1, type = "deviance")^2) 
dispersion <- pr/update1$df.residual
pchisq(pr, update1$df.residual, lower.tail = FALSE)
## [1] 0.9143631

The test result suggest that the model fits well and there is no evidence of dispersion.

Fit a quasi-likelihood model with the same mean structure. Briefly discuss difference.

quasi <- glm(form, data = Glass, family = quasibinomial(link = "logit"))
quasi <- update(quasi, .~. - Fe - Al - RI)
summary(quasi)
## 
## Call:
## glm(formula = Type ~ Na + Mg + Si + K + Ca + Ba, family = quasibinomial(link = "logit"), 
##     data = Glass)
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -369.0581    75.2137  -4.907 1.87e-06 ***
## Na             2.5755     0.7546   3.413 0.000772 ***
## Mg             5.0890     0.8148   6.245 2.36e-09 ***
## Si             3.9260     0.8106   4.844 2.50e-06 ***
## K              3.5030     1.3662   2.564 0.011053 *  
## Ca             3.4646     0.6688   5.181 5.23e-07 ***
## Ba             4.1433     1.6264   2.547 0.011577 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasibinomial family taken to be 0.7587586)
## 
##     Null deviance: 270.54  on 213  degrees of freedom
## Residual deviance: 179.78  on 207  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 7

We compare the fitted quasi model with the one in \(\eqref{eq: model}\). The dispersion parameter changes from 1 to approximately 0.76 and there is no AIC value anymore Note: check why and explain. The residual deviance of the quasi model didn’t change. The dispersion parameter and the residual deviance agree with the goodness-of-test result. A binomial GLM is preferred. However, it is worth noting that the residual deviance is not ideal here compared to the null deviance. The model didn’t explain a large proportion of deviance in the data. Further improvement could include interaction terms, polynomial terms.

The Predict function can be used for prediction. Use the following codes and briefly explain the output.

set.seed(12345)
index <- sample(dim(Glass)[1], 10, replace = TRUE) 
predictions <- predict(update1, newdata = Glass[index, ], type = "response")

We use the logistic regression model in \(\eqref{eq: model}\) for predicting the probability of observations being in class 1 based on their chemical compositions. The 10 observations used are

names(predictions)
##  [1] "142" "51"  "208" "152" "58"  "93"  "75"  "96"  "2"   "86"

and their corresponding predicted probabilities of being in class 1 are

predictions
##         142          51         208         152          58          93 
## 0.700077627 0.887669450 0.001277154 0.778492379 0.379186110 0.272200378 
##          75          96           2          86 
## 0.380535477 0.127788231 0.350208184 0.286440541