Assignment 4

This assignment is worth 5 marks.

This is at home group assignment and it is built on Assignment 2 and 3.

Instructions are in the next page

The use of LLMs is not permitted in this module.

You can do this assignment based on the code I have provided and your solutions for Assignment 2 and 3. Use your brain not chatgp

Group Number: 5

Name	Surname	Username	Student-Number
Conor	McCormack	mccorc16	23372769

I have shared my implementation for myglm, which includes:

The Binomial and Gaussian distributions
The computation of the Maximum Likelihood Estimate for the $\beta$s using Newton-Raphson (as done in Assignment 2)
For the Gaussian distribution (which has an additional parameter), I use a two-stage (alternating) optimisation (as done in Assignment 3) to estimate both the $\beta$s and the additional parameter $\sigma$ in a loop.
I have also implemented functions that compute all statistics reported in the output of glm (standard errors, deviance, AIC, etc.) distinguishing between distributions which don’t have (the Binomial) and have (Gaussian) an additional parameter.

Your goal is simply to extend my function myglm to include:

The Poisson distribution (with a log link function)
The Negative Binomial distribution (with a log link function, your implementation from Assignment 3)

You need also to improve my code in two ways:

In my code I used the R functiondbinom and dnorm to define the log-likelihood. We are not allowed to use functions from R, so replace them with an implementation from scratch of the log-likelihood using the functions $a,b,c,d$. Do the same for Poisson and Negative Binomial.
In my code, the function alteroptim, which implements the alternating optimisation loop, currently does not include a stopping criterion. In particular, the loop continues even when the changes in the estimated coefficients $\beta$s and the distribution additional parameter other_mle between successive iterations fall below a specified tolerance level. Similar to the implementation of the Newton–Raphson algorithm, the loop should terminate once these parameter updates are sufficiently small (below a tolerance). Introducing such a stopping criterion would prevent unnecessary iterations and reduce computational time. Implement this stopping criterion.

Finally, you need to test you code to check that the output is correct. I provided a test function you can use for Gaussian and Binomial by comparing the output of myglm to that of glm. You can do the same for Poisson. For the Negative Binomial, you can use this output as a test for your function:

set.seed(123)
X = cbind(c(1,1,1,1,1,1),c(1,2,3,4,5,6))
Y =rnbinom(6,size=36,mu=exp(X[,2])) # c(4,   9,  23,  43, 119, 430)
myglm(X,Y,"negbinomial","log")

$estimate
           [,1]
[1,] 0.04913257
[2,] 0.98065866

$stdError
[1] 0.33440178 0.06770406

$z
           [,1]
[1,]  0.1469268
[2,] 14.4844882

$pvalz
             [,1]
[1,] 0.8902977264
[2,] 0.0001320879

$other_parameter_mle
[1] 66.50724

$dispersion
[1] 1.11446

$deviance
[1] 4.492353

$deviance_null
[1] 7.357109

$AIC
[1] 46.93867

Important: Use R-studio debugger to find any errors/issues in your code: https://support.posit.co/hc/en-us/articles/205612627-Debugging-with-the-RStudio-IDE

Expected Output: Submit your myglm.R function (as a R function) and a Rmd notebook which shows, using the tests, that the code works for both Poisson and Negative Binomial.

#load myglm

source("myglm.R")

#test Poisson

We compare the output of myglm with the output of glm for a Poisson regression with a log link function.

X = cbind(c(1,1,1,1,1,1),c(1,2,3,4,5,6))
Y = matrix(c(1, 3, 5, 9, 20, 40))

cat("=== myglm output ===\n")

## === myglm output ===

output = myglm(X, Y, "poisson", "log")

## [1] "Warning: the deviance of the null model is computed using a model including  the intercept only.\n\n        This is different from what the standard GLM does."
## [1] "Warning: the deviance of the null model is computed using a model including  the intercept only.\n\n        This is different from what the standard GLM does."

print(output)

## $estimate
##            [,1]
## [1,] -0.5060043
## [2,]  0.6983309
## 
## $stdError
## [1] 0.50004916 0.09545421
## 
## $z
##           [,1]
## [1,] -1.011909
## [2,]  7.315873
## 
## $pvalz
##              [,1]
## [1,] 3.115815e-01
## [2,] 2.557954e-13
## 
## $other_parameter_mle
## [1] NaN
## 
## $dispersion
## [1] 1
## 
## $deviance
## [1] 0.2410295
## 
## $deviance_null
## [1] 77.04364
## 
## $AIC
## [1] 27.13996

cat("\n=== glm output ===\n")

## 
## === glm output ===

model = glm(Y ~ X[,2], family=poisson)
print(summary(model))

## 
## Call:
## glm(formula = Y ~ X[, 2], family = poisson)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.50600    0.50005  -1.012    0.312    
## X[, 2]       0.69833    0.09545   7.316 2.56e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 77.04364  on 5  degrees of freedom
## Residual deviance:  0.24103  on 4  degrees of freedom
## AIC: 27.14
## 
## Number of Fisher Scoring iterations: 4

The estimates, standard errors, z-values, p-values, deviance and AIC from myglm match those from glm.

#test Negative Binomial

We use the test data provided in the assignment to verify that the Negative Binomial implementation produces the correct output.

set.seed(123)
X = cbind(c(1,1,1,1,1,1),c(1,2,3,4,5,6))
Y =rnbinom(6,size=36,mu=exp(X[,2])) # c(4, 9, 23, 43, 119, 430)
Y = matrix(Y)
cat("Y values:", Y, "\n")

## Y values: 4 9 23 43 119 430

cat("\n=== myglm output ===\n")

## 
## === myglm output ===

output = myglm(X, Y, "negbinomial", "log")

## [1] "Warning: the deviance of the null model is computed using a model including  the intercept only.\n\n        This is different from what the standard GLM does."
## [1] "Warning: the deviance of the null model is computed using a model including  the intercept only.\n\n        This is different from what the standard GLM does."

print(output)

## $estimate
##            [,1]
## [1,] 0.04913259
## [2,] 0.98065866
## 
## $stdError
## [1] 0.33440177 0.06770406
## 
## $z
##            [,1]
## [1,]  0.1469268
## [2,] 14.4844883
## 
## $pvalz
##              [,1]
## [1,] 0.8902976781
## [2,] 0.0001320879
## 
## $other_parameter_mle
## [1] 66.50723
## 
## $dispersion
## [1] 1.11446
## 
## $deviance
## [1] 4.492353
## 
## $deviance_null
## [1] 7.357113
## 
## $AIC
## [1] 46.93867

The output matches the expected values:

$estimate: 0.04913257, 0.98065866
$stdError: 0.33440178, 0.06770406
$z: 0.1469268, 14.4844882
$pvalz: 0.8902977264, 0.0001320879
$other_parameter_mle: 66.50724
$dispersion: 1.11446
$deviance: 4.492353
$deviance_null: 7.357109
$AIC: 46.93867

Finally, in your Rmd notebook, complete the following table that summarises how to compute the statistics for GLM model based on a distribution with or without an additional parameter

Component / Statistic	GLM without additional parameter	GLM with additional parameter ($\psi$)
Example distribution	Poisson, Binomial	Gaussian, Negative Binomial
Estimation of $\beta$s	MLE via Newton-Raphson	Alternating Optimisation
Estimate of dispersion $\hat{\phi}$	NA	Pearson residuals: $\frac{1}{N - n_\beta}\sum_{i=1}^{N}\frac{(Y_i - \hat{\mu})^2}{Var(Y_i)}$
Standard errors for $\beta$s	From Fisher information: $\text{SE}(\hat{\beta}_j) = \sqrt{[-\mathcal{H}(\hat{\beta})^{-1}]_{jj}}$	Adjusted by $\hat{\phi}$: $\text{SE}(\hat{\beta}_j) = \sqrt{\hat{\phi} \cdot [-\mathcal{H}(\hat{\beta})^{-1}]_{jj}}$
Test statistics	z-tests: $z = \frac{\hat{\beta}_j}{\text{SE}(\hat{\beta}_j)}$	t-tests: $t = \frac{\hat{\beta}_j}{\text{SE}(\hat{\beta}_j)}$
P-value	$2(1 - \Phi(\left\|\frac{\hat{\beta}_j}{\sigma_j}\right\|))$	$2\left(1 - CDF_{\text{student}}\left(\left\|\frac{\hat{\beta}_j}{\sigma_j}\right\|,\, N - n_\beta\right)\right)$
Deviance	$2(\ln L_{\text{perf}} - \ln L_{\text{model}})$	$2(\ln L_{\text{perf}} - \ln L_{\text{model}})$
AIC	$-2 \ln L_{\text{model}} + 2(n_p + 1)$	$-2 \ln L_{\text{model}} + 2(n_p + 1)$

Completing this table will show that you understand how these statistics are computed. This table is also useful for the exam. See my slides 7-linear-regression.

Component / Statistic	GLM without additional parameter	GLM with additional parameter (\(\psi\))
Example distribution	Poisson, Binomial	Gaussian, Negative Binomial
Estimation of \(\beta\)s	MLE via Newton-Raphson	Alternating Optimisation
Estimate of dispersion \(\hat{\phi}\)	NA	Pearson residuals: \(\frac{1}{N - n_\beta}\sum_{i=1}^{N}\frac{(Y_i - \hat{\mu})^2}{Var(Y_i)}\)
Standard errors for \(\beta\)s	From Fisher information: \(\text{SE}(\hat{\beta}_j) = \sqrt{[-\mathcal{H}(\hat{\beta})^{-1}]_{jj}}\)	Adjusted by \(\hat{\phi}\): \(\text{SE}(\hat{\beta}_j) = \sqrt{\hat{\phi} \cdot [-\mathcal{H}(\hat{\beta})^{-1}]_{jj}}\)
Test statistics	z-tests: \(z = \frac{\hat{\beta}_j}{\text{SE}(\hat{\beta}_j)}\)	t-tests: \(t = \frac{\hat{\beta}_j}{\text{SE}(\hat{\beta}_j)}\)
P-value	\(2(1 - \Phi(\left\|\frac{\hat{\beta}_j}{\sigma_j}\right\|))\)	\(2\left(1 - CDF_{\text{student}}\left(\left\|\frac{\hat{\beta}_j}{\sigma_j}\right\|,\, N - n_\beta\right)\right)\)
Deviance	\(2(\ln L_{\text{perf}} - \ln L_{\text{model}})\)	\(2(\ln L_{\text{perf}} - \ln L_{\text{model}})\)
AIC	\(-2 \ln L_{\text{model}} + 2(n_p + 1)\)	\(-2 \ln L_{\text{model}} + 2(n_p + 1)\)