## Question 2
library(faraway)
library(ResourceSelection)
# str(bliss)
# str(gala)
model1 = glm(Species~Area + Elevation + Nearest + Scruz + Adjacent, family = "poisson",data = gala)
summary(model1)##
## Call:
## glm(formula = Species ~ Area + Elevation + Nearest + Scruz +
## Adjacent, family = "poisson", data = gala)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -8.2752 -4.4966 -0.9443 1.9168 10.1849
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 3.155e+00 5.175e-02 60.963 < 2e-16 ***
## Area -5.799e-04 2.627e-05 -22.074 < 2e-16 ***
## Elevation 3.541e-03 8.741e-05 40.507 < 2e-16 ***
## Nearest 8.826e-03 1.821e-03 4.846 1.26e-06 ***
## Scruz -5.709e-03 6.256e-04 -9.126 < 2e-16 ***
## Adjacent -6.630e-04 2.933e-05 -22.608 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 3510.73 on 29 degrees of freedom
## Residual deviance: 716.85 on 24 degrees of freedom
## AIC: 889.68
##
## Number of Fisher Scoring iterations: 5
Adjusted dependent variable formula:
\(n=log(\mu)\)
\(n'=1/\mu\)
\(V(\mu)=\mu\)
\(w=\mu\)
\[ z_{o} = log(\mu) + (y-\mu)/\mu \]
Area, elevation and adjacent are all extremely close with nearest and scruz still being very close. Only intercept appears to be of the mark a bit but still relatively close overall.
### b
##n = sum(gala$Species)
y = gala$Species
mu = y
eta = log(mu)
z = eta + (y-mu)/(mu)
w = mu
lmod = lm(z~ Area + Elevation + Nearest + Scruz + Adjacent, weights = w,gala)
coef(lmod)## (Intercept) Area Elevation Nearest Scruz
## 3.5191545412 -0.0005298484 0.0031643557 0.0025188990 -0.0037899780
## Adjacent
## -0.0006623523
coef(model1)## (Intercept) Area Elevation Nearest Scruz
## 3.1548078779 -0.0005799429 0.0035405940 0.0088255719 -0.0057094223
## Adjacent
## -0.0006630311
# deviance(model1)
# deviance(lmod)This is getting a lot closer to the deviance of glm model at 828.0096 where the glm is at 716.8458. Still a great improvement from the null deviance of 3510.73.
eta = lmod$fitted.values
mu = exp(eta)
z = eta + (y-mu)/(mu)
w = mu
lmod = lm(z~ Area + Elevation + Nearest + Scruz + Adjacent, weights = w,gala)
d = 2*sum(y*log(y/mu)-(y-mu))
d## [1] 828.0096
deviance(model1)## [1] 716.8458
coef(lmod)## (Intercept) Area Elevation Nearest Scruz
## 3.2102594447 -0.0005651969 0.0034606226 0.0077171134 -0.0052400871
## Adjacent
## -0.0006604828
coef(model1)## (Intercept) Area Elevation Nearest Scruz
## 3.1548078779 -0.0005799429 0.0035405940 0.0088255719 -0.0057094223
## Adjacent
## -0.0006630311
The coefficents and deviance are nearly almost exactly the same now with the deviance showing the greatest difference being 3 higher than the glm model.
eta = lmod$fitted.values
mu = exp(eta)
z = eta + (y-mu)/(mu)
w = mu
lmod = lm(z~ Area + Elevation + Nearest + Scruz + Adjacent, weights = w,gala)
d = 2*sum(y*log(y/mu)-(y-mu))
d## [1] 719.4158
deviance(model1)## [1] 716.8458
coef(lmod)## (Intercept) Area Elevation Nearest Scruz
## 3.1562582546 -0.0005793855 0.0035379237 0.0087861184 -0.0056868875
## Adjacent
## -0.0006630167
coef(model1)## (Intercept) Area Elevation Nearest Scruz
## 3.1548078779 -0.0005799429 0.0035405940 0.0088255719 -0.0057094223
## Adjacent
## -0.0006630311
There is no difference after 5 iterations the deviance and coefficents are pretty much exactly the same.
for(i in 1:2){
eta = lmod$fitted.values
mu = exp(eta)
z = eta + (y-mu)/(mu)
w = mu
lmod = lm(z~ Area + Elevation + Nearest + Scruz + Adjacent, weights = w,gala)
d = 2*sum(y*log(y/mu)-(y-mu))
d
cat(i,coef(lmod),d,"\n")
}## 1 3.154809 -0.0005799422 0.003540591 0.008825509 -0.00570938 -0.0006630313 716.8488
## 2 3.154808 -0.0005799429 0.003540594 0.008825572 -0.005709422 -0.0006630311 716.8458
deviance(model1)## [1] 716.8458
coef(model1)## (Intercept) Area Elevation Nearest Scruz
## 3.1548078779 -0.0005799429 0.0035405940 0.0088255719 -0.0057094223
## Adjacent
## -0.0006630311
Standard errors when computed are all smaller than the ones produced from the imod model but comparing to the glm() model it is the same.
summary(lmod)##
## Call:
## lm(formula = z ~ Area + Elevation + Nearest + Scruz + Adjacent,
## data = gala, weights = w)
##
## Weighted Residuals:
## Min 1Q Median 3Q Max
## -7.4483 -3.6198 -0.9279 2.0367 13.0984
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.1548079 0.2915894 10.819 1.03e-10 ***
## Area -0.0005799 0.0001480 -3.918 0.000649 ***
## Elevation 0.0035406 0.0004925 7.189 1.98e-07 ***
## Nearest 0.0088256 0.0102621 0.860 0.398291
## Scruz -0.0057094 0.0035251 -1.620 0.118379
## Adjacent -0.0006630 0.0001652 -4.012 0.000511 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.635 on 24 degrees of freedom
## Multiple R-squared: 0.7611, Adjusted R-squared: 0.7114
## F-statistic: 15.3 on 5 and 24 DF, p-value: 8.607e-07
xm = model.matrix(lmod)
wm = diag(w)
sqrt(diag(solve(t(xm) %*% wm %*% xm)))## (Intercept) Area Elevation Nearest Scruz
## 5.174951e-02 2.627298e-05 8.740703e-05 1.821260e-03 6.256200e-04
## Adjacent
## 2.932754e-05
Shown in the contour plot we can see blotch appears to increase as site and variety increase as a general trend. This is shown in more detail in the individual line plots of blotch against variety and site respectively. Generally speaking we can see as variety and site increase the blotch increases as well and looks to be a constantly increasing relationship as it goes up with a few expceptions. Some of the lower numbered sites and varieties change is a lot less significant as the values increase.
Boxplots further show that generally the medians and variance increase as site and variety increase as well. Site 1,2,6 and variety 1,2,3,4,6 alll have very low medians and variance whereas site 7,8,9 and variety 7,8,9,10 all have greater medians and variance.
Null deviance: 40.8029 on 89 degrees of freedom
Residual deviance: 6.1264 on 72 degrees of freedom
Appears to be a good fit to the model as we have reduced our deviance reports the goodness of fit of our model of generalized linear model. Our model has lost a great portion of it’s deviance for 17 degrees of freedom.
Using the Hosmer/Lemeshow goodness of fit test we can see out model fits well (p-value = 0.9996) showing no significant difference between the model and the observed data.
Also using the chisq() to check goodness of fit (compares size of residual deviance to its degrees of freedom), we can see a very high p-value (1) indicating no evidence of a lack of fit. This does not mean a better model doesn’t exist or that this is the best model.
model3 = glm(formula = blotch~site+variety, family = "binomial",data = leafblotch)## Warning: non-integer #successes in a binomial glm!
summary(model3)##
## Call:
## glm(formula = blotch ~ site + variety, family = "binomial", data = leafblotch)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -0.64431 -0.13546 -0.02061 0.09628 0.81005
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.0546 4.7723 -1.688 0.0915 .
## site2 1.6391 4.8440 0.338 0.7351
## site3 3.3265 4.5282 0.735 0.4626
## site4 3.5822 4.5122 0.794 0.4273
## site5 3.5838 4.5121 0.794 0.4270
## site6 3.8932 4.4980 0.866 0.3867
## site7 4.7299 4.4798 1.056 0.2910
## site8 5.5226 4.4792 1.233 0.2176
## site9 6.7945 4.4996 1.510 0.1310
## variety2 0.1501 2.4288 0.062 0.9507
## variety3 0.6895 2.2566 0.306 0.7600
## variety4 1.0481 2.1796 0.481 0.6306
## variety5 1.6147 2.0998 0.769 0.4419
## variety6 2.3711 2.0440 1.160 0.2460
## variety7 2.5712 2.0354 1.263 0.2065
## variety8 3.3419 2.0189 1.655 0.0979 .
## variety9 3.4999 2.0182 1.734 0.0829 .
## variety10 4.2529 2.0279 2.097 0.0360 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 40.8029 on 89 degrees of freedom
## Residual deviance: 6.1264 on 72 degrees of freedom
## AIC: 70.44
##
## Number of Fisher Scoring iterations: 8
hoslem.test(leafblotch$blotch, fitted(model3))##
## Hosmer and Lemeshow goodness of fit (GOF) test
##
## data: leafblotch$blotch, fitted(model3)
## X-squared = 0.66998, df = 8, p-value = 0.9996
1-pchisq(deviance(model3), df.residual(model3))## [1] 1
Dispersion parameter = 0.08878094
Dispersion parameter for quasibinomial family taken to be 0.08878094. This is dispersion parameter is derived from the fitted model by taking the sum of the squared Pearson residuals and dividing it by the residual degrees of freedom.
model4 = glm(formula = blotch~site+variety, family = "quasibinomial",data = leafblotch)
# Manual vs Automatic
y = leafblotch$blotch
mu = fitted(model4)
muh = mu*(1-mu)
sum(((y - mu)^2)/muh) / (90-18)## [1] 0.08878094
sum( residuals(model4, "pearson")^2 ) / model4$df.residual## [1] 0.08878094
summary(model4)$dispersion## [1] 0.08878094
We see the variance of the residuals is increasing with error (lower values have lower deviance and higher values have higher deviance) indicating that our variance function is probably no reasonable.
plot(predict(model4),residuals(model4,type="deviance"),
xlab="Linear Predictor", ylab="Deviance Residuals")##plot(residuals(model4)~log(fitted(model4)), ylab = "Deviance Residuals")Is an imrovement from the last plot as less of a pattern (and only more deviance still for much higher values) but there is still a funnel pattern in the residuals indicating our variance functions is still likely unreasonable.
\(n=log(\mu/1-\mu)\)
\(n'=1/\mu(1-\mu)\)
\(V(\mu)=\mu(1-\mu)\)
\(w=\mu(1-\mu)\)
y = leafblotch$blotch
mu = fitted(model4)
w = 1/(mu*(1-mu))
imod = glm(blotch~site+variety, family = quasi(link = logit,variance = "mu(1-mu)"),data = leafblotch, weights = w)
plot(predict(imod),residuals(imod,type="deviance"),
xlab="Linear Predictor", ylab="Deviance Residuals")Both equally important at predicting with very low p-values of 2.2e-16 , I wouldn’t drop either in this case.
drop1(model4,test="F")## Single term deletions
##
## Model:
## blotch ~ site + variety
## Df Deviance F value Pr(>F)
## <none> 6.1264
## site 8 28.6200 33.044 < 2.2e-16 ***
## variety 9 22.2273 21.025 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Through the graphs in a) the combination of site and variety that show the greatest indication of an interation would be the higher values such as site 7,8,9 and variety 7,8,9,10.
An interaction is when the effect of one variable on an outcome depends on the state of a second variable. It can be seen in the line plots when one is high the other generally is higher aswell.In the case you have a high numbered site in combination with a high numbered variety there is a greater effect on blotch.