Data represents species richness of invertebrates from 45 sites on various beaches on the Dutch coast. Among other factors, the height above average sea level (HAASL) (an indicator of the relative time each site was underwater) was recorded.
Read the data into an R object called “rikz”
rikz <- read.table(file = "RIKZ.txt", header = TRUE, sep = "")
Attach the data
attach(rikz)
Calculate species richness by creating a new column in the dataset called “Richness” and summing the content on the species columns to get the total number of species
Note: understand the difference between species RICHNESS and species ABUNDANCE
rikz$Richness <- rowSums(rikz[ , 2:76]>0) #creates a new column and counts how many of the values in columns 2 to 76 for each row (location) are greater than 0. This gives the number of species for each location. i.e. the species richness
Reattach the data
attach(rikz)
## The following objects are masked from rikz (pos = 3):
##
## angle1, angle2, Beach, C1, chalk, CR1, CR10, CR11, CR12, CR13,
## CR14, CR15, CR16, CR17, CR18, CR19, CR2, CR20, CR21, CR22, CR23,
## CR24, CR25, CR26, CR27, CR28, CR3, CR4, CR5, CR6, CR7, CR8, CR9,
## exposure, grainsize, HAASL, humus, I1, I2, I3, I4, I5, M1, M10,
## M11, M12, M14, M15, M16, M17, M2, M3, M4, M5, M6, M7, M8, M9, N1,
## P1, P10, P11, P12, P13, P14, P15, P16, P17, P18, P19, P2, P20, P21,
## P22, P24, P25, P3, P4, P5, P6, P7, P8, P9, penetrability, salinity,
## Sample, sorting1, temperature, week
We will start by fitting a Gaussian model to shown its inadequacy because of increasing variance and predictions outside biological reality (i.e. negative species richness)
Create a plot to see how species richness varies with HAASL
plot(HAASL, Richness, xlab = "HAASL", ylab = "Richness",
pch = 16, cex = 1.3, col = "blue")
Fit a linear model that uses HAASL to explain Richness using
lm()
, and get the summary of model outputs
fit <- lm(Richness ~ HAASL)
summary(fit)
##
## Call:
## lm(formula = Richness ~ HAASL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.0675 -2.7607 -0.8029 1.3534 13.8723
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.6857 0.6578 10.164 5.25e-13 ***
## HAASL -2.8669 0.6307 -4.545 4.42e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.16 on 43 degrees of freedom
## Multiple R-squared: 0.3245, Adjusted R-squared: 0.3088
## F-statistic: 20.66 on 1 and 43 DF, p-value: 4.418e-05
Interpretation: for every one unit increase in HAASL, the species richness decreases by 2.9 units. When HAASL is zero (i.e. height at sea level), the species richness is ~6.7 on average.
Add the fitted model line to the plot
plot(HAASL, Richness, xlab = "HAASL", ylab = "Richness",
pch = 16, cex = 1.3, col = "blue")
abline(fit, lty=2, col="red")
Repeat the fitting of a linear model but this time using the
glm()
function. Call this model object rikz.1, and get the
summary of the model outputs.
rikz.1 <- glm(Richness ~ HAASL, family = gaussian(link = "identity"))
summary(rikz.1)
##
## Call:
## glm(formula = Richness ~ HAASL, family = gaussian(link = "identity"))
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.6857 0.6578 10.164 5.25e-13 ***
## HAASL -2.8669 0.6307 -4.545 4.42e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 17.30501)
##
## Null deviance: 1101.64 on 44 degrees of freedom
## Residual deviance: 744.12 on 43 degrees of freedom
## AIC: 259.95
##
## Number of Fisher Scoring iterations: 2
What is different in the summary of outputs? Coefficient estimates? What else?
Coefficient estimates are all the same, but there are no residuals in the GLM output. And therefore no residual standard error or R2 results. Instead we have results for deviance and AIC. This is because GLM is a framework that is based on likelihood (deviance and AIC)
Remember: lm()
function is designed specifically for
linear models, whereas glm()
function is designed for
generalized linear models. When the glm function is used for linear
regression, both functions are used to predict a continuous response
variable and have the same assumptions (linearity, normality,
homoscedasticity, and independence).
Can you calculate what is the deviance explained by the model?
Deviance = ((null deviance - residual deviance)/(null deviance)) * 100
round(((rikz.1$null.deviance - rikz.1$deviance) / rikz.1$null.deviance)*100, 2)
## [1] 32.45
Interpretation: the predictor HAASL explains 32.45% of the total variability in Richness, suggesting that there may be other factors influencing richness that are not included in the model.
Note the difference between null deviance, residual deviance, and deviance explained
Null deviance: This represents the deviance of a model with no predictors, only the intercept.
Residual deviance: This represents the deviance of the model with the predictors included. It measures how well your fitted model with predictors explains the variation in the response variable compared to the null model. I.e. you see how much the deviance decreases by when adding predictors
Deviance explained: This is the difference between the null deviance and the residual deviance. It quantifies how much of the variability in the response variable is explained by the predictors in your model.
Check if the model violates the assumptions made. i.e., plot the diagnostic plots to check assumptions of linear model
par(mfrow = c(1,2))
plot(rikz.1)
Residuals vs Fitted: This shows a plot of the fitted values against the residuals and aids in assessing the linearity assumption between the predictor and outcome variables. Ideally, the points should exhibit random scattering around the red horizontal line at y = 0. However, some evidence of curvature suggests potential violations of the linearity assumption.
QQ Residuals: This shows a plot of the theoretical quantiles according to the model, against the quantiles of the standardised residuals. In this case, the QQ-plot does not point towards a normal error distribution as the higher quantiles lie away from the line. This suggests that the assumption of normality is not met.
Scale-Location: This plot shows the relationship between the fitted values (model predictions) and the square root of the absolute standardized residuals. It is used to identify deviations from homogeneity of variance. If the variance is constant, then the red line through the middle should be horizontal and flat. As this is not the case for our model (we have a trumpet plot), it means the assumption of homogeneity is violated.
Residuals vs Leverage: This plot is used to identify extreme values that may influence the regression results. In this case, three extreme points (#9, #10, #22) are observed. As shown by Cooks Distance, points #9 and #22 exceeds 3 standard deviations, meaning they are considered as potentially influential points.
As we are violating the assumptions, we can say this model does not follow a linear model
Let’s plot the predicted results of Richness versus HAASL, including the +/- 95% confidence intervals
rikz.pred <- predict(rikz.1, se.fit = TRUE) #uses the predict() function on the rikz.1 model to predict the fitted values for the response variable
rikz$fit <- rikz.pred$fit #adds a new column 'fit' to the rikz.1 data frame containing the predicted values from the model
rikz$lower.ci <- rikz.pred$fit - (qt(p = 0.975, df = dim(rikz)[1]) * rikz.pred$se.fit) #adds another column which calculates the lower bound of the 95% confidence interval for the predicted values.
rikz$upper.ci <- rikz.pred$fit + (qt(p = 0.975, df = dim(rikz)[1]) * rikz.pred$se.fit) #adds another column which calculates the upper bound of the 95% confidence interval for the predicted values.
plot(HAASL, Richness, ylim = c(-5, 22), col="blue")
lines(HAASL, rikz.pred$fit, col = 'blue')
lines(HAASL[order(HAASL)], rikz$lower.ci[order(HAASL)], col = 'blue', lty = 2)
lines(HAASL[order(HAASL)], rikz$upper.ci[order(HAASL)], col = 'blue', lty = 2)
abline(h = 0, lty = 4, col = 'red')
Interpretation: The linear model predicts that the outcome variable (in this case species richness) can take values that are less than 0. This is problematic as count data is inherently non-negative. This again confirms that we cannot use a linear model to predict richness
Fit a Poisson model that uses HAASL to explain Richness using glm(), and get the summary of model outputs
rikz.poisson <- glm(Richness ~ HAASL, family = poisson(link="log"))
summary(rikz.poisson)
##
## Call:
## glm(formula = Richness ~ HAASL, family = poisson(link = "log"))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.79100 0.06329 28.297 < 2e-16 ***
## HAASL -0.55597 0.07163 -7.762 8.39e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 179.75 on 44 degrees of freedom
## Residual deviance: 113.18 on 43 degrees of freedom
## AIC: 259.18
##
## Number of Fisher Scoring iterations: 5
Interpretation
The equation is given by log(Richness) = 1.79 - 0.56 * HAASL
Taking the exponent of the equation gives us Richness = exp(1.79 - 0.56 * HAASL)
The intercept represents the species richness when HAASL is zero (i.e. height is at sea level), which is exp(1.79) = 5.99
This means for every one unit increase in HAASL, the expected species richness is MULTIPLIED by exp(-0.56) = 0.57
It also means that for every unit increase in HAASL, the expected species richness DECREASES by approximately exp(-0.56) - 1 = 43%.
Calculate the deviance explained by the Poisson model
Deviance = ((null deviance - residual deviance)/(null deviance)) * 100
round(((rikz.poisson$null.deviance - rikz.poisson$deviance) / rikz.poisson$null.deviance)*100, 2)
## [1] 37.03
Interpretation: the predictor HAASL explains 37% of the total variability in Richness, suggesting that there may be other factors influencing Richness that are not included in the model
Recreate the plots with the predicted results of Richness versus HAASL, including the +/- 95% confidence interval
# Calculate fitted values +/- 95% confidence interval
rikz.poisson.pred <- predict(rikz.poisson, type = 'response', se.fit = TRUE)
rikz$fit.poisson <- rikz.poisson.pred$fit
rikz$lower.ci.poisson <- rikz.poisson.pred$fit - (qt(p = 0.975, df = dim(rikz)[1]) * rikz.poisson.pred$se.fit)
rikz$upper.ci.poisson <- rikz.poisson.pred$fit + (qt(p = 0.975, df = dim(rikz)[1]) * rikz.poisson.pred$se.fit)
# Plot of fitted line +/- 95% confidence interval
plot(HAASL, Richness)
lines(HAASL[order(HAASL)], rikz$fit.poisson[order(HAASL)], col = 'blue')
lines(HAASL[order(HAASL)], rikz$lower.ci.poisson[order(HAASL)], col = 'blue', lty = 2)
lines(HAASL[order(HAASL)], rikz$upper.ci.poisson[order(HAASL)], col = 'blue', lty = 2)
abline(h = 0, lty = 4, col = 'red')
Interpretation: no matter what height above sea level, the richness will never go below zero. This makes sense in the context of our count data
Read the dataset “police.csv” into R
police <- read.table(file = "police.csv", header = TRUE, sep = ",")
Do some data exploration to get yourself familiarised with the dataset
str(police)
## 'data.frame': 225 obs. of 4 variables:
## $ precinct: int 1 2 3 4 5 6 7 8 9 10 ...
## $ eth : int 1 1 1 1 1 1 1 1 1 1 ...
## $ stops : int 202 132 752 385 517 635 301 925 144 560 ...
## $ arrests : int 980 753 2188 471 1719 808 1046 5667 320 1467 ...
head(police)
## precinct eth stops arrests
## 1 1 1 202 980
## 2 2 1 132 753
## 3 3 1 752 2188
## 4 4 1 385 471
## 5 5 1 517 1719
## 6 6 1 635 808
tail(police)
## precinct eth stops arrests
## 220 70 3 14 154
## 221 71 3 922 1064
## 222 72 3 391 438
## 223 73 3 637 1427
## 224 74 3 896 1493
## 225 75 3 295 312
summary(police)
## precinct eth stops arrests
## Min. : 1 Min. :1 Min. : 7.0 Min. : 16
## 1st Qu.:19 1st Qu.:1 1st Qu.: 133.0 1st Qu.: 312
## Median :38 Median :2 Median : 385.0 Median : 571
## Mean :38 Mean :2 Mean : 584.1 Mean :1051
## 3rd Qu.:57 3rd Qu.:3 3rd Qu.: 824.0 3rd Qu.:1467
## Max. :75 Max. :3 Max. :2771.0 Max. :5667
Attach the Data
attach(police)
Follow the model fitting described in lecture 10 to:
i) fit a poisson model with a constant term and an offset for arrests
fit.1 <- glm (stops ~ 1, family=poisson(link="log"), offset=log(arrests))
summary(fit.1)
##
## Call:
## glm(formula = stops ~ 1, family = poisson(link = "log"), offset = log(arrests))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.587715 0.002758 -213.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 46120 on 224 degrees of freedom
## Residual deviance: 46120 on 224 degrees of freedom
## AIC: 47829
##
## Number of Fisher Scoring iterations: 5
The purpose of this model, called the null model, is to establish a baseline or reference point for comparison with more complex models. It assumes that the number of stops is only a function of the exposure (arrests) and the overall average (intercept). It does not account for any differences due to ethnic groups or precincts.
ii) build on the model to include the predictor “ethnicity” as factor
fit.2 <- glm (stops ~ factor(eth), family=poisson(link="log"), offset=log(arrests))
summary(fit.2)
##
## Call:
## glm(formula = stops ~ factor(eth), family = poisson(link = "log"),
## offset = log(arrests))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.588086 0.003784 -155.40 <2e-16 ***
## factor(eth)2 0.070208 0.006061 11.58 <2e-16 ***
## factor(eth)3 -0.161581 0.008558 -18.88 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 46120 on 224 degrees of freedom
## Residual deviance: 45437 on 222 degrees of freedom
## AIC: 47150
##
## Number of Fisher Scoring iterations: 5
Remember that R organises the factors in alphabetical order. In this case, black is the reference, eth2 is hispanic, and eth3 is white.
Interpretation:
The equation is log(stops) = -0.59 + 0.07 * (if hispanic) - 0.16 * (if white)
Intercept
Hispanics vs Blacks
Hispanics have approximately exp(0.07) = 1.1 times the expected number of stops as blacks
Hispanics have about exp(0.07) - 1 = 7% more stops compared to blacks, in proportion to arrests
Whites vs Blacks
Whites have approximately exp(-0.16) = 0.8521 times the expected number of stops as blacks
Whites have about exp(-0.16) - 1 = 15% fewer stops compared to blacks, in proportion to arrests
Deviance: The deviance has decreased from 46120 (null model) to 45437, indicating an improvement in model fit. The reduction in deviance (683) is much greater than the minimum expected decrease of 2, suggesting that ethnic group is a significant predictor.
iii) build on the model to include the predictor “precinct” as factor at each step get the summary of model outputs
fit.3 <- glm (stops ~ factor(eth) + factor(precinct) , family=poisson(link="log"), offset=log(arrests))
summary(fit.3)
##
## Call:
## glm(formula = stops ~ factor(eth) + factor(precinct), family = poisson(link = "log"),
## offset = log(arrests))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.378868 0.051019 -27.027 < 2e-16 ***
## factor(eth)2 0.010188 0.006802 1.498 0.134190
## factor(eth)3 -0.419001 0.009435 -44.409 < 2e-16 ***
## factor(precinct)2 -0.149050 0.074030 -2.013 0.044077 *
## factor(precinct)3 0.559955 0.056758 9.866 < 2e-16 ***
## factor(precinct)4 1.210636 0.057549 21.037 < 2e-16 ***
## factor(precinct)5 0.282865 0.056794 4.981 6.34e-07 ***
## factor(precinct)6 1.144204 0.058047 19.712 < 2e-16 ***
## factor(precinct)7 0.218173 0.064335 3.391 0.000696 ***
## factor(precinct)8 -0.390565 0.056868 -6.868 6.51e-12 ***
## factor(precinct)9 0.485112 0.078151 6.207 5.39e-10 ***
## factor(precinct)10 0.417068 0.058851 7.087 1.37e-12 ***
## factor(precinct)11 0.654417 0.061586 10.626 < 2e-16 ***
## factor(precinct)12 1.148698 0.061067 18.811 < 2e-16 ***
## factor(precinct)13 1.053509 0.055250 19.068 < 2e-16 ***
## factor(precinct)14 0.614115 0.058803 10.444 < 2e-16 ***
## factor(precinct)15 1.054904 0.054956 19.195 < 2e-16 ***
## factor(precinct)16 0.797715 0.060290 13.231 < 2e-16 ***
## factor(precinct)17 -0.049312 0.060478 -0.815 0.414864
## factor(precinct)18 0.131645 0.054981 2.394 0.016649 *
## factor(precinct)19 0.319164 0.057999 5.503 3.74e-08 ***
## factor(precinct)20 -0.040574 0.056928 -0.713 0.476016
## factor(precinct)21 0.410988 0.056805 7.235 4.65e-13 ***
## factor(precinct)22 1.211227 0.053399 22.683 < 2e-16 ***
## factor(precinct)23 0.599919 0.055064 10.895 < 2e-16 ***
## factor(precinct)24 1.319454 0.054600 24.166 < 2e-16 ***
## factor(precinct)25 0.913359 0.053596 17.041 < 2e-16 ***
## factor(precinct)26 -0.147614 0.057006 -2.589 0.009612 **
## factor(precinct)27 1.895485 0.055664 34.052 < 2e-16 ***
## factor(precinct)28 -0.764964 0.060915 -12.558 < 2e-16 ***
## factor(precinct)29 1.124710 0.054267 20.726 < 2e-16 ***
## factor(precinct)30 0.524652 0.055374 9.475 < 2e-16 ***
## factor(precinct)31 1.649498 0.056167 29.368 < 2e-16 ***
## factor(precinct)32 1.386276 0.059679 23.229 < 2e-16 ***
## factor(precinct)33 1.083192 0.054214 19.980 < 2e-16 ***
## factor(precinct)34 1.518972 0.054355 27.945 < 2e-16 ***
## factor(precinct)35 0.879407 0.063190 13.917 < 2e-16 ***
## factor(precinct)36 1.616306 0.058796 27.490 < 2e-16 ***
## factor(precinct)37 1.409430 0.060008 23.487 < 2e-16 ***
## factor(precinct)38 1.763813 0.056195 31.388 < 2e-16 ***
## factor(precinct)39 0.501081 0.054841 9.137 < 2e-16 ***
## factor(precinct)40 1.531984 0.061467 24.924 < 2e-16 ***
## factor(precinct)41 1.913898 0.055716 34.351 < 2e-16 ***
## factor(precinct)42 1.084217 0.054511 19.890 < 2e-16 ***
## factor(precinct)43 0.541787 0.056645 9.565 < 2e-16 ***
## factor(precinct)44 0.883038 0.056671 15.582 < 2e-16 ***
## factor(precinct)45 0.838439 0.054215 15.465 < 2e-16 ***
## factor(precinct)46 0.655855 0.053750 12.202 < 2e-16 ***
## factor(precinct)47 1.128144 0.059981 18.808 < 2e-16 ***
## factor(precinct)48 0.411880 0.056895 7.239 4.51e-13 ***
## factor(precinct)49 1.101219 0.059681 18.452 < 2e-16 ***
## factor(precinct)50 0.878797 0.054109 16.241 < 2e-16 ***
## factor(precinct)51 0.348980 0.058550 5.960 2.52e-09 ***
## factor(precinct)52 0.491898 0.055458 8.870 < 2e-16 ***
## factor(precinct)53 0.599687 0.058486 10.254 < 2e-16 ***
## factor(precinct)54 0.491840 0.060042 8.192 2.58e-16 ***
## factor(precinct)55 0.497511 0.059023 8.429 < 2e-16 ***
## factor(precinct)56 0.890877 0.065982 13.502 < 2e-16 ***
## factor(precinct)57 1.532782 0.061366 24.978 < 2e-16 ***
## factor(precinct)58 1.578846 0.054528 28.955 < 2e-16 ***
## factor(precinct)59 1.083726 0.058920 18.393 < 2e-16 ***
## factor(precinct)60 0.833828 0.054216 15.380 < 2e-16 ***
## factor(precinct)61 1.277336 0.056757 22.506 < 2e-16 ***
## factor(precinct)62 1.206378 0.056666 21.289 < 2e-16 ***
## factor(precinct)63 1.308782 0.058301 22.449 < 2e-16 ***
## factor(precinct)64 1.712497 0.057772 29.642 < 2e-16 ***
## factor(precinct)65 1.899204 0.055154 34.434 < 2e-16 ***
## factor(precinct)66 2.102128 0.054231 38.763 < 2e-16 ***
## factor(precinct)67 1.202841 0.054957 21.887 < 2e-16 ***
## factor(precinct)68 2.201257 0.059245 37.155 < 2e-16 ***
## factor(precinct)69 1.720700 0.058227 29.551 < 2e-16 ***
## factor(precinct)70 1.032004 0.054813 18.828 < 2e-16 ***
## factor(precinct)71 1.479415 0.054239 27.276 < 2e-16 ***
## factor(precinct)72 1.464985 0.053737 27.262 < 2e-16 ***
## factor(precinct)73 0.991018 0.053585 18.494 < 2e-16 ***
## factor(precinct)74 1.151460 0.058023 19.845 < 2e-16 ***
## factor(precinct)75 1.571225 0.075731 20.747 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 46120.3 on 224 degrees of freedom
## Residual deviance: 3427.1 on 148 degrees of freedom
## AIC: 5287.8
##
## Number of Fisher Scoring iterations: 4
Interpretation:
Remember: the reference categories are black and precinct 1
The equation is log(stops) = -1.38 + 0.01 * (if hispanic) - 0.42 * (if white) - 0.15 (if precinct 2) + 0.56 (if precinct 3)
Intercept
Hispanics vs Blacks
Hispanics have approximately exp(0.01) = 1 times the expected number of stops as blacks
Hispanics have about exp(0.01) - 1 = 1% more stops compared to blacks, in proportion to arrests
Whites vs Blacks
Whites have approximately exp(-0.42) = 0.66 times the expected number of stops as blacks
Whites have about exp(-0.42) - 1 = 34% fewer stops compared to blacks, in proportion to arrests
Precinct 2 vs Precinct 1
Compared to precinct 1, precinct 2 has approximately exp(-0.15) = 0.86 times the expected number of stops, given the same number of arrests
Precinct 2 is expected to have about exp(-0.15) - 1 = 14% fewer stops than precinct 1, in proportion to arrests
Precinct 3 vs Precinct 1
Compared to precinct 1, precinct 3 has approximately exp(0.56) = 1.75 times the expected number of stops, given the same number of arrests
Precinct 3 is expected to have about exp(0.56) - 1 = 75% more stops than precinct 1, in proportion to arrests
Deviance: The deviance has decreased from 45437 (ethnicity only) to 3427 (ethnicity + precinct), indicating an improvement in model fit. The reduction in deviance (42010) is much greater than the minimum expected decrease of 75, suggesting that both ethnicity and precinct are significant predictors.
What test could you use to find out if the model is correctly specified? Tips: we have discrete data
Formulate the null and alternative hypothesis
H0: the Poisson regression model is correctly specified and adequately fits the data
HA: the Poisson regression model is not correctly specified and does not fit the data well
Conducting the Test:
Step 1: determine residual deviance. In this case its 3427.1
Step 2: determine degrees of freedom. In this case its 148. This was calculated by subtracting the total number of observations (225) from the total number of parameters (77, as we have 1 intercept, 2 ethnicity, 74 precincts)
Step 3: conduct the chi-square goodness of fit test
pchisq(3427.1, 148, lower.tail = FALSE) #lower tail is FALSE as the p-value in the chi-squared test is given by the upper tail probability
## [1] 0
As the p-value of 0 is much lower than the significance level of 0.05, we can reject the null hypothesis and conclude that the Poisson model does not fit the data.
WARNING: The Poisson distribution assumes that the mean and variance are equal. For small counts, the distribution is highly skewed and can have high variance relative to the mean. This makes the chi-square approximation less accurate, leading to unreliable p-values.
Calculate the overdispersion in model fit. Refer to formula in slides from lecture 10 - see slide 19
Overdispersion = sum of the squared residuals / degrees of freedom
Step 1: calculate residuals
predictions <- predict(fit.3, type="response")
residuals <- (stops - predictions) / sqrt(predictions)
Step 2: calculate sum of the squared residuals
sum_squared_residuals <- sum(residuals^2)
Step 3: calculate degrees of freedom
n <- dim(police)[1] #finds total number of observations in dataset
k <- length(coef(fit.3)) #finds the number of estimated regression coefficients in the model, including the intercept
degree_of_freedom <- n - k
Step 4: calculate overdispersion
overdispersion <- sum_squared_residuals/degree_of_freedom
overdispersion
## [1] 21.88505
Step 5: calculate the p-value for the overdispersion
p_value <- pchisq(sum_squared_residuals, n - k, lower.tail = FALSE)
p_value
## [1] 0
Interpretation
The overdispersion ratio compares the observed variance to the expected variance under the Poisson model.
A ratio close to 1 suggests that the model’s assumption of equidispersion (variance equal to the mean) is reasonable.
A ratio significantly greater than 1 indicates overdispersion, meaning the observed variance is greater than the Poisson model expects.
The p-value tests the null hypothesis that the data follow a Poisson distribution (i.e., no overdispersion).
An overdispersion of 21.9 is much greater than 1, indicating substantial overdispersion. A p-value of 2.2e-16 is extremely small, leading us to reject the null hypothesis and confirm that overdispersion is significant. This suggests that the Poisson model may not be appropriate, and an alternative model should be considered.
Note: when you have overdispersion, the standard errors are underestimated which means the p-value is inflated. This results in a type 1 error (false positive)
To correct for overdispersion, you can multiply all the standard errors by the square root of the overdispersion. This can done in R using a quasi-Poisson distribution
Re-fit model iii) using a quasipoisson distribution. Indicate what changed in the model outputs.
Fit the model using a quasipoisson distribution
fit.4 <- glm (stops ~ factor(eth) + factor(precinct) , family=quasipoisson (link="log"), offset=log(arrests))
summary(fit.4)
##
## Call:
## glm(formula = stops ~ factor(eth) + factor(precinct), family = quasipoisson(link = "log"),
## offset = log(arrests))
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.37887 0.23867 -5.777 4.33e-08 ***
## factor(eth)2 0.01019 0.03182 0.320 0.749294
## factor(eth)3 -0.41900 0.04414 -9.493 < 2e-16 ***
## factor(precinct)2 -0.14905 0.34632 -0.430 0.667549
## factor(precinct)3 0.55995 0.26552 2.109 0.036640 *
## factor(precinct)4 1.21064 0.26922 4.497 1.38e-05 ***
## factor(precinct)5 0.28287 0.26569 1.065 0.288772
## factor(precinct)6 1.14420 0.27155 4.214 4.35e-05 ***
## factor(precinct)7 0.21817 0.30097 0.725 0.469656
## factor(precinct)8 -0.39056 0.26604 -1.468 0.144202
## factor(precinct)9 0.48511 0.36560 1.327 0.186588
## factor(precinct)10 0.41707 0.27531 1.515 0.131934
## factor(precinct)11 0.65442 0.28811 2.271 0.024563 *
## factor(precinct)12 1.14870 0.28568 4.021 9.21e-05 ***
## factor(precinct)13 1.05351 0.25847 4.076 7.45e-05 ***
## factor(precinct)14 0.61411 0.27509 2.232 0.027089 *
## factor(precinct)15 1.05490 0.25709 4.103 6.71e-05 ***
## factor(precinct)16 0.79771 0.28205 2.828 0.005328 **
## factor(precinct)17 -0.04931 0.28292 -0.174 0.861874
## factor(precinct)18 0.13165 0.25721 0.512 0.609540
## factor(precinct)19 0.31916 0.27133 1.176 0.241364
## factor(precinct)20 -0.04057 0.26632 -0.152 0.879117
## factor(precinct)21 0.41099 0.26574 1.547 0.124105
## factor(precinct)22 1.21123 0.24981 4.849 3.11e-06 ***
## factor(precinct)23 0.59992 0.25760 2.329 0.021217 *
## factor(precinct)24 1.31945 0.25543 5.166 7.62e-07 ***
## factor(precinct)25 0.91336 0.25073 3.643 0.000372 ***
## factor(precinct)26 -0.14761 0.26668 -0.554 0.580740
## factor(precinct)27 1.89549 0.26040 7.279 1.83e-11 ***
## factor(precinct)28 -0.76496 0.28497 -2.684 0.008096 **
## factor(precinct)29 1.12471 0.25387 4.430 1.82e-05 ***
## factor(precinct)30 0.52465 0.25905 2.025 0.044633 *
## factor(precinct)31 1.64950 0.26276 6.278 3.61e-09 ***
## factor(precinct)32 1.38628 0.27919 4.965 1.87e-06 ***
## factor(precinct)33 1.08319 0.25362 4.271 3.47e-05 ***
## factor(precinct)34 1.51897 0.25428 5.974 1.65e-08 ***
## factor(precinct)35 0.87941 0.29561 2.975 0.003424 **
## factor(precinct)36 1.61631 0.27506 5.876 2.67e-08 ***
## factor(precinct)37 1.40943 0.28073 5.021 1.46e-06 ***
## factor(precinct)38 1.76381 0.26289 6.709 3.88e-10 ***
## factor(precinct)39 0.50108 0.25655 1.953 0.052690 .
## factor(precinct)40 1.53198 0.28755 5.328 3.63e-07 ***
## factor(precinct)41 1.91390 0.26065 7.343 1.29e-11 ***
## factor(precinct)42 1.08422 0.25501 4.252 3.74e-05 ***
## factor(precinct)43 0.54179 0.26499 2.045 0.042674 *
## factor(precinct)44 0.88304 0.26512 3.331 0.001094 **
## factor(precinct)45 0.83844 0.25363 3.306 0.001188 **
## factor(precinct)46 0.65585 0.25145 2.608 0.010032 *
## factor(precinct)47 1.12814 0.28060 4.020 9.22e-05 ***
## factor(precinct)48 0.41188 0.26616 1.547 0.123887
## factor(precinct)49 1.10122 0.27920 3.944 0.000123 ***
## factor(precinct)50 0.87880 0.25313 3.472 0.000678 ***
## factor(precinct)51 0.34898 0.27390 1.274 0.204627
## factor(precinct)52 0.49190 0.25944 1.896 0.059911 .
## factor(precinct)53 0.59969 0.27361 2.192 0.029957 *
## factor(precinct)54 0.49184 0.28089 1.751 0.082013 .
## factor(precinct)55 0.49751 0.27612 1.802 0.073614 .
## factor(precinct)56 0.89088 0.30867 2.886 0.004484 **
## factor(precinct)57 1.53278 0.28708 5.339 3.44e-07 ***
## factor(precinct)58 1.57885 0.25509 6.189 5.64e-09 ***
## factor(precinct)59 1.08373 0.27564 3.932 0.000129 ***
## factor(precinct)60 0.83383 0.25363 3.288 0.001262 **
## factor(precinct)61 1.27734 0.26552 4.811 3.67e-06 ***
## factor(precinct)62 1.20638 0.26509 4.551 1.11e-05 ***
## factor(precinct)63 1.30878 0.27274 4.799 3.87e-06 ***
## factor(precinct)64 1.71250 0.27027 6.336 2.68e-09 ***
## factor(precinct)65 1.89920 0.25802 7.361 1.17e-11 ***
## factor(precinct)66 2.10213 0.25370 8.286 6.48e-14 ***
## factor(precinct)67 1.20284 0.25710 4.679 6.46e-06 ***
## factor(precinct)68 2.20126 0.27716 7.942 4.59e-13 ***
## factor(precinct)69 1.72070 0.27240 6.317 2.96e-09 ***
## factor(precinct)70 1.03200 0.25642 4.025 9.08e-05 ***
## factor(precinct)71 1.47941 0.25374 5.830 3.34e-08 ***
## factor(precinct)72 1.46498 0.25139 5.828 3.39e-08 ***
## factor(precinct)73 0.99102 0.25068 3.953 0.000119 ***
## factor(precinct)74 1.15146 0.27144 4.242 3.89e-05 ***
## factor(precinct)75 1.57122 0.35428 4.435 1.79e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for quasipoisson family taken to be 21.88506)
##
## Null deviance: 46120.3 on 224 degrees of freedom
## Residual deviance: 3427.1 on 148 degrees of freedom
## AIC: NA
##
## Number of Fisher Scoring iterations: 4
Interpretation: All coefficient estimates are the same but the standard errors are now corrected for the overdispersion. This adjustment has resulted in larger standard errors compared to the Poisson model. Due to the larger standard errors, the significance levels (p-values) for some predictors have changed, with fewer predictors being statistically significant.
Note: The deviance remains the same because it is a measure of the fit of the model to the data and is not directly affected by the method used to account for overdispersion.
The Akaike Information Criterion (AIC) is not provided in the quasipoisson model summary. This is because AIC is not appropriate for quasipoisson models, as it is based on the likelihood which is not well-defined for quasi-likelihood models.
Notice how the output states: Dispersion parameter for quasipoisson family taken to be 21.88506. This means we are not violating the assumption of variance anymore as our mean is equal to the variance