Exercise 1

Data represents species richness of invertebrates from 45 sites on various beaches on the Dutch coast. Among other factors, the height above average sea level (HAASL) (an indicator of the relative time each site was underwater) was recorded.

Read the data into an R object called “rikz”

rikz <- read.table(file = "RIKZ.txt", header = TRUE, sep = "")

Attach the data

attach(rikz)

Calculate species richness by creating a new column in the dataset called “Richness” and summing the content on the species columns to get the total number of species

Note: understand the difference between species RICHNESS and species ABUNDANCE

rikz$Richness <- rowSums(rikz[ , 2:76]>0) #creates a new column and counts how many of the values in columns 2 to 76 for each row (location) are greater than 0. This gives the number of species for each location. i.e. the species richness

Reattach the data

attach(rikz)

## The following objects are masked from rikz (pos = 3):
## 
##     angle1, angle2, Beach, C1, chalk, CR1, CR10, CR11, CR12, CR13,
##     CR14, CR15, CR16, CR17, CR18, CR19, CR2, CR20, CR21, CR22, CR23,
##     CR24, CR25, CR26, CR27, CR28, CR3, CR4, CR5, CR6, CR7, CR8, CR9,
##     exposure, grainsize, HAASL, humus, I1, I2, I3, I4, I5, M1, M10,
##     M11, M12, M14, M15, M16, M17, M2, M3, M4, M5, M6, M7, M8, M9, N1,
##     P1, P10, P11, P12, P13, P14, P15, P16, P17, P18, P19, P2, P20, P21,
##     P22, P24, P25, P3, P4, P5, P6, P7, P8, P9, penetrability, salinity,
##     Sample, sorting1, temperature, week

Linear Regression

We will start by fitting a Gaussian model to shown its inadequacy because of increasing variance and predictions outside biological reality (i.e. negative species richness)

Create a plot to see how species richness varies with HAASL

plot(HAASL, Richness, xlab = "HAASL", ylab = "Richness", 
pch = 16, cex = 1.3, col = "blue")

Fit a linear model that uses HAASL to explain Richness using lm(), and get the summary of model outputs

fit <- lm(Richness ~ HAASL)
summary(fit)

## 
## Call:
## lm(formula = Richness ~ HAASL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.0675 -2.7607 -0.8029  1.3534 13.8723 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.6857     0.6578  10.164 5.25e-13 ***
## HAASL        -2.8669     0.6307  -4.545 4.42e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.16 on 43 degrees of freedom
## Multiple R-squared:  0.3245, Adjusted R-squared:  0.3088 
## F-statistic: 20.66 on 1 and 43 DF,  p-value: 4.418e-05

Interpretation: for every one unit increase in HAASL, the species richness decreases by 2.9 units. When HAASL is zero (i.e. height at sea level), the species richness is ~6.7 on average.

Add the fitted model line to the plot

plot(HAASL, Richness, xlab = "HAASL", ylab = "Richness", 
pch = 16, cex = 1.3, col = "blue")
  abline(fit, lty=2, col="red")

Repeat the fitting of a linear model but this time using the glm() function. Call this model object rikz.1, and get the summary of the model outputs.

rikz.1 <- glm(Richness ~ HAASL, family = gaussian(link = "identity"))
summary(rikz.1)

## 
## Call:
## glm(formula = Richness ~ HAASL, family = gaussian(link = "identity"))
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.6857     0.6578  10.164 5.25e-13 ***
## HAASL        -2.8669     0.6307  -4.545 4.42e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 17.30501)
## 
##     Null deviance: 1101.64  on 44  degrees of freedom
## Residual deviance:  744.12  on 43  degrees of freedom
## AIC: 259.95
## 
## Number of Fisher Scoring iterations: 2

What is different in the summary of outputs? Coefficient estimates? What else?

Coefficient estimates are all the same, but there are no residuals in the GLM output. And therefore no residual standard error or R2 results. Instead we have results for deviance and AIC. This is because GLM is a framework that is based on likelihood (deviance and AIC)

Remember: lm() function is designed specifically for linear models, whereas glm() function is designed for generalized linear models. When the glm function is used for linear regression, both functions are used to predict a continuous response variable and have the same assumptions (linearity, normality, homoscedasticity, and independence).

Can you calculate what is the deviance explained by the model?

Deviance = ((null deviance - residual deviance)/(null deviance)) * 100

round(((rikz.1$null.deviance - rikz.1$deviance) / rikz.1$null.deviance)*100, 2)

## [1] 32.45

Interpretation: the predictor HAASL explains 32.45% of the total variability in Richness, suggesting that there may be other factors influencing richness that are not included in the model.

Note the difference between null deviance, residual deviance, and deviance explained

Null deviance: This represents the deviance of a model with no predictors, only the intercept.
Residual deviance: This represents the deviance of the model with the predictors included. It measures how well your fitted model with predictors explains the variation in the response variable compared to the null model. I.e. you see how much the deviance decreases by when adding predictors
Deviance explained: This is the difference between the null deviance and the residual deviance. It quantifies how much of the variability in the response variable is explained by the predictors in your model.

Check if the model violates the assumptions made. i.e., plot the diagnostic plots to check assumptions of linear model

par(mfrow = c(1,2))
plot(rikz.1)

Residuals vs Fitted: This shows a plot of the fitted values against the residuals and aids in assessing the linearity assumption between the predictor and outcome variables. Ideally, the points should exhibit random scattering around the red horizontal line at y = 0. However, some evidence of curvature suggests potential violations of the linearity assumption.

QQ Residuals: This shows a plot of the theoretical quantiles according to the model, against the quantiles of the standardised residuals. In this case, the QQ-plot does not point towards a normal error distribution as the higher quantiles lie away from the line. This suggests that the assumption of normality is not met.

Scale-Location: This plot shows the relationship between the fitted values (model predictions) and the square root of the absolute standardized residuals. It is used to identify deviations from homogeneity of variance. If the variance is constant, then the red line through the middle should be horizontal and flat. As this is not the case for our model (we have a trumpet plot), it means the assumption of homogeneity is violated.

Residuals vs Leverage: This plot is used to identify extreme values that may influence the regression results. In this case, three extreme points (#9, #10, #22) are observed. As shown by Cooks Distance, points #9 and #22 exceeds 3 standard deviations, meaning they are considered as potentially influential points.

As we are violating the assumptions, we can say this model does not follow a linear model

Let’s plot the predicted results of Richness versus HAASL, including the +/- 95% confidence intervals

rikz.pred <- predict(rikz.1, se.fit = TRUE) #uses the predict() function on the rikz.1 model to predict the fitted values for the response variable 
rikz$fit <- rikz.pred$fit #adds a new column 'fit' to the rikz.1 data frame containing the predicted values from the model

rikz$lower.ci <- rikz.pred$fit - (qt(p = 0.975, df = dim(rikz)[1]) * rikz.pred$se.fit) #adds another column which calculates the lower bound of the 95% confidence interval for the predicted values.
rikz$upper.ci <- rikz.pred$fit + (qt(p = 0.975, df = dim(rikz)[1]) * rikz.pred$se.fit) #adds another column which calculates the upper bound of the 95% confidence interval for the predicted values.

plot(HAASL, Richness, ylim = c(-5, 22), col="blue")
lines(HAASL, rikz.pred$fit, col = 'blue')
lines(HAASL[order(HAASL)], rikz$lower.ci[order(HAASL)], col = 'blue', lty = 2)
lines(HAASL[order(HAASL)], rikz$upper.ci[order(HAASL)], col = 'blue', lty = 2)
abline(h = 0, lty = 4, col = 'red')

Interpretation: The linear model predicts that the outcome variable (in this case species richness) can take values that are less than 0. This is problematic as count data is inherently non-negative. This again confirms that we cannot use a linear model to predict richness

Poisson Regression

Fit a Poisson model that uses HAASL to explain Richness using glm(), and get the summary of model outputs

rikz.poisson <- glm(Richness ~ HAASL, family = poisson(link="log"))
summary(rikz.poisson)

## 
## Call:
## glm(formula = Richness ~ HAASL, family = poisson(link = "log"))
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.79100    0.06329  28.297  < 2e-16 ***
## HAASL       -0.55597    0.07163  -7.762 8.39e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 179.75  on 44  degrees of freedom
## Residual deviance: 113.18  on 43  degrees of freedom
## AIC: 259.18
## 
## Number of Fisher Scoring iterations: 5

Interpretation

The equation is given by log(Richness) = 1.79 - 0.56 * HAASL

This means for every one unit increase in HAASL, the log(Richness) DECREASES by 0.56 units.

Taking the exponent of the equation gives us Richness = exp(1.79 - 0.56 * HAASL)

The intercept represents the species richness when HAASL is zero (i.e. height is at sea level), which is exp(1.79) = 5.99
This means for every one unit increase in HAASL, the expected species richness is MULTIPLIED by exp(-0.56) = 0.57
It also means that for every unit increase in HAASL, the expected species richness DECREASES by approximately exp(-0.56) - 1 = 43%.

Calculate the deviance explained by the Poisson model

Deviance = ((null deviance - residual deviance)/(null deviance)) * 100

round(((rikz.poisson$null.deviance - rikz.poisson$deviance) / rikz.poisson$null.deviance)*100, 2)

## [1] 37.03

Interpretation: the predictor HAASL explains 37% of the total variability in Richness, suggesting that there may be other factors influencing Richness that are not included in the model

Recreate the plots with the predicted results of Richness versus HAASL, including the +/- 95% confidence interval

# Calculate fitted values +/- 95% confidence interval
rikz.poisson.pred <- predict(rikz.poisson, type = 'response', se.fit = TRUE)
rikz$fit.poisson <- rikz.poisson.pred$fit
rikz$lower.ci.poisson <- rikz.poisson.pred$fit - (qt(p = 0.975, df = dim(rikz)[1]) * rikz.poisson.pred$se.fit)
rikz$upper.ci.poisson <- rikz.poisson.pred$fit + (qt(p = 0.975, df = dim(rikz)[1]) * rikz.poisson.pred$se.fit)

# Plot of fitted line +/- 95% confidence interval
plot(HAASL, Richness)
lines(HAASL[order(HAASL)], rikz$fit.poisson[order(HAASL)], col = 'blue')
lines(HAASL[order(HAASL)], rikz$lower.ci.poisson[order(HAASL)], col = 'blue', lty = 2)
lines(HAASL[order(HAASL)], rikz$upper.ci.poisson[order(HAASL)], col = 'blue', lty = 2)
abline(h = 0, lty = 4, col = 'red')

Interpretation: no matter what height above sea level, the richness will never go below zero. This makes sense in the context of our count data

Exercise 2

Read the dataset “police.csv” into R

police <- read.table(file = "police.csv", header = TRUE, sep = ",")

Do some data exploration to get yourself familiarised with the dataset

str(police)

## 'data.frame':    225 obs. of  4 variables:
##  $ precinct: int  1 2 3 4 5 6 7 8 9 10 ...
##  $ eth     : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ stops   : int  202 132 752 385 517 635 301 925 144 560 ...
##  $ arrests : int  980 753 2188 471 1719 808 1046 5667 320 1467 ...

head(police)

##   precinct eth stops arrests
## 1        1   1   202     980
## 2        2   1   132     753
## 3        3   1   752    2188
## 4        4   1   385     471
## 5        5   1   517    1719
## 6        6   1   635     808

tail(police)

##     precinct eth stops arrests
## 220       70   3    14     154
## 221       71   3   922    1064
## 222       72   3   391     438
## 223       73   3   637    1427
## 224       74   3   896    1493
## 225       75   3   295     312

summary(police)

##     precinct       eth        stops           arrests    
##  Min.   : 1   Min.   :1   Min.   :   7.0   Min.   :  16  
##  1st Qu.:19   1st Qu.:1   1st Qu.: 133.0   1st Qu.: 312  
##  Median :38   Median :2   Median : 385.0   Median : 571  
##  Mean   :38   Mean   :2   Mean   : 584.1   Mean   :1051  
##  3rd Qu.:57   3rd Qu.:3   3rd Qu.: 824.0   3rd Qu.:1467  
##  Max.   :75   Max.   :3   Max.   :2771.0   Max.   :5667

Attach the Data

attach(police)

Follow the model fitting described in lecture 10 to:

i) fit a poisson model with a constant term and an offset for arrests

fit.1 <- glm (stops ~ 1, family=poisson(link="log"), offset=log(arrests))
summary(fit.1)

## 
## Call:
## glm(formula = stops ~ 1, family = poisson(link = "log"), offset = log(arrests))
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.587715   0.002758  -213.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 46120  on 224  degrees of freedom
## Residual deviance: 46120  on 224  degrees of freedom
## AIC: 47829
## 
## Number of Fisher Scoring iterations: 5

The purpose of this model, called the null model, is to establish a baseline or reference point for comparison with more complex models. It assumes that the number of stops is only a function of the exposure (arrests) and the overall average (intercept). It does not account for any differences due to ethnic groups or precincts.

ii) build on the model to include the predictor “ethnicity” as factor

fit.2 <- glm (stops ~ factor(eth), family=poisson(link="log"), offset=log(arrests))
summary(fit.2)

## 
## Call:
## glm(formula = stops ~ factor(eth), family = poisson(link = "log"), 
##     offset = log(arrests))
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.588086   0.003784 -155.40   <2e-16 ***
## factor(eth)2  0.070208   0.006061   11.58   <2e-16 ***
## factor(eth)3 -0.161581   0.008558  -18.88   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 46120  on 224  degrees of freedom
## Residual deviance: 45437  on 222  degrees of freedom
## AIC: 47150
## 
## Number of Fisher Scoring iterations: 5

Remember that R organises the factors in alphabetical order. In this case, black is the reference, eth2 is hispanic, and eth3 is white.

Interpretation:

The equation is log(stops) = -0.59 + 0.07 * (if hispanic) - 0.16 * (if white)

Intercept

On average, for black people, the expected number of stops is exp(-0.588) = 0.56 or 56%, in proportion to arrests

Hispanics vs Blacks

Hispanics have approximately exp(0.07) = 1.1 times the expected number of stops as blacks
Hispanics have about exp(0.07) - 1 = 7% more stops compared to blacks, in proportion to arrests

Whites vs Blacks

Whites have approximately exp(-0.16) = 0.8521 times the expected number of stops as blacks
Whites have about exp(-0.16) - 1 = 15% fewer stops compared to blacks, in proportion to arrests

Deviance: The deviance has decreased from 46120 (null model) to 45437, indicating an improvement in model fit. The reduction in deviance (683) is much greater than the minimum expected decrease of 2, suggesting that ethnic group is a significant predictor.

iii) build on the model to include the predictor “precinct” as factor at each step get the summary of model outputs

fit.3 <- glm (stops ~ factor(eth) + factor(precinct) , family=poisson(link="log"), offset=log(arrests))
summary(fit.3)

## 
## Call:
## glm(formula = stops ~ factor(eth) + factor(precinct), family = poisson(link = "log"), 
##     offset = log(arrests))
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)        -1.378868   0.051019 -27.027  < 2e-16 ***
## factor(eth)2        0.010188   0.006802   1.498 0.134190    
## factor(eth)3       -0.419001   0.009435 -44.409  < 2e-16 ***
## factor(precinct)2  -0.149050   0.074030  -2.013 0.044077 *  
## factor(precinct)3   0.559955   0.056758   9.866  < 2e-16 ***
## factor(precinct)4   1.210636   0.057549  21.037  < 2e-16 ***
## factor(precinct)5   0.282865   0.056794   4.981 6.34e-07 ***
## factor(precinct)6   1.144204   0.058047  19.712  < 2e-16 ***
## factor(precinct)7   0.218173   0.064335   3.391 0.000696 ***
## factor(precinct)8  -0.390565   0.056868  -6.868 6.51e-12 ***
## factor(precinct)9   0.485112   0.078151   6.207 5.39e-10 ***
## factor(precinct)10  0.417068   0.058851   7.087 1.37e-12 ***
## factor(precinct)11  0.654417   0.061586  10.626  < 2e-16 ***
## factor(precinct)12  1.148698   0.061067  18.811  < 2e-16 ***
## factor(precinct)13  1.053509   0.055250  19.068  < 2e-16 ***
## factor(precinct)14  0.614115   0.058803  10.444  < 2e-16 ***
## factor(precinct)15  1.054904   0.054956  19.195  < 2e-16 ***
## factor(precinct)16  0.797715   0.060290  13.231  < 2e-16 ***
## factor(precinct)17 -0.049312   0.060478  -0.815 0.414864    
## factor(precinct)18  0.131645   0.054981   2.394 0.016649 *  
## factor(precinct)19  0.319164   0.057999   5.503 3.74e-08 ***
## factor(precinct)20 -0.040574   0.056928  -0.713 0.476016    
## factor(precinct)21  0.410988   0.056805   7.235 4.65e-13 ***
## factor(precinct)22  1.211227   0.053399  22.683  < 2e-16 ***
## factor(precinct)23  0.599919   0.055064  10.895  < 2e-16 ***
## factor(precinct)24  1.319454   0.054600  24.166  < 2e-16 ***
## factor(precinct)25  0.913359   0.053596  17.041  < 2e-16 ***
## factor(precinct)26 -0.147614   0.057006  -2.589 0.009612 ** 
## factor(precinct)27  1.895485   0.055664  34.052  < 2e-16 ***
## factor(precinct)28 -0.764964   0.060915 -12.558  < 2e-16 ***
## factor(precinct)29  1.124710   0.054267  20.726  < 2e-16 ***
## factor(precinct)30  0.524652   0.055374   9.475  < 2e-16 ***
## factor(precinct)31  1.649498   0.056167  29.368  < 2e-16 ***
## factor(precinct)32  1.386276   0.059679  23.229  < 2e-16 ***
## factor(precinct)33  1.083192   0.054214  19.980  < 2e-16 ***
## factor(precinct)34  1.518972   0.054355  27.945  < 2e-16 ***
## factor(precinct)35  0.879407   0.063190  13.917  < 2e-16 ***
## factor(precinct)36  1.616306   0.058796  27.490  < 2e-16 ***
## factor(precinct)37  1.409430   0.060008  23.487  < 2e-16 ***
## factor(precinct)38  1.763813   0.056195  31.388  < 2e-16 ***
## factor(precinct)39  0.501081   0.054841   9.137  < 2e-16 ***
## factor(precinct)40  1.531984   0.061467  24.924  < 2e-16 ***
## factor(precinct)41  1.913898   0.055716  34.351  < 2e-16 ***
## factor(precinct)42  1.084217   0.054511  19.890  < 2e-16 ***
## factor(precinct)43  0.541787   0.056645   9.565  < 2e-16 ***
## factor(precinct)44  0.883038   0.056671  15.582  < 2e-16 ***
## factor(precinct)45  0.838439   0.054215  15.465  < 2e-16 ***
## factor(precinct)46  0.655855   0.053750  12.202  < 2e-16 ***
## factor(precinct)47  1.128144   0.059981  18.808  < 2e-16 ***
## factor(precinct)48  0.411880   0.056895   7.239 4.51e-13 ***
## factor(precinct)49  1.101219   0.059681  18.452  < 2e-16 ***
## factor(precinct)50  0.878797   0.054109  16.241  < 2e-16 ***
## factor(precinct)51  0.348980   0.058550   5.960 2.52e-09 ***
## factor(precinct)52  0.491898   0.055458   8.870  < 2e-16 ***
## factor(precinct)53  0.599687   0.058486  10.254  < 2e-16 ***
## factor(precinct)54  0.491840   0.060042   8.192 2.58e-16 ***
## factor(precinct)55  0.497511   0.059023   8.429  < 2e-16 ***
## factor(precinct)56  0.890877   0.065982  13.502  < 2e-16 ***
## factor(precinct)57  1.532782   0.061366  24.978  < 2e-16 ***
## factor(precinct)58  1.578846   0.054528  28.955  < 2e-16 ***
## factor(precinct)59  1.083726   0.058920  18.393  < 2e-16 ***
## factor(precinct)60  0.833828   0.054216  15.380  < 2e-16 ***
## factor(precinct)61  1.277336   0.056757  22.506  < 2e-16 ***
## factor(precinct)62  1.206378   0.056666  21.289  < 2e-16 ***
## factor(precinct)63  1.308782   0.058301  22.449  < 2e-16 ***
## factor(precinct)64  1.712497   0.057772  29.642  < 2e-16 ***
## factor(precinct)65  1.899204   0.055154  34.434  < 2e-16 ***
## factor(precinct)66  2.102128   0.054231  38.763  < 2e-16 ***
## factor(precinct)67  1.202841   0.054957  21.887  < 2e-16 ***
## factor(precinct)68  2.201257   0.059245  37.155  < 2e-16 ***
## factor(precinct)69  1.720700   0.058227  29.551  < 2e-16 ***
## factor(precinct)70  1.032004   0.054813  18.828  < 2e-16 ***
## factor(precinct)71  1.479415   0.054239  27.276  < 2e-16 ***
## factor(precinct)72  1.464985   0.053737  27.262  < 2e-16 ***
## factor(precinct)73  0.991018   0.053585  18.494  < 2e-16 ***
## factor(precinct)74  1.151460   0.058023  19.845  < 2e-16 ***
## factor(precinct)75  1.571225   0.075731  20.747  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 46120.3  on 224  degrees of freedom
## Residual deviance:  3427.1  on 148  degrees of freedom
## AIC: 5287.8
## 
## Number of Fisher Scoring iterations: 4

Interpretation:

Remember: the reference categories are black and precinct 1

The equation is log(stops) = -1.38 + 0.01 * (if hispanic) - 0.42 * (if white) - 0.15 (if precinct 2) + 0.56 (if precinct 3)

Intercept

For black people in precinct 1, the expected number of stops is exp(-1.38) = 0.252 or 25%, in proportion to arrests

Hispanics vs Blacks

Hispanics have approximately exp(0.01) = 1 times the expected number of stops as blacks
Hispanics have about exp(0.01) - 1 = 1% more stops compared to blacks, in proportion to arrests

Whites vs Blacks

Whites have approximately exp(-0.42) = 0.66 times the expected number of stops as blacks
Whites have about exp(-0.42) - 1 = 34% fewer stops compared to blacks, in proportion to arrests

Precinct 2 vs Precinct 1

Compared to precinct 1, precinct 2 has approximately exp(-0.15) = 0.86 times the expected number of stops, given the same number of arrests
Precinct 2 is expected to have about exp(-0.15) - 1 = 14% fewer stops than precinct 1, in proportion to arrests

Precinct 3 vs Precinct 1

Compared to precinct 1, precinct 3 has approximately exp(0.56) = 1.75 times the expected number of stops, given the same number of arrests
Precinct 3 is expected to have about exp(0.56) - 1 = 75% more stops than precinct 1, in proportion to arrests

Deviance: The deviance has decreased from 45437 (ethnicity only) to 3427 (ethnicity + precinct), indicating an improvement in model fit. The reduction in deviance (42010) is much greater than the minimum expected decrease of 75, suggesting that both ethnicity and precinct are significant predictors.

What test could you use to find out if the model is correctly specified? Tips: we have discrete data

Formulate the null and alternative hypothesis

H0: the Poisson regression model is correctly specified and adequately fits the data

HA: the Poisson regression model is not correctly specified and does not fit the data well

Conducting the Test:

Step 1: determine residual deviance. In this case its 3427.1

Step 2: determine degrees of freedom. In this case its 148. This was calculated by subtracting the total number of observations (225) from the total number of parameters (77, as we have 1 intercept, 2 ethnicity, 74 precincts)

Step 3: conduct the chi-square goodness of fit test

pchisq(3427.1, 148, lower.tail = FALSE) #lower tail is FALSE as the p-value in the chi-squared test is given by the upper tail probability

## [1] 0

As the p-value of 0 is much lower than the significance level of 0.05, we can reject the null hypothesis and conclude that the Poisson model does not fit the data.

WARNING: The Poisson distribution assumes that the mean and variance are equal. For small counts, the distribution is highly skewed and can have high variance relative to the mean. This makes the chi-square approximation less accurate, leading to unreliable p-values.

Calculate the overdispersion in model fit. Refer to formula in slides from lecture 10 - see slide 19

Overdispersion = sum of the squared residuals / degrees of freedom

Step 1: calculate residuals

predictions <- predict(fit.3, type="response")
residuals <- (stops - predictions) / sqrt(predictions)

Step 2: calculate sum of the squared residuals

sum_squared_residuals <- sum(residuals^2)

Step 3: calculate degrees of freedom

n <- dim(police)[1] #finds total number of observations in dataset 
k <- length(coef(fit.3)) #finds the number of estimated regression coefficients in the model, including the intercept
degree_of_freedom <- n - k

Step 4: calculate overdispersion

overdispersion <- sum_squared_residuals/degree_of_freedom
overdispersion

## [1] 21.88505

Step 5: calculate the p-value for the overdispersion

p_value <- pchisq(sum_squared_residuals, n - k, lower.tail = FALSE)
p_value

## [1] 0

Interpretation

The overdispersion ratio compares the observed variance to the expected variance under the Poisson model.
A ratio close to 1 suggests that the model’s assumption of equidispersion (variance equal to the mean) is reasonable.
A ratio significantly greater than 1 indicates overdispersion, meaning the observed variance is greater than the Poisson model expects.
The p-value tests the null hypothesis that the data follow a Poisson distribution (i.e., no overdispersion).
An overdispersion of 21.9 is much greater than 1, indicating substantial overdispersion. A p-value of 2.2e-16 is extremely small, leading us to reject the null hypothesis and confirm that overdispersion is significant. This suggests that the Poisson model may not be appropriate, and an alternative model should be considered.

Note: when you have overdispersion, the standard errors are underestimated which means the p-value is inflated. This results in a type 1 error (false positive)

To correct for overdispersion, you can multiply all the standard errors by the square root of the overdispersion. This can done in R using a quasi-Poisson distribution

Re-fit model iii) using a quasipoisson distribution. Indicate what changed in the model outputs.

Fit the model using a quasipoisson distribution

fit.4 <- glm (stops ~ factor(eth) + factor(precinct) , family=quasipoisson (link="log"), offset=log(arrests))
summary(fit.4)

## 
## Call:
## glm(formula = stops ~ factor(eth) + factor(precinct), family = quasipoisson(link = "log"), 
##     offset = log(arrests))
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -1.37887    0.23867  -5.777 4.33e-08 ***
## factor(eth)2        0.01019    0.03182   0.320 0.749294    
## factor(eth)3       -0.41900    0.04414  -9.493  < 2e-16 ***
## factor(precinct)2  -0.14905    0.34632  -0.430 0.667549    
## factor(precinct)3   0.55995    0.26552   2.109 0.036640 *  
## factor(precinct)4   1.21064    0.26922   4.497 1.38e-05 ***
## factor(precinct)5   0.28287    0.26569   1.065 0.288772    
## factor(precinct)6   1.14420    0.27155   4.214 4.35e-05 ***
## factor(precinct)7   0.21817    0.30097   0.725 0.469656    
## factor(precinct)8  -0.39056    0.26604  -1.468 0.144202    
## factor(precinct)9   0.48511    0.36560   1.327 0.186588    
## factor(precinct)10  0.41707    0.27531   1.515 0.131934    
## factor(precinct)11  0.65442    0.28811   2.271 0.024563 *  
## factor(precinct)12  1.14870    0.28568   4.021 9.21e-05 ***
## factor(precinct)13  1.05351    0.25847   4.076 7.45e-05 ***
## factor(precinct)14  0.61411    0.27509   2.232 0.027089 *  
## factor(precinct)15  1.05490    0.25709   4.103 6.71e-05 ***
## factor(precinct)16  0.79771    0.28205   2.828 0.005328 ** 
## factor(precinct)17 -0.04931    0.28292  -0.174 0.861874    
## factor(precinct)18  0.13165    0.25721   0.512 0.609540    
## factor(precinct)19  0.31916    0.27133   1.176 0.241364    
## factor(precinct)20 -0.04057    0.26632  -0.152 0.879117    
## factor(precinct)21  0.41099    0.26574   1.547 0.124105    
## factor(precinct)22  1.21123    0.24981   4.849 3.11e-06 ***
## factor(precinct)23  0.59992    0.25760   2.329 0.021217 *  
## factor(precinct)24  1.31945    0.25543   5.166 7.62e-07 ***
## factor(precinct)25  0.91336    0.25073   3.643 0.000372 ***
## factor(precinct)26 -0.14761    0.26668  -0.554 0.580740    
## factor(precinct)27  1.89549    0.26040   7.279 1.83e-11 ***
## factor(precinct)28 -0.76496    0.28497  -2.684 0.008096 ** 
## factor(precinct)29  1.12471    0.25387   4.430 1.82e-05 ***
## factor(precinct)30  0.52465    0.25905   2.025 0.044633 *  
## factor(precinct)31  1.64950    0.26276   6.278 3.61e-09 ***
## factor(precinct)32  1.38628    0.27919   4.965 1.87e-06 ***
## factor(precinct)33  1.08319    0.25362   4.271 3.47e-05 ***
## factor(precinct)34  1.51897    0.25428   5.974 1.65e-08 ***
## factor(precinct)35  0.87941    0.29561   2.975 0.003424 ** 
## factor(precinct)36  1.61631    0.27506   5.876 2.67e-08 ***
## factor(precinct)37  1.40943    0.28073   5.021 1.46e-06 ***
## factor(precinct)38  1.76381    0.26289   6.709 3.88e-10 ***
## factor(precinct)39  0.50108    0.25655   1.953 0.052690 .  
## factor(precinct)40  1.53198    0.28755   5.328 3.63e-07 ***
## factor(precinct)41  1.91390    0.26065   7.343 1.29e-11 ***
## factor(precinct)42  1.08422    0.25501   4.252 3.74e-05 ***
## factor(precinct)43  0.54179    0.26499   2.045 0.042674 *  
## factor(precinct)44  0.88304    0.26512   3.331 0.001094 ** 
## factor(precinct)45  0.83844    0.25363   3.306 0.001188 ** 
## factor(precinct)46  0.65585    0.25145   2.608 0.010032 *  
## factor(precinct)47  1.12814    0.28060   4.020 9.22e-05 ***
## factor(precinct)48  0.41188    0.26616   1.547 0.123887    
## factor(precinct)49  1.10122    0.27920   3.944 0.000123 ***
## factor(precinct)50  0.87880    0.25313   3.472 0.000678 ***
## factor(precinct)51  0.34898    0.27390   1.274 0.204627    
## factor(precinct)52  0.49190    0.25944   1.896 0.059911 .  
## factor(precinct)53  0.59969    0.27361   2.192 0.029957 *  
## factor(precinct)54  0.49184    0.28089   1.751 0.082013 .  
## factor(precinct)55  0.49751    0.27612   1.802 0.073614 .  
## factor(precinct)56  0.89088    0.30867   2.886 0.004484 ** 
## factor(precinct)57  1.53278    0.28708   5.339 3.44e-07 ***
## factor(precinct)58  1.57885    0.25509   6.189 5.64e-09 ***
## factor(precinct)59  1.08373    0.27564   3.932 0.000129 ***
## factor(precinct)60  0.83383    0.25363   3.288 0.001262 ** 
## factor(precinct)61  1.27734    0.26552   4.811 3.67e-06 ***
## factor(precinct)62  1.20638    0.26509   4.551 1.11e-05 ***
## factor(precinct)63  1.30878    0.27274   4.799 3.87e-06 ***
## factor(precinct)64  1.71250    0.27027   6.336 2.68e-09 ***
## factor(precinct)65  1.89920    0.25802   7.361 1.17e-11 ***
## factor(precinct)66  2.10213    0.25370   8.286 6.48e-14 ***
## factor(precinct)67  1.20284    0.25710   4.679 6.46e-06 ***
## factor(precinct)68  2.20126    0.27716   7.942 4.59e-13 ***
## factor(precinct)69  1.72070    0.27240   6.317 2.96e-09 ***
## factor(precinct)70  1.03200    0.25642   4.025 9.08e-05 ***
## factor(precinct)71  1.47941    0.25374   5.830 3.34e-08 ***
## factor(precinct)72  1.46498    0.25139   5.828 3.39e-08 ***
## factor(precinct)73  0.99102    0.25068   3.953 0.000119 ***
## factor(precinct)74  1.15146    0.27144   4.242 3.89e-05 ***
## factor(precinct)75  1.57122    0.35428   4.435 1.79e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for quasipoisson family taken to be 21.88506)
## 
##     Null deviance: 46120.3  on 224  degrees of freedom
## Residual deviance:  3427.1  on 148  degrees of freedom
## AIC: NA
## 
## Number of Fisher Scoring iterations: 4

Interpretation: All coefficient estimates are the same but the standard errors are now corrected for the overdispersion. This adjustment has resulted in larger standard errors compared to the Poisson model. Due to the larger standard errors, the significance levels (p-values) for some predictors have changed, with fewer predictors being statistically significant.

Note: The deviance remains the same because it is a measure of the fit of the model to the data and is not directly affected by the method used to account for overdispersion.

The Akaike Information Criterion (AIC) is not provided in the quasipoisson model summary. This is because AIC is not appropriate for quasipoisson models, as it is based on the likelihood which is not well-defined for quasi-likelihood models.

Notice how the output states: Dispersion parameter for quasipoisson family taken to be 21.88506. This means we are not violating the assumption of variance anymore as our mean is equal to the variance

Week 11 Workshop: Poisson Regression

2024-05-15

Exercise 1

Linear Regression

Poisson Regression

Exercise 2