Exercises

Page 52
Page 58
Page 72
Page 80
Page 91
Page 100
Page 104
Page 115
Page 124

Page 52

fitfs <- lm(sheight ~ fheight, father.son)

1. Test whether the slope coefficient for the father.son data is different from zero (father as predictor, son as outcome).

Hypothesis Testing

\(H_0:\beta_1=0\) \(;\) \(H_A:\beta_1 \neq 0\)

y= father.son$sheight; x=father.son$fheight; n= nrow(father.son)

Given

\(\beta_1 = cor(Y,X) \frac{Sd (Y)}{Sd (X)}\)

and

\(\beta_0= \bar{Y} - \beta_1 \bar{X}\)

beta1 <- cor(y,x)*sd(y)/sd(x)
beta0 <- mean(y) - beta1*mean(x)

Remember, predicted outcome i is \(\hat{Y_i}\) at predicted value \(X_i\)

\(e_i = Y_i - \hat{\beta_0} - \hat{\beta_1} X_i\)

e <- y-beta0 -beta1*x

The standard error of the residuals is

\(\hat{\sigma} = \sqrt{(\frac{1}{n-2} \overset{n}{\underset{i=1}\sum} e_1^2)}\)

sigma <- sqrt(sum(e^2)/(n-2))
ssx <- sum((x - mean(x))^2)

The standard error of the regression slope is

\(\hat{\sigma_{\hat{\beta_1}}} = \sqrt{Var(\hat{\beta_1)} = \frac{\sigma}{\sqrt{(\overset{n}{\underset{i=1}\sum}(X_i-\bar{X})^2)}}}\)

seb1 <- sigma / sqrt(ssx)

Knowing that

\(T= \frac{\hat{\beta_j}- \beta_j}{\hat{\sigma_\hat{\beta_j}}}\)

If the hypothesis is that the true value is 0 we have:

\(T= \frac{\beta_j}{\hat{\sigma_\hat{\beta_j}}}\)

tBeta1 <- beta1 / seb1

pBeta1 <- 2 * pt(abs(tBeta1), df = n - 2, lower.tail = FALSE)

coefTab <- data.frame(beta1, seb1, tBeta1, pBeta1)
colnames(coefTab) <- c("Estimate", "Std. Error", "t value", "P(>|t|)")
rownames(coefTab) <- "x"
coefTab

  Estimate Std. Error  t value      P(>|t|)
x 0.514093 0.02704874 19.00618 1.121268e-69

Comparing with summary():

summary(lm(sheight~fheight, father.son))$coef

             Estimate Std. Error  t value     Pr(>|t|)
(Intercept) 33.886604 1.83235382 18.49348 1.604044e-66
fheight      0.514093 0.02704874 19.00618 1.121268e-69

coefTab[1,4]

[1] 1.121268e-69

summary(lm(sheight~fheight, father.son))$coef[2,4]

[1] 1.121268e-69

The p-value of the slope is very small, which suggests that we have to reject the null hypothesis. The slope coefficient is different than 0. So we can consider that the father’s height is a statistically significant predictor of the son’s height.

2. Refer to question 1. Form a confidence interval for the slope coefficient.

Remember degree of freedom is n-2.

Confidence interval is

\(\hat{\beta} \pm t_\hat{\beta} \times SE_\hat{\beta}\)

Put the coefficient in the variable sumCoef

sumCoef <- (summary(lm(sheight~fheight, father.son))$coef)
sumCoef

             Estimate Std. Error  t value     Pr(>|t|)
(Intercept) 33.886604 1.83235382 18.49348 1.604044e-66
fheight      0.514093 0.02704874 19.00618 1.121268e-69

We can use 2 methods to compute confidence interval

# Method 1
# Estimate....plus minus...t-stat (0.05)... df= n-2........ SE Estimate
sumCoef[2,1] + c(-1,1) * qt(0.975, df= fitfs$df.residual) * sumCoef[2,2]

[1] 0.4610188 0.5671673

# Method 2
confint(fitfs)

                 2.5 %     97.5 %
(Intercept) 30.2912126 37.4819961
fheight      0.4610188  0.5671673

3. Refer to question 1. Form a confidence interval for the intercept (center the fathers’ heights first to get an intercept that is easier to interpret).

As we have seen in the previous exercises, the estimate for the intercept is \(33.8866044\).

summary(fitfs)$coef[1,1]

[1] 33.8866

Which is the son’s height at father’s height of 0 inches that does not make sense. So we get the regression centering the height at the mean of father’s height.

fitfsc <- lm(sheight ~ I(fheight - mean(fheight)), father.son)
fitfsc


Call:
lm(formula = sheight ~ I(fheight - mean(fheight)), data = father.son)

Coefficients:
               (Intercept)  I(fheight - mean(fheight))  
                   68.6841                      0.5141

We will get a new intercept, \(68.7\) inches, the son’s height at average height of the father, 67.7 inches.

Now, the C.I for this intercept (the son’s height at the average father’s height) is:

confint(fitfsc)

                                2.5 %     97.5 %
(Intercept)                68.5384554 68.8296839
I(fheight - mean(fheight))  0.4610188  0.5671673

And we can see also that there is no change in the slope coefficients.

4. Refer to question 1. Form a mean value interval [it means the confidence interval, the interval for the regression line] for the expected son’s height at the average father’s height.

Calculating the expected values

predict(fitfs, newdata = data.frame(fheight = mean(father.son$fheight)), interval = 'confidence')

       fit      lwr      upr
1 68.68407 68.53846 68.82968

5. Refer to question 1. Form a prediction interval [the interval for the predicted Y values] for the Son’s height at the average Father’s height.

\(\hat{Y}= \beta_0 + \beta_1 X\)

predict(fitfs, newdata = data.frame(fheight = mean(father.son$fheight)), interval = "prediction")

       fit      lwr      upr
1 68.68407 63.90091 73.46723

6. Load the mtcars dataset. Fit a linear regression with miles per gallon as the outcome and horsepower as the predictor. Test whether or not the horsepower power coefficient is statistically different from zero. Interpret your test.

fit <- lm(mpg ~ hp, mtcars)

Hypothesis Testing

\(H_0:\hat{\beta_1}=0\) \(;\) \(H_A:\hat{\beta_1} \neq 0\)

summary(fit)$coef

               Estimate Std. Error   t value     Pr(>|t|)
(Intercept) 30.09886054  1.6339210 18.421246 6.642736e-18
hp          -0.06822828  0.0101193 -6.742389 1.787835e-07

The p-value of the slope is very small \((1.7878353^{-7})\), which suggests that we have to reject the null hypothesis. The slope coefficient is different than 0. So we can consider that the the horse power is a statistically significant predictor of miles per gallon.

7. Refer to question 6. Form a confidence interval for the slope coefficient.

Method 1

confint(fit)

                  2.5 %     97.5 %
(Intercept) 26.76194879 33.4357723
hp          -0.08889465 -0.0475619

Method 2

with(mtcars, cor(mpg, hp)*sd(mpg)/sd(hp))+ c(-1,1) * qt(0.975, df=nrow(mtcars)-2)*summary(fit)$coef[2,2]

[1] -0.08889465 -0.04756190

8. Refer to question 6. Form a confidence interval for the intercept (center the HP variable first).

Centered coefficients

fit <- lm(mpg ~ I(hp-mean(hp)), mtcars)

Method 1

confint(fit)

                       2.5 %     97.5 %
(Intercept)      18.69599452 21.4852555
I(hp - mean(hp)) -0.08889465 -0.0475619

Method 2

predict(fit, newdata = data.frame(hp = mean(mtcars$hp)), interval = 'confidence')

       fit      lwr      upr
1 20.09062 18.69599 21.48526

9. Refer to question 6. Form a mean value interval [it means the confidence interval, the interval for the regression line] for the expected MPG for the average HP.

fit <- lm(mpg~hp,mtcars)

predict(fit, newdata = data.frame(hp=mean(mtcars$hp)), interval = 'confidence')

       fit      lwr      upr
1 20.09062 18.69599 21.48526

10. Refer to question 6. Form a prediction interval for the expected MPG for the average HP.

predict(fit, newdata = data.frame(hp=mean(mtcars$hp)), interval = 'prediction')

       fit      lwr      upr
1 20.09062 12.07908 28.10217

11. Refer to question 6. Create a plot that has the fitted regression line plus curves at the expected value and prediction intervals.

y <- mtcars$mpg; x <- mtcars$hp; n <- length(y)
fit <- lm(y~x)

p1 <- data.frame(predict(fit, newdata = data.frame(x=x), interval = 'confidence'))

p2 <- data.frame(predict(fit, newdata = data.frame(x=x), interval = 'prediction'))

p1 <- p1 %>% 
    mutate(interval = 'confidence') %>% 
    mutate(x = mtcars$hp)
p2 <- p2 %>% 
    mutate(interval = 'prediction') %>% 
     mutate(x = mtcars$hp)

dat <- rbind(p1,p2)
names(dat)[1] <- "y"

dat %>% 
    ggplot(aes(x = x, y = y)) +
    geom_ribbon(aes(ymin= lwr, ymax= upr, fill= interval), alpha= 0.4) +
    geom_line(colour="black", lwd=.5) +
    geom_point(data= data.frame(x=x,y=y), aes(x=x,y=y))

Page 58

1. Load the dataset Seatbelts as part of the datasets package via `data(Seatbelts)`. Use `as.data.frame` to convert the object to a dataframe. Fit a linear model of driver deaths `[DriversKilled]` with `kms` and `PetrolPrice as predictors`. Interpret your results.

Seatbelts <- as.data.frame(Seatbelts)

round(summary(lm(DriversKilled ~ kms + PetrolPrice, Seatbelts))$coef,4)

             Estimate Std. Error t value Pr(>|t|)
(Intercept)  215.7461    14.6656 14.7110   0.0000
kms           -0.0017     0.0006 -2.8469   0.0049
PetrolPrice -643.7895   148.2896 -4.3414   0.0000

The intercept is the number of deaths for 0 kms and 0 PetrolPrice whichis not useful. An it’s good to center the variables. The estimate kms is the impact on deaths of 1 km which is insignificant and unworthy. Better to rescale it. PetroPrice is an index which is better tu normalize. Better to normalize.

Seatbelts <- Seatbelts %>% 
    mutate(ppn = (PetrolPrice - mean(PetrolPrice))/sd(PetrolPrice),
           mkm = kms / 1000) %>% 
    mutate(mkm = mkm - mean(mkm))

summary(lm(DriversKilled ~ mkm + ppn, Seatbelts))$coef

              Estimate Std. Error   t value      Pr(>|t|)
(Intercept) 122.802083  1.6628507 73.850336 2.395106e-141
mkm          -1.749546  0.6145401 -2.846919  4.902428e-03
ppn          -7.838674  1.8055491 -4.341435  2.304713e-05

2. Predict the number of driver deaths at the average kms and petrol levels.

fit <- lm(DriversKilled ~ kms + PetrolPrice, Seatbelts)

predict(fit, newdata = data.frame(kms= mean(Seatbelts$kms), 
                                  PetrolPrice= mean(Seatbelts$PetrolPrice)))

       1 
122.8021

This is the number of deaths at the mean of kms and PetrolPrice.

3. Take the residual for `DriversKilled` having regressed out `kms` and an intercept and the residual for `PetrolPrice` having regressed out `kms` and an intercept. Fit a regression through the origin of the two residuals and show that it is the same as your coefficient obtained in question 1.

dk <- Seatbelts$DriversKilled
kms <- Seatbelts$kms
pp <- Seatbelts$PetrolPrice
fitFull <- lm(dk~kms+pp)

Now we get the residuals.

Remember the formula of the estimated coefficient by the residuals:

\(\hat{\beta_1}= \frac{\overset{n}{\underset{i=1}\sum} e_i Y|X_2 \times e_i X_1|X_2 }{\overset{n}{\underset{i=1}\sum} (X_1|X_2)^2}\)

The residuals for DriversKilles is:

edk <- resid(lm(dk~kms))

The residual for PetrolPriceis:

epp <- resid(lm(pp~kms))

So the estimate for pp is:

sum(edk*epp)/sum(epp^2)

[1] -643.7895

Remembrer the coefficients of fitFull:

round(summary(fitFull)$coef, 4)

             Estimate Std. Error t value Pr(>|t|)
(Intercept)  215.7461    14.6656 14.7110   0.0000
kms           -0.0017     0.0006 -2.8469   0.0049
pp          -643.7895   148.2896 -4.3414   0.0000

Also we can calculate

round(summary(lm(edk~epp- 1))$coef,4)

     Estimate Std. Error t value Pr(>|t|)
epp -643.7895   147.5111 -4.3643        0

4. Take the residual for `DriversKilled` having regressed out `PetrolPrice and an intercept`. Take the residual for `kms` having regressed out `PetrolPrice` and an intercept. Fit a regression through the origin of the two residuals and show that it is the same as your coefficient obtained in question 1.

edk2 <- resid(lm(dk ~ pp))
ekms <- resid(lm(kms ~ pp))

round(summary(lm(edk2 ~ ekms -1))$coef, 4)

     Estimate Std. Error t value Pr(>|t|)
ekms  -0.0017      6e-04 -2.8619   0.0047

Page 72

1. Do exercise 1 of the previous chapter if you have not already. Load the dataset `Seatbelts` as part of the datasets package via `data(Seatbelts)`. Use `as.data.frame` to convert the object to a dataframe. Fit a linear model of driver deaths with `kms` and `PetrolPrice` as predictors. Interpret your results.

Seatbelts <- as.data.frame(Seatbelts)

# Remamber that kms have been rescaled (mkm = kms/1000) and PetrolPrice have been normalized 
# ppn= (PetrolPrice - mean(PetrolPrice) / sd(PetrolPrice))
round(summary(lm(DriversKilled ~ mkm + ppn, Seatbelts))$coef,4)

            Estimate Std. Error t value Pr(>|t|)
(Intercept) 122.8021     1.6629 73.8503   0.0000
mkm          -1.7495     0.6145 -2.8469   0.0049
ppn          -7.8387     1.8055 -4.3414   0.0000

The intercept is the number of deaths for 0 kms and 0 PetrolPrice which is not useful. And it’s good to center the variables. The estimate kms is the impact on deaths of 1 km which is insignificant and unworthy. Better to rescale it.

2. Repeat question 1 for the outcome being the log of the count of driver deaths. Interpret your coefficients.

# Method 1: create a new variable logkill = Log_e of DriversKilled
Seatbelts2 <- Seatbelts %>% 
    mutate(logkill = log(DriversKilled))
summary(lm(logkill ~ ppn + mkm, Seatbelts2))$coef

               Estimate  Std. Error    t value      Pr(>|t|)
(Intercept)  4.78966306 0.013426810 356.723817 2.737888e-269
ppn         -0.06412578 0.014579039  -4.398492  1.818005e-05
mkm         -0.01400794 0.004962149  -2.822959  5.267843e-03

# Method 2: Calculate log10 inside lm()
summary(lm(log(DriversKilled) ~ ppn + mkm, Seatbelts))$coef

               Estimate  Std. Error    t value      Pr(>|t|)
(Intercept)  4.78966306 0.013426810 356.723817 2.737888e-269
ppn         -0.06412578 0.014579039  -4.398492  1.818005e-05
mkm         -0.01400794 0.004962149  -2.822959  5.267843e-03

In this case we have:

\(E[ln Y_i]= \beta_0 + \beta_1 X_1 + \beta_2 X_2\)

\(e^{E[ln Y_i]} = e^{\beta_0} + e^{\beta_1 X_1} + e^{\beta_2 X_2}\)

Interpretation:

Our normalized petrol price variable (ppn) can have this interpretation:

We are estimating an expected 6% decrease in the geometric mean of drivers killed for every 1 standard deviation increase in normalized petrol price, holding the kilometers constant.

1-exp(-0.06412578)

[1] 0.06211298

We are estimating an expected 1% decrease in the geometric mean of drivers killed for every additional 1000 miles driven, holding petrol price constant.

1-exp(-0.01400794)

[1] 0.01391029

3. Refer to question 1. Add the dummy variable `law` and interpret the results. Repeat this question with a factor variable that you create called `lawFactor` that takes the levels `No` and `Yes`. Change the reference level from `No` to `Yes`.

Model 2

\(E[Y_1|X_1X_2]= \beta_0 + X_{1i} \beta_1 + X_{2i} \beta_2\)

The model includes \(X_1\) and \(X_2\), without reciprocal interaction. It fits 2 models that have the same slope.

round(summary(lm(DriversKilled ~ mkm + ppn + I(factor(law)), Seatbelts))$coef,4)

                Estimate Std. Error t value Pr(>|t|)
(Intercept)     124.2263     1.8012 68.9674   0.0000
mkm              -1.2233     0.6657 -1.8378   0.0677
ppn              -6.9199     1.8514 -3.7377   0.0002
I(factor(law))1 -11.8892     6.0258 -1.9731   0.0500

In this case, the intercept has changes a little. It is the level of mortality at the average of kms, the average of petrol price and at level 0 of Law, i.e. before the law was edited.
If we wanted to know the value of DriversKilled after the law was taken into effect, at the average mkm and ppn we would have to add \(-11.8892\).
Note that R tells us which factor is treating as reference levels: in this case is 0. The value \(-11.8892\) is associated with the factor level 1.

Change the reference level from No to Yes:

Seatbelts<-Seatbelts %>% 
    mutate(law = factor(law, levels = c(1,0)))

round(summary(lm(DriversKilled ~ mkm + ppn + I(factor(law)), Seatbelts))$coef,4)

                Estimate Std. Error t value Pr(>|t|)
(Intercept)     112.3371     5.5547 20.2236   0.0000
mkm              -1.2233     0.6657 -1.8378   0.0677
ppn              -6.9199     1.8514 -3.7377   0.0002
I(factor(law))0  11.8892     6.0258  1.9731   0.0500

Seatbelts<-Seatbelts %>% 
    mutate(law = factor(law, levels = c(0,1)))

4. Discretize the `PetrolPrice` variable into four factor levels. Fit the linear model with this factor to see how R treats multiple level factor variables.

Seatbelts <- Seatbelts %>% 
    mutate(ppf = as.factor((ppn <= -1.5)+ (ppn <= 0)+ (ppn <= 1.5)+ (ppn < Inf)))

round(summary(lm(DriversKilled ~ ppf + mkm + I(factor(law)), Seatbelts))$coef,4)

                Estimate Std. Error t value Pr(>|t|)
(Intercept)     109.8405     9.5066 11.5541   0.0000
ppf2             10.8271     9.9462  1.0886   0.2778
ppf3             18.6904     9.9374  1.8808   0.0616
ppf4             25.0074    10.9163  2.2908   0.0231
mkm              -1.2991     0.7668 -1.6942   0.0919
I(factor(law))1 -15.3445     6.0345 -2.5428   0.0118

5. Perform the plot requested at the end of the last chapter.

library(datasets); data(swiss)
head(swiss)

             Fertility Agriculture Examination Education Catholic
Courtelary        80.2        17.0          15        12     9.96
Delemont          83.1        45.1           6         9    84.84
Franches-Mnt      92.5        39.7           5         5    93.40
Moutier           85.8        36.5          12         7    33.77
Neuveville        76.9        43.5          17        15     5.16
Porrentruy        76.1        35.3           9         7    90.57
             Infant.Mortality
Courtelary               22.2
Delemont                 22.2
Franches-Mnt             20.2
Moutier                  20.3
Neuveville               20.6
Porrentruy               26.6

Two levels model without interaction

swiss <- swiss %>% 
  mutate(CatholicBin = 1 * (Catholic > 50))

g <- ggplot(swiss, aes(x = Agriculture, y = Fertility, colour = factor(CatholicBin))) + 
  geom_point(size = 5) +
  xlab("% in Agriculture") + 
  ylab("Fertility")

fit <- lm(Fertility ~ Agriculture + factor(CatholicBin), data = swiss)

g +
  geom_abline(intercept = coef(fit)[1], slope = coef(fit)[2], size = 1, colour = 'black') +
  geom_abline(intercept = coef(fit)[1] + coef(fit)[3], slope = coef(fit)[2], size = 1, colour = 'black') +
  annotate("text", x = 0.05, y = 72, label = "Catholic", hjust = 0, col = "black", size = 3, fontface = "italic") +
  annotate("text", x = 0.05, y = 60, label = "Non-Catholic", hjust = 0, col = "black", size = 3, fontface = "italic")

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.

Two levels model with interaction

fitfull <- lm(Fertility ~ Agriculture * factor(CatholicBin), data = swiss)

g +
  geom_abline(intercept = coef(fitfull)[1], slope = coef(fitfull)[2], size = 1, colour = 'black') +
  geom_abline(intercept = coef(fitfull)[1] + coef(fitfull)[3], 
              slope = coef(fitfull)[2] + coef(fitfull)[4], size = 1, colour = 'black') +
  annotate("text", x = 0.05, y = 68, label = "Catholic", hjust = 0, col = "black", size = 3, fontface = "italic") +
  annotate("text", x = 0.05, y = 60, label = "Non-Catholic", hjust = 0, col = "black", size = 3, fontface = "italic")

Page 80

1. Load the dataset `Seatbelts` as part of the datasets package via data(Seatbelts). Use `as.data.frame` to convert the object to a dataframe. Fit a linear model of driver deaths with `kms` and `PetrolPrice` as predictors. Interpret your results.

summary(lm(DriversKilled ~ mkm + ppn, Seatbelts))$coef

              Estimate Std. Error   t value      Pr(>|t|)
(Intercept) 122.802083  1.6628507 73.850336 2.395106e-141
mkm          -1.749546  0.6145401 -2.846919  4.902428e-03
ppn          -7.838674  1.8055491 -4.341435  2.304713e-05

2. Compare the `kms` coefficient with and without the inclusion of the `PetrolPrice` variable in the model.

First of all let’s see the correlation between mkm and ppn:

cor(Seatbelts$mkm, Seatbelts$ppn)

[1] 0.3839004

This value indicates that both variables are measuring similar things. Possibly the petrol price has an impact in the total of kms driven.

fit0 <- lm(DriversKilled ~ mkm, Seatbelts)
fit1 <- lm(DriversKilled ~ mkm + ppn, Seatbelts)
summary(fit0)$coef

              Estimate Std. Error  t value      Pr(>|t|)
(Intercept) 122.802083  1.7391997 70.60839 2.665611e-138
mkm          -2.773787  0.5935049 -4.67357  5.596266e-06

summary(fit1)$coef

              Estimate Std. Error   t value      Pr(>|t|)
(Intercept) 122.802083  1.6628507 73.850336 2.395106e-141
mkm          -1.749546  0.6145401 -2.846919  4.902428e-03
ppn          -7.838674  1.8055491 -4.341435  2.304713e-05

The estimate in this case goes down (negative). Which makes sense because both effects are in the same direction and are correlated. But the estimate change from \(-2.773787\) to \(-1.749546\), which indicates the confounding effect of the variable PetrolPrice on the regression between DriversKilled and kms.

anova(fit0,fit1)

Analysis of Variance Table

Model 1: DriversKilled ~ mkm
Model 2: DriversKilled ~ mkm + ppn
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1    190 110345                                  
2    189 100339  1     10006 18.848 2.305e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is the same as in fit1 (ppn estimate).

3. Compare the `PetrolPrice` coefficient with and without the inclusion fo the `kms` variable in the model.

fit2 <- lm(DriversKilled ~ ppn, Seatbelts)
fit3 <- lm(DriversKilled ~ ppn + mkm, Seatbelts)
summary(fit2)$coef

              Estimate Std. Error   t value      Pr(>|t|)
(Intercept) 122.802083   1.693656 72.507096 2.061333e-140
ppn          -9.812019   1.698084 -5.778288  3.044208e-08

summary(fit3)$coef

              Estimate Std. Error   t value      Pr(>|t|)
(Intercept) 122.802083  1.6628507 73.850336 2.395106e-141
ppn          -7.838674  1.8055491 -4.341435  2.304713e-05
mkm          -1.749546  0.6145401 -2.846919  4.902428e-03

In this case too, the estimate goes down (negative). Which makes sense because both effects are in the same direction and are correlated. But the estimate change from \(-9.812019\) to \(-7.838674\), which indicates the confounding effect of the variable kms on the regression between DriversKilled and PetrolPrice.

Page 91

1. Load the dataset ``Seatbelts` as part of the `datasets` package via `data(Seatbelts)`. Use `as.data.frame` to convert the object to a dataframe. Fit a linear model of driver deaths with `kms`, `PetrolPrice` and `law` as predictors.

data(Seatbelts)
Seatbelts <- as.data.frame(Seatbelts)

model <- lm(DriversKilled ~ kms + PetrolPrice + law, data = Seatbelts)
summary(model)$coef

                 Estimate   Std. Error   t value     Pr(>|t|)
(Intercept)  2.014614e+02 1.625587e+01 12.393145 3.601219e-26
kms         -1.223318e-03 6.656567e-04 -1.837761 6.767594e-02
PetrolPrice -5.683347e+02 1.520552e+02 -3.737687 2.463128e-04
law         -1.188920e+01 6.025785e+00 -1.973055 4.995497e-02

round(summary(lm(DriversKilled ~ kms + PetrolPrice + law, Seatbelts))$coef,4)

             Estimate Std. Error t value Pr(>|t|)
(Intercept)  201.4614    16.2559 12.3931   0.0000
kms           -0.0012     0.0007 -1.8378   0.0677
PetrolPrice -568.3347   152.0552 -3.7377   0.0002
law          -11.8892     6.0258 -1.9731   0.0500

2. Refer to question 1. Directly estimate the residual variation via the function resid. Compare with R’s residual variance estimate.

Seatbelts <- Seatbelts %>% 
    mutate(ppn = (PetrolPrice - mean(PetrolPrice))/sd(PetrolPrice),
           mkm = kms / 1000) %>% 
    mutate(mkm = mkm - mean(mkm))

summary(lm(DriversKilled ~ mkm + ppn + law, Seatbelts))$coef

              Estimate Std. Error   t value      Pr(>|t|)
(Intercept) 124.226311  1.8012324 68.967399 1.976641e-135
mkm          -1.223318  0.6656567 -1.837761  6.767594e-02
ppn          -6.919949  1.8513987 -3.737687  2.463128e-04
law         -11.889202  6.0257850 -1.973055  4.995497e-02

fit <- lm(DriversKilled ~ mkm + ppn + law, Seatbelts)

# Method 1
sum(resid(fit)^2) / (nrow(Seatbelts) - 4)    # In a multidimensional analysis every dimension is 1 unit to be subtracted from the freedom degrees in this case we have 4 dimensions. So, the degree of freedom is n-4.

[1] 522.8903

# Method 2
summary(fit)$sigma^2

[1] 522.8903

3. Refer to question 1. Perform an analysis of diagnostic measures including, dffits, dfbetas, influence and hat diagonals.

plot(fit)

First plot shows residuals versus the fitted values here you’re looking for any kind of pattern and in this case it doesn’t look too bad it labels some plots that appear on the extreme fits a smoother through the residuals this is because it’s often difficult to ascertain a pattern when you’re just looking at the points by themselves but when you fit a smoother through it often you can see patterns that you wouldn’t have seen.
Q-Q plot is done to ascertain normality of the errors in this case what you’re looking for is whether or not your points fall mostly on a line and in this case it gives you the reference line. Typically when you look at these plots these deviations for example down here in the tails which means that the residuals are larger in the left tail than the theoretical quantiles from the normal distribution.
Scale-location plot: plots so the next one is the square root of the absolute value of the standardized residuals by the fitted values this is the scale location plot a smoother is put through it and again you’re looking for patterns in this plot it doesn’t look so bad to me.
Residuals vs Leverage plot: the final one plots the residuals versus the leverage.

Again in this case leverage remember is ascertaining how much potential for influence a point has by virtue of being outside of the norm.

For the collection of points we might for example look at these couple of points that have fairly high leverage relative to all their other partners and just see if there is anything out of the ordinary in those points.

Let’s look at some influence measures as well just to remember though maybe keep an eye out for the point the 12th and 190 second point so remember we can look for example at the DF fits with the function.

plot(dffits(fit))

Evaluating the influence

dffits: Using ggplot, we can create a dataframe with the dffits values, classify them, establish a threshold value, in this case |0.4|, and plot the values indicating the id of the values above the threshold (specialPoint).

datdffits <- data.frame(dffits = dffits(fit)) %>% 
    mutate(id=row_number()) %>% 
    mutate(specialPoint= as.factor(abs(dffits) >= .35))

ggplot(datdffits, aes(y=dffits, x= id, color= specialPoint))+
    geom_point(size=3) +
    geom_text(data=subset(datdffits, specialPoint==TRUE),
            aes(y=dffits,x=id,label=id), colour='black') +
    geom_hline(yintercept = c(-0.35,0.35), color= 'blue') +
    geom_hline(yintercept = 0)

dfbeta: remember that this indicator gives the change in the individual coefficients for every observation:

head(dfbeta(fit))

  (Intercept)        mkm         ppn         law
1  -0.1022614 0.11881682 -0.09349927 -0.26195987
2  -0.1331656 0.22275746 -0.15807657 -0.56786007
3  -0.1279422 0.11452337 -0.07920934 -0.23464337
4  -0.2032275 0.13006726 -0.06413772 -0.23584650
5  -0.0524778 0.02353784 -0.01104844 -0.02756976
6  -0.1188004 0.03961875 -0.01108847 -0.02386593

plot(dfbeta(fit)[ , 2])

dfbeta for kms:

datdfbeta <- data.frame(dfbetamkm = dfbeta(fit)[ , 2]) %>% 
    mutate(id=row_number()) %>% 
    mutate(specialPoint= as.factor(abs(dfbetamkm) >= .2))

ggplot(datdfbeta, aes(y=dfbetamkm, x= id, color= specialPoint))+
    geom_point(size=3) +
    geom_text(data=subset(datdfbeta, specialPoint==TRUE),
            aes(y=dfbetamkm,x=id,label=id), colour='black') +
    geom_hline(yintercept = 0)+
    geom_hline(yintercept = c(-0.2,0.2), color= 'blue') +
    ylab('dfbeta: mkm')

datdfbeta2 <- data.frame(dfbetappn = dfbeta(fit)[ , 3]) %>% 
    mutate(id=row_number()) %>% 
    mutate(specialPoint= as.factor(abs(dfbetappn) >= .4))

ggplot(datdfbeta2, aes(y=dfbetappn, x= id, color= specialPoint))+
    geom_point(size=3) +
    geom_text(data=subset(datdfbeta2, specialPoint==TRUE),
            aes(y=dfbetappn,x=id,label=id), colour='black') +
    geom_hline(yintercept = 0)+
    geom_hline(yintercept = c(-0.4,0.4), color= 'blue') +
    ylab('dfbeta: PetrolPrice')

cooks.distance: it gives a list of numbers, cooksdistance is a summary overall points. It’s more easy to look at a plot so cooks.distance is bounded from below by zero it’s a Mahalanobis distance: Mahalanobis distance take in account the euclidean distance, the gaussian distance (normalization of parameters) and the correlation of parameters.

datcooks <- data.frame(cooksdist = cooks.distance(fit)) %>% 
    mutate(id=row_number()) %>% 
    mutate(specialPoint= as.factor(cooksdist >= .03))

ggplot(datcooks, aes(y=cooksdist, x= id, color= specialPoint))+
    geom_point(size=3) +
    geom_text(data=subset(datcooks, specialPoint==TRUE),
            aes(y=cooksdist,x=id,label=id), colour='black') +
    geom_hline(yintercept = 0)+
    geom_hline(yintercept = 0.03, color= 'blue') +
    ylab('Cooks distance')

Evaluating the Leverage

summary(round(hatvalues(fit), 3))

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.00600 0.01175 0.01700 0.02082 0.02600 0.06700

dathatv <- data.frame(hatv = hatvalues(fit)) %>% 
    mutate(id=row_number()) %>% 
    mutate(specialPoint= as.factor(hatv >= .06))

ggplot(dathatv, aes(y=hatv, x= id, color= specialPoint))+
    geom_point(size=3) +
    geom_text(data=subset(dathatv, specialPoint==TRUE),
            aes(y=hatv,x=id,label=id), colour='black') +
    geom_hline(yintercept = 0.06, color= 'blue') +
    ylab('hatvalues')

Page 100

1. Load the dataset `Seatbelts` as part of the `datasets` package via `data(Seatbelts)`. Use `as.data.frame` to convert the object to a dataframe. Fit a linear model of driver deaths with `kms`, `PetrolPrice` and `law` as predictors.

Seatbelts <- as.data.frame(Seatbelts)

# Fit the linear model
model <- lm(DriversKilled ~ kms + PetrolPrice + law, data = Seatbelts)
summary(model)$coef

                 Estimate   Std. Error   t value     Pr(>|t|)
(Intercept)  2.014614e+02 1.625587e+01 12.393145 3.601219e-26
kms         -1.223318e-03 6.656567e-04 -1.837761 6.767594e-02
PetrolPrice -5.683347e+02 1.520552e+02 -3.737687 2.463128e-04
law         -1.188920e+01 6.025785e+00 -1.973055 4.995497e-02

2. Perform a model selection exercise to arrive at a final model.

In this case we are interested in knowing whether or not the law was effective.

fit0 <- lm(DriversKilled~law, Seatbelts)
fit1 <- lm(DriversKilled~law + mkm, Seatbelts)
fit2 <- lm(DriversKilled~law + ppn, Seatbelts)
fit3 <- lm(DriversKilled~law + mkm + ppn, Seatbelts)

anova(fit0,fit1,fit2,fit3)

Analysis of Variance Table

Model 1: DriversKilled ~ law
Model 2: DriversKilled ~ law + mkm
Model 3: DriversKilled ~ law + ppn
Model 4: DriversKilled ~ law + mkm + ppn
  Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
1    190 109754                                
2    189 105608  1    4145.3 7.9276 0.005388 **
3    189 100069  0    5538.9                   
4    188  98303  1    1766.0 3.3774 0.067676 . 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(fit0,fit3)

Analysis of Variance Table

Model 1: DriversKilled ~ law
Model 2: DriversKilled ~ law + mkm + ppn
  Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
1    190 109754                                  
2    188  98303  2     11450 10.949 3.177e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We are particularly interested with the fit0 and fit3 because I’m interested in whether or not both those parameters are necessary in addition to the law of variable. So here we have our base model which just has the law variable here we have our second model which includes kilometers in petrol price so it’s a 2 degree of freedom test because it added two parameters. Test is highly significant suggesting that we probably should include both those parameters.

rbind(summary(fit0)$coef[2, ],
      summary(fit1)$coef[2, ],
      summary(fit2)$coef[2, ],
      summary(fit3)$coef[2, ])

      Estimate Std. Error   t value     Pr(>|t|)
[1,] -25.60895   5.341655 -4.794198 3.288375e-06
[2,] -17.55372   6.028888 -2.911602 4.028394e-03
[3,] -16.32618   5.555579 -2.938700 3.706585e-03
[4,] -11.88920   6.025785 -1.973055 4.995497e-02

The coefficient for the “law” variable changes depending on which other variables are included in the model. This is due to the phenomenon of multicollinearity, where variables are correlated and can influence the estimated coefficients of each other.

Specifically, the coefficient for “law” decreases from a low of 25 fewer deaths per month to 11 fewer deaths per month as more variables are added to the model. However if we adjust for the variability in just kilometers that drops down to 17 fewer deaths. If we just for adjust the variability for the variability in petrol price that estimate drops down to 16 deaths fewer per month and then if we adjust for both that goes down to 11 fewer deaths per month. This suggests that the initial estimate of the law’s effect was inflated due to the influence of other factors, and that the actual effect is smaller.

While the decrease in coefficient magnitude might raise concerns about the law’s effectiveness, the author argues that including these other variables is important for a comprehensive analysis. Since the additional variables are highly significant and logically relevant, the author concludes that model 4, which includes all variables, is the most appropriate choice.

Page 104

1. True or false, generalized linear models transform the observed outcome. (Discuss.)

FALSE.

GLMs transform the expected value of the outcome, not the observed outcome itself.
The transformation is applied through a link function, which connects the expected value to the linear predictor.
Transforming the expected value allows GLMs to handle non-normal response variables and model a wider range of relationships.
- Case1.

\(log(Y)=\beta_0+\beta_1+\epsilon\)

Here we are not doing in generalizing your model we are simply fitting a linear model where we’ve transformed the outcome.

Case 2. (Log-linear model)

\(log(E[Y])=\beta_0+\beta_1X\) ,

where \(E[Y]\) is the expected value of Y.

Case 3.(Logistic regression)

\(log(\frac{E(Y)}{1-E(Y)})=\beta_0+\beta_1+\epsilon\) ,

where \(E[Y]/(1-E[Y])\) is the odds of Y.

Transforming the observed outcome directly changes the data itself, not the model. This can be done for various reasons, such as to improve normality or stabilize variance. It is not a defining characteristic of GLMs. Therefore, the statement “GLMs transform the observed outcome” is incorrect.

2. True or false, the interpretation of the coefficients in a GLM are on the scale of the link function. (Discuss.)

TRUE.

In fact, all of the coefficients are interpreted on the link function scale.

Example:

Poisson regression:

\(log(E[Y|X=x])=\beta_0+\beta_1X\)

So, when we get an estimate for \(\beta_1\) and see it in our coefficient table that is interpreted on the scale of the log of the expected value of the outcome so it is interpreted on the link function scale.

If we were doing logistic regression it would be interpreted on the scale of the logit of the expected value of the response.

\(log(E[Y|X=x+1])=\beta_0+\beta_1(X+1)\)

if we subtract these two entities

\(+[log(E[Y|X=x+1])=\beta_0+\beta_1(X+1)] - [log(E[Y|X=x])=\beta_0+\beta_1X] =\beta_1\)

So, \(\beta_1\) is the change in the link function of the expected value per unit change in the regressor.

In the logistic regression case because of the natural log we can invert the natural log. Thus,

\(e^{log(E[Y|X=x+1])}- e^{log(E[Y|X=x])} = e^{\beta_1}\)

\(e^{\beta_1} = \frac{E[Y|X=x+1]}{E[Y|X=x]}\).

3. True or false, the generalized linear model assumes an exponential family for the outcome. (Discuss.)

FALSE.

The generalized linear model (GLM) does not strictly assume an exponential family for the outcome. The GLM framework allows for a broad range of probability distributions, including but not limited to the exponential family. The choice of distribution depends on the nature of the response variable.

In a linear model, the relationship between the response variable \(Y\) and predictors \(X_1\) and \(X_2\) is established through a direct equation:

\(Y= \beta_0 +\beta_1 X_1 +\beta_2 X_2 + \epsilon\)

In contrast, a generalized linear model (GLM) does not form such a direct equation. It begins by assuming a distribution for \(Y\), which can be from a variety of distributions, not just the exponential family. Linear models are a special case of GLMs, as they assume the normal distribution.

An exponential family encompasses various distributions, including normal, Bernoulli, binomial, Poisson, and gamma. These distributions depend on parameters, with one of the most important being the mean (\(\mu\)).

In GLMs, the connection between the response variable \(Y\) and the predictors \(X_1\) and \(X_2\) is established through the exponential family and its parameters. One of the key parameters is the mean (\(\mu\)), and the relationship is mediated by a link function (\(g\)). For instance, the link function might be expressed as:

\(g(\mu)= \beta_0 +\beta_1 X_1 +\beta_2 X_2\)

This indirect connection allows GLMs to handle a wide range of distributions and relationships between predictors and the response variable. For example, in Poisson regression, the natural log function is often used as the link function:

\(log(\mu)= \beta_0 +\beta_1 X_1 +\beta_2 X_2\).

Generalized linear models (GLMs) are an extension of linear models that can handle non-normal response variables.
They do this by assuming the response variable follows an exponential family distribution and then linking the mean of that distribution to the linear predictor (a linear combination of the predictor variables).

4. True or false, GLM estimates are obtained by maximizing the likelihood. (Discuss.)

TRUE.

In a Generalized Linear Model (GLM), the parameter estimates are obtained by maximizing the likelihood function. The likelihood represents the probability of observing the given data given the model parameters. Maximizing the likelihood essentially seeks the parameter values that make the observed data most probable under the assumed statistical model.

Poisson Regression Example:

The response variable is assumed to follow a Poisson distribution, which is suitable for count data. The model considers the log of the expected value of \(Y_i\), denoted as log(\(\mu_i\)), where \(\mu_i\) represents the mean for each individual subject.

The model equation is given by:

\(log(E[Y_i])=log(\mu_i)=\beta_0+\beta_1\)

This implies,

\(e^{\mu_i}=e^{\beta_0+\beta_1 X}\).

Likelihood Function for Poisson Distribution:

Assuming \(Y_i\) follows an independent Poisson distribution with mean \(\mu_i\),

\(Y_i \overset{ind} \sim Poisson (\mu_i) \implies \frac {\mu_i e^{-\mu_i}}{Y_i!}\)

the likelihood function takes the form:

\(\prod \mu_i e^{-\mu_i} = L(\beta_0, \beta_1 | Y)\)

The likelihood depends on the parameters \(\beta_0\) and \(\beta_1\) and given the observed data \(Y\).

Hence, the statement is true. Likelihood maximization is a core concept in the estimation of parameters in generalized linear models. It involves finding the parameter values that maximize the likelihood of observing the given data under the assumed statistical model.

5. True or false, some GLM distributions impose restrictions on the relationship between the mean and the variance. (Discuss.)

TRUE.

Example: Poisson Model

\(log(E[Y_i])=log(\mu_i)=\beta_0+\beta_1 X_i\)

Let’s assume that

\(Y_i\sim Poisson (\mu_i) \implies E[Y_i] = \mu_i\)

However, for the Poisson distribution

\(var(Y_i)= E[Y_i] = \mu_i\),

since the mean and the variance of Poisson distribution is always the same.

Hence, for the Poisson, Bernoulli distribution and many other instances of generalized linear models (GLM) there is an implied relationship between the mean and the variance.

Page 115

1. Load the dataset `Seatbelts` as part of the datasets package via `data(Seatbelts)`. Use `as.data.frame` to convert the object to a dataframe. Create a new outcome variable for whether or not greater than 119 drivers were killed that month. Fit a logistic regression GLM with this variable as the outcome and `kms`, `PetrolPrice` and `law` as predictors. Interpret your parameters.

Seatbelts <- Seatbelts %>% 
    mutate(binkill = 1 * (DriversKilled > 119))

logRegSeatbelts <- glm(Seatbelts$binkill~Seatbelts$mkm + Seatbelts$ppn + Seatbelts$law, family = 'binomial')
summary(logRegSeatbelts)


Call:
glm(formula = Seatbelts$binkill ~ Seatbelts$mkm + Seatbelts$ppn + 
    Seatbelts$law, family = "binomial")

Coefficients:
               Estimate Std. Error z value Pr(>|z|)  
(Intercept)    0.024314   0.160775   0.151   0.8798  
Seatbelts$mkm -0.002938   0.059848  -0.049   0.9608  
Seatbelts$ppn -0.416408   0.169734  -2.453   0.0142 *
Seatbelts$law -0.615522   0.577808  -1.065   0.2868  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 266.09  on 191  degrees of freedom
Residual deviance: 253.62  on 188  degrees of freedom
AIC: 261.62

Number of Fisher Scoring iterations: 4

round(summary(logRegSeatbelts)$coef,3)

              Estimate Std. Error z value Pr(>|z|)
(Intercept)      0.024      0.161   0.151    0.880
Seatbelts$mkm   -0.003      0.060  -0.049    0.961
Seatbelts$ppn   -0.416      0.170  -2.453    0.014
Seatbelts$law   -0.616      0.578  -1.065    0.287

round(exp(logRegSeatbelts$coeff),3)

  (Intercept) Seatbelts$mkm Seatbelts$ppn Seatbelts$law 
        1.025         0.997         0.659         0.540

round(1-exp(logRegSeatbelts$coeff), 3)

  (Intercept) Seatbelts$mkm Seatbelts$ppn Seatbelts$law 
       -0.025         0.003         0.341         0.460

If we exponentiate the Seatbelts coefficients we get \(1.025\) for the intercept, and for the coefficients, \(0.997\) for kms, \(0.659\) for the Petrol Price and \(0.540\) for Law. The effect more important we see is that there was a 46% decrease in the odds of drivers of greater than 119 drivers being killed in a month after the law was enacted, compared to before the law wasn’t enacted, holding the other variables constant.

2. Fit a binomial model with `DriversKilled` as the outcome and drivers as the total count with `kms` , `PetrolPrice` and `law` as predictors, interpret your results.

# cbind creates a two-columns matrix with term of the odd: SUCCESSES vs FAILURES (not successes vs total)

fitlgm <- glm(cbind(DriversKilled, drivers-DriversKilled)~mkm + ppn + law, family = 'binomial', data = Seatbelts)
summary(fitlgm)


Call:
glm(formula = cbind(DriversKilled, drivers - DriversKilled) ~ 
    mkm + ppn + law, family = "binomial", data = Seatbelts)

Coefficients:
             Estimate Std. Error  z value Pr(>|z|)    
(Intercept) -2.536637   0.007399 -342.829   <2e-16 ***
mkm          0.003645   0.002733    1.334    0.182    
ppn         -0.007829   0.007479   -1.047    0.295    
law          0.030785   0.026527    1.161    0.246    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 234.93  on 191  degrees of freedom
Residual deviance: 229.93  on 188  degrees of freedom
AIC: 1496

Number of Fisher Scoring iterations: 3

round(summary(fitlgm)$coef,3)

            Estimate Std. Error  z value Pr(>|z|)
(Intercept)   -2.537      0.007 -342.829    0.000
mkm            0.004      0.003    1.334    0.182
ppn           -0.008      0.007   -1.047    0.295
law            0.031      0.027    1.161    0.246

For every one-unit increase in the predictor variable the log-odds of the outcome decrease by 2.537, for every one-unit increase in mkm, the log-odds of the outcome increase by 0.004, and for every one-unit increase in ppn, the log-odds of the outcome decrease by 0.008, assuming the other predictors remain constant. When law changes from 0 to 1, the log-odds of the outcome increase by 0.031, assuming the other predictors remain constant.

In the provided output, none of the predictors (mkm, ppn, and law1) have p-values less than 0.05, suggesting that, based on this model, none of the predictors are statistically significant predictors of the outcome at the 5% significance level.

3. Refer to Question 1. Use the `anova` function to compare models with just, with `law` and `PetrolPrice`, and with all three predictors.

fit1 <- glm(binkill~law, family = 'binomial', data = Seatbelts)
fit2 <- update(fit1, binkill~law + PetrolPrice)
fit3 <- update(fit2, binkill~law + PetrolPrice + kms)

anova(fit1,fit2,fit3)

Analysis of Deviance Table

Model 1: binkill ~ law
Model 2: binkill ~ law + PetrolPrice
Model 3: binkill ~ law + PetrolPrice + kms
  Resid. Df Resid. Dev Df Deviance
1       190     260.40            
2       189     253.62  1   6.7760
3       188     253.62  1   0.0024

The residual deviance it is sort of an analogous to our sums of squares that we’re looking at when we had normal data. So, in this case it looks like it’s going to be significant for adding Petrol Price and not significant for adding kms, so let’s look at the coefficient table.

summary(fit1)$coef

               Estimate Std. Error    z value   Pr(>|z|)
(Intercept)  0.08288766  0.1539783  0.5383074 0.59036483
law         -1.12434153  0.4991985 -2.2522935 0.02430373

summary(fit2)$coef

               Estimate Std. Error   z value    Pr(>|z|)
(Intercept)   3.5875552  1.3851773  2.589961 0.009598679
law          -0.6260282  0.5367204 -1.166396 0.243454555
PetrolPrice -34.3737200 13.4887754 -2.548320 0.010824304

summary(fit3)$coef

                 Estimate   Std. Error     z value   Pr(>|z|)
(Intercept)  3.612261e+00 1.473634e+00  2.45126028 0.01423570
law         -6.155225e-01 5.778076e-01 -1.06527242 0.28675267
PetrolPrice -3.419952e+01 1.394026e+01 -2.45329074 0.01415559
kms         -2.938343e-06 5.984816e-05 -0.04909663 0.96084229

Notice that in law variable it suggest a rather large decrease in the log odds for every before and after enactment of the law. However, that decrease in the odds of greater than 119 drivers killed that month, which decrease attenuates quite a bit by the inclusion of petrol price. And notice that petrol price is quite significant

It is interesting to note that the law variable goes from significant to not significant since the p-value gets increased by a factor of 10. When we include kilometers we get include a variable that now is not significant at all and doesn’t change the law variable that much.

Page 124

1. Load the dataset `Seatbelts` as part of the datasets package via `data(Seatbelts)`. Use `as.data.frame` to convert the object to a dataframe. Fit a Poisson regression GLM with `UKDriversKilled` as the outcome and `kms`, `PetrolPrice` and `law` as predictors. Interpret your results.

fit <- glm(DriversKilled ~ mkm + ppn +law, family = poisson, data = Seatbelts)
summary(fit)$coef

                Estimate  Std. Error    z value     Pr(>|z|)
(Intercept)  4.819845137 0.007127388 676.242884 0.000000e+00
mkm         -0.009980975 0.002614002  -3.818274 1.343887e-04
ppn         -0.055361338 0.007243262  -7.643150 2.119715e-14
law         -0.114877106 0.025557951  -4.494770 6.964526e-06

round(summary(fit)$coef[ ,1],4)

(Intercept)         mkm         ppn         law 
     4.8198     -0.0100     -0.0554     -0.1149

The coefficients

We apply a loop to exponentiate the coefficient and to put in positive (1-exp) the negative ones.

a <- data.frame(Intercept=0,mkm=0,ppn=0,law1=0)

for (i in 1:length(a)) {
a[i] <- if(summary(fit)$coef[ i,1] < 0) 
        {1-exp(summary(fit)$coef[ i,1])} else {exp(summary(fit)$coef[ i,1])}
}

print(a)

  Intercept        mkm        ppn      law1
1  123.9459 0.00993133 0.05385679 0.1085243

law: If we exponentiate it there’s about an 11% decrease in the expected number of drivers killed before and after enacting the law.
mkm: See that there’s about a roughly 1% decrease in the expected number of drivers killed for every thousand additional miles of driven distance.
The intercept: It is the expected number of drivers killed on the log scale for the average petrol price the average number of kilometers driven at the time when the law was not enacted for when the law variable was 0.

Interpretation for coefficients:

If you don’t exponentiate them they’re all interpreted on the log of the expected value outcome scale.
If you do exponentiate it’s then on the relative scale for a one unit increase variable, so for the binary law variable is going from 0 to 1, for the kilometers variable which converted a thousand kilometer units it’s going 1,000 kilometers addition to kilometers driven for petrol price.

2. Refer to question 1. Fit a linear model with the log of drivers killed as the outcome. Interpret your results.

round(exp(coef(lm(I(log(DriversKilled + 1)) ~ mkm + ppn + law, data = Seatbelts))), 5)

(Intercept)         mkm         ppn         law 
  123.18577     0.99190     0.94787     0.87807

In the exercise 1 we have that the coefficients are the estimates of the log of the expected value of the expected log count of driver deaths, when these other variables are set to their zero level.

summary(fit)$coef

                Estimate  Std. Error    z value     Pr(>|z|)
(Intercept)  4.819845137 0.007127388 676.242884 0.000000e+00
mkm         -0.009980975 0.002614002  -3.818274 1.343887e-04
ppn         -0.055361338 0.007243262  -7.643150 2.119715e-14
law         -0.114877106 0.025557951  -4.494770 6.964526e-06

fit2 <-lm(I(log(DriversKilled + 1)) ~ mkm + ppn + law, data = Seatbelts)
summary(fit2)$coef

                Estimate  Std. Error    t value      Pr(>|t|)
(Intercept)  4.813693546 0.014291860 336.813651 2.082663e-263
mkm         -0.008133355 0.005281647  -1.539928  1.252593e-01
ppn         -0.053538784 0.014689904  -3.644597  3.464468e-04
law         -0.130030803 0.047811530  -2.719654  7.147529e-03

In this case, the expected value of the log of the drivers killed which is our estimate of the expected value of the log of the drivers killed where the other one was the log of the expected value and it doesn’t work that you could interchange.

1-exp(-0.130030803)

[1] 0.1219316

We’re seeing about a 12% decrease in the geometric mean number of driver deaths for going from enactment of the law to prior to the law having been enacted.

3. Refer to question 1. Fit your Poisson log-linear model with drivers as a log offset (to consider the proportion of drivers killed of those killed or seriously injured.)

Poisson Model:

fit <- glm(DriversKilled ~ mkm + ppn +law, family = poisson, data = Seatbelts, offset = log(drivers))
fit0 <- glm(DriversKilled ~ mkm + ppn +law, family = poisson, data = Seatbelts)
summary(fit)$coef

                Estimate  Std. Error     z value  Pr(>|z|)
(Intercept) -2.612798146 0.007122545 -366.834931 0.0000000
mkm          0.003377675 0.002630717    1.283937 0.1991640
ppn         -0.007255064 0.007199577   -1.007707 0.3135952
law          0.028484328 0.025512651    1.116479 0.2642173

summary(fit0)$coef

                Estimate  Std. Error    z value     Pr(>|z|)
(Intercept)  4.819845137 0.007127388 676.242884 0.000000e+00
mkm         -0.009980975 0.002614002  -3.818274 1.343887e-04
ppn         -0.055361338 0.007243262  -7.643150 2.119715e-14
law         -0.114877106 0.025557951  -4.494770 6.964526e-06

Notice it’s very different model fit since these are different interpretations. The first GLM is modeling the log of the expected value of drivers killed divided by the total number of drivers killed:

\(\frac{\log(E[DK]}{\text{TotD}}) = \beta_0 + \beta_1 X + \ldots + \epsilon\)

so on in contrast fit 0 is just modeling the log of the expected value of y by itself

\(E[log(y)].\)

Now, let’s exponentiate, as in this case:

exp(0.028484328)

[1] 1.028894

1-exp(-0.114877106)

[1] 0.1085243

In the first case, considering the rate, we have a 2% increase of the rate of DriversKilled on the total Killed or injured, correlated to the law enactment; while in the second, we have a 10% decrease in the number of drivers killed.

Applying a binomial model to the proportion

Another way we could have looked at this this model let’s say fit to is doing a binomial model:

fitbis <- glm(cbind(DriversKilled, drivers - DriversKilled) ~ mkm + ppn +law, 
              family = binomial, 
              data = Seatbelts)
summary(fitbis)$coef

                Estimate  Std. Error     z value  Pr(>|z|)
(Intercept) -2.536637162 0.007399129 -342.829174 0.0000000
mkm          0.003644902 0.002732722    1.333799 0.1822698
ppn         -0.007828905 0.007478975   -1.046789 0.2951971
law          0.030785128 0.026526920    1.160524 0.2458355

compare with the fit on the poisson rate model:

summary(fit)$coef

                Estimate  Std. Error     z value  Pr(>|z|)
(Intercept) -2.612798146 0.007122545 -366.834931 0.0000000
mkm          0.003377675 0.002630717    1.283937 0.1991640
ppn         -0.007255064 0.007199577   -1.007707 0.3135952
law          0.028484328 0.025512651    1.116479 0.2642173

So, this model is also looking at drivers killed as a proportion of drivers killed or seriously injured. So, fit2 should give estimates pretty similar to this first fit and we see they are very similar. So, it’s a slightly different model because one is a binomial model and the other one is a relative Poisson model.

However, we know because Poisson approximates the binomial model when the proportion is small and the proportion of drivers killed relative to the proportion of drivers killed or seriously injured is small.

So, these two models are going to approximate each other pretty well but the important distinction is when you put that log offset we’re looking at the proportion not just the raw number of drivers killed and this is a pretty stark example because some of the coefficients actually reverse their sign.

4. Refer to Question 1. Use the `anova` function to compare models with just `law`, `law` and `PetrolPrice` and all three predictors.

fit1 <- glm(DriversKilled ~ law, family = poisson, data = Seatbelts)
fit2 <- update(fit1, DriversKilled ~law + PetrolPrice )
fit3 <- update(fit2, DriversKilled ~law + PetrolPrice + kms)

anova(fit1, fit2, fit3)

Analysis of Deviance Table

Model 1: DriversKilled ~ law
Model 2: DriversKilled ~ law + PetrolPrice
Model 3: DriversKilled ~ law + PetrolPrice + kms
  Resid. Df Resid. Dev Df Deviance
1       190     870.06            
2       189     792.88  1   77.178
3       188     778.32  1   14.561

Going from model1 to model2 we get a deviance of \(77.178\) that’s one degree of freedom. Remember that a chi-squared with one degree of freedom is like a standard normal squared. So, \(77.178\) would likely be a highly significant, same with \(14.561\).

Now,let’s look the summary coefficient tables for each:

summary(fit1)$coef

              Estimate  Std. Error   z value     Pr(>|z|)
(Intercept)  4.8352482 0.006856395 705.21727 0.000000e+00
law         -0.2274727 0.021923993 -10.37552 3.204779e-25

summary(fit2)$coef

              Estimate Std. Error   z value     Pr(>|z|)
(Intercept)  5.3499077 0.05886323 90.887084 0.000000e+00
law         -0.1516301 0.02364153 -6.413720 1.420110e-10
PetrolPrice -5.0697421 0.57792672 -8.772292 1.750654e-18

summary(fit3)$coef

                 Estimate   Std. Error   z value     Pr(>|z|)
(Intercept)  5.440656e+00 6.360114e-02 85.543371 0.000000e+00
law         -1.148771e-01 2.555795e-02 -4.494770 6.964526e-06
PetrolPrice -4.546821e+00 5.948884e-01 -7.643150 2.119715e-14
kms         -9.980975e-06 2.614002e-06 -3.818274 1.343887e-04

So our law variable which is the coefficient variable (\(-0.2274727\)) is highly significant, but it gets attenuated closer to (\(-0.1516301\)) after the inclusion of PetrolPrice, and then after the inclusion of PetrolPrice and kms(kilometers) it drops even further down to (\(-1.148771e-01\)). Also, the signifance drops from \(10^-25\) to \(10^-10\) and to \(10^-6\) still very significant.

Regression Models for Data Science in R

JORIELYN S. MANLA

2023-12-08

Exercises

Page 52

1. Test whether the slope coefficient for the father.son data is different from zero (father as predictor, son as outcome).

2. Refer to question 1. Form a confidence interval for the slope coefficient.

3. Refer to question 1. Form a confidence interval for the intercept (center the fathers’ heights first to get an intercept that is easier to interpret).

4. Refer to question 1. Form a mean value interval [it means the confidence interval, the interval for the regression line] for the expected son’s height at the average father’s height.

5. Refer to question 1. Form a prediction interval [the interval for the predicted Y values] for the Son’s height at the average Father’s height.

6. Load the mtcars dataset. Fit a linear regression with miles per gallon as the outcome and horsepower as the predictor. Test whether or not the horsepower power coefficient is statistically different from zero. Interpret your test.

7. Refer to question 6. Form a confidence interval for the slope coefficient.

8. Refer to question 6. Form a confidence interval for the intercept (center the HP variable first).

9. Refer to question 6. Form a mean value interval [it means the confidence interval, the interval for the regression line] for the expected MPG for the average HP.

10. Refer to question 6. Form a prediction interval for the expected MPG for the average HP.

11. Refer to question 6. Create a plot that has the fitted regression line plus curves at the expected value and prediction intervals.

Page 58

1. Load the dataset Seatbelts as part of the datasets package via data(Seatbelts). Use as.data.frame to convert the object to a dataframe. Fit a linear model of driver deaths [DriversKilled] with kms and PetrolPrice as predictors. Interpret your results.

2. Predict the number of driver deaths at the average kms and petrol levels.

3. Take the residual for DriversKilled having regressed out kms and an intercept and the residual for PetrolPrice having regressed out kms and an intercept. Fit a regression through the origin of the two residuals and show that it is the same as your coefficient obtained in question 1.

4. Take the residual for DriversKilled having regressed out PetrolPrice and an intercept. Take the residual for kms having regressed out PetrolPrice and an intercept. Fit a regression through the origin of the two residuals and show that it is the same as your coefficient obtained in question 1.

Page 72

2. Repeat question 1 for the outcome being the log of the count of driver deaths. Interpret your coefficients.

3. Refer to question 1. Add the dummy variable law and interpret the results. Repeat this question with a factor variable that you create called lawFactor that takes the levels No and Yes. Change the reference level from No to Yes.

4. Discretize the PetrolPrice variable into four factor levels. Fit the linear model with this factor to see how R treats multiple level factor variables.

5. Perform the plot requested at the end of the last chapter.

Page 80

1. Load the dataset Seatbelts as part of the datasets package via data(Seatbelts). Use as.data.frame to convert the object to a dataframe. Fit a linear model of driver deaths with kms and PetrolPrice as predictors. Interpret your results.

2. Compare the kms coefficient with and without the inclusion of the PetrolPrice variable in the model.

3. Compare the PetrolPrice coefficient with and without the inclusion fo the kms variable in the model.

Page 91

1. Load the dataset `Seatbelts as part of the datasets package via data(Seatbelts). Use as.data.frame to convert the object to a dataframe. Fit a linear model of driver deaths with kms, PetrolPrice and law as predictors.

2. Refer to question 1. Directly estimate the residual variation via the function resid. Compare with R’s residual variance estimate.

3. Refer to question 1. Perform an analysis of diagnostic measures including, dffits, dfbetas, influence and hat diagonals.

Page 100

1. Load the dataset Seatbelts as part of the datasets package via data(Seatbelts). Use as.data.frame to convert the object to a dataframe. Fit a linear model of driver deaths with kms, PetrolPrice and law as predictors.

2. Perform a model selection exercise to arrive at a final model.

Page 104

1. True or false, generalized linear models transform the observed outcome. (Discuss.)

2. True or false, the interpretation of the coefficients in a GLM are on the scale of the link function. (Discuss.)

3. True or false, the generalized linear model assumes an exponential family for the outcome. (Discuss.)

4. True or false, GLM estimates are obtained by maximizing the likelihood. (Discuss.)

5. True or false, some GLM distributions impose restrictions on the relationship between the mean and the variance. (Discuss.)

Page 115

2. Fit a binomial model with DriversKilled as the outcome and drivers as the total count with kms , PetrolPrice and law as predictors, interpret your results.

3. Refer to Question 1. Use the anova function to compare models with just, with law and PetrolPrice, and with all three predictors.

Page 124

1. Load the dataset Seatbelts as part of the datasets package via data(Seatbelts). Use as.data.frame to convert the object to a dataframe. Fit a Poisson regression GLM with UKDriversKilled as the outcome and kms, PetrolPrice and law as predictors. Interpret your results.

2. Refer to question 1. Fit a linear model with the log of drivers killed as the outcome. Interpret your results.

3. Refer to question 1. Fit your Poisson log-linear model with drivers as a log offset (to consider the proportion of drivers killed of those killed or seriously injured.)

4. Refer to Question 1. Use the anova function to compare models with just law, law and PetrolPrice and all three predictors.

1. Load the dataset Seatbelts as part of the datasets package via `data(Seatbelts)`. Use `as.data.frame` to convert the object to a dataframe. Fit a linear model of driver deaths `[DriversKilled]` with `kms` and `PetrolPrice as predictors`. Interpret your results.

3. Take the residual for `DriversKilled` having regressed out `kms` and an intercept and the residual for `PetrolPrice` having regressed out `kms` and an intercept. Fit a regression through the origin of the two residuals and show that it is the same as your coefficient obtained in question 1.

3. Refer to question 1. Add the dummy variable `law` and interpret the results. Repeat this question with a factor variable that you create called `lawFactor` that takes the levels `No` and `Yes`. Change the reference level from `No` to `Yes`.

4. Discretize the `PetrolPrice` variable into four factor levels. Fit the linear model with this factor to see how R treats multiple level factor variables.

1. Load the dataset `Seatbelts` as part of the datasets package via data(Seatbelts). Use `as.data.frame` to convert the object to a dataframe. Fit a linear model of driver deaths with `kms` and `PetrolPrice` as predictors. Interpret your results.

2. Compare the `kms` coefficient with and without the inclusion of the `PetrolPrice` variable in the model.

3. Compare the `PetrolPrice` coefficient with and without the inclusion fo the `kms` variable in the model.

1. Load the dataset ``Seatbelts` as part of the `datasets` package via `data(Seatbelts)`. Use `as.data.frame` to convert the object to a dataframe. Fit a linear model of driver deaths with `kms`, `PetrolPrice` and `law` as predictors.

1. Load the dataset `Seatbelts` as part of the `datasets` package via `data(Seatbelts)`. Use `as.data.frame` to convert the object to a dataframe. Fit a linear model of driver deaths with `kms`, `PetrolPrice` and `law` as predictors.

2. Fit a binomial model with `DriversKilled` as the outcome and drivers as the total count with `kms` , `PetrolPrice` and `law` as predictors, interpret your results.

3. Refer to Question 1. Use the `anova` function to compare models with just, with `law` and `PetrolPrice`, and with all three predictors.

1. Load the dataset `Seatbelts` as part of the datasets package via `data(Seatbelts)`. Use `as.data.frame` to convert the object to a dataframe. Fit a Poisson regression GLM with `UKDriversKilled` as the outcome and `kms`, `PetrolPrice` and `law` as predictors. Interpret your results.

4. Refer to Question 1. Use the `anova` function to compare models with just `law`, `law` and `PetrolPrice` and all three predictors.