Page14

Exercise 1

Consider the dataset given by x=c(0.725,0.429,-0.372 ,0.863). What value of mu minimizes sum((x - mu)^2)?

\[\sum(X_i-\mu)^2\geqslant\sum(X_i-\bar{X})^2 => Answer:\bar{X}\]

Exercise 2

Reconsider the previous question. Suppose that weights were given, w = c(2, 2, 1, 1) so that we wanted to minimize sum(w * (x - mu) ^ 2) for mu. What value would we obtain?

\[\sum_1^n(X_i-\mu)^2=\sum_1^n \frac{w_iX_i}{\sum w_i}\]

x=c(0.725,0.429,-0.372 ,0.863)
w = c(2, 2, 1, 1)

sum(w*x/sum(w))
## [1] 0.4665

Exercise 3

Take the Galton dataset and obtain the regression through the origin slope estimate where the centered parental height is the outcome and the child’s height is the predictor.

Y=Galton$parent
X=Galton$child

YC = Y - mean(Y)
XC = X - mean((X))

lm(YC~XC-1)
## 
## Call:
## lm(formula = YC ~ XC - 1)
## 
## Coefficients:
##     XC  
## 0.3256
cor(YC,XC)*sd(YC)/sd(XC)
## [1] 0.3256475

Page 17

  1. Take the Galton dataset and find the mean, standard deviation and correlation between the parental and child heights. Watch a video solution.³¹
  2. Center the parent and child variables and verify that the centered variable means are 0. Watch a video solution.³²
  3. Rescale the parent and child variables and verify that the scaled variable standard deviations are 1. Watch a video solution.³³
  4. Normalize the parental and child heights. Verify that the normalized variables have mean 0 and standard deviation 1 and take the correlation between them. Watch a video solution.³
# 1.
mean(Galton$child);mean(Galton$parent)
## [1] 68.08847
## [1] 68.30819
sd(Galton$child);sd(Galton$parent)
## [1] 2.517941
## [1] 1.787333
cor(Galton$child,Galton$parent)
## [1] 0.4587624
# 2.
xc <- Galton$parent-mean(Galton$parent)
yc <- Galton$child-mean(Galton$child)

mean(xc); mean(yc)
## [1] 9.775954e-16
## [1] 4.817867e-16
all.equal(mean(xc),mean(yc))
## [1] TRUE
# 3.

xn <- xc/sd(xc)
yn <- yc/sd(yc)

sd(yn); sd(yn)
## [1] 1
## [1] 1
# 4.

mean(xn); mean(yn)
## [1] 5.501733e-16
## [1] 2.183943e-16
sd(xn); sd(yn)
## [1] 1
## [1] 1
cor(yn,xn)
## [1] 0.4587624

Page 23

Exercise 1

Install and load the package UsingR and load the father.son data with data(father.son). Get the linear regression fit where the son’s height is the outcome and the father’s height is the predictor. Give the intercept and the slope, plot the data and overlay the fitted regression line. \[\beta_1 = cor(Y,X)\times sd(Y)/sd(X)\] \[\beta_0=\bar{Y}-\beta_1\bar{X}\]

x = father.son$fheight
y = father.son$sheight

fit = lm(sheight~fheight, data = father.son)
b1 = cor(y,x) * sd(y) / sd(x)
b0 = mean(y) - b1*mean(x)
rbind(coef(fit),c(b0,b1))
##      (Intercept)  fheight
## [1,]     33.8866 0.514093
## [2,]     33.8866 0.514093
ggplot(data = father.son, aes(x=fheight,y=sheight)) +
    geom_point() +
    geom_smooth(method = lm, se = FALSE)

Exercise 2

Remember: frormula for the slope \[\frac{\sum (X-\bar{X})(Y-\bar{Y})}{\sum(X-\bar{X})^2}\]

## Refer to problem 1. Center the father and son variables and refit the model omitting the
## intercept. Verify that the slope estimate is the same as the linear regression fit from problem 1.

xc <- x - mean(x)
yc <- y - mean(y)

sum(xc*yc)/sum(xc^2)
## [1] 0.514093
lm(yc~xc -1)
## 
## Call:
## lm(formula = yc ~ xc - 1)
## 
## Coefficients:
##     xc  
## 0.5141

Exercise 3

# Refer to problem 1. Normalize the father and son data and see that the fitted slope is the
# correlation.

xn <- xc/sd(x)
yn <- yc/sd(y)

lm(yn~xn)
## 
## Call:
## lm(formula = yn ~ xn)
## 
## Coefficients:
## (Intercept)           xn  
##   1.820e-15    5.013e-01
cor(yn,xn)
## [1] 0.5013383

Exercise 4

# Go back to the linear regression line from Problem 1. If a father’s height was 63 inches, what
# would you predict the son’s height to be?

fit = lm(sheight~fheight, data = father.son)

# Method 1
b0 =coef(fit)[1]; b1 = coef(fit)[2]
# Y = b0 + b1*X
b0 + b1*63
## (Intercept) 
##    66.27447
# Method 2
predict(fit, newdata = data.frame(fheight=63))
##        1 
## 66.27447

Exercise 5

# Consider a data set where the standard deviation of the outcome variable is double that of the
# predictor. Also, the variables have a correlation of 0.3. If you fit a linear regression model, what
# would be the estimate of the slope?

\[sd(Y)/sd(x)=2; cor(Y,X)=0.3\]

\[\beta_1 = cor(Y,X)\times sd(Y)/sd(X)\]

\[\beta_1=0.3\times2=0.6\]

Exercise 6

# Consider the previous problem. The outcome variable (Y) has a mean of 1 and the predictor (X) has a
# mean of 0.5. What would be the intercept?

\[\beta_0=\bar{Y}-\beta_1\bar{X} =1-0.6\times0.5=0.7\]

Exercise 7

## True or false, if the predictor variable has mean 0, the estimated intercept from linear regression
## will be the mean of the outcome?

TRUE

\[\beta_0=\bar{Y}-\beta_1\bar{X} = \bar{Y}-\beta_1\times0=>\beta_0=\bar{Y}\]

Exercise 8

# Consider problem 5 again. What would be the estimated slope if the predictor and outcome
# were reversed?

\[\beta_1 = cor(Y,X)\times sd(X)/sd(Y)=0.3*1/2=0.15\]

Pag. 27

Exercise 1

You have two noisy scales and a bunch of people that you’d like to weigh. You weigh each person on both scales. The correlation was 0.75. If you normalized each set of weights, what would you have to multiply the weight on one scale to get a good estimate of the weight on the other scale?

\[X=scale A; Y=scale B; cor(Y,X)= 0.75\] So, \[\beta_1=0.75\] and \[\beta_0=0\] because both variables have been normalizaed. \[Y=\beta_1X;Y=0.75X\]

Exercise 2

Consider the previous problem. Someone’s weight was 2 standard deviations above the mean of the group on the first scale. How many standard deviations above the mean would you estimate them to be on the second?

\[Y=\beta_1Y=>Y=0.75X=>0.75·2=1.50\]

Exercise 3

You ask a collection of husbands and wives to guess how many jellybeans are in a jar. The correlation is 0.2. The standard deviation for the husbands is 10 beans while the standard deviation for wives is 8 beans. Assume that the data were centered so that 0 is the mean for each. The centered guess for a husband was 30 beans (above the mean). What would be your best estimate of the wife’s guess?

Knowing that: \[\beta_1= cor(Y, X)\frac{sd(Y)}{sd(X)}\]

\[X= beansHusbands;Y=beansWives\] \[Y=\beta_1X\]

cor_YX= 0.2
sdX <- 10 # Standard deviation husbands
sdY <- 8  # Standard deviation wives
# beta_1 <- cor_YX*sd(Y)/sd(X)
beta_1 <- cor_YX*sdY/sdX
beta_1
## [1] 0.16
X= beta_1*30
X
## [1] 4.8

\[=>Y=0.16X; X=30=>Y=0.16·30=4.8\]

Pag. 33

Exercise 1

Fit a linear regression model to the father.son dataset with the father as the predictor and the son as the outcome. Give a p-value for the slope coefficient and perform the relevant hypothesis test.

p_value <- summary(lm(sheight~fheight, father.son))$coefficients[2,4]
p_value
## [1] 1.121268e-69

Hypothesis test \[H_0: \beta_1=0; H_A: \beta_1\neq0\]

The p-value of the null hypothesis is 1.1212675^{-69} which is very small. So we reject the null hypothesis, which demonstrates that does exist a linear relation between fheight and sheight.

Exercise 2

Refer to question 1. Interpret both parameters. Recenter for the intercept if necessary.

fit <- lm(sheight~fheight, father.son)

summary(fit)$coef
##              Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 33.886604 1.83235382 18.49348 1.604044e-66
## fheight      0.514093 0.02704874 19.00618 1.121268e-69

We se that the estimate for fheight is 0.514093. This implies that every 1 inch of increase in fheight we have 0.514 increse in sheight. The intercept is 33.89, which is the estimate of sheight for a father of 0 inches, which is no useful

Lets recenter the father’s height:

fit2 <- lm(sheight ~ I(fheight - mean(fheight)), father.son)
summary(fit2)$coef
##                             Estimate Std. Error   t value     Pr(>|t|)
## (Intercept)                68.684070 0.07421078 925.52689 0.000000e+00
## I(fheight - mean(fheight))  0.514093 0.02704874  19.00618 1.121268e-69

Now, the slope did no change. Centering or shifting the regressor will not have impact on the slope. Intercept is now 68.68, the sheight at the mean of fheight (67.69).

Exercise 3

Refer to question 1. Predict the son’s height if the father’s height is 80 inches. Would you recommend this prediction? Why or why not?

# Consider the FIT if exercise 1.

# Method 1: y = beta_0 + beta_1*x
beta_0 <- summary(fit)$coef[1,1]
beta_1 <- summary(fit)$coef[2,1]
beta_0+beta_1*80
## [1] 75.01405
# Method 2
predict(fit, newdata = data.frame(fheight = 80))
##        1 
## 75.01405

I would not recommend this prediction because 80 inches is a value very extreme, beyond the realm of data. There could be a curvature of the line beyond the data beyond the max value. It would not be a very good prediction.

summary(father.son)
##     fheight         sheight     
##  Min.   :59.01   Min.   :58.51  
##  1st Qu.:65.79   1st Qu.:66.93  
##  Median :67.77   Median :68.62  
##  Mean   :67.69   Mean   :68.68  
##  3rd Qu.:69.60   3rd Qu.:70.47  
##  Max.   :75.43   Max.   :78.36

Exercise 4

Load the mtcars dataset. Fit a linear regression with miles per gallon as the outcome and horsepower as the predictor. Interpret your coefficients, recenter for the intercept if necessary.

summary(lm(mpg~hp, mtcars))
## 
## Call:
## lm(formula = mpg ~ hp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7121 -2.1122 -0.8854  1.5819  8.2360 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 30.09886    1.63392  18.421  < 2e-16 ***
## hp          -0.06823    0.01012  -6.742 1.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
## F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

We see a slope of -0.07, which means that for every unit of hp we can predict a decrease of 0.07 miles/gallon. This relation is highly significant, at a very small p-value. The intercept is 30.1 miles/gallon at 0 HP. 0 HP is not too useful value to evaluate the intercept. So, we can recenter our regression line.

summary(lm(mpg ~ I(hp - mean(hp)), mtcars))
## 
## Call:
## lm(formula = mpg ~ I(hp - mean(hp)), data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.7121 -2.1122 -0.8854  1.5819  8.2360 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      20.09062    0.68288  29.420  < 2e-16 ***
## I(hp - mean(hp)) -0.06823    0.01012  -6.742 1.79e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.863 on 30 degrees of freedom
## Multiple R-squared:  0.6024, Adjusted R-squared:  0.5892 
## F-statistic: 45.46 on 1 and 30 DF,  p-value: 1.788e-07

Now the the intercept is 20.09 miles/gallon at the mean of hp.

Exercise 5

Refer to question 4. Overlay the fit onto a scatterplot.

ggplot(mtcars, aes(x=hp, y=mpg))+
    geom_point(cex=3, alpha= 0.5)+
    geom_smooth(method = 'lm', se=FALSE)

Exercis 6

Refer to question 4. Test the hypothesis of no linear relationship between horsepower and miles per gallon.

Hypothesis test No linear relationship means that the slope of the lenar relationship should be 0. So the hypothsis test should be as following: \[H_0: \beta_1=0; H_A: \beta_1\neq0\]

If we get the coeffitient about regression line we can observe the coefficient of the slope and its p-values:

summary(lm(mpg ~ hp, mtcars))$coef
##                Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 30.09886054  1.6339210 18.421246 6.642736e-18
## hp          -0.06822828  0.0101193 -6.742389 1.787835e-07

We can observe the p-value of the probability of the slope being 0 is \(1.8^{-7}\). So we have to reject the null hypothesis.

Exercise 7

Refer to question 4. Predict the miles per gallon for a horsepower of 111.

summary(mtcars$hp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    52.0    96.5   123.0   146.7   180.0   335.0
# Method 1
predict(lm(mpg ~ hp, mtcars), newdata = data.frame(hp = 111))
##        1 
## 22.52552

In this case we see that the predictor is well in the realm of data. And the predicted value of miles/gallon is 22.53.

Which is the same:

# y = beta0 + beta1 * X
summary(lm(mpg ~ hp, mtcars))$coef[1,1]+summary(lm(mpg ~ hp, mtcars))$coef[2,1]*111
## [1] 22.52552

Page 48

Exercise 1

Fit a linear regression model to the father.son dataset with the father as the predictor and the son as the outcome. Plot the father’s height (horizontal axis) versus the residuals (vertical axis).

fitfs <- lm(sheight ~ fheight, father.son)
ggplot(father.son, aes(x= fheight, y= sheight)) +
    geom_point(colour= "red", size= 2, alpha=0.5) +
    geom_smooth(method = 'lm', se= FALSE, lwd= 1.5)

ggplot(data = father.son, aes(x= fheight, y = resid(fitfs))) +
    geom_point(colour= "red", size= 2, alpha=0.5) +
    geom_hline(yintercept = 0, lwd=2)+
    xlab("x= Father's height") +
    ylab("Residuals")

plot(father.son$fheight, resid(fitfs))
abline(h = 0, lwd=2, col='red')

Exercise 2

Refer to question 1. Directly estimate the residual variance and compare this estimate to the output of lm.

Residual Variance \[ \hat{\sigma}^2= \frac{1}{n-2} \sum_{i=1}^n e_i^2 \]

Residual variance

(1/(nrow(father.son)-2))*sum(resid(fitfs)^2)
## [1] 5.936804

Residual standard error (square root of variance)

sqrt((1/(nrow(father.son)-2))*sum(resid(fitfs)^2))
## [1] 2.436556

The summary of the linear model gives

summary(fitfs)$sigma^2
## [1] 5.936804

Exercise 3

Refer to question 1. Give the R squared for this model.

cor(father.son$sheight,father.son$fheight)^2
## [1] 0.2513401
summary(fitfs)$r.squared
## [1] 0.2513401
# Adjusted r-squared
summary(fitfs)$adj.r.squared
## [1] 0.2506443

Exercise 4

Load the mtcars dataset. Fit a linear regression with miles per gallon as the outcome and horsepower as the predictor. Plot horsepower versus the residuals.

fitmtcars <- lm(mpg~hp, mtcars)
mtcars %>% mutate(residuals = resid(fitmtcars)) %>% 
ggplot(aes(x= hp, y= residuals)) +
    geom_point( colour= 'red', size=3, alpha= .5) +
    geom_hline(yintercept = 0) +
    xlab("X = HP") +
    ylab("Residuals")

Anyway, the linear model could be not the best fitted for this correlation, as we can see in the previous plot, and in a scatterplot.

qplot(mpg, hp, data = mtcars)

Exercise 5

Refer to question 4. Directly estimate the residual variance and compare this estimate to the output of lm.

sum(resid(lm(mpg~hp, mtcars))^2)/(nrow(mtcars)-2)
## [1] 14.92248
summary(lm(mpg~hp, mtcars))$sigma^2
## [1] 14.92248

Remeber the residual Variance \[ \hat{\sigma}^2= \frac{1}{n-2} \sum_{i=1}^n e_i^2 \]

Exercise 6

Refer to question 4. Give the R squared for this model.

Recall that: \[{R}^2=cor(Y,X)^2\]

cor(mtcars$mpg,mtcars$hp)^2
## [1] 0.6024373
summary(fitmtcars)$r.squared
## [1] 0.6024373

Which means that the % of variation due to the linear model is 60.2%.

summary(fitmtcars)$adj.r.squared
## [1] 0.5891853

Page 56

Exercise 1

Test whether the slope coefficient for the father.son data is different from zero (father as predictor, son as outcome).

Hypothesis test \[H_0: \beta_1=0; H_A:\beta_1\neq 0\]

y= father.son$sheight; x=father.son$fheight; n= nrow(father.son)

Given \[\beta_1=cor(Y,X)\frac{Sd(Y)}{Sd(X)}\] and \[\beta_0=\bar{Y}-\beta_1\bar{X}\]

beta1 <- cor(y,x)*sd(y)/sd(x)
beta0 <- mean(y) - beta1*mean(x)

Remember, predicted outcome i is \(\hat{Y_i}\) at predicted value Xi

\[e_i=Y_i-\hat{\beta_0}-\hat{\beta_1}X_i\]

e <- y-beta0 -beta1*x

The standard error of the residuals is

\[ \hat{\sigma} = \sqrt( \frac{1}{n-2} \sum_{i=1}^n e_i^2) \]

sigma <- sqrt(sum(e^2)/(n-2))
ssx <- sum((x - mean(x))^2)

The standard error of the regression slope is \[\sigma_{\hat{\beta_1}}=\sqrt Var(\hat{\beta_1})=\frac{\sigma}{\sqrt(\sum_{i=1}^n(X_i-\bar{X})^2)}\]

seb1 <- sigma / sqrt(ssx)

Knowing that [see 17 p.3]: \[T=\frac{\hat{\beta_j}-\beta_j}{\hat{\sigma_{\hat{\beta_j}}}}\]

If the hypothesis is that the true value is 0 we have: \[T= \frac{\hat{\beta_j}}{\hat{\sigma_{\hat{\beta_j}}}}\]

tBeta1 <- beta1 / seb1
pBeta1 <- 2 * pt(abs(tBeta1), df = n - 2, lower.tail = FALSE)
coefTab <- data.frame(beta1, seb1, tBeta1, pBeta1)
colnames(coefTab) <- c("Estimate", "Std. Error", "t value", "P(>|t|)")
rownames(coefTab) <- "x"
coefTab
##   Estimate Std. Error  t value      P(>|t|)
## x 0.514093 0.02704874 19.00618 1.121268e-69

Comparing with summary():

summary(lm(sheight~fheight, father.son))$coef
##              Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 33.886604 1.83235382 18.49348 1.604044e-66
## fheight      0.514093 0.02704874 19.00618 1.121268e-69
coefTab[1,4]
## [1] 1.121268e-69
summary(lm(sheight~fheight, father.son))$coef[2,4]
## [1] 1.121268e-69

The p-value of the slope is very small, which suggests that we have to reject the null hypothesis. The slope coefficient is different than 0. So we can consider that the father’s height is a statistically significant predictor of the son’s height.

Exercise 2

Refer to question 1. Form a confidence interval for the slope coefficient.

Remember degree of freedom is n-2.

Confidence interval is \[\hat{\beta} \pm t_{\hat{\beta}}\times SE_{\hat{\beta}}\]

Put the coefficient in the variable sumCoef

sumCoef <- (summary(lm(sheight~fheight, father.son))$coef)
sumCoef
##              Estimate Std. Error  t value     Pr(>|t|)
## (Intercept) 33.886604 1.83235382 18.49348 1.604044e-66
## fheight      0.514093 0.02704874 19.00618 1.121268e-69

We can use 2 methods to compute confidence interval

# Method 1
# Estimate....plus minus...t-stat (0.05)... df= n-2........ SE Estimate
sumCoef[2,1] + c(-1,1) * qt(0.975, df= fitfs$df.residual) * sumCoef[2,2]
## [1] 0.4610188 0.5671673
# Method 2
confint(fitfs)
##                  2.5 %     97.5 %
## (Intercept) 30.2912126 37.4819961
## fheight      0.4610188  0.5671673

Exercise 3

Refer to question 1. Form a confidence interval for the intercept (center the fathers’ heights first to get an intercept that is easier to interpret).

As we have seen in the previous exercises, the estimate for the intercept is 33.8866044.

summary(fitfs)$coef[1,1]
## [1] 33.8866

Which is the son’s height at father’s height of 0 inches that does not make sense. So we get the regression centering the height at the mean og father’s height.

fitfsc <- lm(sheight ~ I(fheight - mean(fheight)), father.son)
fitfsc
## 
## Call:
## lm(formula = sheight ~ I(fheight - mean(fheight)), data = father.son)
## 
## Coefficients:
##                (Intercept)  I(fheight - mean(fheight))  
##                    68.6841                      0.5141

We will get a new intercept, 68.7 inches, the son’s height at average height of the father, 67.7 inches.

Now, the CI for this intercept (the son’s height at the average father’s height) is:

confint(fitfsc)
##                                 2.5 %     97.5 %
## (Intercept)                68.5384554 68.8296839
## I(fheight - mean(fheight))  0.4610188  0.5671673

And we can see also that there is no change in the slope coefficients.

Exercise 4

Refer to question 1. Form a mean value interval [it means the confidence interval, the interval for the regression line] for the expected son’s height at the average father’s height.

Calculating the expected values

predict(fitfs, newdata = data.frame(fheight = mean(father.son$fheight)), interval = 'confidence')
##        fit      lwr      upr
## 1 68.68407 68.53846 68.82968

Exercise 5

Refer to question 1. Form a prediction interval [the interval for the predicted Y values] for the Son’s height at the average Father’s height.

\[\hat{Y} = \beta_0 + \beta_1\bar{X}\]

predict(fitfs, newdata = data.frame(fheight = mean(father.son$fheight)), interval = "prediction")
##        fit      lwr      upr
## 1 68.68407 63.90091 73.46723

Exercise 6

Load the mtcars dataset. Fit a linear regression with miles per gallon as the outcome and horsepower as the predictor. Test whether or not the horsepower power coefficient is statistically different from zero. Interpret your test.

fit <- lm(mpg ~ hp, mtcars)

Hypothesis tests

\[H_0:\hat\beta_1=0;H_A:\hat\beta_1\neq0\]

summary(fit)$coef
##                Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 30.09886054  1.6339210 18.421246 6.642736e-18
## hp          -0.06822828  0.0101193 -6.742389 1.787835e-07

The p-value of the slope is very small (1.7878353^{-7}), which suggests that we have to reject the null hypothesis. The slope coefficient is different than 0. So we can consider that the the horse power is a statistically significant predictor of miles per gallon.

Exercise 7

Refer to question 6. Form a confidence interval for the slope coefficient.

Method 1

confint(fit)
##                   2.5 %     97.5 %
## (Intercept) 26.76194879 33.4357723
## hp          -0.08889465 -0.0475619

Method 2

with(mtcars, cor(mpg, hp)*sd(mpg)/sd(hp))+ c(-1,1) * qt(0.975, df=nrow(mtcars)-2)*summary(fit)$coef[2,2]
## [1] -0.08889465 -0.04756190

Exercise 8

Refer to question 6. Form a confidence interval for the intercept (center the HP variable first).

Centered coefficients

fit <- lm(mpg ~ I(hp-mean(hp)), mtcars)

Method 1

confint(fit)
##                        2.5 %     97.5 %
## (Intercept)      18.69599452 21.4852555
## I(hp - mean(hp)) -0.08889465 -0.0475619

Method 2

predict(fit, newdata = data.frame(hp = mean(mtcars$hp)), interval = 'confidence')
##        fit      lwr      upr
## 1 20.09062 18.69599 21.48526

Exercise 9

Refer to question 6. Form a mean value interval [it means the confidence interval, the interval for the regression line] for the expected MPG for the average HP.

fit <- lm(mpg~hp,mtcars)
predict(fit, newdata = data.frame(hp=mean(mtcars$hp)), interval = 'confidence')
##        fit      lwr      upr
## 1 20.09062 18.69599 21.48526

Exercise 10

Refer to question 6. Form a prediction interval for the expected MPG for the average HP.

predict(fit, newdata = data.frame(hp=mean(mtcars$hp)), interval = 'prediction')
##        fit      lwr      upr
## 1 20.09062 12.07908 28.10217

Exercise 11

Refer to question 6. Create a plot that has the fitted regression line plus curves at the expected value and prediction intervals.

y <- mtcars$mpg; x <- mtcars$hp; n <- length(y)
fit <- lm(y~x)
p1 <- data.frame(predict(fit, newdata = data.frame(x=x), interval = 'confidence'))

p2 <- data.frame(predict(fit, newdata = data.frame(x=x), interval = 'prediction'))
p1 <- p1 %>% 
    mutate(interval = 'confidence') %>% 
    mutate(x = mtcars$hp)
p2 <- p2 %>% 
    mutate(interval = 'prediction') %>% 
     mutate(x = mtcars$hp)

dat <- rbind(p1,p2)
names(dat)[1] <- "y"
dat %>% 
    ggplot(aes(x = x, y = y)) +
    geom_ribbon(aes(ymin= lwr, ymax= upr, fill= interval), alpha= 0.4) +
    geom_line(colour="black", lwd=.5) +
    geom_point(data= data.frame(x=x,y=y), aes(x=x,y=y))

Multivariable Regression Analysis: Page 62

Exercise 1

Load the dataset Seatbelts as part of the datasets package via data(Seatbelts). Use as.data.frame to convert the object to a dataframe. Fit a linear model of driver deaths [DriversKilled] with kms and PetrolPrice as predictors. Interpret your results.

Seatbelts <- as.data.frame(Seatbelts)

round(summary(lm(DriversKilled ~ kms + PetrolPrice, Seatbelts))$coef,4)
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  215.7461    14.6656 14.7110   0.0000
## kms           -0.0017     0.0006 -2.8469   0.0049
## PetrolPrice -643.7895   148.2896 -4.3414   0.0000

The intercept is the number of deaths for 0 kms and 0 PetrolPrice whichis not useful. An it’s good to center the variables. The estimate kms is the impact on deaths of 1 km which is insignificant and unworthy. Better to rescale it. PetroPrice is an index which is better tu normalize. Better to normalize.

Seatbelts <- Seatbelts %>% 
    mutate(ppn = (PetrolPrice - mean(PetrolPrice))/sd(PetrolPrice),
           mkm = kms / 1000) %>% 
    mutate(mkm = mkm - mean(mkm))

summary(lm(DriversKilled ~ mkm + ppn, Seatbelts))$coef
##               Estimate Std. Error   t value      Pr(>|t|)
## (Intercept) 122.802083  1.6628507 73.850336 2.395106e-141
## mkm          -1.749546  0.6145401 -2.846919  4.902428e-03
## ppn          -7.838674  1.8055491 -4.341435  2.304713e-05

Exercise 2

Predict the number of driver deaths at the average kms and petrol levels.

fit <- lm(DriversKilled ~ kms + PetrolPrice, Seatbelts)

predict(fit, newdata = data.frame(kms= mean(Seatbelts$kms), 
                                  PetrolPrice= mean(Seatbelts$PetrolPrice)))
##        1 
## 122.8021

This is the number of deaths at the mean of kms and PetrolPrice.

Exercise 3

Take the residual for DriversKilled having regressed out kms and an intercept and the residual for PetrolPrice having regressed out kms and an intercept. Fit a regression through the origin of the two residuals and show that it is the same as your coefficient obtained in question 1.

dk <- Seatbelts$DriversKilled
kms <- Seatbelts$kms
pp <- Seatbelts$PetrolPrice
fitFull <- lm(dk~kms+pp)

Now we get the residuals.

Remember the formula of the estimated coefficient by the residuals: \[\hat\beta_1=\frac{\sum_{i=1}^n e_iY|X_2·e_iX_1|X_2}{\sum_{i_1}^n(X_1|X_2)^2}\]

The residuals for DriversKilles is:

edk <- resid(lm(dk~kms))

The residual for PetrolPriceis:

epp <- resid(lm(pp~kms))

So the estimate for pp is:

sum(edk*epp)/sum(epp^2)
## [1] -643.7895

Remembrer the coefficients of fitFull:

round(summary(fitFull)$coef, 4)
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  215.7461    14.6656 14.7110   0.0000
## kms           -0.0017     0.0006 -2.8469   0.0049
## pp          -643.7895   148.2896 -4.3414   0.0000

Which is coincident.

Also we can calculate

round(summary(lm(edk~epp- 1))$coef,4)
##      Estimate Std. Error t value Pr(>|t|)
## epp -643.7895   147.5111 -4.3643        0

Exercise 4

Take the residual for DriversKilled having regressed out PetrolPrice and an intercept. Take the residual for kms having regressed out PetrolPrice and an intercept. Fit a regression through the origin of the two residuals and show that it is the same as your coefficient obtained in question 1.

edk2 <- resid(lm(dk ~ pp))
ekms <- resid(lm(kms ~ pp))
round(summary(lm(edk2 ~ ekms -1))$coef, 4)
##      Estimate Std. Error t value Pr(>|t|)
## ekms  -0.0017      6e-04 -2.8619   0.0047

Multivariable examples and tricks: Page 76

Exercise 1

Do exercise 1 of the previous chapter if you have not already. Load the dataset Seatbelts as part of the datasets package via data(Seatbelts). Use as.data.frame to convert the object to a dataframe. Fit a linear model of driver deaths with kms and PetrolPrice as predictors. Interpret your results.

Seatbelts <- as.data.frame(Seatbelts)

# Remamber that kms have been rescaled (mkm = kms/1000) and PetrolPrice have been normalized 
# ppn= (PetrolPrice - mean(PetrolPrice) / sd(PetrolPrice))
round(summary(lm(DriversKilled ~ mkm + ppn, Seatbelts))$coef,4)
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 122.8021     1.6629 73.8503   0.0000
## mkm          -1.7495     0.6145 -2.8469   0.0049
## ppn          -7.8387     1.8055 -4.3414   0.0000

The intercept is the number of deaths for 0 kms and 0 PetrolPrice whichis not useful. An it’s good to center the variables. The estimate kms is the impact on deaths of 1 km which is insignificant and unworthy. Better to rescale it.

Exercise 2

Repeat question 1 for the outcome being the log of the count of driver deaths. Interpret your coefficients.

# Method 1: create a new variable logkill = Log_e of DriversKilled
Seatbelts2 <- Seatbelts %>% 
    mutate(logkill = log(DriversKilled))
summary(lm(logkill ~ ppn + mkm, Seatbelts2))$coef
##                Estimate  Std. Error    t value      Pr(>|t|)
## (Intercept)  4.78966306 0.013426810 356.723817 2.737888e-269
## ppn         -0.06412578 0.014579039  -4.398492  1.818005e-05
## mkm         -0.01400794 0.004962149  -2.822959  5.267843e-03
# Method 2: Calculate log10 inside lm()
summary(lm(log(DriversKilled) ~ ppn + mkm, Seatbelts))$coef
##                Estimate  Std. Error    t value      Pr(>|t|)
## (Intercept)  4.78966306 0.013426810 356.723817 2.737888e-269
## ppn         -0.06412578 0.014579039  -4.398492  1.818005e-05
## mkm         -0.01400794 0.004962149  -2.822959  5.267843e-03

In this case we have: \[E[\ln Y_i]= \beta_0+\beta_1X_1+\beta_2X_2\]

\[e^{E[\ln Y_i]}= e^{\beta_0}+e^{\beta_1X_1}+e^{\beta_2X_2}\]

Geometric mean formula

\[(\prod_{i=1}^n)^{\frac{1}{n}}=\sqrt[n]{x_1,x_2...x_n}=e^{\frac{\sum_{i=1}^n \ln a_i}{n}}\]

Interpretation

Our are normalized petrol price variable (ppn) can have this interpretation: 1. we are estimating an expected 6% decrease in the geometric mean of drivers killed for every 1 standard deviation increase in normalized petrol price, holding the kilometers constant.

1-exp(-0.06412578)
## [1] 0.06211298
  1. we are estimating an expected 1% decrease in the geometric mean of drivers killed for every additional 1000 miles driven, holding petrol price constant
1-exp(-0.01400794)
## [1] 0.01391029

Exercise 3

Refer to question 1. Add the dummy variable law and interpret the results. Repeat this question with a factor variable that you create called lawFactor that takes the levels No and Yes. Change the reference level from No to Yes.

Model 2

\[E[Y|X_1X_2]=\beta_0+X_{1i}\beta_1+X_{2i}\beta_{2i}\]

The model includes X1 and X2, without reciprocal interaction. It fits 2 models that have the same slope.

round(summary(lm(DriversKilled ~ mkm + ppn + I(factor(law)), Seatbelts))$coef,4)
##                 Estimate Std. Error t value Pr(>|t|)
## (Intercept)     124.2263     1.8012 68.9674   0.0000
## mkm              -1.2233     0.6657 -1.8378   0.0677
## ppn              -6.9199     1.8514 -3.7377   0.0002
## I(factor(law))1 -11.8892     6.0258 -1.9731   0.0500

In this case, the intercept has changes a little. It is the level of mortality at the average of km, the average of petrol price and at level 0 og Law, i.e. before the law was edited.

If we wanted to know the value of DriversKilled after the law was taken into effect, at the average mkm and ppn we would have to add -11.8892.

Important!!!!

R tells us which factor is treating as reference levels: in this case is 0. The value -11.8892 is associated with the factor level 1.

Change the reference level from No to Yes:

Seatbelts<-Seatbelts %>% 
    mutate(law = factor(law, levels = c(1,0)))

round(summary(lm(DriversKilled ~ mkm + ppn + I(factor(law)), Seatbelts))$coef,4)
##                 Estimate Std. Error t value Pr(>|t|)
## (Intercept)     112.3371     5.5547 20.2236   0.0000
## mkm              -1.2233     0.6657 -1.8378   0.0677
## ppn              -6.9199     1.8514 -3.7377   0.0002
## I(factor(law))0  11.8892     6.0258  1.9731   0.0500
Seatbelts<-Seatbelts %>% 
    mutate(law = factor(law, levels = c(0,1)))

Exercise 4

Discretize the PetrolPrice variable into four factor levels. Fit the linear model with this factor to see how R treats multiple level factor variables.

Seatbelts <- Seatbelts %>% 
    mutate(ppf = as.factor((ppn <= -1.5)+ (ppn <= 0)+ (ppn <= 1.5)+ (ppn < Inf)))

round(summary(lm(DriversKilled ~ ppf + mkm + I(factor(law)), Seatbelts))$coef,4)
##                 Estimate Std. Error t value Pr(>|t|)
## (Intercept)     109.8405     9.5066 11.5541   0.0000
## ppf2             10.8271     9.9462  1.0886   0.2778
## ppf3             18.6904     9.9374  1.8808   0.0616
## ppf4             25.0074    10.9163  2.2908   0.0231
## mkm              -1.2991     0.7668 -1.6942   0.0919
## I(factor(law))1 -15.3445     6.0345 -2.5428   0.0118

Exercise 5

Perform the plot requested at the end of the last chapter.

Two levels model without interaction

swiss <- swiss %>% 
    mutate(CatholicBin = 1 * (Catholic > 50))

g <- ggplot(swiss, aes(x = Agriculture, y = Fertility, colour = factor(CatholicBin)))+ 
    geom_point(size = 5) +
    xlab("% in Agriculture") + 
    ylab("Fertility")

fit <- lm(Fertility ~ Agriculture + factor(CatholicBin), data = swiss) # Note the + operator

g +
    geom_abline(intercept = coef(fit)[1], slope = coef(fit)[2], size=1, colour='black') +
    geom_abline(intercept = coef(fit)[1] + coef(fit)[3], slope = coef(fit)[2], size=1, colour='black') +
    annotation_custom(grob = grobTree(textGrob("Catholic", 
                                               x=0.05,  
                                               y=.62, 
                                               hjust=0,
                                               gp=gpar(col="black", 
                                                       fontsize=10, 
                                                       fontface="italic")))) +
    annotation_custom(grob = grobTree(textGrob("Non catholic", 
                                               x=0.05,  
                                               y=.5, 
                                               hjust=0,
                                               gp=gpar(col="black", 
                                                       fontsize=10, 
                                                       fontface="italic"))))

Two levels model with interaction

fitfull <- lm(Fertility ~ Agriculture * factor(CatholicBin), data = swiss)
g +
    geom_abline(intercept = coef(fitfull)[1], slope = coef(fitfull)[2], size=1, colour='black') +
    geom_abline(intercept = coef(fitfull)[1] + coef(fitfull)[3], 
                slope = coef(fitfull)[2] + coef(fitfull)[4], size=1, colour='black') +
    
    annotation_custom(grob = grobTree(textGrob("Catholic", 
                                               x=0.05,  
                                               y=.58, 
                                               hjust=0,
                                               gp=gpar(col="black", 
                                                       fontsize=10, 
                                                       fontface="italic")))) +
    annotation_custom(grob = grobTree(textGrob("Non catholic", 
                                               x=0.05,  
                                               y=.46, 
                                               hjust=0,
                                               gp=gpar(col="black", 
                                                       fontsize=10, 
                                                       fontface="italic"))))

Adjustement: Page 85

Exercise 1

Load the dataset Seatbelts as part of the datasets package via data(Seatbelts). Use as.data.frame to convert the object to a dataframe. Fit a linear model of driver deaths with kms and PetrolPrice as predictors. Interpret your results.

summary(lm(DriversKilled ~ mkm + ppn, Seatbelts))$coef
##               Estimate Std. Error   t value      Pr(>|t|)
## (Intercept) 122.802083  1.6628507 73.850336 2.395106e-141
## mkm          -1.749546  0.6145401 -2.846919  4.902428e-03
## ppn          -7.838674  1.8055491 -4.341435  2.304713e-05

Exercise 2

Compare the kms coefficient with and without the inclusion of the PetrolPrice variable in the model.

First of all let’s see the correlation between mkm and ppn:

cor(Seatbelts$mkm, Seatbelts$ppn)
## [1] 0.3839004

This value indicates that both variables are measuring similar things. Possibly the petrol price has an impact in the total of kms driven.

fit0 <- lm(DriversKilled ~ mkm, Seatbelts)
fit1 <- lm(DriversKilled ~ mkm + ppn, Seatbelts)
summary(fit0)$coef
##               Estimate Std. Error  t value      Pr(>|t|)
## (Intercept) 122.802083  1.7391997 70.60839 2.665611e-138
## mkm          -2.773787  0.5935049 -4.67357  5.596266e-06
summary(fit1)$coef
##               Estimate Std. Error   t value      Pr(>|t|)
## (Intercept) 122.802083  1.6628507 73.850336 2.395106e-141
## mkm          -1.749546  0.6145401 -2.846919  4.902428e-03
## ppn          -7.838674  1.8055491 -4.341435  2.304713e-05

The estimate in this case is negative. Which makes sense because both effects are in the same direction and are correlated. But the estimate change from -2.8 to -1.7: which indicates the confounding effect of the variable PetrolPrice on the regression between DriversKiller and kms.

anova(fit0,fit1)
## Analysis of Variance Table
## 
## Model 1: DriversKilled ~ mkm
## Model 2: DriversKilled ~ mkm + ppn
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    190 110345                                  
## 2    189 100339  1     10006 18.848 2.305e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The p-value is the same as in fit1 (ppn estimate).

Exercise 3

Compare the PetrolPrice coefficient with and without the inclusion of the kms variable in the model.

fit2 <- lm(DriversKilled ~ ppn, Seatbelts)
fit3 <- lm(DriversKilled ~ ppn + mkm, Seatbelts)
summary(fit2)$coef
##               Estimate Std. Error   t value      Pr(>|t|)
## (Intercept) 122.802083   1.693656 72.507096 2.061333e-140
## ppn          -9.812019   1.698084 -5.778288  3.044208e-08
summary(fit3)$coef
##               Estimate Std. Error   t value      Pr(>|t|)
## (Intercept) 122.802083  1.6628507 73.850336 2.395106e-141
## ppn          -7.838674  1.8055491 -4.341435  2.304713e-05
## mkm          -1.749546  0.6145401 -2.846919  4.902428e-03

In this case too, the estimate is negative. Which makes sense because both effects are in the same direction and are correlated. But the estimate change from -9.8 to 7.8: which indicates the confounding effect of the variable kms on the regression between DriversKiller and PetrolPrice.

Residuals, variation, diagnostics: Page 96

Exercise 2

Refer to question 1. Directly estimate the residual variation via the function resid. Compare with R’s residual variance estimate.

Seatbelts <- as.data.frame(Seatbelts)

round(summary(lm(DriversKilled ~ kms + PetrolPrice + law, Seatbelts))$coef,4)
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  201.4614    16.2559 12.3931   0.0000
## kms           -0.0012     0.0007 -1.8378   0.0677
## PetrolPrice -568.3347   152.0552 -3.7377   0.0002
## law1         -11.8892     6.0258 -1.9731   0.0500
Seatbelts <- Seatbelts %>% 
    mutate(ppn = (PetrolPrice - mean(PetrolPrice))/sd(PetrolPrice),
           mkm = kms / 1000) %>% 
    mutate(mkm = mkm - mean(mkm))

summary(lm(DriversKilled ~ mkm + ppn + law, Seatbelts))$coef
##               Estimate Std. Error   t value      Pr(>|t|)
## (Intercept) 124.226311  1.8012324 68.967399 1.976641e-135
## mkm          -1.223318  0.6656567 -1.837761  6.767594e-02
## ppn          -6.919949  1.8513987 -3.737687  2.463128e-04
## law1        -11.889202  6.0257850 -1.973055  4.995497e-02
fit <- lm(DriversKilled ~ mkm + ppn + law, Seatbelts)
# Method 1
sum(resid(fit)^2) / (nrow(Seatbelts) - 4)    # In a multidimensional analysis every dimension is 1 unit to be subtracted from the freedom degrees in this case we have 4 dimensions. So, the degree of freedom is n-4.
## [1] 522.8903
# Method 2
summary(fit)$sigma^2
## [1] 522.8903

Exercise 3

Refer to question 1. Perform an analysis of diagnostic measures including, dffits, dfbetas, influence and hat diagonals

Quick solution:

plot(fit, which = c(1,1))

First plot shows residuals versus the fitted values here you’re looking for any kind of pattern and in this case it doesn’t look too bad it labels some plots that appear on the extreme fits a smoother through the residuals this is because it’s often difficult to ascertain a pattern when you’re just looking at the points by themselves but when you fit a smoother through it often you can see patterns that you wouldn’t have seen.

plot(fit, which = c(1,2))

Q-Q plot is done to ascertain normality of the errors in this case what you’re looking for is whether or not your points fall mostly on a line and in this case it gives you the reference line. Typically when you look at these plots these deviations for example down here in the tails which means that the residuals are larger in the left tail than the theoretical quantiles from the normal distribution.

However when you look at these plots remember there’s variability in the estimate of the quantiles from the standardized residuals if you simulate data even from a normal distribution it’s going to deviate from this line somewhat so you want to see if there’s any grow gross departures not so much evaluating minor departures. In this case this doesn’t seem too bad.

plot(fit, which = 3)

Scale-location plot: plots so the next one is the square root of the absolute value of the standardized residuals by the fitted values this is the scale location plot a smoother is put through it and again you’re looking for patterns in this plot it doesn’t look so bad to me.

plot(fit, which = 5)

Residuals vs Leverage plot: the final one plots the residuals versus the leverage.

Again in this case leverage remember is ascertaining how much potential for influence a point has by virtue of being outside of the norm.

For the collection of points we might for example look at these couple of points that have fairly high leverage relative to all their other partners and just see if there is anything out of the ordinary in those points.

R automatically labels some points potentially of interest [see 12 and 192] in this case ones that have high standardized residual values but again these don’t have terribly high high leverage.

Let’s look at some influence measures as well just to remember though maybe keep an eye out for the point the 12th and 190 second point so remember we can look for example at the DF fits with the function.

plot(dffits(fit))

Evaluating the influence

dffits:Using ggplot, we can create a dataframe with the dffits values, classify them, establish a threshold value, in this case |0.4|, and plot the values indicating the id of the values above the threshold (specialPoint).

datdffits <- data.frame(dffits = dffits(fit)) %>% 
    mutate(id=row_number()) %>% 
    mutate(specialPoint= as.factor(abs(dffits) >= .35))

ggplot(datdffits, aes(y=dffits, x= id, color= specialPoint))+
    geom_point(size=3) +
    geom_text(data=subset(datdffits, specialPoint==TRUE),
            aes(y=dffits,x=id,label=id), colour='black') +
    geom_hline(yintercept = c(-0.35,0.35), color= 'blue') +
    geom_hline(yintercept = 0)

dfbeta: remember that this indicator gives the change in the individual coefficients for every observation:

head(dfbeta(fit))
##   (Intercept)        mkm         ppn        law1
## 1  -0.1022614 0.11881682 -0.09349927 -0.26195987
## 2  -0.1331656 0.22275746 -0.15807657 -0.56786007
## 3  -0.1279422 0.11452337 -0.07920934 -0.23464337
## 4  -0.2032275 0.13006726 -0.06413772 -0.23584650
## 5  -0.0524778 0.02353784 -0.01104844 -0.02756976
## 6  -0.1188004 0.03961875 -0.01108847 -0.02386593
plot(dfbeta(fit)[ , 2])

dfbeta for kms:

datdfbeta <- data.frame(dfbetamkm = dfbeta(fit)[ , 2]) %>% 
    mutate(id=row_number()) %>% 
    mutate(specialPoint= as.factor(abs(dfbetamkm) >= .2))

ggplot(datdfbeta, aes(y=dfbetamkm, x= id, color= specialPoint))+
    geom_point(size=3) +
    geom_text(data=subset(datdfbeta, specialPoint==TRUE),
            aes(y=dfbetamkm,x=id,label=id), colour='black') +
    geom_hline(yintercept = 0)+
    geom_hline(yintercept = c(-0.2,0.2), color= 'blue') +
    ylab('dfbeta: mkm')

dfbeta for PetrolPrice**:

datdfbeta2 <- data.frame(dfbetappn = dfbeta(fit)[ , 3]) %>% 
    mutate(id=row_number()) %>% 
    mutate(specialPoint= as.factor(abs(dfbetappn) >= .4))

ggplot(datdfbeta2, aes(y=dfbetappn, x= id, color= specialPoint))+
    geom_point(size=3) +
    geom_text(data=subset(datdfbeta2, specialPoint==TRUE),
            aes(y=dfbetappn,x=id,label=id), colour='black') +
    geom_hline(yintercept = 0)+
    geom_hline(yintercept = c(-0.4,0.4), color= 'blue') +
    ylab('dfbeta: PetrolPrice')

cooks.distance: it gives a list of numbers. cooksdistance is a summary overall points. It’s more easy to look at a plot so cooks.distance is bounded from below by zero it’s a Mahalanobis distance: Mahalanobis distance take in account the euclidean distance, the gaussian distance (normalization of parameters) and the correlation of parameters.

Fo two parameters: \[\sqrt{ (x_b-x_a,y_b-y_a)\begin{equation} \begin{pmatrix} \sigma_x^2 & cov(X,Y) \\ cov(X,Y) & \sigma_y^2 \end{pmatrix} \begin{pmatrix} x_b-x_a\\ y_b-y_a \end{pmatrix} \end{equation}}\]

So it’s a sort of guaranteed to be larger than zero.

datcooks <- data.frame(cooksdist = cooks.distance(fit)) %>% 
    mutate(id=row_number()) %>% 
    mutate(specialPoint= as.factor(cooksdist >= .03))

ggplot(datcooks, aes(y=cooksdist, x= id, color= specialPoint))+
    geom_point(size=3) +
    geom_text(data=subset(datcooks, specialPoint==TRUE),
            aes(y=cooksdist,x=id,label=id), colour='black') +
    geom_hline(yintercept = 0)+
    geom_hline(yintercept = 0.03, color= 'blue') +
    ylab('Cooks distance')

Evaluating the Leverage

summary(round(hatvalues(fit), 3))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00600 0.01175 0.01700 0.02082 0.02600 0.06700
dathatv <- data.frame(hatv = hatvalues(fit)) %>% 
    mutate(id=row_number()) %>% 
    mutate(specialPoint= as.factor(hatv >= .06))

ggplot(dathatv, aes(y=hatv, x= id, color= specialPoint))+
    geom_point(size=3) +
    geom_text(data=subset(dathatv, specialPoint==TRUE),
            aes(y=hatv,x=id,label=id), colour='black') +
    geom_hline(yintercept = 0.06, color= 'blue') +
    ylab('hatvalues')

Multiple variables and model selection: page 106

Exercise 2

Perform a model selection exercise to arrive at a final model.

In this case we are interested in knowing whether or not the law was effective.

fit0 <- lm(DriversKilled~law, Seatbelts)
fit1 <- lm(DriversKilled~law + mkm, Seatbelts)
fit2 <- lm(DriversKilled~law + ppn, Seatbelts)
fit3 <- lm(DriversKilled~law + mkm + ppn, Seatbelts)
anova(fit0,fit1,fit2,fit3)
## Analysis of Variance Table
## 
## Model 1: DriversKilled ~ law
## Model 2: DriversKilled ~ law + mkm
## Model 3: DriversKilled ~ law + ppn
## Model 4: DriversKilled ~ law + mkm + ppn
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1    190 109754                                
## 2    189 105608  1    4145.3 7.9276 0.005388 **
## 3    189 100069  0    5538.9                   
## 4    188  98303  1    1766.0 3.3774 0.067676 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(fit0,fit3)
## Analysis of Variance Table
## 
## Model 1: DriversKilled ~ law
## Model 2: DriversKilled ~ law + mkm + ppn
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    190 109754                                  
## 2    188  98303  2     11450 10.949 3.177e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We are particularly interested with the fit0 and fit3 because I’m interested in whether or not both those parameters are necessary in addition to the law of variable. So here we have our base model which just has the law variable here we have our second model which includes kilometers in petrol price so it’s a 2 degree 2 degree of freedom test because it added two parameters. Test is highly significant suggesting that we probably should include both those parameters.

rbind(summary(fit0)$coef[2, ],
      summary(fit1)$coef[2, ],
      summary(fit2)$coef[2, ],
      summary(fit3)$coef[2, ])
##       Estimate Std. Error   t value     Pr(>|t|)
## [1,] -25.60895   5.341655 -4.794198 3.288375e-06
## [2,] -17.55372   6.028888 -2.911602 4.028394e-03
## [3,] -16.32618   5.555579 -2.938700 3.706585e-03
## [4,] -11.88920   6.025785 -1.973055 4.995497e-02

Other thing we are interested in is how my coefficients change depending on what variables I put in. So just to go through this summary is of course the summary of our fitted model dollar sign cuff the coefficient table.

If we look at the coefficient table, grabing the law row from each one of these fitted models. Summary of the fit grabbed the coefficient table and grabbed the second row and, binding them Together, we can look at them all at once so this is the model which I labeled fit 0 1 2 3 4 in order 1 2 3 4 by rows. So, we can see what happens to the law estimate: the lowest amount estimates that after the law went into effect we get an expected 25 fewer deaths per month. However if we adjust for the variability in just kilometers that drops down to 17 fewer deaths. If we just for adjust the variability for the variability in petrol price that estimate drops down to 16 deaths fewer per month and then if we adjust for both that goes down to 11 fewer deaths per month.

On th other hand, we can see that our T value drops from negative 4 which is a very large T value to negative 1 point 9 7 which is kind of a marginal kind of a marginal T value. And in fact our p-value goes from highly significant to just barely significant 0.0499 if we’re comparing it to a type 1 error rate of 5 percent I think in this case because the significance of including both coefficients and because it seems reasonable that in evaluating the law that we want to include the variability associated with these other highly significant predictors I would likely go with model.

Ganeralized linear models: Page 110

Exercise 1

True or false, generalized linear models transform the observed outcome.

The answer is FALSE.

Case1. \[log(Y)=\beta_0+\beta_1+\epsilon\]

Recall that if we if we have an observed outcome and we do something like log of our outcome Y is beta naught plus beta 1 X plus, here we are not doing in generalizing your model we are simply fitting a linear model where we’ve transformed the outcome.

In contrast, we might have a model where we would have log of the expected value of y* is equal to beta naught plus beta 1 X: Case 2. \[log(E[Y])=\beta_0+\beta_1X\]

Now that’s a very different model. In the first case, we’ve transformed our data so we actually have to transform the data before we profit the model and put it into R for example to fit them all.

On the other hand, in this case what we’re saying is take for example the plus on regression case the mean that the Poisson likelihood assumes we’re going to assume is parameterised in terms of a collection of parameters and our other data the X values so this a major difference between transforming the outcome and transforming a parameter which is what occurs in Generalized Linear Models (GLM).

The two most common examples that we come up with for transforming a parameter is transforming the mean with the log in the Poisson regression case so in that case we assume that the mean that the linear predictor usually is on the log scale of the mean. Then the data the observed outcome is related to the mean via the Poisson distribution function.

Case 3. \[log(\frac{E(Y)}{1-E(Y)})=\beta_0+\beta_1X\]

Another very common one is the log of the expected value of y over 1 minus the expected value of y when y is a binary random variable Bernoulli. Y can take a value either 0 or 1 so it’s expected value will be a value between 0 & 1 and so we take the log of the odds expected value over 1 minus expected value is the linear predictor so we might have beta naught plus beta 1 X. So in these cases are what’s common in general linear models this is a case of just transforming the outcome both very useful techniques however only the 2 and 3 are examples of generalized linear models.

Exercise 2

True or false, the interpretation of the coefficients in a GLM are on the scale of the link function.

The specific answer to the question is that YES. In fact, all of the coefficients are interpreted on the link function scale. If you want some natural scale interpretation you have to invert that link function in this case we can take the exponent to get the relative scale interpretation.

\[log(E[Y|X=x])=\beta_0+\beta_1X\]

Take for example Poisson regression: our model states that the log of the expected value of the outcome given X takes the value little X is beta naught plus beta 1 X. This is for just a single regressor. So, when we get an estimate for beta 1 and see it in our coefficient table that is interpreted on the scale of the log of the expected value of the outcome so it is interpreted on the link function scale. This is an example of Poisson regression.

If we were doing logistic regression it would be interpreted on the scale of the logit of the expected value of the response.

We note that we can take log of the expected value of y given that X takes the value x + 1 we get beta naught plus beta 1 x + 1 \[log(E[Y|X=x+1])=\beta_0+\beta_1(X+1)\]

if we subtract these two entities \[+[log(E[Y|X=x+1])=\beta_0+\beta_1(X+1)]-[log(E[Y|X=x])=\beta_0+\beta_1x]=\beta_1\] being equal to beta 1 so what we see is that the interpretation of beta 1 is the change in the link function of the expected value per unit change in the regressor. If there were other coefficients for example plus beta_2 x_2 plus beta_2 x_2 as long as we held that constant it would drop out and then our beta 1 would be interpreted as the link function change in the expected value of the response per unit change in the regressor holding all other regressors constant,

In the logistic regression case because of the natural log we can invert the natural log and if we were to take e 2 this quantity and e to this quantity and then thus get e to the beta one we see that e to the beta 1 is expect a value of y given x takes the value X plus 1 over the expected value of y given x takes the value of x. \[e^{log(E[Y|X=x+1]}-e^{log(E[Y|X=x]}=e^{\beta_1}\]

\[e^{\beta_1}=\frac{E[Y|X=x+1]}{E[Y|X=x]}\]

In this particular case where our link function is the log we see that we can take e to the power of our coefficients and then they are interpreted back on the original scale as relative quantities so in this case e to the beta_1 e to our coefficient which is not what’s printed out in our coefficient table we have to then take the exponent of it e to our coefficient is the relative increase or decrease in the expected value of the response per unit increase in the regressor holding all the other regression variables constant.

Exercise 3

True or false, the generalized linear model assumes an exponential family for the outcome.

In linear model, if we have Y, X1 and X2, we need a model that connects our Y with the X1 and the x2 to the linear model that does that in a sense very directly by by equating them with a straightforward equation and because this relationship won’t of course naturally hold it adds an error term assuming that the error term is probably normal or at least making a first and second moment assumption about the error term. \[Y=\beta_0+\beta_1X_1+\beta_2X_2\]

A generalized linear model does not form such an equation.

In contrast, linear model we’re going to start with assuming a distribution for the Y, that Y we’re going to assume, COMES FROM AN EXPONENTIAL FAMILY DISTRIBUTION. \[Y <=> EF\] An exponential family includes the normal so linear models are a special case of generalized linear models it includes the Bernoulli and binomial distributions the Poisson distribution the gamma.

It’s a very rich family of distributions so it’s a large collection of models that we can build. And then we still even though we at this point only assumed a distribution for the Y we need a way to relate the Y to the x1 and x2. That is by the fact that the exponential family is always parametrized by a couple of parameters in the same way that the normal distribution is parametrized by a mean and the variance the binomial distribution is defined by a number of trials and the success probability and the Poisson distribution is parametrized by a mean ALL THESE DISTRIBUTIONS DEPEND ON A PARAMETER.

One of the most important parameters that they always depend on is the mean. So let’s call the expected value of Y our mu: \[Y<=>EF // (E[Y]=\mu) =>LinkFun=>X_1,X_2...\] $$$$ Mu is one of these important parameters the exponential family depends on. So the way we connect our Y to the Xs is through this EXPONETNIAL FAMILY. Through its PARAMETER (mu), this parameter is now going to be CONNECTED to our X1 and X2 by the LINK FUNCTION.

The LINK FUNCTION is g we’re going to assume that g of our MU say is \[g(\mu)=\beta_0+\beta_1X_1+\beta_2X_2\] beta_0 plus beta_1 X1 plus beta_2 X2 and that might seem like a very circuitous way to connect Ys and Xs, when in our linear model we do it with an equality it’s a very powerful way of doing things.

For exmple in Poisson regression we’re going to assume G is the natural log function: \[log(\mu)=\beta_0+\beta_1X_1+\beta_2X_2\]

Exercise 4

True or false, GLM estimates are obtained by maximizing the likelihood. (Discuss.): TRUE.

I’d like to go through I’ve been using Poisson regression as my canonical example in for the questions in this chapter.

So, what I’m going to continue that and assume that the log of my outcome expect a value of Y. Now I’m going to put an “i” on it to remind us that this is an individual subject. The log of the expected value of Y which let’s call that log of mu_i (so I don’t have to keep writing so much) is beta naught and then let’s just assume one regressor with beta_1X. \[log(E[Y_i])=log(\mu_i)=\beta_0+\beta_1\]

so this by implication says that e to the MU I is equal to e to the beta_0 plus beta_1X. \[log(E[Y_i])=log(\mu_i)=\beta_0+\beta_1=>e^{\mu_i}= e^{\beta_0+\beta_1X}\]

So, what this means is we’re going to assume that Y_i, every observation is independent Poisson mu_i: \[Y_i\stackrel{ind}{\sim}Poisson(\mu_i)\]

Which means that the likelihood looks something like this mu_i to the Y_i e to the negative mu_i all over Y_i factorial. \[\frac{\mu_ie^{-\mu_i}}{Y_i!}\] and since many of you have taken the inference class you know that, in terms of the likelihood, it doesn’t matter for anything that doesn’t depend on the parameter. In this case the parameter of interest is mu_i, so, we can get rid of that part in the bottom that doesn’t depend on the parameter anymore.

Now the likelihood will be the product over all the subjects, that is the likelihood that depends on beta_0 and beta_1 and, given Y. That is our likelihood and that function we will maximize with respect to beta naught and beta 1 to obtain our estimates. \[\prod\mu_ie^{-\mu_i}=\mathcal{L}(\beta_0,\beta_1|Y)\]

Exercise 5

True or false, some GLM distributions impose restrictions on the relationship between the mean and the variance. TRUE

Binary GLMs: page 121

Exercise 1

Load the dataset Seatbelts as part of the datasets package via data(Seatbelts). Use as.data.frame to convert the object to a dataframe. Create a new outcome variable for whether or not greater than 119 drivers were killed that month. Fit a logistic regression GLM with this variable as the outcome and kms, PetrolPrice and law as predictors. Interpret your parameters.

Seatbelts <- Seatbelts %>% 
    mutate(binkill = 1 * (DriversKilled > 119))

logRegSeatbelts <- glm(Seatbelts$binkill~Seatbelts$mkm + Seatbelts$ppn + Seatbelts$law, family = 'binomial')
summary(logRegSeatbelts)
## 
## Call:
## glm(formula = Seatbelts$binkill ~ Seatbelts$mkm + Seatbelts$ppn + 
##     Seatbelts$law, family = "binomial")
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4893  -1.1234  -0.7461   1.0878   1.6708  
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)  
## (Intercept)     0.024314   0.160775   0.151   0.8798  
## Seatbelts$mkm  -0.002938   0.059848  -0.049   0.9608  
## Seatbelts$ppn  -0.416408   0.169734  -2.453   0.0142 *
## Seatbelts$law1 -0.615522   0.577808  -1.065   0.2868  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 266.09  on 191  degrees of freedom
## Residual deviance: 253.62  on 188  degrees of freedom
## AIC: 261.62
## 
## Number of Fisher Scoring iterations: 4
round(exp(logRegSeatbelts$coeff),3)
##    (Intercept)  Seatbelts$mkm  Seatbelts$ppn Seatbelts$law1 
##          1.025          0.997          0.659          0.540
round(1-exp(logRegSeatbelts$coeff), 3)
##    (Intercept)  Seatbelts$mkm  Seatbelts$ppn Seatbelts$law1 
##         -0.025          0.003          0.341          0.460

If we exponentiate the Seatbelts coefficients we get 1.025 for the intercept, and for the coefficients, 0.997 for kms, 0.659 for the Petrol Price and 0.540 for Law. The effect more important we see is that there was a 46% decrease in the odds of drivers of greater than 119 drivers being killed in a month after the law was enacted, compared to before the law wasn’t enacted, holding the other variables constant.

Visualization experiment

library(ggeffects)
library(patchwork)

fitprova <- glm(binkill~mkm+ppn+law, family = 'binomial', data = Seatbelts)

wrap_plots(plot(ggpredict(fitprova, se=TRUE,interactive=TRUE,digits=3)), ncol = 2, nrow = 2, widths = c(1,2))

On the y-axis of each is plot is the probability of being 1 or the level you coded to be the non-reference. The x-axis would be the predictor, and you see if you hold other variables constant, how the probability of being 1 changes.

Exercise 2

Fit a binomial model with DriversKilled as the outcome and drivers as the total count with kms , PetrolPrice and law as predictors, interpret your results.

See that: https://stackoverflow.com/questions/9111628/logistic-regression-cbind-command-in-glm When doing the binomial or quasibinomial glm, you either supply a probability of success or a two-column matrix with the columns giving the numbers of successes and failures or a factor where the first level denotes failure and the others success on the left hand side of the equation. See details in “glm.

# cbind creates a two-columns matrix with term of the odd: SUCCESSES vs FAILURES (not successes vs total)

fitlgm <- glm(cbind(DriversKilled, drivers-DriversKilled)~mkm + ppn + law, family = 'binomial', data = Seatbelts)
summary(fitlgm)
## 
## Call:
## glm(formula = cbind(DriversKilled, drivers - DriversKilled) ~ 
##     mkm + ppn + law, family = "binomial", data = Seatbelts)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4371  -0.7270  -0.0235   0.7111   3.0313  
## 
## Coefficients:
##              Estimate Std. Error  z value Pr(>|z|)    
## (Intercept) -2.536637   0.007399 -342.829   <2e-16 ***
## mkm          0.003645   0.002733    1.334    0.182    
## ppn         -0.007829   0.007479   -1.047    0.295    
## law1         0.030785   0.026527    1.161    0.246    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.93  on 191  degrees of freedom
## Residual deviance: 229.93  on 188  degrees of freedom
## AIC: 1496
## 
## Number of Fisher Scoring iterations: 3
Viszualization experiment
wrap_plots(plot(ggpredict(fitlgm, se=TRUE,interactive=TRUE,digits=3)), ncol = 2, nrow = 2, widths = c(1,2))

Exercise 3

Refer to Question 1. Use the anova function to compare models with just, with law and PetrolPrice, and with all three predictors.

This question asks us to look at the an instance of model selection where we first want to model whether or not the drivers driver skilled exceeded 119 first of all we want to fit the model with just law and then we want to fit the model with law and petrol price and then the model with law of petrol price and kilometers and perform a model search to see if the successive models add anything.

This would be a useful exercise to do if we were particularly interested in in law variable and for whatever the reason we thought Petrol Price was the most important thing to follow up the investigation of the law variable with and then after that kilometers.

fit1 <- glm(binkill~law, family = 'binomial', data = Seatbelts)
fit2 <- update(fit1, binkill~law + PetrolPrice)
fit3 <- update(fit2, binkill~law + PetrolPrice + kms)
anova(fit1,fit2,fit3)
## Analysis of Deviance Table
## 
## Model 1: binkill ~ law
## Model 2: binkill ~ law + PetrolPrice
## Model 3: binkill ~ law + PetrolPrice + kms
##   Resid. Df Resid. Dev Df Deviance
## 1       190     260.40            
## 2       189     253.62  1   6.7760
## 3       188     253.62  1   0.0024

The residual deviance it is sort of an analogous to our sums of squares that we’re looking at when we had normal data.

It looks like chi square cut off is usually around 4 that would give us a 5% hypothesis test [3.84, df = 1]. So it looks like it’s going to be significant for adding Petrol Price and not significant for adding kms so let’s look at the coefficient table.

summary(fit1)$coef
##                Estimate Std. Error    z value   Pr(>|z|)
## (Intercept)  0.08288766  0.1539783  0.5383074 0.59036483
## law1        -1.12434153  0.4991985 -2.2522935 0.02430373
summary(fit2)$coef
##                Estimate Std. Error   z value    Pr(>|z|)
## (Intercept)   3.5875552  1.3851773  2.589961 0.009598679
## law1         -0.6260282  0.5367204 -1.166396 0.243454555
## PetrolPrice -34.3737200 13.4887754 -2.548320 0.010824304
summary(fit3)$coef
##                  Estimate   Std. Error     z value   Pr(>|z|)
## (Intercept)  3.612261e+00 1.473634e+00  2.45126028 0.01423570
## law1        -6.155225e-01 5.778076e-01 -1.06527242 0.28675267
## PetrolPrice -3.419952e+01 1.394026e+01 -2.45329074 0.01415559
## kms         -2.938343e-06 5.984816e-05 -0.04909663 0.96084229

Let’s look our law variable which was what we’re interested in. It is going to suggest a rather large DECREASE in the LOG ODDS for every before and after enactment of the law. However that decrease in the odds of greater than 119 drivers killed that month that decrease attenuates quite a bit by the inclusion of petrol price. And notice that petrol price is quite significant

It is interesting to note that the law variable goes from significant to not significant it gets increased by a factor of the p value gets increased by a factor of 10.

When we include kilometers we get include a variable that now is not significant at all and doesn’t change the law variable that much.

So if I were doing a model selection, I was primarily interested in the law coefficient [first model] and I would likely go with this second model.

Count data: page 130

Exercise 1

Load the dataset Seatbelts as part of the datasets package via data(Seatbelts). Use as.data.frame to convert the object to a dataframe. Fit a Poisson regression GLM with DriversKilled as the outcome and kms, PetrolPrice and law as predictors. Interpret your results.

fit <- glm(DriversKilled ~ mkm + ppn +law, family = poisson, data = Seatbelts)
round(summary(fit)$coef[ ,1],4)
## (Intercept)         mkm         ppn        law1 
##      4.8198     -0.0100     -0.0554     -0.1149

In general it’s better to exponentiate it so that without exponentiating it is the decrease in the law the log of the expected numbers of drivers killed before comparing before and after the law having been enacted up, for exemple in tha case of the law variable.

The coefficients

We apply a loop to exponentiate the coefficient and to put in positive (1-exp) the negative ones.

a <- data.frame(Intercept=0,mkm=0,ppn=0,law1=0)

for (i in 1:length(a)) {
a[i] <- if(summary(fit)$coef[ i,1] < 0) 
        {1-exp(summary(fit)$coef[ i,1])} else {exp(summary(fit)$coef[ i,1])}
}

print(a)
##   Intercept        mkm        ppn      law1
## 1  123.9459 0.00993133 0.05385679 0.1085243
  • law: If we exponentiate it we’re saying that there’s about an 11 percent decrease in the expected number of drivers killed four before and after enacting the law so enacting the law decreased the expected number of drivers killed by 11 percent

  • mkm: See that is the kilometers in thousand kilometer units . So that’s also negative so let’s see 1 minus X of that should be about a 1% decrease so it’s about 1%. What this coefficient is saying is that there’s a about a rough roughly 1% decrease in the expected number of drivers killed for every thousand additional miles of driven distance.

  • The intercept: It is the expected number of drivers killed on the log scale for the average petrol price the average number of kilometers driven because remember i centered those two variables and at the time when the law was not enacted for when the law variable.

To interpretate coefficients is quite easy:

  • if you don’t exponentiate them they’re all interpreted on the log of the expected value outcome scale.

  • If you do exponentiate them it’s then on the relative scale for a one unit increase variable is going from 0 to 1 for the kilometers variable which i converted a thousand kilometer units it’s going 1,000 kilometers addition to kilometers driven for petrol price because I converted that to standard deviation units that’s one standard deviation increase in petrol price.

Exercise 2

Refer to question 1. Fit a linear model with the log of drivers killed as the outcome. Interpret your results.

round(exp(coef(lm(I(log(DriversKilled + 1)) ~ mkm + ppn + law, data = Seatbelts))), 5)
## (Intercept)         mkm         ppn        law1 
##   123.18577     0.99190     0.94787     0.87807
fit2 <-lm(I(log(DriversKilled + 1)) ~ mkm + ppn + law, data = Seatbelts)

In the exercise 1 we have that the coefficients are the estimates of the log of the expected value of the expected log count of driver deaths, when these other variables are set to their zero level .

summary(fit)$coef
##                 Estimate  Std. Error    z value     Pr(>|z|)
## (Intercept)  4.819845137 0.007127388 676.242884 0.000000e+00
## mkm         -0.009980975 0.002614002  -3.818274 1.343887e-04
## ppn         -0.055361338 0.007243262  -7.643150 2.119715e-14
## law1        -0.114877106 0.025557951  -4.494770 6.964526e-06
summary(fit2)$coef
##                 Estimate  Std. Error    t value      Pr(>|t|)
## (Intercept)  4.813693546 0.014291860 336.813651 2.082663e-263
## mkm         -0.008133355 0.005281647  -1.539928  1.252593e-01
## ppn         -0.053538784 0.014689904  -3.644597  3.464468e-04
## law1        -0.130030803 0.047811530  -2.719654  7.147529e-03

In this case is the expected value of the log of the drivers killed which is our estimate of the expected value of the log of the drivers killed where the other one was the log of the expected value and it doesn’t work that you could interchange.

1-exp(-0.130030803)
## [1] 0.1219316

We’re seeing about a 12 percent decrease in driver deaths after the enactment of the law now again remind from the lecture that the consequence of this of log in the outcome is that the interpretation is respect to geometric means so this is a 12% decrease in the geometric mean number of driver deaths for going from enactment of the law to prior to the law having been enacted.

Exercise 3

Refer to question 1. Fit your Poisson log-linear model with drivers as a log offset (to consider the proportion of drivers killed of those killed or seriously injured.)

Posson model

fit <- glm(DriversKilled ~ mkm + ppn +law, family = poisson, data = Seatbelts, offset = log(drivers))
fit0 <- glm(DriversKilled ~ mkm + ppn +law, family = poisson, data = Seatbelts)
summary(fit)$coef
##                 Estimate  Std. Error     z value  Pr(>|z|)
## (Intercept) -2.612798146 0.007122545 -366.834931 0.0000000
## mkm          0.003377675 0.002630717    1.283937 0.1991640
## ppn         -0.007255064 0.007199577   -1.007707 0.3135952
## law1         0.028484328 0.025512651    1.116479 0.2642173
summary(fit0)$coef
##                 Estimate  Std. Error    z value     Pr(>|z|)
## (Intercept)  4.819845137 0.007127388 676.242884 0.000000e+00
## mkm         -0.009980975 0.002614002  -3.818274 1.343887e-04
## ppn         -0.055361338 0.007243262  -7.643150 2.119715e-14
## law1        -0.114877106 0.025557951  -4.494770 6.964526e-06

The first GLM is modeling the log of the expected value of drivers killed divided by the total number of drivers killed or seriously injured okay so its modeling the proportion of that as a function of beta naught plus beta 1 x and so on. \[log(\frac{E[DK]}{TotD})=\beta_0+\beta_1X+...+\epsilon\]

so on in contrast fit 0 is just modeling the log of the expected value of y by itself. \[E[log(y)]\]

The result can be veru different, as in this case:

exp(0.028484328)
## [1] 1.028894
1-exp(-0.114877106)
## [1] 0.1085243

In the first case, considering the rate, we have a 2% increase of the rate of DriversKilled on the total Killed or injured, correlated to the law enactment; while in the second, we have a 10% decrease in the number of drivers killed.

Appling a binomial model to the proportion

Another way we could have looked at this this model let’s say fit to is doing a binomial model:

fitbis <- glm(cbind(DriversKilled, drivers - DriversKilled) ~ mkm + ppn +law, 
              family = binomial, 
              data = Seatbelts)
summary(fitbis)$coef
##                 Estimate  Std. Error     z value  Pr(>|z|)
## (Intercept) -2.536637162 0.007399129 -342.829174 0.0000000
## mkm          0.003644902 0.002732722    1.333799 0.1822698
## ppn         -0.007828905 0.007478975   -1.046789 0.2951971
## law1         0.030785128 0.026526920    1.160524 0.2458355

compare with the fit on the posson rate model:

summary(fit)$coef
##                 Estimate  Std. Error     z value  Pr(>|z|)
## (Intercept) -2.612798146 0.007122545 -366.834931 0.0000000
## mkm          0.003377675 0.002630717    1.283937 0.1991640
## ppn         -0.007255064 0.007199577   -1.007707 0.3135952
## law1         0.028484328 0.025512651    1.116479 0.2642173

So, this model is also looking at drivers killed as a proportion of drivers killed or seriously injured. So, fit2 should give estimates pretty similar to this first fit and we see they are very similar. So, it’s a slightly different model because one is a binomial model and the other one is a relative Poisson model.

However we know because Poisson approximates the binom model when the proportion is small and the proportion of drivers killed relative to the proportion of drivers killed or seriously injured is small.

So, these two models are going to approximate each other pretty well but the important distinction is when you put that log offset we’re looking at the proportion not just the raw number not just the raw number of drivers killed and this is a pretty stark example because some of the coefficients actually reverse their sign.