PAGE 52

1. Test whether the slope coefficient for the father.son data is different from zero (father as predictor, son as outcome).

ANSWER:

We have our hypotheses: H0:β^1=0 and H1:β^1≠0. To test this, we get the summary of our model where the father is the predictor and the son is the outcome.

library(UsingR)

## Loading required package: MASS

## Loading required package: HistData

## Loading required package: Hmisc

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:base':
## 
##     format.pval, units

data("father.son")
fit<-lm(sheight~fheight, father.son)
summary(fit)

## 
## Call:
## lm(formula = sheight ~ fheight, data = father.son)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.8772 -1.5144 -0.0079  1.6285  8.9685 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 33.88660    1.83235   18.49   <2e-16 ***
## fheight      0.51409    0.02705   19.01   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.437 on 1076 degrees of freedom
## Multiple R-squared:  0.2513, Adjusted R-squared:  0.2506 
## F-statistic: 361.2 on 1 and 1076 DF,  p-value: < 2.2e-16

From the table above, we see that the p-value of the slope (fheight row) is very small. Using α=0.05, this p-value is not significant. Hence, we reject our null hypothesis. Therefore, the slope coefficient for the father.son data is nonzero. That is β^1≠0. Moreover, we can say that the father’s height is a statistically significant predictor of the son’s height.

2. Refer to question 1. Form a confidence interval for the slope coefficient.

ANSWER:

We will use the function confint() to determine the confidence interval for the slope coefficient of our model. We have,

confint(fit)

##                  2.5 %     97.5 %
## (Intercept) 30.2912126 37.4819961
## fheight      0.4610188  0.5671673

The confidence interval for the slope coefficient is 0.4610188,0.5671673 where β1=0.51409 lies.

3. Refer to question 1. Form a confidence interval for the intercept (center the fathers’ heights first to get an intercept that is easier to interpret).

ANSWER:

Recall in our answer in question number 1, the intercept is the estimated son’s height for father of 0 inches, hence is not relevant for this problem. To adress this, we need to center the father’s height and get the intercept. We have

cfheight<-(father.son$fheight - mean(father.son$fheight))
fit2<-lm(sheight~cfheight, father.son)
confint(fit2)

##                  2.5 %     97.5 %
## (Intercept) 68.5384554 68.8296839
## cfheight     0.4610188  0.5671673

Hence, we have a confidence interval of 68.5384554,68.8296839 for the intercept. Here, the intercept is the estimated son’s height for the average father’s height. With this, we have a much more meaningful prediction.

4. Refer to question 1. Form a mean value interval for the expected son’s height at the average father’s height.

ANSWER:

predict(fit, newdata = data.frame(fheight = mean(father.son$fheight)), interval = 'confidence')

##        fit      lwr      upr
## 1 68.68407 68.53846 68.82968

5. Refer to question 1. Form a prediction interval for the son’s height at the average father’s height.

ANSWER:

predict(fit, newdata = data.frame(fheight = mean(father.son$fheight)), interval = 'prediction')

##        fit      lwr      upr
## 1 68.68407 63.90091 73.46723

6. Load the mtcars dataset. Fit a linear regression with miles per gallon as the outcome and horsepower as the predictor. Test whether or not the horsepower power coefficient is statistically different from zero. Interpret your test.

ANSWER:

fit<-lm(mpg~hp, mtcars)

Hypotheses: H0:β^1=0 and H1:β^1≠0.

summary(fit)$coef

##                Estimate Std. Error   t value     Pr(>|t|)
## (Intercept) 30.09886054  1.6339210 18.421246 6.642736e-18
## hp          -0.06822828  0.0101193 -6.742389 1.787835e-07

From the table above, we see that the p-value of the slope (hp row) is very small. Using α=0.05, this p-value is not significant. Hence, we reject our null hypothesis. Therefore, the slope coefficient horse power is nonzero. That is β^1≠0. Moreover, we can say that the horse power is a statistically significant predictor of miles per gallon.

7. Refer to question 6. Form a confidence interval for the slope coefficient.

ANSWER:

confint(fit)

##                   2.5 %     97.5 %
## (Intercept) 26.76194879 33.4357723
## hp          -0.08889465 -0.0475619

Hence, we have a confidence interval for the slope coefficient -0.08889465, -0.0475619.

8. Refer to quesiton 6. Form a confidence interval for the intercept (center the HP variable first).

ANSWER:

chp<-(mtcars$hp - mean(mtcars$hp))
fit2<-lm(mpg~chp, mtcars)
confint(fit2)

##                   2.5 %     97.5 %
## (Intercept) 18.69599452 21.4852555
## chp         -0.08889465 -0.0475619

Thus, we have a confidence interval for the intercept 18.69599452,21.4852555.

9. Refer to question 6. Form a mean value interval for the expected MPG for the average HP.

ANSWER:

predict(fit, newdata = data.frame(hp=mean(mtcars$hp)), interval = 'confidence')

##        fit      lwr      upr
## 1 20.09062 18.69599 21.48526

10. Refer to question 6. Form a prediction interval for the expected MPG for the average HP.

ANSWER:

predict(fit, newdata = data.frame(hp=mean(mtcars$hp)), interval = 'prediction')

##        fit      lwr      upr
## 1 20.09062 12.07908 28.10217

11. Refer to question 6. Create a plot that has the fitted regression line plus curves at the expected value and prediction intervals.

ANSWER:

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:Hmisc':
## 
##     src, summarize

## The following object is masked from 'package:MASS':
## 
##     select

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
y <- mtcars$mpg; x <- mtcars$hp; n <- length(y)
fit <- lm(y~x)
p1 <- data.frame(predict(fit, newdata = data.frame(x=x), interval = 'confidence'))

p2 <- data.frame(predict(fit, newdata = data.frame(x=x), interval = 'prediction'))
p1 <- p1 %>% 
    mutate(interval = 'confidence') %>% 
    mutate(x = mtcars$hp)
p2 <- p2 %>% 
    mutate(interval = 'prediction') %>% 
     mutate(x = mtcars$hp)

dat <- rbind(p1,p2)
names(dat)[1] <- "y"
dat %>% 
    ggplot(aes(x = x, y = y)) +
    geom_ribbon(aes(ymin= lwr, ymax= upr, fill= interval), alpha= 0.4) +
    geom_line(colour="red", lwd=.5) +
    geom_point(data= data.frame(x=x,y=y), aes(x=x,y=y))

# PAGE 58 ### 1. Load the dataset Seatbelts as part of the datasets package via data(Seatbelts). Use as.data.frame to convert the object to a dataframe. Fit a linear model of driver deaths with kms and PetrolPrice as predictors. Interpret your results. ## ANSWER

library(datasets)
data("Seatbelts")
stblts <- as.data.frame(Seatbelts)
fit<- lm(DriversKilled ~ kms + PetrolPrice, stblts)
summary(fit)$coef

##                  Estimate   Std. Error   t value     Pr(>|t|)
## (Intercept)  2.157461e+02 1.466559e+01 14.711047 3.772201e-33
## kms         -1.749546e-03 6.145401e-04 -2.846919 4.902428e-03
## PetrolPrice -6.437895e+02 1.482896e+02 -4.341435 2.304713e-05

round(summary(fit)$coef, 4)

##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  215.7461    14.6656 14.7110   0.0000
## kms           -0.0017     0.0006 -2.8469   0.0049
## PetrolPrice -643.7895   148.2896 -4.3414   0.0000

Here the intercept is interpreted as the expected number of drivers killed for zero kilometers driven and for petrol price of 0 which is not meaningful. To better interpret this, we center the kilometer and petrol price and rescale them into more meaningful scale.

library(dplyr)
stblts <- stblts %>% 
    mutate(ppn = (PetrolPrice - mean(PetrolPrice))/sd(PetrolPrice),
           mkm = kms / 1000) %>% 
    mutate(mkm = mkm - mean(mkm))
summary(lm(DriversKilled ~ mkm + ppn, stblts))$coef

##               Estimate Std. Error   t value      Pr(>|t|)
## (Intercept) 122.802083  1.6628507 73.850336 2.395106e-141
## mkm          -1.749546  0.6145401 -2.846919  4.902428e-03
## ppn          -7.838674  1.8055491 -4.341435  2.304713e-05

Thus, the intercept 122.802083 is the estimated number of drivers killed for the average petrol price and average distance driven. The petrol price coefficient of -7.838674 means that we expect 7 fewer deaths per one standard deviation change in petrol price. The coefficient -1.749546 which is approximately equal to -2 means that we expect two fewer deaths for every additional thousand kilometers driven.

2. Predict the number of driver deaths at the average kms and petrol levels.

ANSWER:

predict(fit, newdata = data.frame(kms= mean(stblts$kms), 
                                  PetrolPrice= mean(stblts$PetrolPrice)))

##        1 
## 122.8021

Thus, approximately 123 number of deaths at the mean of kms and PetrolPrice.

3. Take the residual for DriversKilled having regressed out kms and an intercept and the residual for PetrolPrice having regressed out kms and an intercept. Fit a regression through the origin of the two residuals and show that it is the same as your coefficient obtained in question 1.

ANSWER:

dk<-stblts$DriversKilled
kms<-stblts$kms
pp<-stblts$PetrolPrice
fitf<-lm(dk~kms+pp)
edk<-resid(lm(dk~kms))
epp<-resid(lm(pp~kms))
summary(lm(edk~epp-1))

## 
## Call:
## lm(formula = edk ~ epp - 1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -51.06 -17.77  -4.15  15.67  59.33 
## 
## Coefficients:
##     Estimate Std. Error t value Pr(>|t|)    
## epp   -643.8      147.5  -4.364 2.09e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.92 on 191 degrees of freedom
## Multiple R-squared:  0.09068,    Adjusted R-squared:  0.08592 
## F-statistic: 19.05 on 1 and 191 DF,  p-value: 2.086e-05

summary(lm(edk~epp-1))$coef

##      Estimate Std. Error   t value     Pr(>|t|)
## epp -643.7895   147.5111 -4.364345 2.085664e-05

This is the estimate of the standard error, the T value and the p value associated with the petrol price residual as a predictor and the residual drivers killed as the outcome.

Now, we look at the summary of our full model fit fitf where we had drivers killed as the outcome, kilometers driven as the predictor and petrol price as a predictor plus an intercept.

summary(fitf)$coef

##                  Estimate   Std. Error   t value     Pr(>|t|)
## (Intercept)  2.157461e+02 1.466559e+01 14.711047 3.772201e-33
## kms         -1.749546e-03 6.145401e-04 -2.846919 4.902428e-03
## pp          -6.437895e+02 1.482896e+02 -4.341435 2.304713e-05

Observe that the epp row is identical to pp row.

4. Take the residual for DriversKilled having regressed out PetrolPrice and an intercept. Take the residual for kms having regressed out PetrolPrice and an intercept. Fit a regression through the origin of the two residuals and show that it is the same as your coefficient obtained in question 1.

ANSWER:

dk2 <- resid(lm(dk ~ pp))
kms <- resid(lm(kms ~ pp))
round(summary(lm(dk2 ~ kms -1))$coef, 4)

##     Estimate Std. Error t value Pr(>|t|)
## kms  -0.0017      6e-04 -2.8619   0.0047

Hence, we got the same coefficient as we have obtained in question number 1.

PAGE 72

1. Do exercise 1 of the previous chapter if you have not already. Load the dataset Seatbelts as part of the datasets package via data(Seatbelts). Use as.data.frame to convert the object to a dataframe. Fit a linear model of driver deaths with kms and PetrolPrice as predictors. Interpret your results.

ANSWER:

round(summary(lm(DriversKilled ~ mkm + ppn, stblts))$coef,4)

##             Estimate Std. Error t value Pr(>|t|)
## (Intercept) 122.8021     1.6629 73.8503   0.0000
## mkm          -1.7495     0.6145 -2.8469   0.0049
## ppn          -7.8387     1.8055 -4.3414   0.0000

The intercept 122.802083 is the estimated number of drivers killed for the average petrol price and average distance driven. The petrol price coefficient of -7.838674 means that we expect 7 fewer deaths per one standard deviation change in petrol price. The coefficient -1.749546 which is approximately equal to -2 means that we expect two fewer deaths for every additional thousand kilometers driven.

2. Repeat question 1 for the outcome being the log of the count of driver deaths. Interpret your coefficients.

ANSWER:

library(dplyr)
stblts <- stblts %>% 
    mutate(ppn = (PetrolPrice - mean(PetrolPrice))/sd(PetrolPrice),
           mkm = kms / 1000) %>% 
    mutate(mkm = mkm - mean(mkm))
fit<-lm(I(log(DriversKilled)) ~ mkm + ppn, stblts)
summary(fit)$coef

##                Estimate  Std. Error    t value      Pr(>|t|)
## (Intercept)  4.78966306 0.013426810 356.723817 2.737888e-269
## mkm         -0.01400794 0.004962149  -2.822959  5.267843e-03
## ppn         -0.06412578 0.014579039  -4.398492  1.818005e-05

Here, all the estimates are interpreted on the log scale. We can exponentiate the log scale coefficients and get the relative interpretations and relating back to the increases or decreases in the geometric mean of the outcome.

1-exp(-0.06412578)

## [1] 0.06211298

1-exp(-0.01400794)

## [1] 0.01391029

Hence, we are estimating a 6% decrease in the geometric mean of the drivers killed for every one standard deviation increase in normalized petrol price holding the kilometers constant. On the other hand, for every additional thousand of driver, the thousands of miles thrived driven, we are estimating an expected 1% decrease in the normalized drivers killed holding petrol price constant.

3 Refer to question 1. Add the dummy variable law and interpret the results. Repeat this question with a factor variable that you create called lawFactor that takes the levels No and Yes. Change the reference level from No to Yes

ANSWER:

library(dplyr)
stblts <- stblts %>% 
    mutate(ppn = (PetrolPrice - mean(PetrolPrice))/sd(PetrolPrice),
           mkm = kms / 1000) %>% 
    mutate(mkm = mkm - mean(mkm))
fit<-lm(DriversKilled ~ mkm + ppn +law, stblts)
summary(fit)

## 
## Call:
## lm(formula = DriversKilled ~ mkm + ppn + law, data = stblts)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -50.69 -17.29  -4.05  14.33  60.71 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 124.2263     1.8012  68.967  < 2e-16 ***
## mkm          -1.2233     0.6657  -1.838 0.067676 .  
## ppn          -6.9199     1.8514  -3.738 0.000246 ***
## law         -11.8892     6.0258  -1.973 0.049955 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.87 on 188 degrees of freedom
## Multiple R-squared:  0.201,  Adjusted R-squared:  0.1882 
## F-statistic: 15.76 on 3 and 188 DF,  p-value: 3.478e-09

The law variable is a 0 1 variable so this change the intercept a little bit in the sense that our intercept 124.2263 is the expected number of drivers killed for average petrol price, the average number of kilometers driven and before the law was taken in effect so at the law at level 0.

If we wanted to know what the intercept was after the law had taken into effect at the average petrol price and at the average number of kilometers driven we would have to add negative 11.

Hence, we say that about 12 fewer deaths occur when after the law had taken effect. When the law varibale went from 0 to 1, we expect 12 fewer deaths aassociated per month associated with the law holding petrol price and kilometers driven constant.

fit<-lm(DriversKilled ~ mkm + ppn +I(factor(law)), stblts)
summary(fit)$coef

##                   Estimate Std. Error   t value      Pr(>|t|)
## (Intercept)     124.226311  1.8012324 68.967399 1.976641e-135
## mkm              -1.223318  0.6656567 -1.837761  6.767594e-02
## ppn              -6.919949  1.8513987 -3.737687  2.463128e-04
## I(factor(law))1 -11.889202  6.0257850 -1.973055  4.995497e-02

Notice that we get the same numbers -124.226311 for the intercept, -1.223318 for the 1st column, -6.919949 for the next and -11.889202 for the other tells us which factor that it is treating as the reference level or not. Hence, the -11.889202 is associated with the factor level 1.

4 Discretize the PetrolPrice variable into four factor levels. Fit the linear model with this factor to see how R treats multiple level factor variables.

ANSWER:

library(dplyr)
stblts <- stblts %>% 
    mutate(ppn = (PetrolPrice - mean(PetrolPrice))/sd(PetrolPrice),
           ppf=as.factor((ppn<=-1.5)+(ppn<=0)+(ppn<=1.5)+(ppn< Inf)),
           mkm = kms / 1000) %>% 
    mutate(mkm = mkm - mean(mkm))
table(stblts$ppf)

## 
##  1  2  3  4 
##  6 96 71 19

fit<-lm(DriversKilled ~ mkm + ppf +law, stblts)
summary(fit)

## 
## Call:
## lm(formula = DriversKilled ~ mkm + ppf + law, data = stblts)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -53.384 -17.211  -3.421  14.849  65.613 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 109.8405     9.5066  11.554   <2e-16 ***
## mkm          -1.2991     0.7668  -1.694   0.0919 .  
## ppf2         10.8271     9.9462   1.089   0.2778    
## ppf3         18.6904     9.9374   1.881   0.0616 .  
## ppf4         25.0074    10.9163   2.291   0.0231 *  
## law         -15.3445     6.0345  -2.543   0.0118 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23.24 on 186 degrees of freedom
## Multiple R-squared:  0.1833, Adjusted R-squared:  0.1614 
## F-statistic:  8.35 on 5 and 186 DF,  p-value: 3.835e-07

We see from the 1st tabele that PPF has 4 factors and we have the dta in each one of the levels of the factor. Getting the summary of the fit model, we see that it elects and chooses level one to be the reference level of this factor PPF. Here the ppf2 is now the comparison with the level 1, ppf3 is the comparison with level 3 with level 1 and ppf4 is the comparison of level 4 with level 1.

5 Perform the plot requested at the end of the last chapter.

ANSWER:

Two levels model without interaction:

library(ggplot2)
library(datasets)
library(grid)
library(ggplotify)
swiss <- swiss %>% 
    mutate(CatholicBin = 1 * (Catholic > 50))

g <- ggplot(swiss, aes(x = Agriculture, y = Fertility, colour = factor(CatholicBin)))+ 
    geom_point(size = 5) +
    xlab("% in Agriculture") + 
    ylab("Fertility")

fit <- lm(Fertility ~ Agriculture + factor(CatholicBin), data = swiss) # Note the + operator

g +
    geom_abline(intercept = coef(fit)[1], slope = coef(fit)[2], size=1, colour='green') +
    geom_abline(intercept = coef(fit)[1] + coef(fit)[3], slope = coef(fit)[2], size=1, colour='red') +
    annotation_custom(grob = grobTree(textGrob("Catholic", 
                                               x=0.05,  
                                               y=.62, 
                                               hjust=0,
                                               gp=gpar(col="blue", 
                                                       fontsize=10, 
                                                       fontface="italic")))) +
    annotation_custom(grob = grobTree(textGrob("Non catholic", 
                                               x=0.05,  
                                               y=.5, 
                                               hjust=0,
                                               gp=gpar(col="black", 
                                                       fontsize=10, 
                                                       fontface="italic"))))

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

fitfull <- lm(Fertility ~ Agriculture * factor(CatholicBin), data = swiss)
g +
    geom_abline(intercept = coef(fitfull)[1], slope = coef(fitfull)[2], size=1, colour='green') +
    geom_abline(intercept = coef(fitfull)[1] + coef(fitfull)[3], 
                slope = coef(fitfull)[2] + coef(fitfull)[4], size=1, colour='red') +
    
    annotation_custom(grob = grobTree(textGrob("Catholic", 
                                               x=0.05,  
                                               y=.58, 
                                               hjust=0,
                                               gp=gpar(col="blue", 
                                                       fontsize=10, 
                                                       fontface="italic")))) +
    annotation_custom(grob = grobTree(textGrob("Non catholic", 
                                               x=0.05,  
                                               y=.46, 
                                               hjust=0,
                                               gp=gpar(col="black", 
                                                       fontsize=10, 
                                                       fontface="italic"))))

# PAGE 80 ### 1. Load the dataset Seatbelts as part of the datasets package via data(Seatbelts). Use as.data.frame to convert the object to a dataframe. Fit a linear model of driver deaths with kms and PetrolPrice as predictors. Interpret your results. ## ANSWER:

library(datasets)
data("Seatbelts")
stblts <- as.data.frame(Seatbelts)
fit<- lm(DriversKilled ~ kms + PetrolPrice, stblts)
summary(fit)$coef

##                  Estimate   Std. Error   t value     Pr(>|t|)
## (Intercept)  2.157461e+02 1.466559e+01 14.711047 3.772201e-33
## kms         -1.749546e-03 6.145401e-04 -2.846919 4.902428e-03
## PetrolPrice -6.437895e+02 1.482896e+02 -4.341435 2.304713e-05

Here the intercept is interpreted as the expected number of drivers killed for zero kilometers driven and for petrol price of 0 which is not meaningful. To better interpret this, we center the kilometer and petrol price and rescale them into more meaningful scale.

library(dplyr)
stblts <- stblts %>% 
    mutate(ppn = (PetrolPrice - mean(PetrolPrice))/sd(PetrolPrice),
           mkm = kms / 1000) %>% 
    mutate(mkm = mkm - mean(mkm))
summary(lm(DriversKilled ~ mkm + ppn, stblts))$coef

##               Estimate Std. Error   t value      Pr(>|t|)
## (Intercept) 122.802083  1.6628507 73.850336 2.395106e-141
## mkm          -1.749546  0.6145401 -2.846919  4.902428e-03
## ppn          -7.838674  1.8055491 -4.341435  2.304713e-05

Thus, the intercept 122.802083 is the estimated number of drivers killed for the average petrol price and average distance driven. The petrol price coefficient of -7.838674 means that we expect 7 fewer deaths per one standard deviation change in petrol price. The coefficient -1.749546 which is approximately equal to -2 means that we expect two fewer deaths for every additional thousand kilometers driven.

2. Compare the kms coefficient with and without the inclusion of the PetrolPrice variable in the model

library(dplyr)
stblts <- stblts %>% 
    mutate(ppn = (PetrolPrice - mean(PetrolPrice))/sd(PetrolPrice),
           mkm = kms / 1000) %>% 
    mutate(mkm = mkm - mean(mkm))

First, let us examine the correlation between the kms and petrol price.

cor(stblts$ppn, stblts$mkm)

## [1] 0.3839004

The correlation tells us that at some level these are measuring similar things and it would stand to reason that the price of gas would have an impact on the total miles being driven. Now, let us fit both models.

fit0<-lm(DriversKilled ~ mkm, stblts)
fit1<-lm(DriversKilled ~ mkm + ppn, stblts)
summary(fit0)$coef

##               Estimate Std. Error  t value      Pr(>|t|)
## (Intercept) 122.802083  1.7391997 70.60839 2.665611e-138
## mkm          -2.773787  0.5935049 -4.67357  5.596266e-06

summary(fit1)$coef

##               Estimate Std. Error   t value      Pr(>|t|)
## (Intercept) 122.802083  1.6628507 73.850336 2.395106e-141
## mkm          -1.749546  0.6145401 -2.846919  4.902428e-03
## ppn          -7.838674  1.8055491 -4.341435  2.304713e-05

What we see is that the the estimate in this case goes down something we would likely expect since both effects are in the same direction and they’re correlated so the mkm which is the kilometers in thousand kilometer units centered it goes from an estimated -2.773787 additional deaths for every thousand extra kilometers driven per month to -1.749546 after the inclusion of the petrol price. Thus we can say that petrol price is having a confounding effect on the relationship between the kilometers driven and the number of drivers deaths.

3. Compare the PetrolPrice coefficient with and without the inclusion fo the kms variable in the model.

ANSWER:

fit2<-lm(DriversKilled ~ ppn, stblts)
fit3<-lm(DriversKilled ~ ppn+mkm, stblts)
summary(fit2)$coef

##               Estimate Std. Error   t value      Pr(>|t|)
## (Intercept) 122.802083   1.693656 72.507096 2.061333e-140
## ppn          -9.812019   1.698084 -5.778288  3.044208e-08

summary(fit3)$coef

##               Estimate Std. Error   t value      Pr(>|t|)
## (Intercept) 122.802083  1.6628507 73.850336 2.395106e-141
## ppn          -7.838674  1.8055491 -4.341435  2.304713e-05
## mkm          -1.749546  0.6145401 -2.846919  4.902428e-03

Similarly in this case, the estimate is negative. Which makes sense because both effects are in the same direction and are correlated. But the estimate change from -9.812019 to 7.838674 which indicates the confounding effect of the variable kms on the regression between DriversKiller and PetrolPrice.

PAGE 91

1. Load the dataset Seatbelts as part of the datasets package via data(Seatbelts). Use as.data.frame to convert the object to a dataframe. Fit a linear model of driver deaths with kms, PetrolPrice and law as predictors.

library(datasets)
data("Seatbelts")
stblts <- as.data.frame(Seatbelts)
fit<- lm(DriversKilled ~ kms + PetrolPrice+law, stblts)

2. Refer to question 1. Directly estimate the residual variation via the function resid. Compare with R’s residual variance estimate.

library(dplyr)
stblts <- stblts %>% 
    mutate(ppn = (PetrolPrice - mean(PetrolPrice))/sd(PetrolPrice),
           mkm = kms / 1000) %>% 
    mutate(mkm = mkm - mean(mkm))
fit<-lm(DriversKilled ~ mkm + ppn +law, stblts)
sum(resid(fit)^2)/(nrow(stblts)-4)

## [1] 522.8903

summary(fit)$sigma^2

## [1] 522.8903

This goes to show how r is calculating the residual variance and what the residual variance represents it is merely the average squared residual which represents the variability left unexplained after having accounted for kilometers,petrol price and law again we subtract off the 4 to achieve some statistical properties namely the property of on biasness.

3. Refer to question 1. Perform an analysis of diagnostic measures including, dffits, dfbetas, influence and hat diagonals.

fit<-lm(DriversKilled ~ mkm + ppn +law, stblts)
plot(fit)

fitted values which labels some plots that appear on the extreme fits a smoother through the residuals. The second plot is the Q-Q plot which is done to ascertain normality of the errors. In this case we are looking for whether or not the points fall mostly on a line and in this case it gives the reference line. However when we look at these plots remember there’s variability in the estimate of the quantiles from the standardized residuals. If we simulate data even from a normal distribution it’s going to deviate from this line somewhat so we want to see if there’s any grow gross departures not so much evaluating minor departures. In this case this doesn’t seem too bad. The next plot is the scale location which is the square root of the absolute value of the standardized residuals by the fitted values. This is the scale location plot a smoother is put through it and again looking for patterns in this plot it doesn’t look so bad. The last plot is the residuals vs leverage plot which doesn’t look so bad as well.

Evaluating the influence:

Using ggplot, we can create a dataframe with the dffits values, classify them, establish a threshold value, in this case |0.4|, and plot the values indicating the id of the values above the threshold (specialPoint).

datdffits <- data.frame(dffits = dffits(fit)) %>% 
    mutate(id=row_number()) %>% 
    mutate(specialPoint= as.factor(abs(dffits) >= .35))

ggplot(datdffits, aes(y=dffits, x= id, color= specialPoint))+
    geom_point(size=3) +
    geom_text(data=subset(datdffits, specialPoint==TRUE),
            aes(y=dffits,x=id,label=id), colour='blue') +
    geom_hline(yintercept = c(-0.35,0.35), color= 'green') +
    geom_hline(yintercept = 0)

head(dfbeta(fit))

##   (Intercept)        mkm         ppn         law
## 1  -0.1022614 0.11881682 -0.09349927 -0.26195987
## 2  -0.1331656 0.22275746 -0.15807657 -0.56786007
## 3  -0.1279422 0.11452337 -0.07920934 -0.23464337
## 4  -0.2032275 0.13006726 -0.06413772 -0.23584650
## 5  -0.0524778 0.02353784 -0.01104844 -0.02756976
## 6  -0.1188004 0.03961875 -0.01108847 -0.02386593

plot(dfbeta(fit)[ , 2])

## dfbeta for kms:

datdfbeta <- data.frame(dfbetamkm = dfbeta(fit)[ , 2]) %>% 
    mutate(id=row_number()) %>% 
    mutate(specialPoint= as.factor(abs(dfbetamkm) >= .2))

ggplot(datdfbeta, aes(y=dfbetamkm, x= id, color= specialPoint))+
    geom_point(size=3) +
    geom_text(data=subset(datdfbeta, specialPoint==TRUE),
            aes(y=dfbetamkm,x=id,label=id), colour='blue') +
    geom_hline(yintercept = 0)+
    geom_hline(yintercept = c(-0.2,0.2), color= 'green') +
    ylab('dfbeta: mkm')

datdfbeta2 <- data.frame(dfbetappn = dfbeta(fit)[ , 3]) %>% 
    mutate(id=row_number()) %>% 
    mutate(specialPoint= as.factor(abs(dfbetappn) >= .4))

ggplot(datdfbeta2, aes(y=dfbetappn, x= id, color= specialPoint))+
    geom_point(size=3) +
    geom_text(data=subset(datdfbeta2, specialPoint==TRUE),
            aes(y=dfbetappn,x=id,label=id), colour='blue') +
    geom_hline(yintercept = 0)+
    geom_hline(yintercept = c(-0.4,0.4), color= 'green') +
    ylab('dfbeta: PetrolPrice')

datcooks <- data.frame(cooksdist = cooks.distance(fit)) %>% 
    mutate(id=row_number()) %>% 
    mutate(specialPoint= as.factor(cooksdist >= .03))

ggplot(datcooks, aes(y=cooksdist, x= id, color= specialPoint))+
    geom_point(size=3) +
    geom_text(data=subset(datcooks, specialPoint==TRUE),
            aes(y=cooksdist,x=id,label=id), colour='black') +
    geom_hline(yintercept = 0)+
    geom_hline(yintercept = 0.03, color= 'blue') +
    ylab('Cooks distance')

## Evaluating leverage:

summary(round(hatvalues(fit), 3))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00600 0.01175 0.01700 0.02082 0.02600 0.06700

dathatv <- data.frame(hatv = hatvalues(fit)) %>% 
    mutate(id=row_number()) %>% 
    mutate(specialPoint= as.factor(hatv >= .06))

ggplot(dathatv, aes(y=hatv, x= id, color= specialPoint))+
    geom_point(size=3) +
    geom_text(data=subset(dathatv, specialPoint==TRUE),
            aes(y=hatv,x=id,label=id), colour='blue') +
    geom_hline(yintercept = 0.06, color= 'green') +
    ylab('hatvalues')

# PAGE 100 ## 1. Load the dataset Seatbelts as part of the datasets package via data(Seatbelts). Use as.data.frame to convert the object to a dataframe. Fit a linear model of driver deaths with kms, PetrolPrice and law as predictors.

library(datasets)
data("Seatbelts")
stblts <- as.data.frame(Seatbelts)
fit<- lm(DriversKilled ~ kms + PetrolPrice, stblts)
summary(fit)$coef

##                  Estimate   Std. Error   t value     Pr(>|t|)
## (Intercept)  2.157461e+02 1.466559e+01 14.711047 3.772201e-33
## kms         -1.749546e-03 6.145401e-04 -2.846919 4.902428e-03
## PetrolPrice -6.437895e+02 1.482896e+02 -4.341435 2.304713e-05

Here the intercept is interpreted as the expected number of drivers killed for zero kilometers driven and for petrol price of 0 which is not meaningful. To better interpret this, we center the kilometer and petrol price and rescale them into more meaningful scale.

library(dplyr)
stblts <- stblts %>% 
    mutate(ppn = (PetrolPrice - mean(PetrolPrice))/sd(PetrolPrice),
           mkm = kms / 1000) %>% 
    mutate(mkm = mkm - mean(mkm))

2. Perform a model selection exercise to arrive at a final model.

fit0 <- lm(DriversKilled~law, stblts)
fit1 <- lm(DriversKilled~law + mkm, stblts)
fit2 <- lm(DriversKilled~law + ppn, stblts)
fit3 <- lm(DriversKilled~law + mkm + ppn, stblts)
anova(fit0,fit1,fit2,fit3)

## Analysis of Variance Table
## 
## Model 1: DriversKilled ~ law
## Model 2: DriversKilled ~ law + mkm
## Model 3: DriversKilled ~ law + ppn
## Model 4: DriversKilled ~ law + mkm + ppn
##   Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
## 1    190 109754                                
## 2    189 105608  1    4145.3 7.9276 0.005388 **
## 3    189 100069  0    5538.9                   
## 4    188  98303  1    1766.0 3.3774 0.067676 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(fit0,fit3)

## Analysis of Variance Table
## 
## Model 1: DriversKilled ~ law
## Model 2: DriversKilled ~ law + mkm + ppn
##   Res.Df    RSS Df Sum of Sq      F    Pr(>F)    
## 1    190 109754                                  
## 2    188  98303  2     11450 10.949 3.177e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We are particularly interested with the fit0 and fit3 because we want to know whether or not both those parameters are necessary in addition to the law of variable. So here we have our base model which just has the law variable. The second model which includes kilometers in petrol price so it’s a 2 degree of freedom test because it added two parameters. Test is highly significant suggesting that we probably should include both those parameters.

rbind(summary(fit0)$coef[2, ],
      summary(fit1)$coef[2, ],
      summary(fit2)$coef[2, ],
      summary(fit3)$coef[2, ])

##       Estimate Std. Error   t value     Pr(>|t|)
## [1,] -25.60895   5.341655 -4.794198 3.288375e-06
## [2,] -17.55372   6.028888 -2.911602 4.028394e-03
## [3,] -16.32618   5.555579 -2.938700 3.706585e-03
## [4,] -11.88920   6.025785 -1.973055 4.995497e-02

As we can see in the coefficient table, grabing the law row from each one of the fitted models we can see what happens to the law estimate. The lowest amount estimates that after the law went into effect we get an expected 25 fewer deaths per month. However if we adjust for the variability in just kilometers it drops down to 17 fewer deaths. If we just for adjust the variability for the variability in petrol price that estimate drops down to 16 deaths fewer per month and then if we adjust for both it goes down to 11 fewer deaths per month.

PAGE 104

1. True or false, generalized linear models transform the observed outcome. (Discuss.)

ANSWER:

Suppose we have an observed outcome and we do log of our outcome, that is

log(Y)=β0+β1+ϵ

we are simply fitting a linear model where we’ve transformed the outcome. On the other hand, suppose we log the expected value of y, that is

log(E[Y])=β0+β1X

Here,we parameterised in terms of a collection of parameters and our other data, the X values.This a major difference between transforming the outcome and transforming a parameter which is what occurs in Generalized Linear Models (GLM). Therefore, the answer is FALSE.

2.True or false, the interpretation of the coefficients in a GLM are on the scale of the link function. (Discuss.)

ANSWER: TRUE. In fact, all of the coefficients are interpreted on the link function scale.If we want some natural scale interpretation we have to invert that link function. For instance we have

log(E[Y|X=x])=β0+β1X

This model states that the log of the expected value of the outcome given X takes the value x is β0+β1X. This is for just a single regressor. So, when we get an estimate for β1 and see it in our coefficient table that is interpreted on the scale of the log of the expected value of the outcome so it is interpreted on the link function scale. This is an example of Poisson regression.

3 True or false, the generalized linear model assumes an exponential family for the outcome.(Discuss.)

ANSWER:

TRUE.

In linear model, if we have Y, X1 and X2, we need a model that connects Y with the X1 and X2 to the linear model that does that in a sense very directly by equating them with a straightforward equation.Since this relationship won’t naturally hold, it adds an error term assuming that the error term is probably normal. We have,

Y=β0+β1X1+β2X2

which is not in the form of a generalized linear model.In this model formulation we started out with an equation that we wish was true and then added the statistical assumptions with the error component. In generalized linear model we’re going to start with assuming a distribution for the Y and we’re going to assume that it comes from an exponential family distribution. An exponential family includes the normal, so linear models are a special case of generalized linear models. It includes the Bernoulli and binomial distributions the Poisson distribution. At this point, we only assume the distribution for the Y we need a way to relate the Y to the X1 and X2 which is done by the fact that the exponential family is always parametrized by a couple of parameters in the same way that the normal distribution is parametrized by a mean and the variance. Therefore, generalized linear model assumes an exponential family for the outcome.

4. True or false, GLM estimates are obtained by maximizing the likelihood. (Discuss.)

ANSWER:

TRUE.

Suppose we have a Poisson regression.

log(E[Yi])=log(μi)=β0+β1

This tells us that,

eμi=eβ0+β1

This means that we are going to assume that Yi observation is independent to Poisson(μi). Thus, we have a likelihood

μieμiYi!

In this case the parameter of interest is μi, so, we can get rid of that part in the bottom that doesn’t depend on the parameter anymore.Now the likelihood will be the product over all the subjects, that is

Πμieμi=ζ(β0β1|Y).

5 True or false, some GLM distributions impose restrictions on the relationship between the mean and the variance. (Discuss.)

ANSWER:

Suppose we have a Poisson distribution log(E[Yi])=log(μi)=β0+β1Xi assuming that Yi observation is independent to Poisson(μi). The implication of this family assumption is that E[Yi]=μi by definition. However, for the Poisson ditribution, Var(Yi)=E[Yi]−μi. This is simply because the mean and the variance of the Poisson distribution is always the same. This is a checkable assumption in some cases for example if we have a lot of repeat measurements of Yi at the same value of μi. Now, suppose we have log(E[Yi])=β0+β1 and let’s assume that we had a bunch of independent realizations of Poisson random variables then we could calculate Y¯ and that will give us an estimate of μ. Then, we could calculate Var(Yi)=μ under the assumption that the Yi’s are Poisson. If those things are close then the Poisson assumption is validated if they’re not closed then the Poisson assumption is not validated and so this mean variance relationship has potentially the problem that it’s very different from what the data actually says. The solution to this often is to relax the assumption and no longer assume that Y is poised on but instead assume a slight generalization of the Poisson distribution such as the negative binomial distribution or to focus on a more asymptotic less modeling based approach such as so called Quisi Poisson on models. But at any rate to answer the question for the Poisson distribution, for the Bernoulli distribution and many other instances of generalized linear models there is an implied relationship between the mean and the variance and often that that relationship is checkable. Thus, whenever we’re actually doing generalized linear model analyses we should either automatically account for the potential that that assumption is false with quasi models quasi likelihood models or actually verify that it’s true.

PAGE 115

1. Load the dataset Seatbelts as part of the datasets package via data(Seatbelts). Use as.data.frame to convert the object to a dataframe. Create a new outcome variable for whether or not greater than 119 drivers were killed that month. Fit a logistic regression GLM with this variable as the outcome and kms, PetrolPrice and law as predictors. Interpret your parameters.

ANSWER:

library(dplyr)
stblts <- stblts %>% 
    mutate(ppn = (PetrolPrice - mean(PetrolPrice))/sd(PetrolPrice),
           dkb = (DriversKilled > 119),
           mkm = kms / 1000) %>% 
    mutate(mkm = mkm - mean(mkm))
table(stblts$dkb)

## 
## FALSE  TRUE 
##    98    94

We se here a 98 false and 94 true. We would like it to be not necessarily a boolean variable but rather a numeric variable. Thus, we multiply 1 to our previous data which of does nothing but because of the arithmetic it will automatically convert it now it’s 0 & 1.

library(dplyr)
stblts <- stblts %>% 
    mutate(ppn = (PetrolPrice - mean(PetrolPrice))/sd(PetrolPrice),
           dkb = 1 *(DriversKilled > 119),
           mkm = kms / 1000) %>% 
    mutate(mkm = mkm - mean(mkm))
table(stblts$dkb)

## 
##  0  1 
## 98 94

fit<-glm(dkb~ppn+mkm+law, family = binomial, stblts)
summary(fit)$coef

##                 Estimate Std. Error     z value   Pr(>|z|)
## (Intercept)  0.024313512 0.16077499  0.15122695 0.87979669
## ppn         -0.416407701 0.16973435 -2.45329074 0.01415559
## mkm         -0.002938343 0.05984816 -0.04909663 0.96084229
## law         -0.615522450 0.57780755 -1.06527242 0.28675267

round(summary(fit)$coef, 4)

##             Estimate Std. Error z value Pr(>|z|)
## (Intercept)   0.0243     0.1608  0.1512   0.8798
## ppn          -0.4164     0.1697 -2.4533   0.0142
## mkm          -0.0029     0.0598 -0.0491   0.9608
## law          -0.6155     0.5778 -1.0653   0.2868

The logit scale odds of having greater than 118 drivers killed that month is -0.6155 lower after the law had taken effect. -0.0029 means that we estimate a logic scale change in the odds of greater than of greater than 119 drivers killed that month as being 0.0029 lower for every thousand kilometer increase in drivers in driving distance. For better interpretation, we have

exp( -0.615522450)

## [1] 0.5403585

1-exp( -0.615522450)

## [1] 0.4596415

This means that the route of the odds ratio comparing when the law was enacted to before the law was 54% and there was was a 46% decrease in the odds of drivers of greater than 119 drivers being killed in a month after the law was enacted compared to before the law wasn’t enacted holding the other variables constant.

exp(-0.002938343)

## [1] 0.997066

1-exp(-0.002938343)

## [1] 0.00293403

This means that there is approximately 0.3% decrease in the odds of greater than 119 drivers getting killed in a month for every additional thousand driver miles that month.

2. Fit a binomial model with DriversKilled as the outcome and drivers as the total count with kms , PetrolPrice and law as predictors, interpret your results.

ANSWER:

fit<-glm(cbind(DriversKilled, drivers-DriversKilled)~ppn+mkm+law, family = binomial, stblts)
summary(fit)

## 
## Call:
## glm(formula = cbind(DriversKilled, drivers - DriversKilled) ~ 
##     ppn + mkm + law, family = binomial, data = stblts)
## 
## Coefficients:
##              Estimate Std. Error  z value Pr(>|z|)    
## (Intercept) -2.536637   0.007399 -342.829   <2e-16 ***
## ppn         -0.007829   0.007479   -1.047    0.295    
## mkm          0.003645   0.002733    1.334    0.182    
## law          0.030785   0.026527    1.161    0.246    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 234.93  on 191  degrees of freedom
## Residual deviance: 229.93  on 188  degrees of freedom
## AIC: 1496
## 
## Number of Fisher Scoring iterations: 3

Consider the drivers killed as a percentage of drivers series killed or seriously injured. Observe that the kilometers variable now is positive rather than negative and so is the variable associated with the enactment of the seatbelt law so the enactment of the seatbelt law is estimating about a 3% increase in the odds of being killed given that a person was killed seriously injured whereas before it was a negative quantity whether or not the driver was killed so again this is a different quantity than we’re looking at before we were just looking at the number killed whereas now we’re looking at the percentage of the number killed all over those that were killed or seriously injured.

3. Refer to Question 1. Use the anova function to compare models with just law, law and PetrolPrice and all three predictors.

ANSWER:

fit1<-glm(dkb~law, family = binomial, stblts)
fit2<-update(fit1, dkb~law + PetrolPrice)
fit3<-update(fit2,dkb~law + PetrolPrice+kms)
anova(fit1,fit2,fit3)

## Analysis of Deviance Table
## 
## Model 1: dkb ~ law
## Model 2: dkb ~ law + PetrolPrice
## Model 3: dkb ~ law + PetrolPrice + kms
##   Resid. Df Resid. Dev Df Deviance
## 1       190     260.40            
## 2       189     253.62  1   6.7760
## 3       188     253.62  1   0.0024

We want to compare this usually to chi squared values this goes from 260 to 253 so it looks like chi square cut off is usually around for that would give us a 5% hypothesis test. Hence, it is significant for adding petrol price and not significant for adding kilometers. Looking at the coefficients tables we have

summary(fit1)$coef

##                Estimate Std. Error    z value   Pr(>|z|)
## (Intercept)  0.08288766  0.1539783  0.5383074 0.59036483
## law         -1.12434153  0.4991985 -2.2522935 0.02430373

summary(fit2)$coef

##                Estimate Std. Error   z value    Pr(>|z|)
## (Intercept)   3.5875552  1.3851773  2.589961 0.009598679
## law          -0.6260282  0.5367204 -1.166396 0.243454555
## PetrolPrice -34.3737200 13.4887754 -2.548320 0.010824304

summary(fit3)$coef

##                  Estimate   Std. Error     z value   Pr(>|z|)
## (Intercept)  3.612261e+00 1.473634e+00  2.45126028 0.01423570
## law         -6.155225e-01 5.778076e-01 -1.06527242 0.28675267
## PetrolPrice -3.419952e+01 1.394026e+01 -2.45329074 0.01415559
## kms         -2.938343e-06 5.984816e-05 -0.04909663 0.96084229

Observe in the law variabe, it suggest a rather large decrease in the log odds for every before and after enactment of the law. However that decrease in the odds of greater than 119 drivers killed that month that decrease and reduce quite a bit by the inclusion of petrol price. And notice that petrol price is quite significant. Note that law variable goes from significant to not significant. It gets increased by a factor of the p value gets increased by a factor of 10. When we include kilometers we get include a variable that now is not significant at all and doesn’t change the law variable that much.Hence if we were doing a model selection, we were primarily interested in the law coefficient and we would likely go with this second model.

PAGE 124

1. Load the dataset Seatbelts as part of the datasets package via data(Seatbelts). Use as.data.frame to convert the object to a dataframe. Fit a Poisson regression GLM with UKDriversKilled as the outcome and kms, PetrolPrice and law as predictors. Interpret your results.

fit<-glm(DriversKilled~ppn+mkm+law, family = poisson, stblts)
summary(fit)$coef

##                 Estimate  Std. Error    z value     Pr(>|z|)
## (Intercept)  4.819845137 0.007127388 676.242884 0.000000e+00
## ppn         -0.055361338 0.007243262  -7.643150 2.119715e-14
## mkm         -0.009980975 0.002614002  -3.818274 1.343887e-04
## law         -0.114877106 0.025557951  -4.494770 6.964526e-06

round(summary(fit)$coef, 4)

##             Estimate Std. Error  z value Pr(>|z|)
## (Intercept)   4.8198     0.0071 676.2429    0e+00
## ppn          -0.0554     0.0072  -7.6431    0e+00
## mkm          -0.0100     0.0026  -3.8183    1e-04
## law          -0.1149     0.0256  -4.4948    0e+00

exp(-0.114877106 )

## [1] 0.8914757

1-exp(-0.114877106 )

## [1] 0.1085243

Therefore, there’s about an 11% decrease in the expected number of drivers killed for before and after enacting the law. Thus enacting the law decreased the expected number of drivers killed by 11%.

exp(-0.009980975 )

## [1] 0.9900687

1-exp(-0.009980975 )

## [1] 0.00993133

This means that there’s a about 1% decrease in the expected number of drivers killed for every thousand additional miles of driver distance that month. The intercept is in is the expected number of drivers killed on the log scale, exponentiating it we have

exp(4.819845137)

## [1] 123.9459

We get 123 which is the expected number of drivers killed for the average number of petrol price and the average number of kilometers driven.

2. Refer to question 1. Fit a linear model with the log of drivers killed as the outcome. Interpret your results.

ANSWER:

fit<-glm(DriversKilled~ppn+mkm+law, family = poisson, stblts)
fit2<-lm(I(log(DriversKilled))~ppn+mkm+law, stblts)
summary(fit)

## 
## Call:
## glm(formula = DriversKilled ~ ppn + mkm + law, family = poisson, 
##     data = stblts)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  4.819845   0.007127 676.243  < 2e-16 ***
## ppn         -0.055361   0.007243  -7.643 2.12e-14 ***
## mkm         -0.009981   0.002614  -3.818 0.000134 ***
## law         -0.114877   0.025558  -4.495 6.96e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 984.50  on 191  degrees of freedom
## Residual deviance: 778.32  on 188  degrees of freedom
## AIC: 2059.1
## 
## Number of Fisher Scoring iterations: 4

summary(fit)$coef

##                 Estimate  Std. Error    z value     Pr(>|z|)
## (Intercept)  4.819845137 0.007127388 676.242884 0.000000e+00
## ppn         -0.055361338 0.007243262  -7.643150 2.119715e-14
## mkm         -0.009980975 0.002614002  -3.818274 1.343887e-04
## law         -0.114877106 0.025557951  -4.494770 6.964526e-06

summary(fit2)$coef

##                 Estimate  Std. Error    t value      Pr(>|t|)
## (Intercept)  4.805409776 0.014411832 333.435035 1.381656e-262
## ppn         -0.053968063 0.014813218  -3.643237  3.481611e-04
## mkm         -0.008189793 0.005325983  -1.537705  1.258020e-01
## law         -0.131450856 0.048212881  -2.726468  7.007200e-03

In this case is the expected value of the log of the drivers killed which is our estimate of the expected value of the log of the drivers killed where the other one was the log of the expected value and it doesn’t work that you could interchange.

1-exp(-0.131450856)

## [1] 0.1231776

Thus we have a 12% decrease in driver deaths after the enactment of the law. Note that the consequence of this log in the outcome is that the interpretation is respect to geometric means so this is a 12% decrease in the geometric mean number of driver deaths for going from enactment of the law to prior to the law having been enacted.

3. Refer to question 1. Fit your Poisson log-linear model with drivers as a log offset (to consider the proportion of drivers killed of those killed or seriously injured.)

ANSWER:

fit <- glm(DriversKilled ~ mkm + ppn +law, family = poisson, stblts, offset = log(drivers))
fit0 <- glm(DriversKilled ~ mkm + ppn +law, family = poisson, stblts)
summary(fit)$coef

##                 Estimate  Std. Error     z value  Pr(>|z|)
## (Intercept) -2.612798146 0.007122545 -366.834931 0.0000000
## mkm          0.003377675 0.002630717    1.283937 0.1991640
## ppn         -0.007255064 0.007199577   -1.007707 0.3135952
## law          0.028484328 0.025512651    1.116479 0.2642173

summary(fit0)$coef

##                 Estimate  Std. Error    z value     Pr(>|z|)
## (Intercept)  4.819845137 0.007127388 676.242884 0.000000e+00
## mkm         -0.009980975 0.002614002  -3.818274 1.343887e-04
## ppn         -0.055361338 0.007243262  -7.643150 2.119715e-14
## law         -0.114877106 0.025557951  -4.494770 6.964526e-06

4 Refer to Question 1. Use the anova function to compare models with just law, law and PetrolPrice and all three predictors

ANSWER:

fit1 <- glm(DriversKilled ~ law, family = poisson, stblts)
fit2 <- update(fit1,DriversKilled ~ law+PetrolPrice)
fit3<- update(fit2,DriversKilled ~ law+PetrolPrice+kms)
anova(fit1,fit2,fit3)

## Analysis of Deviance Table
## 
## Model 1: DriversKilled ~ law
## Model 2: DriversKilled ~ law + PetrolPrice
## Model 3: DriversKilled ~ law + PetrolPrice + kms
##   Resid. Df Resid. Dev Df Deviance
## 1       190     870.06            
## 2       189     792.88  1   77.178
## 3       188     778.32  1   14.561

Observe that all of the regression variables are apparently very necessary going from model 1 to model 2. We get a deviance of 77 a difference in the residual deviances of 77 that’s on 1 degree of freedom which is comparable to a chi-squared.Note that a chi-squared with 1 degree of freedom is like a standard normal squared.We know that two standard deviations for standard normal is kind of far out in its tail so for when we square that is kind of fire out in the tail of the chi squared so 77 is much further out in the tail than four so this would likely be a highly significant result. Similarly, with 14 much larger than 4, then let’s look at the summary coefficient tables for each.

summary(fit1)$coef

##               Estimate  Std. Error   z value     Pr(>|z|)
## (Intercept)  4.8352482 0.006856395 705.21727 0.000000e+00
## law         -0.2274727 0.021923993 -10.37552 3.204779e-25

summary(fit2)$coef

##               Estimate Std. Error   z value     Pr(>|z|)
## (Intercept)  5.3499077 0.05886323 90.887084 0.000000e+00
## law         -0.1516301 0.02364153 -6.413720 1.420110e-10
## PetrolPrice -5.0697421 0.57792672 -8.772292 1.750654e-18

summary(fit3)$coef

##                  Estimate   Std. Error   z value     Pr(>|z|)
## (Intercept)  5.440656e+00 6.360114e-02 85.543371 0.000000e+00
## law         -1.148771e-01 2.555795e-02 -4.494770 6.964526e-06
## PetrolPrice -4.546821e+00 5.948884e-01 -7.643150 2.119715e-14
## kms         -9.980975e-06 2.614002e-06 -3.818274 1.343887e-04

The law variable -0.2274727 is interpreted on the log of the expected value scale.It’s highly significant however it gets reduced and goes closer to -0.1516301 after the inclusion of petrol price and then after the inclusion of petrol price and kilometers it drops even further down 0.1148771.The significance drops from 3.204779e-25 to 1.420110e-10 to 6.964526e-06 which still very significant.Nonetheless,these other variables appear also quite significant so it does seem like we would want to include both of them in our final model.

PS#3

Justin Lian Caballero

2023-10-13

PAGE 52

1. Test whether the slope coefficient for the father.son data is different from zero (father as predictor, son as outcome).

ANSWER:

We have our hypotheses: H0:β^1=0 and H1:β^1≠0. To test this, we get the summary of our model where the father is the predictor and the son is the outcome.

2. Refer to question 1. Form a confidence interval for the slope coefficient.

ANSWER:

We will use the function confint() to determine the confidence interval for the slope coefficient of our model. We have,

The confidence interval for the slope coefficient is 0.4610188,0.5671673 where β1=0.51409 lies.

3. Refer to question 1. Form a confidence interval for the intercept (center the fathers’ heights first to get an intercept that is easier to interpret).

ANSWER:

Recall in our answer in question number 1, the intercept is the estimated son’s height for father of 0 inches, hence is not relevant for this problem. To adress this, we need to center the father’s height and get the intercept. We have

Hence, we have a confidence interval of 68.5384554,68.8296839 for the intercept. Here, the intercept is the estimated son’s height for the average father’s height. With this, we have a much more meaningful prediction.

4. Refer to question 1. Form a mean value interval for the expected son’s height at the average father’s height.

ANSWER:

5. Refer to question 1. Form a prediction interval for the son’s height at the average father’s height.

ANSWER:

6. Load the mtcars dataset. Fit a linear regression with miles per gallon as the outcome and horsepower as the predictor. Test whether or not the horsepower power coefficient is statistically different from zero. Interpret your test.

ANSWER:

Hypotheses: H0:β^1=0 and H1:β^1≠0.

7. Refer to question 6. Form a confidence interval for the slope coefficient.

ANSWER:

Hence, we have a confidence interval for the slope coefficient -0.08889465, -0.0475619.

8. Refer to quesiton 6. Form a confidence interval for the intercept (center the HP variable first).

ANSWER:

Thus, we have a confidence interval for the intercept 18.69599452,21.4852555.

9. Refer to question 6. Form a mean value interval for the expected MPG for the average HP.

ANSWER:

10. Refer to question 6. Form a prediction interval for the expected MPG for the average HP.

ANSWER:

11. Refer to question 6. Create a plot that has the fitted regression line plus curves at the expected value and prediction intervals.

ANSWER:

Here the intercept is interpreted as the expected number of drivers killed for zero kilometers driven and for petrol price of 0 which is not meaningful. To better interpret this, we center the kilometer and petrol price and rescale them into more meaningful scale.

2. Predict the number of driver deaths at the average kms and petrol levels.

ANSWER:

Thus, approximately 123 number of deaths at the mean of kms and PetrolPrice.

3. Take the residual for DriversKilled having regressed out kms and an intercept and the residual for PetrolPrice having regressed out kms and an intercept. Fit a regression through the origin of the two residuals and show that it is the same as your coefficient obtained in question 1.

ANSWER:

This is the estimate of the standard error, the T value and the p value associated with the petrol price residual as a predictor and the residual drivers killed as the outcome.

Now, we look at the summary of our full model fit fitf where we had drivers killed as the outcome, kilometers driven as the predictor and petrol price as a predictor plus an intercept.

Observe that the epp row is identical to pp row.

4. Take the residual for DriversKilled having regressed out PetrolPrice and an intercept. Take the residual for kms having regressed out PetrolPrice and an intercept. Fit a regression through the origin of the two residuals and show that it is the same as your coefficient obtained in question 1.

ANSWER:

Hence, we got the same coefficient as we have obtained in question number 1.

PAGE 72

ANSWER:

2. Repeat question 1 for the outcome being the log of the count of driver deaths. Interpret your coefficients.

ANSWER:

Here, all the estimates are interpreted on the log scale. We can exponentiate the log scale coefficients and get the relative interpretations and relating back to the increases or decreases in the geometric mean of the outcome.

3 Refer to question 1. Add the dummy variable law and interpret the results. Repeat this question with a factor variable that you create called lawFactor that takes the levels No and Yes. Change the reference level from No to Yes

ANSWER:

The law variable is a 0 1 variable so this change the intercept a little bit in the sense that our intercept 124.2263 is the expected number of drivers killed for average petrol price, the average number of kilometers driven and before the law was taken in effect so at the law at level 0.

If we wanted to know what the intercept was after the law had taken into effect at the average petrol price and at the average number of kilometers driven we would have to add negative 11.

Hence, we say that about 12 fewer deaths occur when after the law had taken effect. When the law varibale went from 0 to 1, we expect 12 fewer deaths aassociated per month associated with the law holding petrol price and kilometers driven constant.

Notice that we get the same numbers -124.226311 for the intercept, -1.223318 for the 1st column, -6.919949 for the next and -11.889202 for the other tells us which factor that it is treating as the reference level or not. Hence, the -11.889202 is associated with the factor level 1.

4 Discretize the PetrolPrice variable into four factor levels. Fit the linear model with this factor to see how R treats multiple level factor variables.

ANSWER:

5 Perform the plot requested at the end of the last chapter.

ANSWER:

Two levels model without interaction:

Here the intercept is interpreted as the expected number of drivers killed for zero kilometers driven and for petrol price of 0 which is not meaningful. To better interpret this, we center the kilometer and petrol price and rescale them into more meaningful scale.

2. Compare the kms coefficient with and without the inclusion of the PetrolPrice variable in the model

First, let us examine the correlation between the kms and petrol price.

The correlation tells us that at some level these are measuring similar things and it would stand to reason that the price of gas would have an impact on the total miles being driven. Now, let us fit both models.

3. Compare the PetrolPrice coefficient with and without the inclusion fo the kms variable in the model.

ANSWER:

PAGE 91

1. Load the dataset Seatbelts as part of the datasets package via data(Seatbelts). Use as.data.frame to convert the object to a dataframe. Fit a linear model of driver deaths with kms, PetrolPrice and law as predictors.

2. Refer to question 1. Directly estimate the residual variation via the function resid. Compare with R’s residual variance estimate.

3. Refer to question 1. Perform an analysis of diagnostic measures including, dffits, dfbetas, influence and hat diagonals.

Evaluating the influence:

Using ggplot, we can create a dataframe with the dffits values, classify them, establish a threshold value, in this case |0.4|, and plot the values indicating the id of the values above the threshold (specialPoint).

Here the intercept is interpreted as the expected number of drivers killed for zero kilometers driven and for petrol price of 0 which is not meaningful. To better interpret this, we center the kilometer and petrol price and rescale them into more meaningful scale.

2. Perform a model selection exercise to arrive at a final model.

PAGE 104

1. True or false, generalized linear models transform the observed outcome. (Discuss.)

ANSWER:

Suppose we have an observed outcome and we do log of our outcome, that is