1. Test whether the slope coefficient for the
father.son data is different from zero (father as predictor, son as
outcome).
ANSWER:
We have our hypotheses: \(H_0:\hat\beta_1=0\) and \(H_1:\hat\beta_1\neq0\). To test this, we
get the summary of our model where the father is the predictor and the
son is the outcome.
library(UsingR)
data("father.son")
fit<-lm(sheight~fheight, father.son)
summary(fit)
Call:
lm(formula = sheight ~ fheight, data = father.son)
Residuals:
Min 1Q Median 3Q Max
-8.8772 -1.5144 -0.0079 1.6285 8.9685
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33.88660 1.83235 18.49 <2e-16 ***
fheight 0.51409 0.02705 19.01 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.437 on 1076 degrees of freedom
Multiple R-squared: 0.2513, Adjusted R-squared: 0.2506
F-statistic: 361.2 on 1 and 1076 DF, p-value: < 2.2e-16
From the table above, we see that the p-value of the slope (fheight row) is very small. Using \(\alpha=0.05\), this p-value is not significant. Hence, we reject our null hypothesis. Therefore, the slope coefficient for the father.son data is nonzero. That is \(\hat\beta_1\neq0\). Moreover, we can say that the father’s height is a statistically significant predictor of the son’s height.
2. Refer to question 1. Form a confidence interval
for the slope coefficient.
ANSWER:
We will use the function confint() to determine the confidence interval
for the slope coefficient of our model. We have,
confint(fit)
2.5 % 97.5 %
(Intercept) 30.2912126 37.4819961
fheight 0.4610188 0.5671673
The confidence interval for the slope coefficient is 0.4610188,0.5671673 where \(\beta_1=0.51409\) lies.
3. Refer to question 1. Form a confidence interval
for the intercept (center the fathers’ heights first to get an intercept
that is easier to interpret).
ANSWER:
Recall in our answer in question number 1, the intercept is the
estimated son’s height for father of 0 inches, hence is not relevant for
this problem. To adress this, we need to center the father’s height and
get the intercept. We have
cfheight<-(father.son$fheight - mean(father.son$fheight))
fit2<-lm(sheight~cfheight, father.son)
confint(fit2)
2.5 % 97.5 %
(Intercept) 68.5384554 68.8296839
cfheight 0.4610188 0.5671673
Hence, we have a confidence interval of 68.5384554,68.8296839 for the intercept. Here, the intercept is the estimated son’s height for the average father’s height. With this, we have a much more meaningful prediction.
4. Refer to question 1. Form a mean value interval
for the expected son’s height at the average father’s height.
ANSWER:
predict(fit, newdata = data.frame(fheight = mean(father.son$fheight)), interval = 'confidence')
fit lwr upr
1 68.68407 68.53846 68.82968
5. Refer to question 1. Form a prediction interval
for the son’s height at the average father’s height.
ANSWER:
predict(fit, newdata = data.frame(fheight = mean(father.son$fheight)), interval = 'prediction')
fit lwr upr
1 68.68407 63.90091 73.46723
6. Load the mtcars dataset. Fit a linear regression
with miles per gallon as the outcome and horsepower as the predictor.
Test whether or not the horsepower power coefficient is statistically
different from zero. Interpret your test.
ANSWER:
fit<-lm(mpg~hp, mtcars)
Hypotheses: \(H_0:\hat\beta_1=0\) and \(H_1:\hat\beta_1\neq0\).
summary(fit)$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.09886054 1.6339210 18.421246 6.642736e-18
hp -0.06822828 0.0101193 -6.742389 1.787835e-07
From the table above, we see that the p-value of the slope (hp row) is very small. Using \(\alpha=0.05\), this p-value is not significant. Hence, we reject our null hypothesis. Therefore, the slope coefficient horse power is nonzero. That is \(\hat\beta_1\neq0\). Moreover, we can say that the horse power is a statistically significant predictor of miles per gallon.
7. Refer to question 6. Form a confidence interval
for the slope coefficient.
ANSWER:
confint(fit)
2.5 % 97.5 %
(Intercept) 26.76194879 33.4357723
hp -0.08889465 -0.0475619
Hence, we have a confidence interval for the slope coefficient -0.08889465, -0.0475619.
8. Refer to quesiton 6. Form a confidence interval
for the intercept (center the HP variable first).
ANSWER:
chp<-(mtcars$hp - mean(mtcars$hp))
fit2<-lm(mpg~chp, mtcars)
confint(fit2)
2.5 % 97.5 %
(Intercept) 18.69599452 21.4852555
chp -0.08889465 -0.0475619
Thus, we have a confidence interval for the intercept 18.69599452,21.4852555.
9. Refer to question 6. Form a mean value interval
for the expected MPG for the average HP.
ANSWER:
predict(fit, newdata = data.frame(hp=mean(mtcars$hp)), interval = 'confidence')
fit lwr upr
1 20.09062 18.69599 21.48526
10. Refer to question 6. Form a prediction interval
for the expected MPG for the average HP.
ANSWER:
predict(fit, newdata = data.frame(hp=mean(mtcars$hp)), interval = 'prediction')
fit lwr upr
1 20.09062 12.07908 28.10217
11. Refer to question 6. Create a plot that has the
fitted regression line plus curves at the expected value and prediction
intervals.
ANSWER:
library(dplyr)
library(ggplot2)
y <- mtcars$mpg; x <- mtcars$hp; n <- length(y)
fit <- lm(y~x)
p1 <- data.frame(predict(fit, newdata = data.frame(x=x), interval = 'confidence'))
p2 <- data.frame(predict(fit, newdata = data.frame(x=x), interval = 'prediction'))
p1 <- p1 %>%
mutate(interval = 'confidence') %>%
mutate(x = mtcars$hp)
p2 <- p2 %>%
mutate(interval = 'prediction') %>%
mutate(x = mtcars$hp)
dat <- rbind(p1,p2)
names(dat)[1] <- "y"
dat %>%
ggplot(aes(x = x, y = y)) +
geom_ribbon(aes(ymin= lwr, ymax= upr, fill= interval), alpha= 0.4) +
geom_line(colour="red", lwd=.5) +
geom_point(data= data.frame(x=x,y=y), aes(x=x,y=y))
1. Load the dataset Seatbelts as part of the
datasets package via data(Seatbelts). Use as.data.frame to convert the
object to a dataframe. Fit a linear model of driver deaths with kms and
PetrolPrice as predictors. Interpret your results.
ANSWER:
library(datasets)
data("Seatbelts")
stblts <- as.data.frame(Seatbelts)
fit<- lm(DriversKilled ~ kms + PetrolPrice, stblts)
summary(fit)$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.157461e+02 1.466559e+01 14.711047 3.772201e-33
kms -1.749546e-03 6.145401e-04 -2.846919 4.902428e-03
PetrolPrice -6.437895e+02 1.482896e+02 -4.341435 2.304713e-05
round(summary(fit)$coef, 4)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 215.7461 14.6656 14.7110 0.0000
kms -0.0017 0.0006 -2.8469 0.0049
PetrolPrice -643.7895 148.2896 -4.3414 0.0000
Here the intercept is interpreted as the expected number of drivers killed for zero kilometers driven and for petrol price of 0 which is not meaningful. To better interpret this, we center the kilometer and petrol price and rescale them into more meaningful scale.
library(dplyr)
stblts <- stblts %>%
mutate(ppn = (PetrolPrice - mean(PetrolPrice))/sd(PetrolPrice),
mkm = kms / 1000) %>%
mutate(mkm = mkm - mean(mkm))
summary(lm(DriversKilled ~ mkm + ppn, stblts))$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 122.802083 1.6628507 73.850336 2.395106e-141
mkm -1.749546 0.6145401 -2.846919 4.902428e-03
ppn -7.838674 1.8055491 -4.341435 2.304713e-05
Thus, the intercept 122.802083 is the estimated number of drivers killed for the average petrol price and average distance driven. The petrol price coefficient of -7.838674 means that we expect 7 fewer deaths per one standard deviation change in petrol price. The coefficient -1.749546 which is approximately equal to -2 means that we expect two fewer deaths for every additional thousand kilometers driven.
2. Predict the number of driver deaths at the
average kms and petrol levels.
ANSWER:
predict(fit, newdata = data.frame(kms= mean(stblts$kms),
PetrolPrice= mean(stblts$PetrolPrice)))
1
122.8021
Thus, approximately 123 number of deaths at the mean of kms and PetrolPrice.
3. Take the residual for DriversKilled having
regressed out kms and an intercept and the residual for PetrolPrice
having regressed out kms and an intercept. Fit a regression through the
origin of the two residuals and show that it is the same as your
coefficient obtained in question 1.
ANSWER:
dk<-stblts$DriversKilled
kms<-stblts$kms
pp<-stblts$PetrolPrice
fitf<-lm(dk~kms+pp)
edk<-resid(lm(dk~kms))
epp<-resid(lm(pp~kms))
summary(lm(edk~epp-1))
Call:
lm(formula = edk ~ epp - 1)
Residuals:
Min 1Q Median 3Q Max
-51.06 -17.77 -4.15 15.67 59.33
Coefficients:
Estimate Std. Error t value Pr(>|t|)
epp -643.8 147.5 -4.364 2.09e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 22.92 on 191 degrees of freedom
Multiple R-squared: 0.09068, Adjusted R-squared: 0.08592
F-statistic: 19.05 on 1 and 191 DF, p-value: 2.086e-05
summary(lm(edk~epp-1))$coef
Estimate Std. Error t value Pr(>|t|)
epp -643.7895 147.5111 -4.364345 2.085664e-05
This is the estimate of the standard error, the T value and the p
value associated with the petrol price residual as a predictor and the
residual drivers killed as the outcome.
Now, we look at the summary of our full model fit fitf where we had
drivers killed as the outcome, kilometers driven as the predictor and
petrol price as a predictor plus an intercept.
summary(fitf)$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.157461e+02 1.466559e+01 14.711047 3.772201e-33
kms -1.749546e-03 6.145401e-04 -2.846919 4.902428e-03
pp -6.437895e+02 1.482896e+02 -4.341435 2.304713e-05
Observe that the epp row is identical to pp row.
4. Take the residual for DriversKilled having
regressed out PetrolPrice and an intercept. Take the residual for kms
having regressed out PetrolPrice and an intercept. Fit a regression
through the origin of the two residuals and show that it is the same as
your coefficient obtained in question 1.
ANSWER:
dk2 <- resid(lm(dk ~ pp))
kms <- resid(lm(kms ~ pp))
round(summary(lm(dk2 ~ kms -1))$coef, 4)
Estimate Std. Error t value Pr(>|t|)
kms -0.0017 6e-04 -2.8619 0.0047
Hence, we got the same coefficient as we have obtained in question number 1.
1. Do exercise 1 of the previous chapter if you have
not already. Load the dataset Seatbelts as part of the datasets package
via data(Seatbelts). Use as.data.frame to convert the object to a
dataframe. Fit a linear model of driver deaths with kms and PetrolPrice
as predictors. Interpret your results.
ANSWER:
round(summary(lm(DriversKilled ~ mkm + ppn, stblts))$coef,4)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 122.8021 1.6629 73.8503 0.0000
mkm -1.7495 0.6145 -2.8469 0.0049
ppn -7.8387 1.8055 -4.3414 0.0000
The intercept 122.802083 is the estimated number of drivers killed for the average petrol price and average distance driven. The petrol price coefficient of -7.838674 means that we expect 7 fewer deaths per one standard deviation change in petrol price. The coefficient -1.749546 which is approximately equal to -2 means that we expect two fewer deaths for every additional thousand kilometers driven.
2. Repeat question 1 for the outcome being the log
of the count of driver deaths. Interpret your coefficients.
ANSWER:
library(dplyr)
stblts <- stblts %>%
mutate(ppn = (PetrolPrice - mean(PetrolPrice))/sd(PetrolPrice),
mkm = kms / 1000) %>%
mutate(mkm = mkm - mean(mkm))
fit<-lm(I(log(DriversKilled)) ~ mkm + ppn, stblts)
summary(fit)$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.78966306 0.013426810 356.723817 2.737888e-269
mkm -0.01400794 0.004962149 -2.822959 5.267843e-03
ppn -0.06412578 0.014579039 -4.398492 1.818005e-05
Here, all the estimates are interpreted on the log scale. We can exponentiate the log scale coefficients and get the relative interpretations and relating back to the increases or decreases in the geometric mean of the outcome.
1-exp(-0.06412578)
[1] 0.06211298
1-exp(-0.01400794)
[1] 0.01391029
Hence, we are estimating a 6% decrease in the geometric mean of the drivers killed for every one standard deviation increase in normalized petrol price holding the kilometers constant. On the other hand, for every additional thousand of driver, the thousands of miles thrived driven, we are estimating an expected 1% decrease in the normalized drivers killed holding petrol price constant.
3 Refer to question 1. Add the dummy variable law
and interpret the results. Repeat this question with a factor variable
that you create called lawFactor that takes the levels No and Yes.
Change the reference level from No to Yes
ANSWER:
library(dplyr)
stblts <- stblts %>%
mutate(ppn = (PetrolPrice - mean(PetrolPrice))/sd(PetrolPrice),
mkm = kms / 1000) %>%
mutate(mkm = mkm - mean(mkm))
fit<-lm(DriversKilled ~ mkm + ppn +law, stblts)
summary(fit)
Call:
lm(formula = DriversKilled ~ mkm + ppn + law, data = stblts)
Residuals:
Min 1Q Median 3Q Max
-50.69 -17.29 -4.05 14.33 60.71
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 124.2263 1.8012 68.967 < 2e-16 ***
mkm -1.2233 0.6657 -1.838 0.067676 .
ppn -6.9199 1.8514 -3.738 0.000246 ***
law -11.8892 6.0258 -1.973 0.049955 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 22.87 on 188 degrees of freedom
Multiple R-squared: 0.201, Adjusted R-squared: 0.1882
F-statistic: 15.76 on 3 and 188 DF, p-value: 3.478e-09
The law variable is a 0 1 variable so this change the intercept a
little bit in the sense that our intercept 124.2263 is the expected
number of drivers killed for average petrol price, the average number of
kilometers driven and before the law was taken in effect so at the law
at level 0.
If we wanted to know what the intercept was after the law had taken into
effect at the average petrol price and at the average number of
kilometers driven we would have to add negative 11.
Hence, we say that about 12 fewer deaths occur when after the law had
taken effect. When the law varibale went from 0 to 1, we expect 12 fewer
deaths aassociated per month associated with the law holding petrol
price and kilometers driven constant.
fit<-lm(DriversKilled ~ mkm + ppn +I(factor(law)), stblts)
summary(fit)$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 124.226311 1.8012324 68.967399 1.976641e-135
mkm -1.223318 0.6656567 -1.837761 6.767594e-02
ppn -6.919949 1.8513987 -3.737687 2.463128e-04
I(factor(law))1 -11.889202 6.0257850 -1.973055 4.995497e-02
Notice that we get the same numbers -124.226311 for the intercept, -1.223318 for the 1st column, -6.919949 for the next and -11.889202 for the other tells us which factor that it is treating as the reference level or not. Hence, the -11.889202 is associated with the factor level 1.
4 Discretize the PetrolPrice variable into four
factor levels. Fit the linear model with this factor to see how R treats
multiple level factor variables.
ANSWER:
library(dplyr)
stblts <- stblts %>%
mutate(ppn = (PetrolPrice - mean(PetrolPrice))/sd(PetrolPrice),
ppf=as.factor((ppn<=-1.5)+(ppn<=0)+(ppn<=1.5)+(ppn< Inf)),
mkm = kms / 1000) %>%
mutate(mkm = mkm - mean(mkm))
table(stblts$ppf)
1 2 3 4
6 96 71 19
fit<-lm(DriversKilled ~ mkm + ppf +law, stblts)
summary(fit)
Call:
lm(formula = DriversKilled ~ mkm + ppf + law, data = stblts)
Residuals:
Min 1Q Median 3Q Max
-53.384 -17.211 -3.421 14.849 65.613
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 109.8405 9.5066 11.554 <2e-16 ***
mkm -1.2991 0.7668 -1.694 0.0919 .
ppf2 10.8271 9.9462 1.089 0.2778
ppf3 18.6904 9.9374 1.881 0.0616 .
ppf4 25.0074 10.9163 2.291 0.0231 *
law -15.3445 6.0345 -2.543 0.0118 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 23.24 on 186 degrees of freedom
Multiple R-squared: 0.1833, Adjusted R-squared: 0.1614
F-statistic: 8.35 on 5 and 186 DF, p-value: 3.835e-07
We see from the 1st tabele that PPF has 4 factors and we have the dta
in each one of the levels of the factor. Getting the summary of the fit
model, we see that it elects and chooses level one to be the reference
level of this factor PPF. Here the ppf2 is now the comparison with the
level 1, ppf3 is the comparison with level 3 with level 1 and ppf4 is
the comparison of level 4 with level 1.
5 Perform the plot requested at the end of the last
chapter.
ANSWER:
Two levels model without interaction:
library(ggplot2)
library(datasets)
library(grid)
library(ggplotify)
swiss <- swiss %>%
mutate(CatholicBin = 1 * (Catholic > 50))
g <- ggplot(swiss, aes(x = Agriculture, y = Fertility, colour = factor(CatholicBin)))+
geom_point(size = 5) +
xlab("% in Agriculture") +
ylab("Fertility")
fit <- lm(Fertility ~ Agriculture + factor(CatholicBin), data = swiss) # Note the + operator
g +
geom_abline(intercept = coef(fit)[1], slope = coef(fit)[2], size=1, colour='green') +
geom_abline(intercept = coef(fit)[1] + coef(fit)[3], slope = coef(fit)[2], size=1, colour='red') +
annotation_custom(grob = grobTree(textGrob("Catholic",
x=0.05,
y=.62,
hjust=0,
gp=gpar(col="blue",
fontsize=10,
fontface="italic")))) +
annotation_custom(grob = grobTree(textGrob("Non catholic",
x=0.05,
y=.5,
hjust=0,
gp=gpar(col="black",
fontsize=10,
fontface="italic"))))
Two levels model with interaction
fitfull <- lm(Fertility ~ Agriculture * factor(CatholicBin), data = swiss)
g +
geom_abline(intercept = coef(fitfull)[1], slope = coef(fitfull)[2], size=1, colour='green') +
geom_abline(intercept = coef(fitfull)[1] + coef(fitfull)[3],
slope = coef(fitfull)[2] + coef(fitfull)[4], size=1, colour='red') +
annotation_custom(grob = grobTree(textGrob("Catholic",
x=0.05,
y=.58,
hjust=0,
gp=gpar(col="blue",
fontsize=10,
fontface="italic")))) +
annotation_custom(grob = grobTree(textGrob("Non catholic",
x=0.05,
y=.46,
hjust=0,
gp=gpar(col="black",
fontsize=10,
fontface="italic"))))
1. Load the dataset Seatbelts as part of the
datasets package via data(Seatbelts). Use as.data.frame to convert the
object to a dataframe. Fit a linear model of driver deaths with kms and
PetrolPrice as predictors. Interpret your results.
ANSWER:
library(datasets)
data("Seatbelts")
stblts <- as.data.frame(Seatbelts)
fit<- lm(DriversKilled ~ kms + PetrolPrice, stblts)
summary(fit)$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.157461e+02 1.466559e+01 14.711047 3.772201e-33
kms -1.749546e-03 6.145401e-04 -2.846919 4.902428e-03
PetrolPrice -6.437895e+02 1.482896e+02 -4.341435 2.304713e-05
Here the intercept is interpreted as the expected number of drivers killed for zero kilometers driven and for petrol price of 0 which is not meaningful. To better interpret this, we center the kilometer and petrol price and rescale them into more meaningful scale.
library(dplyr)
stblts <- stblts %>%
mutate(ppn = (PetrolPrice - mean(PetrolPrice))/sd(PetrolPrice),
mkm = kms / 1000) %>%
mutate(mkm = mkm - mean(mkm))
summary(lm(DriversKilled ~ mkm + ppn, stblts))$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 122.802083 1.6628507 73.850336 2.395106e-141
mkm -1.749546 0.6145401 -2.846919 4.902428e-03
ppn -7.838674 1.8055491 -4.341435 2.304713e-05
Thus, the intercept 122.802083 is the estimated number of drivers killed for the average petrol price and average distance driven. The petrol price coefficient of -7.838674 means that we expect 7 fewer deaths per one standard deviation change in petrol price. The coefficient -1.749546 which is approximately equal to -2 means that we expect two fewer deaths for every additional thousand kilometers driven.
2. Compare the kms coefficient with and without the
inclusion of the PetrolPrice variable in the model
library(dplyr)
stblts <- stblts %>%
mutate(ppn = (PetrolPrice - mean(PetrolPrice))/sd(PetrolPrice),
mkm = kms / 1000) %>%
mutate(mkm = mkm - mean(mkm))
First, let us examine the correlation between the kms and petrol price.
cor(stblts$ppn, stblts$mkm)
[1] 0.3839004
The correlation tells us that at some level these are measuring similar things and it would stand to reason that the price of gas would have an impact on the total miles being driven. Now, let us fit both models.
fit0<-lm(DriversKilled ~ mkm, stblts)
fit1<-lm(DriversKilled ~ mkm + ppn, stblts)
summary(fit0)$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 122.802083 1.7391997 70.60839 2.665611e-138
mkm -2.773787 0.5935049 -4.67357 5.596266e-06
summary(fit1)$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 122.802083 1.6628507 73.850336 2.395106e-141
mkm -1.749546 0.6145401 -2.846919 4.902428e-03
ppn -7.838674 1.8055491 -4.341435 2.304713e-05
What we see is that the the estimate in this case goes down something we would likely expect since both effects are in the same direction and they’re correlated so the mkm which is the kilometers in thousand kilometer units centered it goes from an estimated -2.773787 additional deaths for every thousand extra kilometers driven per month to -1.749546 after the inclusion of the petrol price. Thus we can say that petrol price is having a confounding effect on the relationship between the kilometers driven and the number of drivers deaths.
3. Compare the PetrolPrice coefficient with and
without the inclusion fo the kms variable in the model.
ANSWER:
fit2<-lm(DriversKilled ~ ppn, stblts)
fit3<-lm(DriversKilled ~ ppn+mkm, stblts)
summary(fit2)$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 122.802083 1.693656 72.507096 2.061333e-140
ppn -9.812019 1.698084 -5.778288 3.044208e-08
summary(fit3)$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 122.802083 1.6628507 73.850336 2.395106e-141
ppn -7.838674 1.8055491 -4.341435 2.304713e-05
mkm -1.749546 0.6145401 -2.846919 4.902428e-03
Similarly in this case, the estimate is negative. Which makes sense because both effects are in the same direction and are correlated. But the estimate change from -9.812019 to 7.838674 which indicates the confounding effect of the variable kms on the regression between DriversKiller and PetrolPrice.
1. Load the dataset Seatbelts as part of the datasets package via data(Seatbelts). Use as.data.frame to convert the object to a dataframe. Fit a linear model of driver deaths with kms, PetrolPrice and law as predictors.
library(datasets)
data("Seatbelts")
stblts <- as.data.frame(Seatbelts)
fit<- lm(DriversKilled ~ kms + PetrolPrice+law, stblts)
2. Refer to question 1. Directly estimate the residual variation via the function resid. Compare with R’s residual variance estimate.
library(dplyr)
stblts <- stblts %>%
mutate(ppn = (PetrolPrice - mean(PetrolPrice))/sd(PetrolPrice),
mkm = kms / 1000) %>%
mutate(mkm = mkm - mean(mkm))
fit<-lm(DriversKilled ~ mkm + ppn +law, stblts)
sum(resid(fit)^2)/(nrow(stblts)-4)
[1] 522.8903
summary(fit)$sigma^2
[1] 522.8903
This goes to show how r is calculating the residual variance and what the residual variance represents it is merely the average squared residual which represents the variability left unexplained after having accounted for kilometers,petrol price and law again we subtract off the 4 to achieve some statistical properties namely the property of on biasness.
3. Refer to question 1. Perform an analysis of diagnostic measures including, dffits, dfbetas, influence and hat diagonals.
fit<-lm(DriversKilled ~ mkm + ppn +law, stblts)
plot(fit)
First plot shows residuals versus the fitted values which labels some
plots that appear on the extreme fits a smoother through the residuals.
The second plot is the Q-Q plot which is done to ascertain normality of
the errors. In this case we are looking for whether or not the points
fall mostly on a line and in this case it gives the reference line.
However when we look at these plots remember there’s variability in the
estimate of the quantiles from the standardized residuals. If we
simulate data even from a normal distribution it’s going to deviate from
this line somewhat so we want to see if there’s any grow gross
departures not so much evaluating minor departures. In this case this
doesn’t seem too bad. The next plot is the scale location which is the
square root of the absolute value of the standardized residuals by the
fitted values. This is the scale location plot a smoother is put through
it and again looking for patterns in this plot it doesn’t look so bad.
The last plot is the residuals vs leverage plot which doesn’t look so
bad as well.
Evaluating the influence:
Using ggplot, we can create a dataframe with the dffits values, classify
them, establish a threshold value, in this case |0.4|, and plot the
values indicating the id of the values above the threshold
(specialPoint).
datdffits <- data.frame(dffits = dffits(fit)) %>%
mutate(id=row_number()) %>%
mutate(specialPoint= as.factor(abs(dffits) >= .35))
ggplot(datdffits, aes(y=dffits, x= id, color= specialPoint))+
geom_point(size=3) +
geom_text(data=subset(datdffits, specialPoint==TRUE),
aes(y=dffits,x=id,label=id), colour='blue') +
geom_hline(yintercept = c(-0.35,0.35), color= 'green') +
geom_hline(yintercept = 0)
head(dfbeta(fit))
(Intercept) mkm ppn law
1 -0.1022614 0.11881682 -0.09349927 -0.26195987
2 -0.1331656 0.22275746 -0.15807657 -0.56786007
3 -0.1279422 0.11452337 -0.07920934 -0.23464337
4 -0.2032275 0.13006726 -0.06413772 -0.23584650
5 -0.0524778 0.02353784 -0.01104844 -0.02756976
6 -0.1188004 0.03961875 -0.01108847 -0.02386593
plot(dfbeta(fit)[ , 2])
dfbeta for kms:
datdfbeta <- data.frame(dfbetamkm = dfbeta(fit)[ , 2]) %>%
mutate(id=row_number()) %>%
mutate(specialPoint= as.factor(abs(dfbetamkm) >= .2))
ggplot(datdfbeta, aes(y=dfbetamkm, x= id, color= specialPoint))+
geom_point(size=3) +
geom_text(data=subset(datdfbeta, specialPoint==TRUE),
aes(y=dfbetamkm,x=id,label=id), colour='blue') +
geom_hline(yintercept = 0)+
geom_hline(yintercept = c(-0.2,0.2), color= 'green') +
ylab('dfbeta: mkm')
datdfbeta2 <- data.frame(dfbetappn = dfbeta(fit)[ , 3]) %>%
mutate(id=row_number()) %>%
mutate(specialPoint= as.factor(abs(dfbetappn) >= .4))
ggplot(datdfbeta2, aes(y=dfbetappn, x= id, color= specialPoint))+
geom_point(size=3) +
geom_text(data=subset(datdfbeta2, specialPoint==TRUE),
aes(y=dfbetappn,x=id,label=id), colour='blue') +
geom_hline(yintercept = 0)+
geom_hline(yintercept = c(-0.4,0.4), color= 'green') +
ylab('dfbeta: PetrolPrice')
datcooks <- data.frame(cooksdist = cooks.distance(fit)) %>%
mutate(id=row_number()) %>%
mutate(specialPoint= as.factor(cooksdist >= .03))
ggplot(datcooks, aes(y=cooksdist, x= id, color= specialPoint))+
geom_point(size=3) +
geom_text(data=subset(datcooks, specialPoint==TRUE),
aes(y=cooksdist,x=id,label=id), colour='black') +
geom_hline(yintercept = 0)+
geom_hline(yintercept = 0.03, color= 'blue') +
ylab('Cooks distance')
Evaluating leverage:
summary(round(hatvalues(fit), 3))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00600 0.01175 0.01700 0.02082 0.02600 0.06700
dathatv <- data.frame(hatv = hatvalues(fit)) %>%
mutate(id=row_number()) %>%
mutate(specialPoint= as.factor(hatv >= .06))
ggplot(dathatv, aes(y=hatv, x= id, color= specialPoint))+
geom_point(size=3) +
geom_text(data=subset(dathatv, specialPoint==TRUE),
aes(y=hatv,x=id,label=id), colour='blue') +
geom_hline(yintercept = 0.06, color= 'green') +
ylab('hatvalues')
1. Load the dataset Seatbelts as part of the datasets package via data(Seatbelts). Use as.data.frame to convert the object to a dataframe. Fit a linear model of driver deaths with kms, PetrolPrice and law as predictors.
library(datasets)
data("Seatbelts")
stblts <- as.data.frame(Seatbelts)
fit<- lm(DriversKilled ~ kms + PetrolPrice, stblts)
summary(fit)$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.157461e+02 1.466559e+01 14.711047 3.772201e-33
kms -1.749546e-03 6.145401e-04 -2.846919 4.902428e-03
PetrolPrice -6.437895e+02 1.482896e+02 -4.341435 2.304713e-05
Here the intercept is interpreted as the expected number of drivers killed for zero kilometers driven and for petrol price of 0 which is not meaningful. To better interpret this, we center the kilometer and petrol price and rescale them into more meaningful scale.
library(dplyr)
stblts <- stblts %>%
mutate(ppn = (PetrolPrice - mean(PetrolPrice))/sd(PetrolPrice),
mkm = kms / 1000) %>%
mutate(mkm = mkm - mean(mkm))
2. Perform a model selection exercise to arrive at a final model.
fit0 <- lm(DriversKilled~law, stblts)
fit1 <- lm(DriversKilled~law + mkm, stblts)
fit2 <- lm(DriversKilled~law + ppn, stblts)
fit3 <- lm(DriversKilled~law + mkm + ppn, stblts)
anova(fit0,fit1,fit2,fit3)
Analysis of Variance Table
Model 1: DriversKilled ~ law
Model 2: DriversKilled ~ law + mkm
Model 3: DriversKilled ~ law + ppn
Model 4: DriversKilled ~ law + mkm + ppn
Res.Df RSS Df Sum of Sq F Pr(>F)
1 190 109754
2 189 105608 1 4145.3 7.9276 0.005388 **
3 189 100069 0 5538.9
4 188 98303 1 1766.0 3.3774 0.067676 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(fit0,fit3)
Analysis of Variance Table
Model 1: DriversKilled ~ law
Model 2: DriversKilled ~ law + mkm + ppn
Res.Df RSS Df Sum of Sq F Pr(>F)
1 190 109754
2 188 98303 2 11450 10.949 3.177e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We are particularly interested with the fit0 and fit3 because we want to know whether or not both those parameters are necessary in addition to the law of variable. So here we have our base model which just has the law variable. The second model which includes kilometers in petrol price so it’s a 2 degree of freedom test because it added two parameters. Test is highly significant suggesting that we probably should include both those parameters.
rbind(summary(fit0)$coef[2, ],
summary(fit1)$coef[2, ],
summary(fit2)$coef[2, ],
summary(fit3)$coef[2, ])
Estimate Std. Error t value Pr(>|t|)
[1,] -25.60895 5.341655 -4.794198 3.288375e-06
[2,] -17.55372 6.028888 -2.911602 4.028394e-03
[3,] -16.32618 5.555579 -2.938700 3.706585e-03
[4,] -11.88920 6.025785 -1.973055 4.995497e-02
As we can see in the coefficient table, grabing the law row from each one of the fitted models we can see what happens to the law estimate. The lowest amount estimates that after the law went into effect we get an expected 25 fewer deaths per month. However if we adjust for the variability in just kilometers it drops down to 17 fewer deaths. If we just for adjust the variability for the variability in petrol price that estimate drops down to 16 deaths fewer per month and then if we adjust for both it goes down to 11 fewer deaths per month.
1. True or false, generalized linear models
transform the observed outcome. (Discuss.)
ANSWER:
Suppose we have an observed outcome and we do log of our outcome, that
is
\(log(Y)=\beta_0
+\beta_1+\epsilon\)
we are simply fitting a linear model where we’ve transformed the
outcome. On the other hand, suppose we log the expected value of y, that
is
\(log(E[Y])=\beta_0 +\beta_1X\)
Here,we parameterised in terms of a collection of parameters and our
other data, the X values.This a major difference between transforming
the outcome and transforming a parameter which is what occurs in
Generalized Linear Models (GLM). Therefore, the answer is FALSE.
2.True or false, the interpretation of the
coefficients in a GLM are on the scale of the link function.
(Discuss.)
ANSWER: TRUE. In fact, all of the coefficients are
interpreted on the link function scale.If we want some natural scale
interpretation we have to invert that link function. For instance we
have
\(log(E[Y|X=x])=\beta_0+\beta_1X\)
This model states that the log of the expected value of the outcome
given X takes the value x is \(\beta_0+\beta_1X\). This is for just a
single regressor. So, when we get an estimate for \(\beta_1\) and see it in our coefficient
table that is interpreted on the scale of the log of the expected value
of the outcome so it is interpreted on the link function scale. This is
an example of Poisson regression.
3 True or false, the generalized linear model
assumes an exponential family for the outcome.(Discuss.)
ANSWER:
TRUE.
In linear model, if we have \(Y\),
\(X_1\) and \(X_2\), we need a model that connects \(Y\) with the \(X_1\) and \(X_2\) to the linear model that does that in
a sense very directly by equating them with a straightforward
equation.Since this relationship won’t naturally hold, it adds an error
term assuming that the error term is probably normal. We have,
\(Y=\beta_0
+\beta_1X_1+\beta_2X_2\)
which is not in the form of a generalized linear model.In this model
formulation we started out with an equation that we wish was true and
then added the statistical assumptions with the error component. In
generalized linear model we’re going to start with assuming a
distribution for the \(Y\) and we’re
going to assume that it comes from an exponential family distribution.
An exponential family includes the normal, so linear models are a
special case of generalized linear models. It includes the Bernoulli and
binomial distributions the Poisson distribution. At this point, we only
assume the distribution for the \(Y\)
we need a way to relate the \(Y\) to
the \(X_1\) and \(X_2\) which is done by the fact that the
exponential family is always parametrized by a couple of parameters in
the same way that the normal distribution is parametrized by a mean and
the variance. Therefore, generalized linear model assumes an exponential
family for the outcome.
4. True or false, GLM estimates are obtained by
maximizing the likelihood. (Discuss.)
ANSWER:
TRUE.
Suppose we have a Poisson regression.
\(log(E[Y_i])=log(\mu_i)=\beta_0
+\beta_1\)
This tells us that,
\(e^{\mu_i}=e^{\beta_0
+\beta_1}\)
This means that we are going to assume that \(Y_i\) observation is independent to
Poisson(\(\mu_i\)). Thus, we have a
likelihood
\(\frac{\mu_i e^{\mu_i}}{Y_i!}\)
In this case the parameter of interest is \(\mu_i\), so, we can get rid of that part in
the bottom that doesn’t depend on the parameter anymore.Now the
likelihood will be the product over all the subjects, that is
\(\Pi\mu_ie^{\mu_i}=\zeta(\beta_0\beta_1|Y)\).
5 True or false, some GLM distributions impose
restrictions on the relationship between the mean and the variance.
(Discuss.)
ANSWER:
Suppose we have a Poisson distribution \(log(E[Y_i])=log(\mu_i)=\beta_0
+\beta_1X_i\) assuming that \(Y_i\) observation is independent to
Poisson(\(\mu_i\)). The implication of
this family assumption is that \(E[Y_i]=\mu_i\) by definition. However, for
the Poisson ditribution, \(Var(Y_i)=E[Y_i]-\mu_i\). This is simply
because the mean and the variance of the Poisson distribution is always
the same. This is a checkable assumption in some cases for example if we
have a lot of repeat measurements of \(Y_i\) at the same value of \(\mu_i\). Now, suppose we have \(log(E[Y_i])=\beta_0 +\beta_1\) and let’s
assume that we had a bunch of independent realizations of Poisson random
variables then we could calculate \(\bar{Y}\) and that will give us an estimate
of \(\mu\). Then, we could calculate
\(Var(Y_i)=\mu\) under the assumption
that the \(Y_i\)’s are Poisson. If
those things are close then the Poisson assumption is validated if
they’re not closed then the Poisson assumption is not validated and so
this mean variance relationship has potentially the problem that it’s
very different from what the data actually says. The solution to this
often is to relax the assumption and no longer assume that Y is poised
on but instead assume a slight generalization of the Poisson
distribution such as the negative binomial distribution or to focus on a
more asymptotic less modeling based approach such as so called Quisi
Poisson on models. But at any rate to answer the question for the
Poisson distribution, for the Bernoulli distribution and many other
instances of generalized linear models there is an implied relationship
between the mean and the variance and often that that relationship is
checkable. Thus, whenever we’re actually doing generalized linear model
analyses we should either automatically account for the potential that
that assumption is false with quasi models quasi likelihood models or
actually verify that it’s true.
1. Load the dataset Seatbelts as part of the
datasets package via data(Seatbelts). Use as.data.frame to convert the
object to a dataframe. Create a new outcome variable for whether or not
greater than 119 drivers were killed that month. Fit a logistic
regression GLM with this variable as the outcome and kms, PetrolPrice
and law as predictors. Interpret your parameters.
ANSWER:
library(dplyr)
stblts <- stblts %>%
mutate(ppn = (PetrolPrice - mean(PetrolPrice))/sd(PetrolPrice),
dkb = (DriversKilled > 119),
mkm = kms / 1000) %>%
mutate(mkm = mkm - mean(mkm))
table(stblts$dkb)
FALSE TRUE
98 94
We se here a 98 false and 94 true. We would like it to be not necessarily a boolean variable but rather a numeric variable. Thus, we multiply 1 to our previous data which of does nothing but because of the arithmetic it will automatically convert it now it’s 0 & 1.
library(dplyr)
stblts <- stblts %>%
mutate(ppn = (PetrolPrice - mean(PetrolPrice))/sd(PetrolPrice),
dkb = 1 *(DriversKilled > 119),
mkm = kms / 1000) %>%
mutate(mkm = mkm - mean(mkm))
table(stblts$dkb)
0 1
98 94
fit<-glm(dkb~ppn+mkm+law, family = binomial, stblts)
summary(fit)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.024313512 0.16077499 0.15122695 0.87979669
ppn -0.416407701 0.16973435 -2.45329074 0.01415559
mkm -0.002938343 0.05984816 -0.04909663 0.96084229
law -0.615522450 0.57780755 -1.06527242 0.28675267
round(summary(fit)$coef, 4)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.0243 0.1608 0.1512 0.8798
ppn -0.4164 0.1697 -2.4533 0.0142
mkm -0.0029 0.0598 -0.0491 0.9608
law -0.6155 0.5778 -1.0653 0.2868
The logit scale odds of having greater than 118 drivers killed that month is -0.6155 lower after the law had taken effect. -0.0029 means that we estimate a logic scale change in the odds of greater than of greater than 119 drivers killed that month as being 0.0029 lower for every thousand kilometer increase in drivers in driving distance. For better interpretation, we have
exp( -0.615522450)
[1] 0.5403585
1-exp( -0.615522450)
[1] 0.4596415
This means that the route of the odds ratio comparing when the law was enacted to before the law was 54% and there was was a 46% decrease in the odds of drivers of greater than 119 drivers being killed in a month after the law was enacted compared to before the law wasn’t enacted holding the other variables constant.
exp(-0.002938343)
[1] 0.997066
1-exp(-0.002938343)
[1] 0.00293403
This means that there is approximately 0.3% decrease in the odds of greater than 119 drivers getting killed in a month for every additional thousand driver miles that month.
2. Fit a binomial model with DriversKilled as the
outcome and drivers as the total count with kms , PetrolPrice and law as
predictors, interpret your results.
ANSWER:
fit<-glm(cbind(DriversKilled, drivers-DriversKilled)~ppn+mkm+law, family = binomial, stblts)
summary(fit)
Call:
glm(formula = cbind(DriversKilled, drivers - DriversKilled) ~
ppn + mkm + law, family = binomial, data = stblts)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.4371 -0.7270 -0.0235 0.7111 3.0313
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.536637 0.007399 -342.829 <2e-16 ***
ppn -0.007829 0.007479 -1.047 0.295
mkm 0.003645 0.002733 1.334 0.182
law 0.030785 0.026527 1.161 0.246
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 234.93 on 191 degrees of freedom
Residual deviance: 229.93 on 188 degrees of freedom
AIC: 1496
Number of Fisher Scoring iterations: 3
Consider the drivers killed as a percentage of drivers series killed or seriously injured. Observe that the kilometers variable now is positive rather than negative and so is the variable associated with the enactment of the seatbelt law so the enactment of the seatbelt law is estimating about a 3% increase in the odds of being killed given that a person was killed seriously injured whereas before it was a negative quantity whether or not the driver was killed so again this is a different quantity than we’re looking at before we were just looking at the number killed whereas now we’re looking at the percentage of the number killed all over those that were killed or seriously injured.
3. Refer to Question 1. Use the anova function to
compare models with just law, law and PetrolPrice and all three
predictors.
ANSWER:
fit1<-glm(dkb~law, family = binomial, stblts)
fit2<-update(fit1, dkb~law + PetrolPrice)
fit3<-update(fit2,dkb~law + PetrolPrice+kms)
anova(fit1,fit2,fit3)
Analysis of Deviance Table
Model 1: dkb ~ law
Model 2: dkb ~ law + PetrolPrice
Model 3: dkb ~ law + PetrolPrice + kms
Resid. Df Resid. Dev Df Deviance
1 190 260.40
2 189 253.62 1 6.7760
3 188 253.62 1 0.0024
We want to compare this usually to chi squared values this goes from 260 to 253 so it looks like chi square cut off is usually around for that would give us a 5% hypothesis test. Hence, it is significant for adding petrol price and not significant for adding kilometers. Looking at the coefficients tables we have
summary(fit1)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.08288766 0.1539783 0.5383074 0.59036483
law -1.12434153 0.4991985 -2.2522935 0.02430373
summary(fit2)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.5875552 1.3851773 2.589961 0.009598679
law -0.6260282 0.5367204 -1.166396 0.243454555
PetrolPrice -34.3737200 13.4887754 -2.548320 0.010824304
summary(fit3)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.612261e+00 1.473634e+00 2.45126028 0.01423570
law -6.155225e-01 5.778076e-01 -1.06527242 0.28675267
PetrolPrice -3.419952e+01 1.394026e+01 -2.45329074 0.01415559
kms -2.938343e-06 5.984816e-05 -0.04909663 0.96084229
Observe in the law variabe, it suggest a rather large decrease in the log odds for every before and after enactment of the law. However that decrease in the odds of greater than 119 drivers killed that month that decrease and reduce quite a bit by the inclusion of petrol price. And notice that petrol price is quite significant. Note that law variable goes from significant to not significant. It gets increased by a factor of the p value gets increased by a factor of 10. When we include kilometers we get include a variable that now is not significant at all and doesn’t change the law variable that much.Hence if we were doing a model selection, we were primarily interested in the law coefficient and we would likely go with this second model.
1. Load the dataset Seatbelts as part of the datasets package via data(Seatbelts). Use as.data.frame to convert the object to a dataframe. Fit a Poisson regression GLM with UKDriversKilled as the outcome and kms, PetrolPrice and law as predictors. Interpret your results.
fit<-glm(DriversKilled~ppn+mkm+law, family = poisson, stblts)
summary(fit)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.819845137 0.007127388 676.242884 0.000000e+00
ppn -0.055361338 0.007243262 -7.643150 2.119715e-14
mkm -0.009980975 0.002614002 -3.818274 1.343887e-04
law -0.114877106 0.025557951 -4.494770 6.964526e-06
round(summary(fit)$coef, 4)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.8198 0.0071 676.2429 0e+00
ppn -0.0554 0.0072 -7.6431 0e+00
mkm -0.0100 0.0026 -3.8183 1e-04
law -0.1149 0.0256 -4.4948 0e+00
exp(-0.114877106 )
[1] 0.8914757
1-exp(-0.114877106 )
[1] 0.1085243
Therefore, there’s about an 11% decrease in the expected number of drivers killed for before and after enacting the law. Thus enacting the law decreased the expected number of drivers killed by 11%.
exp(-0.009980975 )
[1] 0.9900687
1-exp(-0.009980975 )
[1] 0.00993133
This means that there’s a about 1% decrease in the expected number of drivers killed for every thousand additional miles of driver distance that month. The intercept is in is the expected number of drivers killed on the log scale, exponentiating it we have
exp(4.819845137)
[1] 123.9459
We get 123 which is the expected number of drivers killed for the average number of petrol price and the average number of kilometers driven.
2. Refer to question 1. Fit a linear model with the
log of drivers killed as the outcome. Interpret your results.
ANSWER:
fit<-glm(DriversKilled~ppn+mkm+law, family = poisson, stblts)
fit2<-lm(I(log(DriversKilled))~ppn+mkm+law, stblts)
summary(fit)
Call:
glm(formula = DriversKilled ~ ppn + mkm + law, family = poisson,
data = stblts)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.7909 -1.6247 -0.3526 1.2900 4.8720
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.819845 0.007127 676.243 < 2e-16 ***
ppn -0.055361 0.007243 -7.643 2.12e-14 ***
mkm -0.009981 0.002614 -3.818 0.000134 ***
law -0.114877 0.025558 -4.495 6.96e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 984.50 on 191 degrees of freedom
Residual deviance: 778.32 on 188 degrees of freedom
AIC: 2059.1
Number of Fisher Scoring iterations: 4
summary(fit)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.819845137 0.007127388 676.242884 0.000000e+00
ppn -0.055361338 0.007243262 -7.643150 2.119715e-14
mkm -0.009980975 0.002614002 -3.818274 1.343887e-04
law -0.114877106 0.025557951 -4.494770 6.964526e-06
summary(fit2)$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.805409776 0.014411832 333.435035 1.381656e-262
ppn -0.053968063 0.014813218 -3.643237 3.481611e-04
mkm -0.008189793 0.005325983 -1.537705 1.258020e-01
law -0.131450856 0.048212881 -2.726468 7.007200e-03
In this case is the expected value of the log of the drivers killed which is our estimate of the expected value of the log of the drivers killed where the other one was the log of the expected value and it doesn’t work that you could interchange.
1-exp(-0.131450856)
[1] 0.1231776
Thus we have a 12% decrease in driver deaths after the enactment of the law. Note that the consequence of this log in the outcome is that the interpretation is respect to geometric means so this is a 12% decrease in the geometric mean number of driver deaths for going from enactment of the law to prior to the law having been enacted.
3. Refer to question 1. Fit your Poisson log-linear
model with drivers as a log offset (to consider the proportion of
drivers killed of those killed or seriously injured.)
ANSWER:
fit <- glm(DriversKilled ~ mkm + ppn +law, family = poisson, stblts, offset = log(drivers))
fit0 <- glm(DriversKilled ~ mkm + ppn +law, family = poisson, stblts)
summary(fit)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.612798146 0.007122545 -366.834931 0.0000000
mkm 0.003377675 0.002630717 1.283937 0.1991640
ppn -0.007255064 0.007199577 -1.007707 0.3135952
law 0.028484328 0.025512651 1.116479 0.2642173
summary(fit0)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.819845137 0.007127388 676.242884 0.000000e+00
mkm -0.009980975 0.002614002 -3.818274 1.343887e-04
ppn -0.055361338 0.007243262 -7.643150 2.119715e-14
law -0.114877106 0.025557951 -4.494770 6.964526e-06
4 Refer to Question 1. Use the anova function to
compare models with just law, law and PetrolPrice and all three
predictors
ANSWER:
fit1 <- glm(DriversKilled ~ law, family = poisson, stblts)
fit2 <- update(fit1,DriversKilled ~ law+PetrolPrice)
fit3<- update(fit2,DriversKilled ~ law+PetrolPrice+kms)
anova(fit1,fit2,fit3)
Analysis of Deviance Table
Model 1: DriversKilled ~ law
Model 2: DriversKilled ~ law + PetrolPrice
Model 3: DriversKilled ~ law + PetrolPrice + kms
Resid. Df Resid. Dev Df Deviance
1 190 870.06
2 189 792.88 1 77.178
3 188 778.32 1 14.561
Observe that all of the regression variables are apparently very necessary going from model 1 to model 2. We get a deviance of 77 a difference in the residual deviances of 77 that’s on 1 degree of freedom which is comparable to a chi-squared.Note that a chi-squared with 1 degree of freedom is like a standard normal squared.We know that two standard deviations for standard normal is kind of far out in its tail so for when we square that is kind of fire out in the tail of the chi squared so 77 is much further out in the tail than four so this would likely be a highly significant result. Similarly, with 14 much larger than 4, then let’s look at the summary coefficient tables for each.
summary(fit1)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.8352482 0.006856395 705.21727 0.000000e+00
law -0.2274727 0.021923993 -10.37552 3.204779e-25
summary(fit2)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.3499077 0.05886323 90.887084 0.000000e+00
law -0.1516301 0.02364153 -6.413720 1.420110e-10
PetrolPrice -5.0697421 0.57792672 -8.772292 1.750654e-18
summary(fit3)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.440656e+00 6.360114e-02 85.543371 0.000000e+00
law -1.148771e-01 2.555795e-02 -4.494770 6.964526e-06
PetrolPrice -4.546821e+00 5.948884e-01 -7.643150 2.119715e-14
kms -9.980975e-06 2.614002e-06 -3.818274 1.343887e-04
The law variable -0.2274727 is interpreted on the log of the expected value scale.It’s highly significant however it gets reduced and goes closer to -0.1516301 after the inclusion of petrol price and then after the inclusion of petrol price and kilometers it drops even further down 0.1148771.The significance drops from 3.204779e-25 to 1.420110e-10 to 6.964526e-06 which still very significant.Nonetheless,these other variables appear also quite significant so it does seem like we would want to include both of them in our final model.