fitfs <- lm(sheight ~ fheight, father.son)
Hypothesis Testing
y= father.son$sheight; x=father.son$fheight; n= nrow(father.son)
Given
and
beta1 <- cor(y,x)*sd(y)/sd(x)
beta0 <- mean(y) - beta1*mean(x)
Remember, predicted outcome i is \(\hat{Y_i}\) at predicted value \(X_i\)
e <- y-beta0 -beta1*x
The standard error of the residuals is
sigma <- sqrt(sum(e^2)/(n-2))
ssx <- sum((x - mean(x))^2)
The standard error of the regression slope is
seb1 <- sigma / sqrt(ssx)
Knowing that
If the hypothesis is that the true value is 0 we have:
tBeta1 <- beta1 / seb1
pBeta1 <- 2 * pt(abs(tBeta1), df = n - 2, lower.tail = FALSE)
coefTab <- data.frame(beta1, seb1, tBeta1, pBeta1)
colnames(coefTab) <- c("Estimate", "Std. Error", "t value", "P(>|t|)")
rownames(coefTab) <- "x"
coefTab
Estimate Std. Error t value P(>|t|)
x 0.514093 0.02704874 19.00618 1.121268e-69
Comparing with summary():
summary(lm(sheight~fheight, father.son))$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33.886604 1.83235382 18.49348 1.604044e-66
fheight 0.514093 0.02704874 19.00618 1.121268e-69
coefTab[1,4]
[1] 1.121268e-69
summary(lm(sheight~fheight, father.son))$coef[2,4]
[1] 1.121268e-69
The p-value of the slope is very small, which suggests that we have to reject the null hypothesis. The slope coefficient is different than 0. So we can consider that the father’s height is a statistically significant predictor of the son’s height.
Remember degree of freedom is n-2.
Confidence interval is
Put the coefficient in the variable sumCoef
sumCoef <- (summary(lm(sheight~fheight, father.son))$coef)
sumCoef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33.886604 1.83235382 18.49348 1.604044e-66
fheight 0.514093 0.02704874 19.00618 1.121268e-69
We can use 2 methods to compute confidence interval
# Method 1
# Estimate....plus minus...t-stat (0.05)... df= n-2........ SE Estimate
sumCoef[2,1] + c(-1,1) * qt(0.975, df= fitfs$df.residual) * sumCoef[2,2]
[1] 0.4610188 0.5671673
# Method 2
confint(fitfs)
2.5 % 97.5 %
(Intercept) 30.2912126 37.4819961
fheight 0.4610188 0.5671673
As we have seen in the previous exercises, the estimate for the intercept is \(33.8866044\).
summary(fitfs)$coef[1,1]
[1] 33.8866
Which is the son’s height at father’s height of 0 inches that does not make sense. So we get the regression centering the height at the mean of father’s height.
fitfsc <- lm(sheight ~ I(fheight - mean(fheight)), father.son)
fitfsc
Call:
lm(formula = sheight ~ I(fheight - mean(fheight)), data = father.son)
Coefficients:
(Intercept) I(fheight - mean(fheight))
68.6841 0.5141
We will get a new intercept, \(68.7\) inches, the son’s height at average height of the father, 67.7 inches.
Now, the C.I for this intercept (the son’s height at the average father’s height) is:
confint(fitfsc)
2.5 % 97.5 %
(Intercept) 68.5384554 68.8296839
I(fheight - mean(fheight)) 0.4610188 0.5671673
And we can see also that there is no change in the slope coefficients.
Calculating the expected values
predict(fitfs, newdata = data.frame(fheight = mean(father.son$fheight)), interval = 'confidence')
fit lwr upr
1 68.68407 68.53846 68.82968
predict(fitfs, newdata = data.frame(fheight = mean(father.son$fheight)), interval = "prediction")
fit lwr upr
1 68.68407 63.90091 73.46723
fit <- lm(mpg ~ hp, mtcars)
Hypothesis Testing
summary(fit)$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.09886054 1.6339210 18.421246 6.642736e-18
hp -0.06822828 0.0101193 -6.742389 1.787835e-07
The p-value of the slope is very small \((1.7878353^{-7})\), which suggests that we have to reject the null hypothesis. The slope coefficient is different than 0. So we can consider that the the horse power is a statistically significant predictor of miles per gallon.
Method 1
confint(fit)
2.5 % 97.5 %
(Intercept) 26.76194879 33.4357723
hp -0.08889465 -0.0475619
Method 2
with(mtcars, cor(mpg, hp)*sd(mpg)/sd(hp))+ c(-1,1) * qt(0.975, df=nrow(mtcars)-2)*summary(fit)$coef[2,2]
[1] -0.08889465 -0.04756190
Centered coefficients
fit <- lm(mpg ~ I(hp-mean(hp)), mtcars)
Method 1
confint(fit)
2.5 % 97.5 %
(Intercept) 18.69599452 21.4852555
I(hp - mean(hp)) -0.08889465 -0.0475619
Method 2
predict(fit, newdata = data.frame(hp = mean(mtcars$hp)), interval = 'confidence')
fit lwr upr
1 20.09062 18.69599 21.48526
fit <- lm(mpg~hp,mtcars)
predict(fit, newdata = data.frame(hp=mean(mtcars$hp)), interval = 'confidence')
fit lwr upr
1 20.09062 18.69599 21.48526
predict(fit, newdata = data.frame(hp=mean(mtcars$hp)), interval = 'prediction')
fit lwr upr
1 20.09062 12.07908 28.10217
y <- mtcars$mpg; x <- mtcars$hp; n <- length(y)
fit <- lm(y~x)
p1 <- data.frame(predict(fit, newdata = data.frame(x=x), interval = 'confidence'))
p2 <- data.frame(predict(fit, newdata = data.frame(x=x), interval = 'prediction'))
p1 <- p1 %>%
mutate(interval = 'confidence') %>%
mutate(x = mtcars$hp)
p2 <- p2 %>%
mutate(interval = 'prediction') %>%
mutate(x = mtcars$hp)
dat <- rbind(p1,p2)
names(dat)[1] <- "y"
dat %>%
ggplot(aes(x = x, y = y)) +
geom_ribbon(aes(ymin= lwr, ymax= upr, fill= interval), alpha= 0.4) +
geom_line(colour="black", lwd=.5) +
geom_point(data= data.frame(x=x,y=y), aes(x=x,y=y))
data(Seatbelts). Use
as.data.frame to convert the object to a dataframe. Fit a
linear model of driver deaths [DriversKilled] with
kms and PetrolPrice as predictors. Interpret
your results.Seatbelts <- as.data.frame(Seatbelts)
round(summary(lm(DriversKilled ~ kms + PetrolPrice, Seatbelts))$coef,4)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 215.7461 14.6656 14.7110 0.0000
kms -0.0017 0.0006 -2.8469 0.0049
PetrolPrice -643.7895 148.2896 -4.3414 0.0000
The intercept is the number of deaths for 0 kms and 0 PetrolPrice whichis not useful. An it’s good to center the variables. The estimate kms is the impact on deaths of 1 km which is insignificant and unworthy. Better to rescale it. PetroPrice is an index which is better tu normalize. Better to normalize.
Seatbelts <- Seatbelts %>%
mutate(ppn = (PetrolPrice - mean(PetrolPrice))/sd(PetrolPrice),
mkm = kms / 1000) %>%
mutate(mkm = mkm - mean(mkm))
summary(lm(DriversKilled ~ mkm + ppn, Seatbelts))$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 122.802083 1.6628507 73.850336 2.395106e-141
mkm -1.749546 0.6145401 -2.846919 4.902428e-03
ppn -7.838674 1.8055491 -4.341435 2.304713e-05
fit <- lm(DriversKilled ~ kms + PetrolPrice, Seatbelts)
predict(fit, newdata = data.frame(kms= mean(Seatbelts$kms),
PetrolPrice= mean(Seatbelts$PetrolPrice)))
1
122.8021
This is the number of deaths at the mean of kms and
PetrolPrice.
DriversKilled
having regressed out kms and an intercept and the residual
for PetrolPrice having regressed out kms and
an intercept. Fit a regression through the origin of the two residuals
and show that it is the same as your coefficient obtained in question
1.dk <- Seatbelts$DriversKilled
kms <- Seatbelts$kms
pp <- Seatbelts$PetrolPrice
fitFull <- lm(dk~kms+pp)
Now we get the residuals.
Remember the formula of the estimated coefficient by the residuals:
The residuals for DriversKilles is:
edk <- resid(lm(dk~kms))
The residual for PetrolPriceis:
epp <- resid(lm(pp~kms))
So the estimate for pp is:
sum(edk*epp)/sum(epp^2)
[1] -643.7895
Remembrer the coefficients of fitFull:
round(summary(fitFull)$coef, 4)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 215.7461 14.6656 14.7110 0.0000
kms -0.0017 0.0006 -2.8469 0.0049
pp -643.7895 148.2896 -4.3414 0.0000
Also we can calculate
round(summary(lm(edk~epp- 1))$coef,4)
Estimate Std. Error t value Pr(>|t|)
epp -643.7895 147.5111 -4.3643 0
DriversKilled
having regressed out PetrolPrice and an intercept. Take the
residual for kms having regressed out
PetrolPrice and an intercept. Fit a regression through the
origin of the two residuals and show that it is the same as your
coefficient obtained in question 1.edk2 <- resid(lm(dk ~ pp))
ekms <- resid(lm(kms ~ pp))
round(summary(lm(edk2 ~ ekms -1))$coef, 4)
Estimate Std. Error t value Pr(>|t|)
ekms -0.0017 6e-04 -2.8619 0.0047
Seatbelts as part of the
datasets package via data(Seatbelts). Use
as.data.frame to convert the object to a dataframe. Fit a
linear model of driver deaths with kms and
PetrolPrice as predictors. Interpret your results.Seatbelts <- as.data.frame(Seatbelts)
# Remamber that kms have been rescaled (mkm = kms/1000) and PetrolPrice have been normalized
# ppn= (PetrolPrice - mean(PetrolPrice) / sd(PetrolPrice))
round(summary(lm(DriversKilled ~ mkm + ppn, Seatbelts))$coef,4)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 122.8021 1.6629 73.8503 0.0000
mkm -1.7495 0.6145 -2.8469 0.0049
ppn -7.8387 1.8055 -4.3414 0.0000
The intercept is the number of deaths for 0 kms and 0 PetrolPrice which is not useful. And it’s good to center the variables. The estimate kms is the impact on deaths of 1 km which is insignificant and unworthy. Better to rescale it.
# Method 1: create a new variable logkill = Log_e of DriversKilled
Seatbelts2 <- Seatbelts %>%
mutate(logkill = log(DriversKilled))
summary(lm(logkill ~ ppn + mkm, Seatbelts2))$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.78966306 0.013426810 356.723817 2.737888e-269
ppn -0.06412578 0.014579039 -4.398492 1.818005e-05
mkm -0.01400794 0.004962149 -2.822959 5.267843e-03
# Method 2: Calculate log10 inside lm()
summary(lm(log(DriversKilled) ~ ppn + mkm, Seatbelts))$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.78966306 0.013426810 356.723817 2.737888e-269
ppn -0.06412578 0.014579039 -4.398492 1.818005e-05
mkm -0.01400794 0.004962149 -2.822959 5.267843e-03
In this case we have:
Interpretation:
Our normalized petrol price variable (ppn) can have this
interpretation:
1-exp(-0.06412578)
[1] 0.06211298
1-exp(-0.01400794)
[1] 0.01391029
law and interpret the results. Repeat this question with a
factor variable that you create called lawFactor that takes
the levels No and Yes. Change the reference
level from No to Yes.Model 2
The model includes \(X_1\) and \(X_2\), without reciprocal interaction. It fits 2 models that have the same slope.
round(summary(lm(DriversKilled ~ mkm + ppn + I(factor(law)), Seatbelts))$coef,4)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 124.2263 1.8012 68.9674 0.0000
mkm -1.2233 0.6657 -1.8378 0.0677
ppn -6.9199 1.8514 -3.7377 0.0002
I(factor(law))1 -11.8892 6.0258 -1.9731 0.0500
In this case, the intercept has changes a little. It is the level of mortality at the average of kms, the average of petrol price and at level 0 of Law, i.e. before the law was edited.
If we wanted to know the value of DriversKilled after the law was taken into effect, at the average mkm and ppn we would have to add \(-11.8892\).
Note that R tells us which factor is treating as reference levels: in this case is 0. The value \(-11.8892\) is associated with the factor level 1.
Change the reference level from No to Yes:
Seatbelts<-Seatbelts %>%
mutate(law = factor(law, levels = c(1,0)))
round(summary(lm(DriversKilled ~ mkm + ppn + I(factor(law)), Seatbelts))$coef,4)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 112.3371 5.5547 20.2236 0.0000
mkm -1.2233 0.6657 -1.8378 0.0677
ppn -6.9199 1.8514 -3.7377 0.0002
I(factor(law))0 11.8892 6.0258 1.9731 0.0500
Seatbelts<-Seatbelts %>%
mutate(law = factor(law, levels = c(0,1)))
PetrolPrice variable
into four factor levels. Fit the linear model with this factor to see
how R treats multiple level factor variables.Seatbelts <- Seatbelts %>%
mutate(ppf = as.factor((ppn <= -1.5)+ (ppn <= 0)+ (ppn <= 1.5)+ (ppn < Inf)))
round(summary(lm(DriversKilled ~ ppf + mkm + I(factor(law)), Seatbelts))$coef,4)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 109.8405 9.5066 11.5541 0.0000
ppf2 10.8271 9.9462 1.0886 0.2778
ppf3 18.6904 9.9374 1.8808 0.0616
ppf4 25.0074 10.9163 2.2908 0.0231
mkm -1.2991 0.7668 -1.6942 0.0919
I(factor(law))1 -15.3445 6.0345 -2.5428 0.0118
library(datasets); data(swiss)
head(swiss)
Fertility Agriculture Examination Education Catholic
Courtelary 80.2 17.0 15 12 9.96
Delemont 83.1 45.1 6 9 84.84
Franches-Mnt 92.5 39.7 5 5 93.40
Moutier 85.8 36.5 12 7 33.77
Neuveville 76.9 43.5 17 15 5.16
Porrentruy 76.1 35.3 9 7 90.57
Infant.Mortality
Courtelary 22.2
Delemont 22.2
Franches-Mnt 20.2
Moutier 20.3
Neuveville 20.6
Porrentruy 26.6
swiss <- swiss %>%
mutate(CatholicBin = 1 * (Catholic > 50))
g <- ggplot(swiss, aes(x = Agriculture, y = Fertility, colour = factor(CatholicBin))) +
geom_point(size = 5) +
xlab("% in Agriculture") +
ylab("Fertility")
fit <- lm(Fertility ~ Agriculture + factor(CatholicBin), data = swiss)
g +
geom_abline(intercept = coef(fit)[1], slope = coef(fit)[2], size = 1, colour = 'black') +
geom_abline(intercept = coef(fit)[1] + coef(fit)[3], slope = coef(fit)[2], size = 1, colour = 'black') +
annotate("text", x = 0.05, y = 72, label = "Catholic", hjust = 0, col = "black", size = 3, fontface = "italic") +
annotate("text", x = 0.05, y = 60, label = "Non-Catholic", hjust = 0, col = "black", size = 3, fontface = "italic")
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.
fitfull <- lm(Fertility ~ Agriculture * factor(CatholicBin), data = swiss)
g +
geom_abline(intercept = coef(fitfull)[1], slope = coef(fitfull)[2], size = 1, colour = 'black') +
geom_abline(intercept = coef(fitfull)[1] + coef(fitfull)[3],
slope = coef(fitfull)[2] + coef(fitfull)[4], size = 1, colour = 'black') +
annotate("text", x = 0.05, y = 68, label = "Catholic", hjust = 0, col = "black", size = 3, fontface = "italic") +
annotate("text", x = 0.05, y = 60, label = "Non-Catholic", hjust = 0, col = "black", size = 3, fontface = "italic")
Seatbelts as part
of the datasets package via data(Seatbelts). Use
as.data.frame to convert the object to a dataframe. Fit a
linear model of driver deaths with kms and
PetrolPrice as predictors. Interpret your results.summary(lm(DriversKilled ~ mkm + ppn, Seatbelts))$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 122.802083 1.6628507 73.850336 2.395106e-141
mkm -1.749546 0.6145401 -2.846919 4.902428e-03
ppn -7.838674 1.8055491 -4.341435 2.304713e-05
kms coefficient with
and without the inclusion of the PetrolPrice variable in
the model.First of all let’s see the correlation between mkm and
ppn:
cor(Seatbelts$mkm, Seatbelts$ppn)
[1] 0.3839004
This value indicates that both variables are measuring similar things. Possibly the petrol price has an impact in the total of kms driven.
fit0 <- lm(DriversKilled ~ mkm, Seatbelts)
fit1 <- lm(DriversKilled ~ mkm + ppn, Seatbelts)
summary(fit0)$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 122.802083 1.7391997 70.60839 2.665611e-138
mkm -2.773787 0.5935049 -4.67357 5.596266e-06
summary(fit1)$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 122.802083 1.6628507 73.850336 2.395106e-141
mkm -1.749546 0.6145401 -2.846919 4.902428e-03
ppn -7.838674 1.8055491 -4.341435 2.304713e-05
The estimate in this case goes down (negative). Which makes sense because both effects are in the same direction and are correlated. But the estimate change from \(-2.773787\) to \(-1.749546\), which indicates the confounding effect of the variable PetrolPrice on the regression between DriversKilled and kms.
anova(fit0,fit1)
Analysis of Variance Table
Model 1: DriversKilled ~ mkm
Model 2: DriversKilled ~ mkm + ppn
Res.Df RSS Df Sum of Sq F Pr(>F)
1 190 110345
2 189 100339 1 10006 18.848 2.305e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The p-value is the same as in fit1 (ppn estimate).
PetrolPrice coefficient
with and without the inclusion fo the kms variable in the
model.fit2 <- lm(DriversKilled ~ ppn, Seatbelts)
fit3 <- lm(DriversKilled ~ ppn + mkm, Seatbelts)
summary(fit2)$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 122.802083 1.693656 72.507096 2.061333e-140
ppn -9.812019 1.698084 -5.778288 3.044208e-08
summary(fit3)$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 122.802083 1.6628507 73.850336 2.395106e-141
ppn -7.838674 1.8055491 -4.341435 2.304713e-05
mkm -1.749546 0.6145401 -2.846919 4.902428e-03
In this case too, the estimate goes down (negative). Which makes sense because both effects are in the same direction and are correlated. But the estimate change from \(-9.812019\) to \(-7.838674\), which indicates the confounding effect of the variable kms on the regression between DriversKilled and PetrolPrice.
Seatbelts as part
of the datasets package via data(Seatbelts).
Use as.data.frame to convert the object to a dataframe. Fit
a linear model of driver deaths with kms,
PetrolPrice and law as predictors.data(Seatbelts)
Seatbelts <- as.data.frame(Seatbelts)
model <- lm(DriversKilled ~ kms + PetrolPrice + law, data = Seatbelts)
summary(model)$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.014614e+02 1.625587e+01 12.393145 3.601219e-26
kms -1.223318e-03 6.656567e-04 -1.837761 6.767594e-02
PetrolPrice -5.683347e+02 1.520552e+02 -3.737687 2.463128e-04
law -1.188920e+01 6.025785e+00 -1.973055 4.995497e-02
round(summary(lm(DriversKilled ~ kms + PetrolPrice + law, Seatbelts))$coef,4)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 201.4614 16.2559 12.3931 0.0000
kms -0.0012 0.0007 -1.8378 0.0677
PetrolPrice -568.3347 152.0552 -3.7377 0.0002
law -11.8892 6.0258 -1.9731 0.0500
Seatbelts <- Seatbelts %>%
mutate(ppn = (PetrolPrice - mean(PetrolPrice))/sd(PetrolPrice),
mkm = kms / 1000) %>%
mutate(mkm = mkm - mean(mkm))
summary(lm(DriversKilled ~ mkm + ppn + law, Seatbelts))$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 124.226311 1.8012324 68.967399 1.976641e-135
mkm -1.223318 0.6656567 -1.837761 6.767594e-02
ppn -6.919949 1.8513987 -3.737687 2.463128e-04
law -11.889202 6.0257850 -1.973055 4.995497e-02
fit <- lm(DriversKilled ~ mkm + ppn + law, Seatbelts)
# Method 1
sum(resid(fit)^2) / (nrow(Seatbelts) - 4) # In a multidimensional analysis every dimension is 1 unit to be subtracted from the freedom degrees in this case we have 4 dimensions. So, the degree of freedom is n-4.
[1] 522.8903
# Method 2
summary(fit)$sigma^2
[1] 522.8903
plot(fit)
First plot shows residuals versus the fitted values here you’re looking for any kind of pattern and in this case it doesn’t look too bad it labels some plots that appear on the extreme fits a smoother through the residuals this is because it’s often difficult to ascertain a pattern when you’re just looking at the points by themselves but when you fit a smoother through it often you can see patterns that you wouldn’t have seen.
Q-Q plot is done to ascertain normality of the errors in this case what you’re looking for is whether or not your points fall mostly on a line and in this case it gives you the reference line. Typically when you look at these plots these deviations for example down here in the tails which means that the residuals are larger in the left tail than the theoretical quantiles from the normal distribution.
Scale-location plot: plots so the next one is the square root of the absolute value of the standardized residuals by the fitted values this is the scale location plot a smoother is put through it and again you’re looking for patterns in this plot it doesn’t look so bad to me.
Residuals vs Leverage plot: the final one plots the residuals versus the leverage.
Again in this case leverage remember is ascertaining how much potential for influence a point has by virtue of being outside of the norm.
For the collection of points we might for example look at these couple of points that have fairly high leverage relative to all their other partners and just see if there is anything out of the ordinary in those points.
Let’s look at some influence measures as well just to remember though maybe keep an eye out for the point the 12th and 190 second point so remember we can look for example at the DF fits with the function.
plot(dffits(fit))
Evaluating the influence
dffits values, classify them, establish a threshold value,
in this case |0.4|, and plot the values indicating the id
of the values above the threshold (specialPoint).datdffits <- data.frame(dffits = dffits(fit)) %>%
mutate(id=row_number()) %>%
mutate(specialPoint= as.factor(abs(dffits) >= .35))
ggplot(datdffits, aes(y=dffits, x= id, color= specialPoint))+
geom_point(size=3) +
geom_text(data=subset(datdffits, specialPoint==TRUE),
aes(y=dffits,x=id,label=id), colour='black') +
geom_hline(yintercept = c(-0.35,0.35), color= 'blue') +
geom_hline(yintercept = 0)
head(dfbeta(fit))
(Intercept) mkm ppn law
1 -0.1022614 0.11881682 -0.09349927 -0.26195987
2 -0.1331656 0.22275746 -0.15807657 -0.56786007
3 -0.1279422 0.11452337 -0.07920934 -0.23464337
4 -0.2032275 0.13006726 -0.06413772 -0.23584650
5 -0.0524778 0.02353784 -0.01104844 -0.02756976
6 -0.1188004 0.03961875 -0.01108847 -0.02386593
plot(dfbeta(fit)[ , 2])
dfbeta for kms:
datdfbeta <- data.frame(dfbetamkm = dfbeta(fit)[ , 2]) %>%
mutate(id=row_number()) %>%
mutate(specialPoint= as.factor(abs(dfbetamkm) >= .2))
ggplot(datdfbeta, aes(y=dfbetamkm, x= id, color= specialPoint))+
geom_point(size=3) +
geom_text(data=subset(datdfbeta, specialPoint==TRUE),
aes(y=dfbetamkm,x=id,label=id), colour='black') +
geom_hline(yintercept = 0)+
geom_hline(yintercept = c(-0.2,0.2), color= 'blue') +
ylab('dfbeta: mkm')
datdfbeta2 <- data.frame(dfbetappn = dfbeta(fit)[ , 3]) %>%
mutate(id=row_number()) %>%
mutate(specialPoint= as.factor(abs(dfbetappn) >= .4))
ggplot(datdfbeta2, aes(y=dfbetappn, x= id, color= specialPoint))+
geom_point(size=3) +
geom_text(data=subset(datdfbeta2, specialPoint==TRUE),
aes(y=dfbetappn,x=id,label=id), colour='black') +
geom_hline(yintercept = 0)+
geom_hline(yintercept = c(-0.4,0.4), color= 'blue') +
ylab('dfbeta: PetrolPrice')
cooksdistance is a summary overall points. It’s more easy
to look at a plot so cooks.distance is bounded from below by zero it’s a
Mahalanobis distance: Mahalanobis distance take in account the
euclidean distance, the gaussian distance (normalization of parameters)
and the correlation of parameters.datcooks <- data.frame(cooksdist = cooks.distance(fit)) %>%
mutate(id=row_number()) %>%
mutate(specialPoint= as.factor(cooksdist >= .03))
ggplot(datcooks, aes(y=cooksdist, x= id, color= specialPoint))+
geom_point(size=3) +
geom_text(data=subset(datcooks, specialPoint==TRUE),
aes(y=cooksdist,x=id,label=id), colour='black') +
geom_hline(yintercept = 0)+
geom_hline(yintercept = 0.03, color= 'blue') +
ylab('Cooks distance')
Evaluating the Leverage
summary(round(hatvalues(fit), 3))
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00600 0.01175 0.01700 0.02082 0.02600 0.06700
dathatv <- data.frame(hatv = hatvalues(fit)) %>%
mutate(id=row_number()) %>%
mutate(specialPoint= as.factor(hatv >= .06))
ggplot(dathatv, aes(y=hatv, x= id, color= specialPoint))+
geom_point(size=3) +
geom_text(data=subset(dathatv, specialPoint==TRUE),
aes(y=hatv,x=id,label=id), colour='black') +
geom_hline(yintercept = 0.06, color= 'blue') +
ylab('hatvalues')
Seatbelts as part
of the datasets package via data(Seatbelts).
Use as.data.frame to convert the object to a dataframe. Fit
a linear model of driver deaths with kms,
PetrolPrice and law as predictors.Seatbelts <- as.data.frame(Seatbelts)
# Fit the linear model
model <- lm(DriversKilled ~ kms + PetrolPrice + law, data = Seatbelts)
summary(model)$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.014614e+02 1.625587e+01 12.393145 3.601219e-26
kms -1.223318e-03 6.656567e-04 -1.837761 6.767594e-02
PetrolPrice -5.683347e+02 1.520552e+02 -3.737687 2.463128e-04
law -1.188920e+01 6.025785e+00 -1.973055 4.995497e-02
In this case we are interested in knowing whether or not the law was effective.
fit0 <- lm(DriversKilled~law, Seatbelts)
fit1 <- lm(DriversKilled~law + mkm, Seatbelts)
fit2 <- lm(DriversKilled~law + ppn, Seatbelts)
fit3 <- lm(DriversKilled~law + mkm + ppn, Seatbelts)
anova(fit0,fit1,fit2,fit3)
Analysis of Variance Table
Model 1: DriversKilled ~ law
Model 2: DriversKilled ~ law + mkm
Model 3: DriversKilled ~ law + ppn
Model 4: DriversKilled ~ law + mkm + ppn
Res.Df RSS Df Sum of Sq F Pr(>F)
1 190 109754
2 189 105608 1 4145.3 7.9276 0.005388 **
3 189 100069 0 5538.9
4 188 98303 1 1766.0 3.3774 0.067676 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(fit0,fit3)
Analysis of Variance Table
Model 1: DriversKilled ~ law
Model 2: DriversKilled ~ law + mkm + ppn
Res.Df RSS Df Sum of Sq F Pr(>F)
1 190 109754
2 188 98303 2 11450 10.949 3.177e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We are particularly interested with the fit0 and fit3 because I’m interested in whether or not both those parameters are necessary in addition to the law of variable. So here we have our base model which just has the law variable here we have our second model which includes kilometers in petrol price so it’s a 2 degree of freedom test because it added two parameters. Test is highly significant suggesting that we probably should include both those parameters.
rbind(summary(fit0)$coef[2, ],
summary(fit1)$coef[2, ],
summary(fit2)$coef[2, ],
summary(fit3)$coef[2, ])
Estimate Std. Error t value Pr(>|t|)
[1,] -25.60895 5.341655 -4.794198 3.288375e-06
[2,] -17.55372 6.028888 -2.911602 4.028394e-03
[3,] -16.32618 5.555579 -2.938700 3.706585e-03
[4,] -11.88920 6.025785 -1.973055 4.995497e-02
The coefficient for the “law” variable changes depending on which other variables are included in the model. This is due to the phenomenon of multicollinearity, where variables are correlated and can influence the estimated coefficients of each other.
Specifically, the coefficient for “law” decreases from a low of 25 fewer deaths per month to 11 fewer deaths per month as more variables are added to the model. However if we adjust for the variability in just kilometers that drops down to 17 fewer deaths. If we just for adjust the variability for the variability in petrol price that estimate drops down to 16 deaths fewer per month and then if we adjust for both that goes down to 11 fewer deaths per month. This suggests that the initial estimate of the law’s effect was inflated due to the influence of other factors, and that the actual effect is smaller.
While the decrease in coefficient magnitude might raise concerns about the law’s effectiveness, the author argues that including these other variables is important for a comprehensive analysis. Since the additional variables are highly significant and logically relevant, the author concludes that model 4, which includes all variables, is the most appropriate choice.
FALSE.
GLMs transform the expected value of the outcome, not the observed outcome itself.
The transformation is applied through a link function, which connects the expected value to the linear predictor.
Transforming the expected value allows GLMs to handle non-normal response variables and model a wider range of relationships.
Here we are not doing in generalizing your model we are simply fitting a linear model where we’ve transformed the outcome.
where \(E[Y]\) is the expected value of Y.
where \(E[Y]/(1-E[Y])\) is the odds of Y.
Transforming the observed outcome directly changes the data itself, not the model. This can be done for various reasons, such as to improve normality or stabilize variance. It is not a defining characteristic of GLMs. Therefore, the statement “GLMs transform the observed outcome” is incorrect.
TRUE.
In fact, all of the coefficients are interpreted on the link function scale.
Example:
So, when we get an estimate for \(\beta_1\) and see it in our coefficient table that is interpreted on the scale of the log of the expected value of the outcome so it is interpreted on the link function scale.
If we were doing logistic regression it would be interpreted on the
scale of the logit of the expected value of the response.
if we subtract these two entities
So, \(\beta_1\) is the change in the
link function of the expected value per unit change in the
regressor.
In the logistic regression case because of the natural log we can
invert the natural log. Thus,
FALSE.
The generalized linear model (GLM) does not strictly assume an exponential family for the outcome. The GLM framework allows for a broad range of probability distributions, including but not limited to the exponential family. The choice of distribution depends on the nature of the response variable.
In a linear model, the relationship between the response variable \(Y\) and predictors \(X_1\) and \(X_2\) is established through a direct equation:
In contrast, a generalized linear model (GLM) does not form such a direct equation. It begins by assuming a distribution for \(Y\), which can be from a variety of distributions, not just the exponential family. Linear models are a special case of GLMs, as they assume the normal distribution.
An exponential family encompasses various distributions, including normal, Bernoulli, binomial, Poisson, and gamma. These distributions depend on parameters, with one of the most important being the mean (\(\mu\)).
In GLMs, the connection between the response variable \(Y\) and the predictors \(X_1\) and \(X_2\) is established through the exponential family and its parameters. One of the key parameters is the mean (\(\mu\)), and the relationship is mediated by a link function (\(g\)). For instance, the link function might be expressed as:
This indirect connection allows GLMs to handle a wide range of distributions and relationships between predictors and the response variable. For example, in Poisson regression, the natural log function is often used as the link function:
TRUE.
In a Generalized Linear Model (GLM), the parameter estimates are obtained by maximizing the likelihood function. The likelihood represents the probability of observing the given data given the model parameters. Maximizing the likelihood essentially seeks the parameter values that make the observed data most probable under the assumed statistical model.
Poisson Regression Example:
The response variable is assumed to follow a Poisson distribution, which is suitable for count data. The model considers the log of the expected value of \(Y_i\), denoted as log(\(\mu_i\)), where \(\mu_i\) represents the mean for each individual subject.
The model equation is given by:
This implies,
Likelihood Function for Poisson Distribution:
Assuming \(Y_i\) follows an independent Poisson distribution with mean \(\mu_i\),
the likelihood function takes the form:
The likelihood depends on the parameters \(\beta_0\) and \(\beta_1\) and given the observed data \(Y\).
Hence, the statement is true. Likelihood maximization is a core concept in the estimation of parameters in generalized linear models. It involves finding the parameter values that maximize the likelihood of observing the given data under the assumed statistical model.
TRUE.
Example: Poisson Model
Let’s assume that
However, for the Poisson distribution
since the mean and the variance of Poisson distribution is always the same.
Hence, for the Poisson, Bernoulli distribution and many other instances of generalized linear models (GLM) there is an implied relationship between the mean and the variance.
Seatbelts as part
of the datasets package via data(Seatbelts). Use
as.data.frame to convert the object to a dataframe. Create
a new outcome variable for whether or not greater than 119 drivers were
killed that month. Fit a logistic regression GLM with this variable as
the outcome and kms, PetrolPrice and
law as predictors. Interpret your parameters.Seatbelts <- Seatbelts %>%
mutate(binkill = 1 * (DriversKilled > 119))
logRegSeatbelts <- glm(Seatbelts$binkill~Seatbelts$mkm + Seatbelts$ppn + Seatbelts$law, family = 'binomial')
summary(logRegSeatbelts)
Call:
glm(formula = Seatbelts$binkill ~ Seatbelts$mkm + Seatbelts$ppn +
Seatbelts$law, family = "binomial")
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.024314 0.160775 0.151 0.8798
Seatbelts$mkm -0.002938 0.059848 -0.049 0.9608
Seatbelts$ppn -0.416408 0.169734 -2.453 0.0142 *
Seatbelts$law -0.615522 0.577808 -1.065 0.2868
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 266.09 on 191 degrees of freedom
Residual deviance: 253.62 on 188 degrees of freedom
AIC: 261.62
Number of Fisher Scoring iterations: 4
round(summary(logRegSeatbelts)$coef,3)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.024 0.161 0.151 0.880
Seatbelts$mkm -0.003 0.060 -0.049 0.961
Seatbelts$ppn -0.416 0.170 -2.453 0.014
Seatbelts$law -0.616 0.578 -1.065 0.287
round(exp(logRegSeatbelts$coeff),3)
(Intercept) Seatbelts$mkm Seatbelts$ppn Seatbelts$law
1.025 0.997 0.659 0.540
round(1-exp(logRegSeatbelts$coeff), 3)
(Intercept) Seatbelts$mkm Seatbelts$ppn Seatbelts$law
-0.025 0.003 0.341 0.460
If we exponentiate the Seatbelts coefficients we get \(1.025\) for the intercept, and for the coefficients, \(0.997\) for kms, \(0.659\) for the Petrol Price and \(0.540\) for Law. The effect more important we see is that there was a 46% decrease in the odds of drivers of greater than 119 drivers being killed in a month after the law was enacted, compared to before the law wasn’t enacted, holding the other variables constant.
DriversKilled as the outcome and drivers as the total count
with kms , PetrolPrice and law as
predictors, interpret your results.# cbind creates a two-columns matrix with term of the odd: SUCCESSES vs FAILURES (not successes vs total)
fitlgm <- glm(cbind(DriversKilled, drivers-DriversKilled)~mkm + ppn + law, family = 'binomial', data = Seatbelts)
summary(fitlgm)
Call:
glm(formula = cbind(DriversKilled, drivers - DriversKilled) ~
mkm + ppn + law, family = "binomial", data = Seatbelts)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.536637 0.007399 -342.829 <2e-16 ***
mkm 0.003645 0.002733 1.334 0.182
ppn -0.007829 0.007479 -1.047 0.295
law 0.030785 0.026527 1.161 0.246
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 234.93 on 191 degrees of freedom
Residual deviance: 229.93 on 188 degrees of freedom
AIC: 1496
Number of Fisher Scoring iterations: 3
round(summary(fitlgm)$coef,3)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.537 0.007 -342.829 0.000
mkm 0.004 0.003 1.334 0.182
ppn -0.008 0.007 -1.047 0.295
law 0.031 0.027 1.161 0.246
For every one-unit increase in the predictor variable the log-odds of the outcome decrease by 2.537, for every one-unit increase in mkm, the log-odds of the outcome increase by 0.004, and for every one-unit increase in ppn, the log-odds of the outcome decrease by 0.008, assuming the other predictors remain constant. When law changes from 0 to 1, the log-odds of the outcome increase by 0.031, assuming the other predictors remain constant.
In the provided output, none of the predictors (mkm, ppn, and law1) have p-values less than 0.05, suggesting that, based on this model, none of the predictors are statistically significant predictors of the outcome at the 5% significance level.
anova
function to compare models with just, with law and
PetrolPrice, and with all three predictors.fit1 <- glm(binkill~law, family = 'binomial', data = Seatbelts)
fit2 <- update(fit1, binkill~law + PetrolPrice)
fit3 <- update(fit2, binkill~law + PetrolPrice + kms)
anova(fit1,fit2,fit3)
Analysis of Deviance Table
Model 1: binkill ~ law
Model 2: binkill ~ law + PetrolPrice
Model 3: binkill ~ law + PetrolPrice + kms
Resid. Df Resid. Dev Df Deviance
1 190 260.40
2 189 253.62 1 6.7760
3 188 253.62 1 0.0024
The residual deviance it is sort of an analogous to our sums of squares that we’re looking at when we had normal data. So, in this case it looks like it’s going to be significant for adding Petrol Price and not significant for adding kms, so let’s look at the coefficient table.
summary(fit1)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.08288766 0.1539783 0.5383074 0.59036483
law -1.12434153 0.4991985 -2.2522935 0.02430373
summary(fit2)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.5875552 1.3851773 2.589961 0.009598679
law -0.6260282 0.5367204 -1.166396 0.243454555
PetrolPrice -34.3737200 13.4887754 -2.548320 0.010824304
summary(fit3)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.612261e+00 1.473634e+00 2.45126028 0.01423570
law -6.155225e-01 5.778076e-01 -1.06527242 0.28675267
PetrolPrice -3.419952e+01 1.394026e+01 -2.45329074 0.01415559
kms -2.938343e-06 5.984816e-05 -0.04909663 0.96084229
Notice that in law variable it suggest a rather large decrease in the log odds for every before and after enactment of the law. However, that decrease in the odds of greater than 119 drivers killed that month, which decrease attenuates quite a bit by the inclusion of petrol price. And notice that petrol price is quite significant
It is interesting to note that the law variable goes from significant to not significant since the p-value gets increased by a factor of 10. When we include kilometers we get include a variable that now is not significant at all and doesn’t change the law variable that much.
Seatbelts as part
of the datasets package via data(Seatbelts). Use
as.data.frame to convert the object to a dataframe. Fit a
Poisson regression GLM with UKDriversKilled as the outcome
and kms, PetrolPrice and law as
predictors. Interpret your results.fit <- glm(DriversKilled ~ mkm + ppn +law, family = poisson, data = Seatbelts)
summary(fit)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.819845137 0.007127388 676.242884 0.000000e+00
mkm -0.009980975 0.002614002 -3.818274 1.343887e-04
ppn -0.055361338 0.007243262 -7.643150 2.119715e-14
law -0.114877106 0.025557951 -4.494770 6.964526e-06
round(summary(fit)$coef[ ,1],4)
(Intercept) mkm ppn law
4.8198 -0.0100 -0.0554 -0.1149
The coefficients
We apply a loop to exponentiate the coefficient and to put in positive (1-exp) the negative ones.
a <- data.frame(Intercept=0,mkm=0,ppn=0,law1=0)
for (i in 1:length(a)) {
a[i] <- if(summary(fit)$coef[ i,1] < 0)
{1-exp(summary(fit)$coef[ i,1])} else {exp(summary(fit)$coef[ i,1])}
}
print(a)
Intercept mkm ppn law1
1 123.9459 0.00993133 0.05385679 0.1085243
law: If we exponentiate it there’s about an 11% decrease in the expected number of drivers killed before and after enacting the law.
mkm: See that there’s about a roughly 1% decrease in the expected number of drivers killed for every thousand additional miles of driven distance.
The intercept: It is the expected number of drivers killed on the log scale for the average petrol price the average number of kilometers driven at the time when the law was not enacted for when the law variable was 0.
Interpretation for coefficients:
If you don’t exponentiate them they’re all interpreted on the log of the expected value outcome scale.
If you do exponentiate it’s then on the relative scale for a one unit increase variable, so for the binary law variable is going from 0 to 1, for the kilometers variable which converted a thousand kilometer units it’s going 1,000 kilometers addition to kilometers driven for petrol price.
round(exp(coef(lm(I(log(DriversKilled + 1)) ~ mkm + ppn + law, data = Seatbelts))), 5)
(Intercept) mkm ppn law
123.18577 0.99190 0.94787 0.87807
In the exercise 1 we have that the coefficients are the estimates of the log of the expected value of the expected log count of driver deaths, when these other variables are set to their zero level.
summary(fit)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.819845137 0.007127388 676.242884 0.000000e+00
mkm -0.009980975 0.002614002 -3.818274 1.343887e-04
ppn -0.055361338 0.007243262 -7.643150 2.119715e-14
law -0.114877106 0.025557951 -4.494770 6.964526e-06
fit2 <-lm(I(log(DriversKilled + 1)) ~ mkm + ppn + law, data = Seatbelts)
summary(fit2)$coef
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.813693546 0.014291860 336.813651 2.082663e-263
mkm -0.008133355 0.005281647 -1.539928 1.252593e-01
ppn -0.053538784 0.014689904 -3.644597 3.464468e-04
law -0.130030803 0.047811530 -2.719654 7.147529e-03
In this case, the expected value of the log of the drivers killed which is our estimate of the expected value of the log of the drivers killed where the other one was the log of the expected value and it doesn’t work that you could interchange.
1-exp(-0.130030803)
[1] 0.1219316
We’re seeing about a 12% decrease in the geometric mean number of driver deaths for going from enactment of the law to prior to the law having been enacted.
Poisson Model:
fit <- glm(DriversKilled ~ mkm + ppn +law, family = poisson, data = Seatbelts, offset = log(drivers))
fit0 <- glm(DriversKilled ~ mkm + ppn +law, family = poisson, data = Seatbelts)
summary(fit)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.612798146 0.007122545 -366.834931 0.0000000
mkm 0.003377675 0.002630717 1.283937 0.1991640
ppn -0.007255064 0.007199577 -1.007707 0.3135952
law 0.028484328 0.025512651 1.116479 0.2642173
summary(fit0)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.819845137 0.007127388 676.242884 0.000000e+00
mkm -0.009980975 0.002614002 -3.818274 1.343887e-04
ppn -0.055361338 0.007243262 -7.643150 2.119715e-14
law -0.114877106 0.025557951 -4.494770 6.964526e-06
Notice it’s very different model fit since these are different interpretations. The first GLM is modeling the log of the expected value of drivers killed divided by the total number of drivers killed:
so on in contrast fit 0 is just modeling the log of the expected value of y by itself
Now, let’s exponentiate, as in this case:
exp(0.028484328)
[1] 1.028894
1-exp(-0.114877106)
[1] 0.1085243
In the first case, considering the rate, we have a 2% increase of the rate of DriversKilled on the total Killed or injured, correlated to the law enactment; while in the second, we have a 10% decrease in the number of drivers killed.
Applying a binomial model to the proportion
Another way we could have looked at this this model let’s say fit to is doing a binomial model:
fitbis <- glm(cbind(DriversKilled, drivers - DriversKilled) ~ mkm + ppn +law,
family = binomial,
data = Seatbelts)
summary(fitbis)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.536637162 0.007399129 -342.829174 0.0000000
mkm 0.003644902 0.002732722 1.333799 0.1822698
ppn -0.007828905 0.007478975 -1.046789 0.2951971
law 0.030785128 0.026526920 1.160524 0.2458355
compare with the fit on the poisson rate model:
summary(fit)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.612798146 0.007122545 -366.834931 0.0000000
mkm 0.003377675 0.002630717 1.283937 0.1991640
ppn -0.007255064 0.007199577 -1.007707 0.3135952
law 0.028484328 0.025512651 1.116479 0.2642173
So, this model is also looking at drivers killed as a proportion of drivers killed or seriously injured. So, fit2 should give estimates pretty similar to this first fit and we see they are very similar. So, it’s a slightly different model because one is a binomial model and the other one is a relative Poisson model.
However, we know because Poisson approximates the binomial model when the proportion is small and the proportion of drivers killed relative to the proportion of drivers killed or seriously injured is small.
So, these two models are going to approximate each other pretty well but the important distinction is when you put that log offset we’re looking at the proportion not just the raw number of drivers killed and this is a pretty stark example because some of the coefficients actually reverse their sign.
anova
function to compare models with just law, law
and PetrolPrice and all three predictors.fit1 <- glm(DriversKilled ~ law, family = poisson, data = Seatbelts)
fit2 <- update(fit1, DriversKilled ~law + PetrolPrice )
fit3 <- update(fit2, DriversKilled ~law + PetrolPrice + kms)
anova(fit1, fit2, fit3)
Analysis of Deviance Table
Model 1: DriversKilled ~ law
Model 2: DriversKilled ~ law + PetrolPrice
Model 3: DriversKilled ~ law + PetrolPrice + kms
Resid. Df Resid. Dev Df Deviance
1 190 870.06
2 189 792.88 1 77.178
3 188 778.32 1 14.561
Going from model1 to model2 we get a deviance of \(77.178\) that’s one degree of freedom. Remember that a chi-squared with one degree of freedom is like a standard normal squared. So, \(77.178\) would likely be a highly significant, same with \(14.561\).
Now,let’s look the summary coefficient tables for each:
summary(fit1)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.8352482 0.006856395 705.21727 0.000000e+00
law -0.2274727 0.021923993 -10.37552 3.204779e-25
summary(fit2)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.3499077 0.05886323 90.887084 0.000000e+00
law -0.1516301 0.02364153 -6.413720 1.420110e-10
PetrolPrice -5.0697421 0.57792672 -8.772292 1.750654e-18
summary(fit3)$coef
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.440656e+00 6.360114e-02 85.543371 0.000000e+00
law -1.148771e-01 2.555795e-02 -4.494770 6.964526e-06
PetrolPrice -4.546821e+00 5.948884e-01 -7.643150 2.119715e-14
kms -9.980975e-06 2.614002e-06 -3.818274 1.343887e-04
So our law variable which is the coefficient variable (\(-0.2274727\)) is highly significant, but it gets attenuated closer to (\(-0.1516301\)) after the inclusion of PetrolPrice, and then after the inclusion of PetrolPrice and kms(kilometers) it drops even further down to (\(-1.148771e-01\)). Also, the signifance drops from \(10^-25\) to \(10^-10\) and to \(10^-6\) still very significant.