Call:
lm(formula = sheight ~ fheight, data = father.son)
Residuals:
Min 1Q Median 3Q Max
-8.8772 -1.5144 -0.0079 1.6285 8.9685
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 33.88660 1.83235 18.49 <2e-16 ***
fheight 0.51409 0.02705 19.01 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.437 on 1076 degrees of freedom
Multiple R-squared: 0.2513, Adjusted R-squared: 0.2506
F-statistic: 361.2 on 1 and 1076 DF, p-value: < 2.2e-16
Y = β_0 + β_1x + ε
H_o: β_1 = o
H_a: β_1 ≠ o
2.5 % 97.5 %
(Intercept) 30.2912126 37.4819961
fheight 0.4610188 0.5671673
2.5 % 97.5 %
(Intercept) 68.5384554 68.8296839
I(fheight - mean(fheight)) 0.4610188 0.5671673
fit lwr upr
1 68.68407 68.53846 68.82968
fit lwr upr
1 68.68407 63.90091 73.46723
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.09886054 1.6339210 18.421246 6.642736e-18
hp -0.06822828 0.0101193 -6.742389 1.787835e-07
The extremely low p-value for the slope (1.7878353^{-7}) indicates strong evidence to reject the null hypothesis. The slope coefficient significantly deviates from zero, implying that horsepower is a statistically significant predictor of miles per gallon.
2.5 % 97.5 %
(Intercept) 26.76194879 33.4357723
hp -0.08889465 -0.0475619
2.5 % 97.5 %
(Intercept) 18.69599452 21.4852555
chp -0.08889465 -0.0475619
fit lwr upr
1 20.09062 18.69599 21.48526
fit lwr upr
1 20.09062 12.07908 28.10217
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.157461e+02 1.466559e+01 14.711047 3.772201e-33
kms -1.749546e-03 6.145401e-04 -2.846919 4.902428e-03
PetrolPrice -6.437895e+02 1.482896e+02 -4.341435 2.304713e-05
Estimate Std. Error t value Pr(>|t|)
(Intercept) 215.7461 14.6656 14.7110 0.0000
kms -0.0017 0.0006 -2.8469 0.0049
PetrolPrice -643.7895 148.2896 -4.3414 0.0000
The interpretation of the intercept is problematic when considering it as the expected number of drivers killed for zero kilometers driven and a petrol price of 0, which lacks practical significance. To enhance interpretation, we center and re-scale both the kilometer and petrol price variables, placing them on a more meaningful scale.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 122.802083 1.6628507 73.850336 2.395106e-141
mkm -1.749546 0.6145401 -2.846919 4.902428e-03
ppn -7.838674 1.8055491 -4.341435 2.304713e-05
1
122.8021
Hence, we have the number of deaths at the mean of kms and PetrolPrice.
Call:
lm(formula = edk ~ epp - 1)
Residuals:
Min 1Q Median 3Q Max
-51.06 -17.77 -4.15 15.67 59.33
Coefficients:
Estimate Std. Error t value Pr(>|t|)
epp -643.8 147.5 -4.364 2.09e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 22.92 on 191 degrees of freedom
Multiple R-squared: 0.09068, Adjusted R-squared: 0.08592
F-statistic: 19.05 on 1 and 191 DF, p-value: 2.086e-05
Estimate Std. Error t value Pr(>|t|)
epp -643.7895 147.5111 -4.364345 2.085664e-05
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.157461e+02 1.466559e+01 14.711047 3.772201e-33
kms -1.749546e-03 6.145401e-04 -2.846919 4.902428e-03
pp -6.437895e+02 1.482896e+02 -4.341435 2.304713e-05
Hence, we can see that epp and pp row are the same.
Estimate Std. Error t value Pr(>|t|)
ekms -0.0017 6e-04 -2.8619 0.0047
Thus, the coefficient we obtain is the same in question 1.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 122.8021 1.6629 73.8503 0.0000
mkm -1.7495 0.6145 -2.8469 0.0049
ppn -7.8387 1.8055 -4.3414 0.0000
The intercept, indicating deaths for 0 kilometers and 0 PetrolPrice, isn’t practical. It’s recommended to center the variables. The impact of 1 km on deaths is statistically insignificant and lacks meaning. Rescaling the variable would be more appropriate.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.78966306 0.013426810 356.723817 2.737888e-269
mkm -0.01400794 0.004962149 -2.822959 5.267843e-03
ppn -0.06412578 0.014579039 -4.398492 1.818005e-05
[1] 0.06211298
[1] 0.01391029
The interpretation of our normalized petrol price variable (ppn) is as follows: We expect a 6% decrease in the geometric mean of driver fatalities for each 1 standard deviation increase in normalized petrol price, while keeping kilometers constant.
Call:
lm(formula = DriversKilled ~ mkm + ppn + law, data = stblts)
Residuals:
Min 1Q Median 3Q Max
-50.69 -17.29 -4.05 14.33 60.71
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 124.2263 1.8012 68.967 < 2e-16 ***
mkm -1.2233 0.6657 -1.838 0.067676 .
ppn -6.9199 1.8514 -3.738 0.000246 ***
law -11.8892 6.0258 -1.973 0.049955 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 22.87 on 188 degrees of freedom
Multiple R-squared: 0.201, Adjusted R-squared: 0.1882
F-statistic: 15.76 on 3 and 188 DF, p-value: 3.478e-09
To find the post-law intercept at the average petrol price and average kilometers driven, we subtract 11. This implies that, after the law took effect, we anticipate about 12 fewer deaths per month. In other words, when the law variable changes from 0 to 1, we expect a decrease of 12 deaths per month, while keeping petrol price and kilometers driven constant.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 124.226311 1.8012324 68.967399 1.976641e-135
mkm -1.223318 0.6656567 -1.837761 6.767594e-02
ppn -6.919949 1.8513987 -3.737687 2.463128e-04
I(factor(law))1 -11.889202 6.0257850 -1.973055 4.995497e-02
It’s worth noting that we observe identical values: -124.226311 for the intercept, -1.223318 for the first column, -6.919949 for the next, and -11.889202 for the last. These numbers indicate the choice of the reference level for each factor. Therefore, the value -11.889202 corresponds to factor level 1.
1 2 3 4
6 96 71 19
Call:
lm(formula = DriversKilled ~ mkm + ppf + law, data = stblts)
Residuals:
Min 1Q Median 3Q Max
-53.384 -17.211 -3.421 14.849 65.613
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 109.8405 9.5066 11.554 <2e-16 ***
mkm -1.2991 0.7668 -1.694 0.0919 .
ppf2 10.8271 9.9462 1.089 0.2778
ppf3 18.6904 9.9374 1.881 0.0616 .
ppf4 25.0074 10.9163 2.291 0.0231 *
law -15.3445 6.0345 -2.543 0.0118 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 23.24 on 186 degrees of freedom
Multiple R-squared: 0.1833, Adjusted R-squared: 0.1614
F-statistic: 8.35 on 5 and 186 DF, p-value: 3.835e-07
#### Two levels model with interaction
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.157461e+02 1.466559e+01 14.711047 3.772201e-33
kms -1.749546e-03 6.145401e-04 -2.846919 4.902428e-03
PetrolPrice -6.437895e+02 1.482896e+02 -4.341435 2.304713e-05
Estimate Std. Error t value Pr(>|t|)
(Intercept) 122.802083 1.6628507 73.850336 2.395106e-141
mkm -1.749546 0.6145401 -2.846919 4.902428e-03
ppn -7.838674 1.8055491 -4.341435 2.304713e-05
Correlation between the kms and petrol price
[1] 0.3839004
Estimate Std. Error t value Pr(>|t|)
(Intercept) 122.802083 1.7391997 70.60839 2.665611e-138
mkm -2.773787 0.5935049 -4.67357 5.596266e-06
Estimate Std. Error t value Pr(>|t|)
(Intercept) 122.802083 1.6628507 73.850336 2.395106e-141
mkm -1.749546 0.6145401 -2.846919 4.902428e-03
ppn -7.838674 1.8055491 -4.341435 2.304713e-05
In this scenario, the estimate is negative, aligning with expectations as both effects are in the same direction and correlated. However, the estimate has shifted from -2.846919 to -1.749546, highlighting the confounding impact of the PetrolPrice variable on the regression between DriversKiller and kilometers driven.
Estimate Std. Error t value Pr(>|t|)
(Intercept) 122.802083 1.693656 72.507096 2.061333e-140
ppn -9.812019 1.698084 -5.778288 3.044208e-08
Estimate Std. Error t value Pr(>|t|)
(Intercept) 122.802083 1.6628507 73.850336 2.395106e-141
ppn -7.838674 1.8055491 -4.341435 2.304713e-05
mkm -1.749546 0.6145401 -2.846919 4.902428e-03
In this instance, the estimate remains negative, which is logical given that both effects align in the same direction and are correlated. However, the estimate has shifted from -9.812019 to 7.838674, signaling the confounding impact of the variable kilometers driven on the regression between DriversKiller and PetrolPrice.
Load the dataset Seatbelts as part of the datasets package via data(Seatbelts). Use as.data.frame to convert the object to a dataframe. Fit a linear model of driver deaths with kms, PetrolPrice and law as predictors.
Refer to question 1. Directly estimate the residual variation via the function resid. Compare with R’s residual variance estimate.
[1] 522.8903
[1] 522.8903
Using ggplot, we can create a dataframe with the dffits values, classify them, establish a threshold value, in this case |0.4|, and plot the values indicating the id of the values above the threshold (specialPoint).
(Intercept) mkm ppn law
1 -0.1022614 0.11881682 -0.09349927 -0.26195987
2 -0.1331656 0.22275746 -0.15807657 -0.56786007
3 -0.1279422 0.11452337 -0.07920934 -0.23464337
4 -0.2032275 0.13006726 -0.06413772 -0.23584650
5 -0.0524778 0.02353784 -0.01104844 -0.02756976
6 -0.1188004 0.03961875 -0.01108847 -0.02386593
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00600 0.01175 0.01700 0.02082 0.02600 0.06700
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.157461e+02 1.466559e+01 14.711047 3.772201e-33
kms -1.749546e-03 6.145401e-04 -2.846919 4.902428e-03
PetrolPrice -6.437895e+02 1.482896e+02 -4.341435 2.304713e-05
Analysis of Variance Table
Model 1: DriversKilled ~ law
Model 2: DriversKilled ~ law + mkm
Model 3: DriversKilled ~ law + ppn
Model 4: DriversKilled ~ law + mkm + ppn
Res.Df RSS Df Sum of Sq F Pr(>F)
1 190 109754
2 189 105608 1 4145.3 7.9276 0.005388 **
3 189 100069 0 5538.9
4 188 98303 1 1766.0 3.3774 0.067676 .
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analysis of Variance Table
Model 1: DriversKilled ~ law
Model 2: DriversKilled ~ law + mkm + ppn
Res.Df RSS Df Sum of Sq F Pr(>F)
1 190 109754
2 188 98303 2 11450 10.949 3.177e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Estimate Std. Error t value Pr(>|t|)
[1,] -25.60895 5.341655 -4.794198 3.288375e-06
[2,] -17.55372 6.028888 -2.911602 4.028394e-03
[3,] -16.32618 5.555579 -2.938700 3.706585e-03
[4,] -11.88920 6.025785 -1.973055 4.995497e-02
If we transform the observed outcome, \(log(Y)=β_0+β_1+ϵ\) all we’re doing is fitted a linear model where we transform the outcome (transform data then fit model) \(log(E[Y])=β_0+β_1X\) In GLMs, we’re transforming the parameters. Hence, it’s false.
All coefficeints are interpreted on link function scale. To get natural scale, you have to invert link function (in the case of poisson, take the exponent) \(log(E[Y|X=x])=β_0+β_1X\) So, when we get an estimate for \(β_1\) and see it in our coefficient table that is interpreted on the scale of the log of the expected value of the outcome so it is interpreted on the link function scale. Hence, True.
True. GLMS start with assuming a distribution (linear-normal, bernoilli-binomial, poisson, gamma) for y. We then connect the distribution parameter of y to Xs with a link functions. We do this to continue modelling y on natural scale, and handle certain conditions (bernoilli - 0 or 1, poisson - strictly positive). One of the key parameters is the mean \((μ)\), and the relationship is mediated by a link function \((g)\). For instance, the link function might be expressed as: \(g(μ)=β_0+β_1X_1+β_2X_2\) This indirect connection allows GLMs to handle a wide range of distributions and relationships between predictors and the response variable. For example, in Poisson regression, the natural log function is often used as the link function: \(log(μ)=β_0+β_1X_1+β_2X_2\)
True or false, GLM estimates are obtained by maximizing the likelihood. (Discuss.) True. Suppose we have a Poisson regression. \(log(E[Y_i])=log(μ_i)=β_0+β_1\) This tells us that, \(e^{μ_i}=e^{β_0+β_1}\) Assuming \(Y_i\) follows an independent Poisson distribution with mean \(μ_i\), \(μ_i^{e^μ_i}/Yi!\) the likelihood function takes the form: \(∏μ_i{e^{−μ_i}}=L(β_0,β_1|Y)\) The likelihood depends on the parameters \(β_0\) and \(β_1\) and given the observed data Y.
True or false,some GLM distributions impose restrictions on the relationship between the mean and the variance. (Discuss.)
There is often an implied relationship between mean and variance. In poisson, mean and variance are the same.
Poisson Model
\(log(E[Y_i])=log(μ_i)=β_0+β_1X_i\)
Assume that,
\(Y_i∼Poisson(μ_i)⟹E[Yi_]=μ_i\)
However, for the Poisson distribution
\(var(Y_i)=E[Y_i]=μ_i\)
Thus, for the Poisson, Bernoulli distribution and many other instances of generalized linear models (GLM) there is an implied relationship between the mean and the variance.
FALSE TRUE
98 94
0 1
98 94
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.024313512 0.16077499 0.15122695 0.87979669
ppn -0.416407701 0.16973435 -2.45329074 0.01415559
mkm -0.002938343 0.05984816 -0.04909663 0.96084229
law -0.615522450 0.57780755 -1.06527242 0.28675267
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.0243 0.1608 0.1512 0.8798
ppn -0.4164 0.1697 -2.4533 0.0142
mkm -0.0029 0.0598 -0.0491 0.9608
law -0.6155 0.5778 -1.0653 0.2868
After the law came into effect, the log-odds of having more than 118 drivers killed in a month decreased by -0.6155. Additionally, a change of -0.0029 on the logit scale implies that, for every thousand-kilometer increase in driving distance, we estimate a reduction of 0.0029 in the log-odds of having more than 119 drivers killed in that month.
[1] 0.5403585
[1] 0.4596415
The odds ratio comparing after the law was enacted to before the law was 54%, indicating a 46% decrease in the odds of having more than 119 drivers killed in a month after the law, while holding other variables constant.
[1] 0.997066
[1] 0.00293403
This implies a roughly 0.3% reduction in the odds of more than 119 drivers being killed in a month for each additional thousand driver miles traveled in that month.
Call:
glm(formula = cbind(DriversKilled, drivers - DriversKilled) ~
ppn + mkm + law, family = binomial, data = stblts)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.4371 -0.7270 -0.0235 0.7111 3.0313
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.536637 0.007399 -342.829 <2e-16 ***
ppn -0.007829 0.007479 -1.047 0.295
mkm 0.003645 0.002733 1.334 0.182
law 0.030785 0.026527 1.161 0.246
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 234.93 on 191 degrees of freedom
Residual deviance: 229.93 on 188 degrees of freedom
AIC: 1496
Number of Fisher Scoring iterations: 3
Analysis of Deviance Table
Model 1: dkb ~ law
Model 2: dkb ~ law + PetrolPrice
Model 3: dkb ~ law + PetrolPrice + kms
Resid. Df Resid. Dev Df Deviance
1 190 260.40
2 189 253.62 1 6.7760
3 188 253.62 1 0.0024
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.08288766 0.1539783 0.5383074 0.59036483
law -1.12434153 0.4991985 -2.2522935 0.02430373
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.5875552 1.3851773 2.589961 0.009598679
law -0.6260282 0.5367204 -1.166396 0.243454555
PetrolPrice -34.3737200 13.4887754 -2.548320 0.010824304
Estimate Std. Error z value Pr(>|z|)
(Intercept) 3.612261e+00 1.473634e+00 2.45126028 0.01423570
law -6.155225e-01 5.778076e-01 -1.06527242 0.28675267
PetrolPrice -3.419952e+01 1.394026e+01 -2.45329074 0.01415559
kms -2.938343e-06 5.984816e-05 -0.04909663 0.96084229
The analysis indicates a substantial decrease in the log odds associated with the law variable before and after its enactment, initially rendering it significant. However, the impact diminishes considerably when petrol prices are included in the model, where petrol price itself is found to be significant. Notably, the law variable loses its significance as the p-value increases by a factor of 10. The introduction of kilometers as a variable results in its non-significance, but it has minimal effect on the law variable. Considering model selection, the second model, incorporating petrol price without kilometers, might be preferred, highlighting the importance of careful variable selection and its influence on the significance and interpretation of coefficients, particularly in the context of predicting the odds of fatalities.
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.819845137 0.007127388 676.242884 0.000000e+00
ppn -0.055361338 0.007243262 -7.643150 2.119715e-14
mkm -0.009980975 0.002614002 -3.818274 1.343887e-04
law -0.114877106 0.025557951 -4.494770 6.964526e-06
(Intercept) ppn mkm law
4.8198 -0.0554 -0.0100 -0.1149
[1] 0.8914757
[1] 0.1085243
The law decreased the expected number of drivers killed by 11%.
[1] 0.9900687
[1] 0.00993133
There is an approximate 1% reduction in the anticipated number of drivers killed. The intercept represents the expected number of drivers killed on the logarithmic scale, and when exponentiated, it provides the actual expected number.
[1] 123.9459
Call:
glm(formula = DriversKilled ~ ppn + mkm + law, family = poisson,
data = stblts)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.7909 -1.6247 -0.3526 1.2900 4.8720
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.819845 0.007127 676.243 < 2e-16 ***
ppn -0.055361 0.007243 -7.643 2.12e-14 ***
mkm -0.009981 0.002614 -3.818 0.000134 ***
law -0.114877 0.025558 -4.495 6.96e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for poisson family taken to be 1)
Null deviance: 984.50 on 191 degrees of freedom
Residual deviance: 778.32 on 188 degrees of freedom
AIC: 2059.1
Number of Fisher Scoring iterations: 4
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.819845137 0.007127388 676.242884 0.000000e+00
ppn -0.055361338 0.007243262 -7.643150 2.119715e-14
mkm -0.009980975 0.002614002 -3.818274 1.343887e-04
law -0.114877106 0.025557951 -4.494770 6.964526e-06
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.805409776 0.014411832 333.435035 1.381656e-262
ppn -0.053968063 0.014813218 -3.643237 3.481611e-04
mkm -0.008189793 0.005325983 -1.537705 1.258020e-01
law -0.131450856 0.048212881 -2.726468 7.007200e-03
[1] 0.1231776
Hence, this is a 12% decrease in the geometric mean number of driver deaths for enactment of the law prior to the law having been enacted.
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.612798146 0.007122545 -366.834931 0.0000000
mkm 0.003377675 0.002630717 1.283937 0.1991640
ppn -0.007255064 0.007199577 -1.007707 0.3135952
law 0.028484328 0.025512651 1.116479 0.2642173
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.819845137 0.007127388 676.242884 0.000000e+00
mkm -0.009980975 0.002614002 -3.818274 1.343887e-04
ppn -0.055361338 0.007243262 -7.643150 2.119715e-14
law -0.114877106 0.025557951 -4.494770 6.964526e-06
Analysis of Deviance Table
Model 1: DriversKilled ~ law
Model 2: DriversKilled ~ law + PetrolPrice
Model 3: DriversKilled ~ law + PetrolPrice + kms
Resid. Df Resid. Dev Df Deviance
1 190 870.06
2 189 792.88 1 77.178
3 188 778.32 1 14.561
Estimate Std. Error z value Pr(>|z|)
(Intercept) 4.8352482 0.006856395 705.21727 0.000000e+00
law -0.2274727 0.021923993 -10.37552 3.204779e-25
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.3499077 0.05886323 90.887084 0.000000e+00
law -0.1516301 0.02364153 -6.413720 1.420110e-10
PetrolPrice -5.0697421 0.57792672 -8.772292 1.750654e-18
Estimate Std. Error z value Pr(>|z|)
(Intercept) 5.440656e+00 6.360114e-02 85.543371 0.000000e+00
law -1.148771e-01 2.555795e-02 -4.494770 6.964526e-06
PetrolPrice -4.546821e+00 5.948884e-01 -7.643150 2.119715e-14
kms -9.980975e-06 2.614002e-06 -3.818274 1.343887e-04