In this blog, we compare the diagonal elements of the hat matrix used in multiple linear regression and logistic regression. As we know, the standardized residuals of a multiple linear regression require dividing by a function of leverage numbers \(h_{ii}\), the diagonal elements of the hat matrix \(\mathbf{H}\). We will show theoretically and empirically that those leverage numbers behave differently in these two types of regressions. We illustrate with a familiar numerical dataset from Chapter 8 of the Sheather text “A Modern Approach to Regression with R”.
Let us consider a dataset Michelin used in Chapter 8 of Sheather.
We load this dataset into memory and display its contents.
michelin = read.table("MichelinFood.txt", header= TRUE)
michelin %>% kable() %>% kable_styling(bootstrap_options = c("border", "striped"))
| Food | InMichelin | NotInMichelin | mi | proportion |
|---|---|---|---|---|
| 15 | 0 | 1 | 1 | 0.00 |
| 16 | 0 | 1 | 1 | 0.00 |
| 17 | 0 | 8 | 8 | 0.00 |
| 18 | 2 | 13 | 15 | 0.13 |
| 19 | 5 | 13 | 18 | 0.28 |
| 20 | 8 | 25 | 33 | 0.24 |
| 21 | 15 | 11 | 26 | 0.58 |
| 22 | 4 | 8 | 12 | 0.33 |
| 23 | 12 | 6 | 18 | 0.67 |
| 24 | 6 | 1 | 7 | 0.86 |
| 25 | 11 | 1 | 12 | 0.92 |
| 26 | 1 | 1 | 2 | 0.50 |
| 27 | 6 | 1 | 7 | 0.86 |
| 28 | 4 | 0 | 4 | 1.00 |
We will use the same dataset for both the linear and logistic regressions.
The hat matrix \(\mathbf{H}\) also known as the projection or influence matrix enjoys several key properties in linear regression.
\[\mathbf{H} = \mathbf{ X ( X^TX)^{-1}X^T }\] as stated in Equation 6.5 of Sheather (p153)
\[\mathbf{HY} = \mathbf{ \hat{Y}}\]
Moreover, the standardized residual \(r_i\) is given by: \[ r_i = \frac{\hat{e_{i}} }{s \sqrt{1-h_{ii}}}\]
The key observation is that the hat matrix is independent of the response variable \(Y\). Therefore, the leverage numbers \(h_{ii}\) are unaffected by changes in \(Y\).
The hat matrix is defined differently under logistic regression. The entire treatment of the hat matrix under logistic regression is not mentioned by Sheather in Chapter 8 on Logistic Regression. In fact, I used German Rodriguez’s statistic course notes to find the formula for the hat matrix [https://data.princeton.edu/wws509/notes/]. In Chapter 3, he notes that the weighted hat matrix was derived by Pregibon in 1981:
\[\mathbf{H} = \mathbf{W^{1/2}X(X^{T}WX)^{-1} X^{T}W^{1/2}} \] where \(\mathbf{W}\) is the diagonal matrix of iteration weights used in iteratively re-weighted least squares of the form:
\[w_{ii} = \frac{ \hat{\mu_i}(n_i - \hat{\mu_i})}{n_i}\]
where \(n_i\) are the binomial weights of each cohort \(i\).
The counterpart of the standardized residuals in logistic regression are the standardized deviance residuals.
The deviance residuals are defined as:
\[d_i = \sqrt{ 2\left[] y_i log( \frac{y_i}{\hat{\mu_i}}) + (n_i - y_i) log(\frac{n_i - y_i}{n_i-\hat{\mu_i}} ) \right] } \cdot sign(y_i - \hat{y_i})\]
The standardized deviance residual is then:
\[sd_{i} = \frac{d_{ii}}{\sqrt{1 - h_{ii} } }\] The key observation is that \(h_{ii}\) depends on the values of \(Y\).
Using the above Michelin data set, we will show that the leverage numbers \(h_{ii}\) don’t change in linear regression.
To do this, we will calculate the leverage numbers in two versions of a univariate linear model. The explanatory variable in both cases of the linear model will be the Food predictor in Michelin. But we will use two different response variables: \(Y\) and \(Z\) which represent the proportion.
Y = michelin$proportion
Z = Y
Z[14] = 0.2 # The index where the response will differ.
linear_dataset = data.frame(X = michelin$Food, Y=Y, Z=Z)
linear_dataset %>% kable() %>% kable_styling(bootstrap_options = c("striped", "bordered"))
| X | Y | Z |
|---|---|---|
| 15 | 0.00 | 0.00 |
| 16 | 0.00 | 0.00 |
| 17 | 0.00 | 0.00 |
| 18 | 0.13 | 0.13 |
| 19 | 0.28 | 0.28 |
| 20 | 0.24 | 0.24 |
| 21 | 0.58 | 0.58 |
| 22 | 0.33 | 0.33 |
| 23 | 0.67 | 0.67 |
| 24 | 0.86 | 0.86 |
| 25 | 0.92 | 0.92 |
| 26 | 0.50 | 0.50 |
| 27 | 0.86 | 0.86 |
| 28 | 1.00 | 0.20 |
linmod1 = lm( Y ~ X, data=linear_dataset)
linmod2 = lm( Z ~ X, data=linear_dataset)
hatlin1 = influence(linmod1)$hat
hatlin2 = influence(linmod2)$hat
dflin = data.frame(Model=c( rep("Old", length(hatlin1) ), rep("New", length(hatlin2) ) ),
HValues = c(hatlin1, hatlin2) ,
Index = c(1:length(hatlin1), 1:length(hatlin2 ) )
)
Now we examine the differences the data and linear model regression statistics. We see clearly below that the two model’s regression coefficients are different.
dflin %>% kable()
| Model | HValues | Index |
|---|---|---|
| Old | 0.2571429 | 1 |
| Old | 0.2043956 | 2 |
| Old | 0.1604396 | 3 |
| Old | 0.1252747 | 4 |
| Old | 0.0989011 | 5 |
| Old | 0.0813187 | 6 |
| Old | 0.0725275 | 7 |
| Old | 0.0725275 | 8 |
| Old | 0.0813187 | 9 |
| Old | 0.0989011 | 10 |
| Old | 0.1252747 | 11 |
| Old | 0.1604396 | 12 |
| Old | 0.2043956 | 13 |
| Old | 0.2571429 | 14 |
| New | 0.2571429 | 1 |
| New | 0.2043956 | 2 |
| New | 0.1604396 | 3 |
| New | 0.1252747 | 4 |
| New | 0.0989011 | 5 |
| New | 0.0813187 | 6 |
| New | 0.0725275 | 7 |
| New | 0.0725275 | 8 |
| New | 0.0813187 | 9 |
| New | 0.0989011 | 10 |
| New | 0.1252747 | 11 |
| New | 0.1604396 | 12 |
| New | 0.2043956 | 13 |
| New | 0.2571429 | 14 |
summary(linmod1)
##
## Call:
## lm(formula = Y ~ X, data = linear_dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.315297 -0.082220 0.004967 0.087533 0.204835
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.26642 0.21590 -5.866 7.65e-05 ***
## X 0.08007 0.00987 8.112 3.26e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1489 on 12 degrees of freedom
## Multiple R-squared: 0.8458, Adjusted R-squared: 0.8329
## F-statistic: 65.81 on 1 and 12 DF, p-value: 3.26e-06
summary(linmod2)
##
## Call:
## lm(formula = Z ~ X, data = linear_dataset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.56971 -0.09315 -0.04681 0.17662 0.32191
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.83213 0.35079 -2.372 0.03526 *
## X 0.05721 0.01604 3.567 0.00387 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2419 on 12 degrees of freedom
## Multiple R-squared: 0.5147, Adjusted R-squared: 0.4743
## F-statistic: 12.73 on 1 and 12 DF, p-value: 0.003871
However, their leverages numbers are the same.
ggplot(data=dflin, aes(x=Index, y=HValues, fill=Model)) +
geom_bar(stat="identity", position=position_dodge()) +
ggtitle("Leverage Hat Values for Linear Regression when Y-values are changed")
Let’s repeat the same numerical experiment using Logistic Regression. We will change the response variable at one subpopulation cohort (index = 3) and reccalculate the logistic model and leverage numbers.
First we create the two sets of response variables.
responselog_orig = cbind(michelin$InMichelin, michelin$NotInMichelin)
responselog_new = responselog_orig
responselog_orig[3,]
## [1] 0 8
responselog_new[3,1] = 1
responselog_new[3,2] = 7
Second, we build the logistic regression models using the different response variables and show the logistic model coefficients are diffrent.
logmod1 = glm(responselog_orig ~ michelin$Food, family=binomial)
logmod2 = glm(responselog_new ~ michelin$Food, family=binomial)
summary(logmod1)
##
## Call:
## glm(formula = responselog_orig ~ michelin$Food, family = binomial)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4850 -0.7987 -0.1679 0.5913 1.5889
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.84154 1.86236 -5.821 5.84e-09 ***
## michelin$Food 0.50124 0.08768 5.717 1.08e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 61.427 on 13 degrees of freedom
## Residual deviance: 11.368 on 12 degrees of freedom
## AIC: 41.491
##
## Number of Fisher Scoring iterations: 4
summary(logmod2)
##
## Call:
## glm(formula = responselog_new ~ michelin$Food, family = binomial)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.49670 -0.57159 -0.09218 0.58672 1.49398
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -10.16227 1.79287 -5.668 1.44e-08 ***
## michelin$Food 0.47067 0.08445 5.574 2.49e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 55.7656 on 13 degrees of freedom
## Residual deviance: 9.8196 on 12 degrees of freedom
## AIC: 41.812
##
## Number of Fisher Scoring iterations: 4
Third, we extract the leverage numbers calculated using the weighted hat matrix formula described above.
hvalues_1 = influence(logmod1)$hat
hvalues_2 = influence(logmod2)$hat
df = data.frame(Model=c( rep("Old", length(hvalues_1) ), rep("New", length(hvalues_2) ) ),
HValues = c(hvalues_1, hvalues_2) ,
Index = c(1:length(hvalues_1), 1:length(hvalues_2 ) )
)
df %>% kable()
| Model | HValues | Index |
|---|---|---|
| Old | 0.0108633 | 1 |
| Old | 0.0125533 | 2 |
| Old | 0.1078450 | 3 |
| Old | 0.1973865 | 4 |
| Old | 0.2070604 | 5 |
| Old | 0.3078445 | 6 |
| Old | 0.2161914 | 7 |
| Old | 0.1179974 | 8 |
| Old | 0.2429990 | 9 |
| Old | 0.1214966 | 10 |
| Old | 0.2346899 | 11 |
| Old | 0.0390368 | 12 |
| Old | 0.1240448 | 13 |
| Old | 0.0599911 | 14 |
| New | 0.0123574 | 1 |
| New | 0.0137785 | 2 |
| New | 0.1141996 | 3 |
| New | 0.2020195 | 4 |
| New | 0.2060435 | 5 |
| New | 0.3013811 | 6 |
| New | 0.2105695 | 7 |
| New | 0.1141933 | 8 |
| New | 0.2345765 | 9 |
| New | 0.1186012 | 10 |
| New | 0.2343842 | 11 |
| New | 0.0401551 | 12 |
| New | 0.1318210 | 13 |
| New | 0.0659196 | 14 |
Plotting them leverage numbers side-by-side for both models, we see they are different.
ggplot(data=df, aes(x=Index, y=HValues, fill=Model)) +
geom_bar(stat="identity", position=position_dodge()) +
ggtitle("Leverage Hat Values for Logistic Regression when Y-values are changed")
We have shown a subtle but important difference between the linear and logistic regression model. The leverage numbers of the hat matrix are not the same in both types of models. Linear regression leverages only depend on the predictor variables. Logistic regression leverages actually depend on both the predictor and the response variables.
This is observed in the mathematical formulas and confirmed through a simple numerical experiment.