Data 621 Blog 3

Introduction

In this blog, we compare the diagonal elements of the hat matrix used in multiple linear regression and logistic regression. As we know, the standardized residuals of a multiple linear regression require dividing by a function of leverage numbers \(h_{ii}\), the diagonal elements of the hat matrix \(\mathbf{H}\). We will show theoretically and empirically that those leverage numbers behave differently in these two types of regressions. We illustrate with a familiar numerical dataset from Chapter 8 of the Sheather text “A Modern Approach to Regression with R”.

Example

Let us consider a dataset Michelin used in Chapter 8 of Sheather.
We load this dataset into memory and display its contents.

michelin = read.table("MichelinFood.txt", header= TRUE)

michelin %>% kable() %>% kable_styling(bootstrap_options = c("border", "striped"))

Food	InMichelin	NotInMichelin	mi	proportion
15	0	1	1	0.00
16	0	1	1	0.00
17	0	8	8	0.00
18	2	13	15	0.13
19	5	13	18	0.28
20	8	25	33	0.24
21	15	11	26	0.58
22	4	8	12	0.33
23	12	6	18	0.67
24	6	1	7	0.86
25	11	1	12	0.92
26	1	1	2	0.50
27	6	1	7	0.86
28	4	0	4	1.00

We will use the same dataset for both the linear and logistic regressions.

Hat Matrix Under Linear Regression

The hat matrix \(\mathbf{H}\) also known as the projection or influence matrix enjoys several key properties in linear regression.

\[\mathbf{H} = \mathbf{ X ( X^TX)^{-1}X^T }\] as stated in Equation 6.5 of Sheather (p153)
\[\mathbf{HY} = \mathbf{ \hat{Y}}\]
Moreover, the standardized residual \(r_i\) is given by: \[ r_i = \frac{\hat{e_{i}} }{s \sqrt{1-h_{ii}}}\]

The key observation is that the hat matrix is independent of the response variable \(Y\). Therefore, the leverage numbers \(h_{ii}\) are unaffected by changes in \(Y\).

Hat Matrix and Leverage Under Logistic Regression

The hat matrix is defined differently under logistic regression. The entire treatment of the hat matrix under logistic regression is not mentioned by Sheather in Chapter 8 on Logistic Regression. In fact, I used German Rodriguez’s statistic course notes to find the formula for the hat matrix [https://data.princeton.edu/wws509/notes/]. In Chapter 3, he notes that the weighted hat matrix was derived by Pregibon in 1981:

\[\mathbf{H} = \mathbf{W^{1/2}X(X^{T}WX)^{-1} X^{T}W^{1/2}} \] where \(\mathbf{W}\) is the diagonal matrix of iteration weights used in iteratively re-weighted least squares of the form:

\[w_{ii} = \frac{ \hat{\mu_i}(n_i - \hat{\mu_i})}{n_i}\]

where \(n_i\) are the binomial weights of each cohort \(i\).

The counterpart of the standardized residuals in logistic regression are the standardized deviance residuals.

The deviance residuals are defined as:

\[d_i = \sqrt{ 2\left[] y_i log( \frac{y_i}{\hat{\mu_i}}) + (n_i - y_i) log(\frac{n_i - y_i}{n_i-\hat{\mu_i}} ) \right] } \cdot sign(y_i - \hat{y_i})\]

The standardized deviance residual is then:

\[sd_{i} = \frac{d_{ii}}{\sqrt{1 - h_{ii} } }\] The key observation is that \(h_{ii}\) depends on the values of \(Y\).

A Numerical Example Using Linear Regression

Using the above Michelin data set, we will show that the leverage numbers \(h_{ii}\) don’t change in linear regression.

To do this, we will calculate the leverage numbers in two versions of a univariate linear model. The explanatory variable in both cases of the linear model will be the Food predictor in Michelin. But we will use two different response variables: \(Y\) and \(Z\) which represent the proportion.

Y = michelin$proportion

Z = Y

Z[14] = 0.2 # The index where the response will differ.

linear_dataset = data.frame(X = michelin$Food, Y=Y, Z=Z) 

linear_dataset %>% kable() %>% kable_styling(bootstrap_options = c("striped", "bordered"))

X	Y	Z
15	0.00	0.00
16	0.00	0.00
17	0.00	0.00
18	0.13	0.13
19	0.28	0.28
20	0.24	0.24
21	0.58	0.58
22	0.33	0.33
23	0.67	0.67
24	0.86	0.86
25	0.92	0.92
26	0.50	0.50
27	0.86	0.86
28	1.00	0.20

linmod1 = lm( Y ~ X, data=linear_dataset)

linmod2 = lm( Z ~ X, data=linear_dataset)

hatlin1 = influence(linmod1)$hat

hatlin2 = influence(linmod2)$hat

dflin = data.frame(Model=c( rep("Old", length(hatlin1) ), rep("New", length(hatlin2) ) ), 
                HValues = c(hatlin1, hatlin2) ,
                Index = c(1:length(hatlin1), 1:length(hatlin2 ) ) 
                )

Now we examine the differences the data and linear model regression statistics. We see clearly below that the two model’s regression coefficients are different.

dflin %>% kable()

Model	HValues	Index
Old	0.2571429	1
Old	0.2043956	2
Old	0.1604396	3
Old	0.1252747	4
Old	0.0989011	5
Old	0.0813187	6
Old	0.0725275	7
Old	0.0725275	8
Old	0.0813187	9
Old	0.0989011	10
Old	0.1252747	11
Old	0.1604396	12
Old	0.2043956	13
Old	0.2571429	14
New	0.2571429	1
New	0.2043956	2
New	0.1604396	3
New	0.1252747	4
New	0.0989011	5
New	0.0813187	6
New	0.0725275	7
New	0.0725275	8
New	0.0813187	9
New	0.0989011	10
New	0.1252747	11
New	0.1604396	12
New	0.2043956	13
New	0.2571429	14

summary(linmod1)

## 
## Call:
## lm(formula = Y ~ X, data = linear_dataset)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.315297 -0.082220  0.004967  0.087533  0.204835 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.26642    0.21590  -5.866 7.65e-05 ***
## X            0.08007    0.00987   8.112 3.26e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1489 on 12 degrees of freedom
## Multiple R-squared:  0.8458, Adjusted R-squared:  0.8329 
## F-statistic: 65.81 on 1 and 12 DF,  p-value: 3.26e-06

summary(linmod2)

## 
## Call:
## lm(formula = Z ~ X, data = linear_dataset)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.56971 -0.09315 -0.04681  0.17662  0.32191 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept) -0.83213    0.35079  -2.372  0.03526 * 
## X            0.05721    0.01604   3.567  0.00387 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2419 on 12 degrees of freedom
## Multiple R-squared:  0.5147, Adjusted R-squared:  0.4743 
## F-statistic: 12.73 on 1 and 12 DF,  p-value: 0.003871

However, their leverages numbers are the same.

ggplot(data=dflin, aes(x=Index, y=HValues, fill=Model)) + 
  geom_bar(stat="identity", position=position_dodge()) +
  ggtitle("Leverage Hat Values for Linear Regression when Y-values are changed")

A Numerical Example Using Logistic Regression

Let’s repeat the same numerical experiment using Logistic Regression. We will change the response variable at one subpopulation cohort (index = 3) and reccalculate the logistic model and leverage numbers.

First we create the two sets of response variables.

responselog_orig = cbind(michelin$InMichelin, michelin$NotInMichelin)
responselog_new = responselog_orig

responselog_orig[3,]

## [1] 0 8

responselog_new[3,1] = 1
responselog_new[3,2] = 7

Second, we build the logistic regression models using the different response variables and show the logistic model coefficients are diffrent.

logmod1 = glm(responselog_orig ~ michelin$Food, family=binomial)
logmod2 = glm(responselog_new ~ michelin$Food, family=binomial)

summary(logmod1)

## 
## Call:
## glm(formula = responselog_orig ~ michelin$Food, family = binomial)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4850  -0.7987  -0.1679   0.5913   1.5889  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -10.84154    1.86236  -5.821 5.84e-09 ***
## michelin$Food   0.50124    0.08768   5.717 1.08e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 61.427  on 13  degrees of freedom
## Residual deviance: 11.368  on 12  degrees of freedom
## AIC: 41.491
## 
## Number of Fisher Scoring iterations: 4

summary(logmod2)

## 
## Call:
## glm(formula = responselog_new ~ michelin$Food, family = binomial)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.49670  -0.57159  -0.09218   0.58672   1.49398  
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -10.16227    1.79287  -5.668 1.44e-08 ***
## michelin$Food   0.47067    0.08445   5.574 2.49e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 55.7656  on 13  degrees of freedom
## Residual deviance:  9.8196  on 12  degrees of freedom
## AIC: 41.812
## 
## Number of Fisher Scoring iterations: 4

Third, we extract the leverage numbers calculated using the weighted hat matrix formula described above.

hvalues_1 = influence(logmod1)$hat
hvalues_2 = influence(logmod2)$hat

df = data.frame(Model=c( rep("Old", length(hvalues_1) ), rep("New", length(hvalues_2) ) ), 
                HValues = c(hvalues_1, hvalues_2) ,
                Index = c(1:length(hvalues_1), 1:length(hvalues_2 ) ) 
                )

df %>% kable()

Model	HValues	Index
Old	0.0108633	1
Old	0.0125533	2
Old	0.1078450	3
Old	0.1973865	4
Old	0.2070604	5
Old	0.3078445	6
Old	0.2161914	7
Old	0.1179974	8
Old	0.2429990	9
Old	0.1214966	10
Old	0.2346899	11
Old	0.0390368	12
Old	0.1240448	13
Old	0.0599911	14
New	0.0123574	1
New	0.0137785	2
New	0.1141996	3
New	0.2020195	4
New	0.2060435	5
New	0.3013811	6
New	0.2105695	7
New	0.1141933	8
New	0.2345765	9
New	0.1186012	10
New	0.2343842	11
New	0.0401551	12
New	0.1318210	13
New	0.0659196	14

Plotting them leverage numbers side-by-side for both models, we see they are different.

ggplot(data=df, aes(x=Index, y=HValues, fill=Model)) + 
  geom_bar(stat="identity", position=position_dodge()) +
  ggtitle("Leverage Hat Values for Logistic Regression when Y-values are changed")

Conclusion

We have shown a subtle but important difference between the linear and logistic regression model. The leverage numbers of the hat matrix are not the same in both types of models. Linear regression leverages only depend on the predictor variables. Logistic regression leverages actually depend on both the predictor and the response variables.

This is observed in the mathematical formulas and confirmed through a simple numerical experiment.