Introduction to Regression Modelling Strategies for Public Health Researchers

Author
Affiliation

Bongani Ncube

University Of the Witwatersrand (School of Public Health)

Published

April 21, 2025

Keywords

Regression, Public Health, Analysis Of Variance, Predictions, Statistical Analysis

Correlation

Correlation is a statistical measure that describes the strength and direction of a relationship between two quantitative variables

Correlation measures the strength and direction of association between two variables. There are three common correlation tests:

Correlation is a statistical measure that describes the strength and direction of a relationship between two quantitative variables

Use the Pearson’s r if both variables are quantitative (interval or ratio), normally distributed, and the relationship is linear with homoscedastic residuals.

The Spearman’s rho and Kendal’s tao correlations are measures, so they are valid for both quantitative and ordinal variables and do not carry the normality and homoscedasticity conditions. However, non-parametric tests have less statistical power than parametric tests, so only use these correlations if Pearson does not apply.

Pearson’s r

Pearson’s \(r\)

\[r = \frac{\sum{(X_i - \bar{X})(Y_i - \bar{Y})}}{\sqrt{\sum{(X_i - \bar{X})^2 \sum{(Y_i - \bar{Y})^2}}}} = \frac{cov(X,Y)}{s_X s_Y}\]

estimates the population correlation \(\rho\). Pearson’s \(r\) ranges from \(-1\) (perfect negative linear relationship) to \(+1\) (perfect positive linear relationship, and \(r = 0\) when there is no linear relationship. A correlation in the range :

  • \((.1, .3)\) is condidered small,
  • \((.3, .5)\) medium,
  • and \((.5, 1.0)\) large.

Pearson’s \(r\) only applies if the variables are interval or ratio, normally distributed, linearly related, there are minimal outliers, and the residuals are homoscedastic.

Data Used for Illustration

Body weight

Plasma volume

58.0

2.75

70.0

2.86

74.0

3.37

63.5

2.76

62.0

2.62

70.5

3.49

71.0

3.05

66.0

3.12

Types of correlations

Almost perfect correlations

p1 <- ggplot(data = df, aes(x = `Body weight`, y = `Plasma volume`)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
  
  labs(title = "Pearson's Rho", subtitle = "positive and strong correlation")
p2 <- ggplot(data = df, aes(x = `Body weight`, y = 1/`Plasma volume`)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
 
  labs(title = "Pearson's Rho", subtitle = "Negative yet strong correlation")
gridExtra::grid.arrange(p1, p2, nrow = 1)

Some weak correlations

  • here is some fake dataset producing weak correlations
df2 <- tibble(
  Height = c(115, 101, 99, 107, 118, 127, 120, 129),
  Weight = c(56, 50, 67, 64, 55, 70, 61, 59)
  
)

Visualizing weak correlations

p1 <- ggplot(data = df2, aes(x = Height, y = Weight)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
  
  labs(title = "Pearson's Rho", subtitle = "positive and weak correlation")
p2 <- ggplot(data = df2, aes(x = Height, y = 1/Weight)) +
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE) +
 
  labs(title = "Pearson's Rho", subtitle = "Negative yet weak correlation")
gridExtra::grid.arrange(p1, p2, nrow = 1)

Manually computing Correlations

we are interested in finding \(\sum{(X_i - \bar{X})(Y_i - \bar{Y})}\) , \(\sum{(X_i - \bar{X})^2}\) and \(\sum{(Y_i - \bar{Y})^2}\)

dummy_mse <- df %>% 
  mutate("$(X_i - \\bar{X})$" = (`Body weight`- mean(`Body weight`))) %>% 
  mutate("$(Y_i - \\bar{Y})$" = (`Plasma volume`- mean(`Plasma volume`)))%>% 
  mutate("$(X_i - \\bar{X})^2$" = (`Body weight`- mean(`Body weight`))^2) %>% 
  mutate("$(Y_i - \\bar{Y})^2$" = (`Plasma volume`- mean(`Plasma volume`))^2) %>%  
  mutate("$(Y_i - \\bar{Y})(X_i - \\bar{X})$" = (`Plasma volume`- mean(`Plasma volume`))*(`Body weight`- mean(`Body weight`)))

dummy_mse %>% 
  knitr::kable()
Body weight Plasma volume \((X_i - \bar{X})\) \((Y_i - \bar{Y})\) \((X_i - \bar{X})^2\) \((Y_i - \bar{Y})^2\) \((Y_i - \bar{Y})(X_i - \bar{X})\)
58.0 2.75 -8.875 -0.2525 78.765625 0.0637562 2.2409375
70.0 2.86 3.125 -0.1425 9.765625 0.0203063 -0.4453125
74.0 3.37 7.125 0.3675 50.765625 0.1350563 2.6184375
63.5 2.76 -3.375 -0.2425 11.390625 0.0588063 0.8184375
62.0 2.62 -4.875 -0.3825 23.765625 0.1463062 1.8646875
70.5 3.49 3.625 0.4875 13.140625 0.2376563 1.7671875
71.0 3.05 4.125 0.0475 17.015625 0.0022562 0.1959375
66.0 3.12 -0.875 0.1175 0.765625 0.0138063 -0.1028125

Summations and calculating correlation

(dummy_mse1<-dummy_mse %>% 
  summarise("$\\sum Height$"=sum(.[1]),
            "$\\sum Weight$"=sum(.[2]),
            "$\\sum(X_i - \\bar{X})$"=sum(.[3]),
            "$\\sum(Y_i - \\bar{Y})$"=sum(.[4]),
            "$\\sum(X_i - \\bar{X})^2$"=sum(.[5]),
            "$\\sum(Y_i - \\bar{Y})^2$"=sum(.[6]),
            "$\\sum(Y_i - \\bar{Y})(X_i - \\bar{X})$"=sum(.[7]))%>%  
   mutate(Pearson_R = (.[7]/sqrt(.[5]*.[6]))) %>%  ##seventh index/sqrt(5th times 6th index) 
   relocate(Pearson_R)%>% 
   knitr::kable())
Pearson_R \(\sum Height\) \(\sum Weight\) \(\sum(X_i - \bar{X})\) \(\sum(Y_i - \bar{Y})\) \(\sum(X_i - \bar{X})^2\) \(\sum(Y_i - \bar{Y})^2\) \(\sum(Y_i - \bar{Y})(X_i - \bar{X})\)
0.7591266 535 24.02 0 0 205.375 0.67795 8.9575

Spearman’s Rho

Spearman’s \(\rho\) is the Pearson’s r applied to the sample variable ranks. Let \((X_i, Y_i)\) be the ranks of the \(n\) sample pairs with mean ranks \(\bar{X} = \bar{Y} = (n+1)/2\). Spearman’s rho is

\[\hat{\rho} = \frac{\sum{(X_i - \bar{X})(Y_i - \bar{Y})}}{\sqrt{\sum{(X_i - \bar{X})^2 \sum{(Y_i - \bar{Y})^2}}}}\] You will also encounter another formula given by \[\hat{\rho} = 1 -\frac{6\sum{D^2}}{n(n^2-1)}\] where \(D=rank(X)-rank(Y)\)

Spearman’s rho is a non-parametric test, so there is no associated confidence interval.

Practical Example 1

In the breast cancer dataset, the variable radius mean measures the average size of the tumour. We want to explore how radius mean is related to other tumour characteristics such as texture mean, perimeter mean, smoothness mean, and compactness mean

(data<-haven::read_dta("Breastcancer.dta")) |> 
  head(6)
id diagnosis radius_mean texture_mean perimeter_mean area_mean smoothness_mean compactness_mean concavity_mean concavepoints_mean symmetry_mean fractal_dimension_mean radius_se texture_se perimeter_se area_se smoothness_se compactness_se concavity_se concavepoints_se symmetry_se fractal_dimension_se radius_worst texture_worst perimeter_worst area_worst smoothness_worst compactness_worst concavity_worst concavepoints_worst symmetry_worst fractal_dimension_worst radius_hat resid resid_label tag cooksd
842302 M 17.99 10.38 122.80 1001.0 0.11840 0.27760 0.3001 0.14710 0.2419 0.07871 1.0950 0.9053 8.589 153.40 0.006399 0.04904 0.05373 0.01587 0.03003 0.006193 25.38 17.33 184.60 2019.0 0.1622 0.6656 0.7119 0.2654 0.4601 0.11890 19.97863 -1.988633 -1.99 1 0.0046133
842517 M 20.57 17.77 132.90 1326.0 0.08474 0.07864 0.0869 0.07017 0.1812 0.05667 0.5435 0.7339 3.398 74.08 0.005225 0.01308 0.01860 0.01340 0.01389 0.003532 24.99 23.41 158.80 1956.0 0.1238 0.1866 0.2416 0.1860 0.2750 0.08902 13.25931 7.310688 7.31 1 0.0063065
84300903 M 19.69 21.25 130.00 1203.0 0.10960 0.15990 0.1974 0.12790 0.2069 0.05999 0.7456 0.7869 4.585 94.03 0.006150 0.04006 0.03832 0.02058 0.02250 0.004571 23.57 25.53 152.50 1709.0 0.1444 0.4245 0.4504 0.2430 0.3613 0.08758 16.00364 3.686358 3.69 1 0.0027413
84348301 M 11.42 20.38 77.58 386.1 0.14250 0.28390 0.2414 0.10520 0.2597 0.09744 0.4956 1.1560 3.445 27.23 0.009110 0.07458 0.05661 0.01867 0.05963 0.009208 14.91 26.50 98.87 567.7 0.2098 0.8663 0.6869 0.2575 0.6638 0.17300 20.19140 -8.771399 -8.77 0 0.0961080
84358402 M 20.29 14.34 135.10 1297.0 0.10030 0.13280 0.1980 0.10430 0.1809 0.05883 0.7572 0.7813 5.438 94.44 0.011490 0.02461 0.05688 0.01885 0.01756 0.005115 22.54 16.67 152.20 1575.0 0.1374 0.2050 0.4000 0.1625 0.2364 0.07678 15.08842 5.201585 5.20 0 0.0033317
843786 M 12.45 15.70 82.57 477.1 0.12780 0.17000 0.1578 0.08089 0.2087 0.07613 0.3345 0.8902 2.217 27.19 0.007510 0.03345 0.03672 0.01137 0.02165 0.005082 15.47 23.75 103.40 741.6 0.1791 0.5249 0.5355 0.1741 0.3985 0.12440 16.34474 -3.894743 -3.89 0 0.0037038

Radius mean vs smoothness mean

Note

The results show a weak correlation between radius and smoothness mean (\(r=0.1705812\))

Radius mean vs Perimeter mean

Note

The results show a perfect correlation between radius and smoothness mean (\(r=0.9978553\))

Why Linear Regression

Note
  • Correlation and simple linear regression (LR) are used to investigate the relationship between two quantitative variables.

  • Correlation quantifies the strength and direction of a relationship between two variables, but it does not allow one to predict one variable from the other, while LR can be used to determine how the value of one variable varies in response to the values of the other variable.

  • Correlation does not specify a dependent or independent variable (symmetric relationship) but is just concerned with assessing the global movement shared between two variables, while LR specify one variable as the outcome and the other as the explanatory variable (asymmetric relationship)

The mathematical equation of a straight line

Simple linear regression involves a numeric dependent (or outcome) variable \(Y\) and one independent (or explanatory) variable \(X\) that is either numeric or categorical.

The general equation of a straight line is: \[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \tag{1}\]

Figure 1: Parameters of a straight line.

Preparing the data for this exercises

Dataset 1 : InClass

library(readxl)
BirthWeight2 <- haven::read_dta("abrevaya1.dta") |> 
  mutate(birwtkg =birwt/1000)

Basic concepts

First, we plot the data in a scatter plot of the previous Plasma volume vs Body weight dataset:

ggplot(df, aes(`Body weight`, `Plasma volume`)) +
  geom_point(size = 2, alpha = 0.4) +
  #theme_prism(base_size = 14, base_line_size = 0.4, palette = "office") +
  labs(x = "x = Body weight", y = "y = Plasma volume")
Figure 2: Scatter plot of weight vs Plasma volume.

The objective is to find a mathematical approach that yields a line that, on average, goes through most of the points on the graph. This line is commonly known as a regression line or a line of best fit. We write the equation of the best-fitting line as:

\[\widehat{y_i} = b_0 + b_1 \cdot x_i\]

where \(b_0\) and \(b_1\) are best choices or estimates of the parameters \(\beta_0\) and \(\beta_1\) in Equation 1, and \(\hat y\) is the value of Y that is expected by this model for a specified value of X.

 

We define the following three concepts:

  1. Observed value \(y_i\): the observed value of the dependent variable Y for a given \(x_i\) value.

  2. Expected (or fitted) value \(\widehat{y_i}\): the value of the dependent variable Y that is expected by the model for a given \(x_i\) value.

  3. Residual \(e_i\): the deviation (error) between the observed value \(y_i\) and the expected value \(\hat y\) for a given \(x_i\) value (\(e_i = y_i - \hat y_i\)).

Let’s see, for example, the values for the infant in the position 207 in our data set (Figure 3).

Figure 3: Explanation of the basic concepts of the linear model.
  • The observed value \(y\) = 2.86 litres for \(x\) = 70 kg.

  • The expected value \(\widehat{y}\) is the value 3.14 litres on the regression line for \(x\) = 70. This value can be computed by the ?@eq-line3.

  • The residual is computed by subtracting the expected (fitted) value \(\widehat{y}\) from the observed value \(y\). In the case, it is \(y - \widehat{y} = 3.14 - 2.86=-0.28\) litres.

Comment

The residuals are exactly the vertical distance between the observed data point and the associated point on the regression line (expected value) (Figure 3). Positive residuals are associated with y values above the fitted line, while negative residuals correspond to values below the line.

Residuals can be thought of as “noise” - they are fluctuations caused by various factors, preventing the observed values from forming a perfectly straight line on the scatter plot. Residuals play a crucial role in linear regression as they provide valuable insights into the assessment of a linear regression model.

Ordinary least squares (OLS) estimation of \(b_0\) and \(b_1\) in matrix form

Let’s describe for the \(i^{th}\) observation the underlying association between \(y_i\) and \(x_i\) by:

\[ Y_i = \beta_0 + \beta_1 X_i + \epsilon_i \]

  • \(Y_i\): response (dependent) variable at i-th observation
  • \(\beta_0,\beta_1\): regression parameters for intercept and slope.
  • \(X_i\): known constant (independent or predictor variable) for i-th observation
  • \(\epsilon_i\): random error term

\[ E(\epsilon_i) = 0 \\ \]

\[ var(\epsilon_i) = \sigma^2 \\ \]

\[ cov(\epsilon_i,\epsilon_j) = 0 \text{ for all } i \neq j \tag{2}\]

where the \(e_i\) is the error term, called residual, which represents the deviation between the observed value \(y_i\) and the expected value \(\hat y\) of the linear model (\(e_i = y_i - \hat y_i\)).

We make the following assumptions about \(e_i\)

  1. Linearity The relationship between the independent variable(s) and the outcome is linear
  2. Zero Mean: The expected value of the error term is zero
  3. Homoscedasticity: The error term has a constant variance
  4. Independence of Errors: There is no autocorrelation between
  5. Normality: The error terms are normally distributed

In a matrix notation, given \(n\) observations of the explanatory variable x and the outcome variable y, the Equation 2 becomes:

\[\begin{bmatrix} y_1\\ y_2 \\ y_3 \\ ... \\ y_n \end{bmatrix} = \begin{bmatrix} 1 & x_1\\ 1 & x_2\\ 1 & x_3\\ ... \\ 1 & x_n \end{bmatrix} \bullet \begin{bmatrix} b_0\\ b_1\end{bmatrix} + \begin{bmatrix} e_1\\ e_2 \\ e_3 \\ ... \\ e_n \end{bmatrix} \tag{3}\]

Then the outcome y can be compactly modeled by:

\[Y = X\hat{B} + e \tag{4}\]

where the matrix X is called the design matrix (the first column contains 1s, and the second column contains the actual observations of x).

The coefficients \(b_0\) and \(b_1\) are selected by minimizing the sum of squares of the residuals \(e\). It can be proven that the least squares estimates are obtained by the following matrix formula:

\[\hat{B} = (X^TX)^{-1}X^TY = \begin{bmatrix} b_0 \\ b_1 \end{bmatrix} \tag{5}\]

That is, we’ve got one matrix equation which gives us both coefficient estimates.

model_lm <- lm(`Plasma volume`~ `Body weight`, data = df)
model_lm

Call:
lm(formula = `Plasma volume` ~ `Body weight`, data = df)

Coefficients:
  (Intercept)  `Body weight`  
      0.08572        0.04362  
Summary information about the model
summary(model_lm)

Call:
lm(formula = `Plasma volume` ~ `Body weight`, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.27880 -0.14178 -0.01928  0.13986  0.32939 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept)    0.08572    1.02400   0.084   0.9360  
`Body weight`  0.04362    0.01527   2.857   0.0289 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2188 on 6 degrees of freedom
Multiple R-squared:  0.5763,    Adjusted R-squared:  0.5057 
F-statistic:  8.16 on 1 and 6 DF,  p-value: 0.02893
use PlasmaB , clear

regress Plasma_volume Body_weight 
      Source |       SS           df       MS      Number of obs   =         8
-------------+----------------------------------   F(1, 6)         =      8.16
       Model |  .390684388         1  .390684388   Prob > F        =    0.0289
    Residual |  .287265612         6  .047877602   R-squared       =    0.5763
-------------+----------------------------------   Adj R-squared   =    0.5057
       Total |      .67795         7      .09685   Root MSE        =    .21881

------------------------------------------------------------------------------
Plasma_vol~e | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
 Body_weight |   .0436153   .0152684     2.86   0.029      .006255    .0809757
       _cons |   .0857243   1.023998     0.08   0.936    -2.419909    2.591357
------------------------------------------------------------------------------

The regression equation for our example becomes the one shown in the plot below.

The intercept \(b_0\) = 0.0857 is the mean Plasma volume with weight of 0. Visually, it’s the point where the line crosses the y-axis when x equals 0 (Figure 4).

Show the code
# Create a linear regression line with annotations
ggplot(df, aes(x = `Body weight`, y = `Plasma volume`)) +
  geom_point(alpha = 0.5) +
  geom_smooth(method = "lm", se = FALSE, color = "blue", fullrange = TRUE) +
  xlim(0, 100) +
  ylim(0, 4) +
  geom_hline(yintercept = 0, color = "black", linetype = "solid") +
  geom_vline(xintercept = 0, color = "black", linetype = "solid") +
  annotate("text", x = 67.5, y = 3.05, label = bquote(paste (widehat(PlasmaVolume), "= 0.0857 + 0.0436*weight")), color = "purple", angle = 32, size = 5) +
  #theme_prism(base_size = 14, base_line_size = 0.4, palette = "office") +
  labs(x = "x = Body weight", y = "y = Plasma volume")
Figure 4: Scatter plot with fitted line crossing the y-axis.
Comment

Note that while there is a mathematical interpretation for the intercept of the regression line, it doesn’t hold any physical meaning in this case, as the observation of a weight of 0 is not feasible.

Of greater importance is the slope of the fitted line, \(b_1 = 0.0436\), as this describes the association between the Plasma volume and weight variables.

The positive slope indicates a positive association between plasma volume and weight.

For every 1 kg increase in weight, there is on average an associated increase of 0.0436 litres in plasma volume.

The Standard error (SE) of the coefficients

The standard error is a way to measure the “uncertainty” in the estimate of the regression coefficients and it can be calculated by:

\[ \mathrm{SE}_{b_j} = \sqrt{\mathrm{Var}(b_j)} \tag{6}\]

where the \(\mathrm{Var}(b_j)\) represents the variance of the coefficients of the model.

Test statistic and confidence intervals

The hypothesis testing for the coefficients of the regression model are:

\[ \begin{aligned} H_0 &: \beta_j = 0\\ \text{vs } H_1&: \beta_j \neq 0. \end{aligned} \]

The null hypothesis, \(H_{0}\), states that the coefficients are equal to zero, and the alternative hypothesis, \(H_{1}\), states that the coefficients are not equal to zero.

The t-statistic for the coefficients are defined by the following equation:

\[ \ t_j = \frac{\ b_j}{\text{SE}_{b_j}} \tag{7}\]

In R:

# Compute t for the coefficients bo and b1
summary(model_lm)

Call:
lm(formula = `Plasma volume` ~ `Body weight`, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.27880 -0.14178 -0.01928  0.13986  0.32939 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)  
(Intercept)    0.08572    1.02400   0.084   0.9360  
`Body weight`  0.04362    0.01527   2.857   0.0289 *
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.2188 on 6 degrees of freedom
Multiple R-squared:  0.5763,    Adjusted R-squared:  0.5057 
F-statistic:  8.16 on 1 and 6 DF,  p-value: 0.02893

We can use the p-value to guide our decision:

  • If p − value < 0.05, reject the null hypothesis, \(H_{0}\).
  • If p − value ≥ 0.05, do not reject the null hypothesis, \(H_{0}\).

In our example \(p =0.0289 \le 0.05 \Rightarrow\) hence we reject \(H_{0}\).

The \(95\%\) CI of the coefficients for a significance level α, \(df=n-k\) degrees of freedom and for a two-tailed t-test is given by:

\[ 95\% \ \text{CI}_{b_j} = b_j \pm t^{*}_{df;a/2} \cdot \text{SE}_{b_j} \tag{8}\]

Summary table

All the estimates and statistics of interest for the regression model are presented in the following summary table (Figure 5):

term estimate std_error statistic p_value lower_ci upper_ci
intercept 0.09 1.02 0.08 <0.001 -2.42 2.59
Body weight 0.04 0.01 2.86 <0.001 0.01 0.08
Figure 5: The linear regression table.

 

Verifying Model Assumptions

Note

In the breast cancer dataset, the variable radius mean represents the average size of the tumour. We aim to investigate the effect of compactness mean on radius mean using simple linear regression..

Specific assumptions have to be met for reliable hypothesis tests and confidence intervals in simple linear regression: linearity of the data, independence of the residuals, normality of the residuals, and equality of variance of the residuals.

We will describe some statistical test and diagnostic plots in R for testing the assumptions underlying linear regression model.

Warning

Variations or different versions of the statistical tests may have different results.

Linearity of the data

Linear association between the independent variable x and the dependent variable (outcome) y.

Independence of the residuals

The residuals (or errors) resulting from the model should be independent of each other.

  Statistical test: Durbin-Watson test

We can perform a Durbin-Watson-Test to check for autocorrelated residuals (a p-value < 0.05 indicates autocorrelated residuals).

set.seed(126)
check_autocorrelation(model_lm2, nsim = 1000)
OK: Residuals appear to be independent and not autocorrelated (p = 0.064).

 

Normality of the residuals

The residuals should be normally distributed.

  Statistical test: Shapiro test

check_normality(model_lm2)
Warning: Non-normality of residuals detected (p < .001).

The function performs a shapiro.test and checks the standardized residuals for normal distribution. According to the documentation “this formal test almost always yields significant results for the distribution of residuals for large samples and visual inspection (e.g., Q-Q plots) are preferable.”

  Plot: Normal Q-Q plot

diagnostic_plots[5] 
$QQ

Normal Q-Q plot is used to examine whether the residuals are roughly normally distributed. Ideally, the standardized residuals points should follow the green line. In our example, the normal Q-Q normal plot deviates greatly from normal.

 

Equality of variance of the residuals (Homoscedasticity)

The spread or dispersion of the residuals should remain roughly the same for all values of the X variable.

  Statistical test: Breusch-Pagan test

The most common test for checking equality of variance of the residuals (homoskedasticity) is the Breush-Pagan test (a p-value < 0.05 indicates presence of heteroscedasticity).

The original version of Breusch-Pagan test:

lmtest::bptest(model_lm2, studentize = F) 

    Breusch-Pagan test

data:  model_lm2
BP = 78.115, df = 1, p-value < 0.00000000000000022

However, the studentized BP test (which is the default) is more robust than the original one:

lmtest::bptest(model_lm2) 

    studentized Breusch-Pagan test

data:  model_lm2
BP = 56.116, df = 1, p-value = 0.00000000000006833

  Plot: Square root of standardized residuals vs Fitted values

diagnostic_plots[3] 
$HOMOGENEITY

This plot checks the assumption of equal variance of the residuals (homoscedasticity). To verify the assumption, we plot the square root of standardized residuals against the fitted values from the regression. In this case, the residuals are re-scaled and all values are positive. A horizontal line with equally spread points would be the desired pattern. However, if the spread of residuals widens or narrows systematically as we move along the x-axis, it indicates a violation of homoscedasticity.

 

Influential observations in the data

The dataset should not include data points with exceptional influence on the model.

  Plot: Standardized residuals vs Leverage

diagnostic_plots[4] 
$OUTLIERS

This plot is used to detect influential observations. These data are outlier points that might have a substantial impact on the regression model’s results when included or excluded from the analysis. If any point in this plot fall outside the dashed green lines then it is considered an influential observation. In our example, all the data points are inside the dashed green lines suggesting that none of the observations is influential on the fitted model.

Comment

The standardized residual is the residual divided by its standard deviation. The standardization, which results in a mean of zero and a variance of one, enables the residuals to be compared on the “standard scale”. When a standardized residual falls within the range of +/- 2, it indicates an unusual observation, while a value between +/- 3 indicates a highly unusual data”.

Overall Model fit

Two related measures are typically used to evaluate the linear regression model: the residual standard error (RSE) and the coefficient of determination \(R^2\).

Residual standard error (RSE)

The residual standard error is the square root of the residual variance \(\sigma^2_{e}\) that we have already calculated. It informs us about the average error of the regression model in terms of the units of the outcome variable. Lower values are better because it indicates that the observations are closer to the fitted line. In our example:

\[\ RSE = \sqrt{\frac{SS_{residuals}}{n-k}} \tag{9}\]

Coefficient of determination \(R^2\)

The regression model allows us to divide the total variation in the outcome into two components: one explained by the model, and the other that remains unexplained by the model.

\[ \mathrm{SS}_{\mathrm{total}} = \mathrm{SS}_{\mathrm{model}} + \mathrm{SS}_{\mathrm{residual}} \]

Proof

\[ \begin{aligned} \sum_{i=1}^n (Y_i - \bar{Y})^2 &= \sum_{i=1}^n (Y_i - \hat{Y}_i + \hat{Y}_i - \bar{Y})^2 \\ &= \sum_{i=1}^n(Y_i - \hat{Y}_i)^2 + \sum_{i=1}^n(\hat{Y}_i - \bar{Y})^2 + 2\sum_{i=1}^n(Y_i - \hat{Y}_i)(\hat{Y}_i-\bar{Y}) \\ &= \sum_{i=1}^n(Y_i - \hat{Y}_i)^2 + \sum_{i=1}^n(\hat{Y}_i -\bar{Y})^2 \\ SS_{total} &= SS_{residuals} + SS_{model} \\ \end{aligned} \]

where SSR(\(SS_{model}\)) is the regression sum of squares, which measures how the conditional mean varies about a central value. + Variation explained by the model (what your model learns from the data)

The cross-product term in the decomposition is 0:

\[ \begin{aligned} \sum_{i=1}^n (Y_i - \hat{Y}_i)(\hat{Y}_i - \bar{Y}) &= \sum_{i=1}^{n}(Y_i - \bar{Y} -b_1 (X_i - \bar{X}))(\bar{Y} + b_1 (X_i - \bar{X})-\bar{Y}) \\ &= b_1 \sum_{i=1}^{n} (Y_i - \bar{Y})(X_i - \bar{X}) - b_1^2\sum_{i=1}^{n}(X_i - \bar{X})^2 \\ &= b_1 \frac{\sum_{i=1}^{n}(Y_i -\bar{Y})(X_i - \bar{X})}{\sum_{i=1}^{n}(X_i - \bar{X})^2} \sum_{i=1}^{n}(X_i - \bar{X})^2 - b_1^2\sum_{i=1}^{n}(X_i - \bar{X})^2 \\ &= b_1^2 \sum_{i=1}^{n}(X_i - \bar{X})^2 - b_1^2 \sum_{i=1}^{n}(X_i - \bar{X})^2 \\ &= 0 \end{aligned} \]

\[ \begin{aligned} SSTO &= SSR + SSE \\ (n-1 d.f) &= (1 d.f.) + (n-2 d.f.) \end{aligned} \]

Source of Variation Sum of Squares df Mean Square F
Regression (model) SSR 1 MSR = SSR/df MSR/MSE
Error SSE n-2 MSE = SSE/df
Total (Corrected) SSTO n-1

\[ E(MSE) = \sigma^2 \\ E(MSR) = \sigma^2 + \beta_1^2 \sum_{i=1}^{n} (X_i - \bar{X})^2 \]

Coefficient of Determination (\(R^2\)) and Adjusted \(R^2\)

\(R^2\) is the square of the correlation coefficient r.

  • It tells us the proportion of variability in the outcome variable that is explained by the model.
  • In our regression, \(R^2 = 0.5763\): about \(58\%\) of the variability in plasma volume is explained by bodyweight.
  • Adjusted \(R^2 = 0.5057\): tells us how well the model explains the outcome after accounting for sample size and number of predictors It gives a more honest estimate of model performance, especially with small datasets .After adjusting for model simplicity and sample size, about \(51\%\) of the variability in plasma volume is still explained by body weight.

\[ R^2 = \frac{SS_{model}}{SS_{total}} = 1- \frac{SS_{residuals}}{SS_{total}} \]

where \(0 \le R^2 \le 1\)

Remember

Interpretation: The proportionate reduction of the total variation in Y after fitting a linear model in X.
It is not really correct to say that \(R^2\) is the “variation in Y explained by X”.

\(R^2\) is related to the correlation coefficient between Y and X:

\[ R^2 = (r)^2 \]

where \(r= corr(x,y)\) is an estimate of the Pearson correlation coefficient.

The coefficient of determination, \(R^2\), indicates the percentage of the total variation in the dependent variable Y that can be explained by the regression model (in this simple case the independent variable X). Hence, it is a measure of the ‘goodness of fit’ of the regression line to the data.

\[\ R^2 = \frac{\ explained \ \ variation}{total \ \ variation} = \frac{SS_{model}}{SS_{total}} = 1- \frac{SS_{residual}}{SS_{total}} \tag{10}\]

The R-squared measure ranges between 0 and 1. When \(R^2\) approaches 1 indicates that a substantial proportion of the variation in the dependent variable Y has been explained by the regression model. Conversely, when it nears 0, it implies that the regression model has not effectively explained the majority of the variation in the outcome.

In our example, \(R^2 = 0.58\) indicates that about 58% of the variation in Plasma Volume can be explained by the variation of the body weight.

Comment

In simple linear regression \(\sqrt{0.5762732} = 0.7591266\) which equals to the Pearson’s correlation coefficient r.

(cor(df$`Body weight`,df$`Plasma volume`))^2
[1] 0.5762732

In matrix notation:

  • the sum of squares total is defined as:

\[ \mathrm{SS}_{total} = (\mathbf{Y} - \bar{\mathbf{Y}})^2 = (\mathbf{Y} - \bar{\mathbf{Y}})^\intercal(\mathbf{Y} - \bar{\mathbf{Y}}) \]

  • the residual sum of squares is:

\[ SS_{residuals} = \mathbf{e}^2 =\mathbf{e}^\intercal\mathbf{e} \]

Therefore,

\[ \mathrm{SS}_{\mathrm{model}} = \mathrm{SS}_{\mathrm{total}} - \mathrm{SS}_{\mathrm{residual}} \]

F-Test

The F-test for the model is a test of the null hypothesis that none of the independent variables linearly predict the dependent variable, that is, the model parameters are jointly zero:

The F-statistic tests whether the regression model explains a signicant portion of the total variability in the outcome variable:

\[H_0 : \beta_1 = \ldots = \beta_k = 0\]. The regression mean sum of squares \(MSR = \frac{(\hat{y} - \bar{y})'(\hat{y} - \bar{y})}{k-1}\) and the error mean sum of squares \(MSE = \frac{\hat{\epsilon}'\hat{\epsilon}}{n-k}\) are each chi-square variables. Their ratio has an F distribution with \(k - 1\) numerator degrees of freedom and \(n - k\) denominator degrees of freedom. The F statistic can also be expressed in terms of the coefficient of correlation \(R^2 = \frac{MS_{model}}{MS_{total}}\).

\[F(k - 1, n - k) = \frac{MS_{model}}{MS_{residuals}} = \frac{R^2}{1 - R^2} \frac{n-k}{k-1}\]

MSE is \(\sigma^2\). If \(H_0\) is true, that is, there is no relationship between the predictors and the response, then \(MSR\) is also equal to \(\sigma^2\), so \(F = 1\). As \(R^2 \rightarrow 1\), \(F \rightarrow \infty\), and as \(R^2 \rightarrow 0\), \(F \rightarrow 0\). F increases with \(n\) and decreases with \(k\).

A large F-value indicates that the model explains significantly more variability than the mean alone. Also, p value = 0.029

Present the results

We can use the above information to write up a final report:

Final report

In summary, the regression coefficient of the weight is positive and significantly different from zero \((p < 0.001)\). There is on average an increase of 0.0436 (\(95\%\)CI: 0.006255 to 0.0809757) litres in Plasma Volume for every 1 kg increase in weight. The model explains a substantial proportion of variance (\(58%\) of the variation in Plasma Volume can be explained by the variation of body weight).

Multiple Linear Regression and General Linear Model

We would like to fit a model that includes all potentially important variables simultaneously. This would help us evaluate the relationship between a predictor variable and the outcome while controlling for the potential influence of other variables. This is the strategy used in multiple regression. While we remain cautious about making any causal interpretations using multiple regression, such models are a common first step in providing evidence of a causal connection.

\[y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \dots + \beta_px_p+\epsilon_i \tag{11}\]

Just as with the single predictor case, a multiple regression model may be missing important components or it might not precisely represent the relationship between the outcome and the available explanatory variables. While no model is perfect, we wish to explore the possibility that this model may fit the data reasonably well.

We estimate the parameters \(\beta_0\), \(\beta_1\), …, \(\beta_4\) in the same way as we did in the case of a single predictor. We select \(b_0\), \(b_1\), …, \(b_4\) that minimize the sum of the squared residuals:

\[ \text{SSE} = e_1^2 + e_2^2 + \dots + e_{n}^2 = \sum_{i=1}^{n} e_i^2 = \sum_{i=1}^{n} \left(y_i - \hat{y}_i\right)^2 \]

heart<-readr::read_csv("heart.csv")

General Linear model

Recap : Analysis Of Variance

Use the one-way ANOVA test to compare the mean response of a continuous dependent variable among the levels of a factor variable (if you only have two levels, use the independent-samples t-test). The observations must be independent, meaning the data generators cannot influence each other (e.g., same participant in different groups, or participants interact with each other to produce the observed outcome).

The ANOVA method decomposes the deviation of observation \(Y_{ij}\) around the overall mean \(\bar{Y}_{..}\) into two parts, the deviation of the observations around their treatment means, \(SSE\), and the deviation of the treatment means around the overall mean, \(SSR\). Their ratio, \(F = \frac{SSR}{SSE}\) follows an F-distribution with \(k-1\) numerator dof and \(N-k\) denominator dof. The more observation variance captured by the treatments, the large is \(F\), and the less likely that the null hypothesis, \(H_0 = \mu_1 = \mu_2 = \cdots = \mu_k\) is true.

Compare the F-statistic to the F-distribution with \(k-1\) numerator degrees of freedom and \(N-k\) denominator degrees of freedom

The F test does not indicate which populations cause the rejection of \(H_0\). For this, use one of the post-hoc tests: Tukey, Fisher’s Least Significant Difference (LSD), Bonferroni, Scheffe, or Dunnett.

ANOVA returns reliable results if the following conditions hold:

    1. there should be no significant outliers within the factor levels;
    1. the response variable should be approximately normally distributed within each factor level; and
    1. the the response variable variances within the factor levels should be equal.
Example dataset

Data set abreveya1.dta contains a sample of data from the first birth of mothers in the study, with data from 3,798 mothers who have complete data for the variables of interest and who had singleton births Birth outcomes of interest are birthweight and whether the baby was of low birthweight (birthweight < 2500 grams)

ANC

Note: “kess” is an example of a factor – a categorical variable that divides the data into three groups according to the quality of antenatal care (adequate, intermediate or inadequate)

Question of interest is: “What can be said about the variation in birthweight between the different levels of antenatal care?”

df<-haven::read_dta("abrevaya1.dta") |> 
   mutate(birwtkg = birwt/1000,
          kess_cat = as.factor(kess))

Conclusion

Overall analysis of variance is significant \((F=24.04; P<0.0001)\) we can reject H0 and conclude that the mean birthweight is not the same for all three groups of mothers defined by the level of antenatal care

Bonferroni adjusted comparison of means shows that the mean birthweight is significantly lower for mothers who had either intermediate or inadequate antenatal care compared to mothers who had adequate antenatal care, while the difference in mean birthweight between mothers with intermediate antenatal care and mothers with inadequate antenatal care is not statistically significant

Anova and Regression

Close connection between multiple regression models and analysis of variance models Measure effect of intermediate antenatal care and inadequate antenatal care relative to adequate antenatal care (first level and also the most common in our study) i.e. we set Kessner level 1 (adequate antenatal care) to be the reference or baseline level Define variables to measure the effect of Kess level 2 and Kess level 3 relative to the baseline as follows:

Note

If we compare analysis of variance for regression model to that of the one-way ANOVA, we see that they are in fact identical Model sum of squares and degrees of freedom are equal to the “Between groups” in the ANOVA – similarly for Residual sum of squares and residual degrees of freedom and within groups sum of squares and residual degrees of freedom Results in identical F-ratios (\(24.04\)) and identical P-values (\(<0.0001\))

General linear model : allows us to include both variates and factors as explanatory variables and thus allows us to analyze a wide range of data from research studies – provided it is reasonable to assume that the quantitative outcome or response variable has an underlying normal distribution

Example 2

  1. Does birthweight increase with increasing maternal age?
  2. Is there a difference in mean birthweight between babies whose mothers smoked & babies whose mothers did not smoke adjusting for maternal age?
  3. Does birth weight increase with increasing maternal age at same rate for babies where mother smoked and babies where mother did not smoke during pregnancy?
df |> 
  ggplot(aes(x=birwtkg , y=mage, color = smoke))+
  geom_point()+
  geom_smooth(method ="lm")

df |> 
  ggplot(aes(x=birwtkg , y=mage, color = smoke))+
  geom_point()+
  geom_smooth(method ="lm")+
  facet_wrap(~smoke)

Note

There is overwhelming evidence that birthweight is associated with maternal age – for each year the birthweight increases by 0.012 kg (95% CI 0.0089 ; 0.015 ; P<0.0001)

Model equation

\[ \mbox{E}(Birwtkg)=\beta_0 + \beta_1*\text{mage_cent}+ \beta_2*\text{(smoke=Yes)} \]

Note

Post model test (“testparm i.smoke”) shows that adding i.smoke significantly improves the fit of the model \((F=138.15; P<0.0001)\) Estimated effect of smoking of -0.268 (se 0.023) after adjusting for the effect of maternal age Compared with result of the t-test which said that if the mother smoked birth weight was lower by 0.290 (se 0.023) – so the effect of smoking is slightly reduced by adjusting for mother’s age. In this case the standard error is unchanged (to 3 decimal places) - Often adjusting leads to smaller s.e.

Tip

The post-estimation test after fitting the interaction term shows no evidence of an improvement to the model \((F=0.57; P=0.449)\) We would not choose the model with an interaction (the separate lines model) and we would choose the parallel lines model Example of the principle of parsimony – we only choose a more complicated model over a simpler model if it fits significantly better

Also Note

There is no different slope for each case of the Smoke variable. Thus there is no synergy or interaction between these variables. The smoke status does not change the impact of mom age on birthweight.

use abrevaya1

egen mage_bar = mean(mage)
gen mage_cent = mage -mage_bar


gen birwtkg = birwt/1000

reg birwtkg i.smoke mage i.male i.married i.black kess

cprplot mage , lowess

graph export cpr.png
      Source |       SS           df       MS      Number of obs   =     3,978
-------------+----------------------------------   F(6, 3971)      =     55.80
       Model |  79.3706313         6  13.2284385   Prob > F        =    0.0000
    Residual |   941.46269     3,971  .237084535   R-squared       =    0.0778
-------------+----------------------------------   Adj R-squared   =    0.0764
       Total |  1020.83332     3,977  .256684265   Root MSE        =    .48691

------------------------------------------------------------------------------
     birwtkg | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
     1.smoke |  -.2404347   .0233559   -10.29   0.000    -.2862254    -.194644
        mage |    .006056   .0015381     3.94   0.000     .0030404    .0090716
             |
        male |
       male  |   .1027676   .0154436     6.65   0.000     .0724894    .1330459
             |
     married |
        Yes  |   .0695001   .0286437     2.43   0.015     .0133424    .1256577
             |
       black |
        Yes  |  -.2073296   .0324924    -6.38   0.000    -.2710329   -.1436262
        kess |  -.0356693   .0151464    -2.35   0.019    -.0653647   -.0059738
       _cons |   3.252061   .0533562    60.95   0.000     3.147453    3.356669
------------------------------------------------------------------------------


file cpr.png already exists
r(602);

r(602);

Collinearity

Multicollinearity refers to correlation among explanatory variables.

  • large changes in the estimated regression coefficient when a predictor variable is added or deleted, or when an observation is altered or deleted.
  • noninsignificant results in individual tests on regression coefficients for important predictor variables.
  • estimated regression coefficients with an algebraic sign that is the opposite of that expected from theoretical consideration or prior experience.
  • large coefficients of simple correlation between pairs of predictor variables in the correlation matrix.
  • wide confidence intervals for the regression coefficients representing important predictor variables.

When some of X variables are so highly correlated that the inverse \((X'X)^{-1}\) does not exist or is very computationally unstable.

Correlated Predictor Variables: if some X variables are “perfectly” correlated, the system is undetermined and there are an infinite number of models that fit the data. That is, if \(X'X\) is singular, then \((X'X)^{-1}\) doesn’t exist. Then,

  • parameters cannot be interpreted (\(\mathbf{b = (X'X)^{-1}X'y}\))
  • sampling variability is infinite (\(\mathbf{s^2(b) = MSE (X'X)^{-1}}\))
VIFs

Let \(R^2_k\) be the coefficient of multiple determination when \(X_k\) is regressed on the p - 2 other X variables in the model. Then,

\[ VIF_k = \frac{1}{1-R^2_k} \]

  • large values indicate that a near collinearity is causing the variance of \(b_k\) to be inflated, \(var(b_k) \propto \sigma^2 (VIF_k)\)
  • Typically, the rule of thumb is that \(VIF > 4\) mean you should see why this is the case, and \(VIF_k > 10\) indicates a serious problem collinearity problem that could result in poor parameters estimates.
  • the mean of all VIF’s provide an estimate of the ratio of the true multicollinearity to a model where the X variables are uncorrelated
  • serious multicollinearity if \(\bar{(VIF)} >>1\)
Multicollinearity

The multicollinearity condition is violated when two or more of the predictors in a regression model are correlated. Muticollinearity can occur for structural reasons, as when one variable is a transformation of another variable, or for data reasons, as occurs in observational studies. Multicollinearity is a problem because it inflates the variances of the estimated coefficients, resulting in larger confidence intervals.

When predictor variables are correlated, the precision of the estimated regression coefficients decreases with each added correlated predictor variable. The usual interpretation of a slope coefficient as the change in the mean response for each additional unit increase in the predictor when all the other predictors are held constant breaks down because changing one predictor necessarily changes the others.

A residuals vs fits plot \((\epsilon \sim \hat{Y})\) should have correlation \(\rho \sim 0\). A correlation matrix is helpful for picking out the correlation strengths. A good rule of thumb is correlation coefficients should be less than 0.80. However, this test may not work when a variable is correlated with a function of other variables. A model with multicollinearity may have a significant F-test with insignificant individual slope estimator t-tests. Another way to detect multicollinearity is by calculating variance inflation factors. The predictor variance \(Var(\hat{\beta_k})\) increases by a factor

\[VIF_k = \frac{1}{1 - R_k^2}\]

where \(R_k^2\) is the \(R^2\) of a regression of the \(k^{th}\) predictor on the remaining predictors. A \(VIF_k\) of \(1\) indicates no inflation (no corellation). A \(VIF_k \geq 4\) warrants investigation. A \(VIF_k \geq 10\) requires correction.

If the multicollinearity occurs because you are using a polynomial regression model, center the predictor variables (subtract their means).

In summary (Multicollinearity)
  1. May be significant change in regression coefficients if we add or delete an explanatory variable.
  2. Estimated standard errors of fitted coefficients are inflated.
  3. Estimated coefficients may not be statistically significant, even though statistical relationship exists between outcome and explanatory variables.

Can check for multi-collinearity after fitting regression model by calculating variance inflation factor (VIF), which gives for each coefficient an estimate of factor by which variance has been increased due to multi-collinearity Most analysts apply informal rules of thumb to the VIF- evidence of multi-collinearity if :

  • Largest VIF is greater than 10
  • Mean of all VIF’s is considerably larger than 1

APPENDIX

\(Y_i\) is random since \(\epsilon_i\) is:

\[ \begin{aligned} E(Y_i) &= E(\beta_0 + \beta_1 X_i + \epsilon_i) \\ &= E(\beta_0) + E(\beta_1 X_i) + E(\epsilon) \\ &= \beta_0 + \beta_1 X_i \end{aligned} \]

\[ \begin{aligned} var(Y_i) &= var(\beta_0 + \beta_1 X_i + \epsilon_i) \\ &= var(\epsilon_i) \\ &= \sigma^2 \end{aligned} \]

\[ \beta_1 = \frac{Cov(Y_i, X_i)}{Var(X_i)} \] \[b_1 = \frac{\sum_{i=1}^{n} (X_i - \bar{X})(Y_i - \bar{Y})}{\sum_{i=1}^{n}(X_i - \bar{X})^2}=\frac{\sigma_{xy}}{\sigma_{xx}}\] \[b_0 = \frac{1}{n}(\sum_{i=1}^{n}Y_i - b_1\sum_{i=1}^{n}X_i) = \bar{Y} - b_1 \bar{X}\]

Properties of Least Least Estimators

\[E(b_1) = \beta_1\] \[E(b_0) = E(\bar{Y}) - \bar{X}\beta_1\] \[E(\bar{Y}) = \beta_0 + \beta_1 \bar{X}\] \[E(b_0) = \beta_0\] \[var(b_1) = \frac{\sigma^2}{\sum_{i=1}^{n}(X_i - \bar{X})^2}\] \[var(b_0) = \sigma^2 (\frac{1}{n} + \frac{\bar{X}^2}{\sum (X_i - \bar{X})^2}) \]

Mean Square Error

\[ MSE = \frac{SSE}{n-2} = \frac{\sum_{i=1}^{n}e_i^2}{n-2} = \frac{\sum(Y_i - \hat{Y_i})^2}{n-2} \]

Interpretations

Model Form Interpretation of \(\beta\) In words
Level-Level \(y =\beta_0+\beta_1x+\epsilon\) \(\Delta y = \beta_1 \Delta x\) A unit change in x will result in \(\beta_1\) unit change in y
Log-Level \(ln(y) = \beta_0 + \beta_1x + \epsilon\) \(\% \Delta y=100 \beta_1 \Delta x\) A unit change in x result in 100 \(\beta_1\) % change in y
Level-Log \(y = \beta _0 + \beta_1 ln (x) + \epsilon\) \(\Delta y = (\beta_1/ 100)\%\Delta x\) One percent change in x result in \(\beta_1/100\) units change in y
Log-Log \(ln(y) = \beta_0 + \beta_1 l n(x) +\epsilon\) \(\% \Delta y= \beta _1 \% \Delta x\) One percent change in x result in \(\beta_1\) percent change in y

Model Diagnostics

Akaike Information Criterion (AIC)

\[ AUC = n ln(\frac{SSE_p}{n}) + 2p \]

  • increasing p (number of parameters) leads first-term decreases, and second-term increases.
  • We want model with small values of AIC. If the AIC increases when a parameter is added to the model, that parameter is not needed.
  • AIC represents a tradeoff between precision of fit against the number of parameters used.

Bayes (or Schwarz) Information Criterion

\[ BIC = n \ln(\frac{SSE_p}{n})+ (\ln n)p \]

The coefficient in front of p tends to penalize more heavily models with a larger number of parameters (as compared to AIC).

Stepwise Selection Procedures

The forward stepwise procedure:

  • finds a plausible subset sequentially.
  • at each step, a variable is added or deleted.
  • criterion for adding or deleting is based on SSE, \(R^2\), T, or F-statistic.

Note:

  • Instead of using exact F-values, computer packages usually specify the equivalent “significance” level. For example, SLE is the “significance” level to enter, and SLS is the “significance” level to stay. The SLE and SLS are guides rather than true tests of significance.
  • The choice of SLE and SLS represents a balancing of opposing tendencies. Use of large SLE values tends to result in too many predictor variables; models with small SLE tend to be under-specified resulting in \(\sigma^2\) being badly overestimated.
  • As for choice of SLE, can choose between 0.05 and 0.5.
  • If SLE > SLS then a cycling pattern may occur. Although most computer packages can detect can stop when it happens. A quick fix: SLS = SLE /2
  • If SLE < SLS then the procedure is conservative and may lead variables with low contribution to be retained.
  • Order of variable entry does not matter.

Automated Selection Procedures:

  • Forward selection: Same idea as forward stepwise except it doesn’t test if variables should be dropped once enter. (not as good as forward stepwise).
  • Backward Elimination: begin with all variables and identifies the one with the smallest F-value to be dropped.


Diagnostics

Influential observations/outliers

Hat matrix

\[ \mathbf{H = X(X'X)^{-1}} \]

where \(\mathbf{\hat{Y}= HY, e = (I-H)Y}\) and \(var(\mathbf{e}) = \sigma^2 (\mathbf{I-H})\)

  • \(\sigma^2(e_i) = \sigma^2 (1-h_{ii})\), where \(h_{ii}\) is the i-th element of the main diagonal of \(\mathbf{H}\) (must be between 0 and 1).
  • \(\sum_{i=1}^{n} h_{ii} = p\)
  • \(cov(e_i,e_j) = -h_{ii}\sigma^2\) where \(i \neq j\)
  • Estimate: \(s^2(e_i) = MSE (1-h_{ii})\)
  • Estimate: \(\hat{cov}(e_i,e_j) = -h_{ij}(MSE)\); if model assumption are correct, this covariance is very small for large data sets.
  • If \(\mathbf{x}_i = [1 X_{i,1} ... X_{i,p-1}]'\) (the vector of X-values for a given response), then \(h_{ii} = \mathbf{x_i'(X'X)^{-1}x_i}\) (depends on relative positions of the design points \(X_{i,1},...,X_{i,p-1}\))
Studentized Residuals

\[ r_i = \frac{e_i}{s(e_i)} \\ r_i \sim N(0,1) \]

where \(s(e_i) = \sqrt{MSE(1-h_{ii})}\). \(r_i\) is called the studentized residual or standardized residual.

  • you can use the semi-studentized residual before, \(e_i^*= e_i \sqrt{MSE}\). This doesn’t take into account the different variances for each \(e_i\).

We would want to see the model without a particular value. You delete the i-th case, fit the regression to the remaining n-1 cases, get estimated responses for the i-th case, \(\hat{Y}_{i(i)}\), and find the difference, called the deleted residual:

\[ d_i = Y_i - \hat{Y}_{i(i)} \\ = \frac{e_i}{1-h_{ii}} \]

we don’t need to recompute the regression model for each case

As \(h_{ii}\) increases, \(d_i\) increases.

\[ s^2(d_i)= \frac{MSE_{(i)}}{1-h_{ii}} \]

where \(MSE_{(i)}\) is the mean square error when the i-th case is omitted.

Let

\[ t_i = \frac{d_i}{s(d_i)} = \frac{e_i}{\sqrt{MSE_{(i)}(1-h_{ii})}} \]

be the studentized deleted residual, which follows a t-distribution with n-p-1 df.

\[ (n-p)MSE = (n-p-1)MSE_{(i)}+ \frac{e^2_{i}}{1-h_{ii}} \]

hence, we do not need to fit regressions for each case and

\[ t_i = e_i (\frac{n-p-1}{SSE(1-h_{ii})-e^2_i})^{1/2} \]

The outlying Y-observations are those cases whose studentized deleted residuals are large in absolute value. If there are many residuals to consider, a Bonferroni critical value can be can (\(t_{1-\alpha/2n;n-p-1}\))

Outlying X Observations

Recall, \(0 \le h_{ii} \le 1\) and \(\sum_{i=1}^{n}h_{ii}=p\) (the total number of parameters)

A large \(h_{ii}\) indicates that the i-th case is distant from the center of all X observations (the leverage of the i-th case). That is, a large value suggests that the observation exercises substantial leverage in determining the fitted value \(\hat{Y}_i\)

We have \(\mathbf{\hat{Y}=HY}\), a linear combination of Y-values; \(h_{ii}\) is the weight of the observation \(Y_i\); so \(h_{ii}\) measures the role of the X values in determining how important \(Y_i\) is in affecting the \(\hat{Y}_i\).

Large \(h_{ii}\) implies \(var(e_i)\) is small, so larger \(h_{ii}\) implies that \(\hat{Y}_i\) is close to \(Y_i\)

  • small data sets: \(h_{ii} > .5\) suggests “large”.
  • large data sets: \(h_{ii} > \frac{2p}{n}\) is “large.

Using the hat matrix to identify extrapolation:

  • Let \(\mathbf{x_{new}}\) be a vector containing the X values for which an inference about a mean response or a new observation is to be made.
  • Let \(\mathbf{X}\) be the data design matrix used to fit the data. Then, if \(h_{new,new} = \mathbf{x_{new}(X'X)^{-1}x_{new}}\) is within the range of leverage values (\(h_{ii}\)) for cases in the data set, no extrapolation is involved; otherwise; extrapolation is indicated.

Identifying Influential Cases:

by influential we mean that exclusion of an observation causes major changes int he fitted regression. (not all outliers are influential)

  • Influence on Single Fitted Values: DFFITS
  • Influence on All Fitted Values: Cook’s D
  • Influence on the Regression Coefficients: DFBETAS
DFFITS

Influence on Single Fitted Values: DFFITS

\[ (DFFITS)_i = \frac{\hat{Y}_i - \hat{Y}_{i(i)}}{\sqrt{MSE_{(i)}h_{ii}}} \\ = t_i (\frac{h_{ii}}{1-h_{ii}})^{1/2} \] - the standardized difference between the i-th fitted value with all observations and with the i-th case removed.

  • studentized deleted residual multiplied by a factor that is a function fo the i-th leverage value.

  • influence if:

    • small to medium data sets: \(|DFFITS|>1\)
    • large data sets: \(|DFFITS|> 2 \sqrt{p/n}\)
Cook’s D

Influence on All Fitted Values: Cook’s D

\[ D_i = \frac{\sum_{j=1}^{n}(\hat{Y}_j - \hat{Y}_{j(i)})^2}{p(MSE)} \\ = \frac{e^2_i}{p(MSE)}(\frac{h_{ii}}{(1-h_{ii})^2}) \]

gives the influence of i-th case on all fitted values.

If \(e_i\) increases or \(h_{ii}\) increases, then \(D_i\) increases.

\(D_i\) is a percentile of an \(F_{(p,n-p)}\) distribution. If the percentile is greater than \(.5(50\%)\) then the i-th case has major influence. In practice, if \(D_i >4/n\), then the i-th case has major influence.

DFBETAS

Influence on the Regression Coefficients: DFBETAS

\[ (DFBETAS)_{k(i)} = \frac{b_k - b_{k(i)}}{\sqrt{MSE_{(i)}c_{kk}}} \]