Linear Regression Diagnostics in Stata

Giovanni Minchio giovanni.minchio@unitn.it
Yuxin Zhang yuxin.zhang@unitn.it

Quantitative Methods Lab, Lesson 6.1
5 Nov. 2024

Regression Diagnostics

Assumptions check:

Linearity
Normality
Homoskedasticity
Independence

Additional check:

Influential data points
No multicollinearity

Note: after you made any variable corrections or changed your models, you should re-check the assumptions.

There are strict formal tests to check assumptions, as well as informal graphical methods. We will focus on the graphical methods for approximate checks.

Let’s use the built-in dataset: life expectancy, 1998 (lifeexp.dta) in Stata.

We want to study the relationship between Y: country life expectancy lexp with x1: GNP per capita gnppc and x2: safewater

help dta_examples

sysdescribe lifeexp.dta

Contains data                                 Life expectancy, 1998
 Observations:            68                  26 Mar 2022 09:40
    Variables:             6                  
-------------------------------------------------------------------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------------------------------------------------------------------
region          byte    %16.0g     region     Region
country         str28   %28s                  Country
popgrowth       float   %9.0g                 Avg. annual % growth
lexp            byte    %9.0g                 Life expectancy at birth
gnppc           float   %9.0g                 GNP per capita
safewater       byte    %9.0g                 Safe water
-------------------------------------------------------------------------------------------------------------------------------------------
Sorted by:

sysuse lifeexp.dta

(Life expectancy, 1998)

1. Linearity

The expected outcome is a linear function of the predictors.

check the raw scatter plot

scatter lexp gnppc

show country unit labels

scatter lexp gnppc, mlabel(country)

check also another variable

scatter lexp safewater, mlabel(country)

- change size of labels

scatter lexp gnppc, mlabel(country) mlabsize(tiny)

Nonlinear pattern and log transformation

fit linear and nonlinear lines

scatter lexp gnppc, mlabel(country)  || lfit lexp gnppc, lwidth(thick) || lowess lexp gnppc, lwidth(thick)

P.s., LOWESS (Locally Weighted Scatterplot Smoothing) is a non-parametric regression technique used to create a smooth line through a scatterplot of data points. It is used for visualizing relationships in data that may not follow a linear pattern.

transform data using natural log to make it more linear

gen log_gnppc = log(gnppc)

(5 missing values generated)

check scatter again

scatter lexp log_gnppc, mlabel(country)  || lfit lexp log_gnppc, lwidth(thick) || lowess lexp log_gnppc, lwidth(thick)

Check linearity assumption in multiple regression

first regress lexp on log_gnppc safewater

regress lexp log_gnppc safewater

      Source |       SS           df       MS      Number of obs   =        37
-------------+----------------------------------   F(2, 34)        =     49.61
       Model |   715.51562         2   357.75781   Prob > F        =    0.0000
    Residual |  245.187083        34   7.2113848   R-squared       =    0.7448
-------------+----------------------------------   Adj R-squared   =    0.7298
       Total |  960.702703        36  26.6861862   Root MSE        =    2.6854

------------------------------------------------------------------------------
        lexp | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
   log_gnppc |   1.661387   .5218804     3.18   0.003     .6007983    2.721975
   safewater |   .1396136   .0394113     3.54   0.001     .0595201     .219707
       _cons |   47.43692   2.725392    17.41   0.000     41.89825    52.97558
------------------------------------------------------------------------------

use acprplot augmented partial residual plot. It can be used to identify nonlinearities in the data

acprplot log_gnppc, lowess

acprplot safewater, lowess

In the plot of log_gnppc the smoothed line is very close to the ordinary regression line, and the entire pattern seems uniform. The plot of safewater does seem a bit more problematic at the left end.

2. Influential data

Extreme values may disproportionately bias model fit, because regression is sensitive to observations that deviate from the overall pattern.

Outlier: extreme Y value. It does not follow the general linear trend.

Leverage: extreme X value. It may still follow the general linear trend.

Influential point: can be seen as the product of outlier and leverage. It does not follow the trend and is influential when it changes the estimated coefficient (slope and intercept).

Image source here

Influential point

Image source here

Why?

Data entry error
Confounding
Nonlinear relationship

Solutions:

Remove them
Transform data
Nonlinear models
Present both results with and without them

Check the influential points

regress lexp log_gnppc safewater

acprplot log_gnppc, lowess mlabel(country)

acprplot safewater, lowess mlabel(country)

Image source here

use lvr2plot(leverage vs. residual-squared plot/L-R plot) to check potential influential points

lvr2plot, mlabel(country)

The red horizontal and vertical lines represent the average squared residuals and the average leverage, serving as guidelines for identifying influence rather than strict cutoffs.

Remove influential points

drop if country == "Paraguay" | country == "Haiti"
tab country

(2 observations deleted)

                     Country |      Freq.     Percent        Cum.
-----------------------------+-----------------------------------
                     Albania |          1        1.52        1.52
                   Argentina |          1        1.52        3.03
                     Armenia |          1        1.52        4.55
                     Austria |          1        1.52        6.06
                  Azerbaijan |          1        1.52        7.58
                     Belarus |          1        1.52        9.09
                     Belgium |          1        1.52       10.61
                     Bolivia |          1        1.52       12.12
      Bosnia and Herzegovina |          1        1.52       13.64
                      Brazil |          1        1.52       15.15
                    Bulgaria |          1        1.52       16.67
                      Canada |          1        1.52       18.18
                       Chile |          1        1.52       19.70
                    Colombia |          1        1.52       21.21
                     Croatia |          1        1.52       22.73
                        Cuba |          1        1.52       24.24
              Czech Republic |          1        1.52       25.76
                     Denmark |          1        1.52       27.27
          Dominican Republic |          1        1.52       28.79
                     Ecuador |          1        1.52       30.30
                 El Salvador |          1        1.52       31.82
                     Estonia |          1        1.52       33.33
                     Finland |          1        1.52       34.85
                      France |          1        1.52       36.36
                     Georgia |          1        1.52       37.88
                     Germany |          1        1.52       39.39
                      Greece |          1        1.52       40.91
                   Guatemala |          1        1.52       42.42
                    Honduras |          1        1.52       43.94
                     Hungary |          1        1.52       45.45
                     Ireland |          1        1.52       46.97
                       Italy |          1        1.52       48.48
                     Jamaica |          1        1.52       50.00
                  Kazakhstan |          1        1.52       51.52
             Kyrgyz Republic |          1        1.52       53.03
                      Latvia |          1        1.52       54.55
                   Lithuania |          1        1.52       56.06
               Macedonia FYR |          1        1.52       57.58
                      Mexico |          1        1.52       59.09
                     Moldova |          1        1.52       60.61
                 Netherlands |          1        1.52       62.12
                   Nicaragua |          1        1.52       63.64
                      Norway |          1        1.52       65.15
                      Panama |          1        1.52       66.67
                        Peru |          1        1.52       68.18
                      Poland |          1        1.52       69.70
                    Portugal |          1        1.52       71.21
                 Puerto Rico |          1        1.52       72.73
                     Romania |          1        1.52       74.24
          Russian Federation |          1        1.52       75.76
             Slovak Republic |          1        1.52       77.27
                    Slovenia |          1        1.52       78.79
                       Spain |          1        1.52       80.30
                      Sweden |          1        1.52       81.82
                 Switzerland |          1        1.52       83.33
                  Tajikistan |          1        1.52       84.85
         Trinidad and Tobago |          1        1.52       86.36
                      Turkey |          1        1.52       87.88
                Turkmenistan |          1        1.52       89.39
                     Ukraine |          1        1.52       90.91
              United Kingdom |          1        1.52       92.42
               United States |          1        1.52       93.94
                     Uruguay |          1        1.52       95.45
                  Uzbekistan |          1        1.52       96.97
                   Venezuela |          1        1.52       98.48
Yugoslavia, FR (Serb./Mont.) |          1        1.52      100.00
-----------------------------+-----------------------------------
                       Total |         66      100.00

plot again

acprplot log_gnp, lowess mlabel(country)

acprplot safewater, lowess mlabel(country)

3. Normality

Y residuals follow a normal distribution. It assures accurate standard errors, but it is not required to obtain unbiased estimates of coefficients.

Note: we do not check the distribution of the outcome, but rather the distribution of the model residual.

regress again Y on transformed x1 log_gnppc and x2 safewater after dropping influential points

regress lexp log_gnppc safewater

      Source |       SS           df       MS      Number of obs   =        35
-------------+----------------------------------   F(2, 32)        =     46.87
       Model |  482.146494         2  241.073247   Prob > F        =    0.0000
    Residual |  164.596363        32  5.14363635   R-squared       =    0.7455
-------------+----------------------------------   Adj R-squared   =    0.7296
       Total |  646.742857        34  19.0218487   Root MSE        =     2.268

------------------------------------------------------------------------------
        lexp | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
   log_gnppc |   1.684797   .4807282     3.50   0.001     .7055856    2.664008
   safewater |   .1141241   .0427183     2.67   0.012     .0271098    .2011383
       _cons |   49.30775   2.408392    20.47   0.000     44.40201    54.21348
------------------------------------------------------------------------------

use predict to generate residuals

predict residual, resid

(31 missing values generated)

use kdensity to produce a kernel density plot with the normal option requesting that a normal density be overlaid

kdensity residual, normal

plot residuals using probability-probability plot (P-P plot) pnorm: sensitive to non-normality in the middle

pnorm residual

plot residuals using quantile-quantile plot (Q-Q plot) qnorm is sensitive to non-normality near the ends (recommended)

qnorm residual

Normality over sample sizes

Note: a larger sample size does not guarantee the normality of residuals! If the errors have a non-normal distribution in the population, the residuals will still reflect this, even with a larger sample.

Do not confuse this with the Central Limit Theorem (CLT), which applies to the distribution of sample means, not the distribution of the residuals in regression.

4. Homoskedasticity

Y residuals, conditional on X, have constant variance. So there should be no pattern to the residuals plotted against the fitted values. It implies that as the value of the dependent variable changes, the error term does not vary much for each observation.

It again, assures accurate standard errors.

plot the residuals vs. fitted/predicted values

rvfplot, yline(0)

General solution: robust standard errors (more on this later)

5. Independence

After taking into account the predictors in our model, there is no remaining association between observations. You need expert knowledge of your research and data.

Potential dependence includes:

cluster effect
special association
serial correlation (autocorrelation)

General solution:

control for grouping variable
cluster standard errors by grouping variable
multilevel models (more on this later)

Plot dependence pattern

You can plot residuals against any variables to check if there is a pattern:

residuals vs. grouping variable
residuals vs. spatial variable
residuals vs. time variable

regress lexp log_gnppc safewater

      Source |       SS           df       MS      Number of obs   =        35
-------------+----------------------------------   F(2, 32)        =     46.87
       Model |  482.146494         2  241.073247   Prob > F        =    0.0000
    Residual |  164.596363        32  5.14363635   R-squared       =    0.7455
-------------+----------------------------------   Adj R-squared   =    0.7296
       Total |  646.742857        34  19.0218487   Root MSE        =     2.268

------------------------------------------------------------------------------
        lexp | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
   log_gnppc |   1.684797   .4807282     3.50   0.001     .7055856    2.664008
   safewater |   .1141241   .0427183     2.67   0.012     .0271098    .2011383
       _cons |   49.30775   2.408392    20.47   0.000     44.40201    54.21348
------------------------------------------------------------------------------

predict r, res

in our case, let’s check if our observations (countries here) are dependent on the regions they belong to

tab region

          Region |      Freq.     Percent        Cum.
-----------------+-----------------------------------
Europe & C. Asia |         44       66.67       66.67
   North America |         13       19.70       86.36
   South America |          9       13.64      100.00
-----------------+-----------------------------------
           Total |         66      100.00

This is not a good example though, we prefer to have more categories and sizes in the grouping variable.

scatter r region  || lowess r region , yline(0)

Note: variable region is treated as numeric in the plot.

6. No multicollinearity

Predictor variables included do not heavily predict each other.

When there is a perfect correlation among the predictors, the estimates for a regression model cannot be uniquely computed (redundancy in predictors), and it increases standard errors (uncertainty in estimates).

We can use Variance Inflation Factor (VIF) to measure the amount of multicollinearity in a set of multiple regression predictors

use vif after the regression to check for multicollinearity

regress lexp log_gnppc safewater
vif

      Source |       SS           df       MS      Number of obs   =        35
-------------+----------------------------------   F(2, 32)        =     46.87
       Model |  482.146494         2  241.073247   Prob > F        =    0.0000
    Residual |  164.596363        32  5.14363635   R-squared       =    0.7455
-------------+----------------------------------   Adj R-squared   =    0.7296
       Total |  646.742857        34  19.0218487   Root MSE        =     2.268

------------------------------------------------------------------------------
        lexp | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
   log_gnppc |   1.684797   .4807282     3.50   0.001     .7055856    2.664008
   safewater |   .1141241   .0427183     2.67   0.012     .0271098    .2011383
       _cons |   49.30775   2.408392    20.47   0.000     44.40201    54.21348
------------------------------------------------------------------------------

    Variable |       VIF       1/VIF  
-------------+----------------------
   log_gnppc |      2.73    0.366215
   safewater |      2.73    0.366215
-------------+----------------------
    Mean VIF |      2.73

Perfectly uncorrelated predictors have VIFs of 1, and perfectly correlated predictors have VIFs of infinity.

Calculate VIF

VIF can be calculated by the formula below:

\[ \frac{1}{1 - R^2} = \frac{1}{\text{tolerance}} \] The \(R^2\) (unadjusted) here is retrieved from a new model when we regress one predictor against all other predictors from the original model.

in our case, if we want to calculate the VIF value for the predictor log_gnppc

regress log_gnppc safewater

      Source |       SS           df       MS      Number of obs   =        35
-------------+----------------------------------   F(1, 33)        =     57.11
       Model |  38.5191686         1  38.5191686   Prob > F        =    0.0000
    Residual |   22.257229        33  .674461485   R-squared       =    0.6338
-------------+----------------------------------   Adj R-squared   =    0.6227
       Total |  60.7763976        34  1.78754111   Root MSE        =    .82126

------------------------------------------------------------------------------
   log_gnppc | Coefficient  Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
   safewater |   .0707432   .0093611     7.56   0.000      .051698    .0897885
       _cons |   2.628328   .7424533     3.54   0.001     1.117795     4.13886
------------------------------------------------------------------------------

So we can calculate by hand, and get the same result:

\(\frac{1}{1 - 0.6338} = 2.73\)

Rule of thumb: variables VIF values > 10 should be further investigated (higher is worse). Then 1/VIF is the tolerance, which can be used to check on the degree of collinearity (lower is worse).

If high VIFs?

Simplest solutions would be:

drop the predictor that displays a high correlation with other predictors in the model
combine problematic variables into composite variables (e.g., indices and latent factors)

Note: you might see very high VIFs in a model with polynomials/interaction terms, which is normal and sometimes inevitable.

fit the model without polynomial/interaction terms, and check the VIFs
centering variables might help sometimes

Summary

See here the postestimation commands after regress: regress postestimation diagnostic plots.

One more thingy…

Image source here

These assumptions are based on the mathematical foundations of linear regression, but in the real world, they are rarely perfectly met, especially in the social sciences (e.g., perfectly normal distributions/linear relationships/…).

Again, “all models are wrong, but some are useful”.

In-class assignment

Run multiple linear regression based on your proposed research, and verify if:

the assumptions are satisfied?
to what extent?
what you are going to do then?

Image source here