Giovanni Minchio giovanni.minchio@unitn.it
Yuxin Zhang yuxin.zhang@unitn.it
Quantitative Methods Lab, Lesson 6.1
5
Nov. 2024
Assumptions check:
Additional check:
Note: after you made any variable corrections or changed your models, you should re-check the assumptions.
There are strict formal tests to check assumptions, as well as informal graphical methods. We will focus on the graphical methods for approximate checks.
Let’s use the built-in dataset: life expectancy, 1998
(lifeexp.dta
) in Stata.
We want to study the relationship between Y: country life expectancy
lexp
with x1: GNP per capita gnppc
and x2:
safewater
Contains data Life expectancy, 1998
Observations: 68 26 Mar 2022 09:40
Variables: 6
-------------------------------------------------------------------------------------------------------------------------------------------
Variable Storage Display Value
name type format label Variable label
-------------------------------------------------------------------------------------------------------------------------------------------
region byte %16.0g region Region
country str28 %28s Country
popgrowth float %9.0g Avg. annual % growth
lexp byte %9.0g Life expectancy at birth
gnppc float %9.0g GNP per capita
safewater byte %9.0g Safe water
-------------------------------------------------------------------------------------------------------------------------------------------
Sorted by:
(Life expectancy, 1998)
The expected outcome is a linear function of the predictors.
- change size of labels
scatter lexp gnppc, mlabel(country) || lfit lexp gnppc, lwidth(thick) || lowess lexp gnppc, lwidth(thick)
P.s., LOWESS (Locally Weighted Scatterplot Smoothing) is a non-parametric regression technique used to create a smooth line through a scatterplot of data points. It is used for visualizing relationships in data that may not follow a linear pattern.
(5 missing values generated)
scatter lexp log_gnppc, mlabel(country) || lfit lexp log_gnppc, lwidth(thick) || lowess lexp log_gnppc, lwidth(thick)
Source | SS df MS Number of obs = 37
-------------+---------------------------------- F(2, 34) = 49.61
Model | 715.51562 2 357.75781 Prob > F = 0.0000
Residual | 245.187083 34 7.2113848 R-squared = 0.7448
-------------+---------------------------------- Adj R-squared = 0.7298
Total | 960.702703 36 26.6861862 Root MSE = 2.6854
------------------------------------------------------------------------------
lexp | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
log_gnppc | 1.661387 .5218804 3.18 0.003 .6007983 2.721975
safewater | .1396136 .0394113 3.54 0.001 .0595201 .219707
_cons | 47.43692 2.725392 17.41 0.000 41.89825 52.97558
------------------------------------------------------------------------------
acprplot
augmented partial residual plot. It can be
used to identify nonlinearities in the dataIn the plot of log_gnppc
the smoothed line is very close
to the ordinary regression line, and the entire pattern seems uniform.
The plot of safewater
does seem a bit more problematic at
the left end.
Extreme values may disproportionately bias model fit, because regression is sensitive to observations that deviate from the overall pattern.
Why?
Solutions:
lvr2plot
(leverage vs. residual-squared plot/L-R
plot) to check potential influential pointsThe red horizontal and vertical lines represent the average squared residuals and the average leverage, serving as guidelines for identifying influence rather than strict cutoffs.
(2 observations deleted)
Country | Freq. Percent Cum.
-----------------------------+-----------------------------------
Albania | 1 1.52 1.52
Argentina | 1 1.52 3.03
Armenia | 1 1.52 4.55
Austria | 1 1.52 6.06
Azerbaijan | 1 1.52 7.58
Belarus | 1 1.52 9.09
Belgium | 1 1.52 10.61
Bolivia | 1 1.52 12.12
Bosnia and Herzegovina | 1 1.52 13.64
Brazil | 1 1.52 15.15
Bulgaria | 1 1.52 16.67
Canada | 1 1.52 18.18
Chile | 1 1.52 19.70
Colombia | 1 1.52 21.21
Croatia | 1 1.52 22.73
Cuba | 1 1.52 24.24
Czech Republic | 1 1.52 25.76
Denmark | 1 1.52 27.27
Dominican Republic | 1 1.52 28.79
Ecuador | 1 1.52 30.30
El Salvador | 1 1.52 31.82
Estonia | 1 1.52 33.33
Finland | 1 1.52 34.85
France | 1 1.52 36.36
Georgia | 1 1.52 37.88
Germany | 1 1.52 39.39
Greece | 1 1.52 40.91
Guatemala | 1 1.52 42.42
Honduras | 1 1.52 43.94
Hungary | 1 1.52 45.45
Ireland | 1 1.52 46.97
Italy | 1 1.52 48.48
Jamaica | 1 1.52 50.00
Kazakhstan | 1 1.52 51.52
Kyrgyz Republic | 1 1.52 53.03
Latvia | 1 1.52 54.55
Lithuania | 1 1.52 56.06
Macedonia FYR | 1 1.52 57.58
Mexico | 1 1.52 59.09
Moldova | 1 1.52 60.61
Netherlands | 1 1.52 62.12
Nicaragua | 1 1.52 63.64
Norway | 1 1.52 65.15
Panama | 1 1.52 66.67
Peru | 1 1.52 68.18
Poland | 1 1.52 69.70
Portugal | 1 1.52 71.21
Puerto Rico | 1 1.52 72.73
Romania | 1 1.52 74.24
Russian Federation | 1 1.52 75.76
Slovak Republic | 1 1.52 77.27
Slovenia | 1 1.52 78.79
Spain | 1 1.52 80.30
Sweden | 1 1.52 81.82
Switzerland | 1 1.52 83.33
Tajikistan | 1 1.52 84.85
Trinidad and Tobago | 1 1.52 86.36
Turkey | 1 1.52 87.88
Turkmenistan | 1 1.52 89.39
Ukraine | 1 1.52 90.91
United Kingdom | 1 1.52 92.42
United States | 1 1.52 93.94
Uruguay | 1 1.52 95.45
Uzbekistan | 1 1.52 96.97
Venezuela | 1 1.52 98.48
Yugoslavia, FR (Serb./Mont.) | 1 1.52 100.00
-----------------------------+-----------------------------------
Total | 66 100.00
Y residuals follow a normal distribution. It assures accurate standard errors, but it is not required to obtain unbiased estimates of coefficients.
Note: we do not check the distribution of the outcome, but rather the distribution of the model residual.
log_gnppc
and x2
safewater
after dropping influential points Source | SS df MS Number of obs = 35
-------------+---------------------------------- F(2, 32) = 46.87
Model | 482.146494 2 241.073247 Prob > F = 0.0000
Residual | 164.596363 32 5.14363635 R-squared = 0.7455
-------------+---------------------------------- Adj R-squared = 0.7296
Total | 646.742857 34 19.0218487 Root MSE = 2.268
------------------------------------------------------------------------------
lexp | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
log_gnppc | 1.684797 .4807282 3.50 0.001 .7055856 2.664008
safewater | .1141241 .0427183 2.67 0.012 .0271098 .2011383
_cons | 49.30775 2.408392 20.47 0.000 44.40201 54.21348
------------------------------------------------------------------------------
predict
to generate residuals(31 missing values generated)
kdensity
to produce a kernel density plot with the
normal option requesting that a normal density be overlaidpnorm
: sensitive to non-normality in the middleqnorm
is sensitive to non-normality near the ends
(recommended)Note: a larger sample size does not guarantee the normality of residuals! If the errors have a non-normal distribution in the population, the residuals will still reflect this, even with a larger sample.
Do not confuse this with the Central Limit Theorem (CLT), which applies to the distribution of sample means, not the distribution of the residuals in regression.
Y residuals, conditional on X, have constant variance. So there should be no pattern to the residuals plotted against the fitted values. It implies that as the value of the dependent variable changes, the error term does not vary much for each observation.
It again, assures accurate standard errors.
General solution: robust standard errors (more on this later)
After taking into account the predictors in our model, there is no remaining association between observations. You need expert knowledge of your research and data.
Potential dependence includes:
General solution:
You can plot residuals against any variables to check if there is a pattern:
Source | SS df MS Number of obs = 35
-------------+---------------------------------- F(2, 32) = 46.87
Model | 482.146494 2 241.073247 Prob > F = 0.0000
Residual | 164.596363 32 5.14363635 R-squared = 0.7455
-------------+---------------------------------- Adj R-squared = 0.7296
Total | 646.742857 34 19.0218487 Root MSE = 2.268
------------------------------------------------------------------------------
lexp | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
log_gnppc | 1.684797 .4807282 3.50 0.001 .7055856 2.664008
safewater | .1141241 .0427183 2.67 0.012 .0271098 .2011383
_cons | 49.30775 2.408392 20.47 0.000 44.40201 54.21348
------------------------------------------------------------------------------
Region | Freq. Percent Cum.
-----------------+-----------------------------------
Europe & C. Asia | 44 66.67 66.67
North America | 13 19.70 86.36
South America | 9 13.64 100.00
-----------------+-----------------------------------
Total | 66 100.00
This is not a good example though, we prefer to have more categories and sizes in the grouping variable.
Note: variable
region
is treated as
numeric in the plot.
Predictor variables included do not heavily predict each other.
When there is a perfect correlation among the predictors, the estimates for a regression model cannot be uniquely computed (redundancy in predictors), and it increases standard errors (uncertainty in estimates).
We can use Variance Inflation Factor (VIF) to measure the amount of multicollinearity in a set of multiple regression predictors
vif
after the regression to check for
multicollinearity Source | SS df MS Number of obs = 35
-------------+---------------------------------- F(2, 32) = 46.87
Model | 482.146494 2 241.073247 Prob > F = 0.0000
Residual | 164.596363 32 5.14363635 R-squared = 0.7455
-------------+---------------------------------- Adj R-squared = 0.7296
Total | 646.742857 34 19.0218487 Root MSE = 2.268
------------------------------------------------------------------------------
lexp | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
log_gnppc | 1.684797 .4807282 3.50 0.001 .7055856 2.664008
safewater | .1141241 .0427183 2.67 0.012 .0271098 .2011383
_cons | 49.30775 2.408392 20.47 0.000 44.40201 54.21348
------------------------------------------------------------------------------
Variable | VIF 1/VIF
-------------+----------------------
log_gnppc | 2.73 0.366215
safewater | 2.73 0.366215
-------------+----------------------
Mean VIF | 2.73
Perfectly uncorrelated predictors have VIFs of 1, and perfectly correlated predictors have VIFs of infinity.
VIF can be calculated by the formula below:
\[ \frac{1}{1 - R^2} = \frac{1}{\text{tolerance}} \] The \(R^2\) (unadjusted) here is retrieved from a new model when we regress one predictor against all other predictors from the original model.
log_gnppc
Source | SS df MS Number of obs = 35
-------------+---------------------------------- F(1, 33) = 57.11
Model | 38.5191686 1 38.5191686 Prob > F = 0.0000
Residual | 22.257229 33 .674461485 R-squared = 0.6338
-------------+---------------------------------- Adj R-squared = 0.6227
Total | 60.7763976 34 1.78754111 Root MSE = .82126
------------------------------------------------------------------------------
log_gnppc | Coefficient Std. err. t P>|t| [95% conf. interval]
-------------+----------------------------------------------------------------
safewater | .0707432 .0093611 7.56 0.000 .051698 .0897885
_cons | 2.628328 .7424533 3.54 0.001 1.117795 4.13886
------------------------------------------------------------------------------
So we can calculate by hand, and get the same result:
\(\frac{1}{1 - 0.6338} = 2.73\)
Rule of thumb: variables VIF values > 10 should be further investigated (higher is worse). Then 1/VIF is the tolerance, which can be used to check on the degree of collinearity (lower is worse).
Simplest solutions would be:
Note: you might see very high VIFs in a model with polynomials/interaction terms, which is normal and sometimes inevitable.
See here the postestimation commands after regress
: regress
postestimation diagnostic plots.
These assumptions are based on the mathematical foundations of linear regression, but in the real world, they are rarely perfectly met, especially in the social sciences (e.g., perfectly normal distributions/linear relationships/…).
Again, “all models are wrong, but some are useful”.
Run multiple linear regression based on your proposed research, and verify if: