OLS Assumptions & Diagnostics

CD2010 Introduction to Econometrics

Author
Affiliation

Sergio Castellanos-Gamboa, PhD

Tecnológico de Monterrey

Published

August 28, 2025

1 Introduction

This workshop tests the OLS assumptions on a one‑year cross‑sectional dataset (Carseats, ISLR). We estimate two specifications and then apply diagnostics in sections organized by assumption:

  1. a Lin–Lin model with variables in levels; and

  2. a Log–Log model where the dependent variable and monetary regressors are in logs.

Each section explains the assumption, why it matters (unbiasedness/consistency or efficiency), then run the statistic tests and interprets them. Finally the section presents possible solutions.

Mathematically, the fitted model is y_i \;=\; \beta_0 \;+\; \beta_1 x_{i1} \;+\; \cdots \;+\; \beta_k x_{ik} \;+\; u_i, with \hat\beta=(X'X)^{-1}X'y and residuals \hat u=y-X\hat\beta, or said differently \hat u=y-\hat y. When the classical assumptions hold, OLS is unbiased and (with homoscedastic uncorrelated errors) BLUE; with normal errors, small‑sample t and F tests are exact.

2 Quick reference: OLS assumptions (cross‑section)

Assumption Meaning (plain) Consequence if violated Practical remedy
Linearity in parameters Model is linear in \beta. Logs, polynomials, and interactions are allowed (still linear in \beta). Wrong functional form → biased/inefficient estimates. Add missing terms (polynomials, interactions); re‑specify.
No perfect multicollinearity No regressor is an exact linear combination of others. Unstable/undefined estimates; huge standard errors. Drop/redefine variables; avoid redundant transforms.
homoscedasticity \mathrm{Var}(u|X)=\sigma^2 constant across observations. Classical SEs invalid; OLS not efficient. Use robust SEs (HC1).
Normality (for exact small‑sample inference) u \sim \mathcal{N}(0,\sigma^2). Small‑sample inference unreliable; asymptotics OK in large n. Use robust/asymptotic inference; check tails/outliers. Increase sample whenever possible.
No autocorrelation (cross‑section) Errors uncorrelated across units. SEs biased; inference can mislead (time/panel/spatial data). Cluster/serial/spatial robust SE; model dependence.
Exogeneity (Zero conditional mean) E[u|X]=0. Regressors uncorrelated with omitted factors in u. Bias & inconsistency (Omitted variable, simultaneity, measurement error). Add controls, use Instrumental Variables, panel Fixed Effects, or design; argue identification with theory and logic.

3 Data, variables, and descriptive statistics

Carseats contains information for a company that sales car seats across 400 shops (number of observations) in the US and abroad. The variables in the data set are the following:

  • Sales: Car seats unit sales (in thousands) at each location.
  • CompPrice: Price charged by competitor near each location.
  • Income: Community income level (in thousands of dollars).
  • Advertising: Local advertising budget for company at each location (in thousands of dollars).
  • Population: Population size in region (in thousands).
  • Price: Price company charges for car seats at each site.
  • ShelveLoc: Factor with levels Bad/Medium/Good (quality of shelf location).
  • Age: Average age of the local population.
  • Education: Education level at each location.
  • Urban: Whether the store is in an urban or rural location.
  • US: Whether the store is in the US or not.
# Import libraries for data management
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Import libraries for model estimations
import statsmodels.api as sm
import statsmodels.formula.api as smf

# Import function for regression outputs (nice tables)
from statsmodels.iolib.summary2 import summary_col

# Import functions for statistic tests
######################################

# Functional form RESET
from statsmodels.stats.diagnostic import linear_reset
# Heteroscedasticity White &  Breusch Pagan
from statsmodels.stats.diagnostic import het_breuschpagan, het_white
# Normality Jarque Bera & Omnibus
from statsmodels.stats.stattools import jarque_bera, omni_normtest
# Q-Q plots for normality
from statsmodels.graphics.gofplots import qqplot

# Others 
# Autocorrelation Durbin Watson & Breusch Godfrey
from statsmodels.stats.stattools import durbin_watson
from statsmodels.stats.diagnostic import acorr_breusch_godfrey

# How many decimals in your outputs? Here 2
pd.options.display.float_format = "{:.2f}".format

# Load Carseats
carseats = sm.datasets.get_rdataset("Carseats", package="ISLR", cache=True).data.copy()

# Categoricals as strings
for c in ["ShelveLoc","Urban","US"]:
    carseats[c] = carseats[c].astype(str)

# Some variables are strings (text) so we convert them to binaries to get their 
# descriptive statistics (mean represents frequency in %)
carseats["Urban01"] = (carseats["Urban"]=="Yes").astype(int)
carseats["US01"]    = (carseats["US"]=="Yes").astype(int)

# Log transforms (monetary) + optional log DV
# eps = epsilon, very small number to avoid Log(0).
# eps could also be something very small as 1e-6 (0.000006)
eps = 1 # Here I chose 1 to avoid negative numbers
carseats["log_Sales"]       = np.log(carseats["Sales"] + eps)
carseats["log_Price"]       = np.log(carseats["Price"] + eps)
carseats["log_CompPrice"]   = np.log(carseats["CompPrice"] + eps)
carseats["log_Income"]      = np.log(carseats["Income"] + eps)
carseats["log_Advertising"] = np.log(carseats["Advertising"] + eps)
carseats["log_Population"] = np.log(carseats["Population"] + eps)

# Descriptive statistics (levels/logs + binary shares)
# Create object with variables for descriptive statistics
desc_vars = [
    "Sales","Price","CompPrice","Income","Advertising","Age","Education", "Population",
    "log_Sales","log_Price","log_CompPrice","log_Income","log_Advertising", "log_Population",
    "Urban01","US01"
]
# Create object with descriptive statistics
desc = carseats[desc_vars].describe().T
# Report contents of object (in this case descrptive statistics)
desc
count mean std min 25% 50% 75% max
Sales 400.00 7.50 2.82 0.00 5.39 7.49 9.32 16.27
Price 400.00 115.80 23.68 24.00 100.00 117.00 131.00 191.00
CompPrice 400.00 124.97 15.33 77.00 115.00 125.00 135.00 175.00
Income 400.00 68.66 27.99 21.00 42.75 69.00 91.00 120.00
Advertising 400.00 6.63 6.65 0.00 0.00 5.00 12.00 29.00
Age 400.00 53.32 16.20 25.00 39.75 54.50 66.00 80.00
Education 400.00 13.90 2.62 10.00 12.00 14.00 16.00 18.00
Population 400.00 264.84 147.38 10.00 139.00 272.00 398.50 509.00
log_Sales 400.00 2.07 0.39 0.00 1.85 2.14 2.33 2.85
log_Price 400.00 4.74 0.22 3.22 4.62 4.77 4.88 5.26
log_CompPrice 400.00 4.83 0.12 4.36 4.75 4.84 4.91 5.17
log_Income 400.00 4.15 0.46 3.09 3.78 4.25 4.52 4.80
log_Advertising 400.00 1.46 1.19 0.00 0.00 1.79 2.56 3.40
log_Population 400.00 5.33 0.84 2.40 4.94 5.61 5.99 6.23
Urban01 400.00 0.70 0.46 0.00 0.00 1.00 1.00 1.00
US01 400.00 0.65 0.48 0.00 0.00 1.00 1.00 1.00

4 Model definitions

Two baseline models (in python/R notation), re‑used across all diagnostics.

  • Lin–Lin (levels)
    \text{Sales} \sim \text{Price}\times C(\text{ShelveLoc}) + \text{CompPrice} + \text{Income} + \text{Advertising} + \text{Age} + \text{Education}.

  • Log–Log (logs for monetary; DV in logs)
    \log(\text{Sales}) \sim \log(\text{Price})\times C(\text{ShelveLoc}) + \log(\text{CompPrice}) + \log(\text{Income}) + \log(\text{Advertising}) + \text{Age} + \text{Education}.

# Formulas
formula_lin = (
    "Sales ~ Price * C(ShelveLoc) + CompPrice + Income + Advertising + "
    "Age + Education + Population"
)
formula_log = (
    "log_Sales ~ log_Price * C(ShelveLoc) + log_CompPrice + log_Income + "
    "log_Advertising + Age + Education + log_Population"
)

# Fit
m_lin = smf.ols(formula=formula_lin, data=carseats).fit()
m_log = smf.ols(formula=formula_log, data=carseats).fit()

# --- Robust / Clustered standard errors (FOR LATER) --------------------
# HC1: Huber–White "sandwich" covariance with a small-sample correction
# Cluster: Cluster-robust covariance by 'US' group. Allows arbitrary correlation
#          and heteroscedasticity within each cluster (US=Yes/No here); assumes
#          independence across clusters. Prefer this when clustering is relevant.
groups = pd.Categorical(carseats["US"]).codes

m_lin_hc1 = m_lin.get_robustcov_results(cov_type="HC1")
m_lin_clu = m_lin.get_robustcov_results(cov_type="cluster", groups=groups)

m_log_hc1 = m_log.get_robustcov_results(cov_type="HC1")
m_log_clu = m_log.get_robustcov_results(cov_type="cluster", groups=groups)


# Print the output of the two models for interpretation
print(m_lin.summary())
print(m_log.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Sales   R-squared:                       0.873
Model:                            OLS   Adj. R-squared:                  0.869
Method:                 Least Squares   F-statistic:                     242.1
Date:                Thu, 28 Aug 2025   Prob (F-statistic):          3.95e-166
Time:                        21:20:50   Log-Likelihood:                -569.92
No. Observations:                 400   AIC:                             1164.
Df Residuals:                     388   BIC:                             1212.
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
================================================================================================
                                   coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------
Intercept                        6.0591      0.787      7.701      0.000       4.512       7.606
C(ShelveLoc)[T.Good]             4.1641      0.750      5.552      0.000       2.689       5.639
C(ShelveLoc)[T.Medium]           1.5504      0.633      2.447      0.015       0.305       2.796
Price                           -0.0986      0.005    -21.035      0.000      -0.108      -0.089
Price:C(ShelveLoc)[T.Good]       0.0058      0.006      0.916      0.360      -0.007       0.018
Price:C(ShelveLoc)[T.Medium]     0.0035      0.005      0.653      0.514      -0.007       0.014
CompPrice                        0.0930      0.004     22.217      0.000       0.085       0.101
Income                           0.0157      0.002      8.469      0.000       0.012       0.019
Advertising                      0.1144      0.008     14.242      0.000       0.099       0.130
Age                             -0.0464      0.003    -14.419      0.000      -0.053      -0.040
Education                       -0.0208      0.020     -1.057      0.291      -0.060       0.018
Population                       0.0002      0.000      0.659      0.510      -0.000       0.001
==============================================================================
Omnibus:                        0.888   Durbin-Watson:                   1.993
Prob(Omnibus):                  0.641   Jarque-Bera (JB):                0.828
Skew:                           0.111   Prob(JB):                        0.661
Kurtosis:                       3.003   Cond. No.                     7.40e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.4e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:              log_Sales   R-squared:                       0.759
Model:                            OLS   Adj. R-squared:                  0.752
Method:                 Least Squares   F-statistic:                     110.8
Date:                Thu, 28 Aug 2025   Prob (F-statistic):          2.23e-112
Time:                        21:20:50   Log-Likelihood:                 92.388
No. Observations:                 400   AIC:                            -160.8
Df Residuals:                     388   BIC:                            -112.9
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
====================================================================================================
                                       coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------------------
Intercept                            2.2525      0.595      3.786      0.000       1.083       3.422
C(ShelveLoc)[T.Good]                -2.5232      0.637     -3.963      0.000      -3.775      -1.272
C(ShelveLoc)[T.Medium]              -1.9109      0.536     -3.568      0.000      -2.964      -0.858
log_Price                           -1.7326      0.101    -17.125      0.000      -1.931      -1.534
log_Price:C(ShelveLoc)[T.Good]       0.6607      0.134      4.924      0.000       0.397       0.924
log_Price:C(ShelveLoc)[T.Medium]     0.4674      0.113      4.130      0.000       0.245       0.690
log_CompPrice                        1.5226      0.099     15.406      0.000       1.328       1.717
log_Income                           0.1340      0.022      6.201      0.000       0.092       0.176
log_Advertising                      0.0789      0.008      9.374      0.000       0.062       0.095
Age                                 -0.0062      0.001    -10.125      0.000      -0.007      -0.005
Education                           -0.0057      0.004     -1.515      0.131      -0.013       0.002
log_Population                       0.0224      0.012      1.855      0.064      -0.001       0.046
==============================================================================
Omnibus:                      189.406   Durbin-Watson:                   2.026
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1514.889
Skew:                          -1.837   Prob(JB):                         0.00
Kurtosis:                      11.797   Cond. No.                     5.36e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 5.36e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

5 Interpreting coefficients by model form

Model form Regression spec Units of βj Interpretation
Lin–Lin y = \alpha + \beta x + u units of y per unit of x \Delta y \approx \beta units, \uparrow 1 unit of X \rightarrow 1 unit of y.
Log–Log (elasticity) \ln y = \alpha + \beta \ln x + u elasticity (percent change in y per percent change in x) For a 1% increase in x: \%\Delta y \approx \beta\%.
Lin–Log (semi-elasticity of y w.r.t. log x) y = \alpha + \beta \ln x + u units of y per 1 unit of \ln x For a 1% increase in x: \Delta y \approx 0.01\,\beta units.
Log–Lin (semi-elasticity of log y w.r.t. x) \ln y = \alpha + \beta x + u proportional change in y per unit of x Percent change in y for +1 unit in x: approx 100\beta\%.

Notes:

  • Semi-elasticity vs elasticity:
    • Lin–Log: β is a semi-elasticity of y with respect to x: % change in x ⇒ level change in y.
    • Log–Lin: β is a semi-elasticity of y with respect to x: level change in x ⇒ % change in y.
    • Log–Log: β is an elasticity: % change in x ⇒ % change in y.
  • “1%” vs “1 pp” (percentage point) increases:
    • A 1% increase is relative: multiply the level by 1.01 (e.g., 50 → 50.5).
    • A 1 percentage point (pp) increase is absolute on a rate in percent: add 1 point (e.g., 10% → 11%).
      • Example: going from 10% to 11% is +1 pp but a +10% relative increase in the rate. Do not confuse % with pp in interpretations.

6 Linearity in parameters

The model must be linear in coefficients β. You can use transformations (logs), polynomials (e.g., x^2), and interactions (x_1 x_2) and still estimate with OLS because the model remains linear in β. If important curvature or interactions are missing, coefficients can be biased and predictions poor.

You can use Ramsey RESET specification test to check whether powers of the fitted values add explanatory power. In that case, the functional form may be misspecified (e.g., missing polynomials or interactions).

First, we check for the model in levels (lin-lin):

# Ramsey RESET, powers of 2 and 3, F-test
reset_res2_m_lin = linear_reset(m_lin, power=2, use_f=True)
reset_res3_m_lin = linear_reset(m_lin, power=3, use_f=True)

reset_table_m_lin = pd.DataFrame({
    "Power": [2, 3],
    "F stat": [reset_res2_m_lin.fvalue, reset_res3_m_lin.fvalue],
    "p-value": [reset_res2_m_lin.pvalue, reset_res3_m_lin.pvalue],
})
reset_table_m_lin
Power F stat p-value
0 2 0.17 0.68
1 3 0.11 0.90

The table suggests that the model is well specified, and therfore, wouldn’t need more polynomial terms.

Now, we look at the model in logarithms (elasticities):

# Ramsey RESET, powers of 2 and 3, F-test
reset_res2_m_log = linear_reset(m_log, power=2, use_f=True)
reset_res3_m_log = linear_reset(m_log, power=3, use_f=True)

reset_table_m_log = pd.DataFrame({
    "Power": [2, 3],
    "F stat": [reset_res2_m_log.fvalue, reset_res3_m_log.fvalue],
    "p-value": [reset_res2_m_log.pvalue, reset_res3_m_log.pvalue],
})
reset_table_m_log
Power F stat p-value
0 2 117.55 0.00
1 3 61.24 0.00

In this case, the tests suggests that the model might be missing some cuadratic, or cubic terms. However, it woun’t tell us which terms are missing, and hence, it is more of a trial and error check. For cuadratic terms, theory could suggest some important curvatures, however, for cubic terms, this might be more challenging.

7 No perfect multicollinearity

Exact linear dependence among regressors makes (X'X)^{-1} undefined; near‑dependence inflates standard errors and destabilizes estimates. If that is the case, you should drop or redefine your variables. We check for this issue by analyzing the correlation among continuos independent variables. Notice that we are interested in evaluating the relationship between the dependent and the independent variables, so if there is a high correlation between y and one of the x variables, that is ok.

Let’s look at the correlation table for the linear model:

# Correlation among variables (with dependent but issue among independents)
corr_vars_lin = [ "Sales", "Price", "CompPrice", "Income", "Advertising", "Age", "Education", "Population"]
corr_lin = carseats[corr_vars_lin].corr()
corr_lin
Sales Price CompPrice Income Advertising Age Education Population
Sales 1.00 -0.44 0.06 0.15 0.27 -0.23 -0.05 0.05
Price -0.44 1.00 0.58 -0.06 0.04 -0.10 0.01 -0.01
CompPrice 0.06 0.58 1.00 -0.08 -0.02 -0.10 0.03 -0.09
Income 0.15 -0.06 -0.08 1.00 0.06 -0.00 -0.06 -0.01
Advertising 0.27 0.04 -0.02 0.06 1.00 -0.00 -0.03 0.27
Age -0.23 -0.10 -0.10 -0.00 -0.00 1.00 0.01 -0.04
Education -0.05 0.01 0.03 -0.06 -0.03 0.01 1.00 -0.11
Population 0.05 -0.01 -0.09 -0.01 0.27 -0.04 -0.11 1.00

And now, for the logarithmic model:

# Correlation among variables (with dependent but issue among independents)
corr_vars_log = [ "log_Sales", "log_Price", "log_CompPrice", "log_Income", "log_Advertising", "Age", "Education", "log_Population"]
corr_log = carseats[corr_vars_log].corr()
corr_log
log_Sales log_Price log_CompPrice log_Income log_Advertising Age Education log_Population
log_Sales 1.00 -0.41 0.06 0.16 0.23 -0.22 -0.06 0.07
log_Price -0.41 1.00 0.59 -0.06 0.07 -0.09 -0.02 0.05
log_CompPrice 0.06 0.59 1.00 -0.06 -0.01 -0.10 0.01 -0.05
log_Income 0.16 -0.06 -0.06 1.00 0.06 -0.00 -0.04 -0.03
log_Advertising 0.23 0.07 -0.01 0.06 1.00 0.00 -0.06 0.21
Age -0.22 -0.09 -0.10 -0.00 0.00 1.00 0.01 -0.05
Education -0.06 -0.02 0.01 -0.04 -0.06 0.01 1.00 -0.13
log_Population 0.07 0.05 -0.05 -0.03 0.21 -0.05 -0.13 1.00

8 Normality

In the classical linear model, we assume the regression errors are normally distributed. This is not required for OLS coefficients to be unbiased, nor for consistency. We care about normality because, in small samples, it makes the usual t and F tests exact. When residuals are clearly non-normal (strong skew, heavy tails, outliers), those small-sample formulas can misbehave. So, lack of normality mainly threatens small-sample inference, not the point estimates themselves. In moderate/large samples, central-limit (theorem not discussed here) arguments and robust standard errors usually carry the day.

8.1 Practical consequences if residuals aren’t normal

  • t/F tests can be unreliable in small samples (p-values too optimistic or too conservative).
  • Confidence intervals may be off.
  • Outliers (heavy tails) can dominate fit and significance.
  • In many cross-sections with decent n, switching to heteroskedasticity-robust (HC) or clustered standard errors (by relevant groups) and dealing with outliers is often enough.

8.2 How to check normality

  1. Start with plots:
    • Histogram of residuals: look for strong skew or very long tails.
    • Q–Q plot: plot residual quantiles against a normal. Systematic S- or bow-shapes indicate skew/tails; pointy ends suggest heavy tails. Also, quick visual cue; don’t over-interpret minor wiggles. Look for big, systematic departures.
# Histogram plot with function plt.hist(resid) and overlaying a normal distribution.

# Flag to control whether we draw the fitted normal curve on top of the histogram.
overlay_normal = True  # Set to False to show only the histogram

# Loop over the two fitted models, giving each a label for the plot title.
for label, model in [("Lin–Lin", m_lin), ("Log–Log", m_log)]:
    # Grab the residuals from the current model (these are what we want to inspect).
    resid = model.resid

    # Start a new figure for this model; figsize controls width x height in inches.
    plt.figure(figsize=(6, 4))

    # Plot a histogram of residuals. Using density=True scales the bars so that the
    # total area equals 1, which puts them on the same scale as a probability density.
    plt.hist(resid, bins=20, density=True)

    if overlay_normal:
        # Estimate the parameters of a normal distribution from the residuals:
        # mu = mean; sigma = standard deviation. ddof=0 uses the population SD formula.
        mu = np.mean(resid)
        sigma = np.std(resid, ddof=0)

        # Create an evenly spaced grid over a range that covers most residual mass:
        # mean ± 4 standard deviations usually captures the tails we care about.
        x = np.linspace(mu - 4 * sigma, mu + 4 * sigma, 300)

        # Compute the Normal(μ, σ²) probability density function at each grid point.
        # This is the closed-form formula for the bell curve.
        y = (1 / (sigma * np.sqrt(2 * np.pi))) * np.exp(-0.5 * ((x - mu) / sigma) ** 2)

        # Draw the fitted normal curve on top of the histogram.
        plt.plot(x, y, linewidth=2, label="Normal PDF")
        plt.legend()  # Show the legend so the "Normal PDF" label appears.

    # Add axis labels and a descriptive title with the model label.
    plt.xlabel("Residual")
    plt.ylabel("Density")
    plt.title(f"Histogram of residuals — {label}")

    # Tighten the layout so labels/titles fit nicely inside the figure canvas.
    plt.tight_layout()

    # Render the plot for this model before moving to the next loop iteration.
    plt.show()

# Q-Q plot with function qqplot(resid)

for label, model in [("Lin–Lin", m_lin), ("Log–Log", m_log)]:
    resid = model.resid
    
    plt.figure(figsize=(6,4))
    qqplot(resid, line="s", fit=True)
    plt.title(f"Q–Q plot of residuals — {label}")
    plt.tight_layout()
    plt.show()
<Figure size 576x384 with 0 Axes>

<Figure size 576x384 with 0 Axes>

  1. Run statistic tests
    • Jarque–Bera (JB): combines skewness and kurtosis into one statistic.
      Intuition: a normal has skewness (=0) and kurtosis (=3); departures raise JB.
      Decision: small p-value → evidence against normality.
    • Omnibus test (D’Agostino–Pearson): another combined test using skew/kurt; same interpretation: small p → non-normal.
# Jarque–Bera normality test — Linear Model
print("Jarque–Bera normality test — Linear Model")
jb_stat_lin, jb_p_lin, jb_skew_lin, jb_kurt_lin = jarque_bera(m_lin.resid)
jb_results_lin = pd.Series({
    "Jarque–Bera stat": round(jb_stat_lin, 4),
    "Jarque–Bera p-value": round(jb_p_lin, 4),
    "Skewness": round(jb_skew_lin, 4),
    "Kurtosis": round(jb_kurt_lin, 4),
    "Decision (α=0.05)": "Reject H0 (non-normality)" if jb_p_lin < 0.05 else "Fail to reject H0 (normality)"
})
jb_results_lin
Jarque–Bera normality test — Linear Model
Jarque–Bera stat                                0.83
Jarque–Bera p-value                             0.66
Skewness                                        0.11
Kurtosis                                        3.00
Decision (α=0.05)      Fail to reject H0 (normality)
dtype: object
# Jarque–Bera normality test — Logarithmic Model
print("Jarque–Bera normality test — Logarithmic Model")
jb_stat_log, jb_p_log, jb_skew_log, jb_kurt_log = jarque_bera(m_log.resid)
jb_results_log = pd.Series({
    "Jarque–Bera stat": round(jb_stat_log, 4),
    "Jarque–Bera p-value": round(jb_p_log, 4),
    "Skewness": round(jb_skew_log, 4),
    "Kurtosis": round(jb_kurt_log, 4),
    "Decision (α=0.05)": "Reject H0 (non-normality)" if jb_p_log < 0.05 else "Fail to reject H0 (normality)"
})
jb_results_log
Jarque–Bera normality test — Logarithmic Model
Jarque–Bera stat                         1514.89
Jarque–Bera p-value                         0.00
Skewness                                   -1.84
Kurtosis                                   11.80
Decision (α=0.05)      Reject H0 (non-normality)
dtype: object
# Omnibus (D’Agostino–Pearson) normality test — Linear Model
print("Omnibus normality test — Linear Model")
om_stat_lin, om_p_lin = omni_normtest(m_lin.resid)[:2]
omni_results_lin = pd.Series({
    "Omnibus stat": round(om_stat_lin, 4),
    "Omnibus p-value": round(om_p_lin, 4),
    "Decision (α=0.05)": "Reject H0 (non-normality)" if om_p_lin < 0.05 else "Fail to reject H0 (normality)"
})
omni_results_lin
Omnibus normality test — Linear Model
Omnibus stat                                  0.89
Omnibus p-value                               0.64
Decision (α=0.05)    Fail to reject H0 (normality)
dtype: object
# Omnibus (D’Agostino–Pearson) normality test — Logarithmic Model
print("Omnibus normality test — Logarithmic Model")
om_stat_log, om_p_log = omni_normtest(m_log.resid)[:2]
omni_results_log = pd.Series({
    "Omnibus stat": round(om_stat_log, 4),
    "Omnibus p-value": round(om_p_log, 4),
    "Decision (α=0.05)": "Reject H0 (non-normality)" if om_p_log < 0.05 else "Fail to reject H0 (normality)"
})
omni_results_log
Omnibus normality test — Logarithmic Model
Omnibus stat                            189.41
Omnibus p-value                           0.00
Decision (α=0.05)    Reject H0 (non-normality)
dtype: object
  1. Interpret
    • If p ≥ 0.05 and the plots look okay → you’re fine for small-sample t/F (keep an eye on outliers).
    • If p < 0.05 or plots show strong departures → rely on robust/clustered SE, and consider a specification tweak (e.g., logging a heavily right-skewed DV).

8.3 What to do if normality looks bad

  • Use robust inference: report HC1 (or clustered) standard errors. This addresses inference even if residuals aren’t normal. Even when residuals look normal, robust SEs are standard practice in cross-sectional work.
  • Check for outliers/influence: a few extreme observations can break normality. Consider reporting results with and without them (but give a substantive reason).
  • Transformations when sensible: if the DV is strictly positive and skewed, a log specification can tame tails and make residuals more Gaussian-looking.
  • Model misspecification: sometimes non-normality is a symptom of a missing term (e.g., a square or interaction). A quick RESET check can help decide if you need curvature.

9 Homoscedasticity (constant error variance)

With heteroscedasticity, OLS remains unbiased but is not efficient, and classical standard errors are invalid, so the statistic tests might be wrong. Remember that we use the standard errors for many tests, including the individual significance t-tests \Rightarrow \tau = \frac{\hat\beta-0}{se(\hat\beta)}. In that case we use heteroscedasticity‑robust standard errors (HC1).

Tests and visuals.

  • Residuals vs. fitted plot.

First, start by plotting the residuals of the model for a visual inspection of the variance of the residuals.

# Residuals vs. fitted
for label, model in [("Linear model", m_lin), ("Logarithmic model", m_log)]:
    fitted = model.fittedvalues
    resid  = model.resid
    plt.figure(figsize=(6,4))
    plt.scatter(fitted, resid, s=12)
    plt.axhline(0, linestyle="--")
    plt.xlabel(f"{label} fitted values")
    plt.ylabel("Residuals")
    plt.title(f"Residuals vs. fitted — {label}")
    plt.tight_layout()
    plt.show()

9.1 White and Breusch–Pagan (BP) tests

9.1.1 Goal

Both tests ask whether the error variance depends on the regressors (i.e., heteroscedasticity).

  • Null (both): homoscedasticity
    \mathrm{Var}(u_i \mid X_i) = \sigma^2
  • Alternative: variance changes with X
    \mathrm{Var}(u_i \mid X_i) = \sigma^2 \cdot h(Z_i), \quad \text{with } h(\cdot) \text{ not constant}

9.1.1.1 What to do if you reject?

  • Keep your coefficients from OLS (they remain unbiased under exogeneity), but report robust/clustered standard errors for inference. We got these when we ran the models above.
  • Consider transformations (e.g., logs), respecify functional form, or model variance if theory suggests it.

9.1.2 White test

The idea is a general check of the variance of the residuals on the regressors, their squares and cross-products (doesn’t require specifying a form).

  1. Run your main OLS and get residuals \hat u_i.
  2. Regress \hat u_i^{\,2} on Z_i, Z_i^2, and pairwise products Z_{ij}Z_{ik}: \hat u_i^{\,2} = \alpha_0 + \sum_j \alpha_j Z_{ij} + \sum_j \alpha_{jj} Z_{ij}^2 + \sum_{j<k} \alpha_{jk} Z_{ij}Z_{ik} + v_i.
  3. Compute \text{LM} = n \cdot R^2 from this auxiliary regression.
  4. Decision: compare LM to a \chi^2_m (or use the p-value), where m is the number of auxiliary regressors (excluding the constant).

Interpretation:
- Large p-valuefail to reject homoscedasticity.
- Small p-valuereject → use robust (HC) or clustered SE.

# White test
print("White test of homoscedasticity — Linear Model")
w_lm_lin, w_lm_p_lin = het_white(m_lin.resid, m_lin.model.exog)[:2]
white_results_lin = pd.Series({
    "White LM stat": w_lm_lin,
    "White LM p-value": w_lm_p_lin,
    "Decision (α=0.05)": "Reject H0 (heteroscedasticity)" if w_lm_p_lin < 0.05 else "Fail to reject H0 (homoscedasticity)"
})
white_results_lin
White test of homoscedasticity — Linear Model
White LM stat                                       62.14
White LM p-value                                     0.58
Decision (α=0.05)    Fail to reject H0 (homoscedasticity)
dtype: object
# White test
print("White test of homoscedasticity — Logarithmic Model")
w_lm_log, w_lm_p_log = het_white(m_log.resid, m_log.model.exog)[:2]
white_results_log = pd.Series({
    "White LM stat": w_lm_log,
    "White LM p-value": w_lm_p_log,
    "Decision (α=0.05)": "Reject H0 (heteroscedasticity)" if w_lm_p_log < 0.05 else "Fail to reject H0 (homoscedasticity)"
})
white_results_log
White test of homoscedasticity — Logarithmic Model
White LM stat                                225.20
White LM p-value                               0.00
Decision (α=0.05)    Reject H0 (heteroscedasticity)
dtype: object

9.1.3 Breusch–Pagan (BP)

This tests checks whether the squared residuals change with the regressors.

  1. Run your main OLS and get residuals \hat u_i.
  2. Regress \hat u_i^{\,2} on the original regressors Z_i (often the same X_i): \hat u_i^{\,2} = \alpha_0 + \alpha_1 Z_{i1} + \cdots + \alpha_q Z_{iq} + v_i.
  3. Compute \text{LM} = n \cdot R^2 from this auxiliary regression.
  4. Decision: compare LM to a \chi^2_q (or use the p-value).

Interpretation:
- Large p-valuefail to reject homoscedasticity.
- Small p-valuereject homoscedasticity → use robust (HC) or clustered standard errors.

Note: the classic BP assumes a linear form for how variance depends on Z and (in its original version) normality; in practice, the commonly used implementation is robust enough for a quick diagnostic.

# Breusch–Pagan LM statistic and its p-value.
print("Breusch–Pagan test of homoscedasticity — Linear Model")
bp_lm_lin, bp_lm_p_lin = het_breuschpagan(m_lin.resid, m_lin.model.exog)[:2]
bp_results_lin = pd.Series({
    "BP LM stat": bp_lm_lin,
    "BP LM p-value": bp_lm_p_lin,
    "Decision (α=0.05)": "Reject H0 (heteroscedasticity)" if bp_lm_p_lin < 0.05 else "Fail to reject H0 (homoscedasticity)"
})
bp_results_lin
Breusch–Pagan test of homoscedasticity — Linear Model
BP LM stat                                           9.85
BP LM p-value                                        0.54
Decision (α=0.05)    Fail to reject H0 (homoscedasticity)
dtype: object
# Breusch–Pagan logarithmic
print("Breusch–Pagan test of homoscedasticity — Logarithmic Model")
bp_lm_log, bp_lm_p_log = het_breuschpagan(m_log.resid, m_log.model.exog)[:2]
bp_results_log = pd.Series({
    "BP LM stat": bp_lm_log,
    "BP LM p-value": bp_lm_p_log,
    "Decision (α=0.05)": "Reject H0 (heteroscedasticity)" if bp_lm_p_log < 0.05 else "Fail to reject H0 (homoscedasticity)"
})
bp_results_log
Breusch–Pagan test of homoscedasticity — Logarithmic Model
BP LM stat                                    42.13
BP LM p-value                                  0.00
Decision (α=0.05)    Reject H0 (heteroscedasticity)
dtype: object

10 Summary regression tables

10.1 Linear model

The following table presents the results for the linear model in each column. The first column reports the OLS results with no modifications. The second column contains the Heteroscedasticity-Robust standard errors. Finally, column 3 shows the results inspecting for clustered varianve. In this case, we are grouping observations by location (US vs. non-US), although we did not include this variable in the model (perhaps we should have, however it was not significant).

tbl_lin = summary_col(
    [m_lin, m_lin_hc1, m_lin_clu],
    stars=True,
    model_names=["OLS","OLS (robust)"," OLS (Cluster: US)"],
    info_dict={
        "N":     lambda x: f"{int(x.nobs)}",
        "AIC":   lambda x: f"{getattr(x,'aic', float('nan')):.1f}" if hasattr(x,'aic') else "",
        "BIC":   lambda x: f"{getattr(x,'bic', float('nan')):.1f}" if hasattr(x,'bic') else "",
        "RSS":   lambda x: f"{getattr(x,'ssr', float('nan')):.2f}" if hasattr(x,'ssr') else "",
    }
)
print(tbl_lin)

=======================================================================
                                OLS     OLS (robust)  OLS (Cluster: US)
-----------------------------------------------------------------------
Intercept                    6.0591***  6.0591***    6.0591*           
                             (0.7868)   (0.8078)     (0.8301)          
C(ShelveLoc)[T.Good]         4.1641***  4.1641***    4.1641            
                             (0.7500)   (0.7195)     (1.1336)          
C(ShelveLoc)[T.Medium]       1.5504**   1.5504**     1.5504            
                             (0.6335)   (0.6457)     (0.2903)          
Price                        -0.0986*** -0.0986***   -0.0986**         
                             (0.0047)   (0.0046)     (0.0034)          
Price:C(ShelveLoc)[T.Good]   0.0058     0.0058       0.0058            
                             (0.0063)   (0.0061)     (0.0123)          
Price:C(ShelveLoc)[T.Medium] 0.0035     0.0035       0.0035            
                             (0.0054)   (0.0054)     (0.0030)          
CompPrice                    0.0930***  0.0930***    0.0930**          
                             (0.0042)   (0.0039)     (0.0021)          
Income                       0.0157***  0.0157***    0.0157*           
                             (0.0019)   (0.0019)     (0.0025)          
Advertising                  0.1144***  0.1144***    0.1144**          
                             (0.0080)   (0.0078)     (0.0022)          
Age                          -0.0464*** -0.0464***   -0.0464           
                             (0.0032)   (0.0033)     (0.0085)          
Education                    -0.0208    -0.0208      -0.0208*          
                             (0.0197)   (0.0204)     (0.0027)          
Population                   0.0002     0.0002       0.0002            
                             (0.0004)   (0.0004)     (0.0003)          
R-squared                    0.8728     0.8728       0.8728            
R-squared Adj.               0.8692     0.8692       0.8692            
AIC                          1163.8     1163.8       1163.8            
BIC                          1211.7     1211.7       1211.7            
N                            400        400          400               
RSS                          404.72     404.72       404.72            
=======================================================================
Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01

10.2 Logarithmic Model

The following table does a similar exercise as the obe above, but for the logarithmic model.

tbl_log = summary_col(
    [m_log, m_log_hc1,m_log_clu],
    stars=True,
    model_names=["OLS","OLS (robust)","OLS (Cluster: US)"],
    info_dict={
        "N":     lambda x: f"{int(x.nobs)}",
        "AIC":   lambda x: f"{getattr(x,'aic', float('nan')):.1f}" if hasattr(x,'aic') else "",
        "BIC":   lambda x: f"{getattr(x,'bic', float('nan')):.1f}" if hasattr(x,'bic') else "",
        "RSS":   lambda x: f"{getattr(x,'ssr', float('nan')):.2f}" if hasattr(x,'ssr') else "",
    }
)
print(tbl_log)

==========================================================================
                                    OLS     OLS (robust) OLS (Cluster: US)
--------------------------------------------------------------------------
Intercept                        2.2525***  2.2525***    2.2525           
                                 (0.5950)   (0.8337)     (1.2796)         
C(ShelveLoc)[T.Good]             -2.5232*** -2.5232***   -2.5232          
                                 (0.6366)   (0.8337)     (0.7892)         
C(ShelveLoc)[T.Medium]           -1.9109*** -1.9109**    -1.9109          
                                 (0.5356)   (0.9434)     (0.5474)         
log_Price                        -1.7326*** -1.7326***   -1.7326**        
                                 (0.1012)   (0.1785)     (0.0672)         
log_Price:C(ShelveLoc)[T.Good]   0.6607***  0.6607***    0.6607           
                                 (0.1342)   (0.1791)     (0.1755)         
log_Price:C(ShelveLoc)[T.Medium] 0.4674***  0.4674**     0.4674           
                                 (0.1132)   (0.2016)     (0.1222)         
log_CompPrice                    1.5226***  1.5226***    1.5226*          
                                 (0.0988)   (0.1254)     (0.2231)         
log_Income                       0.1340***  0.1340***    0.1340**         
                                 (0.0216)   (0.0219)     (0.0054)         
log_Advertising                  0.0789***  0.0789***    0.0789*          
                                 (0.0084)   (0.0090)     (0.0065)         
Age                              -0.0062*** -0.0062***   -0.0062**        
                                 (0.0006)   (0.0007)     (0.0003)         
Education                        -0.0057    -0.0057      -0.0057          
                                 (0.0038)   (0.0038)     (0.0049)         
log_Population                   0.0224*    0.0224*      0.0224**         
                                 (0.0121)   (0.0128)     (0.0010)         
R-squared                        0.7585     0.7585       0.7585           
R-squared Adj.                   0.7517     0.7517       0.7517           
AIC                              -160.8     -160.8       -160.8           
BIC                              -112.9     -112.9       -112.9           
N                                400        400          400              
RSS                              14.76      14.76        14.76            
==========================================================================
Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01

11 Extensions - Auto/serial/spatial correlation (preview)

OLS standard errors assume the regression errors are not correlated with one another. In time-series this is “serial” correlation; in cross-sections it can show up as spatial or group-level correlation (nearby stores, same region, same brand policy). If errors move together, the coefficients from OLS are still the same, but the usual standard errors are too optimistic, so t and F tests can mislead.

Two quick tools cover most cases we care about:

  • Durbin–Watson (DW) is a descriptive index for first-order serial correlation in residuals. It runs from 0 to 4 and sits near 2 when there is no first-order autocorrelation.

  • Breusch–Godfrey (BG) is a formal LM test for serial correlation of a chosen order. You pick a lag (or several), run the test, and read the p-value: a small p-value says “there is serial correlation of at least that order.”
# --- Serial dependence tests: DW and BG ---------

def dependence_table(model, label="Model", lags=(1, 2), alpha=0.05):
    """
    Build a one-table summary:
      - Durbin–Watson (descriptive)
      - Breusch–Godfrey LM tests at the requested lags
    Decision column uses α for BG tests; DW has no p-value and is left descriptive.
    """
    rows = []
    # DW first (descriptive)
    dw = durbin_watson(model.resid)
    rows.append({
        "Test": "Durbin–Watson (≈ 2 = no AR(1))",
        "Statistic": round(dw, 3),
        "p-value": "",
        f"Decision (α={alpha:.2f})": ""
    })
    # BG at requested lags
    for L in lags:
        lm_stat, lm_p, *_ = acorr_breusch_godfrey(model, nlags=L)
        rows.append({
            "Test": f"Breusch–Godfrey LM (lag {L})",
            "Statistic": round(lm_stat, 4),
            "p-value": round(lm_p, 4),
            f"Decision (α={alpha:.2f})": "Reject H0 (serial correlation)" if lm_p < alpha else "Fail to reject H0"
        })
    tbl = pd.DataFrame(rows)
    print(f"Serial dependence tests — {label}")
    print(tbl.to_string(index=False))

# Run for both fitted models
dependence_table(m_lin, label="Lin–Lin (levels)", lags=(1, 2), alpha=0.05)
print()  # spacer
dependence_table(m_log, label="Log–Log (logs)",   lags=(1, 2), alpha=0.05)
Serial dependence tests — Lin–Lin (levels)
                          Test  Statistic p-value Decision (α=0.05)
Durbin–Watson (≈ 2 = no AR(1))       1.99                          
    Breusch–Godfrey LM (lag 1)       0.00    0.95 Fail to reject H0
    Breusch–Godfrey LM (lag 2)       0.35    0.84 Fail to reject H0

Serial dependence tests — Log–Log (logs)
                          Test  Statistic p-value Decision (α=0.05)
Durbin–Watson (≈ 2 = no AR(1))       2.03                          
    Breusch–Godfrey LM (lag 1)       0.13    0.72 Fail to reject H0
    Breusch–Godfrey LM (lag 2)       0.28    0.87 Fail to reject H0

In a purely cross-sectional setting we don’t expect serial correlation by time, but we can still see grouped dependence (e.g., stores within the same country). When dependence is present or even plausible, inference should switch to cluster-robust standard errors at the relevant group level.