Classical Linear Regression Model

CD2010 Introduction to Econometrics

Author
Affiliation

Sergio Castellanos-Gamboa, PhD

Tecnológico de Monterrey

Published

August 18, 2025

1 Introduction

Trade policy is back on front pages. In 2025 the United States announced a baseline tariff on most imports under IEEPA and later modified reciprocal tariff rates by executive order on July 31, 2025, with further adjustments announced in August. These moves revived questions about how broad tariffs ripple through prices, sourcing, and trade volumes, and how exemptions or phased schedules may change effective rates country by country.

The near-term political news also underscores why a country-level cross-section is useful: several partners negotiated temporary reprieves or country-specific terms (e.g., Mexico’s 90-day pause to work on a broader deal), implying heterogeneous exposure to tariffs across countries at a point in time. That heterogeneity motivates regressions that control for development and connectivity when we ask whether higher tariffs are associated with lower import intensity (imports as % of GDP).

This workshop estimates the association between tariffs and trade openness using a one-year cross-section of countries. Theory predicts that a higher import tariff raises domestic relative prices and tends to reduce import volumes; in general equilibrium, tariffs can also affect the terms of trade and reallocate production and consumption across sectors (classics include Gorman, 1959; Leith, 1971). Modern empirical work further separates tariff from non-tariff changes in trade agreements and finds measurable impacts on trade margins (Cheong, Kwak, & Tang, 2018). Recent analysis also clarifies how a permanent import tariff can affect trade balances under minimal structure, motivating a look at net trade as well as imports and exports (Costinot & Werning, 2025). Concretely, we begin with imports, \text{Imports}_i \;=\; \alpha \;+\; \beta_{\text{tariff}}\cdot \text{Tariff}_i \;+\; \gamma' Z_i \;+\; u_i, where Z_i collects income, size, and connectivity controls, and we expect \beta_{\text{tariff}}<0 if higher tariffs compress import intensity. We then replicate the exercise for exports (tariffs may also depress exports via input costs and retaliation) and for net trade, where \text{net trade}_i=\text{exports}_i-\text{imports}_i, to summarize overall external balance in the same cross-section.

Academic work helps us choose covariates. A large literature shows that digital connectivity lowers search and coordination costs, expanding both the extensive margin (who trades) and the intensive margin (how much). Early and influential evidence includes Freund & Weinhold (2004) and Osnago & Tan (2016), who document positive links between Internet adoption and trade flows. These insights justify our working set of regressors, tariffs (policy wedge), income (GDP per capita, PPP), market size/geography (population and area or density), and connectivity (Internet %). Finally, we extend the model in two useful directions:

  1. interactions with dummy variables for World Bank income group, to test whether certain countries are more affected by tariffs than others, and

  2. a polynomial term in the tariff rate to allow for curvature (e.g., marginal effects that get stronger as tariffs rise).

All analyses are cross‑sectional (one year only).

2 OLS regression

We begin by defining a linear relationship between an outcome y (e.g., imports as % of GDP) and predictors x_1,\dots,x_k (e.g., tariffs, income, population, connectivity):

y_i \;=\; \beta_0 \;+\; \beta_1 x_{i1} \;+\; \cdots \;+\; \beta_k x_{ik} \;+\; u_i ,

where u_i collects all unobserved influences on y_i. Ordinary Least Squares (OLS) chooses coefficients to minimize the sum of squared residuals:

S(\beta) \;=\; \sum_{i=1}^n \big(y_i - x_i' \beta\big)^2 \;=\; (y - X\beta)'(y - X\beta).

Differentiating and setting to zero yields the closed-form estimator:

\hat{\beta} \;=\; (X'X)^{-1} X'y \quad \text{(provided \(X'X\) is invertible)}.

The fitted values and residuals are \hat{y} \;=\; X\hat{\beta}, \qquad \hat{u} \;=\; y - \hat{y}.

An unbiased estimator of the error variance under the classical assumptions is \hat{\sigma}^2 \;=\; \frac{\hat{u}'\hat{u}}{\,n - k - 1\,}.

The classical standard error of \hat{\beta}_j is \operatorname{se}(\hat{\beta}_j) \;=\; \sqrt{ \, \hat{\sigma}^2 \cdot \big[(X'X)^{-1}\big]_{jj} \, } .

With normally distributed errors, the usual t-tests and confidence intervals follow: \frac{\hat{\beta}_j - \beta_j}{\operatorname{se}(\hat{\beta}_j)} \;\sim\; t_{\,n-k-1}.

2.0.1 Simple vs. multiple regression

  • In a simple regression y_i=\beta_0+\beta_1 x_{i1}+u_i, \hat{\beta}_1 measures the average change in y associated with a one-unit change in x_{1}. This is easy to read but fragile: any factor correlated with both y and x_1 gets absorbed into u_i, potentially biasing \hat{\beta}_1.

  • In a multiple regression y_i=\beta_0+\beta_1 x_{i1}+\cdots+\beta_k x_{ik}+u_i, \hat{\beta}_j measures the partial association between x_j and y ceteris paribus (holding the other regressors fixed). This is why, when we study tariffs and trade, we include income, size/geography, and connectivity, to reduce omitted-variable bias and read the tariff coefficient as a conditional association rather than a proxy for broader development or access differences.

2.0.2 The OLS assumptions (cross-section)

These standard conditions underpin unbiasedness, consistency, (and, with a few extras) efficiency, plus valid small-sample inference. Under exogeneity, OLS is unbiased \mathbb{E}[\hat\beta \mid X]=\beta With i.i.d. sampling, OLS is consistent (properties improve with sample size). If we also assume homoskedasticity and no autocorrelation, OLS is efficient among linear unbiased estimators (BLUE): it achieves the smallest sampling variance \operatorname{Var}(\hat\beta \mid X)=\sigma^2 (X'X)^{-1}. Finally, adding normal errors justifies exact small-sample t tests and confidence intervals. We will test these conditions in the next workshop; here they provide the scaffold for interpreting coefficients.

Assumption Mathematical statement Meaning (plain language)
Linearity in parameters y_i=\beta_0+\sum_{j=1}^k \beta_j x_{ij}+u_i The model is linear in the coefficients \beta. You can include transformations like \log x, x^2, or interactions; OLS remains linear in \beta.
Exogeneity (zero conditional mean) \mathbb{E}[u_i \mid X] = 0 Regressors are uncorrelated with unobservables that affect y. Delivers unbiased and consistent \hat{\beta}. Violations cause omitted-variable or simultaneity bias.
Homoskedasticity \operatorname{Var}(u_i \mid X)=\sigma^2 Error variance is constant across observations; with no autocorrelation this makes OLS BLUE. If false, classical standard errors are wrong, so you have to use robust standard errors.
No autocorrelation (cross-section) \operatorname{Cov}(u_i,u_j \mid X)=0,\; i\neq j Country errors are uncorrelated. Spatial clustering (neighbors share shocks) can violate this; cluster-robust standard errors help.
No perfect multicollinearity \operatorname{rank}(X)=k{+}1 No regressor is an exact linear combination of others
Normality (for exact small-sample inference) u_i \sim \mathcal{N}(0,\sigma^2) Not required for unbiasedness; gives exact t/F in small samples. Otherwise rely on large-sample or robust inference.

3 Load data and prepare variables

We will:

  1. read the Excel,

  2. create logs for skewed scale variables

3.1 Python libraries used in this workshop

We lean on a small, standard toolkit that’s available in RStudio’s Python engine and in Google Colab without extra installs:

pandas. This is our workhorse for data wrangling. A DataFrame (think: a labeled table) lets us clean columns, merge country attributes, compute logs, and build the final modeling dataset. You’ll see patterns like pd.read_excel(url_or_path) to load data, df.rename(...) and df.assign(...) to create clean variables (e.g., ln_population = log(population + 1)), and df.dropna() to keep complete cases. Two core objects:

  • DataFrame: 2-D table with labeled rows/columns (e.g., one row per country).

  • Series: 1-D labeled array (e.g., a single variable/column).

NumPy. Provides fast, vectorized math. We mostly use it through pandas, but call numpy directly for safe transforms, e.g. np.log(x + 1) to avoid log(0), or to create polynomial terms like x**2. NumPy arrays (ndarray) are the low-level structures that make column-wise operations fast and memory-efficient.

statsmodels. Our statistics and econometrics engine. We’ll use the formula API (statsmodels.formula.api as smf) to specify models in a clear syntax:

  • OLS: smf.ols("imports_pct_gdp ~ tariff_weighted + ln_gdppc_ppp_const + ...", data=df).fit()

  • Categoricals: Just include string columns (e.g., continent, wb_income_group) and statsmodels will dummy-encode them automatically; or use C(var) to force categorical treatment if needed.

  • Interactions: var1 * var2 expands to main effects + interaction; var1 : var2 is interaction only.

  • Polynomials: Wrap with I() to treat math literally, e.g., I(tariff_weighted**2).

We’ll also use statsmodels.iolib.summary2.summary_col to print a compact comparison table of models (simple, multiple, interactions, polynomial) with sample size and R^2.

Matplotlib. Simple plots for quick data exploration, such as two-way scatter plots and a correlation heatmap. It’s part of the standard stack and doesn’t require any configuration for our use (e.g., plt.scatter(x, y)).

In short: pandas to shape the data, NumPy for numerical transforms (logs, squares), statsmodels to estimate and summarize regressions, and matplotlib to visualize relationships before we model. This keeps the workflow transparent and reproducible across RStudio and Colab.

3.2 Importing data from the internet

For this workshop we will download the data directly from Git Hub. The dataset consists of the following variables:

Variable (column name) What it is & role in the regression
imports_pct_gdp Imports of goods & services (% of GDP). Primary dependent variable (DV) to measure trade openness on the import side. We expect a negative association with tariffs when conditioning on controls.
exports_pct_gdp Exports of goods & services (% of GDP). Alternative DV for robustness. Tariffs can affect exports via higher input costs, retaliation, or supply-chain re-routing. Sign is ambiguous a priori.
trade_pct_gdp Total trade (% of GDP) = Exports + Imports. Optional DV for a one-number openness measure; can be used as a robustness outcome.
tariff_weighted Applied, weighted-mean tariff (%). Main policy regressor (IV). Captures the effective tariff burden based on import shares. In polynomial specs, also include I(tariff_weighted**2) to allow curvature.
gdppc_ppp_const / ln_gdppc_ppp_const GDP per capita, PPP (constant intl $). Development/market depth control. Use the log transform (ln_gdppc_ppp_const) to reduce skew and interpret coefficients as semi-elasticities.
population / ln_population Total population. Market size control. Larger economies often have lower trade/GDP ratios due to internal absorption. Prefer the log version.
surface_area / ln_surface_area Country surface area (km²). Geography/scale control. Can capture remoteness and internal transport costs. Prefer the log version.
pop_density / ln_pop_density Population density (people/km²). Optional structure control. Do not include together with both population and surface_area (risk of multicollinearity).
internet_users_pct Individuals using the Internet (% of population). Connectivity/control (IV). Proxy for information/search and coordination channels.
continent Region label (categorical). Use as dummies to absorb broad regional level differences; also useful in interactions (tariff_weighted*continent) to test heterogeneous tariff effects by region.
wb_income_group World Bank income category (categorical). Use as dummies for baseline differences in development; also for interactions (tariff_weighted*wb_income_group) to test income-level heterogeneity.

For logged variables, use \ln(x+1) on strictly non-negative scales to avoid \ln(0). Keep percentage variables (e.g., tariffs, Internet %) in levels unless you have a specific elasticity interpretation in mind.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
from statsmodels.iolib.summary2 import summary_col

# Set or reset display formats
# pd.options.display.float_format = None
pd.options.display.float_format = "{:.2f}".format

# >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
# EDIT THIS: set your path accordingly
url = "https://github.com/chechurris/CD2010/raw/refs/heads/main/wbdata_trade.xlsx"
# <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

# Read the first sheet (or specify sheet_name="...")
data = pd.read_excel(url)

# Create logs for skewed scale variables (avoid log(0) with replace)
for col in ["gdppc_ppp_const", "population", "surface_area", "pop_density"]:
    if col in data.columns:
        data[f"ln_{col}"] = np.log(data[col]+1e-7)

data.head(8)
time time_code country iso3c gdp_const gdp_nominal gdppc_const gdppc_nominal gdppc_ppp_const gdppc_ppp_current ... internet_users_pct pop_density population surface_area continent wb_income_group ln_gdppc_ppp_const ln_population ln_surface_area ln_pop_density
0 2022 YR2022 Albania ALB 14385329993.53 19017244116.72 5178.88 6846.43 17111.95 19444.71 ... 82.60 101.38 2777689.00 28750.00 Europe Upper middle income 9.75 14.84 10.27 4.62
1 2022 YR2022 Algeria DZA 206670488126.83 225638456572.14 4544.47 4961.55 14782.20 15836.09 ... 74.80 19.09 45477389.00 2381741.00 Africa Upper middle income 9.60 17.63 14.68 2.95
2 2022 YR2022 Angola AGO 84883445838.16 104399746853.40 2382.02 2929.69 7397.49 7924.89 ... 42.10 28.58 35635029.00 1246700.00 Africa Lower middle income 8.91 17.39 14.04 3.35
3 2022 YR2022 Argentina ARG 598603016935.29 632790070063.12 13182.79 13935.68 27627.96 29597.69 ... 88.40 16.59 45407904.00 2780400.00 South America Upper middle income 10.23 17.63 14.84 2.81
4 2022 YR2022 Australia AUS 1587133480804.53 1690858246994.43 61009.81 64997.01 59883.65 65871.77 ... 97.00 3.38 26014399.00 7741220.00 Oceania High income 11.00 17.07 15.86 1.22
5 2022 YR2022 Austria AUT 427236226728.30 471773629830.38 47250.97 52176.66 65661.45 70734.94 ... 93.60 109.57 9041851.00 83879.00 Europe High income 11.09 16.02 11.34 4.70
6 2022 YR2022 Azerbaijan AZE 56789627316.90 78807470588.24 5599.59 7770.59 21051.24 22552.09 ... 88.00 122.71 10141756.00 86600.00 Asia Upper middle income 9.95 16.13 11.37 4.81
7 2022 YR2022 Bahamas, The BHS 12795362392.88 13896800000.00 32186.51 34957.16 34342.79 36791.25 ... 94.70 39.71 397538.00 13880.00 North America High income 10.44 12.89 9.54 3.68

8 rows × 25 columns

3.2.1 Choice of outcome and predictors

For the baseline specification we take Imports (% of GDP) as the dependent variable. As predictors we use: tariff (weighted mean), log GDP per capita (PPP, constant), log population, log surface area, and internet users (%). This set captures policy exposure, development, market size, geographic scale, and digital adoption; five distinct levers that theory suggests should matter for trade intensity.

4 Descriptive statistics and visual exploration

Before running regressions, it is essential to understand the data’s level and spread. We examine descriptive statistics, two‑way scatter plots of the dependent variable against each regressor, and a correlation heatmap. This helps detect outliers, nonlinear patterns, and near‑redundant variables.

# Summary stats for variables used
desc = data.drop(columns=["country","iso3c","continent","wb_income_group"], errors="ignore").describe().T
desc[["count","mean","std","min","25%","50%","75%","max"]]
count mean std min 25% 50% 75% max
gdp_const 128.00 664424051724.42 2465768654505.64 482629585.76 14593597529.00 63070517151.70 366728761759.98 21443388432051.03
gdp_nominal 128.00 742741833719.01 2861572424714.76 518180029.41 17727732488.17 72558819155.97 407091463782.61 26006893000000.00
gdppc_const 128.00 18045.42 22306.89 253.69 2343.29 7008.16 26611.60 109642.67
gdppc_nominal 128.00 20788.73 25908.96 250.63 2984.89 8368.96 30951.56 121613.94
gdppc_ppp_const 128.00 30039.53 27469.16 829.39 7367.57 21008.89 47322.66 133571.96
gdppc_ppp_current 128.00 32566.11 30169.37 888.52 7892.84 22506.71 51117.25 143094.95
trade_pct_gdp 128.00 98.51 57.60 26.89 60.93 85.59 123.36 384.88
exports_pct_gdp 128.00 47.86 31.80 4.97 26.26 40.84 62.06 194.49
imports_pct_gdp 128.00 50.64 28.00 15.29 29.04 43.66 63.26 190.39
tariff_simple 128.00 6.72 5.63 0.00 1.95 4.56 11.48 26.31
tariff_weighted 128.00 5.28 5.33 0.00 1.33 3.21 7.99 29.52
internet_users_pct 125.00 72.52 24.71 11.00 57.70 82.10 91.50 100.00
pop_density 125.00 239.29 755.96 2.20 27.91 82.79 150.42 7851.01
population 128.00 50870541.05 180798717.14 64749.00 2818151.50 10435607.00 33230723.50 1425423212.00
surface_area 125.00 813220.12 2108325.92 300.00 38390.00 147570.00 587295.00 15634410.00
ln_gdppc_ppp_const 128.00 9.78 1.17 6.72 8.90 9.95 10.76 11.80
ln_population 128.00 16.06 1.84 11.08 14.85 16.16 17.32 21.08
ln_surface_area 125.00 11.83 2.17 5.70 10.56 11.90 13.28 16.56
ln_pop_density 125.00 4.34 1.43 0.79 3.33 4.42 5.01 8.97
# Pairwise scatter plots: Imports %GDP vs each IV
ivs = []
for v in ["tariff_weighted","ln_gdppc_ppp_const","ln_population","ln_surface_area","internet_users_pct"]:
    if v in data.columns:
        ivs.append(v)

fig, axs = plt.subplots(2, 3, figsize=(12, 8))
axs = axs.ravel()
for i, v in enumerate(ivs):
    axs[i].scatter(data[v], data["imports_pct_gdp"], s=14)
    axs[i].set_xlabel(v)
    axs[i].set_ylabel("Imorts  (%GDP)")
    axs[i].set_title(f"Imports (%GDP) vs {v}")
# Hide any spare subplot
for j in range(len(ivs), len(axs)):
    axs[j].axis("off")
plt.tight_layout()
plt.show()

# Cleaner correlation heatmap using seaborn
import seaborn as sns

# Pick DV + ~5 IVs (edit if you chose different names)
corr_vars = [
    "imports_pct_gdp",
    "tariff_weighted",
    "ln_gdppc_ppp_const",
    "ln_population",
    "ln_surface_area",
    "internet_users_pct",
]

corr_p = data[corr_vars].corr(method="pearson")

plt.figure(figsize=(7,5))
sns.heatmap(corr_p, annot=True, vmin=-1, vmax=1, cmap="vlag", fmt=".2f",
            linewidths=.5, linecolor="white")
plt.title("Pearson Correlation — Trade & Covariates")
plt.tight_layout()
plt.show()

The scatter plots show how imports as a share of GDP co-move with each predictor in isolation, while the heatmap provides a compact view of linear associations among all variables. Patterns here should inform your expectations about coefficient signs and magnitudes in the regression, but remember that bivariate associations can be misleading when variables are correlated with one another.

5 Simple vs. multiple regression

We now estimate two models. First, a simple regression of imports on tariffs only. Second, a multiple regression that includes the additional controls. Comparing the tariff coefficient across these two models illustrates how multiple regression corrects for confounding influences; large changes indicate that some of the simple relationship was due to other factors (like size or development).

# Simple OLS: only tariff as predictor
m_simple = smf.ols(
    formula="imports_pct_gdp ~ tariff_weighted",
    data=data
).fit()

print(m_simple.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:        imports_pct_gdp   R-squared:                       0.147
Model:                            OLS   Adj. R-squared:                  0.140
Method:                 Least Squares   F-statistic:                     21.73
Date:                Mon, 18 Aug 2025   Prob (F-statistic):           7.87e-06
Time:                        13:21:56   Log-Likelihood:                -597.45
No. Observations:                 128   AIC:                             1199.
Df Residuals:                     126   BIC:                             1205.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
===================================================================================
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept          61.2804      3.236     18.938      0.000      54.877      67.684
tariff_weighted    -2.0136      0.432     -4.662      0.000      -2.868      -1.159
==============================================================================
Omnibus:                       46.735   Durbin-Watson:                   1.474
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              129.110
Skew:                           1.395   Prob(JB):                     9.21e-29
Kurtosis:                       7.053   Cond. No.                         10.7
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

5.1 Reading your first regression output (Imports % of GDP ~ Tariff %)

Below is a short guide to the main statistics you see in the statsmodels summary, and how to interpret the tariff coefficient for this simple OLS:

5.1.1 R^2 and Adjusted R^2

  • R^2 is the fraction of the variance of the dependent variable explained by the model:
    R^2 = 1 - \frac{\text{RSS}}{\text{TSS}}, where RSS is the residual sum of squares and TSS is the total sum of squares. Values are between 0 and 1. Higher is better fit. However, if R^2 is too-high it alsgo signlas problems with the OLS assumptions.

  • Adjusted R^2 penalizes adding regressors that don’t help (useful for multivariate regressions):
    \text{Adj }R^2 = 1 - \frac{\text{RSS}/(n-k-1)}{\text{TSS}/(n-1)}. If you add irrelevant variables, Adj R^2 can go down, acting as a simple complexity check. In the first regression there’s only one regressor, so R^2 and Adj R^2 are usually very close.

5.1.2 t-tests (individual significance)

Each coefficient \hat\beta_j is tested against 0 with a t-statistic: t_j = \frac{\hat\beta_j-0}{\operatorname{se}(\hat\beta_j)}

  • a large |t| (e.g., |t|>1.96) and small p-value (e.g., p<0.05) means the variable is individually significant.
  • For the tariff slope, the t-test asks: is the relationship between tariffs and imports statistically different from zero?

5.1.3 F-test (joint significance)

The summary shows an overall F-statistic with Prob (F-statistic). In the simple model it essentially tests whether the slope is zero (same idea as the t-test), but in multiple regression it tests whether all slopes are zero jointly (i.e., the model has explanatory power beyond a constant).

5.1.4 AIC and BIC (fit vs. parsimony)

  • AIC and BIC reward goodness of fit but penalize model size. Lower is better.

  • BIC penalizes complexity more than AIC, so it favors simpler models when evidence is similar. Use these to compare non-nested specifications (e.g., linear vs. polynomial) estimated on the same dataset.

5.2 Interpreting the tariff coefficient (units and meaning)

  • Dependent variable: imports_pct_gdp is measured in percentage points of GDP (e.g., 50 means imports are 50% of GDP).

  • Regressor: tariff_weighted is measured in percentage points (e.g., 5 means a 5% tariff).

So the slope \hat\beta_{\text{tariff}} is interpreted as:

A 1-percentage-point increase in the tariff rate is associated with a \hat\beta_{\text{tariff}} percentage-point change in imports as % of GDP, ceteris paribus.

Examples to anchor the units:

  • If \hat\beta_{\text{tariff}} = -2.01 and it’s statistically significant, then raising tariffs from 5.28% (the sample mean) to 6.28% is associated with imports/GDP falling by 2.01 percentage points (e.g., from 50.64% to 48.63 of GDP), on average.

  • If \hat\beta_{\text{tariff}} is close to zero and not significant, your data don’t show a clear linear association in this simple bivariate frame.

Intercept: The constant \hat\beta_0 is the model’s predicted imports_pct_gdp when tariff_weighted = 0. In cross-country data, the intercept is usually just a baseline level, in this case 61.28%; the slope carries the substantive economic interpretation.

# Multiple OLS with 5-ish IVs 
m_multiple = smf.ols(
    formula=(
        "imports_pct_gdp ~ "
        "tariff_weighted + ln_gdppc_ppp_const + ln_population + ln_surface_area + internet_users_pct "
        
    ),
    data=data
).fit()

print(m_multiple.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:        imports_pct_gdp   R-squared:                       0.544
Model:                            OLS   Adj. R-squared:                  0.525
Method:                 Least Squares   F-statistic:                     27.71
Date:                Mon, 18 Aug 2025   Prob (F-statistic):           2.26e-18
Time:                        13:21:56   Log-Likelihood:                -518.16
No. Observations:                 122   AIC:                             1048.
Df Residuals:                     116   BIC:                             1065.
Df Model:                           5                                         
Covariance Type:            nonrobust                                         
======================================================================================
                         coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept            173.4888     31.223      5.556      0.000     111.648     235.330
tariff_weighted       -2.0940      0.439     -4.770      0.000      -2.963      -1.224
ln_gdppc_ppp_const    -2.2177      3.768     -0.589      0.557      -9.680       5.245
ln_population         -1.6214      1.405     -1.154      0.251      -4.404       1.161
ln_surface_area       -6.2471      1.132     -5.518      0.000      -8.490      -4.005
internet_users_pct     0.1232      0.172      0.716      0.475      -0.217       0.464
==============================================================================
Omnibus:                        6.263   Durbin-Watson:                   1.849
Prob(Omnibus):                  0.044   Jarque-Bera (JB):                5.885
Skew:                           0.526   Prob(JB):                       0.0527
Kurtosis:                       3.225   Cond. No.                     1.58e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.58e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

In the multiple model, the tariff coefficient measures the partial association between tariff rates and imports holding development, population, surface area, and internet adoption constant. A negative and statistically significant estimate would suggest that higher tariffs are associated with lower import intensity in the cross‑section. Magnitude matters: if the coefficient is -2.1, then a one‑percentage‑point increase in the tariff rate is linked to 2.1 percentage points lower imports/GDP, ceteris paribus. Notice that the effect of tariffs did not change significantly from the simple to the multivariate model. How would you interpret the remaining coefficients?

6 Estimation with dummy variables and interactions

Countries differ systematically by continent and income group. To capture these differences, we include dummy variables (one‑hot indicators) for continent and income, omitting one category from each set to serve as the baseline. We then add interactions between tariffs and selected dummies to test whether the tariff–imports relationship differs by region or income.

The dummy coefficients reflect level differences relative to the omitted baseline categories (e.g., “High income,” depending on what’s omitted). The interaction coefficients show how the effect of tariffs on imports changes in those categories. For example, a negative and significant inc_High income × tariff term would imply that higher‑income countries experience a more negative tariff–imports association than the baseline income group, after controlling for other factors.

# Drop unknown income group
data = data[data["wb_income_group"] != "Unknown"]
# Interactions: `*` = main effects + interaction with tariff
m_interact = smf.ols(
    formula=(
        "imports_pct_gdp ~ "
        "ln_population + ln_surface_area + internet_users_pct "
        "+ tariff_weighted * wb_income_group"
    ),
    data=data
).fit()

print(m_interact.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:        imports_pct_gdp   R-squared:                       0.599
Model:                            OLS   Adj. R-squared:                  0.561
Method:                 Least Squares   F-statistic:                     15.42
Date:                Mon, 18 Aug 2025   Prob (F-statistic):           1.65e-16
Time:                        13:21:56   Log-Likelihood:                -479.11
No. Observations:                 114   AIC:                             980.2
Df Residuals:                     103   BIC:                             1010.
Df Model:                          10                                         
Covariance Type:            nonrobust                                         
==========================================================================================================================
                                                             coef    std err          t      P>|t|      [0.025      0.975]
--------------------------------------------------------------------------------------------------------------------------
Intercept                                                160.9381     23.442      6.865      0.000     114.447     207.429
wb_income_group[T.Low income]                             -5.3191     23.276     -0.229      0.820     -51.482      40.844
wb_income_group[T.Lower middle income]                    10.0412      8.381      1.198      0.234      -6.581      26.663
wb_income_group[T.Upper middle income]                    -5.1788      5.816     -0.890      0.375     -16.714       6.357
ln_population                                             -2.4272      1.400     -1.734      0.086      -5.204       0.349
ln_surface_area                                           -6.3641      1.147     -5.550      0.000      -8.638      -4.090
internet_users_pct                                         0.1998      0.165      1.211      0.229      -0.127       0.527
tariff_weighted                                           -4.2064      1.012     -4.156      0.000      -6.214      -2.199
tariff_weighted:wb_income_group[T.Low income]              3.2281      2.019      1.599      0.113      -0.776       7.233
tariff_weighted:wb_income_group[T.Lower middle income]     2.3106      1.251      1.847      0.068      -0.170       4.792
tariff_weighted:wb_income_group[T.Upper middle income]     2.5466      1.238      2.058      0.042       0.092       5.001
==============================================================================
Omnibus:                        1.540   Durbin-Watson:                   2.114
Prob(Omnibus):                  0.463   Jarque-Bera (JB):                1.484
Skew:                           0.273   Prob(JB):                        0.476
Kurtosis:                       2.883   Cond. No.                     1.37e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.37e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

7 Polynomial regression: allowing curvature in tariffs

A polynomial regression is still linear in the parameters; you simply add higher‑order terms of a predictor to allow curvature. For tariffs, including a squared term lets the marginal effect of tariffs depend on the tariff level itself. Formally, \text{Imports} = \alpha + \beta_1 \cdot \text{Tariff} + \beta_2 \cdot \text{Tariff}^2 + \gamma^\top Z + u. The marginal effect of tariffs is then $ , / , = _1 + 2_2 $. If $ _2 $, curvature matters. No special estimator is required, you just add the squared term.

What does the polynomial contribute? If the squared term is significant and the adjusted R^2 improves, the data favor curvature—perhaps small tariff changes matter little at low levels but bite more as tariffs climb. If the squared term is small and insignificant and fit does not improve, keep the linear model for parsimony. The goal is interpretability with enough flexibility to capture important non‑linear patterns.

# Polynomial in tariff: add tariff^2 (use I() to interpret literally)
m_poly = smf.ols(
    formula=(
        "imports_pct_gdp ~ "
        "tariff_weighted + I(tariff_weighted**2) "
        "+ ln_gdppc_ppp_const + ln_population + ln_surface_area + internet_users_pct"
    ),
    data=data
).fit()

print(m_poly.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:        imports_pct_gdp   R-squared:                       0.570
Model:                            OLS   Adj. R-squared:                  0.546
Method:                 Least Squares   F-statistic:                     23.65
Date:                Mon, 18 Aug 2025   Prob (F-statistic):           1.23e-17
Time:                        13:21:56   Log-Likelihood:                -483.14
No. Observations:                 114   AIC:                             980.3
Df Residuals:                     107   BIC:                             999.4
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
===========================================================================================
                              coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------------
Intercept                 208.3254     33.660      6.189      0.000     141.599     275.052
tariff_weighted            -4.6382      1.370     -3.386      0.001      -7.354      -1.922
I(tariff_weighted ** 2)     0.1360      0.072      1.893      0.061      -0.006       0.278
ln_gdppc_ppp_const         -4.7052      3.946     -1.192      0.236     -12.527       3.117
ln_population              -1.6004      1.444     -1.108      0.270      -4.463       1.263
ln_surface_area            -6.5670      1.165     -5.636      0.000      -8.877      -4.257
internet_users_pct          0.1229      0.176      0.700      0.486      -0.225       0.471
==============================================================================
Omnibus:                        4.495   Durbin-Watson:                   2.023
Prob(Omnibus):                  0.106   Jarque-Bera (JB):                4.073
Skew:                           0.457   Prob(JB):                        0.130
Kurtosis:                       3.147   Cond. No.                     2.09e+03
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.09e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

8 Presenting your results

Usually, output generated by programming languages is neither straightforward, nor clear. Therefore, the following code allows you to generate cleaner, nicer looking regression tables.

# Nice compact comparison table (adjust which models you want to include)
models = [m_simple, m_multiple]
names  = ["Simple OLS", "Multi OLS"]

table_sm = summary_col(
    models, stars=True, model_names=names,
    info_dict={
        "N":      lambda x: f"{int(x.nobs)}",
        "AIC":     lambda x: f"{x.aic:.1f}",
        "BIC":     lambda x: f"{x.bic:.1f}",
        "RSS":     lambda x: f"{x.ssr:.2f}"   
    }
)
print(table_sm)

=========================================
                   Simple OLS  Multi OLS 
-----------------------------------------
Intercept          61.2804*** 173.4888***
                   (3.2358)   (31.2231)  
tariff_weighted    -2.0136*** -2.0940*** 
                   (0.4320)   (0.4390)   
ln_gdppc_ppp_const            -2.2177    
                              (3.7677)   
ln_population                 -1.6214    
                              (1.4049)   
ln_surface_area               -6.2471*** 
                              (1.1322)   
internet_users_pct            0.1232     
                              (0.1719)   
R-squared          0.1471     0.5443     
R-squared Adj.     0.1403     0.5246     
AIC                1198.9     1048.3     
BIC                1204.6     1065.1     
N                  128        122        
RSS                84909.90   34908.82   
=========================================
Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01
# Nice compact comparison table (adjust which models you want to include)
models = [m_interact, m_poly]
names  = ["Interactions (tariff×groups)", "Polynomial (tariff + tariff^2)"]

table_sm = summary_col(
    models, stars=True, model_names=names,
    info_dict={
        "N":      lambda x: f"{int(x.nobs)}",
        "AIC":     lambda x: f"{x.aic:.1f}",
        "BIC":     lambda x: f"{x.bic:.1f}",
        "RSS":     lambda x: f"{x.ssr:.2f}"   
    }
)
print(table_sm)

==================================================================================================================
                                                       Interactions (tariff×groups) Polynomial (tariff + tariff^2)
------------------------------------------------------------------------------------------------------------------
Intercept                                              160.9381***                  208.3254***                   
                                                       (23.4417)                    (33.6599)                     
wb_income_group[T.Low income]                          -5.3191                                                    
                                                       (23.2761)                                                  
wb_income_group[T.Lower middle income]                 10.0412                                                    
                                                       (8.3811)                                                   
wb_income_group[T.Upper middle income]                 -5.1788                                                    
                                                       (5.8164)                                                   
ln_population                                          -2.4272*                     -1.6004                       
                                                       (1.3999)                     (1.4442)                      
ln_surface_area                                        -6.3641***                   -6.5670***                    
                                                       (1.1467)                     (1.1653)                      
internet_users_pct                                     0.1998                       0.1229                        
                                                       (0.1649)                     (0.1757)                      
tariff_weighted                                        -4.2064***                   -4.6382***                    
                                                       (1.0121)                     (1.3700)                      
tariff_weighted:wb_income_group[T.Low income]          3.2281                                                     
                                                       (2.0191)                                                   
tariff_weighted:wb_income_group[T.Lower middle income] 2.3106*                                                    
                                                       (1.2510)                                                   
tariff_weighted:wb_income_group[T.Upper middle income] 2.5466**                                                   
                                                       (1.2376)                                                   
I(tariff_weighted ** 2)                                                             0.1360*                       
                                                                                    (0.0719)                      
ln_gdppc_ppp_const                                                                  -4.7052                       
                                                                                    (3.9457)                      
R-squared                                              0.5995                       0.5701                        
R-squared Adj.                                         0.5606                       0.5460                        
AIC                                                    980.2                        980.3                         
BIC                                                    1010.3                       999.4                         
N                                                      114                          114                           
RSS                                                    29842.44                     32031.57                      
==================================================================================================================
Standard errors in parentheses.
* p<.1, ** p<.05, ***p<.01

9 Closing remarks

You now have a complete,cross‑section pipeline on tariffs and trade: descriptive analysis, simple vs. multiple OLS, dummy variables and interactions for income, and a polynomial extension for curvature. In your write‑up, emphasize economic mechanisms and magnitudes, not just statistical significance. In the next workshop we will rigorously check model assumptions and discuss robustness (e.g., heteroskedasticity‑robust standard errors, and alternative outcomes like Trade % GDP and Exports % GDP).

10 References

Cheong, J., Kwak, D. W., & Tang, K. K. (2018). The trade effects of tariffs and non-tariff changes of preferential trade agreements. Economic Modelling, 70, 370–382. https://doi.org/10.1016/j.econmod.2017.08.011

Costinot, A., & Werning, I. (2025). How tariffs affect trade deficits (NBER Working Paper No. 33709). National Bureau of Economic Research. https://www.nber.org/papers/w33709

Freund, C. L., & Weinhold, D. (2004). The effect of the Internet on international trade. Journal of International Economics, 62(1), 171–189. https://doi.org/10.1016/S0022-1996(03)00059-X

Gorman, W. M. (1959). The effect of tariffs on the level and terms of trade. Journal of Political Economy, 67(3), 246–265. https://doi.org/10.1086/258174

Leith, J. C. (1971). The effects of tariffs on production, consumption, and trade: A revised analysis. American Economic Review, 61(1), 74–81. https://www.jstor.org/stable/1910542

Osnago, A., & Tan, S. W. (2016). Disaggregating the impact of the Internet on international trade (Policy Research Working Paper No. 7785). World Bank. https://openknowledge.worldbank.org/bitstreams/0208884c-459b-56b2-af92-a4f029fe0a17/download

Politico. (2025, July 31). Trump issues order imposing new global tariff rates effective Aug. 7. https://www.politico.com/news/2025/07/31/trump-executive-order-higher-tariff-rates-00487913

Reuters. (2025, July 31). Trump gives Mexico 90-day tariff reprieve as deadline for higher duties looms. https://www.reuters.com/world/americas/trump-gives-mexico-90-day-tariff-reprieve-deadline-higher-duties-looms-2025-07-31/

The White House. (2025, April 2). Fact sheet: President Donald J. Trump declares national emergency to increase our competitive edge, protect our sovereignty, and strengthen our national and economic security. https://www.whitehouse.gov/fact-sheets/2025/04/fact-sheet-president-donald-j-trump-declares-national-emergency-to-increase-our-competitive-edge-protect-our-sovereignty-and-strengthen-our-national-and-economic-security/

The White House. (2025, July 31). Further modifying the reciprocal tariff rates [Executive order]. https://www.whitehouse.gov/presidential-actions/2025/07/further-modifying-the-reciprocal-tariff-rates/