Mexico exports characteristics
• Mexico’s exports grew strongly between 2015 and 2025, almost
doubling in total value. This shows how important international trade is
for the Mexican economy.
• Around 90% of exports come from manufactured goods, especially
automotive, electronics, and machinery. Mexico is highly specialized in
industrial production.
• More than 80% of exports go to the United States. This means Mexico
depends heavily on the U.S. market and regional supply chains.
• Even with challenges such as COVID-19 and the renegotiation of
NAFTA into USMCA, exports recovered quickly and continued growing.
Use of Panel Data for Short‑Term Export Prediction
• We can use past export trends from different states or sectors to
estimate what could happen in the next 1–2 years.
• It is possible to include variables like exchange rate, U.S.
economic growth, or tariff changes to simulate different scenarios.
• Predictions can be made at the state or sector level, not only for
the whole country. This gives more specific and useful results.
Role of Business Intelligence and Predictive Analytics for
Identifying Vulnerability During USMCA Negotiations
• Business Intelligence tools can measure how much each state or
sector depends on exports to the U.S. This helps identify which ones are
more exposed to tariff changes.
• Predictive models can estimate how sensitive exports are to new
trade rules. With this, we can classify states or industries by level of
risk.
• Dashboards and visual tools allow decision makers to monitor export
performance before and after policy changes.
• Machine learning techniques can group states or sectors with
similar export structures. This helps identify patterns of vulnerability
during trade negotiations.
Exports Data Analysis
Data Cleaning and Transforming
library(readxl)
library(tidyverse)
library(plm)
library(sf)
library(rnaturalearth)
library(dplyr)
library(ggplot2)
library(reshape2)
library(readr)
library(lmtest)
library(sandwich)
library(car)
library(dplyr)
df_expo <- read_excel("C:/Users/Osval/Downloads/inegi_exports_dataset.xlsx", sheet = "exports")
df_ts <- read_excel("C:/Users/Osval/Downloads/inegi_exports_dataset.xlsx", sheet = "ts_exports")
df_data <- read_excel("C:/Users/Osval/Downloads/inegi_exports_dataset.xlsx", sheet = "data")
df_fdi <- read_excel("C:/Users/Osval/Downloads/inegi_exports_dataset.xlsx", sheet = "fdi")
• The original datasets were cleaned and transformed. Export
variables were converted from wide format to long format so each row
represents one state in one specific year.
df_expolong <- df_expo %>%
pivot_longer(
cols = starts_with("real_exports_"),
names_to = "year",
names_prefix = "real_exports_",
values_to = "real_exports"
)
df_fdilong <- df_fdi %>%
pivot_longer(
cols = starts_with("fdi_"),
names_to = "year",
names_prefix = "fdi_",
values_to = "fdi"
)
• We verified that there were no duplicated observations by state and
year before merging the datasets.
df_expolong %>% count(state, year, region) %>% filter(n > 1)
## # A tibble: 0 × 4
## # ℹ 4 variables: state <chr>, year <chr>, region <chr>, n <int>
df_data %>% count(state, year) %>% filter(n > 1)
## # A tibble: 0 × 3
## # ℹ 3 variables: state <chr>, year <dbl>, n <int>
df_fdilong %>% count(state, year, region) %>% filter(n > 1)
## # A tibble: 0 × 4
## # ℹ 4 variables: state <chr>, year <chr>, region <chr>, n <int>
• Different datasets (exports, FDI, macroeconomic variables, and
state characteristics) were merged using state and year as key
variables.
df_expolong$year <- as.numeric(df_expolong$year)
df_fdilong$year <- as.numeric(df_fdilong$year)
df_data$year <- as.numeric(df_data$year)
panel_full <- df_expolong %>%
left_join(df_fdilong, by = c("state", "year", "region")) %>%
left_join(df_data, by = c("state", "year"))
• Missing values were replaced using the average value by state in
order to maintain a balanced panel structure.
panel_full <- panel_full %>%
group_by(state) %>%
mutate(
across(
where(is.numeric),
~ ifelse(is.na(.), mean(., na.rm = TRUE), .)
)
) %>%
ungroup()
• A panel dataset was created using state and year as indexes,
allowing analysis across time and across regions.
panel_data <- pdata.frame(panel_full,
index = c("state", "year"))
Exploratory Data Analysis – EDA
Descriptive Statistics
regional_avg <- panel_full %>%
group_by(region) %>%
summarise(
avg_exports = mean(real_exports, na.rm = TRUE),
avg_fdi = mean(fdi, na.rm = TRUE),
avg_exchange_rate = mean(exchange_rate, na.rm = TRUE),
avg_college = mean(college_education, na.rm = TRUE),
.groups = "drop"
)
ggplot(regional_avg, aes(x = region, y = avg_exports)) +
geom_col() +
labs(title = "Average Exports by Region",
x = "Region",
y = "Average Real Exports") +
theme_minimal()

ggplot(regional_avg, aes(x = region, y = avg_fdi)) +
geom_col() +
labs(title = "Average FDI by Region",
x = "Region",
y = "Average FDI") +
theme_minimal()

ggplot(regional_avg, aes(x = region, y = avg_college)) +
geom_col() +
labs(title = "Average College Education by Region",
x = "Region",
y = "Average College Education") +
theme_minimal()

ggplot(panel_full, aes(x = real_exports)) +
geom_histogram(bins = 30) +
labs(title = "Distribution of Real Exports",
x = "Real Exports",
y = "Frequency") +
theme_minimal()

ggplot(panel_full, aes(x = fdi)) +
geom_histogram(bins = 30) +
labs(title = "Distribution of FDI",
x = "FDI",
y = "Frequency") +
theme_minimal()

ggplot(panel_full, aes(x = college_education)) +
geom_histogram(bins = 30) +
labs(title = "Distribution of College Education",
x = "College Education",
y = "Frequency") +
theme_minimal()

Statistics of Dispersion
ggplot(panel_full, aes(x = region, y = real_exports)) +
geom_boxplot() +
labs(title = "Exports Distribution Across Regions",
x = "Region",
y = "Real Exports") +
theme_minimal()

ggplot(panel_full, aes(x = region, y = fdi)) +
geom_boxplot() +
labs(title = "FDI Distribution Across Regions",
x = "Region",
y = "FDI") +
theme_minimal()

ggplot(panel_full, aes(x = region, y = lq_primary)) +
geom_boxplot() +
labs(title = "College Education Across Regions",
x = "Region",
y = "College Education") +
theme_minimal()

ggplot(panel_full, aes(x = region, y = lq_secondary)) +
geom_boxplot() +
labs(title = "College Education Across Regions",
x = "Region",
y = "College Education") +
theme_minimal()

Visualization across Time
exports_time <- panel_full %>%
group_by(year) %>%
summarise(total_exports = sum(real_exports, na.rm = TRUE))
ggplot(exports_time, aes(x = year, y = total_exports)) +
geom_line(size = 1) +
labs(title = "Total Exports Inflows in Mexico",
x = "Year",
y = "Total Real Exports") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
• Total exports show a clear upward trend over the period. There is a
temporary decline around the COVID-19 period, but exports recover
quickly and continue growing.
exports_region <- panel_full %>%
group_by(region, year) %>%
summarise(total_exports = sum(real_exports, na.rm = TRUE),
.groups = "drop")
ggplot(exports_region, aes(x = year, y = total_exports, color = region)) +
geom_line(linewidth = 1) +
labs(title = "Total Exports Inflows by Region",
x = "Year",
y = "Total Real Exports",
color = "Region") +
theme_minimal()
• Northern regions show higher export levels compared to southern
regions.
• The gap between regions remains persistent over time, which
suggests structural differences in industrial development.
macro_time <- panel_full %>%
group_by(year) %>%
summarise(
total_exports = sum(real_exports, na.rm = TRUE),
avg_exchange_rate = mean(exchange_rate, na.rm = TRUE),
.groups = "drop"
)
macro_time <- macro_time %>%
mutate(
exports_index = total_exports / max(total_exports),
exchange_index = avg_exchange_rate / max(avg_exchange_rate)
)
ggplot(macro_time, aes(x = year)) +
geom_line(aes(y = exports_index, color = "Exports"), linewidth = 1.2) +
geom_line(aes(y = exchange_index, color = "Exchange Rate"), linewidth = 1.2) +
labs(title = "Exports Inflows vs Exchange Rate (Indexed)",
x = "Year",
y = "Indexed Values",
color = "") +
theme_minimal()
• When comparing indexed exports and exchange rate, both variables show
similar movements in some years.
• This suggests that exchange rate fluctuations may influence export
performance, although the relationship is not perfectly linear.
###Regional Differences
panel_full <- panel_full %>%
mutate(zone = ifelse(border_distance < 400, "North", "South"))
panel_full <- panel_full %>%
mutate(zone = ifelse(region %in% c("Noroeste", "Noreste"),
"North",
"South"))
north_south_exports <- panel_full %>%
group_by(zone) %>%
summarise(avg_exports = mean(real_exports, na.rm = TRUE),
.groups = "drop")
ggplot(north_south_exports, aes(x = zone, y = avg_exports)) +
geom_bar(stat = "identity") +
labs(title = "Average Exports Inflows: North vs South",
x = "Region",
y = "Average Real Exports") +
theme_minimal()

• States classified as “North” have significantly higher average
exports than those in the “South.”
• This confirms the importance of geographic proximity to the U.S.
border and industrial concentratio
• Northern states show much higher export volumes. Southern states
participate less in international trade.
• Geographic proximity to the U.S. border is a improtant factor.
Northern states benefit from lower transportation costs and better
access to cross-border infrastructure.
• Industrial structure is different. Northern states have higher
concentration in manufacturing and export-oriented industries, while
southern states more on primary activities or services.
• Foreign Direct Investment is more concentrated in the North. Many
multinational firms are located in northern states due to their
connection with U.S. markets.
• Infrastructure and connectivity are generally stronger in northern
states, including highways, industrial parks, and customs facilities,
which facilitate international trade.
mexico_map <- ne_states(country = "mexico", returnclass = "sf")
setdiff(mexico_map$name, panel_data$state)
## [1] "Nuevo León" "Yucatán" "Michoacán" NA
## [5] "México" "Querétaro" "San Luis Potosí" "Distrito Federal"
mexico_map <- ne_states(country = "mexico", returnclass = "sf")
setdiff(panel_data$state, mexico_map$name)
## [1] "Ciudad de Mexico" "Mexico" "Michoacan" "Nuevo Leon"
## [5] "Queretaro" "San Luis Potosi" "Yucatan"
panel_full$state <- dplyr::recode(panel_full$state,
"Ciudad de Mexico" = "Distrito Federal",
"Michoacan" = "Michoacán",
"Queretaro" = "Querétaro",
"Mexico" = "México",
"San Luis Potosi" = "San Luis Potosí",
"Yucatan" = "Yucatán",
"Nuevo Leon" = "Nuevo León"
)
• Border states such as Chihuahua and other northern states show
higher export levels.
avg_state_exports <- panel_full %>%
group_by(state) %>%
summarise(avg_exports = mean(real_exports, na.rm = TRUE),
.groups = "drop")
map_data <- mexico_map %>%
left_join(avg_state_exports, by = c("name" = "state"))
ggplot(map_data) +
geom_sf(aes(fill = avg_exports)) +
scale_fill_viridis_c(option = "mako", na.value = "grey90") +
labs(title = "Average Exports Inflows by Mexican State",
fill = "Avg Exports") +
theme_minimal()

Correlation and Relationships
scale_fill_gradient2(
low = "blue",
mid = "white",
high = "red",
midpoint = 0,
limits = c(-1, 1)
)
## <ScaleContinuous>
## Range:
## Limits: -1 -- 1
corr_data <- panel_data %>%
select(real_exports, exchange_rate, pop_density, fdi, border_distance, lq_primary,lq_secondary,lq_tertiary, average_daily_salary, crime_rate) %>%
drop_na()
corr_matrix <- cor(corr_data)
corr_long <- as.data.frame(as.table(corr_matrix))
ggplot(corr_long, aes(Var1, Var2, fill = Freq)) +
geom_tile() +
scale_fill_gradient2(
low = "blue",
mid = "white",
high = "red",
midpoint = 0,
limits = c(-1, 1)
) +
geom_text(aes(label = round(Freq, 2)), size = 4) +
labs(title = "Correlation Matrix",
x = "",
y = "",
fill = "Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

• Exports show positive correlation with FDI and employment
concentration in the secondary sector.
• There is a negative relationship between border distance and
exports, meaning that states closer to the U.S. tend to export more.
• Some socioeconomic variables such as salary also show positive
association with exports, suggesting that more industrialized states
have higher income levels.
ggplot(panel_full, aes(x = border_distance, y = real_exports)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = TRUE, color = "black") +
labs(title = "Exports Inflows vs Border Disance",
x = "Border Disance",
y = "Exports") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

• The regression line shows a negative slope. States farther from the
border tend to have lower export levels.
ggplot(panel_full, aes(x = lq_secondary, y = real_exports)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = TRUE, color = "black") +
labs(title = "Exports Inflows vs Employment concentration in the secondary sector",
x = "Employment concentration in the secondary sector",
y = "Real Exports") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
• There is a positive relationship. States with stronger industrial
specialization export more.
ggplot(panel_full, aes(x = average_daily_salary, y = real_exports)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = TRUE, color = "black") +
labs(title = "Exports Inflows vs Average Daily Salary",
x = "Average Daily Salary",
y = "Real Exports") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

• The relationship appears positive, suggesting that higher
productivity and industrialization are linked to higher wages and
exports.
##Hypotheses Statements
• H1: States with higher employment concentration in the secondary
(industrial specialization) and third (services) sector have
significantly higher export levels.
• H2: Border distance has a negative and significant effect on
exports. States closer to the U.S. border export more than states
located farther away.
• H3: Higher levels of Foreign Direct Investment (FDI) positively
affect export performance at the state level.
• H4: States with higher average wages, as a proxy for productivity
and industrial development, show higher export performance.
Main Findings
• Total exports show fluctuations over time, with increases and
decreases from one year to another.
• Northern states consistently show higher export levels than
southern states.
• The scatter plot of exports versus border distance shows a negative
relationship. States closer to the U.S. border tend to export more.
• There is a positive relationship between exports and employment
concentration in the secondary sector. States with stronger
manufacturing labor tend to achieve higher export levels.
• Exports are positively correlated with FDI and average daily
salary. This suggests that more industrialized states perform better in
international trade.
#Model ## Differences
• Type of data used o Cross-sectional: Analyzes many individuals
(people, firms, countries, etc.) at a single point in time. o Time
series: Analyzes one individual or variable over a period of time. o
Panel data: Combines both; analyzes multiple individuals across multiple
time periods.
• Dimension of analysis o Cross-sectional: Has only an individual
dimension (N). o Time series: Has only a time dimension (T). o Panel
data: Has both an individual dimension (N) and a time dimension (T)
simultaneously.
• Type of variability captured o Cross-sectional: Compares
differences between individuals. o Time series: Examines changes over
time. o Panel data: Allows analysis of both differences between
individuals and changes over time simultaneously.
• Common statistical issues o Cross-sectional: May suffer from
heteroskedasticity. o Time series: May suffer from autocorrelation and
non-stationarity problems. o Panel data: May involve fixed or random
effects, as well as autocorrelation and heteroskedasticity.
• Analytical capacity and depth o Cross-sectional: Simpler, but does
not capture time dynamics. o Time series: Allows analysis of trends and
cycles over time. o Panel data: More comprehensive because it better
controls for individual heterogeneity and improves estimator
precision.
model_formula <- real_exports ~ lq_secondary +
average_daily_salary + pop_density + real_public_investment_pc
pooling <- plm(model_formula, data = panel_data, model = "pooling")
summary(pooling)
## Pooling Model
##
## Call:
## plm(formula = model_formula, data = panel_data, model = "pooling")
##
## Balanced Panel: n = 32, T = 9, N = 288
##
## Residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -4.80e+08 -1.11e+08 4.19e+06 0.00e+00 1.09e+08 4.42e+08
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## (Intercept) -1004258854 83563782 -12.0179 < 2.2e-16 ***
## lq_secondary 455153589 28339329 16.0608 < 2.2e-16 ***
## average_daily_salary 2449813 259179 9.4522 < 2.2e-16 ***
## pop_density -49737 11526 -4.3153 2.205e-05 ***
## real_public_investment_pc -37243 25461 -1.4627 0.1447
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 2.2278e+19
## Residual Sum of Squares: 8.3902e+18
## R-Squared: 0.62339
## Adj. R-Squared: 0.61807
## F-statistic: 117.11 on 4 and 283 DF, p-value: < 2.22e-16
fe_ind <- plm(model_formula, data = panel_data, model = "within", effect = "individual")
summary(fe_ind)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = model_formula, data = panel_data, effect = "individual",
## model = "within")
##
## Balanced Panel: n = 32, T = 9, N = 288
##
## Residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.20e+08 -1.12e+07 -7.81e+04 0.00e+00 8.88e+06 1.51e+08
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## lq_secondary 25034558.1 29583697.1 0.8462 0.3982
## average_daily_salary -17708.3 116039.2 -0.1526 0.8788
## pop_density -21049.3 271104.2 -0.0776 0.9382
## real_public_investment_pc 7340.4 6973.5 1.0526 0.2935
##
## Total Sum of Squares: 3.0639e+17
## Residual Sum of Squares: 3.0385e+17
## R-Squared: 0.0082919
## Adj. R-Squared: -0.12945
## F-statistic: 0.526758 on 4 and 252 DF, p-value: 0.71616
fe_time <- plm(model_formula, data = panel_data, model = "within", effect = "time")
summary(fe_time)
## Oneway (time) effect Within Model
##
## Call:
## plm(formula = model_formula, data = panel_data, effect = "time",
## model = "within")
##
## Balanced Panel: n = 32, T = 9, N = 288
##
## Residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -3.85e+08 -9.29e+07 1.67e+06 0.00e+00 1.12e+08 4.15e+08
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## lq_secondary 435521531 27192797 16.0161 < 2.2e-16 ***
## average_daily_salary 3233824 280159 11.5428 < 2.2e-16 ***
## pop_density -63991 11248 -5.6889 3.268e-08 ***
## real_public_investment_pc -76084 25687 -2.9620 0.003324 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 2.2211e+19
## Residual Sum of Squares: 7.3903e+18
## R-Squared: 0.66727
## Adj. R-Squared: 0.65275
## F-statistic: 137.876 on 4 and 275 DF, p-value: < 2.22e-16
fe_tw <- plm(model_formula, data = panel_data, model = "within", effect = "twoways")
summary(fe_tw)
## Twoways effects Within Model
##
## Call:
## plm(formula = model_formula, data = panel_data, effect = "twoways",
## model = "within")
##
## Balanced Panel: n = 32, T = 9, N = 288
##
## Residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -1.04e+08 -1.43e+07 -1.69e+06 0.00e+00 1.59e+07 1.29e+08
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## lq_secondary 13539301.8 26702971.7 0.5070 0.61259
## average_daily_salary -28770.2 229440.3 -0.1254 0.90032
## pop_density -286037.4 252623.3 -1.1323 0.25863
## real_public_investment_pc 13038.5 6598.8 1.9759 0.04929 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 2.3933e+17
## Residual Sum of Squares: 2.3375e+17
## R-Squared: 0.023305
## Adj. R-Squared: -0.14882
## F-statistic: 1.4555 on 4 and 244 DF, p-value: 0.21642
re <- plm(model_formula, data = panel_data, model = "random")
summary(re)
## Oneway (individual) effect Random Effect Model
## (Swamy-Arora's transformation)
##
## Call:
## plm(formula = model_formula, data = panel_data, model = "random")
##
## Balanced Panel: n = 32, T = 9, N = 288
##
## Effects:
## var std.dev share
## idiosyncratic 1.206e+15 3.472e+07 0.044
## individual 2.602e+16 1.613e+08 0.956
## theta: 0.9284
##
## Residuals:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -80456012 -18157289 -7849908 0 10480115 203393118
##
## Coefficients:
## Estimate Std. Error z-value Pr(>|z|)
## (Intercept) 139354203.5 56498460.8 2.4665 0.013644 *
## lq_secondary 86780057.3 30093475.3 2.8837 0.003931 **
## average_daily_salary 31246.0 116089.7 0.2692 0.787811
## pop_density -28431.3 29383.7 -0.9676 0.333251
## real_public_investment_pc 7097.1 7531.2 0.9424 0.346009
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 4.1893e+17
## Residual Sum of Squares: 4.0313e+17
## R-Squared: 0.037702
## Adj. R-Squared: 0.024101
## Chisq: 11.0878 on 4 DF, p-value: 0.025595
##
## F test for individual effects
##
## data: model_formula
## F = 216.34, df1 = 31, df2 = 252, p-value < 2.2e-16
## alternative hypothesis: significant effects
##
## F test for time effects
##
## data: model_formula
## F = 4.6511, df1 = 8, df2 = 275, p-value = 2.384e-05
## alternative hypothesis: significant effects
##
## F test for twoways effects
##
## data: model_formula
## F = 218.31, df1 = 39, df2 = 244, p-value < 2.2e-16
## alternative hypothesis: significant effects
##
## Hausman Test
##
## data: model_formula
## chisq = 729.9, df = 4, p-value < 2.2e-16
## alternative hypothesis: one model is inconsistent
vif(lm(model_formula, data = panel_data))
## lq_secondary average_daily_salary pop_density
## 1.112735 1.398501 1.460673
## real_public_investment_pc
## 1.054705
##
## studentized Breusch-Pagan test
##
## data: fe_time
## BP = 108.78, df = 4, p-value < 2.2e-16
##
## Breusch-Godfrey/Wooldridge test for serial correlation in panel models
##
## data: model_formula
## chisq = 187.28, df = 9, p-value < 2.2e-16
## alternative hypothesis: serial correlation in idiosyncratic errors
library(plm)
models <- list(
Pooling = pooling,
FE_Individual = fe_ind,
FE_TwoWays = fe_tw,
RE = re,
FE_Time = fe_time
)
comparison <- data.frame(
Model = names(models),
R2 = sapply(models, function(m) summary(m)$r.squared["rsq"]),
Adj_R2 = sapply(models, function(m) summary(m)$r.squared["adjrsq"])
)
comparison
## Model R2 Adj_R2
## Pooling.rsq Pooling 0.623389824 0.61806671
## FE_Individual.rsq FE_Individual 0.008291903 -0.12944533
## FE_TwoWays.rsq FE_TwoWays 0.023304595 -0.14881796
## RE.rsq RE 0.037702199 0.02410082
## FE_Time.rsq FE_Time 0.667272482 0.65275346