Questions

a) Briefly describe what are 3 main differences between Global Regression Analysis vs. Local Regression Analysis?

The first major difference lies in how model parameters are estimated. In global regression (OLS), a single set of coefficients is produced and applied uniformly across the entire study area, assuming that the relationship between variables is identical at every point in space. In contrast, local regression (GWR) generates a distinct coefficient for each location, meaning that every point essentially has its own regression model calibrated using the observations nearest to it.

The second difference concerns the assumption of spatial stationarity. OLS assumes that the underlying processes do not change across space, which is frequently an unrealistic simplification. GWR explicitly abandons this assumption and models non-stationarity, allowing relationships between variables to strengthen, weaken, or even reverse direction depending on the geographic zone being analyzed.

The third difference is the nature of the outputs. A global model produces a single R², one set of residuals, and one coefficient vector. A local model produces spatial surfaces: maps of local coefficients, point-by-point local R², and georeferenced residuals, which allow analysts to visualize and interpret how the model behaves across every part of the territory under study.

b) How is the optimal bandwidth determined in GWR, and what criteria (e.g., AIC, cross-validation) guide this choice?

The bandwidth is the parameter that controls how many neighboring observations influence the estimation of each local regression. A bandwidth that is too narrow leads to overfitting because the model reacts to local noise; one that is too wide collapses the model toward OLS, losing the spatial variation that justifies using GWR in the first place.

There are two types of bandwidth. A fixed bandwidth applies a constant radius in kilometers or distance units across all locations. An adaptive bandwidth adjusts the window size according to data density, becoming narrower where observations are abundant and wider where data are sparse; the adaptive type is generally preferred when the distribution of points across the study area is irregular.

Three criteria are primarily used to select the optimal bandwidth. AICc, the corrected version of the Akaike Information Criterion, is the most widely used in GWR practice because it penalizes model complexity and is robust with small samples; the bandwidth that minimizes AICc across multiple candidates evaluated iteratively is selected. Leave-one-out cross-validation predicts each observation using a model built without it, minimizing cumulative prediction error across all points. BIC applies a harsher complexity penalty than AIC, tending to favor larger bandwidths that produce smoother coefficient surfaces.

c) How do local parameter estimates vary across space, and what insights can these variations provide for decision-making in business intelligence?

In GWR, each location receives its own coefficient vector estimated by weighting nearby observations through a kernel function such as Gaussian or bisquare. This produces a continuous surface of parameter estimates that can be mapped and interpreted geographically. Two particularly informative patterns emerge from this variation. First, sign reversal occurs when a variable that is positively associated with an outcome in one region becomes negatively associated in another, for example when income drives higher spending in urban cores but has little or inverse effect in rural peripheries due to differing consumption structures. Second, magnitude gradients reveal that an effect may be strong in certain clusters and taper off progressively toward other zones, indicating where a particular driver is more or less relevant.

For business intelligence, these variations are highly actionable. Local R² maps reveal which areas the model explains well and which remain poorly understood, signaling where additional variables or data collection efforts are needed. Coefficient maps allow analysts to identify geographic segments where specific factors, such as price sensitivity, demographic composition, or accessibility, behave differently, enabling segment-specific strategies rather than one-size-fits-all decisions. In retail analytics, for instance, a GWR model might reveal that foot traffic is driven by public transit access in central districts but by parking availability in suburban ones, a distinction that a global model would completely obscure.

d) How can GWR results guide prescriptive analytics by identifying location-specific strategies (e.g., targeted investment, resource allocation)?

GWR transitions naturally from descriptive and diagnostic analytics into prescriptive analytics because its outputs are inherently spatial and actionable. Once local coefficients are mapped, decision-makers can identify precisely where each lever of intervention will have the greatest impact, rather than applying uniform policies across heterogeneous territories.

In targeted investment, GWR can reveal which geographic zones show the strongest positive relationship between, for example, infrastructure spending and economic output, allowing capital to be directed where the return is highest rather than distributed evenly. In resource allocation, local coefficient maps can identify areas where a particular service or product driver, such as promotional sensitivity or service demand elasticity, is significantly elevated, justifying concentrated resource deployment in those zones. In retail network expansion, if GWR reveals that the relationship between population density and store revenue is strong in the north of a city but weak in the south due to competing formats already present, expansion strategy can be calibrated accordingly.

More broadly, GWR supports prescriptive analytics by replacing the implicit assumption that one strategy fits all locations with an evidence-based geographic segmentation grounded in how relationships actually behave across space. This makes it particularly powerful in business intelligence contexts such as franchise territory design, logistics optimization, regional marketing mix modeling, and public-private investment prioritization, where spatial heterogeneity is not noise to be controlled but signal to be exploited.

1) Hypotheses

Based on the EDA and ESDA results from Activities 1 and 2, the dependent variable selected for the model is real exports at the Mexican state level.

H1. Border economic activity and export performance H1: States with higher border economic activity are expected to have higher levels of real exports. the idea that proximity to the northern border provides structural advantages for exports, such as easier access to the United States market. In the ESDA results, border economic activity showed the clearest geographic pattern, with positive values in the north and negative values in the south.
H2. Foreign Direct Investment and export performance H2: States with higher levels of foreign direct investment are expected to have higher levels of real exports. Foreign direct investment can increase export capacity because it is usually associated with multinational firms, capital-intensive production, technology transfer, and participation in global value chains. The previous ESDA suggested that FDI is spatially concentrated and has a marginal positive spatial autocorrelation, meaning that foreign investment tends to cluster moderately in certain regions.
H3. Industrial specialization and export performance H3: States with greater specialization in the secondary sector are expected to have higher levels of real exports. The secondary sector includes manufacturing and industrial production, which are directly related to export activity. The ESDA showed that LQ secondary has a clear regional pattern, especially with stronger industrial specialization in northern states.
H4. Labor conditions and export performance H4: States with higher average daily salaries are expected to have higher levels of real exports, although this relationship may vary across space. Higher wages may reflect more productive labor markets, stronger formal employment, greater human capital, and more complex economic structures. The ESDA also suggested that salaries are more geographically uniform than exports or FDI, so their effect may be weaker in a global OLS model and more relevant locally in a GWR model.

library(sf)
library(tidyverse)
library(car)
library(lmtest)
library(GWmodel)
library(tmap)
library(dplyr)

#setwd("~/Desktop/local_predictive_analytics")
panel_data <- read.csv("/Users/hanniahdz58/Desktop/local_predictive_analytics/panel_dataset.csv")

cs_data_2018 <- panel_data %>%
  filter(year == 2018) %>%
  dplyr::select(-year) %>%
  mutate(log_exports = log1p(exports_real_mxn))

# Fix state names FIRST
cs_data_2018 <- cs_data_2018 %>%
  mutate(
    state = trimws(state),
    state = case_when(
      state == "Ciudad de Mexico"                ~ "Distrito Federal",
      state == "Coahuila de Zaragoza"            ~ "Coahuila",
      state == "Veracruz de Ignacio de la Llave" ~ "Veracruz",
      TRUE ~ state
    )
  )

# THEN assign border_activity (after names are fixed)
border_states <- c("Baja California", "Sonora", "Chihuahua",
                   "Coahuila", "Nuevo León", "Tamaulipas")
cs_data_2018$border_activity <- ifelse(cs_data_2018$state %in% border_states, 1, 0)
cs_data_2018$region_a <- as.factor(cs_data_2018$region_a)

summary(cs_data_2018)

##     state              state_id       new_fdi        reinv_profits    
##  Length:32          Min.   : 888   Min.   : -72.45   Min.   : -37.14  
##  Class :character   1st Qu.:1047   1st Qu.:  65.67   1st Qu.:  65.09  
##  Mode  :character   Median :1081   Median : 179.11   Median : 305.36  
##                     Mean   :1219   Mean   : 353.87   Mean   : 416.03  
##                     3rd Qu.:1118   3rd Qu.: 475.89   3rd Qu.: 568.26  
##                     Max.   :2357   Max.   :2254.03   Max.   :2523.46  
##   intercom_acc        total_fdi         crime_rate     unemployment    
##  Min.   :-209.000   Min.   :  70.22   Min.   : 2.52   Min.   :0.01000  
##  1st Qu.:   6.305   1st Qu.: 236.04   1st Qu.:10.74   1st Qu.:0.02750  
##  Median :  89.040   Median : 509.06   Median :23.01   Median :0.04000  
##  Mean   : 295.755   Mean   :1065.66   Mean   :29.68   Mean   :0.03469  
##  3rd Qu.: 501.858   3rd Qu.:1310.07   3rd Qu.:41.79   3rd Qu.:0.04000  
##  Max.   :1612.130   Max.   :5836.90   Max.   :94.53   Max.   :0.06000  
##    employment     business_activity   real_wage     real_ave_month_income
##  Min.   :0.9300   Min.   :-2.540    Min.   :251.7   Min.   :3281         
##  1st Qu.:0.9600   1st Qu.:-2.160    1st Qu.:289.5   1st Qu.:4357         
##  Median :0.9700   Median :-2.015    Median :308.1   Median :5002         
##  Mean   :0.9675   Mean   :-1.775    Mean   :314.7   Mean   :5078         
##  3rd Qu.:0.9800   3rd Qu.:-1.805    3rd Qu.:336.7   3rd Qu.:5815         
##  Max.   :0.9900   Max.   : 2.430    Max.   :431.3   Max.   :7155         
##   pop_density      good_governance  ratio_public_investment   lq_primary   
##  Min.   :  10.33   Min.   : 0.000   Min.   :0.000000        Min.   :0.010  
##  1st Qu.:  42.22   1st Qu.: 0.405   1st Qu.:0.000000        1st Qu.:0.465  
##  Median :  65.06   Median : 1.285   Median :0.000000        Median :0.935  
##  Mean   : 307.47   Mean   : 3.035   Mean   :0.003438        Mean   :1.059  
##  3rd Qu.: 156.88   3rd Qu.: 2.770   3rd Qu.:0.010000        3rd Qu.:1.300  
##  Max.   :6154.80   Max.   :21.630   Max.   :0.010000        Max.   :3.900  
##   lq_secondary     lq_tertiary     exchange_rate    patents_rate  
##  Min.   :0.3900   Min.   :0.8100   Min.   :20.11   Min.   :0.200  
##  1st Qu.:0.6350   1st Qu.:0.9300   1st Qu.:20.11   1st Qu.:1.265  
##  Median :0.9650   Median :1.0100   Median :20.11   Median :1.815  
##  Mean   :0.9856   Mean   :0.9997   Mean   :20.11   Mean   :2.608  
##  3rd Qu.:1.2525   3rd Qu.:1.0725   3rd Qu.:20.11   3rd Qu.:3.312  
##  Max.   :1.8000   Max.   :1.1800   Max.   :20.11   Max.   :9.140  
##       inpc     border_distance   college_education new_fdi_real_mxn
##  Min.   :103   Min.   :   8.83   Min.   :0.3050    Min.   :-1414   
##  1st Qu.:103   1st Qu.: 613.26   1st Qu.:0.3760    1st Qu.: 1282   
##  Median :103   Median : 751.64   Median :0.4135    Median : 3497   
##  Mean   :103   Mean   : 704.92   Mean   :0.4155    Mean   : 6908   
##  3rd Qu.:103   3rd Qu.: 875.76   3rd Qu.:0.4552    3rd Qu.: 9290   
##  Max.   :103   Max.   :1252.66   Max.   :0.5290    Max.   :44002   
##  log_new_fdi_real_mxn      region_a    region_b     trump_election
##  Min.   :-3.151       Bajio    :5   Min.   :1.000   Min.   :1     
##  1st Qu.: 3.107       Centro   :6   1st Qu.:2.000   1st Qu.:1     
##  Median : 3.543       Norte    :6   Median :3.000   Median :1     
##  Mean   : 3.191       Occidente:7   Mean   :3.188   Mean   :1     
##  3rd Qu.: 3.968       Sur      :8   3rd Qu.:4.250   3rd Qu.:1     
##  Max.   : 4.643                     Max.   :5.000   Max.   :1     
##  exports_real_mxn   log_exports     border_activity 
##  Min.   :   1.54   Min.   :0.9322   Min.   :0.0000  
##  1st Qu.:  26.53   1st Qu.:3.3099   1st Qu.:0.0000  
##  Median : 110.44   Median :4.7058   Median :0.0000  
##  Mean   : 244.44   Mean   :4.5756   Mean   :0.1562  
##  3rd Qu.: 390.04   3rd Qu.:5.9688   3rd Qu.:0.0000  
##  Max.   :1031.93   Max.   :6.9402   Max.   :1.0000

2) Cross-Sectional Dataset

The cross-sectional dataset is constructed by filtering the panel to the year 2018, yielding 32 observations — one per Mexican state. The dependent variable is the natural log of real exports in MXN (log_exports), and the four independent variables correspond directly to the hypotheses: a binary indicator for border states (border_activity), the log of real FDI inflows (log_new_fdi_real_mxn), the location quotient of the secondary sector (lq_secondary), and the average real daily wage (real_wage). The log transformation on exports and FDI reduces right-skew and stabilizes variance, which is consistent with the distributions observed in Activity 1.

3) OLS Model, Multicollinearity & Heteroscedasticity

ols <- lm(log_exports ~ border_activity + log_new_fdi_real_mxn + lq_secondary + real_wage,
          data = cs_data_2018)
summary(ols)

## 
## Call:
## lm(formula = log_exports ~ border_activity + log_new_fdi_real_mxn + 
##     lq_secondary + real_wage, data = cs_data_2018)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.84188 -0.50897  0.01987  0.67557  1.46068 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)   
## (Intercept)          -3.02779    1.42645  -2.123  0.04310 * 
## border_activity       0.81357    0.58960   1.380  0.17895   
## log_new_fdi_real_mxn  0.20851    0.11420   1.826  0.07894 . 
## lq_secondary          1.98874    0.56715   3.507  0.00161 **
## real_wage             0.01541    0.00460   3.351  0.00239 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9887 on 27 degrees of freedom
## Multiple R-squared:  0.6752, Adjusted R-squared:  0.6271 
## F-statistic: 14.03 on 4 and 27 DF,  p-value: 2.579e-06

vif(ols)

##      border_activity log_new_fdi_real_mxn         lq_secondary 
##             1.500150             1.152529             1.490592 
##            real_wage 
##             1.127773

bptest(ols)

## 
##  studentized Breusch-Pagan test
## 
## data:  ols
## BP = 6.0574, df = 4, p-value = 0.1949

OLS Results. The global model explains approximately 67.5% of the variance in log exports across Mexican states (R² = 0.675, Adjusted R² = 0.627), and is highly significant overall (F = 14.03, p < 0.001). Among the four predictors, industrial specialization (lq_secondary, β = 1.989, p = 0.0016) and average real wages (real_wage, β = 0.015, p = 0.0024) are the strongest and most statistically significant drivers of export performance, providing support for H3 and H4. FDI (log_new_fdi_real_mxn, β = 0.209, p = 0.079) shows a positive marginal effect but falls just short of conventional significance thresholds, offering partial support for H2. Border activity (border_activity, β = 0.814, p = 0.179) has the expected positive sign but is not statistically significant in the global model, suggesting that its effect may be geographically concentrated and better captured through a local specification such as GWR — which will be explored in the next section.

Multicollinearity. All VIF values are well below the threshold of 5: border activity (1.50), FDI (1.15), secondary LQ (1.49), and real wage (1.13). There is no evidence of problematic multicollinearity, and all predictors can be retained in the model.

Heteroscedasticity. The Breusch-Pagan test yields BP = 6.057 (df = 4, p = 0.195), failing to reject the null hypothesis of homoscedastic residuals. The OLS standard errors are therefore reliable, and no correction for heteroscedasticity is required at this stage.

4) GWR Model & Adaptive Kernel Justification

Mexico’s 32 states vary dramatically in geographic size — Chihuahua covers over 247,000 km², while Tlaxcala spans less than 4,000 km² and CDMX barely 1,500 km². A fixed-distance kernel would apply the same spatial radius to every state: in the north, this radius might capture only 2–3 neighbors, leading to unreliable local estimates with high variance; in the densely packed central region, the same radius would pull in 15+ observations, producing overly smoothed estimates. An adaptive kernel solves this by defining bandwidth as a fixed number of nearest neighbors rather than a fixed distance. This ensures that every local regression is calibrated using the same amount of information regardless of state size or geographic density, making estimates comparably reliable across the entire territory. For a dataset of only 32 observations with such extreme size heterogeneity, the adaptive kernel is not just preferable — it is structurally necessary.

mx_state_map <- st_read("~/Downloads/mx_maps/mx_states/mexlatlong.shp")

## Reading layer `mexlatlong' from data source 
##   `/Users/hanniahdz58/Downloads/mx_maps/mx_states/mexlatlong.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 32 features and 19 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: -118.4042 ymin: 14.55055 xmax: -86.73862 ymax: 32.71846
## Geodetic CRS:  WGS 84

# left_join — all 32 states match now, no NAs to drop
sp_cs_data <- mx_state_map %>%
  left_join(cs_data_2018, by = c("ADMIN_NAME" = "state"))

mx_state_geodata <- as(sp_cs_data, "Spatial")

bw_gwr <- bw.gwr(
  log_exports ~ border_activity + log_new_fdi_real_mxn + lq_secondary + real_wage,
  data     = mx_state_geodata,
  approach = "AICc",
  kernel   = "bisquare",
  adaptive = TRUE
)

## Adaptive bandwidth (number of nearest neighbours): 27 AICc value: 108.8179 
## Adaptive bandwidth (number of nearest neighbours): 25 AICc value: 113.7819 
## Adaptive bandwidth (number of nearest neighbours): 30 AICc value: 105.0167 
## Adaptive bandwidth (number of nearest neighbours): 30 AICc value: 105.0167

gwr_model <- gwr.basic(
  log_exports ~ border_activity + log_new_fdi_real_mxn + lq_secondary + real_wage,
  data     = mx_state_geodata,
  bw       = bw_gwr,
  kernel   = "bisquare",
  adaptive = TRUE
)
gwr_model

##    ***********************************************************************
##    *                       Package   GWmodel                             *
##    ***********************************************************************
##    Program starts at: 2026-05-22 17:22:31.654881 
##    Call:
##    gwr.basic(formula = log_exports ~ border_activity + log_new_fdi_real_mxn + 
##     lq_secondary + real_wage, data = mx_state_geodata, bw = bw_gwr, 
##     kernel = "bisquare", adaptive = TRUE)
## 
##    Dependent (y) variable:  log_exports
##    Independent variables:  border_activity log_new_fdi_real_mxn lq_secondary real_wage
##    Number of data points: 32
##    ***********************************************************************
##    *                    Results of Global Regression                     *
##    ***********************************************************************
## 
##    Call:
##     lm(formula = formula, data = data)
## 
##    Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.84188 -0.50897  0.01987  0.67557  1.46068 
## 
##    Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)   
##    (Intercept)          -3.02779    1.42645  -2.123  0.04310 * 
##    border_activity       0.81357    0.58960   1.380  0.17895   
##    log_new_fdi_real_mxn  0.20851    0.11420   1.826  0.07894 . 
##    lq_secondary          1.98874    0.56715   3.507  0.00161 **
##    real_wage             0.01541    0.00460   3.351  0.00239 **
## 
##    ---Significance stars
##    Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
##    Residual standard error: 0.9887 on 27 degrees of freedom
##    Multiple R-squared: 0.6752
##    Adjusted R-squared: 0.6271 
##    F-statistic: 14.03 on 4 and 27 DF,  p-value: 2.579e-06 
##    ***Extra Diagnostic information
##    Residual sum of squares: 26.39514
##    Sigma(hat): 0.9379968
##    AIC:  96.65028
##    AICc:  100.0103
##    BIC:  94.23911
##    ***********************************************************************
##    *          Results of Geographically Weighted Regression              *
##    ***********************************************************************
## 
##    *********************Model calibration information*********************
##    Kernel function: bisquare 
##    Adaptive bandwidth: 30 (number of nearest neighbours)
##    Regression points: the same locations as observations are used.
##    Distance metric: Euclidean distance metric is used.
## 
##    ****************Summary of GWR coefficient estimates:******************
##                              Min.   1st Qu.    Median   3rd Qu.    Max.
##    Intercept            -3.412393 -2.561977 -2.195664 -1.742157 -1.5005
##    border_activity       0.513836  0.560184  0.678961  0.790071  0.8502
##    log_new_fdi_real_mxn  0.179559  0.245372  0.305963  0.354798  0.4252
##    lq_secondary          1.649531  1.674777  1.798967  2.109769  2.6038
##    real_wage             0.010250  0.010876  0.011166  0.012814  0.0177
##    ************************Diagnostic information*************************
##    Number of data points: 32 
##    Effective number of parameters (2trace(S) - trace(S'S)): 8.769145 
##    Effective degrees of freedom (n-2trace(S) + trace(S'S)): 23.23086 
##    AICc (GWR book, Fotheringham, et al. 2002, p. 61, eq 2.33): 105.0167 
##    AIC (GWR book, Fotheringham, et al. 2002,GWR p. 96, eq. 4.22): 88.42222 
##    BIC (GWR book, Fotheringham, et al. 2002,GWR p. 61, eq. 2.34): 74.84476 
##    Residual sum of squares: 23.51342 
##    R-square value:  0.7106775 
##    Adjusted R-square value:  0.5965519 
## 
##    ***********************************************************************
##    Program stops at: 2026-05-22 17:22:31.662009

gwr_sf_results <- st_as_sf(gwr_model$SDF)

Bandwidth selection. The AICc-based search selected an optimal adaptive bandwidth of 30 nearest neighbors (AICc = 105.02), evaluated against 27 neighbors (AICc = 108.82) and 25 neighbors (AICc = 113.78). The monotonic improvement toward larger bandwidths reflects the small sample size (n = 32): with only 32 states, narrower bandwidths produce unstable local estimates, and the model benefits from including most of the country in each local regression. In practical terms, a bandwidth of 30 means that each state’s local model draws on nearly the entire national sample, which is consistent with the relatively low spatial heterogeneity suggested by the OLS residuals.

GWR vs. OLS performance. The GWR model improves upon the global OLS in terms of fit: R² increases from 0.675 to 0.711, and the residual sum of squares decreases from 26.40 to 23.51. The GWR AIC (88.42) is notably lower than the OLS AIC (96.65), confirming that allowing coefficients to vary locally provides a better characterization of the data. The local coefficient ranges reveal meaningful spatial variation: lq_secondary ranges from 1.65 to 2.60 across states, and log_new_fdi_real_mxn ranges from 0.18 to 0.43, suggesting that these relationships are stronger in some regions than others. The border_activity coefficient ranges from 0.51 to 0.85, with its effect being most pronounced in states closest to the northern border — consistent with H1.

5) Geographic Visualization & Statistical Validation

# 5a) Actual
ggplot(st_as_sf(mx_state_geodata)) +
  geom_sf(aes(fill = log_exports)) +
  scale_fill_distiller(palette = "Blues", direction = 1, name = "Actual") +
  labs(title = "Actual log(Exports)") +
  theme_void()

# 5a) Predicted
ggplot(gwr_sf_results) +
  geom_sf(aes(fill = yhat)) +
  scale_fill_distiller(palette = "Blues", direction = 1, name = "Predicted") +
  labs(title = "GWR Predicted log(Exports)") +
  theme_void()

5a) Actual vs. Predicted. The side-by-side choropleth maps confirm that the GWR model successfully reproduces the main geographic pattern of Mexican export activity. In both maps, the highest export values are concentrated in the northern border states — Chihuahua, Coahuila, Nuevo León, Tamaulipas — and in industrialized central states such as Estado de México and Jalisco. The predicted map captures this north-south gradient with reasonable fidelity, reflecting the strong structural role of border proximity and industrial specialization. The primary discrepancies appear in southern and southeastern states, where the predicted values tend to deviate more from actuals, anticipating the residual patterns examined next.

# 5b) Local residuals
ggplot(gwr_sf_results) +
  geom_sf(aes(fill = residual)) +
  scale_fill_gradient2(low = "red", mid = "white", high = "blue", midpoint = 0,
                       name = "Residual") +
  labs(title = "GWR Local Residuals",
       subtitle = "Red = under-predicted | Blue = over-predicted") +
  theme_void()

5b) Local Residuals. The residual map reveals that the GWR model’s prediction errors are not randomly distributed, which indicates that some structural variation remains unaccounted for. States shown in red are under-predicted, meaning their actual export volumes exceed what the four-variable model would anticipate given their border status, FDI, industrial specialization, and wages — suggesting the presence of additional export drivers not captured in the current specification, such as port access, maquiladora density, or sector-specific trade agreements. States in blue are over-predicted, indicating that the model assigns them higher expected exports than are actually observed, possibly reflecting structural barriers or lower integration into global value chains that the model does not directly measure. The geographic clustering of residuals in certain regions also suggests that spatial autocorrelation in the errors has not been fully eliminated, which could be explored in future iterations using spatial error models.

# 5c) t-values — border_activity
gwr_sf_results$sig_level <- cut(
  abs(gwr_sf_results$border_activity_TV),
  breaks = c(0, 1.96, 2.58, Inf),
  labels = c("Not significant", "95% (t > 1.96)", "99% (t > 2.58)")
)

ggplot(gwr_sf_results) +
  geom_sf(aes(fill = sig_level)) +
  scale_fill_manual(values = c("Not significant" = "grey85",
                               "95% (t > 1.96)"  = "steelblue",
                               "99% (t > 2.58)"  = "darkblue"),
                    name = "Significance") +
  labs(title = "Local Significance: Border Activity") +
  theme_void()

5c) Local Significance of Border Activity. This map illustrates one of the most important advantages of GWR over OLS: the ability to detect where a predictor actually matters. In the global OLS, border_activity was statistically insignificant (p = 0.179), which could mistakenly lead to the conclusion that border proximity does not influence exports. The GWR t-value map reveals that this is a case of spatial averaging masking local signal: the effect of border activity reaches statistical significance (t > 1.96 or t > 2.58) precisely in the northern states where the border is geographically relevant, while remaining insignificant in central and southern states where it carries no structural meaning. This confirms H1 in a spatially nuanced way — border proximity matters, but only where it is actually a geographic reality — and underscores the value of local regression for policy analysis. A uniform national policy to leverage border proximity would be well-targeted in the north but irrelevant in the south.

# 5d) Local R²
ggplot(gwr_sf_results) +
  geom_sf(aes(fill = Local_R2)) +
  scale_fill_distiller(palette = "Blues", direction = 1, name = "R²") +
  labs(title = "GWR Local R² across Mexico's States") +
  theme_void()

5d) Local R². The local R² map shows that the model’s explanatory power is not uniform across Mexican territory. Darker states indicate areas where the four-variable specification explains a high proportion of the variance in exports locally, typically corresponding to states where the hypothesized drivers — border activity, FDI, industrial specialization, and wages — are the dominant forces shaping export performance. Lighter states reveal geographic zones where the model is less informative, suggesting that other structural factors are at play. This diagnostic is directly actionable: states with low local R² are candidates for model enrichment through additional variables (e.g., logistics infrastructure, energy costs, human capital quality), and should be interpreted with greater caution when using GWR predictions as the basis for regional policy or investment decisions. The pattern also reinforces the earlier residual analysis — zones of poor model fit and large residuals tend to overlap, pointing to coherent subregions where the current specification is insufficient.

Actividad 1.3 Modelos de Regresión Espacial y Machine Learning

Equipo 1 - Modulo 1

2026-05-21