1 Introduction

This report presents a spatial analysis using data from the Saber 11 exam (2024_2) and the National Geostatistical Framework (MGN) provided by DANE. The aim is to explore the relationship between educational outcomes and various sociodemographic factors at the municipal level in Colombia. Specifically, the analysis examines how elements such as ethnic composition, access to technological resources, parental education, and students’ work responsibilities are associated with overall scores in the Saber 11 exam, from a territorial perspective.

This initial exercise adopts an exploratory approach by applying spatial econometric models (SAR) which make it possible to capture geographic dependence between municipalities and detect patterns that might remain hidden in traditional econometric models.

Limitation: The “Saber 11” data are available at the individual level, while the study is conducted at the municipal level. To bridge this, individual-level data were aggregated using means. While necessary, this aggregation process can mask significant internal variation within municipalities.

# Esta sección se ejecuta pero no se muestra en el documento final

#Cargar paquetes
library(sf)        # Manejo de datos espaciales
library(dplyr)     # Manipulación de datos
library(haven)     # Lectura de archivos .dta (Stata)
library(ggplot2)   # Visualización de datos
library(patchwork)
library(spdep)
library(spatialreg)
library(stringr)

# 1) Cargar shapefile 
setwd("C:/Users/ferna/Desktop/Spacial Analysis 2025_r/Ejercicio")
municipios <- st_read("MGN_ADM_MPIO_GRAFICO.shp")

# 2) Cargar base de educacion 
library(readr)
Saber_11 <- read.csv("Examen_Saber_11_20242.txt", sep=";")

# 3) Manipulacion de variables, para agregacion municipal 

## Arregla variables 
Saber_11 <- Saber_11 %>%
  mutate(educacion_madre_num = case_when(
    fami_educacionmadre == "Ninguno" ~ 0,
    fami_educacionmadre == "Primaria incompleta" ~ 1,
    fami_educacionmadre == "Primaria completa" ~ 2,
    fami_educacionmadre == "Secundaria (Bachillerato) incompleta" ~ 3,
    fami_educacionmadre == "Secundaria (Bachillerato) completa" ~ 4,
    fami_educacionmadre == "Técnica o tecnológica incompleta" ~ 5,
    fami_educacionmadre == "Técnica o tecnológica completa" ~ 6,
    fami_educacionmadre == "Educación profesional incompleta" ~ 7,
    fami_educacionmadre == "Educación profesional completa" ~ 8,
    fami_educacionmadre == "Postgrado" ~ 9,
    TRUE ~ NA_real_
  ))

Saber_11 <- Saber_11 %>%
  mutate(educacion_padre_num = case_when(
    fami_educacionpadre == "Ninguno" ~ 0,
    fami_educacionpadre == "Primaria incompleta" ~ 1,
    fami_educacionpadre == "Primaria completa" ~ 2,
    fami_educacionpadre == "Secundaria (Bachillerato) incompleta" ~ 3,
    fami_educacionpadre == "Secundaria (Bachillerato) completa" ~ 4,
    fami_educacionpadre == "Técnica o tecnológica incompleta" ~ 5,
    fami_educacionpadre == "Técnica o tecnológica completa" ~ 6,
    fami_educacionpadre == "Educación profesional incompleta" ~ 7,
    fami_educacionpadre == "Educación profesional completa" ~ 8,
    fami_educacionpadre == "Postgrado" ~ 9,
    TRUE ~ NA_real_
  ))

Saber_11 <- Saber_11 %>%
  mutate(
    estrato_0 = if_else(fami_estratovivienda == "Sin Estrato", 1, 0, missing = NA_real_),
    estrato_1 = if_else(fami_estratovivienda == "Estrato 1", 1, 0, missing = NA_real_),
    estrato_2 = if_else(fami_estratovivienda == "Estrato 2", 1, 0, missing = NA_real_),
    estrato_3 = if_else(fami_estratovivienda == "Estrato 3", 1, 0, missing = NA_real_),
    estrato_4 = if_else(fami_estratovivienda == "Estrato 4", 1, 0, missing = NA_real_),
    estrato_5 = if_else(fami_estratovivienda == "Estrato 5", 1, 0, missing = NA_real_),
    estrato_6 = if_else(fami_estratovivienda == "Estrato 6", 1, 0, missing = NA_real_)
  )

Saber_11 <- Saber_11 %>%
  mutate(
    horas_trabajo_num = case_when(
      estu_horassemanatrabaja == "0" ~ 0,
      estu_horassemanatrabaja == "Menos de 10 horas" ~ 5,
      estu_horassemanatrabaja == "Entre 11 y 20 horas" ~ 15.5,
      estu_horassemanatrabaja == "Entre 21 y 30 horas" ~ 25.5,
      estu_horassemanatrabaja == "Más de 30 horas" ~ 40,
      estu_horassemanatrabaja == "" ~ NA_real_,
      TRUE ~ NA_real_
    )
  )

Saber_11 <- Saber_11 %>%
  mutate(
    num_libros = case_when(
      fami_numlibros == "0 A 10 LIBROS" ~ 5,
      fami_numlibros == "11 A 25 LIBROS" ~ 18,
      fami_numlibros == "26 A 100 LIBROS" ~ 63,
      fami_numlibros == "MÁS DE 100 LIBROS" ~ 120,
      fami_numlibros == "" ~ NA_real_,
      TRUE ~ NA_real_
    )
  )

Saber_11 <- Saber_11 %>%
  mutate(
    genero_femenino = case_when(
      estu_genero == "F" ~ 1,
      estu_genero == "M" ~ 0,
      TRUE ~ NA_real_
    ),
    
    tiene_etnia = case_when(
      estu_tieneetnia == "S" ~ 1,
      estu_tieneetnia == "N" ~ 0,
      estu_tieneetnia == ""  ~ 0,
      estu_tieneetnia == " " ~ 0,
      TRUE ~ NA_real_
    ),
    
    acceso_internet = case_when(
      fami_tieneinternet == "Si" ~ 1,
      fami_tieneinternet == "No" ~ 0,
      TRUE ~ NA_real_
    ),
    
    acceso_computador = case_when(
      fami_tienecomputador == "Si" ~ 1,
      fami_tienecomputador == "No" ~ 0,
      TRUE ~ NA_real_
    )
  )

## 4) Crear base de datos agregada a Nivel municipal 
Saber_municipio <- Saber_11 %>%
  group_by(cole_cod_mcpio_ubicacion) %>%
  summarise(
    total_estudiantes      = n(),
    
    punt_global            = mean(punt_global, na.rm = TRUE),
    punt_lectura           = mean(punt_lectura_critica, na.rm = TRUE),
    punt_matematicas       = mean(punt_matematicas, na.rm = TRUE),
    
    educacion_madre        = mean(educacion_madre_num, na.rm = TRUE),
    educacion_padre        = mean(educacion_padre_num, na.rm = TRUE),
    
    estrato_0              = mean(estrato_0, na.rm = TRUE),
    estrato_1              = mean(estrato_1, na.rm = TRUE),
    estrato_2              = mean(estrato_2, na.rm = TRUE),
    estrato_3              = mean(estrato_3, na.rm = TRUE),
    estrato_4              = mean(estrato_4, na.rm = TRUE),
    estrato_5              = mean(estrato_5, na.rm = TRUE),
    estrato_6              = mean(estrato_6, na.rm = TRUE),
    
    internet               = mean(acceso_internet, na.rm = TRUE),
    computador             = mean(acceso_computador, na.rm = TRUE),
    
    libros                 = mean(num_libros, na.rm = TRUE),
    horas_trabajo          = mean(horas_trabajo_num, na.rm = TRUE),
    
    genero_femenino        = mean(genero_femenino, na.rm = TRUE),
    etnia                  = mean(tiene_etnia, na.rm = TRUE)
  )

# 5) Revisar variable a unir 
class(Saber_municipio$cole_cod_mcpio_ubicacion) ## interger
class(municipios$mpio_cdpmp) #character
Saber_municipio$cole_cod_mcpio_ubicacion <- str_pad(as.character(Saber_municipio$cole_cod_mcpio_ubicacion), 5, pad = "0")
class(Saber_municipio$cole_cod_mcpio_ubicacion) #character


# 6) Unión entre shapefile y datos educativos
municipios_datos <- left_join(municipios, Saber_municipio, by = c("mpio_cdpmp" = "cole_cod_mcpio_ubicacion"))

# 7) revisar valores faltantes de pruebas 
sum(is.na(municipios_datos$punt_global))
municipios_datos <- municipios_datos %>% ## Elimino los datos vacios en punt_global_prom
  filter(!is.na(punt_global))

2 Spatial visualization

The following map shows the spatial distribution of the average overall score for the Saber 11 tests at the municipal level.

On this map, we begin to see a pattern where the highest scores are clustered in the Andes and lower scores toward the peripheries.

ggplot(municipios_datos) +
  geom_sf(aes(fill = punt_global), color = NA) +
  scale_fill_viridis_c(option = "D", na.value = "gray80") +
  labs(
    title = "Average Saber 11 Score by Municipality",
    fill = "Average Score"
  ) +
  theme_minimal()

2.1 Weight matrices and Moran’s Index

Two spatial weight matrices were constructed to assess the geographic dependence of average test scores:

A Contiguity matrix, which identifies neighboring municipalities that share a common border or vertex.
An inverse-distance matrix, which assigns greater weight to nearby municipalities and progressively attenuates the influence of more distant ones.

Using each matrix, the global Moran’s Index statistic was computed to measure the degree of spatial autocorrelation in Saber 11 scores.

#Verificar que todos los polígonos sean válidos
empty_geom <- st_is_empty(municipios_datos) # Revisa cuáles geometrías están vacías
table(empty_geom)  # Ver cuántas están vacías (TRUE = vacía) ## todas estan ok

# 2) Verificar que todos tengan vecinos (eliminar vacios, para que queden mejor los graficos)
neighbors_temp <- poly2nb(municipios_datos, queen = TRUE)
num_vecinos <- card(neighbors_temp)
sin_vecinos <- which(num_vecinos == 0)  # Identificar índices sin vecinos

# Filtrar del objeto espacial
municipios_datos <- municipios_datos[-sin_vecinos, ]

## 3) Matriz de contigüidad (vecinos por borde)
neighbors <- poly2nb(municipios_datos, queen = TRUE) # Crear vecinos espaciales
# 4) Normalized (row-standardized)
W <- nb2listw(neighbors, style = "W", zero.policy = TRUE)

### 2) Distance-based Weights – (Distancia inversa) = matriz de pesos espaciales basada en la distancia entre centroides
#el peso entre dos unidades espaciales disminuye cuando aumenta la distancia entre ellas

# 1. Get centroids / Obtener centroides
centroides <- st_centroid(municipios_datos)
coords <- st_coordinates(centroides)
# 2. Create list of neighbors within a radius of 100 km
dlist <- dnearneigh(coords, 0, 100000)  # 100 km = 100000 m
# 3. Calculate distances
distancias <- nbdists(dlist, coords)
# 4. Invert distances (as in "idistance")
inv_distancias <- lapply(distancias, function(x) 1 / x)
# 5. Create row-normalized list of weights (as in normalize(row))
Wdis <- nb2listw(dlist, glist = inv_distancias, style = "W", zero.policy = TRUE)

Moran I test (contiguity matrix):

moran.test(municipios_datos$punt_global, listw = W, zero.policy = TRUE)

## 
##  Moran I test under randomisation
## 
## data:  municipios_datos$punt_global  
## weights: W    
## 
## Moran I statistic standard deviate = 34.703, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.6234486104     -0.0009000900      0.0003236756

Moran I test (inverse matrix):

moran.test(municipios_datos$punt_global, listw = Wdis, zero.policy = TRUE)

## 
##  Moran I test under randomisation
## 
## data:  municipios_datos$punt_global  
## weights: Wdis    
## 
## Moran I statistic standard deviate = 84.764, p-value < 2.2e-16
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      1.447788e-01     -9.000900e-04      2.953710e-06

The Moran index applied to the”average overall score for Saber 11 suggests the presence of positive and significant spatial autocorrelation.

This indicates that municipalities with high scores tend to be surrounded by others with similar scores, and the same is true for low scores.

In other words, municipal educational performance is not distributed randomly across the territory, but rather forms geographical clusters of performance.

The results were visualized through spatial scatterplots (Moran plots)

par(mfrow = c(1, 2))  # define layout en una fila, dos columnas

moran.plot(as.vector(municipios_datos$punt_global),
           listw = W,
           zero.policy = TRUE,
           xlab = "Average score (municipality)",
           ylab = "Spatial Lag (contiguity)",
           main = "Moran Dispersion - Contiguity")

moran.plot(as.vector(municipios_datos$punt_global),
           listw = Wdis,
           zero.policy = TRUE,
           xlab = "Average score (municipality)",
           ylab = "Spatial Lag (inverse distance)",
           main = "Moran Dispersion - Inverse Distance")

The neighborhood matrix shows a clearer visual fit in Moran’s scatter plot: the scatter plot aligns better with the expected slope, suggesting that contiguous municipalities share more homogeneous performance levels.

2.2 Getis-Ord Gi*

To complement the global autocorrelation analysis, the Getis-Ord Gi* local statistic was applied to the average Saber 11 scores. This indicator allows for the detection of municipalities that form clusters of high or low intensity, i.e., areas where scores are significantly higher or lower than in the surrounding area.

This indicator not only tells us if there is spatial clustering, but also where that clustering occurs and how strong it is.

They are identified as follows:

Hotspots: municipalities with high values surrounded by other high values.
Coldspots: municipalities with low values surrounded by other low values.

gi_star <- localG(municipios_datos$punt_global, listw = W)
municipios_datos$gi_star <- as.numeric(localG(municipios_datos$punt_global, listw = W))

ggplot(municipios_datos) +
  geom_sf(aes(fill = gi_star), color = NA) +
  scale_fill_gradient2(low = "blue", mid = "white", high = "red", midpoint = 0) +
  labs(title = "Getis-Ord Gi* - Global Score",
       fill = "Gi* Z-score") +
  theme_minimal()

Gi* values reveal the presence of local clusters of educational performance. Municipalities marked as hotspots have significantly higher scores than their neighbors, while coldspots concentrate low results within equally disadvantaged environments. This territorial structure suggests that performance responds not only to internal factors, but also to the immediate context.

2.3 LISA

While Getis-Ord Gi* indicates where there is a significant concentration of high or low values, LISA analysis goes one step further: it allows us to understand how each municipality relates to its surroundings, classifying them into types of spatial association:

High–High
Low-Low
High–Low (outliers)
Low-High(outliers)

lisa <- localmoran(municipios_datos$punt_global, W)

## Unir resultados: 
municipios_datos$lisa_Ii <- lisa[, 1]        # índice local
municipios_datos$lisa_pvalor <- lisa[, 5]    # p-valor
municipios_datos$lisa_z <- lisa[, 4]

municipios_datos$cluster_LISA <- case_when(
  municipios_datos$punt_global >= mean(municipios_datos$punt_global, na.rm = TRUE) &
    lag.listw(W, municipios_datos$punt_global) >= mean(municipios_datos$punt_global, na.rm = TRUE) &
    municipios_datos$lisa_pvalor < 0.05 ~ "High-High",
  
  municipios_datos$punt_global < mean(municipios_datos$punt_global, na.rm = TRUE) &
    lag.listw(W, municipios_datos$punt_global) < mean(municipios_datos$punt_global, na.rm = TRUE) &
    municipios_datos$lisa_pvalor < 0.05 ~ "Low-Low",
  
  municipios_datos$punt_global >= mean(municipios_datos$punt_global, na.rm = TRUE) &
    lag.listw(W, municipios_datos$punt_global) < mean(municipios_datos$punt_global, na.rm = TRUE) &
    municipios_datos$lisa_pvalor < 0.05 ~ "High-Low",
  
  municipios_datos$punt_global < mean(municipios_datos$punt_global, na.rm = TRUE) &
    lag.listw(W, municipios_datos$punt_global) >= mean(municipios_datos$punt_global, na.rm = TRUE) &
    municipios_datos$lisa_pvalor < 0.05 ~ "Low-High",
  
  TRUE ~ "Not significant"
)

ggplot(municipios_datos) +
  geom_sf(aes(fill = cluster_LISA), color = NA) +
  scale_fill_manual(values = c(
    "High-High" = "#d7191c", 
    "Low-Low" = "#2c7bb6", 
    "High-Low" = "#fdae61", 
    "Low-High" = "#abd9e9", 
    "Not significant" = "gray80")) +
  labs(title = "Cluster LISA - Global Score",
       fill = "Type of association") +
  theme_minimal()

🟥 Central zone (High–High): includes regions with a greater institutional presence, consolidated educational infrastructure, and better socioeconomic conditions. The municipalities in this zone not only have good scores, but are surrounded by others that also have good scores—reinforcing performance through the environment.

🟦 National periphery (Low–Low): this often corresponds to rural, border areas or areas with a larger ethnic population, where lower scores are concentrated and the environment does not help to compensate for them. This reflects structural inequalities in access, quality, and the educational environment.

3 Spatial econometric models

The exploratory analysis above shows that the Saber 11 score exhibits significant territorial dependence, both in its overall distribution and in the local clusters detected by Gi* and LISA. In this context, applying spatial econometric models is not only a technical option but an analytical necessity to capture the educational dynamics that are structured geographically.

Spatial models allow us to:

Incorporate the influence of the territorial environment on municipal performance.
Avoid estimation errors that arise from ignoring autocorrelation.
Detect variables whose effect is amplified or attenuated depending on the spatial context.

# Variable dependiente: puntaje global
y <- municipios_datos$punt_global

# Variables independientes
X <- municipios_datos %>%
  st_drop_geometry() %>%
  select(
    etnia,
    horas_trabajo,
    educacion_madre,
    educacion_padre,
    libros,
    internet,
    computador,
  )

3.1 Classic Model (OLS)

As a starting point, a classic linear regression model (OLS) is estimated that relates the average Saber 11 score to social, educational, and material variables at the municipal level. The model shows statistically significant associations: for example, the score decreases with ethnic affiliation and student workload, and increases with parental education, access to books, and technology.

However, although the model explains part of the variability, it ignores the territorial dependence between municipalities, as previous tests have not confirmed, which may distort the coefficients and underestimate the effect of the environment.

Therefore, spatial autocorrelation tests are continued.

modelo_ols <- lm(y ~ ., data = X) 
summary(modelo_ols)

## 
## Call:
## lm(formula = y ~ ., data = X)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -61.189 -10.867   0.037  10.684  57.041 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     224.4443     4.0235  55.784  < 2e-16 ***
## etnia           -26.6215     3.1904  -8.344  < 2e-16 ***
## horas_trabajo    -1.9915     0.2878  -6.920 7.62e-12 ***
## educacion_madre   4.2821     1.9483   2.198   0.0282 *  
## educacion_padre  -9.0059     1.7609  -5.115 3.71e-07 ***
## libros            0.6309     0.1182   5.338 1.14e-07 ***
## internet         26.3317     4.1708   6.313 3.95e-10 ***
## computador       47.1141     5.2607   8.956  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.81 on 1104 degrees of freedom
## Multiple R-squared:  0.4211, Adjusted R-squared:  0.4174 
## F-statistic: 114.7 on 7 and 1104 DF,  p-value: < 2.2e-16

3.1.1 Lagrange Multiplier test

As a diagnostic tool based on the classic OLS model, the Lagrange Multiplier (LM) tests assess whether spatial autocorrelation is present and if so, which type of spatial structure dominates.

All LM tests yield extremely low p-values (<< 0.05), rejecting the null hypothesis of spatial independence. This confirms the need to apply spatial econometric models.

The adjusted versions of the tests help determine which model best captures spatial dependence:

Adjusted LM (error) -> adjRSerr(SEM) = 117.36
Adjusted LM (lag) → adjRSlag(SAR) = 42.114

These results suggest that the Spatial Error Model (SEM) offers a more appropriate specification than the SAR model, as it better accounts for the underlying spatial structure observed in the residuals.

lm_tests <- lm.LMtests(modelo_ols, listw = W, test = "all")
lm_tests

## 
##  Rao's score (a.k.a Lagrange multiplier) diagnostics for spatial
##  dependence
## 
## data:  
## model: lm(formula = y ~ ., data = X)
## test weights: listw
## 
## RSerr = 819.4, df = 1, p-value < 2.2e-16
## 
## 
##  Rao's score (a.k.a Lagrange multiplier) diagnostics for spatial
##  dependence
## 
## data:  
## model: lm(formula = y ~ ., data = X)
## test weights: listw
## 
## RSlag = 741.45, df = 1, p-value < 2.2e-16
## 
## 
##  Rao's score (a.k.a Lagrange multiplier) diagnostics for spatial
##  dependence
## 
## data:  
## model: lm(formula = y ~ ., data = X)
## test weights: listw
## 
## adjRSerr = 119.43, df = 1, p-value < 2.2e-16
## 
## 
##  Rao's score (a.k.a Lagrange multiplier) diagnostics for spatial
##  dependence
## 
## data:  
## model: lm(formula = y ~ ., data = X)
## test weights: listw
## 
## adjRSlag = 41.482, df = 1, p-value = 1.189e-10
## 
## 
##  Rao's score (a.k.a Lagrange multiplier) diagnostics for spatial
##  dependence
## 
## data:  
## model: lm(formula = y ~ ., data = X)
## test weights: listw
## 
## SARMA = 860.88, df = 2, p-value < 2.2e-16

3.2 Spatial Error Model (SEM)

Given that the Lagrange Multiplier tests indicated stronger spatial dependence in the residuals of the OLS model, we proceed with a Spatial Error Model (SEM). This specification captures unobserved territorial influences that affect educational performance, such as institutional structure, cultural factors, or regional disparities not explicitly included as covariates.

By modeling spatial autocorrelation through the error term, SEM acknowledges that the residuals from the classic regression are correlated across neighboring municipalities. This leads to more accurate coefficient estimates and a better representation of the relationship between predictors and Saber 11 scores within Colombia’s spatial framework.

modelo_sem <- errorsarlm(y ~ ., data = X, listw = W)
summary(modelo_sem)

## 
## Call:errorsarlm(formula = y ~ ., data = X, listw = W)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -43.33502  -6.88388  -0.14566   6.54413  38.03390 
## 
## Type: error 
## Coefficients: (asymptotic standard errors) 
##                   Estimate Std. Error z value  Pr(>|z|)
## (Intercept)     208.972626   3.487375 59.9226 < 2.2e-16
## etnia           -27.933263   3.107199 -8.9899 < 2.2e-16
## horas_trabajo    -1.529948   0.204212 -7.4920 6.795e-14
## educacion_madre   6.165239   1.282813  4.8060 1.540e-06
## educacion_padre   1.566828   1.192216  1.3142 0.1887741
## libros            0.285423   0.085688  3.3310 0.0008655
## internet         13.592857   3.239296  4.1962 2.714e-05
## computador       11.235135   4.128620  2.7213 0.0065029
## 
## Lambda: 0.81692, LR test value: 719.18, p-value: < 2.22e-16
## Asymptotic standard error: 0.020461
##     z-value: 39.925, p-value: < 2.22e-16
## Wald statistic: 1594, p-value: < 2.22e-16
## 
## Log likelihood: -4352.273 for error model
## ML residual variance (sigma squared): 124.18, (sigma: 11.144)
## Number of observations: 1112 
## Number of parameters estimated: 10 
## AIC: 8724.5, (AIC for lm: 9441.7)

The Spatial Error Model (SEM) confirms the relevance of territorial structure in explaining differences in Saber 11 scores across municipalities. Several variables remain statistically significant, reinforcing their influence even after accounting for unobserved spatial effects.

The spatial error coefficient (λ) is highly significant, indicating that residuals are correlated among neighboring municipalities. This suggests that factors not explicitly included in the model, such as institutional quality, regional educational policies or latent socioeconomic conditions , contribute to performance patterns in geographically linked areas.

Compared to the classic OLS model, SEM offers:

More robust coefficient estimates with spatial bias reduced.
Improved model fit and lower AIC.
A more faithful representation of territorial dynamics influencing educational outcomes.

Spatial Econometrics Exercise

Luisa Arenas - Sebastian Rangel

2025-07-22