Actividad 1.1. Análisis Exploratorio Espacial de Datos

1 Key Concepts

1.1 What is ESDA?

Exploratory Spatial Data Analysis (ESDA) is the process of analyzing data that has a geographic component to discover spatial patterns, clusters, and anomalies. It uses maps, spatial statistics, and visualizations to understand how a variable behaves across space.

	EDA	ESDA
Assumption	Observations are independent	Location matters
Focus	Statistical patterns	Spatial patterns
Output	Charts and summary stats	Maps and spatial statistics
Question	What is the distribution?	Where are the patterns?

1.2 What is Spatial Autocorrelation?

Spatial autocorrelation measures whether a variable is correlated with itself across space. It is basically do nearby locations tend to have similar values (positive autocorrelation) or different values (negative autocorrelation)?

Why does it matter in Business Analytics? Many business variables like sales, demand, and prices tend to cluster geographically. Ignoring spatial dependence can lead to biased models and wrong conclusions.

1.3 Global vs. Local Spatial Autocorrelation

	Global	Local
Scope	Entire study area	Each individual location
Question	Does clustering exist?	Where does clustering exist?
Output	Single statistic (Moran’s I)	Map of hot spots, cold spots, outliers
Use case	Quick overall diagnosis	Actionable zone-level insights

1.4 How ESDA Improves the Analytics Process

Descriptive: ESDA adds a spatial layer to exploration, allowing analysts to map distributions and spot clusters or outliers that standard statistics would miss.

Predictive: ESDA identifies spatial dependencies early, informing whether a standard or spatial regression model is more appropriate — leading to more accurate predictions.

Prescriptive: ESDA grounds recommendations in spatial patterns. If demand clusters in specific zones, resources like stores, logistics, and campaigns can be allocated more efficiently.

library(readxl)
library(dplyr)
library(tidyr)
library(knitr)
library(kableExtra)
library(ggplot2)
library(sf)
library(sp)
library(spdep)
library(RColorBrewer)
library(viridis)
library(tmap)
library(tigris)

df <- read_excel('/Users/mariadelbv/Downloads/inegi_mx_state_tourism (1).xlsx', sheet = "panel_data")

## New names:
## • `region` -> `region...26`
## • `region` -> `region...27`

glossary <- read_excel('/Users/mariadelbv/Downloads/inegi_mx_state_tourism (1).xlsx', sheet = "glossary")

names(df)[names(df) == "region...26"] <- "region"
names(df)[names(df) == "region...27"] <- "region_num"

df$restaurants <- as.numeric(df$restaurants)

last_year <- max(df$year, na.rm = TRUE)

df_last <- df %>%
  filter(year == last_year)

cat("Selected year:", last_year, "\n")

## Selected year: 2022

cat("Number of observations:", nrow(df_last), "\n")

## Number of observations: 32

cat("States:", length(unique(df_last$state)), "\n")

## States: 32

glossary %>%
  filter(!is.na(Description)) %>%
  kable(col.names = c("Variable", "Description")) %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)

Variable	Description
tourism_activity	Travel and tourism’s direct contribution to GDP. Base Year is 2018 = 100.
crime_rate	Crime rate per 100,000 state’s population.
unemployment	Percentaje of unemployed population.
employment	Percentaje of employed population.
business_activity	Index of economic activity weighted by the state’s distance to nearest U.S. port of entry.
real_wage	Real wage using 2018 INPC = 100. MXN Pesos.
pop_density	Population per state’s land area km2.
good_governance	Ratio between state’s public investment and its public debt.
ratio_public_investment	Ratio between state’s public investment and its gross domestic product (GDP).
exchange_rate	MXN pesos per 1 USD
inpc	National Price Consumer Index. 2018 = 100
college_education	Percentage of population with at least high school education.

vars_of_interest <- c("tourism_activity",
                      "llegada_turistas_nacionales", "llegada_turistas_extranjeros",
                      "crime_rate", "pop_density")

df_selected <- df_last %>%
  select(state, state_id, year, region, all_of(vars_of_interest))

str(df_selected)

## tibble [32 × 9] (S3: tbl_df/tbl/data.frame)
##  $ state                       : chr [1:32] "Aguascalientes" "Baja California" "Baja California Sur" "Campeche" ...
##  $ state_id                    : num [1:32] 1057 2304 2327 1086 1182 ...
##  $ year                        : num [1:32] 2022 2022 2022 2022 2022 ...
##  $ region                      : chr [1:32] "Bajio" "Norte" "Occidente" "Sur" ...
##  $ tourism_activity            : num [1:32] 15049 69966 28951 12774 36945 ...
##  $ llegada_turistas_nacionales : num [1:32] 844346 3109495 1399476 1096087 3325811 ...
##  $ llegada_turistas_extranjeros: num [1:32] 91075 1780664 2526340 230739 532475 ...
##  $ crime_rate                  : num [1:32] 5.63 75.87 9.68 9.88 8.73 ...
##  $ pop_density                 : num [1:32] 263.6 54.2 11.4 18.2 76.9 ...

2 Descriptive Statistics by Region

descriptive_stats <- df_selected %>%
  group_by(region) %>%
  summarise(across(all_of(vars_of_interest),
                   list(
                     Mean   = ~round(mean(.x, na.rm = TRUE), 2),
                     Median = ~round(median(.x, na.rm = TRUE), 2),
                     Min    = ~round(min(.x, na.rm = TRUE), 2),
                     Max    = ~round(max(.x, na.rm = TRUE), 2)
                   ),
                   .names = "{.col}__{.fn}"))

summary(df_selected[, vars_of_interest])

##  tourism_activity llegada_turistas_nacionales llegada_turistas_extranjeros
##  Min.   :  8618   Min.   : 282384             Min.   :   21207            
##  1st Qu.: 22066   1st Qu.:1644743             1st Qu.:  162870            
##  Median : 34023   Median :2407776             Median :  237114            
##  Mean   : 60344   Mean   :3062893             Mean   :  871694            
##  3rd Qu.: 62411   3rd Qu.:4158283             3rd Qu.:  526229            
##  Max.   :365959   Max.   :9296032             Max.   :12589248            
##    crime_rate       pop_density     
##  Min.   :  1.984   Min.   :  11.36  
##  1st Qu.:  9.585   1st Qu.:  42.82  
##  Median : 17.346   Median :  68.54  
##  Mean   : 28.636   Mean   : 315.08  
##  3rd Qu.: 40.571   3rd Qu.: 163.03  
##  Max.   :111.064   Max.   :6211.45

3 Dispersion Statistics by Region

Standard deviation for each selected variable, grouped by region.

df_selected %>%
  group_by(region) %>%
  summarise(across(all_of(vars_of_interest), ~round(sd(.x, na.rm = TRUE), 2)))

## # A tibble: 5 × 6
##   region    tourism_activity llegada_turistas_nacionales llegada_turistas_extr…¹
##   <chr>                <dbl>                       <dbl>                   <dbl>
## 1 Bajio               54577.                    1398693.                  65694.
## 2 Centro             139415.                    2328171.                 863000.
## 3 Norte               29622.                    1257864.                 602360.
## 4 Occidente           45035.                    2737399.                 934755.
## 5 Sur                 46485.                    2630232.                4341189.
## # ℹ abbreviated name: ¹llegada_turistas_extranjeros
## # ℹ 2 more variables: crime_rate <dbl>, pop_density <dbl>

4 Histograms

options(scipen = 999)
par(mfrow = c(2, 3))

hist(df_selected$tourism_activity,main = "Tourism Activity",col = "lightblue", prob = TRUE, xlab = "GDP Contribution (2018=100)", xaxt = "n")
axis(1, at = axTicks(1), labels = format(axTicks(1), big.mark = ","))

hist(df_selected$llegada_turistas_nacionales,  main = "National Tourist Arrivals",col = "lightblue", prob = TRUE, xlab = "Number of Arrivals",           xaxt = "n")
axis(1, at = axTicks(1), labels = format(axTicks(1), big.mark = ","))

hist(df_selected$llegada_turistas_extranjeros, main = "International Tourist Arrivals", col = "lightblue", prob = TRUE, xlab = "Number of Arrivals",           xaxt = "n")
axis(1, at = axTicks(1), labels = format(axTicks(1), big.mark = ","))

hist(df_selected$crime_rate,main = "Crime Rate",col = "lightblue", prob = TRUE, xlab = "Rate per 100,000 pop.",xaxt = "n")
axis(1, at = axTicks(1), labels = format(axTicks(1), big.mark = ","))

hist(df_selected$pop_density,main = "Population Density",col = "lightblue", prob = TRUE, xlab = "pop./km²",xaxt = "n")
axis(1, at = axTicks(1), labels = format(axTicks(1), big.mark = ","))

par(mfrow = c(1, 1))

5 Histogramas with log transformaton

par(mfrow = c(2, 3))

hist(log(df_selected$tourism_activity),            main = "Tourism Activity (Log)",               col = "lightblue", prob = TRUE, xlab = "log(GDP Contribution)")
hist(log(df_selected$llegada_turistas_nacionales),  main = "National Tourist Arrivals (Log)",      col = "lightblue", prob = TRUE, xlab = "log(Arrivals)")
hist(log(df_selected$llegada_turistas_extranjeros), main = "International Tourist Arrivals (Log)", col = "lightblue", prob = TRUE, xlab = "log(Arrivals)")
hist(log(df_selected$crime_rate),                  main = "Crime Rate (Log)",                     col = "lightblue", prob = TRUE, xlab = "log(Rate per 100,000 pop.)")
hist(log(df_selected$pop_density),                 main = "Population Density (Log)",             col = "lightblue", prob = TRUE, xlab = "log(pop./km²)")

par(mfrow = c(1, 1))

5.1 Histogram Interpretation

Which variables exhibit the presence of outliers?

Which variables display a right / left skew?

Which variables require data transformation?

6 Boxplots by Region

par(mfrow = c(2, 3), mar = c(7, 6, 3, 1), mgp = c(4, 1, 0))

boxplot(df_selected$tourism_activity            ~ df_selected$region, main = "Tourism Activity",               col = brewer.pal(5, "Set2"), xlab = "", ylab = "GDP Contribution",  las = 2, cex.axis = 0.8, yaxt = "n")
axis(2, at = axTicks(2), labels = format(axTicks(2), big.mark = ","), las = 1, cex.axis = 0.7)

boxplot(df_selected$llegada_turistas_nacionales  ~ df_selected$region, main = "National Tourist Arrivals",     col = brewer.pal(5, "Set2"), xlab = "", ylab = "Arrivals",          las = 2, cex.axis = 0.8, yaxt = "n")
axis(2, at = axTicks(2), labels = format(axTicks(2), big.mark = ","), las = 1, cex.axis = 0.7)

boxplot(df_selected$llegada_turistas_extranjeros ~ df_selected$region, main = "International Tourist Arrivals", col = brewer.pal(5, "Set2"), xlab = "", ylab = "Arrivals",          las = 2, cex.axis = 0.8, yaxt = "n")
axis(2, at = axTicks(2), labels = format(axTicks(2), big.mark = ","), las = 1, cex.axis = 0.7)

boxplot(df_selected$crime_rate                  ~ df_selected$region, main = "Crime Rate",                    col = brewer.pal(5, "Set2"), xlab = "", ylab = "per 100,000 pop.", las = 2, cex.axis = 0.8, yaxt = "n")
axis(2, at = axTicks(2), labels = format(axTicks(2), big.mark = ","), las = 1, cex.axis = 0.7)

boxplot(df_selected$pop_density                 ~ df_selected$region, main = "Population Density",            col = brewer.pal(5, "Set2"), xlab = "", ylab = "pop./km²",         las = 2, cex.axis = 0.8, yaxt = "n")
axis(2, at = axTicks(2), labels = format(axTicks(2), big.mark = ","), las = 1, cex.axis = 0.7)

par(mfrow = c(1, 1))

7 Choropleth Maps

mx_state_map <- st_read('/Users/mariadelbv/Desktop/TEC/8 SEMESTRE/Planeacion Estrategica basada en Analitica Prescriptiva/mx_maps/mx_states/mexlatlong.shp', quiet = TRUE)

names(mx_state_map)

##  [1] "OBJECTID"   "FIPS_ADMIN" "GMI_ADMIN"  "ADMIN_NAME" "FIPS_CNTRY"
##  [6] "GMI_CNTRY"  "CNTRY_NAME" "POP_ADMIN"  "TYPE_ENG"   "TYPE_LOC"  
## [11] "SQKM"       "SQMI"       "COLOR_MAP"  "Shape_Leng" "Shape_Area"
## [16] "OBJECTID_1" "OBJECTID_2" "longitude"  "latitude"   "geometry"

mx_map <- mx_state_map %>%
  inner_join(df_selected, by = c("OBJECTID" = "state_id"))

cat("Matched states:", nrow(mx_map), "\n")

## Matched states: 32

cat("Unmatched states:", nrow(mx_state_map) - nrow(mx_map), "\n")

## Unmatched states: 0

7.1 Choropleth Maps

tmap_options(component.autoscale = FALSE)

tourism  <- tm_shape(mx_map) + tm_polygons(col = "tourism_activity",palette = "brewer.blues",    style = "quantile", n = 5, title = "Index",    colorNA = "grey80") + tm_layout(main.title = "Tourism Activity",legend.outside = TRUE, legend.outside.position = "bottom")

##

## ── tmap v3 code detected ───────────────────────────────────────────────────────

## [v3->v4] `tm_polygons()`: instead of `style = "quantile"`, use fill.scale =
## `tm_scale_intervals()`.
## ℹ Migrate the argument(s) 'style', 'n', 'palette' (rename to 'values'),
##   'colorNA' (rename to 'value.na') to 'tm_scale_intervals(<HERE>)'
## For small multiples, specify a 'tm_scale_' for each multiple, and put them in a
## list: 'fill'.scale = list(<scale1>, <scale2>, ...)'
## [v3->v4] `tm_polygons()`: use 'fill' for the fill color of polygons/symbols
## (instead of 'col'), and 'col' for the outlines (instead of 'border.col').
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(main.title = )`

nacional <- tm_shape(mx_map) + tm_polygons(col = "llegada_turistas_nacionales",  palette = "brewer.bu_gn",    style = "quantile", n = 5, title = "Arrivals", colorNA = "grey80") + tm_layout(main.title = "National Tourist Arrivals",legend.outside = TRUE, legend.outside.position = "bottom")

## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(main.title = )`

extran   <- tm_shape(mx_map) + tm_polygons(col = "llegada_turistas_extranjeros", palette = "brewer.yl_or_rd", style = "quantile", n = 5, title = "Arrivals", colorNA = "grey80") + tm_layout(main.title = "International Tourist Arrivals", legend.outside = TRUE, legend.outside.position = "bottom")

## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(main.title = )`

crime    <- tm_shape(mx_map) + tm_polygons(col = "crime_rate",palette = "brewer.reds",     style = "quantile", n = 5, title = "per 100k", colorNA = "grey80") + tm_layout(main.title = "Crime Rate",legend.outside = TRUE, legend.outside.position = "bottom")

## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(main.title = )`

pop      <- tm_shape(mx_map) + tm_polygons(col = "pop_density", palette = "brewer.oranges",  style = "quantile", n = 5, title = "pop/km²",  colorNA = "grey80") + tm_layout(main.title = "Population Density",legend.outside = TRUE, legend.outside.position = "bottom")

## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(main.title = )`

tmap_arrange(tourism, nacional, extran, crime, pop, ncol = 2)

7.2 Choropleth Map Interpretation

Potential presence of clustering:

Potential presence of outliers:

Regional disparities:

Do the selected variables exhibit a spatial pattern or a random distribution across Mexico’s states?

8 Contiguity-Based Spatial Weight Matrix

queen_nb <- poly2nb(mx_map, queen = TRUE)
queen_w  <- nb2listw(queen_nb, style = "W", zero.policy = TRUE)
summary(queen_nb)

## Neighbour list object:
## Number of regions: 32 
## Number of nonzero links: 138 
## Percentage nonzero weights: 13.47656 
## Average number of links: 4.3125 
## Link number distribution:
## 
## 1 2 3 4 5 6 7 8 9 
## 1 6 6 6 5 2 3 2 1 
## 1 least connected region:
## 31 with 1 link
## 1 most connected region:
## 8 with 9 links

9 Spatial Weight Matrix Interpretation

What does “neighbors” mean in this context?

mx_sp  <- as(mx_map, "Spatial")
coords <- coordinates(mx_sp)

plot(mx_sp, border = "blue", axes = FALSE, las = 1,
     main = "Mexico's States – Queen Contiguity Spatial Weight Matrix")
plot(mx_sp, col = "grey", border = grey(0.9), axes = TRUE, add = TRUE)
plot(queen_w, coords = coords, pch = 19, cex = 0.1, col = "red", add = TRUE)

10 Global Moran

moran.test(mx_map$tourism_activity,            queen_w, zero.policy = TRUE)

## 
##  Moran I test under randomisation
## 
## data:  mx_map$tourism_activity  
## weights: queen_w    
## 
## Moran I statistic standard deviate = 0.38237, p-value = 0.3511
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       0.006512234      -0.032258065       0.010281098

moran.test(mx_map$llegada_turistas_nacionales,  queen_w, zero.policy = TRUE)

## 
##  Moran I test under randomisation
## 
## data:  mx_map$llegada_turistas_nacionales  
## weights: queen_w    
## 
## Moran I statistic standard deviate = -1.1838, p-value = 0.8818
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##       -0.17362616       -0.03225806        0.01426119

moran.test(mx_map$llegada_turistas_extranjeros, queen_w, zero.policy = TRUE)

## 
##  Moran I test under randomisation
## 
## data:  mx_map$llegada_turistas_extranjeros  
## weights: queen_w    
## 
## Moran I statistic standard deviate = 0.30324, p-value = 0.3809
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      -0.016313329      -0.032258065       0.002764844

moran.test(mx_map$crime_rate,                   queen_w, zero.policy = TRUE)

## 
##  Moran I test under randomisation
## 
## data:  mx_map$crime_rate  
## weights: queen_w    
## 
## Moran I statistic standard deviate = 0.44052, p-value = 0.3298
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##        0.02055402       -0.03225806        0.01437244

moran.test(mx_map$pop_density,                  queen_w, zero.policy = TRUE)

## 
##  Moran I test under randomisation
## 
## data:  mx_map$pop_density  
## weights: queen_w    
## 
## Moran I statistic standard deviate = 4.3633, p-value = 0.000006405
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.0989846831     -0.0322580645      0.0009047252

11 Global Moran’s I Interpretation

Is there positive or negative spatial autocorrelation across the dataset?

Do the spatial autocorrelation results support the spatial patterns displayed in the choropleth maps?

12 Hypotheses

H1:

H2:

H3:

H4:

H5:

13 ESDA Main Insights

14 R Script Interpretation

Descriptive & Dispersion Statistics:

Histograms:

Boxplots:

Choropleth Maps:

Spatial Weight Matrix:

Global Moran’s I: