Exploratory Spatial Data Analysis (ESDA) is the process of analyzing data that has a geographic component to discover spatial patterns, clusters, and anomalies. It uses maps, spatial statistics, and visualizations to understand how a variable behaves across space.
| EDA | ESDA | |
|---|---|---|
| Assumption | Observations are independent | Location matters |
| Focus | Statistical patterns | Spatial patterns |
| Output | Charts and summary stats | Maps and spatial statistics |
| Question | What is the distribution? | Where are the patterns? |
Spatial autocorrelation measures whether a variable is correlated with itself across space. It is basically do nearby locations tend to have similar values (positive autocorrelation) or different values (negative autocorrelation)?
Why does it matter in Business Analytics? Many business variables like sales, demand, and prices tend to cluster geographically. Ignoring spatial dependence can lead to biased models and wrong conclusions.
| Global | Local | |
|---|---|---|
| Scope | Entire study area | Each individual location |
| Question | Does clustering exist? | Where does clustering exist? |
| Output | Single statistic (Moran’s I) | Map of hot spots, cold spots, outliers |
| Use case | Quick overall diagnosis | Actionable zone-level insights |
Descriptive: ESDA adds a spatial layer to exploration, allowing analysts to map distributions and spot clusters or outliers that standard statistics would miss.
Predictive: ESDA identifies spatial dependencies early, informing whether a standard or spatial regression model is more appropriate — leading to more accurate predictions.
Prescriptive: ESDA grounds recommendations in spatial patterns. If demand clusters in specific zones, resources like stores, logistics, and campaigns can be allocated more efficiently.
library(readxl)
library(dplyr)
library(tidyr)
library(knitr)
library(kableExtra)
library(ggplot2)
library(sf)
library(sp)
library(spdep)
library(RColorBrewer)
library(viridis)
library(tmap)
library(tigris)df <- read_excel('/Users/mariadelbv/Downloads/inegi_mx_state_tourism (1).xlsx', sheet = "panel_data")## New names:
## • `region` -> `region...26`
## • `region` -> `region...27`
glossary <- read_excel('/Users/mariadelbv/Downloads/inegi_mx_state_tourism (1).xlsx', sheet = "glossary")
names(df)[names(df) == "region...26"] <- "region"
names(df)[names(df) == "region...27"] <- "region_num"
df$restaurants <- as.numeric(df$restaurants)last_year <- max(df$year, na.rm = TRUE)
df_last <- df %>%
filter(year == last_year)
cat("Selected year:", last_year, "\n")## Selected year: 2022
## Number of observations: 32
## States: 32
glossary %>%
filter(!is.na(Description)) %>%
kable(col.names = c("Variable", "Description")) %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE)| Variable | Description |
|---|---|
| tourism_activity | Travel and tourism’s direct contribution to GDP. Base Year is 2018 = 100. |
| crime_rate | Crime rate per 100,000 state’s population. |
| unemployment | Percentaje of unemployed population. |
| employment | Percentaje of employed population. |
| business_activity | Index of economic activity weighted by the state’s distance to nearest U.S. port of entry. |
| real_wage | Real wage using 2018 INPC = 100. MXN Pesos. |
| pop_density | Population per state’s land area km2. |
| good_governance | Ratio between state’s public investment and its public debt. |
| ratio_public_investment | Ratio between state’s public investment and its gross domestic product (GDP). |
| exchange_rate | MXN pesos per 1 USD |
| inpc | National Price Consumer Index. 2018 = 100 |
| college_education | Percentage of population with at least high school education. |
vars_of_interest <- c("tourism_activity",
"llegada_turistas_nacionales", "llegada_turistas_extranjeros",
"crime_rate", "pop_density")
df_selected <- df_last %>%
select(state, state_id, year, region, all_of(vars_of_interest))
str(df_selected)## tibble [32 × 9] (S3: tbl_df/tbl/data.frame)
## $ state : chr [1:32] "Aguascalientes" "Baja California" "Baja California Sur" "Campeche" ...
## $ state_id : num [1:32] 1057 2304 2327 1086 1182 ...
## $ year : num [1:32] 2022 2022 2022 2022 2022 ...
## $ region : chr [1:32] "Bajio" "Norte" "Occidente" "Sur" ...
## $ tourism_activity : num [1:32] 15049 69966 28951 12774 36945 ...
## $ llegada_turistas_nacionales : num [1:32] 844346 3109495 1399476 1096087 3325811 ...
## $ llegada_turistas_extranjeros: num [1:32] 91075 1780664 2526340 230739 532475 ...
## $ crime_rate : num [1:32] 5.63 75.87 9.68 9.88 8.73 ...
## $ pop_density : num [1:32] 263.6 54.2 11.4 18.2 76.9 ...
descriptive_stats <- df_selected %>%
group_by(region) %>%
summarise(across(all_of(vars_of_interest),
list(
Mean = ~round(mean(.x, na.rm = TRUE), 2),
Median = ~round(median(.x, na.rm = TRUE), 2),
Min = ~round(min(.x, na.rm = TRUE), 2),
Max = ~round(max(.x, na.rm = TRUE), 2)
),
.names = "{.col}__{.fn}"))## tourism_activity llegada_turistas_nacionales llegada_turistas_extranjeros
## Min. : 8618 Min. : 282384 Min. : 21207
## 1st Qu.: 22066 1st Qu.:1644743 1st Qu.: 162870
## Median : 34023 Median :2407776 Median : 237114
## Mean : 60344 Mean :3062893 Mean : 871694
## 3rd Qu.: 62411 3rd Qu.:4158283 3rd Qu.: 526229
## Max. :365959 Max. :9296032 Max. :12589248
## crime_rate pop_density
## Min. : 1.984 Min. : 11.36
## 1st Qu.: 9.585 1st Qu.: 42.82
## Median : 17.346 Median : 68.54
## Mean : 28.636 Mean : 315.08
## 3rd Qu.: 40.571 3rd Qu.: 163.03
## Max. :111.064 Max. :6211.45
Standard deviation for each selected variable, grouped by region.
df_selected %>%
group_by(region) %>%
summarise(across(all_of(vars_of_interest), ~round(sd(.x, na.rm = TRUE), 2))) ## # A tibble: 5 × 6
## region tourism_activity llegada_turistas_nacionales llegada_turistas_extr…¹
## <chr> <dbl> <dbl> <dbl>
## 1 Bajio 54577. 1398693. 65694.
## 2 Centro 139415. 2328171. 863000.
## 3 Norte 29622. 1257864. 602360.
## 4 Occidente 45035. 2737399. 934755.
## 5 Sur 46485. 2630232. 4341189.
## # ℹ abbreviated name: ¹llegada_turistas_extranjeros
## # ℹ 2 more variables: crime_rate <dbl>, pop_density <dbl>
options(scipen = 999)
par(mfrow = c(2, 3))
hist(df_selected$tourism_activity,main = "Tourism Activity",col = "lightblue", prob = TRUE, xlab = "GDP Contribution (2018=100)", xaxt = "n")
axis(1, at = axTicks(1), labels = format(axTicks(1), big.mark = ","))
hist(df_selected$llegada_turistas_nacionales, main = "National Tourist Arrivals",col = "lightblue", prob = TRUE, xlab = "Number of Arrivals", xaxt = "n")
axis(1, at = axTicks(1), labels = format(axTicks(1), big.mark = ","))
hist(df_selected$llegada_turistas_extranjeros, main = "International Tourist Arrivals", col = "lightblue", prob = TRUE, xlab = "Number of Arrivals", xaxt = "n")
axis(1, at = axTicks(1), labels = format(axTicks(1), big.mark = ","))
hist(df_selected$crime_rate,main = "Crime Rate",col = "lightblue", prob = TRUE, xlab = "Rate per 100,000 pop.",xaxt = "n")
axis(1, at = axTicks(1), labels = format(axTicks(1), big.mark = ","))
hist(df_selected$pop_density,main = "Population Density",col = "lightblue", prob = TRUE, xlab = "pop./km²",xaxt = "n")
axis(1, at = axTicks(1), labels = format(axTicks(1), big.mark = ","))
par(mfrow = c(1, 1))par(mfrow = c(2, 3))
hist(log(df_selected$tourism_activity), main = "Tourism Activity (Log)", col = "lightblue", prob = TRUE, xlab = "log(GDP Contribution)")
hist(log(df_selected$llegada_turistas_nacionales), main = "National Tourist Arrivals (Log)", col = "lightblue", prob = TRUE, xlab = "log(Arrivals)")
hist(log(df_selected$llegada_turistas_extranjeros), main = "International Tourist Arrivals (Log)", col = "lightblue", prob = TRUE, xlab = "log(Arrivals)")
hist(log(df_selected$crime_rate), main = "Crime Rate (Log)", col = "lightblue", prob = TRUE, xlab = "log(Rate per 100,000 pop.)")
hist(log(df_selected$pop_density), main = "Population Density (Log)", col = "lightblue", prob = TRUE, xlab = "log(pop./km²)")
par(mfrow = c(1, 1))Which variables exhibit the presence of outliers?
Which variables display a right / left skew?
Which variables require data transformation?
par(mfrow = c(2, 3), mar = c(7, 6, 3, 1), mgp = c(4, 1, 0))
boxplot(df_selected$tourism_activity ~ df_selected$region, main = "Tourism Activity", col = brewer.pal(5, "Set2"), xlab = "", ylab = "GDP Contribution", las = 2, cex.axis = 0.8, yaxt = "n")
axis(2, at = axTicks(2), labels = format(axTicks(2), big.mark = ","), las = 1, cex.axis = 0.7)
boxplot(df_selected$llegada_turistas_nacionales ~ df_selected$region, main = "National Tourist Arrivals", col = brewer.pal(5, "Set2"), xlab = "", ylab = "Arrivals", las = 2, cex.axis = 0.8, yaxt = "n")
axis(2, at = axTicks(2), labels = format(axTicks(2), big.mark = ","), las = 1, cex.axis = 0.7)
boxplot(df_selected$llegada_turistas_extranjeros ~ df_selected$region, main = "International Tourist Arrivals", col = brewer.pal(5, "Set2"), xlab = "", ylab = "Arrivals", las = 2, cex.axis = 0.8, yaxt = "n")
axis(2, at = axTicks(2), labels = format(axTicks(2), big.mark = ","), las = 1, cex.axis = 0.7)
boxplot(df_selected$crime_rate ~ df_selected$region, main = "Crime Rate", col = brewer.pal(5, "Set2"), xlab = "", ylab = "per 100,000 pop.", las = 2, cex.axis = 0.8, yaxt = "n")
axis(2, at = axTicks(2), labels = format(axTicks(2), big.mark = ","), las = 1, cex.axis = 0.7)
boxplot(df_selected$pop_density ~ df_selected$region, main = "Population Density", col = brewer.pal(5, "Set2"), xlab = "", ylab = "pop./km²", las = 2, cex.axis = 0.8, yaxt = "n")
axis(2, at = axTicks(2), labels = format(axTicks(2), big.mark = ","), las = 1, cex.axis = 0.7)
par(mfrow = c(1, 1))mx_state_map <- st_read('/Users/mariadelbv/Desktop/TEC/8 SEMESTRE/Planeacion Estrategica basada en Analitica Prescriptiva/mx_maps/mx_states/mexlatlong.shp', quiet = TRUE)
names(mx_state_map)## [1] "OBJECTID" "FIPS_ADMIN" "GMI_ADMIN" "ADMIN_NAME" "FIPS_CNTRY"
## [6] "GMI_CNTRY" "CNTRY_NAME" "POP_ADMIN" "TYPE_ENG" "TYPE_LOC"
## [11] "SQKM" "SQMI" "COLOR_MAP" "Shape_Leng" "Shape_Area"
## [16] "OBJECTID_1" "OBJECTID_2" "longitude" "latitude" "geometry"
mx_map <- mx_state_map %>%
inner_join(df_selected, by = c("OBJECTID" = "state_id"))
cat("Matched states:", nrow(mx_map), "\n")## Matched states: 32
## Unmatched states: 0
tmap_options(component.autoscale = FALSE)
tourism <- tm_shape(mx_map) + tm_polygons(col = "tourism_activity",palette = "brewer.blues", style = "quantile", n = 5, title = "Index", colorNA = "grey80") + tm_layout(main.title = "Tourism Activity",legend.outside = TRUE, legend.outside.position = "bottom")##
## ── tmap v3 code detected ───────────────────────────────────────────────────────
## [v3->v4] `tm_polygons()`: instead of `style = "quantile"`, use fill.scale =
## `tm_scale_intervals()`.
## ℹ Migrate the argument(s) 'style', 'n', 'palette' (rename to 'values'),
## 'colorNA' (rename to 'value.na') to 'tm_scale_intervals(<HERE>)'
## For small multiples, specify a 'tm_scale_' for each multiple, and put them in a
## list: 'fill'.scale = list(<scale1>, <scale2>, ...)'
## [v3->v4] `tm_polygons()`: use 'fill' for the fill color of polygons/symbols
## (instead of 'col'), and 'col' for the outlines (instead of 'border.col').
## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(main.title = )`
nacional <- tm_shape(mx_map) + tm_polygons(col = "llegada_turistas_nacionales", palette = "brewer.bu_gn", style = "quantile", n = 5, title = "Arrivals", colorNA = "grey80") + tm_layout(main.title = "National Tourist Arrivals",legend.outside = TRUE, legend.outside.position = "bottom")## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(main.title = )`
extran <- tm_shape(mx_map) + tm_polygons(col = "llegada_turistas_extranjeros", palette = "brewer.yl_or_rd", style = "quantile", n = 5, title = "Arrivals", colorNA = "grey80") + tm_layout(main.title = "International Tourist Arrivals", legend.outside = TRUE, legend.outside.position = "bottom")## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(main.title = )`
crime <- tm_shape(mx_map) + tm_polygons(col = "crime_rate",palette = "brewer.reds", style = "quantile", n = 5, title = "per 100k", colorNA = "grey80") + tm_layout(main.title = "Crime Rate",legend.outside = TRUE, legend.outside.position = "bottom")## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(main.title = )`
pop <- tm_shape(mx_map) + tm_polygons(col = "pop_density", palette = "brewer.oranges", style = "quantile", n = 5, title = "pop/km²", colorNA = "grey80") + tm_layout(main.title = "Population Density",legend.outside = TRUE, legend.outside.position = "bottom")## [v3->v4] `tm_polygons()`: migrate the argument(s) related to the legend of the
## visual variable `fill` namely 'title' to 'fill.legend = tm_legend(<HERE>)'
## [v3->v4] `tm_layout()`: use `tm_title()` instead of `tm_layout(main.title = )`
Potential presence of clustering:
Potential presence of outliers:
Regional disparities:
Do the selected variables exhibit a spatial pattern or a random distribution across Mexico’s states?
queen_nb <- poly2nb(mx_map, queen = TRUE)
queen_w <- nb2listw(queen_nb, style = "W", zero.policy = TRUE)
summary(queen_nb)## Neighbour list object:
## Number of regions: 32
## Number of nonzero links: 138
## Percentage nonzero weights: 13.47656
## Average number of links: 4.3125
## Link number distribution:
##
## 1 2 3 4 5 6 7 8 9
## 1 6 6 6 5 2 3 2 1
## 1 least connected region:
## 31 with 1 link
## 1 most connected region:
## 8 with 9 links
What does “neighbors” mean in this context?
mx_sp <- as(mx_map, "Spatial")
coords <- coordinates(mx_sp)
plot(mx_sp, border = "blue", axes = FALSE, las = 1,
main = "Mexico's States – Queen Contiguity Spatial Weight Matrix")
plot(mx_sp, col = "grey", border = grey(0.9), axes = TRUE, add = TRUE)
plot(queen_w, coords = coords, pch = 19, cex = 0.1, col = "red", add = TRUE)##
## Moran I test under randomisation
##
## data: mx_map$tourism_activity
## weights: queen_w
##
## Moran I statistic standard deviate = 0.38237, p-value = 0.3511
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.006512234 -0.032258065 0.010281098
##
## Moran I test under randomisation
##
## data: mx_map$llegada_turistas_nacionales
## weights: queen_w
##
## Moran I statistic standard deviate = -1.1838, p-value = 0.8818
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## -0.17362616 -0.03225806 0.01426119
##
## Moran I test under randomisation
##
## data: mx_map$llegada_turistas_extranjeros
## weights: queen_w
##
## Moran I statistic standard deviate = 0.30324, p-value = 0.3809
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## -0.016313329 -0.032258065 0.002764844
##
## Moran I test under randomisation
##
## data: mx_map$crime_rate
## weights: queen_w
##
## Moran I statistic standard deviate = 0.44052, p-value = 0.3298
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.02055402 -0.03225806 0.01437244
##
## Moran I test under randomisation
##
## data: mx_map$pop_density
## weights: queen_w
##
## Moran I statistic standard deviate = 4.3633, p-value = 0.000006405
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.0989846831 -0.0322580645 0.0009047252
Is there positive or negative spatial autocorrelation across the dataset?
Do the spatial autocorrelation results support the spatial patterns displayed in the choropleth maps?
H1:
H2:
H3:
H4:
H5:
Descriptive & Dispersion Statistics:
Histograms:
Boxplots:
Choropleth Maps:
Spatial Weight Matrix:
Global Moran’s I: