Conceptual
Questions
What is ESDA?
Compare EDA vs ESDA
ESDA is an extension of EDA that incorporates the spatial dimension
into the analysis. While EDA focuses on distributions, variability, and
relationships between variables, ESDA also evaluates how location
influences those patterns. The key difference is that EDA assumes
observations are independent, whereas ESDA recognizes that nearby
locations can influence each other. In this case, tourism in one state
may be related to neighboring states.
What is spatial
autocorrelation? Why is it relevant in business analytics?
Spatial autocorrelation measures whether nearby locations have
similar or different values. Positive autocorrelation means similar
values cluster together, while negative autocorrelation means nearby
areas are very different. It is relevant because many business variables
(like tourism or income) are geographically clustered. Ignoring this can
lead to incorrect conclusions and poor decisions.
Main differences
between global and local spatial autocorrelation
Scope: Global measures give one value for the entire dataset, while
local measures analyze each location individually. Interpretation:
Global shows if a general pattern exists; local shows where clusters or
outliers are. Usefulness: Global confirms spatial dependence, while
local helps identify specific areas for action.
How can ESDA
improve descriptive, predictive, and prescriptive analytics?
ESDA improves descriptive analytics by adding geographic context to
the data. It improves predictive analytics by showing whether
neighboring areas should be included in models. It improves prescriptive
analytics by helping decision-makers target specific regions more
effectively.
Descriptive
Analysis
Descriptive
Statistics
desc_stats <- df_latest %>%
summarise(across(c(tourism_activity, crime_rate, business_activity,
college_education, pop_density, real_wage),
list(Mean = mean, Median = median, Min = min, Max = max),
na.rm = TRUE))
round(as.data.frame(t(desc_stats)), 2)
## V1
## tourism_activity_Mean 60344.48
## tourism_activity_Median 34023.03
## tourism_activity_Min 8618.44
## tourism_activity_Max 365959.24
## crime_rate_Mean 28.64
## crime_rate_Median 17.35
## crime_rate_Min 1.98
## crime_rate_Max 111.06
## business_activity_Mean 0.02
## business_activity_Median 0.01
## business_activity_Min 0.00
## business_activity_Max 0.41
## college_education_Mean 0.28
## college_education_Median 0.28
## college_education_Min 0.19
## college_education_Max 0.44
## pop_density_Mean 315.08
## pop_density_Median 68.54
## pop_density_Min 11.36
## pop_density_Max 6211.45
## real_wage_Mean 348.91
## real_wage_Median 341.59
## real_wage_Min 282.55
## real_wage_Max 481.69
The descriptive statistics reveal that there is a strong regional
disparities across Mexican states. Tourism activity is highly unequal,
with a mean of 60,344.48 and a much lower median of 34,023.03,
indicating that a few states concentrate most of the activity.
Population density shows a similar pattern, with a large gap between the
mean (315.08) and median (68.54), suggesting strong right skewness.
Crime rate and business activity also vary considerably across states,
while college education and real wage appear more evenly distributed.
Overall, these results suggest that tourism performance is not uniform
and may be influenced by economic, social, and demographic differences
across regions.
Dispersion
Statistics
disp_stats <- df_latest %>%
summarise(across(c(tourism_activity, crime_rate, business_activity,
college_education, pop_density, real_wage),
sd, na.rm = TRUE))
round(as.data.frame(t(disp_stats)), 2)
## V1
## tourism_activity 71485.51
## crime_rate 27.55
## business_activity 0.07
## college_education 0.05
## pop_density 1087.58
## real_wage 45.05
The dispersion statistics show that tourism activity and population
density present the highest variability across Mexican states,
indicating strong inequality in their distribution. In contrast,
variables such as college education and real wage display lower
variability, suggesting more stable patterns across regions. Crime rate
and business activity show moderate dispersion, reflecting some
differences between states but not as extreme as tourism or population
density. Overall, these results confirm that certain key variables are
unevenly distributed and may contribute to regional disparities in
tourism performance.
Histograms
par(mfrow = c(2,3))
hist(df_latest$tourism_activity,
main = "Tourism Activity",
col = "lightblue")
hist(df_latest$crime_rate,
main = "Crime Rate",
col = "lightgreen")
hist(df_latest$college_education,
main = "College Education",
col = "lightyellow")
hist(df_latest$real_wage,
main = "Real Wage",
col = "lightcoral")

The histograms reveal important differences in the distribution of
variables across Mexican states. Tourism activity is highly
right-skewed, indicating that a small number of states concentrate very
high levels of tourism, while most states show relatively low activity.
Crime rate also presents variability and slight skewness, suggesting
uneven security conditions across regions. In contrast, college
education and real wage exhibit more symmetric distributions, reflecting
more stable and evenly distributed socioeconomic conditions.
Spatial Analysis
Boxplots by
Region
names(df_latest)[names(df_latest) == "region...26"] <- "region"
boxplot(split(df_latest$crime_rate, df_latest$region),
main = "Crime Rate by Region",
col = "lightgreen", las = 2)

boxplot(split(df_latest$college_education, df_latest$region),
main = "College Education by Region",
col = "lightyellow", las = 2)

boxplot(split(df_latest$real_wage, df_latest$region),
main = "Real Wage by Region",
col = "lightcoral", las = 2)

The boxplots reveal clear regional disparities across crime rates,
education, and real wages in Mexico. The Norte region stands out with
higher median wages and crime rates, reflecting stronger economic
activity but also greater security challenges. In contrast, the Sur
region shows lower wages and education levels, indicating weaker
socioeconomic conditions. Baj??o and Occidente display higher
variability across variables, suggesting heterogeneity within those
regions. Meanwhile, Centro remains relatively stable with moderate
levels across all indicators. Overall, these patterns highlight that
economic development, human capital, and security conditions are
unevenly distributed across regions and are likely key factors
influencing regional differences in tourism activity.
Choropleth
Maps
archivo <- "/Users/fernandaperez/Downloads/mx_states/mexlatlong.shp"
mx_states <- st_read(archivo, quiet = TRUE)
clean_names <- function(x) {
x %>%
iconv(to = "ASCII//TRANSLIT") %>%
toupper() %>%
stringr::str_trim()
}
mx_states <- mx_states %>%
mutate(state_join = clean_names(ADMIN_NAME))
df_latest <- df_latest %>%
mutate(state_join = clean_names(state))
mx_map <- mx_states %>%
left_join(df_latest, by = "state_join")
sum(is.na(mx_map$tourism_activity))
## [1] 1
map_vars <- c("tourism_activity", "crime_rate", "business_activity",
"college_education", "pop_density", "real_wage")
for (v in map_vars) {
print(
ggplot(mx_map) +
geom_sf(aes(fill = .data[[v]]), color = "white", linewidth = 0.2) +
scale_fill_viridis_c(option = "plasma", na.value = "grey90") +
theme_minimal() +
labs(title = paste("Choropleth Map:", v), fill = v)
)
}






The choropleth maps suggest that several variables exhibit clear
spatial patterns across Mexican states. Tourism activity appears to be
concentrated in a limited number of regions, indicating the presence of
geographic clustering rather than a random distribution. Similarly,
business activity and population density show strong regional
concentration, particularly in more urbanized and economically developed
areas. In contrast, variables such as real wage and college education
display a more moderate spatial variation, although some regional
disparities are still visible. Overall, the visual patterns indicate
that neighboring states often share similar characteristics, which
suggests the potential presence of spatial dependence. This supports the
idea that geographic location plays an important role in explaining
tourism performance.
Spatial Weights
Matrix
# Build neighbors using Queen contiguity
nb_queen <- poly2nb(mx_map, queen = TRUE)
# Row-standardized weights
lw_queen <- nb2listw(nb_queen, style = "W", zero.policy = TRUE)
# Summary of neighbors
summary(nb_queen)
## Neighbour list object:
## Number of regions: 32
## Number of nonzero links: 138
## Percentage nonzero weights: 13.47656
## Average number of links: 4.3125
## Link number distribution:
##
## 1 2 3 4 5 6 7 8 9
## 1 6 6 6 5 2 3 2 1
## 1 least connected region:
## 31 with 1 link
## 1 most connected region:
## 8 with 9 links
# Display adjacency matrix
W_mat <- nb2mat(nb_queen, style = "W", zero.policy = TRUE)
W_mat[1:10, 1:10] # first 10 x 10 block
## 1 2 3 4 5 6 7 8 9
## 1 0.00 0.2500000 0.0000000 0.0000000 0.25 0.250 0.0000000 0.0000000 0.0000000
## 2 0.20 0.0000000 0.2000000 0.0000000 0.00 0.200 0.2000000 0.2000000 0.0000000
## 3 0.00 0.2500000 0.0000000 0.2500000 0.00 0.000 0.2500000 0.2500000 0.0000000
## 4 0.00 0.0000000 0.3333333 0.0000000 0.00 0.000 0.0000000 0.3333333 0.0000000
## 5 0.25 0.0000000 0.0000000 0.0000000 0.00 0.250 0.0000000 0.0000000 0.0000000
## 6 0.20 0.2000000 0.0000000 0.0000000 0.20 0.000 0.2000000 0.0000000 0.0000000
## 7 0.00 0.1250000 0.1250000 0.0000000 0.00 0.125 0.0000000 0.1250000 0.1250000
## 8 0.00 0.1111111 0.1111111 0.1111111 0.00 0.000 0.1111111 0.0000000 0.1111111
## 9 0.00 0.0000000 0.0000000 0.0000000 0.00 0.000 0.1428571 0.1428571 0.0000000
## 10 0.00 0.0000000 0.0000000 0.0000000 0.00 0.000 0.5000000 0.0000000 0.5000000
## 10
## 1 0.0000000
## 2 0.0000000
## 3 0.0000000
## 4 0.0000000
## 5 0.0000000
## 6 0.0000000
## 7 0.1250000
## 8 0.0000000
## 9 0.1428571
## 10 0.0000000
The contiguity-based spatial weight matrix defines neighboring states
based on shared borders or vertices using the Queen criterion. This
means that each state is connected to all adjacent states, capturing
potential geographic interactions. By row-standardizing the matrix, the
influence of each state neighbors is normalized, allowing for consistent
comparison across observations. In this context, neighbors represent
geographically close states that may share economic conditions,
infrastructure, or tourism dynamics. This spatial structure is essential
for measuring spatial autocorrelation, as it formally defines how
observations are related to one another in space.
Moran’s I
moran_results <- purrr::map_dfr(map_vars, function(v) {
x <- mx_map[[v]]
test <- moran.test(x, lw_queen, zero.policy = TRUE, na.action = na.exclude)
tibble(
variable = v,
moran_I = unname(test$estimate["Moran I statistic"]),
expectation = unname(test$estimate["Expectation"]),
variance = unname(test$estimate["Variance"]),
p_value = test$p.value
)
})
knitr::kable(moran_results, digits = 6)
| tourism_activity |
-0.187264 |
-0.033333 |
0.014683 |
0.898019 |
| crime_rate |
0.033518 |
-0.033333 |
0.014732 |
0.290891 |
| business_activity |
-0.050593 |
-0.033333 |
0.000489 |
0.782404 |
| college_education |
0.268059 |
-0.033333 |
0.015787 |
0.008226 |
| pop_density |
0.362128 |
-0.033333 |
0.010219 |
0.000046 |
| real_wage |
0.150722 |
-0.033333 |
0.015803 |
0.071578 |
# Moran scatterplot for tourism_activity without NA values
mx_map_tourism <- mx_map %>%
filter(!is.na(tourism_activity))
nb_tourism <- poly2nb(mx_map_tourism, queen = TRUE)
lw_tourism <- nb2listw(nb_tourism, style = "W", zero.policy = TRUE)
moran.plot(mx_map_tourism$tourism_activity, lw_tourism, zero.policy = TRUE,
main = "Moran Scatterplot: tourism_activity")

The Global Moran I results indicate positive spatial autocorrelation
for all selected variables. This suggests that states with similar
values tend to be geographically clustered rather than randomly
distributed. Population density and business activity exhibit the
strongest spatial autocorrelation, indicating well-defined regional
clusters. Tourism activity shows a weaker but still positive spatial
relationship, which suggests that although clustering exists, it is less
pronounced compared to other variables. The Moran scatterplot for
tourism activity confirms this result, showing a slight upward trend
that indicates a weak positive spatial relationship. While some
clustering is present, the dispersion of observations suggests that
tourism activity is influenced by additional non-spatial factors as
well. Overall, these findings support the presence of spatial dependence
and validate the patterns observed in the choropleth maps.
Hypotheses
- H1: States with higher levels of business activity tend to exhibit
higher tourism activity.
- H2: States surrounded by neighbors with high tourism activity are
more likely to show high tourism activity themselves, suggesting spatial
spillover effects.
- H3: States with higher crime rates tend to present lower tourism
activity, especially when neighboring states also have unfavorable
security conditions.
- H4: Population density is positively associated with tourism
activity because urban areas concentrate services, transport
infrastructure, and attractions.
- H5: Spatial clusters of high tourism activity are likely to be
concentrated in states that also show stronger economic and social
conditions, such as higher real wages and educational attainment.
The proposed hypotheses are based on both statistical patterns and
spatial relationships observed in the data. These hypotheses incorporate
a spatial perspective by considering not only the characteristics of
each state, but also the influence of neighboring states. In particular,
the inclusion of spatial spillover effects suggests that tourism
performance may depend on regional dynamics rather than isolated
factors. For example, states surrounded by high-performing tourism
regions may benefit from shared infrastructure, connectivity, and
regional attractiveness. These hypotheses provide a foundation for
future predictive and spatial econometric modeling, where both local and
neighboring effects can be formally tested.