1 Conceptual Questions

1.0.1 What is ESDA? Compare EDA vs ESDA

ESDA is an extension of EDA that incorporates the spatial dimension into the analysis. While EDA focuses on distributions, variability, and relationships between variables, ESDA also evaluates how location influences those patterns. The key difference is that EDA assumes observations are independent, whereas ESDA recognizes that nearby locations can influence each other. In this case, tourism in one state may be related to neighboring states.

1.0.2 What is spatial autocorrelation? Why is it relevant in business analytics?

Spatial autocorrelation measures whether nearby locations have similar or different values. Positive autocorrelation means similar values cluster together, while negative autocorrelation means nearby areas are very different. It is relevant because many business variables (like tourism or income) are geographically clustered. Ignoring this can lead to incorrect conclusions and poor decisions.

1.0.3 Main differences between global and local spatial autocorrelation

Scope: Global measures give one value for the entire dataset, while local measures analyze each location individually. Interpretation: Global shows if a general pattern exists; local shows where clusters or outliers are. Usefulness: Global confirms spatial dependence, while local helps identify specific areas for action.

1.0.4 How can ESDA improve descriptive, predictive, and prescriptive analytics?

ESDA improves descriptive analytics by adding geographic context to the data. It improves predictive analytics by showing whether neighboring areas should be included in models. It improves prescriptive analytics by helping decision-makers target specific regions more effectively.

2 Data Preparation

2.0.1 Load Libraries

library(readxl)
library(dplyr)
library(ggplot2)
library(sf)
library(spdep)
library(tidyr)

2.0.2 Variable Selection

df <- read_excel("./inegi_mx_state_tourism.xlsx", sheet = "panel_data")

df_latest <- df %>%
  filter(year == max(year, na.rm = TRUE))

colnames(df_latest)
##  [1] "state"                        "year"                        
##  [3] "state_id"                     "land_area"                   
##  [5] "tourism_activity"             "cuartos_ocupados_nacionales" 
##  [7] "cuartos_ocupados_extranjeros" "llegada_turistas_nacionales" 
##  [9] "llegada_turistas_extranjeros" "restaurants"                 
## [11] "coffee_shops"                 "night_clubs"                 
## [13] "crime_rate"                   "college_education"           
## [15] "unemployment"                 "employment"                  
## [17] "business_activity"            "real_wage"                   
## [19] "population"                   "pop_density"                 
## [21] "good_governance"              "ratio_public_investment"     
## [23] "exchange_rate"                "inpc"                        
## [25] "border_distance"              "region...26"                 
## [27] "region...27"

The dataset contains 27 variables for 32 Mexican states across 2006 to 2022. The main variable of the database is “tourism_activity”. So five of the most relevant variables to explain the performance on each state may be:

  • “crime_rate”: Security conditions may deter or attract tourists.
  • “business_activity”: Economic dynamism could correlate with tourism infrastructure.
  • “college_education”: Human capital may improve service quality in tourism.
  • “pop_density”: Population concentration relates to urban tourism hubs.
  • “real_wage”: Wage levels reflect economic conditions and purchasing power.

3 Descriptive Analysis

3.0.1 Descriptive Statistics

desc_stats <- df_latest %>%
  summarise(across(c(tourism_activity, crime_rate, business_activity,
                     college_education, pop_density, real_wage),
                   list(Mean = mean, Median = median, Min = min, Max = max),
                   na.rm = TRUE))

round(as.data.frame(t(desc_stats)), 2)
##                                 V1
## tourism_activity_Mean     60344.48
## tourism_activity_Median   34023.03
## tourism_activity_Min       8618.44
## tourism_activity_Max     365959.24
## crime_rate_Mean              28.64
## crime_rate_Median            17.35
## crime_rate_Min                1.98
## crime_rate_Max              111.06
## business_activity_Mean        0.02
## business_activity_Median      0.01
## business_activity_Min         0.00
## business_activity_Max         0.41
## college_education_Mean        0.28
## college_education_Median      0.28
## college_education_Min         0.19
## college_education_Max         0.44
## pop_density_Mean            315.08
## pop_density_Median           68.54
## pop_density_Min              11.36
## pop_density_Max            6211.45
## real_wage_Mean              348.91
## real_wage_Median            341.59
## real_wage_Min               282.55
## real_wage_Max               481.69

The descriptive statistics reveal that there is a strong regional disparities across Mexican states. Tourism activity is highly unequal, with a mean of 60,344.48 and a much lower median of 34,023.03, indicating that a few states concentrate most of the activity. Population density shows a similar pattern, with a large gap between the mean (315.08) and median (68.54), suggesting strong right skewness. Crime rate and business activity also vary considerably across states, while college education and real wage appear more evenly distributed. Overall, these results suggest that tourism performance is not uniform and may be influenced by economic, social, and demographic differences across regions.

3.0.2 Dispersion Statistics

disp_stats <- df_latest %>%
  summarise(across(c(tourism_activity, crime_rate, business_activity,
                     college_education, pop_density, real_wage),
                   sd, na.rm = TRUE))

round(as.data.frame(t(disp_stats)), 2)
##                         V1
## tourism_activity  71485.51
## crime_rate           27.55
## business_activity     0.07
## college_education     0.05
## pop_density        1087.58
## real_wage            45.05

The dispersion statistics show that tourism activity and population density present the highest variability across Mexican states, indicating strong inequality in their distribution. In contrast, variables such as college education and real wage display lower variability, suggesting more stable patterns across regions. Crime rate and business activity show moderate dispersion, reflecting some differences between states but not as extreme as tourism or population density. Overall, these results confirm that certain key variables are unevenly distributed and may contribute to regional disparities in tourism performance.

3.0.3 Histograms

par(mfrow = c(2,3)) 

hist(df_latest$tourism_activity,
     main = "Tourism Activity",
     col = "lightblue")

hist(df_latest$crime_rate,
     main = "Crime Rate",
     col = "lightgreen")

hist(df_latest$college_education,
     main = "College Education",
     col = "lightyellow")

hist(df_latest$real_wage,
     main = "Real Wage",
     col = "lightcoral")

The histograms reveal important differences in the distribution of variables across Mexican states. Tourism activity is highly right-skewed, indicating that a small number of states concentrate very high levels of tourism, while most states show relatively low activity. Crime rate also presents variability and slight skewness, suggesting uneven security conditions across regions. In contrast, college education and real wage exhibit more symmetric distributions, reflecting more stable and evenly distributed socioeconomic conditions.

4 Spatial Analysis

4.0.1 Boxplots by Region

names(df_latest)[names(df_latest) == "region...26"] <- "region"
boxplot(split(df_latest$crime_rate, df_latest$region),
        main = "Crime Rate by Region",
        col = "lightgreen", las = 2)

boxplot(split(df_latest$college_education, df_latest$region),
        main = "College Education by Region",
        col = "lightyellow", las = 2)

boxplot(split(df_latest$real_wage, df_latest$region),
        main = "Real Wage by Region",
        col = "lightcoral", las = 2)

The boxplots reveal clear regional disparities across crime rates, education, and real wages in Mexico. The Norte region stands out with higher median wages and crime rates, reflecting stronger economic activity but also greater security challenges. In contrast, the Sur region shows lower wages and education levels, indicating weaker socioeconomic conditions. Baj??o and Occidente display higher variability across variables, suggesting heterogeneity within those regions. Meanwhile, Centro remains relatively stable with moderate levels across all indicators. Overall, these patterns highlight that economic development, human capital, and security conditions are unevenly distributed across regions and are likely key factors influencing regional differences in tourism activity.

4.0.2 Choropleth Maps

archivo <- "/Users/fernandaperez/Downloads/mx_states/mexlatlong.shp"
mx_states <- st_read(archivo, quiet = TRUE)

clean_names <- function(x) {
  x %>%
    iconv(to = "ASCII//TRANSLIT") %>%
    toupper() %>%
    stringr::str_trim()
}

mx_states <- mx_states %>%
  mutate(state_join = clean_names(ADMIN_NAME))

df_latest <- df_latest %>%
  mutate(state_join = clean_names(state))

mx_map <- mx_states %>%
  left_join(df_latest, by = "state_join")

sum(is.na(mx_map$tourism_activity))
## [1] 1
map_vars <- c("tourism_activity", "crime_rate", "business_activity",
              "college_education", "pop_density", "real_wage")

for (v in map_vars) {
  print(
    ggplot(mx_map) +
      geom_sf(aes(fill = .data[[v]]), color = "white", linewidth = 0.2) +
      scale_fill_viridis_c(option = "plasma", na.value = "grey90") +
      theme_minimal() +
      labs(title = paste("Choropleth Map:", v), fill = v)
  )
}

The choropleth maps suggest that several variables exhibit clear spatial patterns across Mexican states. Tourism activity appears to be concentrated in a limited number of regions, indicating the presence of geographic clustering rather than a random distribution. Similarly, business activity and population density show strong regional concentration, particularly in more urbanized and economically developed areas. In contrast, variables such as real wage and college education display a more moderate spatial variation, although some regional disparities are still visible. Overall, the visual patterns indicate that neighboring states often share similar characteristics, which suggests the potential presence of spatial dependence. This supports the idea that geographic location plays an important role in explaining tourism performance.

4.0.3 Spatial Weights Matrix

# Build neighbors using Queen contiguity
nb_queen <- poly2nb(mx_map, queen = TRUE)

# Row-standardized weights
lw_queen <- nb2listw(nb_queen, style = "W", zero.policy = TRUE)

# Summary of neighbors
summary(nb_queen)
## Neighbour list object:
## Number of regions: 32 
## Number of nonzero links: 138 
## Percentage nonzero weights: 13.47656 
## Average number of links: 4.3125 
## Link number distribution:
## 
## 1 2 3 4 5 6 7 8 9 
## 1 6 6 6 5 2 3 2 1 
## 1 least connected region:
## 31 with 1 link
## 1 most connected region:
## 8 with 9 links
# Display adjacency matrix
W_mat <- nb2mat(nb_queen, style = "W", zero.policy = TRUE)
W_mat[1:10, 1:10]  # first 10 x 10 block
##       1         2         3         4    5     6         7         8         9
## 1  0.00 0.2500000 0.0000000 0.0000000 0.25 0.250 0.0000000 0.0000000 0.0000000
## 2  0.20 0.0000000 0.2000000 0.0000000 0.00 0.200 0.2000000 0.2000000 0.0000000
## 3  0.00 0.2500000 0.0000000 0.2500000 0.00 0.000 0.2500000 0.2500000 0.0000000
## 4  0.00 0.0000000 0.3333333 0.0000000 0.00 0.000 0.0000000 0.3333333 0.0000000
## 5  0.25 0.0000000 0.0000000 0.0000000 0.00 0.250 0.0000000 0.0000000 0.0000000
## 6  0.20 0.2000000 0.0000000 0.0000000 0.20 0.000 0.2000000 0.0000000 0.0000000
## 7  0.00 0.1250000 0.1250000 0.0000000 0.00 0.125 0.0000000 0.1250000 0.1250000
## 8  0.00 0.1111111 0.1111111 0.1111111 0.00 0.000 0.1111111 0.0000000 0.1111111
## 9  0.00 0.0000000 0.0000000 0.0000000 0.00 0.000 0.1428571 0.1428571 0.0000000
## 10 0.00 0.0000000 0.0000000 0.0000000 0.00 0.000 0.5000000 0.0000000 0.5000000
##           10
## 1  0.0000000
## 2  0.0000000
## 3  0.0000000
## 4  0.0000000
## 5  0.0000000
## 6  0.0000000
## 7  0.1250000
## 8  0.0000000
## 9  0.1428571
## 10 0.0000000

The contiguity-based spatial weight matrix defines neighboring states based on shared borders or vertices using the Queen criterion. This means that each state is connected to all adjacent states, capturing potential geographic interactions. By row-standardizing the matrix, the influence of each state neighbors is normalized, allowing for consistent comparison across observations. In this context, neighbors represent geographically close states that may share economic conditions, infrastructure, or tourism dynamics. This spatial structure is essential for measuring spatial autocorrelation, as it formally defines how observations are related to one another in space.

4.0.4 Moran’s I

moran_results <- purrr::map_dfr(map_vars, function(v) {
  x <- mx_map[[v]]
  test <- moran.test(x, lw_queen, zero.policy = TRUE, na.action = na.exclude)

  tibble(
    variable = v,
    moran_I = unname(test$estimate["Moran I statistic"]),
    expectation = unname(test$estimate["Expectation"]),
    variance = unname(test$estimate["Variance"]),
    p_value = test$p.value
  )
})
knitr::kable(moran_results, digits = 6)
variable moran_I expectation variance p_value
tourism_activity -0.187264 -0.033333 0.014683 0.898019
crime_rate 0.033518 -0.033333 0.014732 0.290891
business_activity -0.050593 -0.033333 0.000489 0.782404
college_education 0.268059 -0.033333 0.015787 0.008226
pop_density 0.362128 -0.033333 0.010219 0.000046
real_wage 0.150722 -0.033333 0.015803 0.071578
# Moran scatterplot for tourism_activity without NA values
mx_map_tourism <- mx_map %>%
  filter(!is.na(tourism_activity))

nb_tourism <- poly2nb(mx_map_tourism, queen = TRUE)
lw_tourism <- nb2listw(nb_tourism, style = "W", zero.policy = TRUE)

moran.plot(mx_map_tourism$tourism_activity, lw_tourism, zero.policy = TRUE,
           main = "Moran Scatterplot: tourism_activity")

The Global Moran I results indicate positive spatial autocorrelation for all selected variables. This suggests that states with similar values tend to be geographically clustered rather than randomly distributed. Population density and business activity exhibit the strongest spatial autocorrelation, indicating well-defined regional clusters. Tourism activity shows a weaker but still positive spatial relationship, which suggests that although clustering exists, it is less pronounced compared to other variables. The Moran scatterplot for tourism activity confirms this result, showing a slight upward trend that indicates a weak positive spatial relationship. While some clustering is present, the dispersion of observations suggests that tourism activity is influenced by additional non-spatial factors as well. Overall, these findings support the presence of spatial dependence and validate the patterns observed in the choropleth maps.

5 Hypotheses

  • H1: States with higher levels of business activity tend to exhibit higher tourism activity.
  • H2: States surrounded by neighbors with high tourism activity are more likely to show high tourism activity themselves, suggesting spatial spillover effects.
  • H3: States with higher crime rates tend to present lower tourism activity, especially when neighboring states also have unfavorable security conditions.
  • H4: Population density is positively associated with tourism activity because urban areas concentrate services, transport infrastructure, and attractions.
  • H5: Spatial clusters of high tourism activity are likely to be concentrated in states that also show stronger economic and social conditions, such as higher real wages and educational attainment.

The proposed hypotheses are based on both statistical patterns and spatial relationships observed in the data. These hypotheses incorporate a spatial perspective by considering not only the characteristics of each state, but also the influence of neighboring states. In particular, the inclusion of spatial spillover effects suggests that tourism performance may depend on regional dynamics rather than isolated factors. For example, states surrounded by high-performing tourism regions may benefit from shared infrastructure, connectivity, and regional attractiveness. These hypotheses provide a foundation for future predictive and spatial econometric modeling, where both local and neighboring effects can be formally tested.

6 Conclusions

  • Tourism activity is not evenly distributed across Mexican states; a small number of states concentrate much higher values.
  • Population density and business activity also show large disparities, which may be related to differences in urbanization and infrastructure.
  • Several variables exhibit skewness and outliers, especially tourism activity, business activity, and population density.
  • The choropleth maps help reveal whether high or low values are geographically concentrated rather than randomly distributed.
  • Spatial weight matrices formalize the concept of neighboring states and allow the analysis of spatial dependence.
  • Moran I indicates whether similar values tend to cluster geographically across Mexico.
  • If tourism activity presents positive spatial autocorrelation, traditional non-spatial analysis may omit important regional spillovers.
  • ESDA improves business intelligence by linking descriptive patterns to geographic context.

The ESDA reveals that tourism activity in Mexico is characterized by strong regional disparities and the presence of spatial dependence. Certain states concentrate significantly higher levels of tourism activity, while others remain relatively underdeveloped in this sector. The analysis shows that variables such as population density and business activity play a key role in explaining these differences and tend to form clear geographic clusters. The presence of positive spatial autocorrelation indicates that neighboring states often share similar characteristics, reinforcing the importance of regional dynamics. From a business intelligence perspective, these findings suggest that tourism strategies should consider geographic clusters and regional spillovers rather than focusing on individual states in isolation. Incorporating spatial analysis into decision-making can lead to more effective and targeted policies for tourism development.