import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
#!pip install geopandas
import geopandas as gpdGeographic Analysis for Business Intelligence
The aim of this document is to integrate geographic information systems (GIS) and spatial analysis within the field of Business Intelligence (BI). We will learn how geographic data can enhance decision-making by revealing spatial patterns, regional disparities, and location-based opportunities that are not visible in traditional datasets. Using official data on financial inclusion, we move from understanding the basic structure of spatial data to visualizing results through interactive choropleth maps.
1 Intro
Business Intelligence is often defined as the set of tools and systems used to collect and analyze data to make more informed business decisions. Most BI systems focus on two dimensions: time and categories. For example, dashboards show KPIs, sales over time, customers by segment, or production by line of business. Yet every transaction, customer, or branch also has a location. Recognizing and analyzing that location introduces a third dimension: space.
Geographic analysis expands the scope of BI by answering not only what happened and when, but also where it happened and, consequently, why there. When we map data, patterns of concentration, inequality, or opportunity that are often hidden in tables suddenly become visible. For example:
a bank might visualize the distribution of ATMs, branches, and correspondents across Jalisco. Patterns of concentration around Guadalajara would emerge, signaling both market saturation and potential underserved zones. Without mapping these data, such insights would remain buried in rows of a spreadsheet.
a company analyzing its branch performance may discover that two offices with equal revenues serve very different markets: one in a dense urban corridor surrounded by competing banks, the other in a rural municipality with few alternatives. Again, such insights rapidly emerge when the data are visualized on a map.
Spatial analysis is not limited to visualization; it expands the set of decision-making tools. Within a BI framework, geographic data become another analytical layer that interacts with time, product, and customer dimensions. Adding this layer allows organizations to:
Identify regional markets and customer clusters.
Detect potential areas for expansion/reduction.
Optimize logistics, distribution, and routing.
Evaluate performance adjusted for local conditions (income, population, infrastructure).
Monitor risks tied to geography—natural hazards, social vulnerability, or crime incidence.
In public administration, the same logic supports territorial equity: visualizing where programs, services, or infrastructure are concentrated helps direct resources to the areas that need them most.
Once the value of location was recognized, a natural question arises: how do we technically represent space in a BI environment? This is where Geographic Information Systems (GIS) come in. In the next section we will examine how GIS works, how coordinates and projections define locations, and how spatial data (points, lines, polygons) interact with the tabular information typical of BI. Combining these two worls, BI and GIS, gives rise to what is known as GeoBI, a modern approach that empowers organizations to think not only in classification and trends, but also in territory and context.
2 Geographic Information Systems (GIS)
A GIS is a framework that combines data, methods, and technology to capture, store, analyze, and visualize spatially referenced information. A GIS can be understood as framework designed to work with data that are spatially referenced, meaning they can be positioned on the Earth’s surface using coordinates or geographic identifiers (like electoral districts or postal codes) .
2.1 Core components
Spatial data describe the geometry of geographic entities. They can take three fundamental forms:
Points: Represent discrete locations such as bank branches, ATMs, schools, or customer addresses. Each point is defined by a pair of coordinates (latitude and longitude).
Lines: Represent linear features such as roads, pipelines, or delivery routes. They model movement, connection, or distance between points.
Polygons (areas): Represent bounded regions like municipalities, states, or sales territories.
They are used to analyze aggregated data (e.g., total sales per region).
These geometries are stored in spatial data formats such as Shapefile (.shp), GeoJSON, or KML, which can be read by GIS software (QGIS) or spatial analysis libraries in Python and R.
Attribute data. For each spatial feature, attribute data store descriptive information in tabular form. For instance, a polygon representing the municipality of Guadalajara might have attributes for population, GDP, number of financial institutions, and average income. These attributes are the metrics or indicators that we analyze, while the spatial features provide the dimension of geography.
Software. A GIS environment includes tools for data entry (digitization, GPS capture, importing shapefiles), data management (databases such as PostGIS), spatial analysis (e.g., calculating distances, densities, or clusters), and visualization (map design and dashboards).
Although some Business Intelligence platforms such as Power BI or Tableau already embed GIS capabilities, allowing maps to be created directly from datasets that include coordinates or region names, for analytical depth, open-source libraries like GeoPandas and Folium in Python provide the flexibility to build customized spatial analyses, as we do in this document.
2.2 Representing spatial data on a computer
Geographic objects can be represented on a computer either with vector or raster data models. Both describe geographic phenomena, but they differ in how they conceptualize the Earth’s surface. Vector data represent geographic features as discrete objects, are constructed from points, which are connected to each other forming lines and polygons, with precisely defined boundaries. Raster data represent geographic features as a continuous surface divided into a grid of small cells (pixels), where each cell stores a value. Both models are complementary rather than competing: vectors are ideal for discrete features (such as locations of stores), while rasters are better for continuous variables (such as population density, light intensity, elevation or temperature).
The vector model treats geographic entities as individual features defined by coordinates.
| Geometry | Description | Examples |
|---|---|---|
| Point | Single coordinate (x, y) representing a precise location. | Branch locations, ATMs, client addresses. |
| Line | Ordered sequence of points forming a path. | Delivery routes, road networks, supply chains. |
| Polygon | Closed shape bounded by lines; represents an area. | Sales territories, municipalities, market zones. |
Each feature can store multiple attributes in an accompanying table. For example, each polygon (municipality) can contain attributes such as total population, income, or financial inclusion indicators. Wector data are the foundation for choropleth maps, point distribution maps, and network analyses (e.g., route optimization). They allow precise measurement, querying (“select all branches within 5 km of Guadalajara”), and relational operations (joining branches to their respective regions).
Vector data is most commonly stored in a “shapefile”. A shapefile is composed of 3 required files with the same prefix but different extensions:
spatial-data.shp: main file that stores records of each shape geometries
spatial-data.shx: index of how the geometries in the main file relate to one-another
spatial-data.dbf: attributes of each record
The raster model divides space into a regular grid of cells, each containing a numeric value that represents a property of that location. Each cell (pixel) has a uniform size, commonly between 10 and 1,000 meters on the ground, and collectively the grid represents a continuous surface. Raster data are commonly used for environmental, demographic, or imagery-based information, where phenomena vary gradually across space rather than in discrete units.
| Example | Description | Use |
|---|---|---|
| Satellite imagery | Each cell stores reflectance or color value. | Visual context or land-use classification. |
| Population density surface | Each cell stores estimated inhabitants per km². | Market potential or customer base estimation. |
| Heat maps | Cells represent intensity of events (transactions, visits). | Identify commercial hotspots or risk zones. |
Both models often coexist. The vector layer provides structure (regions, branch locations, administrative units). The raster layer provides background or continuous context (population density, socioeconomic heat, satellite basemap). Together, they enable multi-layered insights.
2.3 Georeferencing
The actual process of assigning real-world coordinates or geographic identifiers to data so they can be accurately located on a map is called georeferencing. For example: a table of bank branches can be georeferenced by attaching each address to a latitude–longitude pair. A dataset of municipalities can be linked to the official geographic boundaries published by INEGI. Without georeferencing, data remain abstract, they are simply numbers without location. With it, every observation acquires spatial meaning, allowing it to be compared, aggregated, or visualized geographically.
To locate objects on Earth, GIS uses coordinate reference systems (CRS). The most common is WGS84 (EPSG:4326), which expresses positions using degrees of latitude (north–south) and longitude (east–west). Other projections, such as UTM (Universal Transverse Mercator), express locations in meters, which are useful for local or regional analysis. When integrating multiple data sources, the analyst must ensure that all layers share the same CRS. Otherwise, maps will misalign or display distorted geometries.
3 Exploratory Spatial Analysis
Exploratory Spatial Analysis (ESA) is the process of examining spatial data to discover patterns, relationships, and anomalies before applying formal models or predictions. Just as Exploratory Data Analysis (EDA) helps analysts understand numerical datasets through visualization and summary statistics, ESA reveals the geographic structure of information (distribution, cluster, or dispersion of phenomena across territory).
ESA allows organizations to answer questions such as:
Where are sales, clients, or resources concentrated?
Which territories show performance above or below average?
Are neighboring regions similar or different in behavior?
What spatial factors might explain market opportunities or operational challenges?
ESA thus forms the bridge between data visualization and spatial reasoning, preparing analysts for more advanced spatial modeling.
Because vector data treat geographic entities as discrete objects, ESA with vectors focuses on relationships between those objects. The analysis might involve:
Points: distribution of facilities, clients, or transactions.
Polygons: comparison of indicators among administrative units (states, municipalities, territories).
Lines: network accessibility and flow analysis.
Typical exploratory operations include counting points within polygons (e.g., number of ATMs per municipality), calculating nearest-neighbor distances to assess clustering, measuring spatial autocorrelation between polygon attributes (e.g., similarity of inclusion levels among adjacent municipalities), and mapping ratios or densities as choropleth visualizations.
3.1 Types of maps
ESA begins with visual exploration. Four main types of vector-based maps are commonly used:
| Map Type | Description | BI Example |
|---|---|---|
| Point Map | Displays discrete events or objects as dots on the map. Useful for visualizing distribution and density. | Locations of bank branches, customer visits, or retail outlets. |
| Choropleth Map | Shades polygons based on a quantitative variable. Reveals concentration and regional differences. | Municipal inclusion index, sales per territory, average income. |
| Proportional Symbol Map | Uses symbol size to represent magnitude at each location. | Transaction volume by branch. |
| Flow Map (Network Map) | Uses lines whose thickness indicates volume or frequency of movement. | Supply-chain routes or inter-branch money transfers. |
3.2 Spatial Descriptive Measures
Simple descriptive measures can formalize observations:
Mean Center and Standard Distance – Identify the geographic “center of gravity” of a dataset and how dispersed the features are around it. – Example: locating the average position of all ATMs in Jalisco.
Spatial Density (Kernel or Count) – Measure how many features fall within a certain area. – Example: number of branches per 10 000 adults or per km².
Spatial Autocorrelation (conceptual overview) – Indicates whether similar values occur near each other. – A high-inclusion municipality surrounded by others with high inclusion suggests positive autocorrelation (spatial clustering). In practice, indices like Moran’s I or Getis-Ord Gi* quantify this; conceptually, they capture spatial dependence.
4 Integrating Tabular and Geographic Data
Up to this point, we have learned why the spatial dimension is essential to Business Intelligence. We are now ready to build our own geographic dataset (merging traditional indicators with spatial information) so we can analyze and visualize patterns across territory. We focus on the practical workflow that transforms separate data tables into a unified GeoDataFrame. We will work in Google Colab using open-source Python libraries such as pandas, geopandas, and folium.
This example uses indicators of financial access in the state of Jalisco as a case of study, but the same method applies to any business dataset with regional identifiers. Data come from the CNBV: Financial infrastructure.
Objectives:
Load and explore a geographic layer (municipal polygons).
Load and inspect tabular data.
Merge the two datasets through a common geographic key.
Compute new spatial indicators.
Create an interactive map that visualizes business or policy variables geographically.
4.1 Load data
Load the Geographic Layer
Jalisco = gpd.read_file("14_Jalisco") # points to the shapefile "directory"
Jalisco.crs
Jalisco = Jalisco.to_crs("EPSG:4326")
# type(Jalisco)EPSG:6362 is INEGI’s projected coordinate system is EPSG:6362, we reproject to the global WGS 84 system (EPSG:4326), which is compatible with most web-based visualization tools.
Jalisco.columns
print(f"Number of municipalities in Jalisco: {len(Jalisco)}")Number of municipalities in Jalisco: 125
Plotting the layer:
fig, ax = plt.subplots(figsize=(7,7))
# Plot municipalities
Jalisco.plot(ax=ax, color="lightgrey", edgecolor="slategray")
ax.set_title("Municipalities in the State of Jalisco", fontsize=12)
ax.axis("off")
# plt.savefig("Jalisco_municipalities.png", bbox_inches="tight", dpi=150)
plt.show()Zapopan = Jalisco[Jalisco["NOMGEO"] == "Zapopan"].copy()
ax = Jalisco.plot(figsize=(7,7), color="lightgrey", edgecolor="slategray")
Zapopan.plot(ax=ax, color="steelblue", edgecolor="white")
ax.set_title("Location of Zapopan, Jalisco", fontsize=14)
ax.axis("off")
# plt.savefig("highlight_Zapopan.png", bbox_inches="tight", dpi=150)
plt.show()Zapopan.plot(color="steelblue", edgecolor="slategray", figsize=(7, 7))
plt.title('Zapopan')
ax.axis("off")
plt.show()Load tabular data
The Excel file “Info_Inclusion_Financiera.xlsx” contains financial inclusion indicators by municipality. Rows are municipalities and columns are variables (population, ATMs, branches, corresponsales, etc.).
df_inclusion = pd.read_excel("Info_Inclusion_Financiera.xlsx")
df_inclusion.head()| Clave Municipio | Clave \nEstado | Región | Estado | Municipio | Población total | Población adulta | Población adulta Mujeres | Población adulta Hombres | Tipo de población | Índice de rezago social | Corresponsales\nNúmero | Sucursales\nNúmero | Cajeros automáticos\nNúmero | TPV\nNúmero | Establecimientos con TPV\nNúmero | Cuentas que efectúan transacciones por celular\nNúmero | Transacciones en cajeros automáticos\nNúmero | Transacciones en TPV\nNúmero | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 14001 | 14 | Occidente y Bajío | Jalisco | Acatic | 25136 | 18364 | 9405 | 8959 | Semi-urbano | Muy bajo | 7 | 5 | 6 | 64 | 58 | 9541 | 25330 | 11739 |
| 1 | 14002 | 14 | Occidente y Bajío | Jalisco | Acatlán de Juárez | 28926 | 20700 | 10058 | 10642 | Semi-urbano | Muy bajo | 17 | 8 | 14 | 109 | 100 | 16867 | 46300 | 15901 |
| 2 | 14003 | 14 | Occidente y Bajío | Jalisco | Ahualulco de Mercado | 24676 | 18633 | 9418 | 9215 | Semi-urbano | Muy bajo | 16 | 6 | 14 | 142 | 117 | 14832 | 53854 | 13451 |
| 3 | 14004 | 14 | Occidente y Bajío | Jalisco | Amacueca | 6165 | 4525 | 2379 | 2146 | En transición | Muy bajo | 2 | 3 | 2 | 7 | 7 | 1718 | 2793 | 181 |
| 4 | 14005 | 14 | Occidente y Bajío | Jalisco | Amatitán | 17851 | 12854 | 6468 | 6386 | Semi-urbano | Muy bajo | 5 | 4 | 10 | 93 | 87 | 9967 | 24151 | 20291 |
df_inclusion.dtypesClave Municipio int64
Clave \nEstado int64
Región object
Estado object
Municipio object
Población total int64
Población adulta int64
Población adulta Mujeres int64
Población adulta Hombres int64
Tipo de población object
Índice de rezago social object
Corresponsales\nNúmero int64
Sucursales\nNúmero int64
Cajeros automáticos\nNúmero int64
TPV\nNúmero int64
Establecimientos con TPV\nNúmero int64
Cuentas que efectúan transacciones por celular\nNúmero int64
Transacciones en cajeros automáticos\nNúmero int64
Transacciones en TPV\nNúmero int64
dtype: object
The ‘’ come from line breaks in Excel headers. Clean as follows:
df_inclusion.columns = (
df_inclusion.columns
.str.replace("\n", " ", regex=False) # remove line breaks
.str.strip()
)df_inclusion.dtypesClave Municipio int64
Clave Estado int64
Región object
Estado object
Municipio object
Población total int64
Población adulta int64
Población adulta Mujeres int64
Población adulta Hombres int64
Tipo de población object
Índice de rezago social object
Corresponsales Número int64
Sucursales Número int64
Cajeros automáticos Número int64
TPV Número int64
Establecimientos con TPV Número int64
Cuentas que efectúan transacciones por celular Número int64
Transacciones en cajeros automáticos Número int64
Transacciones en TPV Número int64
dtype: object
4.2 Merging with Geographic Data
At this stage, we have two complementary datasets:
Tabular data (df_inclusion): a table of financial inclusion indicators from the Excel file. Where each row is one municipality, while columns contain indicators or characteristics (population, ATMs, branches, etc.)
Geographic data (gdf_jalisco): polygons representing the boundaries of Jalisco’s municipalities. Wher each row is one municipality, and the columns include a geographic identifier and a geometry column describing the polygon.
Our objective is to merge these two so that each polygon gains all the numerical indicators from the table. Once merged, the resulting GeoDataFrame becomes the spatial foundation for further analysis and visualization. Possible only if both datasets contain a common field that uniquely identifies each municipality.
print(df_inclusion.columns)
print(Jalisco.columns)Index(['Clave Municipio', 'Clave Estado', 'Región', 'Estado', 'Municipio',
'Población total', 'Población adulta', 'Población adulta Mujeres',
'Población adulta Hombres', 'Tipo de población',
'Índice de rezago social', 'Corresponsales Número', 'Sucursales Número',
'Cajeros automáticos Número', 'TPV Número',
'Establecimientos con TPV Número',
'Cuentas que efectúan transacciones por celular Número',
'Transacciones en cajeros automáticos Número',
'Transacciones en TPV Número'],
dtype='object')
Index(['CVEGEO', 'CVE_ENT', 'CVE_MUN', 'NOMGEO', 'geometry'], dtype='object')
Compare a few entries:
df_inclusion["Clave Municipio"].sort_values().unique()[:5]
Jalisco["CVEGEO"].sort_values().unique()[:5]array(['14001', '14002', '14003', '14004', '14005'], dtype=object)
Note: If you are joining by names and you see slight spelling differences, accents, or capitalization discrepancies, you’ll need to clean them before merging.
Also, for a merge to work the join keys must be of the same type:
Jalisco[["CVEGEO"]].dtypes
df_inclusion[["Clave Municipio"]].dtypesClave Municipio int64
dtype: object
Here, one side of the join is an object, while the other is integer. Fix this making both strings:
Jalisco["CVEGEO"] = Jalisco["CVEGEO"].astype(str)
df_inclusion["Clave Municipio"] = df_inclusion["Clave Municipio"].astype(str)Municipal codes usually are of 5 digits (e.g. 14 + 001 : 14001), the first two indicate the State (Jalisco is 14) and the remaining three are to identify the municipality. To make sure the code always has 5 digits you can zero-pad them:
Jalisco["CVEGEO"] = Jalisco["CVEGEO"].astype(str).str.zfill(5)
df_inclusion["Clave Municipio"] = df_inclusion["Clave Municipio"].astype(str).str.zfill(5)Perform the merge:
geoData = pd.merge(
Jalisco,
df_inclusion,
left_on='CVEGEO',
right_on='Clave Municipio',
how='left',
)geoData must contain now both geometry columns and financial indicators columns.
geoData.info()<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 125 entries, 0 to 124
Data columns (total 24 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 CVEGEO 125 non-null object
1 CVE_ENT 125 non-null object
2 CVE_MUN 125 non-null object
3 NOMGEO 125 non-null object
4 geometry 125 non-null geometry
5 Clave Municipio 125 non-null object
6 Clave Estado 125 non-null int64
7 Región 125 non-null object
8 Estado 125 non-null object
9 Municipio 125 non-null object
10 Población total 125 non-null int64
11 Población adulta 125 non-null int64
12 Población adulta Mujeres 125 non-null int64
13 Población adulta Hombres 125 non-null int64
14 Tipo de población 125 non-null object
15 Índice de rezago social 125 non-null object
16 Corresponsales Número 125 non-null int64
17 Sucursales Número 125 non-null int64
18 Cajeros automáticos Número 125 non-null int64
19 TPV Número 125 non-null int64
20 Establecimientos con TPV Número 125 non-null int64
21 Cuentas que efectúan transacciones por celular Número 125 non-null int64
22 Transacciones en cajeros automáticos Número 125 non-null int64
23 Transacciones en TPV Número 125 non-null int64
dtypes: geometry(1), int64(13), object(10)
memory usage: 23.6+ KB
Verify whether any polygons failed to find a matching record:
missing = geoData[geoData['Sucursales Número'].isna()]
print(len(missing))0
geoData[["Población adulta", "Cajeros automáticos Número"]].describe()| Población adulta | Cajeros automáticos Número | |
|---|---|---|
| count | 1.250000e+02 | 125.000000 |
| mean | 5.411965e+04 | 38.472000 |
| std | 1.700648e+05 | 156.607872 |
| min | 1.278000e+03 | 0.000000 |
| 25% | 5.748000e+03 | 3.000000 |
| 50% | 1.357100e+04 | 6.000000 |
| 75% | 2.801900e+04 | 15.000000 |
| max | 1.265031e+06 | 1338.000000 |
Confirm that every municipality now contains both its geometry and attributes:
geoData[["NOMGEO", "Cajeros automáticos Número", "Sucursales Número", "geometry"]].head()| NOMGEO | Cajeros automáticos Número | Sucursales Número | geometry | |
|---|---|---|---|---|
| 0 | Zapotitlán de Vadillo | 3 | 3 | POLYGON ((-103.66059 19.5976, -103.66081 19.59... |
| 1 | Bolaños | 4 | 3 | POLYGON ((-103.8884 22.02637, -103.88838 22.02... |
| 2 | Pihuamo | 4 | 4 | POLYGON ((-103.27457 19.37393, -103.27432 19.3... |
| 3 | Tonalá | 137 | 51 | POLYGON ((-103.2657 20.70554, -103.26526 20.70... |
| 4 | San Martín de Bolaños | 1 | 1 | POLYGON ((-103.89007 21.78033, -103.88952 21.7... |
Visual Verification:
fig, ax = plt.subplots(figsize=(7,7))
geoData.plot(
column="Cajeros automáticos Número",
cmap="Blues",
legend=True,
edgecolor="black",
linewidth=0.5,
ax=ax
)
# Add title and subtitle
ax.set_title(
"Distribution of ATMs in Jalisco, by municipality",
fontdict={"fontsize": 12},
pad=15
)
ax.text(
0.5, -0.1,
"Source: CNBV, Indicadores de Inclusión Financiera",
transform=ax.transAxes,
fontsize=9,
ha="center",
color="dimgray"
)
ax.set_axis_off()
plt.show()4.3 Calculating indicators of interest
# 1. ATMs per 10,000
geoData["atm_per_10k"] = (
geoData["Cajeros automáticos Número"] / geoData["Población total"] * 10000
)
# 2. Branches per 10,000
geoData["branches_per_10k"] = (
geoData["Sucursales Número"] / geoData["Población total"] * 10000
)Why did we normalize? To make municipalities comparable despite differences in population.
Generate descriptive stats:
geoData[[
"atm_per_10k",
"branches_per_10k",
]].describe()| atm_per_10k | branches_per_10k | |
|---|---|---|
| count | 125.000000 | 125.000000 |
| mean | 4.115009 | 3.495350 |
| std | 2.137648 | 2.879419 |
| min | 0.000000 | 0.000000 |
| 25% | 2.803162 | 1.962323 |
| 50% | 3.904217 | 2.964720 |
| 75% | 4.950648 | 4.195217 |
| max | 15.959145 | 20.629190 |
4.4 Geographic Visualization
A choropleth map colors each municipality according to the value of an indicator—such as “ATMs per 10,000 adults.” For Business Intelligence, this is equivalent to a KPI dashboard adding a dimension of space. We would like to see, for instance:
Clusters of high and low performance
Gaps in service coverage
Regional inequalities or growth corridors
# !pip install folium
import folium
from folium.features import GeoJsonTooltip# Create a base map centered roughly on Jalisco
m = folium.Map(location=[20.67, -103.35],
zoom_start=7,
tiles="cartodb positron",
width=700, # or "70%"
height=700 # or "70%"
)
folium.Choropleth(
geo_data=geoData,
data=geoData,
columns=["CVEGEO", "atm_per_10k"], # municipality code + indicator
key_on="feature.properties.CVEGEO",
fill_color="YlGnBu",
fill_opacity=0.8,
line_opacity=0.5,
legend_name="ATMs per 10 000 people",
).add_to(m)<folium.features.Choropleth at 0x14ecb8e60>
mAdd interactive tooltips:
tooltip = folium.features.GeoJsonTooltip(
fields=["NOMGEO", "atm_per_10k"],
aliases=["Municipality:", "ATMS per 10 000 people:"],
localize=True
)
folium.GeoJson(
geoData,
tooltip=tooltip,
name="ATM rate",
style_function=lambda feature: {
"fillColor": "transparent", # keep choropleth colors underneath
"color": "slategray", # edge color (neutral)
"weight": 0.5, # thinner border
"fillOpacity": 0.0, # no extra fill from this layer
}
).add_to(m)
mWhat do these spatial patterns mean for decision-makers?
4.5 Spatial Descriptive Measures
Moran’s I (moran1950?) is a spatial correlation coefficient. It’s analogous to the Pearson correlation, but instead of comparing two variables, it compares a variable with its spatial lag. That is, the average value of that variable in neighboring areas. If high values are surrounded by high values then there is a positive spatial autocorrelation (clustering). But if high values are surrounded by low values the autocorrelation is negative (dispersion). If there is no clean pattern, there is a random spatial arrangement. Formally,
\[ I = \frac{n}{W}\frac{ \sum_i \sum_j w_{ij}(Y_i - \bar Y)(Y_j - \bar Y)} { \sum_i (Y_i - \bar Y)^2}, \]
where \(n\) is the number of spatial units (e.g., municipalities), \(Y_i\) is the observed value of the variable of interest in unit \(i\), \(\bar{Y}\) is the mean of all values, \(\omega_{ij}\) are weights that represent spatial proximity between units \(i\) and \(j\) (e.g., 1 if they are neighbors, 0 otherwise, or a distance-based weight). \(W = \sum_i \sum_j w_{ij}\) is the sum of all weights.
While the global Moran’s I gives one single value summarizing the entire map, Local Moran’s I decomposes that global statistic into a separate measure for each spatial unit.
\[ I_i = \frac{Y_i - \bar{Y}}{S^2} \sum_j w_{ij}(Y_i=Y_j) \] where: \[ S^2 = \frac{1}{n-1}\sum_i(y_i-\bar{y})^2 \] \(I_i\) is the local contribution of unit \(i\) to the overall spatial autocorrelation. So \(I_i\) is positive if \(i\) and its neighbors deviate from the mean in the same direction (both high or both low), and negative if \(i\) deviates in the opposite direction (high surrounded by low, or vice versa).
| Indicator | Description | Range | Meaning |
|---|---|---|---|
| Global Moran’s I | Overall spatial autocorrelation (global pattern) | -1 to +1 | + → clustering, 0 → random, - → dispersion |
| Local Moran’s I (LISA) | Local clusters of similar/dissimilar values | varies | Identifies “hot spots” and “cold spots” |
# !pip install libpysal esda splot
import libpysal
from esda import Moran, Moran_Local
from libpysal.weights import KNNTo estimate the measures, we need to define neighbors. A simple and intuitive choice is Queen contiguity, where polygons sharing any border or vertex are neighbors. But some units may be disconnected in the contiguity graph. We need to define ‘neighbors’, e.g. based on distance or borders shared. K-nearest neighbors weights (KNN) are the k geometrically closest units (based on centroids), which guarantees everyone has k neighbors (no islands) thus is more flexible for irregular or fragmented geometries.
# Weights matrix based on shared boundaries
w = libpysal.weights.Queen.from_dataframe(geoData)
w.transform = 'r' Global Moran’s I
geoData = geoData.set_index("CVEGEO")w_knn = KNN.from_dataframe(geoData, k=5, use_index=True)
w_knn.transform = "r" # row-standardize weightsprint("Number of observations:", w_knn.n)
print("K neighbors per unit:", w_knn.k)
print("Islands:", w_knn.islands)Number of observations: 125
K neighbors per unit: 5
Islands: []
y = geoData["atm_per_10k"]
moran_knn = Moran(y, w_knn)
print("Moran's I (KNN):", round(moran_knn.I, 3))
print("p-value:", round(moran_knn.p_sim, 4))Moran's I (KNN): 0.109
p-value: 0.017
If \(I > 0\), municipalities with similar values (high or low) tend to be near each other. Clustering. If \(I ≈ 0\), spatial randomness. If \(I < 0\), high values are surrounded by low ones. Dispersion.
Local Moran’s I
| Code | Type | Meaning |
|---|---|---|
| 1 | High–High | Hot spot (high values surrounded by high) |
| 2 | Low–Low | Cold spot (low surrounded by low) |
| 3 | High–Low | Spatial outlier (high surrounded by low) |
| 4 | Low–High | Spatial outlier (low surrounded by high) |
lisa_knn = Moran_Local(y, w_knn)
geoData["LISA_I_knn"] = lisa_knn.Is
geoData["LISA_p_knn"] = lisa_knn.p_sim
geoData["LISA_q_knn"] = lisa_knn.q # quadrant: 1=HH, 2=LL, 3=HL, 4=LH(geoData["LISA_p_knn"] < 0.05).sum()19
def lisa_label(row):
if row["LISA_p_knn"] >= 0.05:
return "Not singnificant"
if row["LISA_q_knn"] == 1:
return "Hi-Hi"
if row["LISA_q_knn"] == 2:
return "Lo-Lo"
if row["LISA_q_knn"] == 3:
return "Hi-Lo"
if row["LISA_q_knn"] == 4:
return "Lo-Hi"
geoData["LISA_label_knn"] = geoData.apply(lisa_label, axis=1)
fig, ax1 = plt.subplots(figsize=(7,7))
# Plot the LISA clusters
geoData.plot(
column="LISA_label_knn",
cmap="Set1",
legend=True,
categorical=True,
edgecolor="slategray",
linewidth=0.25, # use one linewidth value
ax=ax1
)
# Add title
ax1.set_title(
"Local Moran's I – Clusters of Spatial Autocorrelation",
fontdict={"fontsize": 12},
pad=15
)
ax1.set_axis_off()
plt.show()5 Resources
https://pythongis.org/index.html
https://www.kdnuggets.com/2023/08/5-python-packages-geospatial-data-analysis.html
https://www.tomasbeuzen.com/python-for-geospatial-analysis/README.html
https://spatial-analytics.readthedocs.io/en/develop/index.html
https://volaya.github.io/libro-sig/