Effectiveness of Large-Language Models in Recognizing Spatially Intensive Statistical Data

Michael T. Gastner

Singapore Institute of Technology

Co-Authors

Atima Tharatipyakul

Singapore Institute of Technology

Haw Yuh Loh

Simon T. Perrault

Singapore University of Technology and Design


Yong Wang

Nanyang Technological University, Singapore

Objectives

  1. Define and illustrate intensiveness in the context of spatial data.
  2. Elucidate cartographic relevance of intensiveness.
  3. Evaluate the effectiveness of large-language models in recognizing intensiveness.

Overall goal: Rescue the world from invalid choropleth maps!

Introduction: Intensiveness in the Physical Sciences

Intensive quantity:
Physical quantity whose magnitude is independent of the extent of the system.

Examples

  • Temperature (e.g., in Kelvin)
  • Particle density (e.g., in molecules per cubic meter)
  • Pressure (e.g., in Pascal)

Intensiveness in Geospatial Data

Definition (Pebesma & Bivand, 2023)

Intensive Variables:
Variables that do not have values proportional to support [e.g., length or area]: if the area is split, values may vary but on average remain the same.

Example: Population Density

“If an area is split into smaller areas, population density is not split similarly: the sum of population densities for the smaller areas is a meaningless measure, as opposed to the average of the population densities which will be similar to the density of the total area.”

Example: Corporate Tax Rate

The corporate tax rate in the U.S. by state is intensive because every subdivision of a state (e.g., county) applies the same rate.

Note: Unlike tax rate, tax revenue is not intensive because the average revenue of a subdivision must be smaller than the revenue of the entire state.

Example: Annual Percentage Change in CO2 Emissions

\[\left(\frac{\textrm{Emissions in current year} } {\textrm{Emissions in previous year}} - 1 \right) \times 100\%\]

This indicator is intensive because the average percentage change over the subdivisions is approximately equal to the percentage change in the entire region.

Note: Neither numerator nor denominator is intensive.

Normalization

Dividing an additive quantity, A, by another, B, is referred to as a “normalization” or “standardization” of A.

Examples

  • \(\textrm{Population density} = \frac{\textrm{Population}}{\textrm{Area}}\)

  • \(\textrm{Measles incidence} = \frac{\textrm{Number of people with measles}}{\textrm{Population in 100,000}}\)

  • \(\textrm{Inflation rate} = \left(\frac{\textrm{Price index in current year}}{\textrm{Price index in previous year}} - 1\right) \times 100\%\)

However, an intensive quantity does not always have to be a ratio (e.g., median income).

Cartographic Relevance of Intensiveness

Nominal Ordered Ordinal Interval Ratio Qualitative Quantitative Intensive Additive Hue Lightness for Magnitude,Hue for +/- StatisticalVariable RecommendedVisual Variable Data Associated with Geographic Enumeration Unit + 0 0 + Size for Magnitude,Hue for +/- 0 + + 0

Choropleth Maps Require Intensive Quantities

Map Types for Additive Quantities

Automatic Advice for Map Design

A “grammar checker” for maps could help to avoid many common mistakes. GeoLinter by Fei et al. (2024) is a promising step in this direction. It performs automatic checks for choropleth maps, for example:

  • Number of data classes: between 3 and 7?
  • Map projection: consistent with Šavrič et al.’s (2016) projection wizard?

However, GeoLinter does not automate intensiveness checks.

GeoLinter User Interface

Mathematical Approaches

Scheider & Huisjes (2019) investigated a machine-learning model for recognizing intensiveness in spatial data. They used a support-vector machine with a radial-basis function and various statistical predictors, for example:

  • Intercept and slope of linear regression
  • Measures of spatial autocorrelation (e.g. Moran I and Getis-Ord G)

Tested on 519 data sets from the Dutch Central Bureau of Statistics, the model achieved an accuracy of 95%.

Large-Language Models (LLMs)

Statistical measures only provide circumstantial evidence for intensiveness. In principle, the verbal description of the data should leave no need for guessing.

We investigated whether LLMs can recognize intensiveness and explain their decisions to the user.

Case Study: Selected LLMs

We tested three LLMs available through the Ollama application programming interface:

  • Gemma (version 7B) from Google
  • Llama 3 (8B) from Meta
  • Mistral (7B), an EU-based open-source model

Case Study: Data Sets

1,326 indicators from the World Bank Data Catalog.

For ground truth data, we manually reviewed each indicator and classified it as intensive (1,006 indicators) or non-intensive (320).

155 additional indicators were excluded as unclassifiable (e.g., financial data in local currency units).

Example Data Set

GDP per unit of energy use (PPP $ per kg of oil equivalent)

Country Name Indicator Value
1 Afghanistan
2 Albania 13.925878
3 Algeria 11.148744
4 American Samoa
5 Andorra
6 Angola 14.790826
7 Antigua and Barbuda
8 Argentina 9.679852
9 Armenia 9.917423
10..216
217 Zimbabwe

Example Indicator Description

The World Bank provides a “Long Description” in addition to the indicator name, for example:

GDP per unit of energy use (PPP $ per kg of oil equivalent):

“GDP per unit of energy use is the PPP GDP per kilogram of oil equivalent of energy use. PPP GDP is gross domestic product converted to current international dollars using purchasing power parity rates based on the 2017 ICP [International Comparison Program] round. An international dollar has the same purchasing power over GDP as a U.S. dollar has in the United States.”

Start of Prompt for LLM

“Act as an expert in geospatial data science. Analyze the input that provides information about a quantity. Your task is to identify the unit of measurement of the quantity, identify whether the quantity is intensive (TRUE) or is not intensive (FALSE), and explain your reason to a beginner.”

Tested Intensiveness Definitions and Examples

Definitions by IUPAC (2006) on Slide 4, Pebesma & Bivand (2023) on Slide 5 and:

An intensive quantity is independent of the size of the analysis unit. For intensive data, the mean does not change with different analysis units and the variance declines when the analysis unit gets coarser. Examples include temperature and elevation data.

Data Input for LLM

We tested five different types of input for each indicator:

Description

I0

Indicator title only

I1

I0 and the hint: “If the unit of measurement of the quantity includes a word such as ‘per’, ‘%’, or ‘percent’, the quantity is often intensive.”

I2

I1 and the World Bank’s indicator description

I3

I2 and the first five rows of non-missing data

I4

I3 and the slope, alongside the confidence interval, of a regression line against the logarithm of the area

Results

Best and worst combinations of LLM, intensiveness definition, and data input, ranked by F1 score:

Model Definition Input F1 Precision Recall Accuracy

1

Llama 3

IUPAC

I4

0.961 1.000 0.925 0.944

2

Llama 3

IUPAC

I2

0.953 1.000 0.909 0.932

3

Llama 3

IUPAC

I3

0.950 0.995 0.909 0.929

4..42

43

Mistral

IUPAC

I0

0.087 1.000 0.045 0.286

44

Llama 3

Chen et al.

I0

0.020 1.000 0.010 0.259

45

Llama 3

IUPAC

I0

0.010 1.000 0.005 0.256

Example: True Negative Based on Wrong Reasoning

Agricultural land (sq. km)

“The unit of measurement includes ‘per’ (i.e., square kilometers), which is a common indicator of an extensive quantity. Additionally, the slope of the regression equation is not zero, indicating that the quantity is dependent on the area. This suggests that agricultural land is an extensive quantity, as its magnitude changes with the extent of the system.”

Example: True Positive Based on Wrong Reasoning

Fertilizer consumption (% of fertilizer production)

“The unit of measurement includes the word ‘per’, which is a common indicator of an intensive quantity. Additionally, the definition of the quantity mentions ‘per unit of arable land’, further supporting that it is an intensive quantity.”

Example: False Negative Based on Wrong Reasoning

Forest area (% of land area)

“The unit of measurement includes a word such as ‘per’, which indicates that the quantity is extensive rather than intensive. Additionally, the fact that the quantity is measured in terms of percentage of land area suggests that it depends on the extent of the system (i.e., the size of the country). This is consistent with the definition of an extensive quantity.”

Summary: Intensiveness Detection

  • In tandem with traditional machine learning, LLMs might be used for “grammar checking” of thematic maps.
  • Best combinations of LLMs and prompts achieved high F1 scores.
  • Including more details about the input data in the prompt pays off in terms of model performance.
  • Explanations provided by LLMs sometimes reveal incorrect reasoning. Better prompt engineering required.