1. Why Spatial Data? The Importance of ‘Where’

As data scientists and statisticians, we are experts at extracting insights from data. We analyze trends, build predictive models, and uncover hidden patterns. Traditionally, we focus on the ‘what’ (the variables) and the ‘when’ (the time). Spatial statistics adds a critical third dimension: the ‘where’.

Whether you are tracking a disease outbreak, planning aid distribution, analyzing crime patterns, or modeling climate change, the location of your data points is not just another variable—it’s a fundamental part of the story. Events that happen close to each other are often related. This principle, known as spatial autocorrelation, is the cornerstone of spatial analysis.

This module will provide you with the foundational vocabulary and concepts to understand, classify, and work with spatial data. We will learn the different “shapes” spatial data can take and, most importantly, how to choose the right type for the right question.

2. The Two Foundational Models: Vector vs. Raster

Before diving into statistical types, we must understand the two fundamental ways computers store geographic information. All spatial data falls into one of these two models.

2.1 Vector Data: Drawing with Coordinates

Think of vector data as drawing with coordinates. It represents the world using discrete, precise geometric objects. Each object has associated attributes (e.g., the name of a hospital, the type of a road). There are three types of vector geometries:

  • Points: A single coordinate pair (x, y) representing a specific location.
    • When to use: For features that have a location but no meaningful area or length at your scale of analysis.
    • Somalia/Somaliland Example: The location of a specific hospital in Mogadishu, a water well in a rural village, or an IDP (Internally Displaced Persons) camp. Each is a distinct, single point.
  • Lines (or Polylines): A sequence of connected coordinate pairs.
    • When to use: For features that have length but not area, like networks or boundaries.
    • Somalia/Somaliland Example: The path of the Jubba River, the road network connecting Hargeisa and Berbera, or the administrative border between Puntland and Galmudug.
  • Polygons: A series of connected coordinate pairs that form a closed shape, representing an area.
    • When to use: For features that have a defined boundary and area.
    • Somalia/Somaliland Example: The boundary of the Banaadir region, the area of a national park, or the footprint of a specific neighborhood in Burao.

Vector data is ideal when you are interested in the features themselves and their relationships.

2.2 Raster Data: A Grid of Values

Think of raster data as a digital photograph or a grid. It represents the world as a continuous surface, divided into a matrix of equally sized cells (or pixels). Each cell has a single value representing the characteristic of that specific location.

  • When to use: For phenomena that vary continuously across a landscape, with no clear boundaries.
  • Somalia/Somaliland Examples:
    • Elevation Data (DEM): A grid where each cell’s value is its elevation above sea level. This is crucial for understanding topography and flood risk.
    • Climate Data: A grid where each cell’s value could be the average monthly rainfall, temperature, or a vegetation index (like NDVI) to monitor drought conditions.
    • Satellite Imagery: A picture from space where each pixel has a color value, used to identify land cover (urban, farm, desert).

Raster data is ideal when you are studying how a measurement varies over a surface.

Feature Vector Data Raster Data
Representation Discrete points, lines, polygons Continuous grid of cells (pixels)
Structure Coordinates and attributes Matrix of values
Best For Administrative boundaries, roads, locations Elevation, temperature, satellite imagery
Example A shapefile of Somalia’s regions A GeoTIFF file of rainfall in the Horn of Africa
Key Question What is where? (e.g., a hospital) How much of something is everywhere? (e.g., elevation)

3. The Three Classical Types of Spatial Data

Building on the vector/raster models, spatial statisticians classify data into three main types based on the nature of the spatial process being observed. Understanding this classification is key to choosing the correct statistical methods later on.

3.1 Areal (or Lattice) Data

What it is: Areal data consists of values that are aggregated over polygons. The domain is a fixed set of areas, and we have a summary statistic (like a count, average, or rate) for each one. This is perhaps the most common type of spatial data in public health and social sciences.

The Key Question it Answers: “How does a value or count vary from area to area, and are neighboring areas more similar than distant ones?”

When to Use It and Why: * Privacy: When individual-level data (like patient addresses) is too sensitive to share, aggregating to a district or regional level protects confidentiality. * Administrative Data: Data is often collected and reported by administrative units (e.g., census bureaus, ministries of health). * Stable Rates: Calculating rates (e.g., disease incidence) requires a denominator (population), which is usually available at an areal level.

Somalia/Somaliland Example: Mapping Malnutrition Cases by Region Imagine we have the number of severe acute malnutrition (SAM) cases reported by health facilities in each region of Somaliland. We want to see if there are regional hotspots.

# Load Somalia administrative boundaries (we'll use this as our base)
# Ensure "som_admbnda_adm1_ocha_20250108.shp" is in your working directory
som_adm1 <- st_read("som_admbnda_adm1_ocha_20250108.shp")
## Reading layer `som_admbnda_adm1_ocha_20250108' from data source 
##   `C:\Users\Admin\Desktop\Course Spatial Statistics for Data Science\som_admbnda_adm1_ocha_20250108.shp' 
##   using driver `ESRI Shapefile'
## Simple feature collection with 18 features and 10 fields
## Geometry type: MULTIPOLYGON
## Dimension:     XY
## Bounding box:  xmin: 40.99488 ymin: -1.664897 xmax: 51.41303 ymax: 11.9852
## Geodetic CRS:  WGS 84
# Filter for Somaliland regions
somaliland_regions <- som_adm1 %>%
  filter(ADM1_EN %in% c("Awdal", "Woqooyi Galbeed", "Togdheer", "Sanaag", "Sool"))

# Create some fictional malnutrition data to join
malnutrition_data <- data.frame(
  ADM1_EN = c("Awdal", "Woqooyi Galbeed", "Togdheer", "Sanaag", "Sool"),
  SAM_cases = c(150, 320, 210, 180, 250)
)

# Join the data to the spatial object
somaliland_map_data <- left_join(somaliland_regions, malnutrition_data, by = "ADM1_EN")

# Plot the areal data
ggplot(data = somaliland_map_data) +
  geom_sf(aes(fill = SAM_cases)) +
  geom_sf_text(aes(label = ADM1_EN), size = 3) +
  scale_fill_viridis_c(name = "SAM Cases") +
  ggtitle("Areal Data: Malnutrition Cases by Region in Somaliland") +
  theme_void()

3.2 Geostatistical Data

What it is: Geostatistical data consists of measurements of a continuous phenomenon taken at specific point locations. We assume the underlying phenomenon (e.g., temperature, pollution) exists everywhere, but we can only afford to sample it at a finite number of sites.

The Key Question it Answers: “Given these sample points, can we predict the value of the phenomenon at un-sampled locations? (This is called interpolation or kriging).”

When to Use It and Why: * Environmental Monitoring: It’s impossible to measure air quality or soil contamination everywhere. We use monitoring stations (points) to sample the continuous field. * Resource Exploration: Geologists drill at specific locations (points) to infer the extent of an underground oil reserve (a continuous field). * When the underlying process is truly continuous.

Somalia/Somaliland Example: Measuring Water Salinity at Wells Imagine we are monitoring the salinity of groundwater at several wells along the Shabelle River to check for saltwater intrusion. We want to create a continuous map of salinity from these point measurements.

# Create a fictional dataset of well locations and salinity measurements
wells_data <- data.frame(
  well_id = c("W1", "W2", "W3", "W4", "W5"),
  longitude = c(45.0, 45.2, 45.5, 45.8, 46.1),
  latitude = c(2.2, 2.5, 2.9, 3.2, 3.5),
  salinity_ppm = c(800, 950, 1100, 1500, 2200) # Parts Per Million
)

# Convert the data frame to a spatial sf object
wells_sf <- st_as_sf(wells_data, coords = c("longitude", "latitude"), crs = 4326)

# Plot the points on the map of Somalia
ggplot() +
  geom_sf(data = som_adm1, fill = "khaki", color = "grey") + # Base map
  geom_sf(data = wells_sf, aes(color = salinity_ppm), size = 4) +
  scale_color_viridis_c(name = "Salinity (PPM)") +
  ggtitle("Geostatistical Data: Water Salinity at Wells") +
  theme_bw()

3.3 Point Pattern Data

What it is: In a point pattern, the locations of the events are the data themselves. We are not interested in a value at the point, but rather in the spatial arrangement of the points. The underlying process is what generates the locations.

The Key Question it Answers: “Are the events clustered, randomly distributed, or regularly spaced? What process might be generating this pattern?”

When to Use It and Why: * Epidemiology: To see if disease cases are clustered around a source of infection (e.g., a contaminated water pump). * Criminology: To identify crime hotspots. * Ecology: To study the territorial patterns of plants or animals.

Somalia/Somaliland Example: Mapping Security Incidents in Mogadishu An NGO is analyzing the locations of reported security incidents in the Banaadir region (which contains Mogadishu) to understand if they are concentrated in specific areas.

# Get the polygon for the Banadir region
banadir_region <- som_adm1 %>% filter(ADM1_EN == "Banadir")

# Generate 100 random points *within* the Banadir polygon to simulate incidents
# In a real analysis, these would be actual coordinates of incidents
set.seed(123) # for reproducibility
incident_points <- st_sample(banadir_region, size = 100)

# Plot the point pattern
ggplot() +
  geom_sf(data = banadir_region, fill = "grey80") +
  geom_sf(data = incident_points, color = "red", alpha = 0.6, size = 2) +
  ggtitle("Point Pattern: Simulated Security Incidents in Banadir") +
  theme_bw()

4. Summary and A Glimpse Ahead

Understanding these data types is the first and most important step in any spatial analysis. Choosing the wrong classification can lead to using inappropriate methods and drawing incorrect conclusions.

Data Type What it Represents Key Question Somalia/Somaliland Example Common Goal
Areal Aggregated values in polygons How do values vary by area? Cholera cases per district Disease mapping, hotspot detection
Geostatistical Samples of a continuous field Can we predict values everywhere? Rainfall at weather stations Interpolation (Kriging), surface creation
Point Pattern Locations of events Is there clustering? Locations of IDP camps Cluster analysis, density estimation

In the upcoming modules, we will learn how to use R packages like sf and terra to manipulate these data types and then apply powerful statistical techniques to answer these key questions. We will also touch upon more advanced concepts like spatio-temporal data, which tracks these patterns over time—for example, mapping the spread of a measles outbreak across districts over several months. ```

---
title: "Module I: The Language of Spatial Data"
subtitle: "A Foundation for Spatial Statistics and Data Science"
author: "Abdisalam Hassan Muse, PhD"
date: "`r Sys.Date()`"
output:
  html_document:
    toc: true
    toc_float: true
    theme: united
    highlight: tango
    code_download: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warning = FALSE)
# Load necessary packages for the entire document
library(sf)
library(ggplot2)
library(dplyr)
```

## 1. Why Spatial Data? The Importance of 'Where'

As data scientists and statisticians, we are experts at extracting insights from data. We analyze trends, build predictive models, and uncover hidden patterns. Traditionally, we focus on the **'what'** (the variables) and the **'when'** (the time). Spatial statistics adds a critical third dimension: the **'where'**.

Whether you are tracking a disease outbreak, planning aid distribution, analyzing crime patterns, or modeling climate change, the location of your data points is not just another variable—it's a fundamental part of the story. Events that happen close to each other are often related. This principle, known as **spatial autocorrelation**, is the cornerstone of spatial analysis.

This module will provide you with the foundational vocabulary and concepts to understand, classify, and work with spatial data. We will learn the different "shapes" spatial data can take and, most importantly, how to choose the right type for the right question.

## 2. The Two Foundational Models: Vector vs. Raster

Before diving into statistical types, we must understand the two fundamental ways computers store geographic information. All spatial data falls into one of these two models.

#### 2.1 Vector Data: Drawing with Coordinates

Think of vector data as **drawing with coordinates**. It represents the world using discrete, precise geometric objects. Each object has associated attributes (e.g., the name of a hospital, the type of a road). There are three types of vector geometries:

*   **Points:** A single coordinate pair (x, y) representing a specific location.
    *   **When to use:** For features that have a location but no meaningful area or length at your scale of analysis.
    *   **Somalia/Somaliland Example:** The location of a specific hospital in Mogadishu, a water well in a rural village, or an IDP (Internally Displaced Persons) camp. Each is a distinct, single point.

*   **Lines (or Polylines):** A sequence of connected coordinate pairs.
    *   **When to use:** For features that have length but not area, like networks or boundaries.
    *   **Somalia/Somaliland Example:** The path of the Jubba River, the road network connecting Hargeisa and Berbera, or the administrative border between Puntland and Galmudug.

*   **Polygons:** A series of connected coordinate pairs that form a closed shape, representing an area.
    *   **When to use:** For features that have a defined boundary and area.
    *   **Somalia/Somaliland Example:** The boundary of the Banaadir region, the area of a national park, or the footprint of a specific neighborhood in Burao.

**Vector data is ideal when you are interested in the features themselves and their relationships.**

#### 2.2 Raster Data: A Grid of Values

Think of raster data as a **digital photograph or a grid**. It represents the world as a continuous surface, divided into a matrix of equally sized cells (or pixels). Each cell has a single value representing the characteristic of that specific location.

*   **When to use:** For phenomena that vary continuously across a landscape, with no clear boundaries.
*   **Somalia/Somaliland Examples:**
    *   **Elevation Data (DEM):** A grid where each cell's value is its elevation above sea level. This is crucial for understanding topography and flood risk.
    *   **Climate Data:** A grid where each cell's value could be the average monthly rainfall, temperature, or a vegetation index (like NDVI) to monitor drought conditions.
    *   **Satellite Imagery:** A picture from space where each pixel has a color value, used to identify land cover (urban, farm, desert).

**Raster data is ideal when you are studying how a measurement varies over a surface.**

| Feature | Vector Data | Raster Data |
| :--- | :--- | :--- |
| **Representation** | Discrete points, lines, polygons | Continuous grid of cells (pixels) |
| **Structure** | Coordinates and attributes | Matrix of values |
| **Best For** | Administrative boundaries, roads, locations | Elevation, temperature, satellite imagery |
| **Example** | A shapefile of Somalia's regions | A GeoTIFF file of rainfall in the Horn of Africa |
| **Key Question** | *What* is *where*? (e.g., a hospital) | *How much* of something is *everywhere*? (e.g., elevation) |

---

## 3. The Three Classical Types of Spatial Data

Building on the vector/raster models, spatial statisticians classify data into three main types based on the nature of the spatial process being observed. Understanding this classification is key to choosing the correct statistical methods later on.

### 3.1 Areal (or Lattice) Data

**What it is:** Areal data consists of values that are **aggregated over polygons**. The domain is a fixed set of areas, and we have a summary statistic (like a count, average, or rate) for each one. This is perhaps the most common type of spatial data in public health and social sciences.

**The Key Question it Answers:** "How does a value or count vary from area to area, and are neighboring areas more similar than distant ones?"

**When to Use It and Why:**
*   **Privacy:** When individual-level data (like patient addresses) is too sensitive to share, aggregating to a district or regional level protects confidentiality.
*   **Administrative Data:** Data is often collected and reported by administrative units (e.g., census bureaus, ministries of health).
*   **Stable Rates:** Calculating rates (e.g., disease incidence) requires a denominator (population), which is usually available at an areal level.

**Somalia/Somaliland Example: Mapping Malnutrition Cases by Region**
Imagine we have the number of severe acute malnutrition (SAM) cases reported by health facilities in each region of Somaliland. We want to see if there are regional hotspots.

```{r areal_example}
# Load Somalia administrative boundaries (we'll use this as our base)
# Ensure "som_admbnda_adm1_ocha_20250108.shp" is in your working directory
som_adm1 <- st_read("som_admbnda_adm1_ocha_20250108.shp")

# Filter for Somaliland regions
somaliland_regions <- som_adm1 %>%
  filter(ADM1_EN %in% c("Awdal", "Woqooyi Galbeed", "Togdheer", "Sanaag", "Sool"))

# Create some fictional malnutrition data to join
malnutrition_data <- data.frame(
  ADM1_EN = c("Awdal", "Woqooyi Galbeed", "Togdheer", "Sanaag", "Sool"),
  SAM_cases = c(150, 320, 210, 180, 250)
)

# Join the data to the spatial object
somaliland_map_data <- left_join(somaliland_regions, malnutrition_data, by = "ADM1_EN")

# Plot the areal data
ggplot(data = somaliland_map_data) +
  geom_sf(aes(fill = SAM_cases)) +
  geom_sf_text(aes(label = ADM1_EN), size = 3) +
  scale_fill_viridis_c(name = "SAM Cases") +
  ggtitle("Areal Data: Malnutrition Cases by Region in Somaliland") +
  theme_void()
```

### 3.2 Geostatistical Data

**What it is:** Geostatistical data consists of measurements of a **continuous phenomenon taken at specific point locations**. We assume the underlying phenomenon (e.g., temperature, pollution) exists everywhere, but we can only afford to sample it at a finite number of sites.

**The Key Question it Answers:** "Given these sample points, can we predict the value of the phenomenon at un-sampled locations? (This is called interpolation or kriging)."

**When to Use It and Why:**
*   **Environmental Monitoring:** It's impossible to measure air quality or soil contamination everywhere. We use monitoring stations (points) to sample the continuous field.
*   **Resource Exploration:** Geologists drill at specific locations (points) to infer the extent of an underground oil reserve (a continuous field).
*   **When the underlying process is truly continuous.**

**Somalia/Somaliland Example: Measuring Water Salinity at Wells**
Imagine we are monitoring the salinity of groundwater at several wells along the Shabelle River to check for saltwater intrusion. We want to create a continuous map of salinity from these point measurements.

```{r geostatistical_example}
# Create a fictional dataset of well locations and salinity measurements
wells_data <- data.frame(
  well_id = c("W1", "W2", "W3", "W4", "W5"),
  longitude = c(45.0, 45.2, 45.5, 45.8, 46.1),
  latitude = c(2.2, 2.5, 2.9, 3.2, 3.5),
  salinity_ppm = c(800, 950, 1100, 1500, 2200) # Parts Per Million
)

# Convert the data frame to a spatial sf object
wells_sf <- st_as_sf(wells_data, coords = c("longitude", "latitude"), crs = 4326)

# Plot the points on the map of Somalia
ggplot() +
  geom_sf(data = som_adm1, fill = "khaki", color = "grey") + # Base map
  geom_sf(data = wells_sf, aes(color = salinity_ppm), size = 4) +
  scale_color_viridis_c(name = "Salinity (PPM)") +
  ggtitle("Geostatistical Data: Water Salinity at Wells") +
  theme_bw()
```

### 3.3 Point Pattern Data

**What it is:** In a point pattern, the **locations of the events are the data themselves**. We are not interested in a value *at* the point, but rather in the spatial arrangement of the points. The underlying process is what generates the locations.

**The Key Question it Answers:** "Are the events clustered, randomly distributed, or regularly spaced? What process might be generating this pattern?"

**When to Use It and Why:**
*   **Epidemiology:** To see if disease cases are clustered around a source of infection (e.g., a contaminated water pump).
*   **Criminology:** To identify crime hotspots.
*   **Ecology:** To study the territorial patterns of plants or animals.

**Somalia/Somaliland Example: Mapping Security Incidents in Mogadishu**
An NGO is analyzing the locations of reported security incidents in the Banaadir region (which contains Mogadishu) to understand if they are concentrated in specific areas.

```{r point_pattern_example}
# Get the polygon for the Banadir region
banadir_region <- som_adm1 %>% filter(ADM1_EN == "Banadir")

# Generate 100 random points *within* the Banadir polygon to simulate incidents
# In a real analysis, these would be actual coordinates of incidents
set.seed(123) # for reproducibility
incident_points <- st_sample(banadir_region, size = 100)

# Plot the point pattern
ggplot() +
  geom_sf(data = banadir_region, fill = "grey80") +
  geom_sf(data = incident_points, color = "red", alpha = 0.6, size = 2) +
  ggtitle("Point Pattern: Simulated Security Incidents in Banadir") +
  theme_bw()
```

## 4. Summary and A Glimpse Ahead

Understanding these data types is the first and most important step in any spatial analysis. Choosing the wrong classification can lead to using inappropriate methods and drawing incorrect conclusions.

| Data Type | What it Represents | Key Question | Somalia/Somaliland Example | Common Goal |
| :--- | :--- | :--- | :--- | :--- |
| **Areal** | Aggregated values in polygons | How do values vary by area? | Cholera cases per district | Disease mapping, hotspot detection |
| **Geostatistical**| Samples of a continuous field | Can we predict values everywhere? | Rainfall at weather stations | Interpolation (Kriging), surface creation |
| **Point Pattern**| Locations of events | Is there clustering? | Locations of IDP camps | Cluster analysis, density estimation |

In the upcoming modules, we will learn how to use R packages like `sf` and `terra` to manipulate these data types and then apply powerful statistical techniques to answer these key questions. We will also touch upon more advanced concepts like **spatio-temporal data**, which tracks these patterns over time—for example, mapping the spread of a measles outbreak across districts over several months.
```