Spatial vector analysis

Introduction

This tutorial introduces to spatial analysis with vector data. Let’s start with types of spatial data.

There are mainly two types of spatial data: vectors and rasters.

Spatial vector Data:

  • Spatial vector data represents geographic features using points, lines, and polygons. It stores location information as coordinates and topological relationships among these features.
  • Each feature in a spatial vector dataset is associated with attributes that provide additional information about the feature. Spatial vector data is typically used to represent discrete and distinct objects on the Earth’s surface, such as buildings, roads, rivers, and administrative boundaries.
  • Spatial vector data often includes topological relationships, which define the connectivity and adjacency between features.
  • Topology ensures data integrity, consistency, and enables spatial analysis operations like overlay and network analysis.

Spatial raster data

  • Spatial raster data represents geographic features as a grid of cells or pixels, where each cell represents a value or attribute for a particular location.
  • Raster data is commonly used to represent continuous phenomena that can be measured or observed across a continuous surface, such as elevation, temperature, precipitation, or satellite imagery.
  • Raster data is organized into a grid of regularly spaced cells or pixels. Each cell corresponds to a specific location on the Earth’s surface and contains a value representing an attribute or measurement.

This tutorial introduces you to handling and analyzing spatial vector data using the sf package.

From the creators of sf package,

Simple features or simple feature access refers to a formal standard (ISO 19125-1:2004) that describes how objects in the real world can be represented in computers, with emphasis on the spatial geometry of these objects. It also describes how such objects can be stored in and retrieved from databases, and which geometrical operations should be defined for them.

Libraries

library(sf)
library(tidyverse)

Projections

We cannot perform spatial operations on spatial dataframes with different Coordinate Reference Systems (CRS), which is made up of Coordinate System and Datum.

Coordinate System: The coordinate system defines how locations are represented in terms of coordinates. It specifies the units and scales used to measure distances, angles, and areas. Common coordinate systems include:

  • Geographic Coordinate System (GCS): It uses latitude and longitude coordinates to represent locations on a curved surface, such as the Earth. GCS is based on a spherical or ellipsoidal model of the Earth, where latitude represents north-south position and longitude represents east-west position.

  • Projected Coordinate System (PCS): It uses Cartesian coordinates (x, y) to represent locations on a 2D flat plane. PCS is derived from a GCS by projecting the curved Earth surface onto a flat surface using various mathematical algorithms. Different projections can preserve different properties like area, shape, or distance.

Datum: The datum defines the reference frame or starting point for measuring positions on the Earth’s surface. It specifies the origin, orientation, and shape of the Earth. Datum information is essential for accurate positioning and spatial data integration. Common datums include WGS84 (World Geodetic System 1984) and NAD83 (North American Datum 1983).

Projections are character strings in proj4 format. For example for UTM zone 33N (EPSG:32633) the string would be:

+proj=utm +zone=33 +ellps=WGS84 +datum=WGS84 +units=m +no_defs

We can use st_transform to reproject sf objects.

Let’s explore the projections of a dataset of Homicides in Philly.

philly_sf <- st_read("./philly/PhillyHomicides/PhillyHomicides.shp")
Reading layer `PhillyHomicides' from data source 
  `C:\Users\kshit\OneDrive\Desktop\PLAN372\materials\philly\PhillyHomicides\PhillyHomicides.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 3883 features and 8 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -75.26809 ymin: 39.87503 xmax: -74.95874 ymax: 40.13086
Geodetic CRS:  WGS 84

Check CRS

st_crs(philly_sf)
Coordinate Reference System:
  User input: WGS 84 
  wkt:
GEOGCRS["WGS 84",
    DATUM["World Geodetic System 1984",
        ELLIPSOID["WGS 84",6378137,298.257223563,
            LENGTHUNIT["metre",1]]],
    PRIMEM["Greenwich",0,
        ANGLEUNIT["degree",0.0174532925199433]],
    CS[ellipsoidal,2],
        AXIS["latitude",north,
            ORDER[1],
            ANGLEUNIT["degree",0.0174532925199433]],
        AXIS["longitude",east,
            ORDER[2],
            ANGLEUNIT["degree",0.0174532925199433]],
    ID["EPSG",4326]]

EPSG assigns numeric codes to CRS. provides a standardized numbering system and database for coordinate reference systems and transformations, facilitating the accurate definition, communication, and integration of spatial data across different geospatial applications and software.

st_crs(philly_sf)$epsg
[1] 4326
st_crs(philly_sf)$proj4string
[1] "+proj=longlat +datum=WGS84 +no_defs"

The proj4string: “+proj=longlat +datum=WGS84 +no_defs” indicates that the coordinate reference system (CRS) of our data is in longitude and latitude using the WGS84 datum. The “proj=longlat” signifies a geographic coordinate system in which locations are represented by pairs of latitude and longitude values. The “datum=WGS84” specifies that the data is based on the World Geodetic System 1984, a widely used global datum for GPS and mapping applications. The “+no_defs” part means that there are no additional definitions or custom parameters defined.

Reprojecting data involves converting the coordinates of our spatial data from one CRS to another. It is done to ensure that data from different sources or with different CRSs can be accurately combined, analyzed, and visualized together in a common coordinate system.

Let’s reproject our data from its current CRS (WGS84) to a Web Mercator (EPSG 3857).

Web Mercator is widely used in web mapping applications, including popular mapping services like Google Maps and OpenStreetMap. Here’s why you might consider transforming data to EPSG 3857:

  • Compatibility: EPSG 3857 is supported by many web mapping platforms and libraries. If you’re working with web-based mapping applications or need to integrate our data with other online mapping services, using EPSG 3857 ensures compatibility and seamless overlay with other data sources.

  • Performance: EPSG 3857, based on the Mercator projection, uses a simple cylindrical projection that preserves angles and is optimized for fast rendering of maps on computer screens. This makes it suitable for interactive web mapping applications that require quick rendering and zooming capabilities.

  • Visual consistency: Many web mapping platforms, such as Google Maps, use EPSG 3857 as their default CRS. Transforming our data to EPSG 3857 ensures visual consistency when overlaying our data with these widely used basemaps.

Note that while EPSG 3857 is suitable for web mapping and visualization purposes, it introduces distortion and is not appropriate for precise distance or area calculations. For accurate measurements or specialized analysis, it is recommended to use an appropriate projected CRS that suits our specific requirements.

# Reproject to a new CRS (example: EPSG 3857)
philly_sf_reprojected <- st_transform(philly_sf, crs = 3857)

# Check the new CRS
st_crs(philly_sf_reprojected)
Coordinate Reference System:
  User input: EPSG:3857 
  wkt:
PROJCRS["WGS 84 / Pseudo-Mercator",
    BASEGEOGCRS["WGS 84",
        ENSEMBLE["World Geodetic System 1984 ensemble",
            MEMBER["World Geodetic System 1984 (Transit)"],
            MEMBER["World Geodetic System 1984 (G730)"],
            MEMBER["World Geodetic System 1984 (G873)"],
            MEMBER["World Geodetic System 1984 (G1150)"],
            MEMBER["World Geodetic System 1984 (G1674)"],
            MEMBER["World Geodetic System 1984 (G1762)"],
            MEMBER["World Geodetic System 1984 (G2139)"],
            ELLIPSOID["WGS 84",6378137,298.257223563,
                LENGTHUNIT["metre",1]],
            ENSEMBLEACCURACY[2.0]],
        PRIMEM["Greenwich",0,
            ANGLEUNIT["degree",0.0174532925199433]],
        ID["EPSG",4326]],
    CONVERSION["Popular Visualisation Pseudo-Mercator",
        METHOD["Popular Visualisation Pseudo Mercator",
            ID["EPSG",1024]],
        PARAMETER["Latitude of natural origin",0,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8801]],
        PARAMETER["Longitude of natural origin",0,
            ANGLEUNIT["degree",0.0174532925199433],
            ID["EPSG",8802]],
        PARAMETER["False easting",0,
            LENGTHUNIT["metre",1],
            ID["EPSG",8806]],
        PARAMETER["False northing",0,
            LENGTHUNIT["metre",1],
            ID["EPSG",8807]]],
    CS[Cartesian,2],
        AXIS["easting (X)",east,
            ORDER[1],
            LENGTHUNIT["metre",1]],
        AXIS["northing (Y)",north,
            ORDER[2],
            LENGTHUNIT["metre",1]],
    USAGE[
        SCOPE["Web mapping and visualisation."],
        AREA["World between 85.06°S and 85.06°N."],
        BBOX[-85.06,-180,85.06,180]],
    ID["EPSG",3857]]
st_crs(philly_sf_reprojected)$epsg
[1] 3857
st_crs(philly_sf_reprojected)$proj4string
[1] "+proj=merc +a=6378137 +b=6378137 +lat_ts=0 +lon_0=0 +x_0=0 +y_0=0 +k=1 +units=m +nadgrids=@null +wktext +no_defs"

Let’s break down the components of this Proj4 string:

  • +proj=merc: This indicates that the projection type is Mercator. The Mercator projection is a cylindrical map projection that preserves straight lines, making it suitable for navigation and world maps.

  • +a=6378137 +b=6378137: These parameters specify the semi-major and semi-minor axes of the ellipsoid used in the projection. In this case, both axes are set to 6378137 meters, indicating a sphere or ellipsoid with equal axes lengths.

  • +lat_ts=0: The lat_ts parameter represents the latitude of true scale. In this case, it is set to 0, indicating that the Mercator projection’s true scale is at the equator.

  • +lon_0=0: The lon_0 parameter specifies the central meridian, which is set to 0, indicating the projection’s reference longitude.

  • +x_0=0 +y_0=0: These parameters define the false easting and false northing, which are set to 0, indicating that the projection’s origin is at the intersection of the central meridian and equator.

  • +k=1: The k parameter represents the scale factor at the central meridian. A scale factor of 1 means that the scale is not distorted.

  • +units=m: The units parameter specifies the unit of measurement, which is set to meters in this case.

  • +nadgrids=@null: This parameter specifies that no datum transformation grids should be applied.

  • +wktext +no_defs: These parameters indicate that the string contains well-known text (WKT) representation and that no defaults should be used for undefined parameters.

Spatial operations

Import data

We are importing a data of regional boundaries in Alaska from the State of Alaska’s Salmon and People project.

Exercise

The data source (linked right before) is a good example of publishing scientific spatial data.

  • Explore the metadata (information about the data) to learn more the source.
  • You may notice DOI in the link and the dataset. What is a DOI? How can it be beneficial?
  • What are some elements in the data that makes you trust or not trust the data?

Download the sasap_regions and philly compressed folders from Files on Canvas. We will use these datasets in class today.

ak <- st_read("./sasap_regions/sasap_regions.shp")
Reading layer `sasap_regions' from data source 
  `C:\Users\kshit\OneDrive\Desktop\PLAN372\materials\sasap_regions\sasap_regions.shp' 
  using driver `ESRI Shapefile'
Simple feature collection with 13 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -2175343 ymin: 405519.1 xmax: 1579226 ymax: 2383770
Projected CRS: NAD83 / Alaska Albers
plot(ak)

Properties of spatial vector data

head(ak)
Simple feature collection with 6 features and 2 fields
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -2175343 ymin: 405519.1 xmax: 773854 ymax: 2383770
Projected CRS: NAD83 / Alaska Albers
  region_id           region                       geometry
1         1 Aleutian Islands MULTIPOLYGON (((-1156666 42...
2         2           Arctic MULTIPOLYGON (((571289.9 21...
3         3      Bristol Bay MULTIPOLYGON (((-339688.6 9...
4         4          Chignik MULTIPOLYGON (((-114381.9 6...
5         5     Copper River MULTIPOLYGON (((561012 1148...
6         6           Kodiak MULTIPOLYGON (((115112.5 98...
st_crs(ak)$input
[1] "NAD83 / Alaska Albers"

The CRS used in the data is NAD83 / Alaska Albers, which is appropriate for Alaska.

Tidyverse operations with sf dataframes

We can work with sf objects like we would with Tidyverse objects (tibbles). All the rules of select, filter, joins, mutate, etc. apply the same way as Tidyverse.

head(ak %>% select(region))
Simple feature collection with 6 features and 1 field
Geometry type: MULTIPOLYGON
Dimension:     XY
Bounding box:  xmin: -2175343 ymin: 405519.1 xmax: 773854 ymax: 2383770
Projected CRS: NAD83 / Alaska Albers
            region                       geometry
1 Aleutian Islands MULTIPOLYGON (((-1156666 42...
2           Arctic MULTIPOLYGON (((571289.9 21...
3      Bristol Bay MULTIPOLYGON (((-339688.6 9...
4          Chignik MULTIPOLYGON (((-114381.9 6...
5     Copper River MULTIPOLYGON (((561012 1148...
6           Kodiak MULTIPOLYGON (((115112.5 98...

The geometry column will remain for sf objects, even when we don’t explicitly select it.

Reading in tabular data

These data show languages spoken in the household for people over the age of 5 in Alaska, in addition to the total population, by community, from the same project.

tabl <- read_csv("./sasap_regions/household_language.csv")
Rows: 3142 Columns: 50
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (3): SASAP.Region, city, data_file_year
dbl (47): Year, lat, lng, total, speak_only_english, german, yiddish, greek,...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

The table has longitude and latitude information, but is not a spatial dataframe. Yet. We can convert it as such using st_as_sf. You might have noticed that st_ at the beginning of sf functions. It stands for spatio temporal, helping us with datasets with spatial and time components. We only deal with spatial information in this tutorial.

#remove rows with missing lat and long values
tabl <- tabl[complete.cases(tabl$lng, tabl$lat), ]


pop_4326 <- st_as_sf(tabl, 
                  coords = c('lng', 'lat'),
                  crs = 4326,
                  remove = F)

head(pop_4326)                     
Simple feature collection with 6 features and 50 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -165.7731 ymin: 54.13556 xmax: -141.1881 ymax: 62.72306
Geodetic CRS:  WGS 84
# A tibble: 6 × 51
   Year SASAP.Region   city    lat   lng total speak_only_english german yiddish
  <dbl> <chr>          <chr> <dbl> <dbl> <dbl>              <dbl>  <dbl>   <dbl>
1  1990 Kodiak         Akhi…  56.9 -154.    77                 63      0       0
2  1990 Kuskokwim      Akia…  60.9 -161.   481                 29      0       0
3  1990 Kuskokwim      Akiak  60.9 -161.   285                 48      0       0
4  1990 Alaska Penins… Akut…  54.1 -166.   589                250      8       0
5  1990 Yukon          Alak…  62.7 -165.   544                125      0       0
6  1990 Yukon          Alcan  62.7 -141.    27                 15      2       0
# ℹ 42 more variables: greek <dbl>, indic <dbl>, italian <dbl>,
#   french_or_french_creole <dbl>, portuguese_or_portuguese_creole <dbl>,
#   spanish_or_spanish_creole <dbl>, polish <dbl>, russian <dbl>,
#   south_slavic <dbl>, arabic <dbl>, tagalog <dbl>, chinese <dbl>,
#   hungarian <dbl>, japanese <dbl>, korean <dbl>, vietnamese <dbl>,
#   other_and_unspecified_languages <dbl>, french_incl_patois_cajun <dbl>,
#   french_creole <dbl>, other_west_germanic_languages <dbl>, …

We use st_join for spatial joins. It’s a spatial left join.

Before we perform spatial joins, we need to make sure the CRS systems is same.

pop_3338 <- st_transform(pop_4326, crs = 3338)

We use st_within to assign each city to a region based on the latitude and longitude.

pop_joined <- st_join(pop_3338, ak, join = st_within)

head(pop_joined)
Simple feature collection with 6 features and 52 fields
Geometry type: POINT
Dimension:     XY
Bounding box:  xmin: -766425.7 ymin: 526057.8 xmax: 649412.1 ymax: 1479652
Projected CRS: NAD83 / Alaska Albers
# A tibble: 6 × 53
   Year SASAP.Region   city    lat   lng total speak_only_english german yiddish
  <dbl> <chr>          <chr> <dbl> <dbl> <dbl>              <dbl>  <dbl>   <dbl>
1  1990 Kodiak         Akhi…  56.9 -154.    77                 63      0       0
2  1990 Kuskokwim      Akia…  60.9 -161.   481                 29      0       0
3  1990 Kuskokwim      Akiak  60.9 -161.   285                 48      0       0
4  1990 Alaska Penins… Akut…  54.1 -166.   589                250      8       0
5  1990 Yukon          Alak…  62.7 -165.   544                125      0       0
6  1990 Yukon          Alcan  62.7 -141.    27                 15      2       0
# ℹ 44 more variables: greek <dbl>, indic <dbl>, italian <dbl>,
#   french_or_french_creole <dbl>, portuguese_or_portuguese_creole <dbl>,
#   spanish_or_spanish_creole <dbl>, polish <dbl>, russian <dbl>,
#   south_slavic <dbl>, arabic <dbl>, tagalog <dbl>, chinese <dbl>,
#   hungarian <dbl>, japanese <dbl>, korean <dbl>, vietnamese <dbl>,
#   other_and_unspecified_languages <dbl>, french_incl_patois_cajun <dbl>,
#   french_creole <dbl>, other_west_germanic_languages <dbl>, …

Groups and summaries

Estimating total population by region

pop_region <- pop_joined %>% 
  as.data.frame() %>% 
  group_by(region) %>% 
  summarise(total_pop = sum(total))

head(pop_region)
# A tibble: 6 × 2
  region           total_pop
  <chr>                <dbl>
1 Aleutian Islands     74611
2 Arctic               63741
3 Bristol Bay          61500
4 Chignik               3517
5 Cook Inlet         3377935
6 Copper River         22614

Population by region wasn’t available to the original shapefile. We can add that back using this newly created dataframe.

pop_region_3338 <- left_join(ak, pop_region, by = c("region" = "region"))

Creating a simple plot using base R

#plot to check
plot(pop_region_3338["total_pop"])

Saving into shapefiles

Shapefiles are still pretty common formats to store spatial vector data. While there are other alternatives like geojson and gpkg formats, we will use shapefiles because of their popularity.

write_sf(pop_region_3338, "sasap_regions/ak_regions_population.shp", delete_layer = TRUE)

Check the sasap_regions folder to see if new shapefiles have been created.

Visualizing with ggplot

ggplot(pop_region_3338) +
  geom_sf(aes(fill = total_pop)) +
  theme_bw() +
  labs(fill = "Total Population") +
  scale_fill_continuous(low = "khaki", high =  "firebrick")

Exercise
  • Categorize the continuous population data into bins, and remake the plot.
  • Why are the gridlines on the plot curved?

References