Load Libraries

In this class, we will study visualization of geospatial data. Below are the packages that need to be installed and loaded.


library(sf)           
library(ggplot2)      
library(dplyr)        
library(tidyr)        
library(scales)       
library(RColorBrewer) 
library(units)
library(cowplot)

Introduction


Geospatial data constitute another important type of real data. It is defined as data that are related to a location on earth (and in many situations with respect to time as well).

Question: Can you give an example of geospatial data?

Types of Geospatial Data


  • Vector: point, lines, polygons referenced by geospatial information
  • Raster: geospatial information by pixel or equal-spaced grids
  • Geographic Database: a structured multi-layered data with vectors and rasters (Google Map)
  • Multi-temporal: Attach geospatial data with time components

Watch a video: https://www.youtube.com/watch?v=NV2fqs_me60

Vector vs Raster data


We will briefly cover the two main types of geospatial data - vector and raster.



In this module, we will start with vector data.

Object types in Vector Data


Vector data consists of features that represent geographic phenomena.

  • Points can be used to represent small objects such as wells, weather stations, or field plot locations.

  • Lines represent one-dimensional linear features such as roads and streams.

  • Polygons represent two-dimensional natural features such as lakes, vegetation patches, and burn scars as well as administrative units such as counties, districts, and counties.

Scale matters


The type of geometry used to represent a particular feature may depend on the scale of the measurement.

Points can be used to represent city locations at a continental scale, whereas a polygon would be used to map the boundary of an individual city at a city scale.

Importing Geospatial Data


This chapter uses several datasets that characterize the historical distribution of tornadoes in the state of Oklahoma.

  • Data in ESRI shapefile format, an old but still popular file formats for vector geospatial data.

  • This is a multiple file format, where separate files contain the feature geometries, attribute table, spatial indices, and coordinate reference system. All files have the same filename with different extensions.

  • To read in a shapefile, it is only necessary to specify the filename with a “.shp” extension. However, all the files, including the “.shp” file as well as the “.dbf”, “.shx”, and “.prj” files, need to be present in the directory from which the data are read.

Read .shp file with st_read


The shapefiles are imported to sf objects using the st_read() function. The quiet = FALSE argument suppresses output to the console when importing spatial datasets.

path <- "~/Documents/Fei Tian/Course_DAS522_Exploratory_Data_Analysis_and_Visualization_Spring2024/Datasets/Geospatial/gdswr_data/Chapter5"
okcounty <- st_read(paste0(path,"/ok_counties.shp"), quiet = TRUE)
tpoint <- st_read(paste0(path,"/ok_tornado_point.shp"), quiet = TRUE)
tpath <- st_read(paste0(path,"/ok_tornado_path.shp"), quiet = TRUE)


  • The ok_counties.shp dataset contains county boundaries for the state of Oklahoma (polygon).

  • The ok_tornado_point.shp dataset contains the initial locations of tornado touchdown (point).

  • The ok_tornado_path.shp dataset contains the path of each tornado (line).

Glimpse of okcounty Data


The package sf read geospatial as a special data frame with the class of sf. It contains a geometry feature (polygon, line or point) and other characteristics for each geometric unit.


class(okcounty)
## [1] "sf"         "data.frame"
glimpse(okcounty)
## Rows: 77
## Columns: 8
## $ STATEFP  <chr> "40", "40", "40", "40", "40", "40", "40", "40", "40", "40", "…
## $ COUNTYFP <chr> "077", "025", "011", "107", "105", "153", "001", "053", "059"…
## $ COUNTYNS <chr> "01101826", "01101800", "01101793", "01101841", "01101840", "…
## $ AFFGEOID <chr> "0500000US40077", "0500000US40025", "0500000US40011", "050000…
## $ GEOID    <chr> "40077", "40025", "40011", "40107", "40105", "40153", "40001"…
## $ NAME     <chr> "Latimer", "Cimarron", "Blaine", "Okfuskee", "Nowata", "Woodw…
## $ LSAD     <chr> "06", "06", "06", "06", "06", "06", "06", "06", "06", "06", "…
## $ geometry <POLYGON [°]> POLYGON ((-95.50766 35.0292..., POLYGON ((-103.0025 3…

meaning of each feature


  • STATEFP: State FIPS (Federal Information Processing Standards) code. It’s a numeric code that uniquely identifies states in the United States. For Oklahoma, the code is ‘40’.

  • COUNTYFP: County FIPS code. It’s a numeric code that uniquely identifies counties within a state.

  • COUNTYNS: National statistics code for the county, another unique identifier for counties within national databases.

  • AFFGEOID: An aggregate feature identifier that combines several geographic identifiers, typically used for referencing geographic data.

  • GEOID: A concatenation of the state and county FIPS codes, providing a unique identifier for each county within the entire United States.

  • NAME: The name of the county.

  • LSAD: Legal/Statistical Area Description code, which describes the type of legal entity or statistical area a geographic area represents. The code ‘6’ usually stands for a county or equivalent entity.

  • geometry: Contains geographic coordinates defining the boundaries of the counties. This column typically includes lists of coordinates (longitude and latitude) that outline the shape of each county on a map.

Creating maps from okcounty


To generate a map of counties using ggplot() with a sf object, the geom_sf() function is used. This map shows the borders of all counties in Oklahoma.

ggplot(data = okcounty) +
  geom_sf(fill = NA)


Exercise: Try what will happen without fill=NA in the geom_sf function.

Class Exercise


Try to fill different counties with different colors for okcounty data.

tpoint data


names(tpoint)
##  [1] "om"       "yr"       "mo"       "dy"       "date"     "time"    
##  [7] "tz"       "st"       "stf"      "stn"      "mag"      "inj"     
## [13] "fat"      "loss"     "closs"    "slat"     "slon"     "elat"    
## [19] "elon"     "len"      "wid"      "fc"       "geometry"


Exercise: Find the meaning of each variable.

Filter tornado data


The yr column indicates the year of each tornado and can be used to filter the data to a smaller year range from 2016-2021. This column is retained in the output along with om, a unique ID code, and date, the date of each tornado.

tpoint_16_21 <- tpoint %>%
  filter(yr >= 2016 & yr <= 2021) %>%
  select(om, yr, date)
tpath_16_21 <- tpath %>%
  filter(yr >= 2016 & yr <= 2021) %>%
  select(om, yr, date)

Plot tornoda touchdown places into the county map


Now let’s plot the tornoda touchdown point tpoint into the county map. They will be overlapped by the geospatial information (longitude and latitude).

ggplot() +
  geom_sf(data = okcounty, fill = NA) +
  geom_sf(data = tpoint_16_21) +
  theme_bw()

Class exercise


Plot tornado from different years in different colors in the county map.

Plot path of tornadoes


Now let’s work on tpath and plot the paths of each tornado in red lines.

ggplot() +
  geom_sf(data = okcounty, fill = NA) +
  geom_sf(data = tpath_16_21, 
          color = "red", 
          size = 1) +
  theme_bw()

Plot tornado map by year using facet_wrap()


ggplot() +
  geom_sf(data = okcounty, fill = NA, color = "gray") +
  geom_sf(data = tpoint_16_21, size = 0.75) +
  facet_wrap(vars(yr), ncol = 2) +
  coord_sf(datum = NA) +
  theme_void()

st_drop_geometry() function


When joining data or summarizing data for geospatial data, the geometry information will always follow. For example, if we hope to summarize the number of tornadoes in each year, we can try the following code:

tpoint_16_21 %>% 
  count(yr)
## Simple feature collection with 6 features and 2 fields
## Geometry type: MULTIPOINT
## Dimension:     XY
## Bounding box:  xmin: -102.838 ymin: 33.7535 xmax: -94.4381 ymax: 36.975
## Geodetic CRS:  NAD83
##     yr   n                       geometry
## 1 2016  57 MULTIPOINT ((-95.0718 34.38...
## 2 2017  85 MULTIPOINT ((-96.701 34.267...
## 3 2018  43 MULTIPOINT ((-96.6806 34.22...
## 4 2019 148 MULTIPOINT ((-94.6576 33.84...
## 5 2020  38 MULTIPOINT ((-94.642 33.753...
## 6 2021  63 MULTIPOINT ((-102.838 36.55...

To drop the geometry information when it is no longer necessary, we may use the st_drop_geometry() function:

tpoint_16_21 %>% 
  st_drop_geometry() %>% 
  count(yr)
##     yr   n
## 1 2016  57
## 2 2017  85
## 3 2018  43
## 4 2019 148
## 5 2020  38
## 6 2021  63

Joining data


Now let’s try to summarize the number tornadoes by county. This information is not directly available. We need to join the tpoint data with okcounty data using the st_join() function.

countypnt <- st_join(tpoint_16_21, okcounty)

This is different from the join functions we learned so far in that st_join() links rows from the two tables based on the spatial locations instead of their attributes.

Group by GEOID as county identifier


countypnt <- st_drop_geometry(countypnt)
countysum <- countypnt %>%
  group_by(GEOID) %>%
  summarize(tcnt = n())
glimpse(countysum)
## Rows: 75
## Columns: 2
## $ GEOID <chr> "40001", "40005", "40007", "40009", "40011", "40013", "40015", "…
## $ tcnt  <int> 6, 3, 4, 8, 1, 4, 10, 5, 7, 5, 3, 12, 10, 5, 5, 1, 7, 9, 7, 8, 2…


Next, okcounty is joined to countysum so that each polygon is associated with the appropriate tornado summary. At this stage we can use the left_join function.

Computer tornado density by county


  • join countsum back with okcounty to add tcnt (count of tornadoes) to each county.
  • A few counties that had no tornadoes during 2016-2021 are missing from countysum, and therefore have NA values in the joined table. (check how to use replace by yourself)
  • The st_area() function computes the area of each county in square meter. The density of tornadoes per county is then calculated and converted to tornadoes per 1000 \(\mathrm{km}^2\).
  • The st_area() function returns a column with explicit measurement units, but these are removed using the drop_units() function for simplicity.
countymap <- okcounty %>%
  left_join(countysum, by = "GEOID") %>%
  replace(is.na(.), 0) %>%
  mutate(area = st_area(okcounty),
         tdens = 10^6 * 10^3 * tcnt / area) %>%
  drop_units()

Result


glimpse(countymap)
## Rows: 77
## Columns: 11
## $ STATEFP  <chr> "40", "40", "40", "40", "40", "40", "40", "40", "40", "40", "…
## $ COUNTYFP <chr> "077", "025", "011", "107", "105", "153", "001", "053", "059"…
## $ COUNTYNS <chr> "01101826", "01101800", "01101793", "01101841", "01101840", "…
## $ AFFGEOID <chr> "0500000US40077", "0500000US40025", "0500000US40011", "050000…
## $ GEOID    <chr> "40077", "40025", "40011", "40107", "40105", "40153", "40001"…
## $ NAME     <chr> "Latimer", "Cimarron", "Blaine", "Okfuskee", "Nowata", "Woodw…
## $ LSAD     <chr> "06", "06", "06", "06", "06", "06", "06", "06", "06", "06", "…
## $ tcnt     <dbl> 1, 12, 1, 10, 6, 2, 6, 0, 4, 9, 3, 10, 12, 1, 2, 8, 13, 4, 5,…
## $ geometry <POLYGON [°]> POLYGON ((-95.50766 35.0292..., POLYGON ((-103.0025 3…
## $ area     <dbl> 1890663261, 4766283042, 2427121029, 1657249513, 1503893122, 3…
## $ tdens    <dbl> 0.5289149, 2.5176851, 0.4120108, 6.0340944, 3.9896452, 0.6197…

Lab Exercise


Plot the okcounty map with each county color-filled by the density of tornadoes between 2000 and 2021.

Save geospatial data


The st_write() function can be used to save sf objects to a variety of file formats. For example, the following code saves countymap as a shp file.

st_write(countymap, 
         dsn = "oktornadosum.shp", 
         append = FALSE)
## Writing layer `oktornadosum' to data source 
##   `oktornadosum.shp' using driver `ESRI Shapefile'
## Writing 77 features with 10 fields and geometry type Polygon.


Another commonly-used open geospatial data format is GeoJSON. It is based on Javascript Object Notation (JSON), a human-readable text format that stores data in ASCII files. The layer_options argument must be set to “RFC7946 = YES” to save the data in the newest GeoJSON specification.

st_write(countymap, "oktornado.geojson",
         layer_options = "RFC7946 = YES",
         delete_dsn = TRUE)
## Deleting source `oktornado.geojson' using driver `GeoJSON'
## Writing layer `oktornado' to data source `oktornado.geojson' using driver `GeoJSON'
## options:        RFC7946 = YES 
## Writing 77 features with 10 fields and geometry type Polygon.

Choropleth Maps


A typical way to display geospatial data is as a choropleth map, where summary statistics for each unit region are displayed as different colors. In the tornado example, we can do

ggplot(data = countymap) +
  geom_sf(aes(fill = tdens)) +
  theme_void()

Map a polygon to its centroid


In general, choropleth maps are most effective when mapping rates, such as the density of tornadoes per area or the prevalence of a disease per number of people. In some cases, it is necessary to map count data such as the number of tornadoes or disease cases in each county. People tend to naturally associate sizes with quantitie when viewing maps, and displaying the counts as graduated symbols is often an effective approach.

To map symbols, the county polygons must first be converted to points. The st_centroid() generates a point feature located at the centroid of each county. The st_geometry_type() function returns the feature geometry type. Setting by_geometry = FALSE returns one geometry type for the entire dataset instead of for every feature.


st_geometry_type(okcounty, by_geometry=FALSE)
## [1] POLYGON
## 18 Levels: GEOMETRY POINT LINESTRING POLYGON MULTIPOINT ... TRIANGLE
okcntrd = st_centroid(countymap)
st_geometry_type(okcntrd, by_geometry = FALSE)
## [1] POINT
## 18 Levels: GEOMETRY POINT LINESTRING POLYGON MULTIPOINT ... TRIANGLE

Glimpse of county centroid data


glimpse(okcntrd)
## Rows: 77
## Columns: 11
## $ STATEFP  <chr> "40", "40", "40", "40", "40", "40", "40", "40", "40", "40", "…
## $ COUNTYFP <chr> "077", "025", "011", "107", "105", "153", "001", "053", "059"…
## $ COUNTYNS <chr> "01101826", "01101800", "01101793", "01101841", "01101840", "…
## $ AFFGEOID <chr> "0500000US40077", "0500000US40025", "0500000US40011", "050000…
## $ GEOID    <chr> "40077", "40025", "40011", "40107", "40105", "40153", "40001"…
## $ NAME     <chr> "Latimer", "Cimarron", "Blaine", "Okfuskee", "Nowata", "Woodw…
## $ LSAD     <chr> "06", "06", "06", "06", "06", "06", "06", "06", "06", "06", "…
## $ tcnt     <dbl> 1, 12, 1, 10, 6, 2, 6, 0, 4, 9, 3, 10, 12, 1, 2, 8, 13, 4, 5,…
## $ geometry <POINT [°]> POINT (-95.25035 34.87653), POINT (-102.5173 36.74848),…
## $ area     <dbl> 1890663261, 4766283042, 2427121029, 1657249513, 1503893122, 3…
## $ tdens    <dbl> 0.5289149, 2.5176851, 0.4120108, 6.0340944, 3.9896452, 0.6197…


We see that the geometry of each country is converted to a point (centroid of each county). Now we are ready to do a map with the symbol size proportional to the number of tornadoes, and placed at the centroid of each county. This is called a graduated symbol map.

Graduated symbol map


ggplot() +
  geom_sf(data = okcntrd, aes(size = tcnt)) +
  geom_sf(data = okcounty, fill = NA) +
  theme_void()

Modifying the Appearance of the Map


Additional ggplot2 functions can be added to improve the appearance of the map. The scale_fill_distiller() function allows the specification of a different color ramp. This example uses the “YlOrRd” palette from the RColorBrewer package. The pretty_breaks() function from the scales package is used to automatically select a set of breaks for the legend. The legend is moved to the bottom of the map to better accommodate the longer legend title.


ggplot(data = countymap) +
  geom_sf(aes(fill = tdens)) +
  scale_fill_distiller(name = expression("Tornadoes/1000 km"^2), 
                       palette = "YlOrRd", 
                       breaks = pretty_breaks(),
                       direction = 1) +
  theme_void() +
  theme(legend.position = "bottom")

Palettes


The RColorBrewer package provides a selection of palettes designed for choropleth mapping (Harrower and Brewer 2003). The display_brewer_all() function generates a chart with examples of all the available palettes.


display.brewer.all()

  • The top palette group is suitable to map ordered data or numeric data.

  • The middle palette group is suitable to map categorical data without an order.

  • The bottom palette group is suitbale to indicate values that are above or below the mean or to highlight values that are higher or lower than zero.

Preview paletts for a given finite number of colors


display.brewer.pal(5, "YlGnBu")

Lab Homework (Required)


  • Choose a proper palette map to visualize the average injuries per year by tornadoes in each county between 2000 and 2021.

Convert a continuous scale to a discrete scale


Rather than using continuous scales for color and size, it is often recommended to aggregate the data into a small number of classes (typically 3-6). Using discrete classes makes it easier to associate each color or symbol in the map with a specific range of values.


numclas <- 4
qbrks <- seq(0, 1, length.out = numclas + 1)
countymap <- countymap %>%
  mutate(tdens_c1 = cut(tdens,
                        breaks = quantile(tdens, breaks = qbrks),
                        include.lowest = T))


The code above break the data into four categories by quartiles of tornado density, and save that as a new variable tdens_c1.

Plot tdens_c1


ggplot(data = countymap) +
  geom_sf(aes(fill = tdens_c1)) +
  scale_fill_brewer(name = expression("Tornadoes/1000 km"^2),   
                    palette = "YlOrRd") +
  theme_void() +
  theme(legend.position = "bottom") 

Lab Exercise


Reproduce the map in the previous page with breaking density with 6 groups - 0 to 2, 2 to 4, 4 to 6, 6 to 8, 8 to 10 and above 10.

Another example


Similar to choropleth maps, graduated symbol maps are often easier to interpret if they include a limited number of symbol sizes.


maxcnt <- max(okcntrd$tcnt)
brkpts <- c(0, 2, 5, 10, maxcnt)
okcntrd <- okcntrd %>%
  mutate(tcnt_c1 = cut(tcnt,
                        breaks = brkpts,
                        include.lowest = T))

ggplot(data = okcntrd) +
  geom_sf(aes(size = tcnt_c1)) +
  scale_size_discrete(name="Tornadoes") +
  geom_sf(data = okcounty, fill = NA) +
  theme_void() +
  theme(legend.position = "bottom")

Export image output


We can save the image in different formats as needed. Below are examples:

ggsave("tornado.png", 
       width = 6,  # in inch by default
       height = 4, # in inch by default
       dpi = 300)
ggsave("tornado2.png", 
       width = 15, 
       height = 10, 
       units = "cm", 
       dpi = 100)
  • unit specifies the units for width and height

  • dpi specifies the resolution of the output image, representing the number of pixels per inch

More output formats


ggsave("tornado.jpeg", 
       width = 6, 
       height = 4, 
       dpi = 300, 
       quality = 90)
ggsave("tornado.tiff", 
       width = 6, 
       height = 4, 
       dpi = 300, 
       compression = "lzw")
ggsave("tornado.pdf", 
       width = 6, 
       height = 4)

Note that tiff file embeds geospatial information in the image and that’s why it’s much larger than other formats. We will talk about this format later.

Save images by R object


By default, `ggsave saves the current active graph. We can also save multiple graphs as R objects and save them by using the plot argument in ggsave.

choropleth <- ggplot(data = countymap) +
  geom_sf(aes(fill = tdens_c1)) +
  scale_fill_brewer(name="Density",   
                    palette = "YlOrRd") +
  theme_void()

gradsymbol <- ggplot(data = okcntrd) +
  geom_sf(aes(size = tcnt_c1)) +
  scale_size_discrete(name="Count") +
  geom_sf(data = okcounty, fill = NA) +
  theme_void()

ggsave("choropleth.tiff", 
       plot = choropleth,
       width = 6, 
       height = 4, 
       dpi = 300, 
       compression = "lzw")

ggsave("gradsymbol.tiff",
       plot = gradsymbol,
       width = 6, 
       height = 4, 
       dpi = 300, 
       compression = "lzw")

Save multiple images in a grid


Saved graphs and maps can also be combined into a composite figure using the cowplot package (Wilke 2020). The plot_grid() function provides a variety of options for arranging figures in a regular grid. Below is an example.


plot_grid(choropleth, gradsymbol, 
          labels = c("A) Choropleth Map", 
                     "B) Graduated Symbol Map"),
          label_size = 12,
          ncol = 1, 
          scale = 0.8,
          hjust = 0,
          vjust = 1.5, 
          label_x = 0, 
          align = "v")

Lab Homework (Required)


Generate a map of tornado paths where the paths from each year (2016-2021) are displayed as a different color, Create a composite figure containing the map of tornado paths and the map of tornado points using plot_grid().