In this class, we will study visualization of geospatial data. Below are the packages that need to be installed and loaded.
library(sf)
library(ggplot2)
library(dplyr)
library(tidyr)
library(scales)
library(RColorBrewer)
library(units)
library(cowplot)
We will use the online textbook https://bookdown.org/mcwimberly/gdswr-book/ for this part.
You need to download all data needed from the link:
Geospatial data constitute another important type of real data. It is defined as data that are related to a location on earth (and in many situations with respect to time as well).
Question: Can you give an example of geospatial data?
Watch a video: https://www.youtube.com/watch?v=NV2fqs_me60
We will briefly cover the two main types of geospatial data - vector and raster.
In this module, we will start with vector data.
Vector data consists of features that represent geographic phenomena.
Points can be used to represent small objects such as wells, weather stations, or field plot locations.
Lines represent one-dimensional linear features such as roads and streams.
Polygons represent two-dimensional natural features such as lakes, vegetation patches, and burn scars as well as administrative units such as counties, districts, and counties.
The type of geometry used to represent a particular feature may depend on the scale of the measurement.
Points can be used to represent city locations at a continental scale, whereas a polygon would be used to map the boundary of an individual city at a city scale.
This chapter uses several datasets that characterize the historical distribution of tornadoes in the state of Oklahoma.
Data in ESRI shapefile format, an old but still popular file formats for vector geospatial data.
This is a multiple file format, where separate files contain the feature geometries, attribute table, spatial indices, and coordinate reference system. All files have the same filename with different extensions.
To read in a shapefile, it is only necessary to specify the filename with a “.shp” extension. However, all the files, including the “.shp” file as well as the “.dbf”, “.shx”, and “.prj” files, need to be present in the directory from which the data are read.
.shp
file with st_read
The shapefiles are imported to sf objects using the
st_read()
function. The quiet = FALSE
argument
suppresses output to the console when importing spatial datasets.
path <- "~/Documents/Fei Tian/Course_DAS522_Exploratory_Data_Analysis_and_Visualization_Spring2024/Datasets/Geospatial/gdswr_data/Chapter5"
okcounty <- st_read(paste0(path,"/ok_counties.shp"), quiet = TRUE)
tpoint <- st_read(paste0(path,"/ok_tornado_point.shp"), quiet = TRUE)
tpath <- st_read(paste0(path,"/ok_tornado_path.shp"), quiet = TRUE)
The ok_counties.shp
dataset contains county
boundaries for the state of Oklahoma
(polygon).
The ok_tornado_point.shp
dataset contains the
initial locations of tornado touchdown
(point).
The ok_tornado_path.shp
dataset contains the path of
each tornado (line).
okcounty
DataThe package sf
read geospatial as a special data frame
with the class of sf
. It contains a geometry feature
(polygon, line or point) and other characteristics for each geometric
unit.
class(okcounty)
## [1] "sf" "data.frame"
glimpse(okcounty)
## Rows: 77
## Columns: 8
## $ STATEFP <chr> "40", "40", "40", "40", "40", "40", "40", "40", "40", "40", "…
## $ COUNTYFP <chr> "077", "025", "011", "107", "105", "153", "001", "053", "059"…
## $ COUNTYNS <chr> "01101826", "01101800", "01101793", "01101841", "01101840", "…
## $ AFFGEOID <chr> "0500000US40077", "0500000US40025", "0500000US40011", "050000…
## $ GEOID <chr> "40077", "40025", "40011", "40107", "40105", "40153", "40001"…
## $ NAME <chr> "Latimer", "Cimarron", "Blaine", "Okfuskee", "Nowata", "Woodw…
## $ LSAD <chr> "06", "06", "06", "06", "06", "06", "06", "06", "06", "06", "…
## $ geometry <POLYGON [°]> POLYGON ((-95.50766 35.0292..., POLYGON ((-103.0025 3…
STATEFP
: State FIPS (Federal Information Processing
Standards) code. It’s a numeric code that uniquely identifies states in
the United States. For Oklahoma, the code is ‘40’.
COUNTYFP
: County FIPS code. It’s a numeric code that
uniquely identifies counties within a state.
COUNTYNS
: National statistics code for the county,
another unique identifier for counties within national
databases.
AFFGEOID
: An aggregate feature identifier that
combines several geographic identifiers, typically used for referencing
geographic data.
GEOID
: A concatenation of the state and county FIPS
codes, providing a unique identifier for each county within the entire
United States.
NAME
: The name of the county.
LSAD
: Legal/Statistical Area Description code, which
describes the type of legal entity or statistical area a geographic area
represents. The code ‘6’ usually stands for a county or equivalent
entity.
geometry
: Contains geographic coordinates defining
the boundaries of the counties. This column typically includes lists of
coordinates (longitude and latitude) that outline the shape of each
county on a map.
okcounty
To generate a map of counties using ggplot()
with a sf
object, the geom_sf()
function is used. This map shows the
borders of all counties in Oklahoma.
ggplot(data = okcounty) +
geom_sf(fill = NA)
Exercise: Try what will happen without
fill=NA
in the geom_sf
function.
Try to fill different counties with different colors for
okcounty
data.
tpoint
datanames(tpoint)
## [1] "om" "yr" "mo" "dy" "date" "time"
## [7] "tz" "st" "stf" "stn" "mag" "inj"
## [13] "fat" "loss" "closs" "slat" "slon" "elat"
## [19] "elon" "len" "wid" "fc" "geometry"
Exercise: Find the meaning of each variable.
The yr
column indicates the year of each tornado and can
be used to filter the data to a smaller year range from 2016-2021. This
column is retained in the output along with om
, a unique ID
code, and date
, the date of each tornado.
tpoint_16_21 <- tpoint %>%
filter(yr >= 2016 & yr <= 2021) %>%
select(om, yr, date)
tpath_16_21 <- tpath %>%
filter(yr >= 2016 & yr <= 2021) %>%
select(om, yr, date)
Now let’s plot the tornoda touchdown point tpoint
into
the county map. They will be overlapped by the geospatial information
(longitude and latitude).
ggplot() +
geom_sf(data = okcounty, fill = NA) +
geom_sf(data = tpoint_16_21) +
theme_bw()
Plot tornado from different years in different colors in the county map.
Now let’s work on tpath
and plot the paths of each
tornado in red lines.
ggplot() +
geom_sf(data = okcounty, fill = NA) +
geom_sf(data = tpath_16_21,
color = "red",
size = 1) +
theme_bw()
facet_wrap()
ggplot() +
geom_sf(data = okcounty, fill = NA, color = "gray") +
geom_sf(data = tpoint_16_21, size = 0.75) +
facet_wrap(vars(yr), ncol = 2) +
coord_sf(datum = NA) +
theme_void()
st_drop_geometry()
functionWhen joining data or summarizing data for geospatial data, the geometry information will always follow. For example, if we hope to summarize the number of tornadoes in each year, we can try the following code:
tpoint_16_21 %>%
count(yr)
## Simple feature collection with 6 features and 2 fields
## Geometry type: MULTIPOINT
## Dimension: XY
## Bounding box: xmin: -102.838 ymin: 33.7535 xmax: -94.4381 ymax: 36.975
## Geodetic CRS: NAD83
## yr n geometry
## 1 2016 57 MULTIPOINT ((-95.0718 34.38...
## 2 2017 85 MULTIPOINT ((-96.701 34.267...
## 3 2018 43 MULTIPOINT ((-96.6806 34.22...
## 4 2019 148 MULTIPOINT ((-94.6576 33.84...
## 5 2020 38 MULTIPOINT ((-94.642 33.753...
## 6 2021 63 MULTIPOINT ((-102.838 36.55...
To drop the geometry information when it is no longer necessary, we
may use the st_drop_geometry()
function:
tpoint_16_21 %>%
st_drop_geometry() %>%
count(yr)
## yr n
## 1 2016 57
## 2 2017 85
## 3 2018 43
## 4 2019 148
## 5 2020 38
## 6 2021 63
Now let’s try to summarize the number tornadoes by county. This
information is not directly available. We need to join the
tpoint
data with okcounty
data using the
st_join()
function.
countypnt <- st_join(tpoint_16_21, okcounty)
This is different from the join
functions we learned so
far in that st_join()
links rows from the two tables based
on the spatial locations instead of their attributes.
GEOID
as county identifiercountypnt <- st_drop_geometry(countypnt)
countysum <- countypnt %>%
group_by(GEOID) %>%
summarize(tcnt = n())
glimpse(countysum)
## Rows: 75
## Columns: 2
## $ GEOID <chr> "40001", "40005", "40007", "40009", "40011", "40013", "40015", "…
## $ tcnt <int> 6, 3, 4, 8, 1, 4, 10, 5, 7, 5, 3, 12, 10, 5, 5, 1, 7, 9, 7, 8, 2…
Next, okcounty
is joined to countysum
so
that each polygon is associated with the appropriate tornado summary. At
this stage we can use the left_join
function.
countsum
back with okcounty
to add
tcnt
(count of tornadoes) to each county.countysum
, and therefore have NA values in the joined
table. (check how to use replace
by yourself)st_area()
function computes the area of each county
in square meter. The density of tornadoes per county is then calculated
and converted to tornadoes per 1000 \(\mathrm{km}^2\).st_area()
function returns a column with explicit
measurement units, but these are removed using the
drop_units()
function for simplicity.countymap <- okcounty %>%
left_join(countysum, by = "GEOID") %>%
replace(is.na(.), 0) %>%
mutate(area = st_area(okcounty),
tdens = 10^6 * 10^3 * tcnt / area) %>%
drop_units()
glimpse(countymap)
## Rows: 77
## Columns: 11
## $ STATEFP <chr> "40", "40", "40", "40", "40", "40", "40", "40", "40", "40", "…
## $ COUNTYFP <chr> "077", "025", "011", "107", "105", "153", "001", "053", "059"…
## $ COUNTYNS <chr> "01101826", "01101800", "01101793", "01101841", "01101840", "…
## $ AFFGEOID <chr> "0500000US40077", "0500000US40025", "0500000US40011", "050000…
## $ GEOID <chr> "40077", "40025", "40011", "40107", "40105", "40153", "40001"…
## $ NAME <chr> "Latimer", "Cimarron", "Blaine", "Okfuskee", "Nowata", "Woodw…
## $ LSAD <chr> "06", "06", "06", "06", "06", "06", "06", "06", "06", "06", "…
## $ tcnt <dbl> 1, 12, 1, 10, 6, 2, 6, 0, 4, 9, 3, 10, 12, 1, 2, 8, 13, 4, 5,…
## $ geometry <POLYGON [°]> POLYGON ((-95.50766 35.0292..., POLYGON ((-103.0025 3…
## $ area <dbl> 1890663261, 4766283042, 2427121029, 1657249513, 1503893122, 3…
## $ tdens <dbl> 0.5289149, 2.5176851, 0.4120108, 6.0340944, 3.9896452, 0.6197…
Plot the okcounty
map with each county color-filled by
the density of tornadoes between 2000 and 2021.
The st_write()
function can be used to save sf objects
to a variety of file formats. For example, the following code saves
countymap
as a shp
file.
st_write(countymap,
dsn = "oktornadosum.shp",
append = FALSE)
## Writing layer `oktornadosum' to data source
## `oktornadosum.shp' using driver `ESRI Shapefile'
## Writing 77 features with 10 fields and geometry type Polygon.
Another commonly-used open geospatial data format is GeoJSON. It is based on Javascript Object Notation (JSON), a human-readable text format that stores data in ASCII files. The layer_options argument must be set to “RFC7946 = YES” to save the data in the newest GeoJSON specification.
st_write(countymap, "oktornado.geojson",
layer_options = "RFC7946 = YES",
delete_dsn = TRUE)
## Deleting source `oktornado.geojson' using driver `GeoJSON'
## Writing layer `oktornado' to data source `oktornado.geojson' using driver `GeoJSON'
## options: RFC7946 = YES
## Writing 77 features with 10 fields and geometry type Polygon.
A typical way to display geospatial data is as a choropleth map, where summary statistics for each unit region are displayed as different colors. In the tornado example, we can do
ggplot(data = countymap) +
geom_sf(aes(fill = tdens)) +
theme_void()
In general, choropleth maps are most effective when mapping rates, such as the density of tornadoes per area or the prevalence of a disease per number of people. In some cases, it is necessary to map count data such as the number of tornadoes or disease cases in each county. People tend to naturally associate sizes with quantitie when viewing maps, and displaying the counts as graduated symbols is often an effective approach.
To map symbols, the county polygons must first be converted to points. The st_centroid() generates a point feature located at the centroid of each county. The st_geometry_type() function returns the feature geometry type. Setting by_geometry = FALSE returns one geometry type for the entire dataset instead of for every feature.
st_geometry_type(okcounty, by_geometry=FALSE)
## [1] POLYGON
## 18 Levels: GEOMETRY POINT LINESTRING POLYGON MULTIPOINT ... TRIANGLE
okcntrd = st_centroid(countymap)
st_geometry_type(okcntrd, by_geometry = FALSE)
## [1] POINT
## 18 Levels: GEOMETRY POINT LINESTRING POLYGON MULTIPOINT ... TRIANGLE
glimpse(okcntrd)
## Rows: 77
## Columns: 11
## $ STATEFP <chr> "40", "40", "40", "40", "40", "40", "40", "40", "40", "40", "…
## $ COUNTYFP <chr> "077", "025", "011", "107", "105", "153", "001", "053", "059"…
## $ COUNTYNS <chr> "01101826", "01101800", "01101793", "01101841", "01101840", "…
## $ AFFGEOID <chr> "0500000US40077", "0500000US40025", "0500000US40011", "050000…
## $ GEOID <chr> "40077", "40025", "40011", "40107", "40105", "40153", "40001"…
## $ NAME <chr> "Latimer", "Cimarron", "Blaine", "Okfuskee", "Nowata", "Woodw…
## $ LSAD <chr> "06", "06", "06", "06", "06", "06", "06", "06", "06", "06", "…
## $ tcnt <dbl> 1, 12, 1, 10, 6, 2, 6, 0, 4, 9, 3, 10, 12, 1, 2, 8, 13, 4, 5,…
## $ geometry <POINT [°]> POINT (-95.25035 34.87653), POINT (-102.5173 36.74848),…
## $ area <dbl> 1890663261, 4766283042, 2427121029, 1657249513, 1503893122, 3…
## $ tdens <dbl> 0.5289149, 2.5176851, 0.4120108, 6.0340944, 3.9896452, 0.6197…
We see that the geometry of each country is converted to a point (centroid of each county). Now we are ready to do a map with the symbol size proportional to the number of tornadoes, and placed at the centroid of each county. This is called a graduated symbol map.
ggplot() +
geom_sf(data = okcntrd, aes(size = tcnt)) +
geom_sf(data = okcounty, fill = NA) +
theme_void()
Additional ggplot2
functions can be added to improve the
appearance of the map. The scale_fill_distiller()
function
allows the specification of a different color ramp. This example uses
the “YlOrRd” palette from the RColorBrewer
package. The
pretty_breaks()
function from the scales
package is used to automatically select a set of breaks for the legend.
The legend is moved to the bottom of the map to better accommodate the
longer legend title.
ggplot(data = countymap) +
geom_sf(aes(fill = tdens)) +
scale_fill_distiller(name = expression("Tornadoes/1000 km"^2),
palette = "YlOrRd",
breaks = pretty_breaks(),
direction = 1) +
theme_void() +
theme(legend.position = "bottom")
The RColorBrewer
package provides a selection of
palettes designed for choropleth mapping (Harrower and Brewer 2003). The
display_brewer_all()
function generates a chart with
examples of all the available palettes.
display.brewer.all()
The top palette group is suitable to map ordered data or numeric data.
The middle palette group is suitable to map categorical data without an order.
The bottom palette group is suitbale to indicate values that are above or below the mean or to highlight values that are higher or lower than zero.
display.brewer.pal(5, "YlGnBu")
Choose a proper palette map to visualize the average injuries per year by tornadoes in each county between 2000 and 2021.
Rather than using continuous scales for color and size, it is often recommended to aggregate the data into a small number of classes (typically 3-6). Using discrete classes makes it easier to associate each color or symbol in the map with a specific range of values.
numclas <- 4
qbrks <- seq(0, 1, length.out = numclas + 1)
countymap <- countymap %>%
mutate(tdens_c1 = cut(tdens,
breaks = quantile(tdens, breaks = qbrks),
include.lowest = T))
The code above break the data into four categories by quartiles of
tornado density, and save that as a new variable
tdens_c1
.
tdens_c1
ggplot(data = countymap) +
geom_sf(aes(fill = tdens_c1)) +
scale_fill_brewer(name = expression("Tornadoes/1000 km"^2),
palette = "YlOrRd") +
theme_void() +
theme(legend.position = "bottom")
Reproduce the map in the previous page with breaking density with 6 groups - 0 to 2, 2 to 4, 4 to 6, 6 to 8, 8 to 10 and above 10.
Similar to choropleth maps, graduated symbol maps are often easier to interpret if they include a limited number of symbol sizes.
maxcnt <- max(okcntrd$tcnt)
brkpts <- c(0, 2, 5, 10, maxcnt)
okcntrd <- okcntrd %>%
mutate(tcnt_c1 = cut(tcnt,
breaks = brkpts,
include.lowest = T))
ggplot(data = okcntrd) +
geom_sf(aes(size = tcnt_c1)) +
scale_size_discrete(name="Tornadoes") +
geom_sf(data = okcounty, fill = NA) +
theme_void() +
theme(legend.position = "bottom")
We can save the image in different formats as needed. Below are examples:
ggsave("tornado.png",
width = 6, # in inch by default
height = 4, # in inch by default
dpi = 300)
ggsave("tornado2.png",
width = 15,
height = 10,
units = "cm",
dpi = 100)
unit
specifies the units for width and
height
dpi
specifies the resolution of the output image,
representing the number of pixels per inch
ggsave("tornado.jpeg",
width = 6,
height = 4,
dpi = 300,
quality = 90)
ggsave("tornado.tiff",
width = 6,
height = 4,
dpi = 300,
compression = "lzw")
ggsave("tornado.pdf",
width = 6,
height = 4)
Note that tiff
file embeds geospatial information in the
image and that’s why it’s much larger than other formats. We will talk
about this format later.
By default, `ggsave
saves the current active graph. We
can also save multiple graphs as R objects and save them by using the
plot
argument in ggsave
.
choropleth <- ggplot(data = countymap) +
geom_sf(aes(fill = tdens_c1)) +
scale_fill_brewer(name="Density",
palette = "YlOrRd") +
theme_void()
gradsymbol <- ggplot(data = okcntrd) +
geom_sf(aes(size = tcnt_c1)) +
scale_size_discrete(name="Count") +
geom_sf(data = okcounty, fill = NA) +
theme_void()
ggsave("choropleth.tiff",
plot = choropleth,
width = 6,
height = 4,
dpi = 300,
compression = "lzw")
ggsave("gradsymbol.tiff",
plot = gradsymbol,
width = 6,
height = 4,
dpi = 300,
compression = "lzw")
Saved graphs and maps can also be combined into a composite figure
using the cowplot
package (Wilke 2020). The
plot_grid()
function provides a variety of options for
arranging figures in a regular grid. Below is an example.
plot_grid(choropleth, gradsymbol,
labels = c("A) Choropleth Map",
"B) Graduated Symbol Map"),
label_size = 12,
ncol = 1,
scale = 0.8,
hjust = 0,
vjust = 1.5,
label_x = 0,
align = "v")
Generate a map of tornado paths where the paths from each year
(2016-2021) are displayed as a different color, Create a composite
figure containing the map of tornado paths and the map of tornado points
using plot_grid()
.