This module will study chapter 9 of the textbook (linked provided below). The contents in the slides are also from the textbook.
https://bookdown.org/mcwimberly/gdswr-book/combining-vector-data-with-continuous-raster-data.html
We will learn how to combine data from multiple sources, which can involve both vector data and raster data.
For example, geospatial climate and weather data are typically provided as rasters, while geospatial data on the demographics of human populations are typically referenced to polygons.
To understand how different populations subgroups may be exposed to different climate and meteorological hazards, it is necessary to integrate data with fundamentally different geospatial structures.
In this module, we will learn how to combine continuous raster data with vector data.
This chapter will present several analyses using the PRISM dataset:
http://www.prism.oregonstate.edu
This is a widely-used interpolated climate dataset for the United States that is based on data from meteorological stations combined with other ancillary datasets such as elevation.
We will use the following libraries.
library(tidyverse)
library(sf)
library(terra)
library(prism)
library(tigris)
In previous chapters, various functions have been used to import data into R and convert them into appropriate objects for tabular data and geospatial data in vector and raster formats.
This chapter will use a different approach for bringing data into R.
County boundaries and gridded climate and meteorological data will be
imported directly from online archives using two R packages specifically
designed to facilitate data access: tigris
(Walker 2022)
and prism
(Edmund and Bell 2020).
tigris
We will use the tigris
package to download a county
shapefile with the counties
function.
county <- counties(cb = TRUE,
resolution = '20m')
class(county)
## [1] "sf" "data.frame"
glimpse(county)
## Rows: 3,222
## Columns: 13
## $ STATEFP <chr> "17", "27", "37", "47", "06", "17", "19", "22", "18", "33",…
## $ COUNTYFP <chr> "127", "017", "181", "079", "021", "093", "095", "003", "05…
## $ COUNTYNS <chr> "01784730", "00659454", "01008591", "01639755", "00277275",…
## $ AFFGEOID <chr> "0500000US17127", "0500000US27017", "0500000US37181", "0500…
## $ GEOID <chr> "17127", "27017", "37181", "47079", "06021", "17093", "1909…
## $ NAME <chr> "Massac", "Carlton", "Vance", "Henry", "Glenn", "Kendall", …
## $ NAMELSAD <chr> "Massac County", "Carlton County", "Vance County", "Henry C…
## $ STUSPS <chr> "IL", "MN", "NC", "TN", "CA", "IL", "IA", "LA", "IN", "NH",…
## $ STATE_NAME <chr> "Illinois", "Minnesota", "North Carolina", "Tennessee", "Ca…
## $ LSAD <chr> "06", "06", "06", "06", "06", "06", "06", "15", "06", "06",…
## $ ALAND <dbl> 614218330, 2230473967, 653701481, 1455320362, 3403160299, 8…
## $ AWATER <dbl> 12784614, 36173451, 42190675, 81582236, 33693344, 5136525, …
## $ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-88.92876 3..., MULTIPOLYGON (…
The STATEFP and GEOID fields are initially read in as characters, but they are convered to numbers to make it easier to join with the zonal summary table later on. Then, as in Chapter 8, the dataset is filtered down to only the 48 conterminous United States.
county <- county %>%
mutate(state = as.numeric(STATEFP),
fips = as.numeric(GEOID)) %>%
filter(state != 2, state != 15, state < 60)
STATEFP
).GEOID
for orange county, new york state.prism
package to download climatology dataThe prism
package automates data downloading and
importing of PRISM climate data.
Setting options(prism.path = ".")
downloads and stores
the data in the current working directly.
To download the data to a different location in the file system, a directory path can be specified here instead.
options(prism.path = "~/Documents/Fei Tian/Course_DAS522_Exploratory_Data_Analysis_and_Visualization_Spring2024/Datasets/Geospatial/gdswr_data/Chapter9")
get_prism_normals(type = 'ppt',
resolution = '4km',
annual = T,
keepZip = TRUE)
Here the get_prism_normals()
function downloads the
30-year average annual precipitation in millimeter (mm).
The type = 'ppt'
argument specifies that
precipitation data will be downloaded.
The resolution
argument can be used to select either
4km
or 800m
grids.
The annual = TRUE
argument selects annual instead of
monthly summaries
The keepZip = TRUE
argument saves the downloaded ZIP
archives.
prism
functionsget_prism_monthlys()
function can similarly be used
to download monthly meteorological summaries for a specific range of
years and months.get_prism_annual()
and get_prism_dailys()
functions download annually and daily summaries, respectively.get_prism_monthlys(type = 'ppt',
years = 2018,
mon=1:12,
keepZip = TRUE)
prism_archive_ls()
The prism package also has functions for managing these downloaded data.
The prism_archive_ls()
function produces a vector
containing the names of all downloaded PRISM datasets.
prism_files <- prism_archive_ls()
prism_files
## [1] "PRISM_ppt_30yr_normal_4kmM4_annual_bil"
## [2] "PRISM_ppt_stable_4kmM3_201801_bil"
## [3] "PRISM_ppt_stable_4kmM3_201802_bil"
## [4] "PRISM_ppt_stable_4kmM3_201803_bil"
## [5] "PRISM_ppt_stable_4kmM3_201804_bil"
## [6] "PRISM_ppt_stable_4kmM3_201805_bil"
## [7] "PRISM_ppt_stable_4kmM3_201806_bil"
## [8] "PRISM_ppt_stable_4kmM3_201807_bil"
## [9] "PRISM_ppt_stable_4kmM3_201808_bil"
## [10] "PRISM_ppt_stable_4kmM3_201809_bil"
## [11] "PRISM_ppt_stable_4kmM3_201810_bil"
## [12] "PRISM_ppt_stable_4kmM3_201811_bil"
## [13] "PRISM_ppt_stable_4kmM3_201812_bil"
## [14] "PRISM_tmean_provisional_4kmM3_2023_bil"
## [15] "PRISM_tmean_provisional_4kmM3_202402_bil"
## [16] "PRISM_tmean_stable_4kmM3_2000_bil"
## [17] "PRISM_tmean_stable_4kmM3_2001_bil"
## [18] "PRISM_tmean_stable_4kmM3_2002_bil"
## [19] "PRISM_tmean_stable_4kmM3_2003_bil"
## [20] "PRISM_tmean_stable_4kmM3_2004_bil"
## [21] "PRISM_tmean_stable_4kmM3_2005_bil"
## [22] "PRISM_tmean_stable_4kmM3_2006_bil"
## [23] "PRISM_tmean_stable_4kmM3_2007_bil"
## [24] "PRISM_tmean_stable_4kmM3_2008_bil"
## [25] "PRISM_tmean_stable_4kmM3_2009_bil"
## [26] "PRISM_tmean_stable_4kmM3_2010_bil"
## [27] "PRISM_tmean_stable_4kmM3_2011_bil"
## [28] "PRISM_tmean_stable_4kmM3_2012_bil"
## [29] "PRISM_tmean_stable_4kmM3_2013_bil"
## [30] "PRISM_tmean_stable_4kmM3_2014_bil"
## [31] "PRISM_tmean_stable_4kmM3_2015_bil"
## [32] "PRISM_tmean_stable_4kmM3_2016_bil"
## [33] "PRISM_tmean_stable_4kmM3_2017_bil"
## [34] "PRISM_tmean_stable_4kmM3_2018_bil"
## [35] "PRISM_tmean_stable_4kmM3_2019_bil"
## [36] "PRISM_tmean_stable_4kmM3_2020_bil"
## [37] "PRISM_tmean_stable_4kmM3_2021_bil"
## [38] "PRISM_tmean_stable_4kmM3_2022_bil"
## [39] "PRISM_tmean_stable_4kmM3_202302_bil"
Now let’s read the downloaded files into raster files.
my_wd <- "~/Documents/Fei Tian/Course_DAS522_Exploratory_Data_Analysis_and_Visualization_Spring2024/Datasets/Geospatial/gdswr_data/Chapter9"
prism_paths <- file.path(my_wd,
prism_files,
paste0(prism_files, ".bil"))
prism_p30 <- rast(prism_paths[1])
prism_prec_2018 <- rast(prism_paths[2:13])
The first command saves all file paths into the vector
prims_paths
.
The first path which is the 30-year average precipitation, is
read into prism_p30
.
The second to thirteenth paths correspond to monthly precipitation in 2018.
Use what you learned from last class, visualize
prism_p30
raster file onto a map.
Zonal statistics are a type of polygon-on-raster overlay in which the values in the raster dataset are summarized within each polygon.
Next, we will do an example of summarizing mean annual precipitation for every county in the United States.
To calculate zonal statistics, it is necessary to convert the vector polygon dataset of counties to a raster dataset in which each raster cell is coded with the 5-digit FIPS code of the corresponding county.
cnty_ras <- rasterize(vect(county),
prism_p30,
field = "fips")
The rasterize()
function is used to convert vector to
raster data.
The first argument is the county
dataset to
summarize wrapped in the vect()
function to convert it to
the terra
vector data format.
The second argument is a raster dataset to provide parameters for the conversion.
The field = "fips"
argument indicates which column
in the county
vector dataset will be used to assign values
to the raster cells in the output.
zonal()
functionZonal statistics are generated wtih the zonal()
function.
cnty_p30 <- zonal(prism_p30,
cnty_ras,
fun = "mean",
na.rm = TRUE)
The first argument specifies the raster layer to summarize
The second argument specifies the zones to use for summarization.
The third argument specifies the summary function
The fourth argument removes NA
data before
summarizing
class(cnty_p30)
## [1] "data.frame"
summary(cnty_p30)
## fips PRISM_ppt_30yr_normal_4kmM4_annual_bil
## Min. : 1001 Min. : 79.69
## 1st Qu.:19040 1st Qu.: 762.15
## Median :29204 Median :1091.01
## Mean :30618 Mean :1022.63
## 3rd Qu.:45086 3rd Qu.:1273.89
## Max. :56045 Max. :3044.25
cnty_p30 <- rename(cnty_p30,
precip = 2)
The output cnty_p30
is a data frame with one row for
each county.
The precipitation summary column (the 2nd column) initially has
the long file name of the original image dataset, but is renamed as
precip
for simplicity.
When processing data, it is always advisable to run basic checks to ensure that the results make sense.
In this case, it is expected that the number of counties (rows) in
the cnty_p30
datasets containing zonal summaries should
match the number of counties in the original county dataset.
dim(cnty_p30)
## [1] 3102 2
dim(county)
## [1] 3109 15
setdiff(county$fips,
cnty_p30$fips)
## [1] 51600 51830 51580 51685 51610 51570 51678
However, this is not the case. There are fewer counties in the summary data table than there are counties in the polygon file.
To identify the locations of missing data, we can visualize data by
left_join
and ggplot
.
cnty_join1 <- left_join(county,
cnty_p30,
by = "fips")
ggplot(data = cnty_join1) +
geom_sf(aes(fill = precip), size = 0.1) +
scale_fill_continuous(name = "Precip (mm)") +
theme_bw() +
theme(legend.position = "bottom")
However, there are no obviously visible counties with missing data.
To make the map more readable, the
scale_fill_distiller()
function can be used to specify a
yellow-to-green-to-blue color palette, change the direction so that
areas with more precipitation are blue, and use a logarithmic scale
instead of a linear scale for precipitation.
ggplot(data = cnty_join1) +
geom_sf(aes(fill = precip), size = 0.1) +
scale_fill_distiller(name = "Precip (mm)",
palette = "YlGnBu",
direction = 1,
trans = "log",
breaks = c(100, 300, 1000, 3000)) +
theme_bw() +
theme(legend.position = "bottom")
Still, we can’t find obvious missing data from the whole US map. A more direct way is to find which state those missing values are in.
setdiff(county$fips, cnty_p30$fips)
## [1] 51600 51830 51580 51685 51610 51570 51678
From the result of the first exercise we see that all missing values occur in Virginia (state FIPS code of 51). Let’s just make a plot of Virginia.
va_join1 <- filter(cnty_join1,
STATEFP == "51")
ggplot(data = va_join1) +
geom_sf(aes(fill = precip), size = 0.1) +
scale_fill_distiller(name = "Precip (mm)",
palette = "YlGnBu",
direction = 1,
trans = "log",
breaks = c(1000, 1150, 1300)) +
theme_bw() +
theme(legend.position = "bottom")
It seems that there are some small grey areas (indicating missing values). Let’s zoom in a particular area to see what happens.
ggplot(data = va_join1) +
geom_sf(aes(fill = precip), size = 0.25) +
coord_sf(xlim = c(-77.6, -77.0),
ylim = c(38.6, 39.1)) +
scale_fill_distiller(name = "Precip (mm)",
palette = "YlGnBu",
direction = 1,
trans = "log",
breaks = c(1000, 1150, 1300)) +
theme_bw()
Virginia is unique among states in that it has many small, independent cities that are governed separately from the surrounding counties and are therefore assigned their own county FIPS codes.
What happens here is that when a city is too small, it may have no precipitation data assigned to it within a sampling grid of 4km in size.
The simplest way to deal with this problem is to use a finer sampling grid to generate the zonal statistics.
The disagg()
function is used to reduce the cell size of
the PRISM data by a factor of four, changing the cell size to
approximately 1 km.
Bilinear resampling is used to conduct linear interpolation between the center points of the 4 km grid cells to estimate values at the 1 km resolution.
prism_p30_1km <- disagg(prism_p30,
fact = 4,
method = "bilinear")
cnty_ras_1km <- rasterize(vect(county),
prism_p30_1km,
field = "fips")
cnty_p30_1km <- zonal(prism_p30_1km,
cnty_ras_1km,
fun = "mean",
na.rm=T)
cnty_p30_1km <- rename(cnty_p30_1km,
precip = 2)
This repeats the data preparation steps with the new finer grid.
Now the zonal summary table contains the same number of records as the county file.
dim(cnty_p30_1km)
## [1] 3109 2
dim(county)
## [1] 3109 15
setdiff(county$fips,
cnty_p30_1km$fips)
## numeric(0)
cnty_join2 <- left_join(county,
cnty_p30_1km,
by = "fips")
va_join2 <- filter(cnty_join2, STATEFP == "51")
ggplot(data = va_join2) +
geom_sf(aes(fill = precip), size = 0.25) +
scale_fill_distiller(name = "Precip (mm)",
palette = "YlGnBu",
direction = 1,
trans = "log",
breaks = c(1000, 1150, 1300)) +
coord_sf(xlim = c(-77.6, -77.0),
ylim = c(38.6, 39.1)) +
theme_bw()
Starting with the sf object containing the county-level precipitation summaries,
In the first part of the second project, we will exercise data import, data cleaning and data visualization using geospatial data.
Download the mean annual temperature data of each year from 2000
to 2023 using get_prism_annual()
function and specifying
type = 'tmean'
.
Properly clean and transform the data to obtain a data frame that
contains the mean annual temperature data (2000-2023) of each county in
New York State. There should be a column called year
for
year and a column called mean_temp
for mean annual
temperature.
Use wrap_facet
to plot the temperature map of New
York state counties between 2018 and 2023. Use proper settings to make
your visualization informative.
Find a way to analyze the temperature trend of orange county, NY between 2000 and 2023. You need to create at least one graph to visualize your result. Summarize your findings in words.
(Bonus) Analyze the trend of mean February temperature data between 2000 and 2024 specifically for Middletown, NY, with proper data visualization.