Introduction


This module will study chapter 9 of the textbook (linked provided below). The contents in the slides are also from the textbook.

https://bookdown.org/mcwimberly/gdswr-book/combining-vector-data-with-continuous-raster-data.html

We will learn how to combine data from multiple sources, which can involve both vector data and raster data.

For example, geospatial climate and weather data are typically provided as rasters, while geospatial data on the demographics of human populations are typically referenced to polygons.

To understand how different populations subgroups may be exposed to different climate and meteorological hazards, it is necessary to integrate data with fundamentally different geospatial structures.

In this module, we will learn how to combine continuous raster data with vector data.

Load Libraries


This chapter will present several analyses using the PRISM dataset:

http://www.prism.oregonstate.edu

This is a widely-used interpolated climate dataset for the United States that is based on data from meteorological stations combined with other ancillary datasets such as elevation.

We will use the following libraries.

library(tidyverse)
library(sf)
library(terra)
library(prism)
library(tigris)

Accessing Data with R Packages


In previous chapters, various functions have been used to import data into R and convert them into appropriate objects for tabular data and geospatial data in vector and raster formats.

This chapter will use a different approach for bringing data into R. County boundaries and gridded climate and meteorological data will be imported directly from online archives using two R packages specifically designed to facilitate data access: tigris (Walker 2022) and prism (Edmund and Bell 2020).

Downloading US county Data using tigris


We will use the tigris package to download a county shapefile with the counties function.

county <- counties(cb = TRUE, 
                   resolution = '20m')
class(county)
## [1] "sf"         "data.frame"
glimpse(county)
## Rows: 3,222
## Columns: 13
## $ STATEFP    <chr> "17", "27", "37", "47", "06", "17", "19", "22", "18", "33",…
## $ COUNTYFP   <chr> "127", "017", "181", "079", "021", "093", "095", "003", "05…
## $ COUNTYNS   <chr> "01784730", "00659454", "01008591", "01639755", "00277275",…
## $ AFFGEOID   <chr> "0500000US17127", "0500000US27017", "0500000US37181", "0500…
## $ GEOID      <chr> "17127", "27017", "37181", "47079", "06021", "17093", "1909…
## $ NAME       <chr> "Massac", "Carlton", "Vance", "Henry", "Glenn", "Kendall", …
## $ NAMELSAD   <chr> "Massac County", "Carlton County", "Vance County", "Henry C…
## $ STUSPS     <chr> "IL", "MN", "NC", "TN", "CA", "IL", "IA", "LA", "IN", "NH",…
## $ STATE_NAME <chr> "Illinois", "Minnesota", "North Carolina", "Tennessee", "Ca…
## $ LSAD       <chr> "06", "06", "06", "06", "06", "06", "06", "15", "06", "06",…
## $ ALAND      <dbl> 614218330, 2230473967, 653701481, 1455320362, 3403160299, 8…
## $ AWATER     <dbl> 12784614, 36173451, 42190675, 81582236, 33693344, 5136525, …
## $ geometry   <MULTIPOLYGON [°]> MULTIPOLYGON (((-88.92876 3..., MULTIPOLYGON (…

Some data preparation


The STATEFP and GEOID fields are initially read in as characters, but they are convered to numbers to make it easier to join with the zonal summary table later on. Then, as in Chapter 8, the dataset is filtered down to only the 48 conterminous United States.

county <- county %>%
  mutate(state = as.numeric(STATEFP),
         fips = as.numeric(GEOID)) %>%
  filter(state != 2, state != 15, state < 60) 

Lab Exercise


  1. Print all state names corresponding to state codes (STATEFP).
  2. Find the GEOID for orange county, new york state.

Use prism package to download climatology data


The prism package automates data downloading and importing of PRISM climate data.

Setting options(prism.path = ".") downloads and stores the data in the current working directly.

To download the data to a different location in the file system, a directory path can be specified here instead.


options(prism.path = "~/Documents/Fei Tian/Course_DAS522_Exploratory_Data_Analysis_and_Visualization_Spring2024/Datasets/Geospatial/gdswr_data/Chapter9")
get_prism_normals(type = 'ppt', 
                  resolution = '4km', 
                  annual = T,
                  keepZip = TRUE)

Here the get_prism_normals() function downloads the 30-year average annual precipitation in millimeter (mm).

  • The type = 'ppt' argument specifies that precipitation data will be downloaded.

  • The resolution argument can be used to select either 4km or 800m grids.

  • The annual = TRUE argument selects annual instead of monthly summaries

  • The keepZip = TRUE argument saves the downloaded ZIP archives.

Other prism functions


  • The get_prism_monthlys() function can similarly be used to download monthly meteorological summaries for a specific range of years and months.
  • get_prism_annual() and get_prism_dailys() functions download annually and daily summaries, respectively.


get_prism_monthlys(type = 'ppt', 
                   years = 2018,  
                   mon=1:12, 
                   keepZip = TRUE)

Managing downloaded data with prism_archive_ls()


The prism package also has functions for managing these downloaded data.

The prism_archive_ls() function produces a vector containing the names of all downloaded PRISM datasets.

prism_files <- prism_archive_ls()
prism_files
##  [1] "PRISM_ppt_30yr_normal_4kmM4_annual_bil"  
##  [2] "PRISM_ppt_stable_4kmM3_201801_bil"       
##  [3] "PRISM_ppt_stable_4kmM3_201802_bil"       
##  [4] "PRISM_ppt_stable_4kmM3_201803_bil"       
##  [5] "PRISM_ppt_stable_4kmM3_201804_bil"       
##  [6] "PRISM_ppt_stable_4kmM3_201805_bil"       
##  [7] "PRISM_ppt_stable_4kmM3_201806_bil"       
##  [8] "PRISM_ppt_stable_4kmM3_201807_bil"       
##  [9] "PRISM_ppt_stable_4kmM3_201808_bil"       
## [10] "PRISM_ppt_stable_4kmM3_201809_bil"       
## [11] "PRISM_ppt_stable_4kmM3_201810_bil"       
## [12] "PRISM_ppt_stable_4kmM3_201811_bil"       
## [13] "PRISM_ppt_stable_4kmM3_201812_bil"       
## [14] "PRISM_tmean_provisional_4kmM3_2023_bil"  
## [15] "PRISM_tmean_provisional_4kmM3_202402_bil"
## [16] "PRISM_tmean_stable_4kmM3_2000_bil"       
## [17] "PRISM_tmean_stable_4kmM3_2001_bil"       
## [18] "PRISM_tmean_stable_4kmM3_2002_bil"       
## [19] "PRISM_tmean_stable_4kmM3_2003_bil"       
## [20] "PRISM_tmean_stable_4kmM3_2004_bil"       
## [21] "PRISM_tmean_stable_4kmM3_2005_bil"       
## [22] "PRISM_tmean_stable_4kmM3_2006_bil"       
## [23] "PRISM_tmean_stable_4kmM3_2007_bil"       
## [24] "PRISM_tmean_stable_4kmM3_2008_bil"       
## [25] "PRISM_tmean_stable_4kmM3_2009_bil"       
## [26] "PRISM_tmean_stable_4kmM3_2010_bil"       
## [27] "PRISM_tmean_stable_4kmM3_2011_bil"       
## [28] "PRISM_tmean_stable_4kmM3_2012_bil"       
## [29] "PRISM_tmean_stable_4kmM3_2013_bil"       
## [30] "PRISM_tmean_stable_4kmM3_2014_bil"       
## [31] "PRISM_tmean_stable_4kmM3_2015_bil"       
## [32] "PRISM_tmean_stable_4kmM3_2016_bil"       
## [33] "PRISM_tmean_stable_4kmM3_2017_bil"       
## [34] "PRISM_tmean_stable_4kmM3_2018_bil"       
## [35] "PRISM_tmean_stable_4kmM3_2019_bil"       
## [36] "PRISM_tmean_stable_4kmM3_2020_bil"       
## [37] "PRISM_tmean_stable_4kmM3_2021_bil"       
## [38] "PRISM_tmean_stable_4kmM3_2022_bil"       
## [39] "PRISM_tmean_stable_4kmM3_202302_bil"

Read the file as raster files


Now let’s read the downloaded files into raster files.

my_wd <- "~/Documents/Fei Tian/Course_DAS522_Exploratory_Data_Analysis_and_Visualization_Spring2024/Datasets/Geospatial/gdswr_data/Chapter9"

prism_paths <- file.path(my_wd, 
                         prism_files, 
                         paste0(prism_files, ".bil"))
prism_p30 <- rast(prism_paths[1])
prism_prec_2018 <- rast(prism_paths[2:13])

The first command saves all file paths into the vector prims_paths.

  • The first path which is the 30-year average precipitation, is read into prism_p30.

  • The second to thirteenth paths correspond to monthly precipitation in 2018.

Lab Exercise


Use what you learned from last class, visualize prism_p30 raster file onto a map.

Zonal Statistics


Zonal statistics are a type of polygon-on-raster overlay in which the values in the raster dataset are summarized within each polygon.

Next, we will do an example of summarizing mean annual precipitation for every county in the United States.

To calculate zonal statistics, it is necessary to convert the vector polygon dataset of counties to a raster dataset in which each raster cell is coded with the 5-digit FIPS code of the corresponding county.


cnty_ras <- rasterize(vect(county), 
                      prism_p30, 
                      field = "fips")

The rasterize() function is used to convert vector to raster data.

  • The first argument is the county dataset to summarize wrapped in the vect() function to convert it to the terra vector data format.

  • The second argument is a raster dataset to provide parameters for the conversion.

  • The field = "fips" argument indicates which column in the county vector dataset will be used to assign values to the raster cells in the output.

The zonal() function


Zonal statistics are generated wtih the zonal() function.

cnty_p30 <- zonal(prism_p30, 
                  cnty_ras, 
                  fun = "mean", 
                  na.rm = TRUE)
  • The first argument specifies the raster layer to summarize

  • The second argument specifies the zones to use for summarization.

  • The third argument specifies the summary function

  • The fourth argument removes NA data before summarizing

Summary and rename the combined zonal data


class(cnty_p30)
## [1] "data.frame"
summary(cnty_p30)
##       fips       PRISM_ppt_30yr_normal_4kmM4_annual_bil
##  Min.   : 1001   Min.   :  79.69                       
##  1st Qu.:19040   1st Qu.: 762.15                       
##  Median :29204   Median :1091.01                       
##  Mean   :30618   Mean   :1022.63                       
##  3rd Qu.:45086   3rd Qu.:1273.89                       
##  Max.   :56045   Max.   :3044.25
cnty_p30 <- rename(cnty_p30, 
                   precip = 2)
  • The output cnty_p30 is a data frame with one row for each county.

  • The precipitation summary column (the 2nd column) initially has the long file name of the original image dataset, but is renamed as precip for simplicity.

Data completion check


When processing data, it is always advisable to run basic checks to ensure that the results make sense.

In this case, it is expected that the number of counties (rows) in the cnty_p30 datasets containing zonal summaries should match the number of counties in the original county dataset.

dim(cnty_p30)
## [1] 3102    2
dim(county)
## [1] 3109   15
setdiff(county$fips, 
        cnty_p30$fips)
## [1] 51600 51830 51580 51685 51610 51570 51678

However, this is not the case. There are fewer counties in the summary data table than there are counties in the polygon file.

Visualizing data


To identify the locations of missing data, we can visualize data by left_join and ggplot.

cnty_join1 <- left_join(county, 
                        cnty_p30, 
                        by = "fips")

ggplot(data = cnty_join1) +
  geom_sf(aes(fill = precip), size = 0.1) +
  scale_fill_continuous(name = "Precip (mm)") +
  theme_bw() +
  theme(legend.position = "bottom")

However, there are no obviously visible counties with missing data.

Polish the graph appearance


To make the map more readable, the scale_fill_distiller() function can be used to specify a yellow-to-green-to-blue color palette, change the direction so that areas with more precipitation are blue, and use a logarithmic scale instead of a linear scale for precipitation.

ggplot(data = cnty_join1) +
  geom_sf(aes(fill = precip), size = 0.1) +
  scale_fill_distiller(name = "Precip (mm)", 
                       palette = "YlGnBu", 
                       direction = 1,
                       trans = "log", 
                       breaks = c(100, 300, 1000, 3000)) +
  theme_bw() +
  theme(legend.position = "bottom")

Identify the locations of missing values


Still, we can’t find obvious missing data from the whole US map. A more direct way is to find which state those missing values are in.

setdiff(county$fips, cnty_p30$fips)
## [1] 51600 51830 51580 51685 51610 51570 51678

From the result of the first exercise we see that all missing values occur in Virginia (state FIPS code of 51). Let’s just make a plot of Virginia.

va_join1 <- filter(cnty_join1, 
                   STATEFP == "51")

ggplot(data = va_join1) +
  geom_sf(aes(fill = precip), size = 0.1) +
  scale_fill_distiller(name = "Precip (mm)", 
                       palette = "YlGnBu", 
                       direction = 1,
                       trans = "log", 
                       breaks = c(1000, 1150, 1300)) +
  theme_bw() +
  theme(legend.position = "bottom")

Identify the locations of missing values


It seems that there are some small grey areas (indicating missing values). Let’s zoom in a particular area to see what happens.

ggplot(data = va_join1) +
  geom_sf(aes(fill = precip), size = 0.25) +
  coord_sf(xlim = c(-77.6, -77.0), 
           ylim = c(38.6, 39.1)) +
  scale_fill_distiller(name = "Precip (mm)", 
                       palette = "YlGnBu", 
                       direction = 1,
                       trans = "log", 
                       breaks = c(1000, 1150, 1300)) +
  theme_bw()

What’s special about Virginia?


Virginia is unique among states in that it has many small, independent cities that are governed separately from the surrounding counties and are therefore assigned their own county FIPS codes.

What happens here is that when a city is too small, it may have no precipitation data assigned to it within a sampling grid of 4km in size.

The simplest way to deal with this problem is to use a finer sampling grid to generate the zonal statistics.

Create finer grid for rasterization


The disagg() function is used to reduce the cell size of the PRISM data by a factor of four, changing the cell size to approximately 1 km.

Bilinear resampling is used to conduct linear interpolation between the center points of the 4 km grid cells to estimate values at the 1 km resolution.

prism_p30_1km <- disagg(prism_p30, 
                        fact = 4, 
                        method = "bilinear")
cnty_ras_1km <- rasterize(vect(county), 
                          prism_p30_1km, 
                          field = "fips")
cnty_p30_1km <- zonal(prism_p30_1km, 
                      cnty_ras_1km, 
                      fun = "mean", 
                      na.rm=T)
cnty_p30_1km <- rename(cnty_p30_1km, 
                       precip = 2)

This repeats the data preparation steps with the new finer grid.

Recheck data completion


Now the zonal summary table contains the same number of records as the county file.

dim(cnty_p30_1km)
## [1] 3109    2
dim(county)
## [1] 3109   15
setdiff(county$fips, 
        cnty_p30_1km$fips)
## numeric(0)

Redo the Virginia map


cnty_join2 <- left_join(county, 
                        cnty_p30_1km, 
                        by = "fips")
va_join2 <- filter(cnty_join2, STATEFP == "51")

ggplot(data = va_join2) +
  geom_sf(aes(fill = precip), size = 0.25) +
  scale_fill_distiller(name = "Precip (mm)", 
                       palette = "YlGnBu", 
                       direction = 1,
                       trans = "log", 
                       breaks = c(1000, 1150, 1300)) +
  coord_sf(xlim = c(-77.6, -77.0), 
           ylim = c(38.6, 39.1)) +
  theme_bw()

Lab Homework (Required)


Starting with the sf object containing the county-level precipitation summaries,

  1. convert the precipitation units from millimeters to inches
  2. classify precipitation into a factor with five categories: 0-10, 10-35, 35-50, 50-70, and > 70 inches
  3. Generate a new map of precipitation using these categories with proper aesthetic details.

Project #2 (Part One)


In the first part of the second project, we will exercise data import, data cleaning and data visualization using geospatial data.

  1. Download the mean annual temperature data of each year from 2000 to 2023 using get_prism_annual() function and specifying type = 'tmean'.

  2. Properly clean and transform the data to obtain a data frame that contains the mean annual temperature data (2000-2023) of each county in New York State. There should be a column called year for year and a column called mean_temp for mean annual temperature.

  3. Use wrap_facet to plot the temperature map of New York state counties between 2018 and 2023. Use proper settings to make your visualization informative.

  4. Find a way to analyze the temperature trend of orange county, NY between 2000 and 2023. You need to create at least one graph to visualize your result. Summarize your findings in words.

  5. (Bonus) Analyze the trend of mean February temperature data between 2000 and 2024 specifically for Middletown, NY, with proper data visualization.