R(Lovelace, Nowosad, & Muenchow, 2019). This document is prepared for CP6521 Advanced GIS, a graduate-level city planning elective course at Georgia Tech in Spring 2019. For any question, contact the instructor, Yongsung Lee, Ph.D. via yongsung.lee(at)gatech.edu.if (!require("tidyverse")) install.packages("tidyverse", dependencies = TRUE)
if (!require("stringr")) install.packages("stringr", dependencies = TRUE)
if (!require("sf")) install.packages("sf", dependencies = TRUE)
if (!require("raster")) install.packages("raster", dependencies = TRUE)
if (!require("spData")) install.packages("spData", dependencies = TRUE)
if (!require("devtools")) install.packages("devtools") # for this, you need Rtools installed on your machine
if (!require("spDataLarge")) devtools::install_github("Nowosad/spDataLarge")
if (!require("sp")) install.packages("sp", dependencies = TRUE)
library(tidyverse)
library(stringr) # for working with strings (pattern matching)
library(sf) # classes and functions for vector data
library(raster) # classes and functions for raster data
library(spData) # load geographic data
library(spDataLarge) # load larger geographic data
library(sp)
What we do:
sf and raster objects by attributesWhy we do:
We’ll learn three types of spatial data operations, by attributes, by spatial relationships, and by geometric manipulation (i.e., geoprocessing).
Since input spatial data are rarely in the formats appropriate for our analysis, we need to extract parts of them, examine spatial relationships among them, or change their geometries.
sf provides methods that allow sf objects to behave like regular data frames.
methods(class = "sf")
If you’re interested in available functions on other classes, try the next two lines
methods(class = "data.frame")
methods(class = "tbl")
sf objects store spatial and non-spatial data in the same way, as columns in a data.frame.
dim(world)
nrow(world)
ncol(world)
If you want to take out only the data frame part from an sf object…
world_df = st_set_geometry(world, NULL)
# compare the types of the two objects
class(world)
class(world_df)
world[1:6, ] # subset rows by position
world[, 1:3] # subset columns by position
world[, c("name_long", "lifeExp")] # subset columns by name
Select rows (i.e., observations, cases, data points, or features)
sel_area = world$area_km2 < 10000
typeof(sel_area) # check its data type
class(sel_area) # check its data type
summary(sel_area)
small_countries = world[sel_area, ]
typeof(small_countries)
class(small_countries)
summary(small_countries)
# instead of creating an intermediary object,
test1 <- world[world$area_km2 < 10000, ] # with data frame syntax (df name & comma)
test2 <- world %>% # with tidyverse syntax (no df name & no comma)
filter(area_km2 < 10000)
# compare two data
# https://stackoverflow.com/questions/19119320/how-to-check-if-two-data-frames-are-equal
identical(test1, test2) # returns only TRUE/FALSE
all.equal(test1, test2) # returns some clues on how two data differ
test3 <- subset(world, area_km2 < 10000) # from base R
identical(test1, test3)
world[3:5, ]
slice(world, 3:5)
Select columns (i.e., variables, or attributes)
world1 <- dplyr::select(world, name_long, pop)
world2 <- world %>%
dplyr::select(name_long, pop)
identical(world1, world2)
all.equal(world1, world2)
world2 <- world %>%
dplyr::select(name_long:pop)
world3 <- world %>%
dplyr::select(-subregion, -area_km2)
world4 <- world %>%
dplyr::select(name+long, population = pop) # select & rename at the same time
names(world4)
# rename a variable in the R base way
world5 = world[, c("name_long", "pop")] # first subset
names(world5)[names(world5) == "pop"] == "population" # next rename a variable
Subsetting is tricky at times. For a data frame:
- Single brackets [] return a lowest possible dimension by default.
- Double brackets [[]] return an atomic vector
d = data.frame(pop = 1:10, area = 1:10)
d[, "pop"]
d[, "pop", drop = FALSE] # to change the default behavior
d["pop"] # think of d as a list, then [] returns a sublist
d[["pop"]] # think of d as a list, then [[]] returns actual elements
dplyr::select(d, pop) # tidyverse verb: always returns df
pull(d, pop)
# geometry list-column is sticky
world[, "pop"]
world$pop
pull(world, pop)
Examples doing subsetting in two ways
world7 <- world %>%
filter(continent == "Asia") %>%
dplyr::select(name_long, continent) %>%
slice(6:10)
# without pipes, it gets confusing: from inside towards outside, not from top to bottom
world8 <- slice(
dplyr::select(
filter(world, continent == "Asia"),
name_long, continent),
6:10)
Aggregation operations summarize datasets by a grouping variable (i.e., group_by() in non-spatial data manipulation).
# base R function
world_agg1 = aggregate(pop ~ continent, FUN = sum, data = world, na.rm = TRUE)
class(world_agg1)
aggregate() is a generic function which means that it behaves differently depending on its inputs. sf provides a function that can be called directly with sf:::aggregate() that is activated when a by argument is provided, rather than using the ~ to refer to the grouping variable:
# sf fuction on an sf object
world_agg2 = aggregate(world["pop"], by = list(world$continent), # target data is sf; grouping variable is a vector
FUN = sum, na.rm = TRUE)
class(world_agg2)
# sf fuction on a vector
world_agg2 = aggregate(world$pop, by = list(world$continent), # target data is a vector; grouping variable is a vector
FUN = sum, na.rm = TRUE)
class(world_agg2)
Finally, with dplyr functions (in tidyverse ways)!
world_agg3 = world %>%
group_by(continent) %>%
summarize(pop = sum(pop, na.rm = TRUE))
class(world_agg3)
world %>%
group_by(continent) %>%
summarize(
pop = sum(pop, na.rm = TRUE),
n = n()
)
A little longer example.
world %>%
dplyr::select(pop, continent) %>%
group_by(continent) %>%
summarize(
pop = sum(pop, na.rm = TRUE),
n_countries = n()
) %>%
top_n(n = 3, wt = pop) %>%
st_drop_geometry() # delete the list-column and change the class from sf to df
Join an sf object with a data.frame.
str(coffee_data)
# when joining variable is not specified, R uses variables with the same names from two input
world_coffee = left_join(world, coffee_data)
class(world_coffee)
# output has the same class as the first argument
test = left_join(coffee_data, world)
class(test)
ma_chr(test, typeof)
ma(test, class) # the geom list-column is there, but R treats it as a normal column.
test_sf <- st_as_sf(test) # change the class from df to sf, as long as df has simple feature column.
class(test_sf)
names(world_coffee) # notice the sticky geom list-column
plot(world_coffee["coffee_production_2017"])
coffee_renamed = rename(coffee_data, nm = name_long)
coffee_renamed <- coffee_data %>%
mutate(nm = name_long) %>%
select(-name_long) # longer, but more intuitive
# when there are no common variables
world_coffee2 = left_join(world, coffee_renamed)
# to avoid confusion, always put common variables in ""
world_coffee2 = left_join(world, coffee_renamed, by = c(name_long = "nm")) # the second variable names must be in ""
world_coffee2 = left_join(world, coffee_renamed, by = c("name_long" = "nm")) # this also works
world_coffee_inner = inner_join(world, coffee_data)
nrow(world_coffee_inner)
Use setdiff to examine which countries have different names. Note that setdiff is order-dependent (check each pair of two vectors).
setdiff(coffee_data$name_long, world$name_long)
Deal with character: a brief introduction to regular expressions, or regex with base R functions.
- We did not cover Chapter 11 Strings in R4DS, which uses the stringr package.
- Also, another great resource is made by Dr. Roger Peng explaining base R functions for regex.
- Cheat sheet for basic regular expressions in R
# returns a matched value(s)
str_subset(world$name_long, "Dem*.+Congo")
?str_subset
grep("Dem*.+Congo", world$name_long, value = TRUE)
grep("Dem*.+Congo", world$name_long) # by defaul value = FALSE
grepl("Dem*.+Congo", world$name_long) # returns a logical vector
coffee_data$name_long[grepl("Congo,", coffee_data$name_long)] =
str_subset(world$name_long, "Dem*.+Congo")
world_coffee_match = inner_join(world, coffee_data)
nrow(world_coffee_match)
Create new variables, unite/separate variables, and rename variables.
# base R function
world_new = world # do not overwrite our original data
world_new$pop_dens = world_new$pop / world_new$area_km2
# dplyr functions
world %>%
mutate(pop_dens = pop / area_km2)
world %>%
transmute(pop_dens = pop / area_km2) # only with the new variable, but the other variables are dropped
# dplyr functions - unite, separate
world_unite = world %>%
unite("con_reg", continent:region_un, sep = ":", remove = TRUE) # remove = TRUE: the original columns are removed
world_separate = world_unite %>%
separate(con_reg, c("continent", "region_un"), sep = ":")
# rename variables
world %>%
rename(name = name_long) # change a column name
new_names = c("i", "n", "c", "r", "s", "t", "a", "p", "l", "gP", "geom")
world %>%
set_names(new_names) # change all column names
Change an sf object to data.frame.
world_data = world %>% st_drop_geometry()
class(world_data)
Pause for in-class exercises here.
Skip
For these exercises we will use the us_states and us_states_df datasets from the spData package:
library(spData)
data(us_states)
data(us_states_df)
us_states is a spatial object (of class sf), containing geometry and a few attributes (including name, region, area, and population) of states within the contiguous United States. us_states_df is a data frame (of class data.frame) containing the name and additional variables (including median income and poverty level, for the years 2010 and 2015) of US states, including Alaska, Hawaii and Puerto Rico. The data comes from the United States Census Bureau, and is documented in ?us_states and ?us_states_df.
Create a new object called us_states_name that contains only the NAME column from the us_states object. What is the class of the new object and what makes it geographic?
Select columns from the us_states object which contain population data. Obtain the same result using a different command (bonus: try to find three ways of obtaining the same result). Hint: try to use helper functions, such as contains or starts_with from dplyr (see ?contains).
Find all states with the following characteristics (bonus find and plot them):
units::set_units() or as.numeric()). A good referenceWhat was the total population in 2015 in the us_states dataset? What was the minimum and maximum total population in 2015?
How many states are there in each region?
What was the minimum and maximum total population in 2015 in each region? What was the total population in 2015 in each region?
Add variables from us_states_df to us_states, and create a new object called us_states_stats. What function did you use and why? Which variable is the key in both datasets? What is the class of the new object?
us_states_df has two more rows than us_states. How can you find them? (hint: try to use the dplyr::anti_join() function)
What was the population density in 2015 in each state? What was the population density in 2010 in each state?
How much has population density changed between 2010 and 2015 in each state? Calculate the change in percentages and ma them.
Change the columns’ names in us_states to lowercase. (Hint: helper functions - tolower() and colnames() may help.)
Using us_states and us_states_df create a new object called us_states_sel. The new object should have only two variables - median_income_15 and geometry. Change the name of the median_income_15 column to Income.
Calculate the change in median income between 2010 and 2015 for each state. Bonus: What was the minimum, average and maximum median income in 2015 for each region? What is the region with the largest increase of the median income?