Attribute data operations

Disclaimer: The contents of this document come from Chapter 3. Attribute data operations of Geocomputation with `R`(Lovelace, Nowosad, & Muenchow, 2019). This document is prepared for CP6521 Advanced GIS, a graduate-level city planning elective course at Georgia Tech in Spring 2019. For any question, contact the instructor, Yongsung Lee, Ph.D. via yongsung.lee(at)gatech.edu.

This document is also published on RPubs.

if (!require("tidyverse")) install.packages("tidyverse", dependencies = TRUE)
if (!require("stringr")) install.packages("stringr", dependencies = TRUE)
if (!require("sf")) install.packages("sf", dependencies = TRUE)
if (!require("raster")) install.packages("raster", dependencies = TRUE)
if (!require("spData")) install.packages("spData", dependencies = TRUE)
if (!require("devtools")) install.packages("devtools") # for this, you need Rtools installed on your machine 
if (!require("spDataLarge")) devtools::install_github("Nowosad/spDataLarge")
if (!require("sp")) install.packages("sp", dependencies = TRUE)

library(tidyverse)
library(stringr)     # for working with strings (pattern matching)
library(sf)          # classes and functions for vector data
library(raster)      # classes and functions for raster data
library(spData)      # load geographic data
library(spDataLarge) # load larger geographic data
library(sp)

Introduction

What we do:

Learn how to subset sf and raster objects by attributes

Why we do:

We’ll learn three types of spatial data operations, by attributes, by spatial relationships, and by geometric manipulation (i.e., geoprocessing).
Since input spatial data are rarely in the formats appropriate for our analysis, we need to extract parts of them, examine spatial relationships among them, or change their geometries.

3.1 Introduction

3.2 Vector attribute manipulation

sf provides methods that allow sf objects to behave like regular data frames.

methods(class = "sf")

If you’re interested in available functions on other classes, try the next two lines

methods(class = "data.frame")
methods(class = "tbl")

sf objects store spatial and non-spatial data in the same way, as columns in a data.frame.

dim(world)
nrow(world)
ncol(world)

If you want to take out only the data frame part from an sf object…

world_df = st_set_geometry(world, NULL)
# compare the types of the two objects 
class(world)
class(world_df)

3.2.1 Vector attribute subsetting

world[1:6, ]       # subset rows by position 
world[, 1:3]       # subset columns by position 
world[, c("name_long", "lifeExp")]         # subset columns by name

Select rows (i.e., observations, cases, data points, or features)

sel_area = world$area_km2 < 10000
typeof(sel_area)  # check its data type 
class(sel_area)   # check its data type 
summary(sel_area) 

small_countries = world[sel_area, ]
typeof(small_countries)
class(small_countries)
summary(small_countries)

# instead of creating an intermediary object, 
test1 <- world[world$area_km2 < 10000, ]  # with data frame syntax (df name & comma)
test2 <- world %>%                        # with tidyverse syntax (no df name & no comma)
  filter(area_km2 < 10000)

# compare two data 
# https://stackoverflow.com/questions/19119320/how-to-check-if-two-data-frames-are-equal
identical(test1, test2) # returns only TRUE/FALSE
all.equal(test1, test2) # returns some clues on how two data differ 

test3 <- subset(world, area_km2 < 10000)    # from base R  
identical(test1, test3)

world[3:5, ]
slice(world, 3:5)

Select columns (i.e., variables, or attributes)

world1 <- dplyr::select(world, name_long, pop)
world2 <- world %>% 
  dplyr::select(name_long, pop)
identical(world1, world2)
all.equal(world1, world2)

world2 <- world %>% 
  dplyr::select(name_long:pop)

world3 <- world %>%
  dplyr::select(-subregion, -area_km2)

world4 <- world %>%
  dplyr::select(name+long, population = pop) # select & rename at the same time 
names(world4)

# rename a variable in the R base way 
world5 = world[, c("name_long", "pop")]               # first subset 
names(world5)[names(world5) == "pop"] == "population" # next rename a variable

Subsetting is tricky at times. For a data frame:
- Single brackets [] return a lowest possible dimension by default.
- Double brackets [[]] return an atomic vector

d = data.frame(pop = 1:10, area = 1:10) 

d[, "pop"]
d[, "pop", drop = FALSE] # to change the default behavior 

d["pop"]          # think of d as a list, then [] returns a sublist 
d[["pop"]]        # think of d as a list, then [[]] returns actual elements  

dplyr::select(d, pop)  # tidyverse verb: always returns df 
pull(d, pop)

# geometry list-column is sticky 
world[, "pop"]

world$pop
pull(world, pop)

Examples doing subsetting in two ways

world7 <- world %>%
  filter(continent == "Asia") %>%
  dplyr::select(name_long, continent) %>%
  slice(6:10)

# without pipes, it gets confusing: from inside towards outside, not from top to bottom  
world8 <- slice(
  dplyr::select(
    filter(world, continent == "Asia"), 
    name_long, continent), 
  6:10)

3.2.2 Vector attribute aggregation

Aggregation operations summarize datasets by a grouping variable (i.e., group_by() in non-spatial data manipulation).

# base R function 
world_agg1 = aggregate(pop ~ continent, FUN = sum, data = world, na.rm = TRUE)
class(world_agg1)

aggregate() is a generic function which means that it behaves differently depending on its inputs. sf provides a function that can be called directly with sf:::aggregate() that is activated when a by argument is provided, rather than using the ~ to refer to the grouping variable:

# sf fuction on an sf object  
world_agg2 = aggregate(world["pop"], by = list(world$continent), # target data is sf; grouping variable is a vector
                       FUN = sum, na.rm = TRUE)
class(world_agg2)

# sf fuction on a vector
world_agg2 = aggregate(world$pop, by = list(world$continent),    # target data is a vector; grouping variable is a vector
                       FUN = sum, na.rm = TRUE)
class(world_agg2)

Finally, with dplyr functions (in tidyverse ways)!

world_agg3 = world %>%
  group_by(continent) %>%
  summarize(pop = sum(pop, na.rm = TRUE))
class(world_agg3)

world %>% 
  group_by(continent) %>%
  summarize(
    pop = sum(pop, na.rm = TRUE), 
    n = n()
  )

A little longer example.

world %>% 
  dplyr::select(pop, continent) %>% 
  group_by(continent) %>% 
  summarize(
    pop = sum(pop, na.rm = TRUE), 
    n_countries = n()
  ) %>% 
  top_n(n = 3, wt = pop) %>%
  st_drop_geometry()                # delete the list-column and change the class from sf to df

3.2.3 Vector attribute joining

Join an sf object with a data.frame.

str(coffee_data)

# when joining variable is not specified, R uses variables with the same names from two input 
world_coffee = left_join(world, coffee_data) 
class(world_coffee)

# output has the same class as the first argument 
test = left_join(coffee_data, world) 
class(test)

ma_chr(test, typeof)    
ma(test, class)          # the geom list-column is there, but R treats it as a normal column. 
test_sf <- st_as_sf(test) # change the class from df to sf, as long as df has simple feature column. 
class(test_sf)

names(world_coffee) # notice the sticky geom list-column 
plot(world_coffee["coffee_production_2017"])

coffee_renamed = rename(coffee_data, nm = name_long)
coffee_renamed <- coffee_data %>%
  mutate(nm = name_long) %>%
  select(-name_long)              # longer, but more intuitive 

# when there are no common variables 
world_coffee2 = left_join(world, coffee_renamed)

# to avoid confusion, always put common variables in ""
world_coffee2 = left_join(world, coffee_renamed, by = c(name_long = "nm"))   # the second variable names must be in  ""
world_coffee2 = left_join(world, coffee_renamed, by = c("name_long" = "nm")) # this also works

world_coffee_inner = inner_join(world, coffee_data)
nrow(world_coffee_inner)

Use setdiff to examine which countries have different names. Note that setdiff is order-dependent (check each pair of two vectors).

setdiff(coffee_data$name_long, world$name_long)

Deal with character: a brief introduction to regular expressions, or regex with base R functions.
- We did not cover Chapter 11 Strings in R4DS, which uses the stringr package.
- Also, another great resource is made by Dr. Roger Peng explaining base R functions for regex.
- Cheat sheet for basic regular expressions in R

# returns a matched value(s)  
str_subset(world$name_long, "Dem*.+Congo")
?str_subset
grep("Dem*.+Congo", world$name_long, value = TRUE)
grep("Dem*.+Congo", world$name_long)  # by defaul value = FALSE 
grepl("Dem*.+Congo", world$name_long) # returns a logical vector 

coffee_data$name_long[grepl("Congo,", coffee_data$name_long)] =  
  str_subset(world$name_long, "Dem*.+Congo")
world_coffee_match = inner_join(world, coffee_data)
nrow(world_coffee_match)

3.2.4 Creating attributes and removing spatial information

Create new variables, unite/separate variables, and rename variables.

# base R function 
world_new = world # do not overwrite our original data
world_new$pop_dens = world_new$pop / world_new$area_km2

# dplyr functions 
world %>% 
  mutate(pop_dens = pop / area_km2)
world %>% 
  transmute(pop_dens = pop / area_km2) # only with the new variable, but the other variables are dropped 

# dplyr functions - unite, separate 
world_unite = world %>%
  unite("con_reg", continent:region_un, sep = ":", remove = TRUE) # remove = TRUE: the original columns are removed
world_separate = world_unite %>% 
  separate(con_reg, c("continent", "region_un"), sep = ":")

# rename variables 
world %>% 
  rename(name = name_long) # change a column name 

new_names = c("i", "n", "c", "r", "s", "t", "a", "p", "l", "gP", "geom")
world %>% 
  set_names(new_names)    # change all column names

Change an sf object to data.frame.

world_data = world %>% st_drop_geometry()
class(world_data)

Pause for in-class exercises here.

3.3 Manipulating raster objects

Skip

3.4 Exercises

For these exercises we will use the us_states and us_states_df datasets from the spData package:

library(spData)
data(us_states)
data(us_states_df)

us_states is a spatial object (of class sf), containing geometry and a few attributes (including name, region, area, and population) of states within the contiguous United States. us_states_df is a data frame (of class data.frame) containing the name and additional variables (including median income and poverty level, for the years 2010 and 2015) of US states, including Alaska, Hawaii and Puerto Rico. The data comes from the United States Census Bureau, and is documented in ?us_states and ?us_states_df.

Create a new object called us_states_name that contains only the NAME column from the us_states object. What is the class of the new object and what makes it geographic?
Select columns from the us_states object which contain population data. Obtain the same result using a different command (bonus: try to find three ways of obtaining the same result). Hint: try to use helper functions, such as contains or starts_with from dplyr (see ?contains).
Find all states with the following characteristics (bonus find and plot them):
- Belong to the Midwest region.
- Belong to the West region, have an area below 250,000 km2 and in 2015 a population greater than 5,000,000 residents (hint: you may need to use the function units::set_units() or as.numeric()). A good reference
- Belong to the South region, had an area larger than 150,000 km2 or a total population in 2015 larger than 7,000,000 residents.
What was the total population in 2015 in the us_states dataset? What was the minimum and maximum total population in 2015?
How many states are there in each region?
What was the minimum and maximum total population in 2015 in each region? What was the total population in 2015 in each region?
Add variables from us_states_df to us_states, and create a new object called us_states_stats. What function did you use and why? Which variable is the key in both datasets? What is the class of the new object?
us_states_df has two more rows than us_states. How can you find them? (hint: try to use the dplyr::anti_join() function)
What was the population density in 2015 in each state? What was the population density in 2010 in each state?
How much has population density changed between 2010 and 2015 in each state? Calculate the change in percentages and ma them.
Change the columns’ names in us_states to lowercase. (Hint: helper functions - tolower() and colnames() may help.)
Using us_states and us_states_df create a new object called us_states_sel. The new object should have only two variables - median_income_15 and geometry. Change the name of the median_income_15 column to Income.
Calculate the change in median income between 2010 and 2015 for each state. Bonus: What was the minimum, average and maximum median income in 2015 for each region? What is the region with the largest increase of the median income?