This notebook would introduce the basic skills of spatial analysis with R. Spatial analysis is vital in terms of transportation study. For instance, when conducting the transport planning analysis, we need to collect the data of the study area, and dig into the issue about geographic characteristic, socioeconomic factors, to further do research on the traffic zone, trip distribution and so forth. Here we need the skills of spatial analysis before executing the transport planning process. In fact, there are more powerful and more user-experience based tools such as ArcGIS, CUBE and many other transport planning software to do spatial analysis. However, these software are not easily accessible due to high-cost. Hence, we can use R to do some simple spatial analysis instead. In the notebook, we would acquire the skills listed below.

  1. Making customized maps.
  2. Join the spatial and attribute data.
  3. Conduct spatial operations.
  4. Data analysis on spatial data.
  5. Export our own spatial data.

In addition, the notebook would focus on the transportation spatial issue. If you are major in Transportation Engineering or Transportation Management, you may relate to the content. The notebook provides lots of example regarding transportation data in Taiwan. Any suggestions and questions please contact here.

Basic Knowledge on GIS

Geographic Information Systems

Geographic Information Systems (GIS) store, analyze, and visualize data for geographic positions on Earth’s surface. The four major features of GIS are listed below.

  • Create geographic data.
  • Manage spatial data in a database.
  • Analyze and find patterns.
  • Map Visualization.

In this notebook, we would focus on the issue about spaitial analysis and making visualized maps.

The detailed concept of GIS can be derived on the textbook “Introduction to Geographic Information Systems (Kang Tsung Chang)”.

Well-Known Text (WKT)

Well-known text (WKT) is a text markup language for representing vector geometry objects. The common type of geometric objects are marked as shown in the table below.

Type Graph WKT
Point POINT (3 3)
LineString LINESTRING (1 4, 3 3, 5 5)
Polygon POLYGON (1 4, 2 2, 4 1, 5 5, 1 4)

The WKT types illustrated above are single geometry. However, sometimes we want to mark all the geometry together to signify the same attribute, single geometry may seem to be difficult to express. For instance, there are five campuses in NCTU. If we want to combine all the points into one single geometry, in order to represent “NCTU”, it is hard to express by POINT text. Thus, we need multi geometries to mark it.

Multi geometries are available to represent more than one geometry of the same dimension in a single object, including MultiPoint, MultiLineString, MultiPolygon. Geometry Collection is the geometries of different dimensions.

Shapefile

The shapefile format is a universal geospatial vector data format. It is developed and regulated by Esri, which is an international supplier of GIS software. The shapefile format can illustrate the vector features: points, lines, and polygons. In fact, shapefile is not a single file, but a collection containing four mandatory files. Each file is briefly introduced as the following.

File Features
.shp shape format; the feature geometry
.dbf attribute format; attributes for each shape, stored as two-dimensional table
.prj projection description, using a well-known text representation of coordinate reference systems
.shx shape index format; a positional index of the feature geometry to allow seeking forwards and backwards quickly

Coordinate Reference System (CRS)

Geometric data is not geospatial unless it is accompanied by coordinate reference system (CRS) information, which allows GIS to display and operate the data accurately. It includes two major components, datum and projection. Datum is a model of the shape of the earth. It has angular units (degrees) and defines the starting point (0,0), and hence the coordinate can represent a specific spot on the earth. Projection is a mathematical transformation of the angular measurements on a round earth to a flat surface. The units associated with a given projection are usually linear (feet, meters, etc.).

There are two common types of coordinate systems:

Type Features Example
Geographic Coordinate Systems A global or spherical coordinate system such as latitude–longitude. EPSG:4326
(World Geodetic System 1984)
Projected Coordinate System Based on a map projection such as transverse Mercator, Albers equal area, or Robinson, which project the spherical surface onto a two-dimensional coordinate plane. EPSG:3826
(TWD97 / TM2 zone 121, Taiwan Datum 1997)

To understand the difference of CRS type clearly, let’s take NCTU for example. The coordinate of it can be recorded as (120.999645, 24.789071) in EPSG:4326. x coordinate represents longitude, while y coordinate is latitude. Also, it can be recorded as (249964.105, 2742415.017) in EPSG:3826. In this projected coordinate system, we define 121°E as standard line, and 250 kilometers west of it is defined as false northing, while the equator is defined as false easting. x coordinate represents that the spot is located at 249964.105 meters east of the false northing. y coordinate illustrates that the spot is located at 2742415.017 meters north of the equator.
Something interesting… The center of NCTU (Guangfu Campus) is very close to standard line (249964.105 m ~ 250 km). In fact 121°E traverses the campus!

Why Use R?

R is a programming language mainly for statistical computing and graphics. With a wide range of packages, R supports advanced geospatial statistics, modeling and visualization. In addition, integrated development environments such as RStudio have made it more user-friendly, allowing us to easily analyze spatial data and make maps.

Using R to do analysis, we need first to download the software and its development environments (RStudio).

Required Packages

To conduct the geocomputation in R, it is required to download “sf” package.

To analyze the data more efficiently, it is suggested to take advantage of “dplyr” package.

To display the map, “ggplot2” package is strongly recommended. (There are other packages to make the map as well; nonetheless, ggplot2 is more powerful and easier to learn.)

install.packages("sf")
install.packages("dplyr")
install.packages("ggplot2")
library(sf)
library(dplyr)
library(ggplot2)

Data Visualization with Maps

When it comes to Big Data, a massive amounts of information may cross our mind. How to process on the big data and to produce a neat summary is a vital issue nowadays, and so is the “Geospatial Big Data”. Data visualization with maps is one of the methods, which summarizes the huge data with geometry features.

The basic features of GIS is to make the maps!! But a map with only the geometric information may sound tedious. To display a map much more interestingly and interactively, we should combine the geometric features with its attributes. With the data joined, we can make the heat map, hypsometric map and so forth to make the attributes visualized.

In this section, we would go through the basic command on making maps, and then have a better knowing on the skills to label on the layers. Finally, briefly learn how to make the graduated, categorized, rule-based symbology map.

Making Maps

For the beginners, making map would give us a quick view and better understanding on how R can be used as GIS. It can be done simply by the function ggplot() in ggplot2 package.

The gg in ggplot2 means Grammar of Graphics, a graphic concept which describes plots by using a “grammar”. ggplot() is used to construct the initial plot object, and is followed by + to add component to the plot. To make the map, we should also follow this grammar.

The following two code represent how to draw United States Map by the ggplot grammar. ggplot() means the initialization of drawing, while geom_sf() informs the function ggplot() to make the map. The data can be either put in the code ggplot() or in geom_sf().

Note that “us_states” is an example data originally attached in spData package. Please install the packages “spData” and access the library before using the data.

install.packages("spData")
library(spData)
ggplot(data=us_states)+
  geom_sf()
ggplot()+
  geom_sf(data=us_states)

Two codes above can plot the same map as shown. And from this simple work, we may find out that making a map may be much easier and flexible than the desktop GIS (e.g., QGIS, ArcGIS), since it can produce map by standard and readable language.

Attributes of Shapefile

Notably, the class of “us_states” is “sf” (simple features) as well as “data.frame”. The simple features do not only store the geometric information, but the attributes of each features.

class(us_states)
## [1] "sf"         "data.frame"

Now, we know how to make a map in R. But if we want to make visualized map, it is better for us to take a closer look on the shapefile first. Take “us_states” for instance.

us_states
## Simple feature collection with 49 features and 6 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -124.7042 ymin: 24.55868 xmax: -66.9824 ymax: 49.38436
## geographic CRS: NAD83
## First 10 features:
##    GEOID        NAME   REGION             AREA total_pop_10 total_pop_15
## 1     01     Alabama    South 133709.27 [km^2]      4712651      4830620
## 2     04     Arizona     West 295281.25 [km^2]      6246816      6641928
## 3     08    Colorado     West 269573.06 [km^2]      4887061      5278906
## 4     09 Connecticut Norteast  12976.59 [km^2]      3545837      3593222
## 5     12     Florida    South 151052.01 [km^2]     18511620     19645772
## 6     13     Georgia    South 152725.21 [km^2]      9468815     10006693
## 7     16       Idaho     West 216512.66 [km^2]      1526797      1616547
## 8     18     Indiana  Midwest  93648.40 [km^2]      6417398      6568645
## 9     20      Kansas  Midwest 213037.08 [km^2]      2809329      2892987
## 10    22   Louisiana    South 122345.76 [km^2]      4429940      4625253
##                          geometry
## 1  MULTIPOLYGON (((-88.20006 3...
## 2  MULTIPOLYGON (((-114.7196 3...
## 3  MULTIPOLYGON (((-109.0501 4...
## 4  MULTIPOLYGON (((-73.48731 4...
## 5  MULTIPOLYGON (((-81.81169 2...
## 6  MULTIPOLYGON (((-85.60516 3...
## 7  MULTIPOLYGON (((-116.916 45...
## 8  MULTIPOLYGON (((-87.52404 4...
## 9  MULTIPOLYGON (((-102.0517 4...
## 10 MULTIPOLYGON (((-92.01783 2...

There are 49 features and 6 fields (attributes) in “us_states” data, that is, 49 rows (the 50 states except for Hawaii) and 6 columns in a table. However, we may find that there are actually 7 columns existing in the table above. It is because that the last column of this table stores the geometry of this shapefile. The “geometry” column is recorded in the form of “Well-Known Text (WKT)”.

It is allowed to apply basic command of R on the “sf” data. For instance, if we want to observe all the names in “us_states”, plus “$” behind the name of data, and then type on the column name.

## Output the name of us_states by "$"
us_states$NAME
##  [1] "Alabama"              "Arizona"              "Colorado"            
##  [4] "Connecticut"          "Florida"              "Georgia"             
##  [7] "Idaho"                "Indiana"              "Kansas"              
## [10] "Louisiana"            "Massachusetts"        "Minnesota"           
## [13] "Missouri"             "Montana"              "Nevada"              
## [16] "New Jersey"           "New York"             "North Dakota"        
## [19] "Oklahoma"             "Pennsylvania"         "South Carolina"      
## [22] "South Dakota"         "Texas"                "Vermont"             
## [25] "West Virginia"        "Arkansas"             "California"          
## [28] "Delaware"             "District of Columbia" "Illinois"            
## [31] "Iowa"                 "Kentucky"             "Maine"               
## [34] "Maryland"             "Michigan"             "Mississippi"         
## [37] "Nebraska"             "New Hampshire"        "New Mexico"          
## [40] "North Carolina"       "Ohio"                 "Oregon"              
## [43] "Rhode Island"         "Tennessee"            "Utah"                
## [46] "Virginia"             "Washington"           "Wisconsin"           
## [49] "Wyoming"

Or, we can use the concept of matrix to derive the desired columns and rows.

## Output the name and region of us_states by "[,]"
us_states[,c("NAME","REGION")]  # Put the name of column directly
## Output the name and region of us_states by "[,]"
us_states[,c(2,3)]  # NAME is stored in the second column
## Simple feature collection with 49 features and 2 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -124.7042 ymin: 24.55868 xmax: -66.9824 ymax: 49.38436
## geographic CRS: NAD83
## First 10 features:
##           NAME   REGION                       geometry
## 1      Alabama    South MULTIPOLYGON (((-88.20006 3...
## 2      Arizona     West MULTIPOLYGON (((-114.7196 3...
## 3     Colorado     West MULTIPOLYGON (((-109.0501 4...
## 4  Connecticut Norteast MULTIPOLYGON (((-73.48731 4...
## 5      Florida    South MULTIPOLYGON (((-81.81169 2...
## 6      Georgia    South MULTIPOLYGON (((-85.60516 3...
## 7        Idaho     West MULTIPOLYGON (((-116.916 45...
## 8      Indiana  Midwest MULTIPOLYGON (((-87.52404 4...
## 9       Kansas  Midwest MULTIPOLYGON (((-102.0517 4...
## 10   Louisiana    South MULTIPOLYGON (((-92.01783 2...
## Output the attributes of first row
us_states[1,]
## Simple feature collection with 1 feature and 6 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -88.47323 ymin: 30.24971 xmax: -84.89184 ymax: 35.00803
## geographic CRS: NAD83
##   GEOID    NAME REGION            AREA total_pop_10 total_pop_15
## 1    01 Alabama  South 133709.3 [km^2]      4712651      4830620
##                         geometry
## 1 MULTIPOLYGON (((-88.20006 3...

In addition, with the characteristic of “data.frame”, we can easily manipulate data by the package “dplyr”. It would be further discussed in the following section.

Labels on Map

Sometimes we need to label on the map to inform the name or attribute of specific places. To do so, the command geom_sf_text() should be appended after the geom_sf(), to inform which column is going to be labeled. The code illustrated below is the map of US states with thier names labelled on.

ggplot(us_states)+
  geom_sf()+
  geom_sf_text(aes(label=NAME))
## Warning in st_point_on_surface.sfc(data$geometry): st_point_on_surface may not
## give correct results for longitude/latitude data

In function geom_sf_text(), aes() represents aesthetic mappings, which defines the coordinates of each label. The attributes we want to label is followed by label=.

However, something awkward occurs. The text is overlapping with each other when the labels are dense. To solve this problem, install the package “ggsflabel” from GitHub. (Note that “ggsflabel” has not been released on CRAN, which is a global network that store versions of documentation for R. We should download it from GitHub!!)
* If you want to learn more on the reason why install from GitHub and how “devtools” works, please click here.

install.packages("devtools")
devtools::install_github("yutannihilation/ggsflabel")
library(ggsflabel)

With this package, the overlapping text would be eliminated, and make the map clear and readable.

ggplot(us_states)+
  geom_sf()+
  geom_sf_text_repel(aes(label=NAME), nudge_y=0.5, size=3, color="red")
## Warning in st_point_on_surface.sfc(data$geometry): st_point_on_surface may not
## give correct results for longitude/latitude data

In function geom_sf_text() and geom_sf_text_repel(), nudge_x and nudge_y shift the label horizontally and vertically respectively in order to make the maps more visualizable.

Also, if we want to change the color or size of the text, just revise the variable in geom_sf_text() or geom_sf_text_repel(): color=, size=. Common colors can be directly given as the name of color, for instance, color="red". Other colors can be used by typing its Color Code, such as color="#2894FF".

Graduated Symbol Map

Graduated symbol map is a map that change in color or size according to the value of the attribute they represent. The data type of the particular value should be numeric and continuous. Graduated symbol map can be applied widely, for instance, denser populations might be represented by larger dots or darker polygon.

Let’s take “us_states” data for example again. There are two columns representing population (total_pop_10, total_pop_15) in the data. What if we want to make a graduated symbol map with population of each states?

First, let us learn how to alter the color and size of the map. To do so, we just need to set up the parameters in geom_sf(). The parameter and feature of geom_sf() are summarized in the table below.

Parameter Features
size identify the size of point, line, polyon
color identify the color of point and line
fill identify the color of polygon

The following code fills “us_states” polygon blue, and make the red and thicker border.

ggplot(us_states)+
  geom_sf(fill="blue", size=3, color="red")

Graduated symbol map does not assign a specific color to each geometry features. Thus, the partameter in geom_sf() should not be a fixed value or color. Instead, the parameters(size, color, fill) should be based on the value of attribute, that is Parameter=Attribute. The following code and map are the example of how to make a graduated symbol map based on the population of each US states.

ggplot(us_states)+
  geom_sf(aes(fill=total_pop_10))

Simply add the code aes(fill=total_pop_10) inside geom_sf(), it produces the graduated symbol map based on the value of population. Here, aes() defines the coordinate of the filled polygon. fill defines which attribute should be based on.

But the map is somehow weird. In common, area with high population should be filled with dark color as expected. The result above are the total opposite. Thus, we need other functions to re-define the color.

scale_fill_continuous() is appended after geom_sf(). It is used to define the color of continuous data for filling the polygon. scale_color_continuous() do the same thing, but coloring the point or line. In these functions, two major parameters should be set. One is low=, representing the color given to the lowest value in the data; the other is high=, representing the color given for highest value.

In “us_states” example, since the data is continuous and the objective is polygon, we should apply the function scale_fill_continuous(). The code and figure are shown as follows.

ggplot(us_states)+
  geom_sf(aes(fill=total_pop_15))+
  scale_fill_continuous(low="#D2E9FF", high="#003D79")

Categorized Symbol Map

Categorized map is similar to graduated map, but it is used on the discrete data. It is widely used to the geometric data which has been classified in advance. For instance, the region of states, the type of schools (elementary, junior high, senior high, university,…), the classification of highways (expressway, freeway,…), and so forth.

Take “us_states” data for example. A column named “REGION” represents the region of each states. We can fill the same region with consistent color. Similar to graduated map, the parameters(size, color, fill) should be based on the classification of attribute, that is Parameter=Attribute. The code and result are as the followings.

ggplot(us_states)+
  geom_sf(aes(fill=REGION))

It seems to be as simple as the graduated map. But what if we want to adjust the color manually?

scale_fill_manual() or scale_color_manual() is appended after geom_sf(). They are used to adjust the color of polygon and “point or line” respectively. In these functions, parameter values= should be set up. It is used to define the color of each classifications. The pseudocode is as follows. values=c("Classification_1"="Color_1", "Classification_2"="Color_2",...)

ggplot(us_states)+
  geom_sf(aes(fill=REGION))+
  scale_fill_manual(values=c("Norteast"="#FFC1E0", "Midwest"="#97CBFF", "South"="#A6FFA6", "West"="#FFE66F"))

Rule-based Symbol Map

Sometimes the category is determined by values from multiple columns. Under this condition, we may find it hard to use categorized symbol map. We should mutate a new column to determine the new classification of each tuple.

Take “us_states” data for example. We want to classify all the data into four categories, that is “high population-large area”, “high population-small area”, “low population-large area” and “low population-small area”. And the data is seperated simply by the average, namely, the one whose population is lower than the average is defined as “low population”.

In this case, we need mutate() and case_when() functions in dplyr package to help us sort out the data.

# Create a new data frame named us_states_type
us_states_type=mutate(us_states,TYPE=case_when(
  total_pop_15>mean(us_states$total_pop_15) & AREA>mean(us_states$AREA) ~ "high population-large area",
  total_pop_15>mean(us_states$total_pop_15) & AREA<mean(us_states$AREA) ~ "high population-small area",
  total_pop_15<mean(us_states$total_pop_15) & AREA>mean(us_states$AREA) ~ "low population-large area",
  total_pop_15<mean(us_states$total_pop_15) & AREA<mean(us_states$AREA) ~ "low population-small area"
))

# Output the newly data
us_states_type
## Simple feature collection with 49 features and 7 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -124.7042 ymin: 24.55868 xmax: -66.9824 ymax: 49.38436
## geographic CRS: NAD83
## First 10 features:
##    GEOID        NAME   REGION             AREA total_pop_10 total_pop_15
## 1     01     Alabama    South 133709.27 [km^2]      4712651      4830620
## 2     04     Arizona     West 295281.25 [km^2]      6246816      6641928
## 3     08    Colorado     West 269573.06 [km^2]      4887061      5278906
## 4     09 Connecticut Norteast  12976.59 [km^2]      3545837      3593222
## 5     12     Florida    South 151052.01 [km^2]     18511620     19645772
## 6     13     Georgia    South 152725.21 [km^2]      9468815     10006693
## 7     16       Idaho     West 216512.66 [km^2]      1526797      1616547
## 8     18     Indiana  Midwest  93648.40 [km^2]      6417398      6568645
## 9     20      Kansas  Midwest 213037.08 [km^2]      2809329      2892987
## 10    22   Louisiana    South 122345.76 [km^2]      4429940      4625253
##                          geometry                       TYPE
## 1  MULTIPOLYGON (((-88.20006 3...  low population-small area
## 2  MULTIPOLYGON (((-114.7196 3... high population-large area
## 3  MULTIPOLYGON (((-109.0501 4...  low population-large area
## 4  MULTIPOLYGON (((-73.48731 4...  low population-small area
## 5  MULTIPOLYGON (((-81.81169 2... high population-small area
## 6  MULTIPOLYGON (((-85.60516 3... high population-small area
## 7  MULTIPOLYGON (((-116.916 45...  low population-large area
## 8  MULTIPOLYGON (((-87.52404 4... high population-small area
## 9  MULTIPOLYGON (((-102.0517 4...  low population-large area
## 10 MULTIPOLYGON (((-92.01783 2...  low population-small area

*To understand more about basic operation of dplyr, please click here.

With the newly data derived, we can then apply the categorized symbol map. The code and result are as follows.

ggplot(us_states_type)+
  geom_sf(aes(fill=TYPE))

Basic Operations/Elements of Map

Map Overlay

It is simple by using ggplot2 to make an overlaying map. Shapefiles can be overlay by repeating geom_sf(data=). Let’s take New Zealand and its top 101 heighest points for example. Layers used in this example are provided in the package spData. nz is the polygon shapefile, and nz_height is the point. The code and result are as follows.

ggplot()+
  geom_sf(data=nz)+
  geom_sf(data=nz_height, color="red")

North Arrow and Scale

North arrow is a graphical representation indicating the direction of north in the map. The scale of a map is the ratio of a distance on the map to the corresponding distance on the ground. These two items are required when making the maps. In ggplot2, we can simply use the function annotation_north_arrow and annotation_scale in package ggspatial to add them. Again, we use map of New Zealand (nz) to display the result.

# Please install the package ggspatial first!
install.packages("ggspatial")
library(ggspatial)
ggplot()+
  geom_sf(data=nz)+
  annotation_scale(location="br")+
  annotation_north_arrow(location="tl")

Parameter location= in two functions represents that the place where signs are expected to be put. “br” means “bottom right, while”tl" means “top left”. If we do not prefer the location of default value, we can slghtly alter the location by using parameter pad_x and pad_y. Usage of other parameters please look up ggspatial documentation.

Linetype/Shape

We have introduced the parameter color, fill and size in the past section. But what if we want to change the type of line (e.g., dashed, dotted) or change the shape of point (e.g., diamond, triangle)? We need to add parameters in geom_sf(). Take map of New Zealand for instance again.

ggplot()+
  geom_sf(data=nz, color="blue", linetype="dashed")+       # make the border of nz dashed
  geom_sf(data=nz_height, color="red", size=2, shape=4)    # set the shape of nz_height to "X"

Linetype and points shapes commonly used in R are illustrated in the figure below.

Theme

Maps above show the background (longitude and latitude) and the axes. But sometimes we do not prefer to show the details about the geographic information, then we need to set those elements to be blanked. Or we might change the position of legend, size of legend, and so forth, in order to make the maps more elegant. In these situation, we need function theme to customize the style of map. Let’s optimize our map of New Zealand. Suppose we don’t want the background as well as the coordinate. Also, we need to change the position of legend on the bottom-right of the map.

ggplot()+
  geom_sf(data=nz, aes(fill=Island))+
  theme(axis.text=element_blank(),             # remove the value on axis
        axis.ticks=element_blank(),            # remove the ticks on axis
        axis.title=element_blank(),            # remove the title on axis (X,Y)
        legend.text=element_text(size=12),     # set up the size of text in legend to 12
        legend.title=element_text(size=15),    # set up the size of legend title to 15
        legend.position=c(0.87, 0.15),         # set up the position of legend to the bottom-right ((0,0) at the bottom-left)
        panel.background=element_blank())      # remove the background

Example above shows one of the common theme in making map. In fact, function theme provides lots of parameters for the users to set. To make an ideal map, it is suggested to read theme documentation and take advantage of it.

Other Mapping Tools

ggplot is a powerful package for making maps (in fact it is also helpful on plotting bar chart, line chart, histogram, and so forth) though, the major drawback is that raster objects are not natively supported by it. Hence, many users are inclined to use a specialized package for making maps, such as “tmap” and “leaflet”. In this notebook, we would briefly introduce how they work with a simple example.

tmap

Like ggplot2, tmap is based on the idea of a “grammar of graphics”. This involves a separation between the input data and the aesthetics (how data are visualised): each input dataset can be mapped by the visual variables. The basic building block is tm_shape(), which defines input data, raster and vector objects, followed by one or more layer elements such as tm_fill() and tm_dots(). Here, we can compare the command used in two different packages.

Package tmap ggplot2
Basic Block tm_shape()+ ggplot()+
Polygon tm_polygons(col=X) geom_sf(fill=X)
Fill tm_fill(col=X) geom_sf(color=NA, fill=X)
Border tm_borders(col=X, lwd=Y, lty=Z) geom_sf(color=X, size=Y, linetype=Z)
Point tm_dots(col=X, size=Y shape=Z) geom_sf(color=X, size=Y, shape=Z)
Line tm_lines(col=X, lwd=Y, lty=Z) geom_sf(color=X, size=Y, linetype=Z)
Raster tm_raster() Does not support raster data
Aesthetic tm_dots(col="ATTRIBUTE") geom_sf(aes(color=ATTRIBUTE))

Plotting the map by using tmap, please install tmap package first. Then, let’s display the map of New Zealand.

# Please install the package tmap first!
install.packages("tmap")
library(tmap)
tm_shape(nz)+
  tm_polygons(col="Island")+
  tm_shape(nz)+
  tm_borders(col="grey", lty="dashed")+
  tm_shape(nz_height)+
  tm_dots(shape=4, col="red", size=1)

Also, tmap supports to plot the interactive map, which is provided by leaflet (discussed later). The plotting procedure is very easy, just change the mode from “plot” to “view” by function tmap_mode("view"). If we do not define the mode used, the default value would be tmap_mode("plot").

tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(nz)+
  tm_polygons(col="Island")+
  tm_shape(nz)+
  tm_borders(col="grey", lty="dashed")

leaflet

Leaflet is the most mature and widely used interactive mapping package in R. It provides a relatively low-level interface to the Leaflet JavaScript library and many of its arguments can be understood by reading the documentation of the original JavaScript library.

Leaflet maps are created with leaflet(), the result of which is a leaflet map object which can be piped to other leaflet functions. This allows multiple map layers and control settings to be added interactively.

# Please install the package leaflet first!
install.packages("leaflet")
library(leaflet)

The basic command of leaflet includes addTiles, addPolygons, addCircles. addTiles means the map used. If no arguments is given, OpenStreetMap is set by default. Or we can use addProviderTiles to connect other maps provided by different websites. addPolygons is used to paint the border color of polygon, and also fill in the color. addCircles is used for the point data. It draws a buffer with a given radius. Note that the unit of radius is “meters”.

Now, let’s display the interactive map of New Zealand by using leaflet.

leaflet(data=st_transform(nz, crs=4326))%>% 
  addTiles()%>%
  addPolygons(color="green", fillColor="gray")%>%
  addCircles(data=st_transform(nz_height, crs=4326), color="red", radius=1000)

* Please ignore the function st_transform() above, which is applied to transform the coordinate reference system, we would discuss it in the following chapter.

Several method to plot the visualized map have been introduced. Here, we spend a lot on understanding how to use package ggplot2 to plot the map, for it is the most common and universal tool to plot all kinds of graph in R. Other mapping tools are also helpful, particularly when it comes to raster shapefile. If the readers are interested in those packages, and want to learn more advanced skills please read chapter 8 of book “Geocomputation with R”. But please note that, ggplot2 is introduced more detailed in this notebook, while book “Geocomputation with R” may provide sketchy information on the topic.

PRACTICE

In this chapter, we introduce the simple example on the us_states and nz data which “spdata” package provides. Then, it’s time for us to make our own maps of Taipei.

But, how to connect to shapefile of Taipei map? There is no package like “us_states” for Taipei, and thus we need to download the files on government open data in advance. The list below shows commonly used websites, which are in rich of the shapefiles regarding transportation, economy, and administration in Taiwan.

For convenience, please download the data here directly. This zip file contains all data needed in the notebook. Please unzip the file first, and make sure that you place the file in the same directory as your R script file.

Here, we use Taipei village map (polygon), cycle path map (line), MRT routes (line) and stations (point) map, and YouBike map (point), to practice plotting maps.

First, let’s make a map regarding cycling system and village population. Please follow the instructions below:

  1. Plot the village map, and fill in the color by population (PP) of each village. (Hint: scale_fill_continuous())
  2. Add the cycling path. (Hint: Use alpha to adjust the transparency of the color, not necessary.)
  3. Add the point of YouBike stations, and size of point is determined by the bikes capacity (BIKESCAPAC). (Hint: scale_size_continuous())
  4. Finally, use function theme to optimize your map.
# Please put the files you download in the same directory of your RStudio environment!!
# read the shapefiles
taipei_village_map=read_sf("./data/taipei_map/taipei_village_map.shp")
taipei_cycle_path=read_sf("./data/taipei_cycle_path/taipei_cycle_path.shp")
taipei_youbike=read_sf("./data/taipei_youbike/taipei_youbike.shp")

The result is illustrated below.

With the maps above, we may come out some interesting questions: Is the design capacity of YouBike stations associated with the population of its location? Or, does the cycling path meet the supply of bike sharing system? Now, we can briefly observe the “phenomenon” from the visualized map, but of course, a concrete and pratical conclusion should be drawn after conducting spatial analysis.

Now let’s make a more complicated map, a Taipei district map with MRT routes and stations. In addition, the transfer stations should be labeled with text and use a different point shape compared to non-transfer station. Thus, the first task we need to do is to classify the transfer and non-transfer stations. It is recommended to use package dplyr to tidy the data.

After cleaning the data. Please follow the instructions below to plot the map:

  1. Plot the village map, and fill in the color by district (TOWNCODE). (Hint: scale_fill_manual() or scale_fill_brewer())
  2. Add the MRT route, and paint them with its specific color (i.e., 板南線 should be painted blue). (Hint: scale_color_manual())
  3. Add the non-transfer MRT stations.
  4. Add the transfer MRT stations. Pleaes note that you should set up a different shape compared to the former.
  5. Finally, use function theme to optimize your map.
# Please put the files you download in the same directory of your RStudio environment!!
# read the shapefiles
taipei_village_map=read_sf("./data/taipei_map/taipei_village_map.shp")
taipei_mrt_route=read_sf("./data/taipei_mrt/taipei_mrt_route.shp")
taipei_mrt_station=read_sf("./data/taipei_mrt/taipei_mrt_station.shp")

# tidy data
transfer_station=filter(taipei_mrt_station, ...)
non_transfer_station=filter(taipei_mrt_station, ...)

The result is illustrated below.

## Warning: ggrepel: 3 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

In this example, you may find that making maps might be a bit difficult when lots of conditions are going to be considered. Thus, besides the source code provided in this notebook, we need to learn more on dplyr or other packages that could efficiently clean the data. Moreover, making maps elegant is another issue. Only with the study on this notebook is enough, but not advanced, we should practice more and search for more functions from websites (others’ work), to optimize the skills of map visualization.

Geographic Data

We have introduced the concept of WKT in the previous section, and then we may come up with a few questions: Can R understand WKT format? And how R produce the WKT format from its base code and sf package? In this section, we would first introduce “sf” class and learn the fundamental skill about how geographic data is built. Then, we would further discuss about the transformation of CRS and the method to export the shapefile. Spatial data combines the geometric information and its attributes, and thus, the study on how to join both of them together would be followed by. Please note that, we would use sf package in this section (also the first time to use), and make sure you have this package downloaded (install.packages("sf")).

Simple Feature Geometries (sfg)

Usually, we have the geographic data in shapefile format downloaded from the website. This allows us to plot the map or do geometric operations without defining the WKT. However, what if we are asked to define the geometric based on WKT format on our own? Simple feature geometries (sfg) helps us to create these objects.

The sfg class contains (multi-)point, (multi-)linestring, (multi-)polygon and geometry collection. A set of functions which start with st_ prefix and end with the name of geometry type create the sfg objects. For instance, st_point() creates the POINT of WKT format. sfg objects can be created from three base R types:

R Data Type Code sfg Object
numeric vector c( ) st_point
matrix rbind(c( ), c( )) st_linestring
st_multipoint
list list(rbind(c( ), c( ))) st_polygon
st_multilinestring
st_multipolygon

Based on the table, we then use the functions of sfg objects to generate geometric information in WKT format.

Point

point_ex=st_point(c(2,3))
class(point_ex)  #identify the data type of sfg object
## [1] "XY"    "POINT" "sfg"
point_ex  #show the WKT format of geometric data
## POINT (2 3)
ggplot(point_ex)+
  geom_sf()

Linestring

linestring_ex=st_linestring(rbind(c(2,3), c(4,4), c(3,5), c(1,4)))
linestring_ex
## LINESTRING (2 3, 4 4, 3 5, 1 4)
ggplot(linestring_ex)+
  geom_sf()

Use rbind to create the matrix, combining all the vector cells (c( )). Note that the connection of line depends on the sequence of the vector cells in the matrix.

Polygon

polygon_ex=st_polygon(list(rbind(c(2,3), c(4,4), c(3,5), c(1,4), c(2,3))))
polygon_ex
## POLYGON ((2 3, 4 4, 3 5, 1 4, 2 3))
ggplot(polygon_ex)+
  geom_sf()

Note that the polygon should be closed, that is, the first vector in the list is the same as the last one.

Multipoint

point_ex=st_multipoint(rbind(c(2,3), c(4,4), c(3,5), c(1,4)))
point_ex
## MULTIPOINT ((2 3), (4 4), (3 5), (1 4))
ggplot(point_ex)+
  geom_sf()

The R data type in st_multipoint is the same as st_linestring, which both of them are matrices.

Multilinestring

linestring_ex=st_multilinestring(list(rbind(c(2,3), c(4,4), c(3,5)), rbind(c(2,5), c(1,2))))
linestring_ex
## MULTILINESTRING ((2 3, 4 4, 3 5), (2 5, 1 2))
ggplot(linestring_ex)+
  geom_sf()

Multipolygon

polygon_ex=st_multipolygon(list(list(rbind(c(2,3), c(4,4), c(3,5), c(1,4), c(2,3))),
                      list(rbind(c(1,5), c(2,5), c(3,6), c(1,5)))))
polygon_ex
## MULTIPOLYGON (((2 3, 4 4, 3 5, 1 4, 2 3)), ((1 5, 2 5, 3 6, 1 5)))
ggplot(polygon_ex)+
  geom_sf()

A polygon is created by a list, and thus, multipolygon is generated by a list of lists.

Simple Feature Columns (sfc)

The sfc object is a list of sfg objects, which additionally contains information about the coordinate system (CRS), geometry type and the border of the whole geometry set (listing the minimum and maximum of x-axis and y-axis). sfc object represents the geometry column in sf data frames.

point_1=st_point(c(2,3))
point_2=st_point(c(4,2))
point_3=st_point(c(1,1))
point_sfc=st_sfc(point_1, point_2, point_3, crs=4326)  #set the CRS to be 4326 (WGS 84)
point_sfc
## Geometry set for 3 features 
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 1 ymin: 1 xmax: 4 ymax: 3
## geographic CRS: WGS 84
## POINT (2 3)
## POINT (4 2)
## POINT (1 1)
ggplot(point_sfc)+
  geom_sf()

In this example, we can find that the geometry set consists of three points. The border of the geometry set is listed in the bbox (xmin: 1 ymin: 1 xmax: 4 ymax: 3). CRS is set to be EPSG:4326 (i.e.,latitude–longitude system). And please note the axis of the graph. It is labeled as the latitude and longitude format, instead of the integer value labeled in the sfg objects.

If we want to check the CRS and border of the geometry, we can use the functions as follows.

# Check CRS
st_crs(point_sfc)
## Coordinate Reference System:
##   User input: EPSG:4326 
##   wkt:
## GEOGCRS["WGS 84",
##     DATUM["World Geodetic System 1984",
##         ELLIPSOID["WGS 84",6378137,298.257223563,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["geodetic latitude (Lat)",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["geodetic longitude (Lon)",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     USAGE[
##         SCOPE["unknown"],
##         AREA["World"],
##         BBOX[-90,-180,90,180]],
##     ID["EPSG",4326]]
# Check the border
st_bbox(point_sfc)
## xmin ymin xmax ymax 
##    1    1    4    3

Simple Feature (sf)

The sf object does not only contain the geometry information, but sets of features with attributes. The basic component of its geometry data is sfg, which is recorded in WKT format. Remember the us_states data in package spData? Let’s take a closer look on it. We know that the class of us_states data is not only sf but data frame. A tuple, namely a row, is a simple feature. A simple feature must contain feature attributes and a simple feature geometries (sfg) object, which defines the location of the tuple. The list-column with the simple feature geometries (sfg) for each tuple is then called simple feature columns (sfc).

Again, in the figure above, we can find that the geometry type is “multipolygon”. The border of the us_states data is listed in bbox. CRS used in this data is NAD83 (EPSG:4269).

How do we create the simple feature on our own? It is simple, just create the attribute in a data frame, and then use st_sf() function to combine the attributes and the simple feature columns (sfc). There are two parameters should be given in st_sf(). One is data frame, which provides attribute of the features. The other one is geometry, which defines the geometry data that should be based on. If we do not define the CRS in function st_sfc(), then it is required to set CRS in function st_sf().

Example below shows the code to make the simple features of NCTU and NTHU.

uni_geom=st_sfc(
  st_point(c(120.999645, 24.789071)),  # define the sfg of first school (NCTU)
  st_point(c(120.996665, 24.796442))   # define the sfg of second school (NTHU)
  ,crs=4326)  # set the CRS

uni=data.frame(
  name=c("NCTU","NTHU"),               # name of schools
  type=c("university","university"),   # type of schools
  phone=c("03-5714769","03-5712121"),  # phone of schools
  geometry=uni_geom                    # put the sfc into the data frame
)

class(uni)  # the class of "uni" is remain a data frame
## [1] "data.frame"
uni=st_sf(uni, geometry=uni$geometry)  # define the sfc as geometry
# uni=st_sf(uni, geometry=uni$geometry, crs=4326) 

class(uni)  # the class of "uni" changes to data frame as well as sf
## [1] "sf"         "data.frame"

To check whether the class of data we produce is “sf”, we can directly plot the map to confirm it. Let’s recall the tmap function to plot the interactive map.

tmap_mode("view")
## tmap mode set to interactive viewing
tm_shape(uni)+
  tm_dots(size=0.2)

Convert Text File to Shapefile

In fact, most of the spatial data is stored as shapefile format, which is simple for us to read the files, plot the map and conduct spatial operations in R. However, some spatial data is saved as CSV or XML files. These are not standard geographic data format, and hence, we should convert them in advance. Most of the geometric data they record is under WKT format with pure text (e.g., directly record “POINT (3 2)”). But the function st_sfc() we have learned in the previous section cannot parse the text, whereas it can only identify the vector, matrix, list data type written in R base code. Here, we would introduce a similar function called st_as_sfc, which can help us to read the pure text of WKT.

st_as_sfc("POINT(3 2)")
## Geometry set for 1 feature 
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 3 ymin: 2 xmax: 3 ymax: 2
## CRS:            NA
## POINT (3 2)

Also, a function named st_as_text() can retrieve the WKT pure text from the R base data type.

point_ex=st_point(c(3,2))
st_as_text(point_ex)
## [1] "POINT (3 2)"

Knowing the skills of st_as_sfc, let’s take two example data downloaded from government open data, to practice this skill. Data needed has been downloaded previously. You placed the file in the same directory as the R script file. Click here to re-download the file if it is lost.

Let’s use the CSV data named “hsinchu_bus_route” in “csv_files” to produce the bus route map of Hsinchu. There are four columns in this data, including “geometry” columns, which records in WKT format. Use function read.csv() to import the CSV data first.

# read CSV files
BusShape=read.csv("./data/csv_files/hsinchu_bus_route.csv")

In the CSV data, the spatial feature is listed in the last column named “geometry”. It is recorded as string in the WKT format (LINESTRING()). Then, apply the function st_as_sfc() to convert the column “Geometry” to simple feature columns (sfc). Finally, use function st_sf() to give the data frame with sf class and define its CRS (EPSG:4326).

# convert the column "geometry" to sfc
BusShape$Geometry=st_as_sfc(BusShape$Geometry)

# give the data frame with sf class
BusShape=st_sf(BusShape, crs=4326)

We can use function class() and plot the map to confirm whether we successfully convert CSV file to shapefile.

class(BusShape)
## [1] "sf"         "data.frame"
tm_shape(BusShape)+
  tm_lines()

We are given the geometry directly in WKT format in the example above, so it is easy to transform to spatial data. But what if the data (point) provides longitude and latitude separately in two columns? Here goes an example. We would plot the map regarding scenic spots in Hsinchu. Please import the CSV data named “hsinchu_scenicSpot” in “csv_files” first. There are four columns in the data: ID, Name, PositionLat and PositionLon.

# read CSV files
ScenicSpot=read.csv("./data/csv_files/hsinchu_scenicSpot.csv")
head(ScenicSpot)
##                     ID               Name PositionLat PositionLon
## 1 C1_376580000A_000101 新竹公園(中山公園)    24.80125    120.9771
## 2 C1_376580000A_000029             進士第    24.80947    120.9653
## 3 C1_376580000A_000037         楊氏節孝坊    24.80311    120.9646
## 4 C1_376580000A_000045         香山火車站    24.76312    120.9139
## 5 C1_376580000A_000085       于飛島(鳥島)    24.77494    120.9718
## 6 C1_376580000A_000079   十草原(櫻花草原)    24.76732    120.9379

We need to add a new column (here we name it “geometry”) to paste longitude (X) and latitude (Y) together, and then create the WKT format as POINT(X Y). Use st_as_sfc to convert the column “geometry” to sfc. Last, apply st_sf() to give the data frame with sf class and define its CRS.

# paste the latitude and longitude together, and create WKT text
ScenicSpot=mutate(ScenicSpot, geometry=paste("POINT(", PositionLon, " ", PositionLat, ")"))
head(ScenicSpot)
##                     ID               Name PositionLat PositionLon
## 1 C1_376580000A_000101 新竹公園(中山公園)    24.80125    120.9771
## 2 C1_376580000A_000029             進士第    24.80947    120.9653
## 3 C1_376580000A_000037         楊氏節孝坊    24.80311    120.9646
## 4 C1_376580000A_000045         香山火車站    24.76312    120.9139
## 5 C1_376580000A_000085       于飛島(鳥島)    24.77494    120.9718
## 6 C1_376580000A_000079   十草原(櫻花草原)    24.76732    120.9379
##                          geometry
## 1 POINT( 120.977132   24.801246 )
## 2 POINT( 120.965333   24.809472 )
## 3 POINT( 120.964563   24.803107 )
## 4 POINT( 120.913918   24.763124 )
## 5 POINT( 120.971808   24.774943 )
## 6  POINT( 120.937865   24.76732 )
# convert the column "geometry" to sfc
ScenicSpot$geometry=st_as_sfc(ScenicSpot$geometry)

# give the data frame with sf class
ScenicSpot=st_sf(ScenicSpot, crs=4326)

Use function class() and plot the map to confirm whether we successfully convert CSV file to shapefile.

class(ScenicSpot)
## [1] "sf"         "data.frame"
tm_shape(ScenicSpot)+
  tm_dots(col="red")

Reproject Geographic Data

After collecting the shapefile data, we may find that their CRS are not consistent, which may cause error in plotting maps as well as in conducting spatial analysis. Hence, it is required to covert them to a same CRS. The inconsistency of CRS often results from the different origin or purpose of the map. In Taiwan, there are two major type of CRS. One is EPSG:4326, which is recorded in latitude and longitude. The other one is EPSG:3826, which is specifically used for the island. However, since shapefiles are managed and released by different sectors (private sector and gonvernment), CRS has not been unified. Function st_transform() can solve this problem by resetting CRS.

We have convert “ScenicSpot” from CSV file to shapefile in EPSG:4326 in the previous section. In the example below, we would further transform CRS to EPSG:3826.

# check the original CRS
st_crs(ScenicSpot)$epsg
## [1] 4326
# transform CRS
ScenicSpot=st_transform(ScenicSpot, crs=3826)

# check the new CRS
st_crs(ScenicSpot)$epsg
## [1] 3826

Let’s compare the data frame of the original one and the new one in the figure below. We can find that the geometry column has been converted based on the CRS.

Geographic Data Export

We know how to import the shapefile in the previous practice (by function read_sf()). But how to export the shapefile we make in R? By using function st_write, we can export sf data. Two essential parameters should be filled in st_write(): obj means the object we have produced in R; dsn means the directory where we want to save shapefile (the directory should include the name of it and file extension *.shp). Sometimes we have Chinese characters in our shapefile, we should additionally add the parameter layer_options="ENCODING=UTF-8" to ensure that characters would not be garbled.

Let’s again use “ScenicSpot” in the following example. Export the shapefile to the same directory of your R script. You can definitely use absolute path, or simply use relative path to get the same result.

# relative path
write_sf(ScenicSpot, "./ScenicSpot.shp", layer_options="ENCODING=UTF-8")
## Warning in abbreviate_shapefile_names(obj): Field names abbreviated for ESRI
## Shapefile driver

Successfully export the shapefile now? If yes, please check the directory you export to. You may find five files. The file extension includes *.shp, *.shx, *.dbf, *.prj, and *.cpg. The first four files have been introduced in the previous chapter. The last one; however, is in fact not necessary. It is used to specify the code page for identifying the character encoding to be used.

Geographic Data Format

Up to now, we learn how to import and export the shapefile. But is it true that R can only cope with the shapefile format? Apparently not! There are in fact several formats we have not discussed in the past section. Here, we would like to introduce some important geographic data formats that are commonly used in spatial analysis.

Shapefile is the most popular vector data format. However, it has some limitations listed below:

  1. It is a multi-file format, which consists of at least three files.
  2. It only supports 255 columns and the file size is limited to 2 GB.
  3. It is unable to distinguish between a polygon and a multipolygon.

In spite of the limitations, we are still inclined to use shapefile, and lots of data provided by government use it. Aside from shapefile, Geopackage is a format for exchanging geospatial information. It defines the rules on how to store geospatial information in a tiny SQLite container. Hence, GeoPackage is a lightweight spatial database container, which allows the storage of vector and raster data. KML (Keyhole Markup Language) is an XML notation for expressing geographic annotation and visualization within two-dimensional maps and three-dimensional Earth browsers. It was developed for use with Google Earth. GeoJSON is an open standard format designed for representing simple geographical features, along with their non-spatial attributes. It is based on the JSON format. GPX (GPS Exchange Format) is an XML schema designed as a common GPS data format for software applications. It can be used to describe points, tracks, and routes.

If we want to export the geographic data in those formats, just simply change the file extension in the directory. Take “ScenicSpot” we produce again, and generate the spatial data in GeoPackage format.

# relative path
write_sf(ScenicSpot, "./ScenicSpot.gpkg")

# import the data and check if it is successfully generated
read_sf("./ScenicSpot.gpkg")
## Simple feature collection with 94 features and 4 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 239280.7 ymin: 2737355 xmax: 251527.1 ymax: 2749333
## projected CRS:  TWD97 / TM2 zone 121
## # A tibble: 94 x 5
##    ID            Name              PositionLat PositionLon                  geom
##    <chr>         <chr>                   <dbl>       <dbl>           <POINT [m]>
##  1 C1_376580000~ 新竹公園(中山公園)~        24.8        121.      (247688 2743764)
##  2 C1_376580000~ 進士第                   24.8        121.    (246495.3 2744675)
##  3 C1_376580000~ 楊氏節孝坊               24.8        121.    (246417.3 2743970)
##  4 C1_376580000~ 香山火車站               24.8        121.    (241294.3 2739544)
##  5 C1_376580000~ 于飛島(鳥島)             24.8        121.    (247149.1 2740851)
##  6 C1_376580000~ 十草原(櫻花草原)         24.8        121.    (243716.3 2740007)
##  7 C1_376580000~ 新竹都城隍廟             24.8        121.    (246528.1 2744113)
##  8 C1_376580000~ 鄭用錫墓                 24.8        121.    (247062.5 2742288)
##  9 C1_376580000~ 消防博物館               24.8        121.    (246781.4 2744342)
## 10 C1_376580000~ 新竹神社殘跡(大陸人民處理中心)~        24.8        121.    (245226.6 2742840)
## # ... with 84 more rows

Join Attribute and Spatial Data

In reality, most of the attribute features we want to analyze and spatial feature are not in the same file. Usually, the spatial features are stored as shapefile format, while the attribute features are provided in a CSV or spreadsheet file. Take transportation data for instance. If we want to plot the map showing the number of passengers getting on and off at each MRT stations, we need spatial features recording the location of each stations, and attribute features recording the ridership counts. In this example, the geometry data (shapefile) can be downloaded from GIS-T website, while the operating data (spreadsheet) can be derived from Taipei Rapid Transit Corporation. Apparently, the data management agency and the format of two data are not consistent. Hence, combining data from different sources would be a vital and common task before the spatial analysis.

A primary key is required among the data to be joined. Primary key is a specific choice of a minimal set of attributes (columns) that uniquely specify a tuple (row) in a table. In other words, primary key is a common column that can help us link the spatial features to attribute features. In the MRT example we have mentioned above, the primary key between spatial and attribute data may be “MRT station name” or “MRT station ID”. The procedure of join is illustrated in the figure below.

Now, let’s join the data provided in package spData. “world” is simple feature data that stores the geometry of the shape of each country. “coffee_data” is a data frame which records the production of coffee in every nations each year. We want to join two data together, and perform the map about the world coffee production by country. Use function left_join() to accomplish our goal. In left_join(), the first data placed would be on the left-hand side of the new data frame; while the latter one would be on the right-hand side. Parameter by="" should be given if the column is specifically to be joined, or the function would join the data based on all the columns that have common name by default. In this example, the primary key is the country name. The column name in both data are the same, which is named “name_long”. The code and result are shown below.

# join the spatial and attribute features
world_coffee=left_join(world, coffee_data, by="name_long")

# print the result
world_coffee[,c("name_long","continent","coffee_production_2016","coffee_production_2017")]
## Simple feature collection with 177 features and 4 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -180 ymin: -90 xmax: 180 ymax: 83.64513
## geographic CRS: WGS 84
## # A tibble: 177 x 5
##    name_long continent coffee_producti~ coffee_producti~
##    <chr>     <chr>                <int>            <int>
##  1 Fiji      Oceania                 NA               NA
##  2 Tanzania  Africa                  81               66
##  3 Western ~ Africa                  NA               NA
##  4 Canada    North Am~               NA               NA
##  5 United S~ North Am~               NA               NA
##  6 Kazakhst~ Asia                    NA               NA
##  7 Uzbekist~ Asia                    NA               NA
##  8 Papua Ne~ Oceania                114               74
##  9 Indonesia Asia                   742              360
## 10 Argentina South Am~               NA               NA
## # ... with 167 more rows, and 1 more variable: geom <MULTIPOLYGON [arc_degree]>
# plot the map (use 2017 data)
ggplot(world)+
  geom_sf(fill=NA)+
  geom_sf(data=filter(world_coffee, !is.na(coffee_production_2017)), aes(fill=coffee_production_2017))+
  scale_fill_continuous(low="#F2E6E6", high="#984B4B")+
  theme(panel.background=element_blank())

In the section above, we repeatedly show the method of how to connect the spatial features and attribute features. But actually, for some practical reasons, we tend to drop the geometry object from the data. Use the function st_drop_geometry() can do so. Take “ScenicSpot” data again. It is now with spatial features, and we can apply this function to eliminate the geometry.

# check the class of ScenicSpot
class(ScenicSpot)
## [1] "sf"         "data.frame"
ScenicSpot=st_drop_geometry(ScenicSpot)
head(ScenicSpot)
##                     ID               Name PositionLat PositionLon
## 1 C1_376580000A_000101 新竹公園(中山公園)    24.80125    120.9771
## 2 C1_376580000A_000029             進士第    24.80947    120.9653
## 3 C1_376580000A_000037         楊氏節孝坊    24.80311    120.9646
## 4 C1_376580000A_000045         香山火車站    24.76312    120.9139
## 5 C1_376580000A_000085       于飛島(鳥島)    24.77494    120.9718
## 6 C1_376580000A_000079   十草原(櫻花草原)    24.76732    120.9379
# check the class of ScenicSpot again
class(ScenicSpot)
## [1] "data.frame"

PRACTICE

In this chapter, we have learned the concept of the spatial features as well as attribute features, and show some examples about the method to join the data. Now, it’s time for you to practice the three skills on your own.

  • Set up simple features.
  • Join the spatial and attribute data.
  • Plot the map. (Review previous chapter)

Data needed has been downloaded previously. You placed the file in the same directory as the R script file. Click here to re-download the file if it is lost.

Three data needed in the example is stored in the file “TRA”. “TRA_line” records the name of railway line and its geometry (WKT format), whereas “TRA_station” records the name of train station and its longitude and latitude. Ridership data of November, 2020, is stored in “TRA_ridership”. We want to plot the map of the route and station of Taiwan Railway Administration (TRA). The data provided is not in shapefile format, but the CSV files. You should convert it to the simple features in advance based on the skills you have learned. Additionally, please join the data of train station and ridership of each station. Last, plot the map, and the size of point of station should be based on the number of ridership, that is, the station with more ridership should be larger point.

First of all, since the ridership data is given by date for each stations, we need to sum up all the ridership based on the station before joining the data. We would like to use package dplyr to tidy it.

# read the ridership data
ridership=read.csv("./data/TRA/TRA_ridership.csv")

# the ridership is given by date for each stations
head(ridership)
##   trnOpDate staCode gateInComingCnt gateOutGoingCnt
## 1  20201101     900            8063            7635
## 2  20201101     910            1054            1053
## 3  20201101     920            1732            1775
## 4  20201101     930            4225            4300
## 5  20201101     940            1938            1901
## 6  20201101     950            1434            1331
# group by and summarize the data
ridership=group_by(ridership, staCode)%>%
  summarise(riderships=sum(gateInComingCnt+gateOutGoingCnt))

# print the final result
ridership
## # A tibble: 239 x 2
##    staCode riderships
##  *   <int>      <int>
##  1     900     438926
##  2     910      80125
##  3     920     131957
##  4     930     360184
##  5     940     157416
##  6     950     115831
##  7     960     620925
##  8     970     582197
##  9     980     704605
## 10     990     960433
## # ... with 229 more rows

The rest of the work would be done on your own. Here comes with some tips:

  1. Please note that the data “TRA_line” and “TRA_station” contain Chinese character, so make sure to add fileEncoding="UTF-8" in read.csv().
  2. Please use EPSG:4326 as CRS, which is in longitude-latitude form.
  3. Please note that the TRA station code in two data are not the same type. One is numeric, and the other is character. It is recommended to transform all the station code to numeric type. (If the type of station code is not unified, the error would be shown in left_join().)
  4. As the column name of data should be joined is not the same, you should define the parameter by="" in the following forms: by=c("A"="B"). “A” means the data you place in the first parameter in left_join(), while “B” means the data placed in the second parameter.

The result is shown below.

## Warning in st_point_on_surface.sfc(data$geometry): st_point_on_surface may not
## give correct results for longitude/latitude data
## Warning: Removed 3 rows containing missing values (geom_sf).

Spatial Operations

We have learned how geographic datasets are structured in R, and how to join the spatial and attribute features. In this chapter, we would further discuss about the skills of spatial operations. Imagine that we want to know the number of schools in each villages, we should intersect the school point layer and village polygon layer, to identify the polygon for each point located on. “Intersection” is a type of spatial operations. Or imagine we want to know whether there are schools within 500 meters of MRT stations. We need to draw buffer of the stations in advance. “Buffer” is also a type of spatial operations. From the examples above, we find that spatial operations are vital in the spatial analysis.

We would introduce some common spatial operation functions in package sf. But please note that we would also use some functions in package dplyr. If you are not acquainted with the package, please read the notebook attached here, which provides you with a quick learn on the major function in dplyr. More detailed information of dplyr is supplied in Chapter 5 of R for Data Science.

Spatial Join

As we have learned in the previous chapter, joining spatial and attribute data requires a primary key, namely a common column variable. But what if we want to join two spatial data? It applies the similar concept, but relies on the shared areas of geographic space (spatial overlay). We can apply function st_join(x, y, left). The target object which would be in the left part of the new data frame is placed in the first parameter (x), while the joined object is placed in the second parameter (y). Parameter left= represents whether do the left join or not, which is left=T by default, and return all records of the x object with y fields. If left=F, then do the inner join instead.

Take nz and nz_height in spData for instance. The former data is the map of New Zealand, while the latter is the top 101 highest points in the country. In the example below, we want to know which provinces those points are located in.

st_join(nz_height, nz["Name"])
## Simple feature collection with 101 features and 3 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 1204143 ymin: 5048309 xmax: 1822492 ymax: 5650492
## projected CRS:  NZGD2000 / New Zealand Transverse Mercator 2000
## First 10 features:
##    t50_fid elevation       Name                geometry
## 1  2353944      2723  Southland POINT (1204143 5049971)
## 2  2354404      2820      Otago POINT (1234725 5048309)
## 3  2354405      2830      Otago POINT (1235915 5048745)
## 4  2369113      3033 West Coast POINT (1259702 5076570)
## 5  2362630      2749 Canterbury POINT (1378170 5158491)
## 6  2362814      2822 Canterbury POINT (1389460 5168749)
## 7  2362817      2778 Canterbury POINT (1390166 5169466)
## 8  2363991      3004 Canterbury POINT (1372357 5172729)
## 9  2363993      3114 Canterbury POINT (1372062 5173236)
## 10 2363994      2882 Canterbury POINT (1372810 5173419)

Please note that we should remain the class sf in two parameters. Thus, second parameter in function st_join() should not be nz$name, which is a numeric type. Use nz["Name"] instead, to remain the spatial data.

You may be confused about the sequence of data placed in function st_join(). Let’s do a simple comparison. In st_join(nz_height, nz), nz_height is target data, and hence the new data frame is based on all the highest points, and the geometry remains the form of nz_height (point). In st_join(nz, nz_height), nz is the target data, and hence the new data frame is based on the province, and the geometry remains the form of nz (polygon). Try the code yourself might get a better knowing.

Note that st_join(..., join=st_intersects) does the intersection operation by default. If we want to do other operations, we need to revise the parameter join= in st_join(..., join=). Function used in join= is listed in documentation of geos_binary_pred. The features are simply introduced in the table below (not all the functions are appropriate to be applied in st_join).

Function Features
st_intersects(x,y) identifies if x and y geometry share any space
st_contains(x,y) identifies if x is within y (i.e. point within polygon)
st_disjoint(x,y) identifies when geometries from x do not share space with y
st_crosses(x,y) identifies if any geometry of x have commonalities with y
st_covers(x,y) identifies if any point from x is outside of y (i.e. polygon outside polygon)
st_covered_by(x,y) identifies if x is completely within y (i.e. polygon completely within polygon)
st_within(x,y) identifies if x is in a specified distance to y
st_touches(x,y) identifies if geometries of x and y share a common point but their interiors do not intersect

Attribute Aggregation

Aggregation operations summarize datasets (including attributes and spatial features) by a “grouping variable”. We can use function group_by() %>% summarise() in dplyr to conduct aggregation. Let’s take data “world” for example. We want to group by the data by continent, and summarize the total population of each continent. The code and result are shown below.

# map before aggregation
ggplot(world)+
  geom_sf()+
  theme(panel.background=element_blank())

# aggregation operations

# group_by(Data, Grouping_variable)%>%
#   summarise(New_column=Operation(Operated_variable))
continent_pop=group_by(world, continent)%>%
  summarise(pop=sum(pop, na.rm=T))

# glance at new data
continent_pop
## Simple feature collection with 8 features and 2 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -180 ymin: -90 xmax: 180 ymax: 83.64513
## geographic CRS: WGS 84
## # A tibble: 8 x 3
##   continent              pop                                                geom
##   <chr>                <dbl>                         <MULTIPOLYGON [arc_degree]>
## 1 Africa              1.15e9 (((32.83012 -26.74219, 32.58026 -27.47016, 32.4621~
## 2 Antarctica          0.     (((-48.66062 -78.04702, -48.1514 -78.04707, -46.66~
## 3 Asia                4.31e9 (((120.295 -10.25865, 118.9678 -9.557969, 119.9003~
## 4 Europe              6.69e8 (((-51.6578 4.156232, -52.24934 3.241094, -52.5564~
## 5 North America       5.65e8 (((-61.68 10.76, -61.105 10.89, -60.895 10.855, -6~
## 6 Oceania             3.78e7 (((169.6678 -43.55533, 170.5249 -43.03169, 171.125~
## 7 Seven seas (open~   0.     (((68.935 -48.625, 69.58 -48.94, 70.525 -49.065, 7~
## 8 South America       4.12e8 (((-66.95992 -54.89681, -67.29103 -55.30124, -68.1~
# map after aggregation
ggplot(continent_pop)+
  geom_sf()+
  theme(panel.background=element_blank())

After aggregation, we can find that the population (attribute) has been summed up based on the continent. The geometry is also aggregated by it, which eliminate the border of each country. The geometry of new map represents the border of each continent.

Here let’s do more practice on the function of dplyr. Suppose we want to retrieve the data of total population, average life expectancy and the number of countries by subregion. Then, arrange the data by average life expectancy in descending order. The code is shown below.

subregion_info=group_by(world, subregion)%>%
  summarise(
    pop=sum(pop, na.rm=T),              # use function sum() to calculate total population
    lifeExp=mean(lifeExp, na.rm=T),     # use function mean() to calculate average life expectancy
    num_cou=n()                         # use function n() to calculate total number in each group
  )%>%
  arrange(desc(lifeExp))                # use arrange to order data, and desc() means descending order

# glance at new data
subregion_info
## Simple feature collection with 22 features and 4 fields
## geometry type:  GEOMETRY
## dimension:      XY
## bbox:           xmin: -180 ymin: -90 xmax: 180 ymax: 83.64513
## geographic CRS: WGS 84
## # A tibble: 22 x 5
##    subregion          pop lifeExp num_cou                                   geom
##    <chr>            <dbl>   <dbl>   <int>                <GEOMETRY [arc_degree]>
##  1 Australia and~  2.80e7    81.9       2 MULTIPOLYGON (((126.1487 -32.21597, 1~
##  2 Western Europe  1.26e8    81.8       7 MULTIPOLYGON (((-51.6578 4.156232, -5~
##  3 Northern Amer~  3.54e8    80.4       3 MULTIPOLYGON (((-155.4021 20.07975, -~
##  4 Northern Euro~  9.66e7    79.5      10 MULTIPOLYGON (((-6.197885 53.86757, -~
##  5 Southern Euro~  1.53e8    78.4      12 MULTIPOLYGON (((26.29 35.29999, 26.16~
##  6 Eastern Asia    1.57e9    76.3       6 MULTIPOLYGON (((109.4752 18.1977, 108~
##  7 Central Ameri~  1.70e8    74.7       8 POLYGON ((-77.35336 8.670505, -77.474~
##  8 Western Asia    2.52e8    74.5      18 MULTIPOLYGON (((53.10857 16.65105, 52~
##  9 Eastern Europe  2.93e8    74.5      10 MULTIPOLYGON (((28.55808 43.70746, 28~
## 10 Caribbean       4.06e7    73.8       7 MULTIPOLYGON (((-61.68 10.76, -61.105~
## # ... with 12 more rows

Function sum() obtains the total value; mean() obtains the average value; n() obtains the total number in the group. Parameter na.rm=T means that the calculation would skip the NA (not available). If we do not add this parameter, the result would be NA as well. Parameter desc() means arrange the data in descending order, or it would be in ascending order by default. As we can see in the result printed above, the first three highest life expectancy are Australia and New Zealand, Western Europe, and Northern America respectively.

In addition to the application of dplyr, we can use function aggregate() provided by package sf to obtain the same result. The first parameter in the function is the data we want to summarize. Note the data should be remain the sf features. Thus, do not use the form like world$pop, whose type is “numeric”. Instead, we should write in the form as world["pop"], whose type is “sf” and “dataframe”. Parameter by= should be filled in with the columns we want to group by. Note that the type of data should be a list. Parameter FUN is the spatial operation such as sum or average. Last, if the data contains NA, we should add the parameter na.rm=T, in order to skip NA. Again, take data “world” for example, and group by the data by continent, summarize the total population of each continent. The code and result are as follows.

continent_pop_ag=aggregate(world["pop"], by=list(world$continent), FUN=sum, na.rm=T)
continent_pop_ag
## Simple feature collection with 8 features and 2 fields
## Attribute-geometry relationship: 0 constant, 1 aggregate, 1 identity
## geometry type:  GEOMETRY
## dimension:      XY
## bbox:           xmin: -180 ymin: -90 xmax: 180 ymax: 83.64513
## geographic CRS: WGS 84
##                   Group.1        pop                       geometry
## 1                  Africa 1154946633 MULTIPOLYGON (((32.83012 -2...
## 2              Antarctica          0 MULTIPOLYGON (((-163.7129 -...
## 3                    Asia 4311408059 MULTIPOLYGON (((120.295 -10...
## 4                  Europe  669036256 MULTIPOLYGON (((-51.6578 4....
## 5           North America  565028684 MULTIPOLYGON (((-61.68 10.7...
## 6                 Oceania   37757833 MULTIPOLYGON (((169.6678 -4...
## 7 Seven seas (open ocean)          0 POLYGON ((68.935 -48.625, 6...
## 8           South America  412060811 MULTIPOLYGON (((-66.95992 -...

You may find that package dplyr provides more flexible functions for us in terms of attribute features calculation in spatial analysis. Hence, it is very helpful to use package “sf” and “dplyr” together.

Spatial Aggregation

Similar to attribute data aggregation introduced in the previous section, spatial aggregation can be done by function aggregate(). In attribute aggregation, the first parameter should be in class sf, while the second parameter by= is a list with non-spatial data. There is a little difference on the data imported in parameter by=, which the data should also contain geometry in spatial aggregation. Briefly to say, in function aggregate(x, by=y), spatial aggregation is the geometry of the source (y) that defines how values in the target object (x) are grouped.

Take data nz and nz_height for example. We want to know the average height in each province of New Zealand. The code is shown below.

aggregate(x=nz_height, by=nz, FUN=mean)
## Simple feature collection with 16 features and 2 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 1090144 ymin: 4748537 xmax: 2089533 ymax: 6191874
## projected CRS:  NZGD2000 / New Zealand Transverse Mercator 2000
## First 10 features:
##    t50_fid elevation                       geometry
## 1       NA        NA MULTIPOLYGON (((1745493 600...
## 2       NA        NA MULTIPOLYGON (((1803822 590...
## 3  2408405  2734.333 MULTIPOLYGON (((1860345 585...
## 4       NA        NA MULTIPOLYGON (((2049387 583...
## 5       NA        NA MULTIPOLYGON (((2024489 567...
## 6       NA        NA MULTIPOLYGON (((2024489 567...
## 7       NA        NA MULTIPOLYGON (((1740438 571...
## 8  2408395  2777.000 MULTIPOLYGON (((1866732 566...
## 9       NA        NA MULTIPOLYGON (((1881590 548...
## 10 2368390  2889.455 MULTIPOLYGON (((1557042 531...

As the result shows, all of attribute features of data nz_height (first parameter) are average value of each province. And the geometry is based on the spatial data in second parameter by=. You may find that the data we obtain are not perfect, since it does not contain the features in second parameter by=. Also, it does not make sense to calculate the average of all attribute features. For instance, variable “t50_fid” is the identity number of the highest points, it does not make sense to derive its mean.

In addition to function aggregate(), the same result can also be generated by function group_by() %>% summarise() in dplyr. The code and result are shown below.

ave_height=st_join(nz, nz_height)%>%
  group_by(Name)%>%
  summarise(elevation=mean(elevation, na.rm=T))

# plot the map of average height of each province
ggplot(ave_height)+
  geom_sf(aes(fill=elevation))+
  scale_fill_continuous(low="#CEFFCE", high="#006000")+
  theme(axis.text=element_blank(),
        axis.ticks=element_blank(),
        axis.title=element_blank(),
        panel.background=element_blank())

We need to join two spatial data (nz is the target data, which should be placed in the first parameter). Then, group by the variable “Name” and summarize the mean value of elevation in each group. Though the code is longer than function aggregate(), dplyr shows a readable and clear code to do the spatial aggregation.

In this example, we can again find that use “dplyr” and “sf” package together might be a better method to do spatial aggregation.

Interpolation

Interpolation is the process that calculate estimates from a source set of polygons to an overlapping but incongruent set of target polygons. Imagine that we want to derive the population in a specific area, which is defined by us. Since the boundary of study area we define does not align with the one of population data (most of the population data is based on the villages or Statistical Areas), we should use interpolation to produce estimates in this situation. The concept is simply illustrated in the figure below.

In the figure above, the study area contains three districts whose population have been surveyed. It is covered by 40% of Area 1; 25% of Area 2; and 40% of Area 3. Population is the attribute assumed to be spatially extensive, and the sum is preserved. Hence, we can simply derive the total population of the study area based on the weight of each area: 1000* 40% + 800* 25% + 900* 40% = 960.

For the study area, there are 40% of it is Area 1; 33.3% of it is Area 2; 26.7% of it is Area 3. Density is spatially intensive instead, and the mean is preserved. Hence, we can simply derive the mean density of the study area based on the area weight of study area: 200/3* 40% + 40* 33.3% + 90* 26.7% = 64.

Package sf provides a function st_interpolation_aw() to conduct the interpolation. There are three parameters required in the function st_interpolate_aw(x, to, extensive). The first parameter (x) is the object of simple features, for which we want to aggregate attributes. The second parameter (to) is the object of simple features with the target geometries. Parameter extensive= is the operation of data. If it is True, the attribute variables are assumed to be spatially extensive (like population) and the sum is preserved, otherwise, spatially intensive (like population density) and the mean is preserved.

Here, we use the data incongruent and aggregating_zones provided by package spData. The former one is a specific region separated into 9 districts in UK, while the latter is in the same region, but separated into 2 major area (as the figure shown below). We need to interpolate the value in data incongruent to the geometry data aggregating_zones, and derive the sum of value in two areas. The code and result are shown below.

st_interpolate_aw(incongruent["value"], aggregating_zones, extensive=T)
## Warning in st_interpolate_aw.sf(incongruent["value"], aggregating_zones, :
## st_interpolate_aw assumes attributes are constant or uniform over areas of x
## Simple feature collection with 2 features and 1 field
## Attribute-geometry relationship: 0 constant, 1 aggregate, 0 identity
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 417686.2 ymin: 443703.6 xmax: 422959.3 ymax: 447036.8
## projected CRS:  OSGB 1936 / British National Grid
##      value                       geometry
## 1 19.61613 MULTIPOLYGON (((418731.9 44...
## 2 25.66872 MULTIPOLYGON (((419196.4 44...

Unions

Aggregation can dissolve the boundaries of touching polygons in the same group. And what is the behind scene of the function aggregate() and group_by() %>% summarise()? They combine the geometries and eliminate the boundaries in the group by using function st_union(). In the example shown below, we filter the West part of Us states, and then union it by collecting all the geometries together.

# filter the west of US states
us_west=filter(us_states, REGION=="West")
head(us_west)
## Simple feature collection with 6 features and 6 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -124.4096 ymin: 31.33224 xmax: -102.0422 ymax: 49.00091
## geographic CRS: NAD83
##   GEOID       NAME REGION            AREA total_pop_10 total_pop_15
## 1    04    Arizona   West 295281.3 [km^2]      6246816      6641928
## 2    08   Colorado   West 269573.1 [km^2]      4887061      5278906
## 3    16      Idaho   West 216512.7 [km^2]      1526797      1616547
## 4    30    Montana   West 380829.2 [km^2]       973739      1014699
## 5    32     Nevada   West 286363.7 [km^2]      2633331      2798636
## 6    06 California   West 409747.1 [km^2]     36637290     38421464
##                         geometry
## 1 MULTIPOLYGON (((-114.7196 3...
## 2 MULTIPOLYGON (((-109.0501 4...
## 3 MULTIPOLYGON (((-116.916 45...
## 4 MULTIPOLYGON (((-116.0492 4...
## 5 MULTIPOLYGON (((-119.9992 4...
## 6 MULTIPOLYGON (((-118.6034 3...
# union us_west
us_west=st_union(us_west)
## although coordinates are longitude/latitude, st_union assumes that they are planar
head(us_west)
## Geometry set for 1 feature 
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -124.7042 ymin: 31.33224 xmax: -102.0422 ymax: 49.00236
## geographic CRS: NAD83
## MULTIPOLYGON (((-118.6055 33.031, -118.37 32.83...

Centroid

Centroid operations identify the center of the geographic objects. By using centroid, we can approximately estimate the distance between polygons. Or imagine you want to transform the complex polygon to the point geometry, centroid of the polygon may be a good choice. The most common type of centroid is called geographic centroid, which represents the center of mass in a spatial object. Function st_centroid() can generate the centroid. Use data us_states for example to derive the centroids of each states.

# derive the centroid
us_states_centroid=st_centroid(us_states)
## Warning in st_centroid.sf(us_states): st_centroid assumes attributes are
## constant over geometries of x
## Warning in st_centroid.sfc(st_geometry(x), of_largest_polygon =
## of_largest_polygon): st_centroid does not give correct centroids for longitude/
## latitude data
# plot the map
ggplot(us_states)+
  geom_sf()+
  geom_sf(data=us_states_centroid, color="red")+
  theme(axis.text=element_blank(),
        axis.ticks=element_blank(),
        axis.title=element_blank(),
        panel.background=element_blank())

But there is a drawback of st_centroid(). If the polygon contains many separated areas, the centroid might not lie on the polygon. For instance, the centroid of the group islands may not be situated in the land of the island, but in the ocean. Function st_point_on_surface can solve this problem. It ensures that all of the points will fall in their parent object. But please note that the point st_point_on_surface generates does not guarantee to be located in the most largest area. Here comes an example about the centroid and point of surface on Penghu Island, which is located in 50 kilometers west of Taiwan Island. Data needed has been downloaded previously. You placed the file in the same directory as the R script file. Click here to re-download the file if it is lost.

# read the shapefile of Taiwan and filter Penghu Island
taiwan_county=read_sf("./data/taiwan_county/taiwan_county.shp")
penghu=filter(taiwan_county, COUNTYCODE==10016)

# derive the centroid
penghu_centroid=st_centroid(penghu)
## Warning in st_centroid.sf(penghu): st_centroid assumes attributes are constant
## over geometries of x
## Warning in st_centroid.sfc(st_geometry(x), of_largest_polygon =
## of_largest_polygon): st_centroid does not give correct centroids for longitude/
## latitude data
# derive the point on surface
penghu_point=st_point_on_surface(penghu)
## Warning in st_point_on_surface.sf(penghu): st_point_on_surface assumes
## attributes are constant over geometries of x
## Warning in st_point_on_surface.sfc(st_geometry(x)): st_point_on_surface may not
## give correct results for longitude/latitude data
# plot the map
ggplot(penghu)+
  geom_sf()+
  geom_sf(data=penghu_point, color="red", size=2)+
  geom_sf(data=penghu_centroid, color="blue", size=2)+
  theme(axis.text=element_blank(),
        axis.ticks=element_blank(),
        axis.title=element_blank(),
        panel.background=element_blank())

In Penghu example, we can find that the point on surface (red dot) and the centroid (blue dot) are not the same. Centroid is located in the inland sea of Penghu, while the point on surface is located in one of the island of Penghu. Please note that the island which point on surface located in is not the largest one. For more information of its algorithm, please refer to the discussion attached.

Buffer

Imagine that we want to calculate the number of schools within 500 meters radius of MRT stations, we need to first draw the buffer of each MRT stations before points calculation. Buffer is a zone (polygon) around a map feature within given distance, and it is useful for proximity analysis. Buffer can be created by function st_buffer(), whatever the type of geometry is. There are two parameters should be filled in st_buffer(). Place the spatial data first, and then set the distance of radius. But please note that the unit of distance is based on the coordinate reference system. In general, the unit is meter (m) in the projected coordinate system, while the unit is NULL in the the geographic coordinate systems (e.g., EPSG:4326). Hence, we need to use function st_transform() to convert CRS to specific projected coordinate system in the region of study area, if the spatial data with CRS EPSG:4326 is obtained.

Examples below illustrates the buffer of cycle hire points across London. Data cycle hire is provided by package spData. Suppose we define the region within 300 meters of hire points is service area.

# test of the CRS and buffer of cycle_hire
st_crs(cycle_hire)$epsg             # original CRS: EPSG:4326 (geographic)
## [1] 4326
st_buffer(cycle_hire, 300)
## Warning in st_buffer.sfc(st_geometry(x), dist, nQuadSegs, endCapStyle =
## endCapStyle, : st_buffer does not correctly buffer longitude/latitude data
## dist is assumed to be in decimal degrees (arc_degrees).
## Simple feature collection with 742 features and 5 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: -300.2368 ymin: -248.5452 xmax: 299.9977 ymax: 351.5421
## geographic CRS: WGS 84
## First 10 features:
##    id               name             area nbikes nempty
## 1   1       River Street      Clerkenwell      4     14
## 2   2 Phillimore Gardens       Kensington      2     34
## 3   3 Christopher Street Liverpool Street      0     32
## 4   4  St. Chad's Street     King's Cross      4     19
## 5   5     Sedding Street    Sloane Square     15     12
## 6   6 Broadcasting House       Marylebone      0     18
## 7   7   Charlbert Street  St. John's Wood     15      0
## 8   8         Lodge Road  St. John's Wood      5     13
## 9   9     New Globe Walk         Bankside      3     16
## 10 10        Park Street         Bankside      1     17
##                          geometry
## 1  POLYGON ((299.89 51.52916, ...
## 2  POLYGON ((299.8024 51.49961...
## 3  POLYGON ((299.9154 51.52128...
## 4  POLYGON ((299.879 51.53006,...
## 5  POLYGON ((299.8431 51.49313...
## 6  POLYGON ((299.8558 51.51812...
## 7  POLYGON ((299.8319 51.5343,...
## 8  POLYGON ((299.8299 51.52834...
## 9  POLYGON ((299.9036 51.50739...
## 10 POLYGON ((299.9072 51.50597...

The warning shows that st_buffer() does not correctly buffer longitude-latitude spatial data, whose distance is assumed to be in decimal degrees (arc_degrees). Though the operation of geometry is still done, it is not true. (You can plot the map to check what the geometry looks like.) Hence, we should not use the spatial data with EPSG:4326 to create buffer, but transform CRS in advance. However, what CRS should be applied? We can look up the CRS of England in EPSG database.

# transform CRS to EPSG:27700 (projected)
cycle_hire_new=st_transform(cycle_hire, crs=27700)

# create the buffer (radius: 300 m)
st_buffer(cycle_hire_new, 300)
## Simple feature collection with 742 features and 5 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 522202 ymin: 174108 xmax: 539033.2 ymax: 184721
## projected CRS:  OSGB 1936 / British National Grid
## First 10 features:
##    id               name             area nbikes nempty
## 1   1       River Street      Clerkenwell      4     14
## 2   2 Phillimore Gardens       Kensington      2     34
## 3   3 Christopher Street Liverpool Street      0     32
## 4   4  St. Chad's Street     King's Cross      4     19
## 5   5     Sedding Street    Sloane Square     15     12
## 6   6 Broadcasting House       Marylebone      0     18
## 7   7   Charlbert Street  St. John's Wood     15      0
## 8   8         Lodge Road  St. John's Wood      5     13
## 9   9     New Globe Walk         Bankside      3     16
## 10 10        Park Street         Bankside      1     17
##                          geometry
## 1  POLYGON ((531503.5 182832.1...
## 2  POLYGON ((525508.1 179391.9...
## 3  POLYGON ((533285.8 182001.6...
## 4  POLYGON ((530737.8 182912, ...
## 5  POLYGON ((528351 178742, 52...
## 6  POLYGON ((529158.4 181542.9...
## 7  POLYGON ((527459 183300.8, ...
## 8  POLYGON ((527332.7 182634.6...
## 9  POLYGON ((532505 180434.6, ...
## 10 POLYGON ((532764.9 180284.3...

Figure above shows the location of cycle hire points (black dots), and the buffer of each point is painted red.

Boundary

Function st_boundary() creates a polygon that encompasses the full extent of the geometry. Boundary and union are often applied together to emphasize the border of the study area. The example below shows the us_state and makes the border of country thicker.

# union us_states and then derive the boundary
us_boundary=st_boundary(st_union(us_states))
## although coordinates are longitude/latitude, st_union assumes that they are planar
# plot the map
ggplot(us_states)+
  geom_sf()+
  geom_sf(data=us_boundary, size=1)+
  theme(axis.text=element_blank(),
        axis.ticks=element_blank(),
        axis.title=element_blank(),
        panel.background=element_blank())

Clipping

Clipping is a critical spatial operation in GIS. It extracts the geometry (point, line, polygon) that lies within the area outlined by the clip polygon. For instance, we have polygon of Taipei City and points of schools in Taiwan, and suppose we want to extract the schools located in Taipei City. We need to use the polygon to clip the point data, removing all the points outside the polygon. Though there is no “clip” function provided by package sf in fact, function st_intersection() help us do so. But please note that there is a difference between clipping and intersection in an accurate GIS definition. Clipping and intersection both extracts the geometry in a specific polygon; nonetheless, clipping does not retain the attributes of the clip polygon, while intersection creates output features which possess the attributes of all input features. The concept is illustrated below.

The slight difference may be a fundamental concept of GIS, but it is not really vital in terms of spatial analysis in R (after all, there is no clip function provided in R). In st_intersection(x, y), it is free to place the spatial data in two parameters. But please note data x would be on the left-hand side of new data frame, while y is on the right-hand side. Here we use the example of nz (map of New Zealand) and nz_height (101 highest point in New Zealand) again. Suppose we want to extract the points located in North Island, the code and result are shown below.

# filter North Island in data nz
nz_north=filter(nz, Island=="North")

# st_intersection(point, polygon)
st_intersection(nz_height, nz_north)
## Warning: attribute variables are assumed to be spatially constant throughout all
## geometries
## Simple feature collection with 5 features and 8 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 1820643 ymin: 5647971 xmax: 1822492 ymax: 5650492
## projected CRS:  NZGD2000 / New Zealand Transverse Mercator 2000
##     t50_fid elevation              Name Island Land_area Population
## 99  2408397      2751           Waikato  North  23900.04     460100
## 100 2408406      2720           Waikato  North  23900.04     460100
## 101 2408411      2732           Waikato  North  23900.04     460100
## 97  2408394      2797 Manawatu-Wanganui  North  22220.61     234500
## 98  2408395      2757 Manawatu-Wanganui  North  22220.61     234500
##     Median_income Sex_ratio                geometry
## 99          27900 0.9520500 POINT (1820660 5649488)
## 100         27900 0.9520500 POINT (1822263 5650429)
## 101         27900 0.9520500 POINT (1822492 5650492)
## 97          25000 0.9387734 POINT (1821014 5647971)
## 98          25000 0.9387734 POINT (1820643 5648331)

Remember the spatial join operation? Actually, we can use function st_join() to do the same thing. The code is shown below.

# union two data first
nz_height_join=st_join(nz_height, nz)

# filter the highest point located in North Island
filter(nz_height_join, Island=="North")
## Simple feature collection with 5 features and 8 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 1820643 ymin: 5647971 xmax: 1822492 ymax: 5650492
## projected CRS:  NZGD2000 / New Zealand Transverse Mercator 2000
##   t50_fid elevation              Name Island Land_area Population Median_income
## 1 2408394      2797 Manawatu-Wanganui  North  22220.61     234500         25000
## 2 2408395      2757 Manawatu-Wanganui  North  22220.61     234500         25000
## 3 2408397      2751           Waikato  North  23900.04     460100         27900
## 4 2408406      2720           Waikato  North  23900.04     460100         27900
## 5 2408411      2732           Waikato  North  23900.04     460100         27900
##   Sex_ratio                geometry
## 1 0.9387734 POINT (1821014 5647971)
## 2 0.9387734 POINT (1820643 5648331)
## 3 0.9520500 POINT (1820660 5649488)
## 4 0.9520500 POINT (1822263 5650429)
## 5 0.9520500 POINT (1822492 5650492)

In this example, we find that function st_join() and st_intersection() do the similar operation. This is because that st_join() sets the operation to “intersection” by default, that is, the parameter is st_join(..., join="st_intersects"). Other operations are listed in the table of geos_binary_pred.

In addition to st_intersection, there are some functions to do subsetting and clipping spatial data. To illustrate the concept and operations of these functions, we use a simple example: two overlapping circles (X and Y), with a center point one unit away from each other and a radius of one. The code and results are shown below.

X=st_buffer(st_point(c(1,1)), dist=1)
Y=st_buffer(st_point(c(2,1)), dist=1)

st_intersection(X, Y)
st_difference(X, Y)
st_difference(Y, X)
st_union(X, Y)
st_sym_difference(X, Y)

Simplification

When plotting the map, it is better to remove the detailed data and trivial information which may make the map complicated. This process is called cartographic generalization. It seeks to abstract spatial information at a high level of detail to information that can be rendered on a map at a lower level of detail. One of the main type of generalization is simplification. It simplifies the geometry, and result in reducing the amount of memory. To do simplification, many algorithms have been developed, such as Ramer–Douglas–Peucker algorithm and Visvalingam–Whyatt algorithm. Let’s take a closer look on Ramer–Douglas–Peucker algorithm illustrated in the figure below.

The algorithm initially finds the point that is farthest from the line segment with the first and last points as end points. If the distance between the point and line segment is shorter than the minimum threshold ε, then any other points can be discarded. If the distance is greater than ε, the point must be kept. The algorithm recursively calls itself with the first point and the farthest point and then with the farthest point and the last point, which includes the farthest point being marked as kept.

Function st_simplify) provided in sf package can do so. The simplification operation can be applied on all of the geometries (point, line, polygon). In function st_simplify(x, dTolerance), spatial data is placed in the first parameter (x). Parameter dTolerance represents the minimum threshold (ε). Please note that the spatial data should be in projected coordinate system, or it might cause error. We can use function st_crs()$unit to confirm whether the unit is “meter”. If it is NULL, it represents the data is in geographic coordinate system please look up the appropriate CRS in EPSG database. Here, let’s simplify the map of US states as code shown below.

# check the unit (if NULL, convert CRS)
st_crs(us_states)$units
## NULL
# transform CRS to 2163
us_states_2163=st_transform(us_states, crs=2163)

# simplification (ε = 100000 meters = 100 kilometers)
st_simplify(us_states_2163, dTolerance=1000000)
## Simple feature collection with 49 features and 6 fields (with 49 geometries empty)
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: NA ymin: NA xmax: NA ymax: NA
## projected CRS:  US National Atlas Equal Area
## First 10 features:
##    GEOID        NAME   REGION             AREA total_pop_10 total_pop_15
## 1     01     Alabama    South 133709.27 [km^2]      4712651      4830620
## 2     04     Arizona     West 295281.25 [km^2]      6246816      6641928
## 3     08    Colorado     West 269573.06 [km^2]      4887061      5278906
## 4     09 Connecticut Norteast  12976.59 [km^2]      3545837      3593222
## 5     12     Florida    South 151052.01 [km^2]     18511620     19645772
## 6     13     Georgia    South 152725.21 [km^2]      9468815     10006693
## 7     16       Idaho     West 216512.66 [km^2]      1526797      1616547
## 8     18     Indiana  Midwest  93648.40 [km^2]      6417398      6568645
## 9     20      Kansas  Midwest 213037.08 [km^2]      2809329      2892987
## 10    22   Louisiana    South 122345.76 [km^2]      4429940      4625253
##         geometry
## 1  POLYGON EMPTY
## 2  POLYGON EMPTY
## 3  POLYGON EMPTY
## 4  POLYGON EMPTY
## 5  POLYGON EMPTY
## 6  POLYGON EMPTY
## 7  POLYGON EMPTY
## 8  POLYGON EMPTY
## 9  POLYGON EMPTY
## 10 POLYGON EMPTY
## NULL

Affine Transformations

Geometric measurement

Nearest Analysis

Convex Hull

Voronoi Polygon