Spatial Analysis with R

Basic Knowledge on GIS

Geographic Information Systems

Geographic Information Systems (GIS) store, analyze, and visualize data for geographic positions on Earth’s surface. The four major features of GIS are listed below.

Create geographic data.
Manage spatial data in a database.
Analyze and find patterns.
Map Visualization.

In this notebook, we would focus on the issue about spaitial analysis and making visualized maps.

The detailed concept of GIS can be derived on the textbook “Introduction to Geographic Information Systems (Kang Tsung Chang)”.

Well-Known Text (WKT)

Well-known text (WKT) is a text markup language for representing vector geometry objects. The common type of geometric objects are marked as shown in the table below.

Type	Graph	WKT
Point		POINT (3 3)
LineString		LINESTRING (1 4, 3 3, 5 5)
Polygon		POLYGON (1 4, 2 2, 4 1, 5 5, 1 4)

The WKT types illustrated above are single geometry. However, sometimes we want to mark all the geometry together to signify the same attribute, single geometry may seem to be difficult to express. For instance, there are five campuses in NCTU. If we want to combine all the points into one single geometry, in order to represent “NCTU”, it is hard to express by POINT text. Thus, we need multi geometries to mark it.

Multi geometries are available to represent more than one geometry of the same dimension in a single object, including MultiPoint, MultiLineString, MultiPolygon. Geometry Collection is the geometries of different dimensions.

Shapefile

The shapefile format is a universal geospatial vector data format. It is developed and regulated by Esri, which is an international supplier of GIS software. The shapefile format can illustrate the vector features: points, lines, and polygons. In fact, shapefile is not a single file, but a collection containing four mandatory files. Each file is briefly introduced as the following.

File	Features
.shp	shape format; the feature geometry
.dbf	attribute format; attributes for each shape, stored as two-dimensional table
.prj	projection description, using a well-known text representation of coordinate reference systems
.shx	shape index format; a positional index of the feature geometry to allow seeking forwards and backwards quickly

Coordinate Reference System (CRS)

Geometric data is not geospatial unless it is accompanied by coordinate reference system (CRS) information, which allows GIS to display and operate the data accurately. It includes two major components, datum and projection. Datum is a model of the shape of the earth. It has angular units (degrees) and defines the starting point (0,0), and hence the coordinate can represent a specific spot on the earth. Projection is a mathematical transformation of the angular measurements on a round earth to a flat surface. The units associated with a given projection are usually linear (feet, meters, etc.).

There are two common types of coordinate systems:

Type	Features	Example
Geographic Coordinate Systems	A global or spherical coordinate system such as latitude–longitude.	EPSG:4326 (World Geodetic System 1984)
Projected Coordinate System	Based on a map projection such as transverse Mercator, Albers equal area, or Robinson, which project the spherical surface onto a two-dimensional coordinate plane.	EPSG:3826 (TWD97 / TM2 zone 121, Taiwan Datum 1997)

To understand the difference of CRS type clearly, let’s take NCTU for example. The coordinate of it can be recorded as (120.999645, 24.789071) in EPSG:4326. x coordinate represents longitude, while y coordinate is latitude. Also, it can be recorded as (249964.105, 2742415.017) in EPSG:3826. In this projected coordinate system, we define 121°E as standard line, and 250 kilometers west of it is defined as false northing, while the equator is defined as false easting. x coordinate represents that the spot is located at 249964.105 meters east of the false northing. y coordinate illustrates that the spot is located at 2742415.017 meters north of the equator.
Something interesting… The center of NCTU (Guangfu Campus) is very close to standard line (249964.105 m ~ 250 km). In fact 121°E traverses the campus!

Why Use R?

R is a programming language mainly for statistical computing and graphics. With a wide range of packages, R supports advanced geospatial statistics, modeling and visualization. In addition, integrated development environments such as RStudio have made it more user-friendly, allowing us to easily analyze spatial data and make maps.

Using R to do analysis, we need first to download the software and its development environments (RStudio).

Required Packages

To conduct the geocomputation in R, it is required to download “sf” package.

To analyze the data more efficiently, it is suggested to take advantage of “dplyr” package.

To display the map, “ggplot2” package is strongly recommended. (There are other packages to make the map as well; nonetheless, ggplot2 is more powerful and easier to learn.)

install.packages("sf")
install.packages("dplyr")
install.packages("ggplot2")
library(sf)
library(dplyr)
library(ggplot2)

Data Visualization with Maps

When it comes to Big Data, a massive amounts of information may cross our mind. How to process on the big data and to produce a neat summary is a vital issue nowadays, and so is the “Geospatial Big Data”. Data visualization with maps is one of the methods, which summarizes the huge data with geometry features.

The basic features of GIS is to make the maps!! But a map with only the geometric information may sound tedious. To display a map much more interestingly and interactively, we should combine the geometric features with its attributes. With the data joined, we can make the heat map, hypsometric map and so forth to make the attributes visualized.

In this section, we would go through the basic command on making maps, and then have a better knowing on the skills to label on the layers. Finally, briefly learn how to make the graduated, categorized, rule-based symbology map.

Making Maps

For the beginners, making map would give us a quick view and better understanding on how R can be used as GIS. It can be done simply by the function ggplot() in ggplot2 package.

The gg in ggplot2 means Grammar of Graphics, a graphic concept which describes plots by using a “grammar”. ggplot() is used to construct the initial plot object, and is followed by + to add component to the plot. To make the map, we should also follow this grammar.

The following two code represent how to draw United States Map by the ggplot grammar. ggplot() means the initialization of drawing, while geom_sf() informs the function ggplot() to make the map. The data can be either put in the code ggplot() or in geom_sf().

Note that “us_states” is an example data originally attached in spData package. Please install the packages “spData” and access the library before using the data.

install.packages("spData")
library(spData)

ggplot(data=us_states)+
  geom_sf()

ggplot()+
  geom_sf(data=us_states)

Two codes above can plot the same map as shown. And from this simple work, we may find out that making a map may be much easier and flexible than the desktop GIS (e.g., QGIS, ArcGIS), since it can produce map by standard and readable language.

Attributes of Shapefile

Notably, the class of “us_states” is “sf” (simple features) as well as “data.frame”. The simple features do not only store the geometric information, but the attributes of each features.

class(us_states)

## [1] "sf"         "data.frame"

Now, we know how to make a map in R. But if we want to make visualized map, it is better for us to take a closer look on the shapefile first. Take “us_states” for instance.

us_states

## Simple feature collection with 49 features and 6 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -124.7042 ymin: 24.55868 xmax: -66.9824 ymax: 49.38436
## geographic CRS: NAD83
## First 10 features:
##    GEOID        NAME   REGION             AREA total_pop_10 total_pop_15
## 1     01     Alabama    South 133709.27 [km^2]      4712651      4830620
## 2     04     Arizona     West 295281.25 [km^2]      6246816      6641928
## 3     08    Colorado     West 269573.06 [km^2]      4887061      5278906
## 4     09 Connecticut Norteast  12976.59 [km^2]      3545837      3593222
## 5     12     Florida    South 151052.01 [km^2]     18511620     19645772
## 6     13     Georgia    South 152725.21 [km^2]      9468815     10006693
## 7     16       Idaho     West 216512.66 [km^2]      1526797      1616547
## 8     18     Indiana  Midwest  93648.40 [km^2]      6417398      6568645
## 9     20      Kansas  Midwest 213037.08 [km^2]      2809329      2892987
## 10    22   Louisiana    South 122345.76 [km^2]      4429940      4625253
##                          geometry
## 1  MULTIPOLYGON (((-88.20006 3...
## 2  MULTIPOLYGON (((-114.7196 3...
## 3  MULTIPOLYGON (((-109.0501 4...
## 4  MULTIPOLYGON (((-73.48731 4...
## 5  MULTIPOLYGON (((-81.81169 2...
## 6  MULTIPOLYGON (((-85.60516 3...
## 7  MULTIPOLYGON (((-116.916 45...
## 8  MULTIPOLYGON (((-87.52404 4...
## 9  MULTIPOLYGON (((-102.0517 4...
## 10 MULTIPOLYGON (((-92.01783 2...

There are 49 features and 6 fields (attributes) in “us_states” data, that is, 49 rows (the 50 states except for Hawaii) and 6 columns in a table. However, we may find that there are actually 7 columns existing in the table above. It is because that the last column of this table stores the geometry of this shapefile. The “geometry” column is recorded in the form of “Well-Known Text (WKT)”.

It is allowed to apply basic command of R on the “sf” data. For instance, if we want to observe all the names in “us_states”, plus “$” behind the name of data, and then type on the column name.

## Output the name of us_states by "$"
us_states$NAME

##  [1] "Alabama"              "Arizona"              "Colorado"            
##  [4] "Connecticut"          "Florida"              "Georgia"             
##  [7] "Idaho"                "Indiana"              "Kansas"              
## [10] "Louisiana"            "Massachusetts"        "Minnesota"           
## [13] "Missouri"             "Montana"              "Nevada"              
## [16] "New Jersey"           "New York"             "North Dakota"        
## [19] "Oklahoma"             "Pennsylvania"         "South Carolina"      
## [22] "South Dakota"         "Texas"                "Vermont"             
## [25] "West Virginia"        "Arkansas"             "California"          
## [28] "Delaware"             "District of Columbia" "Illinois"            
## [31] "Iowa"                 "Kentucky"             "Maine"               
## [34] "Maryland"             "Michigan"             "Mississippi"         
## [37] "Nebraska"             "New Hampshire"        "New Mexico"          
## [40] "North Carolina"       "Ohio"                 "Oregon"              
## [43] "Rhode Island"         "Tennessee"            "Utah"                
## [46] "Virginia"             "Washington"           "Wisconsin"           
## [49] "Wyoming"

Or, we can use the concept of matrix to derive the desired columns and rows.

## Output the name and region of us_states by "[,]"
us_states[,c("NAME","REGION")]  # Put the name of column directly

## Output the name and region of us_states by "[,]"
us_states[,c(2,3)]  # NAME is stored in the second column

## Simple feature collection with 49 features and 2 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -124.7042 ymin: 24.55868 xmax: -66.9824 ymax: 49.38436
## geographic CRS: NAD83
## First 10 features:
##           NAME   REGION                       geometry
## 1      Alabama    South MULTIPOLYGON (((-88.20006 3...
## 2      Arizona     West MULTIPOLYGON (((-114.7196 3...
## 3     Colorado     West MULTIPOLYGON (((-109.0501 4...
## 4  Connecticut Norteast MULTIPOLYGON (((-73.48731 4...
## 5      Florida    South MULTIPOLYGON (((-81.81169 2...
## 6      Georgia    South MULTIPOLYGON (((-85.60516 3...
## 7        Idaho     West MULTIPOLYGON (((-116.916 45...
## 8      Indiana  Midwest MULTIPOLYGON (((-87.52404 4...
## 9       Kansas  Midwest MULTIPOLYGON (((-102.0517 4...
## 10   Louisiana    South MULTIPOLYGON (((-92.01783 2...

## Output the attributes of first row
us_states[1,]

## Simple feature collection with 1 feature and 6 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -88.47323 ymin: 30.24971 xmax: -84.89184 ymax: 35.00803
## geographic CRS: NAD83
##   GEOID    NAME REGION            AREA total_pop_10 total_pop_15
## 1    01 Alabama  South 133709.3 [km^2]      4712651      4830620
##                         geometry
## 1 MULTIPOLYGON (((-88.20006 3...

In addition, with the characteristic of “data.frame”, we can easily manipulate data by the package “dplyr”. It would be further discussed in the following section.

Labels on Map

Sometimes we need to label on the map to inform the name or attribute of specific places. To do so, the command geom_sf_text() should be appended after the geom_sf(), to inform which column is going to be labeled. The code illustrated below is the map of US states with thier names labelled on.

ggplot(us_states)+
  geom_sf()+
  geom_sf_text(aes(label=NAME))

## Warning in st_point_on_surface.sfc(data$geometry): st_point_on_surface may not
## give correct results for longitude/latitude data

In function geom_sf_text(), aes() represents aesthetic mappings, which defines the coordinates of each label. The attributes we want to label is followed by label=.

However, something awkward occurs. The text is overlapping with each other when the labels are dense. To solve this problem, install the package “ggsflabel” from GitHub. (Note that “ggsflabel” has not been released on CRAN, which is a global network that store versions of documentation for R. We should download it from GitHub!!)
* If you want to learn more on the reason why install from GitHub and how “devtools” works, please click here.

install.packages("devtools")
devtools::install_github("yutannihilation/ggsflabel")
library(ggsflabel)

With this package, the overlapping text would be eliminated, and make the map clear and readable.

ggplot(us_states)+
  geom_sf()+
  geom_sf_text_repel(aes(label=NAME), nudge_y=0.5, size=3, color="red")

## Warning in st_point_on_surface.sfc(data$geometry): st_point_on_surface may not
## give correct results for longitude/latitude data

In function geom_sf_text() and geom_sf_text_repel(), nudge_x and nudge_y shift the label horizontally and vertically respectively in order to make the maps more visualizable.

Also, if we want to change the color or size of the text, just revise the variable in geom_sf_text() or geom_sf_text_repel(): color=, size=. Common colors can be directly given as the name of color, for instance, color="red". Other colors can be used by typing its Color Code, such as color="#2894FF".

Graduated Symbol Map

Graduated symbol map is a map that change in color or size according to the value of the attribute they represent. The data type of the particular value should be numeric and continuous. Graduated symbol map can be applied widely, for instance, denser populations might be represented by larger dots or darker polygon.

Let’s take “us_states” data for example again. There are two columns representing population (total_pop_10, total_pop_15) in the data. What if we want to make a graduated symbol map with population of each states?

First, let us learn how to alter the color and size of the map. To do so, we just need to set up the parameters in geom_sf(). The parameter and feature of geom_sf() are summarized in the table below.

Parameter	Features
size	identify the size of point, line, polyon
color	identify the color of point and line
fill	identify the color of polygon

The following code fills “us_states” polygon blue, and make the red and thicker border.

ggplot(us_states)+
  geom_sf(fill="blue", size=3, color="red")

Graduated symbol map does not assign a specific color to each geometry features. Thus, the partameter in geom_sf() should not be a fixed value or color. Instead, the parameters(size, color, fill) should be based on the value of attribute, that is Parameter=Attribute. The following code and map are the example of how to make a graduated symbol map based on the population of each US states.

ggplot(us_states)+
  geom_sf(aes(fill=total_pop_10))

Simply add the code aes(fill=total_pop_10) inside geom_sf(), it produces the graduated symbol map based on the value of population. Here, aes() defines the coordinate of the filled polygon. fill defines which attribute should be based on.

But the map is somehow weird. In common, area with high population should be filled with dark color as expected. The result above are the total opposite. Thus, we need other functions to re-define the color.

scale_fill_continuous() is appended after geom_sf(). It is used to define the color of continuous data for filling the polygon. scale_color_continuous() do the same thing, but coloring the point or line. In these functions, two major parameters should be set. One is low=, representing the color given to the lowest value in the data; the other is high=, representing the color given for highest value.

In “us_states” example, since the data is continuous and the objective is polygon, we should apply the function scale_fill_continuous(). The code and figure are shown as follows.

ggplot(us_states)+
  geom_sf(aes(fill=total_pop_15))+
  scale_fill_continuous(low="#D2E9FF", high="#003D79")

Categorized Symbol Map

Categorized map is similar to graduated map, but it is used on the discrete data. It is widely used to the geometric data which has been classified in advance. For instance, the region of states, the type of schools (elementary, junior high, senior high, university,…), the classification of highways (expressway, freeway,…), and so forth.

Take “us_states” data for example. A column named “REGION” represents the region of each states. We can fill the same region with consistent color. Similar to graduated map, the parameters(size, color, fill) should be based on the classification of attribute, that is Parameter=Attribute. The code and result are as the followings.

ggplot(us_states)+
  geom_sf(aes(fill=REGION))

It seems to be as simple as the graduated map. But what if we want to adjust the color manually?

scale_fill_manual() or scale_color_manual() is appended after geom_sf(). They are used to adjust the color of polygon and “point or line” respectively. In these functions, parameter values= should be set up. It is used to define the color of each classifications. The pseudocode is as follows. values=c("Classification_1"="Color_1", "Classification_2"="Color_2",...)

ggplot(us_states)+
  geom_sf(aes(fill=REGION))+
  scale_fill_manual(values=c("Norteast"="#FFC1E0", "Midwest"="#97CBFF", "South"="#A6FFA6", "West"="#FFE66F"))

Rule-based Symbol Map

Sometimes the category is determined by values from multiple columns. Under this condition, we may find it hard to use categorized symbol map. We should mutate a new column to determine the new classification of each tuple.

Take “us_states” data for example. We want to classify all the data into four categories, that is “high population-large area”, “high population-small area”, “low population-large area” and “low population-small area”. And the data is seperated simply by the average, namely, the one whose population is lower than the average is defined as “low population”.

In this case, we need mutate() and case_when() functions in dplyr package to help us sort out the data.

# Create a new data frame named us_states_type
us_states_type=mutate(us_states,TYPE=case_when(
  total_pop_15>mean(us_states$total_pop_15) & AREA>mean(us_states$AREA) ~ "high population-large area",
  total_pop_15>mean(us_states$total_pop_15) & AREA<mean(us_states$AREA) ~ "high population-small area",
  total_pop_15<mean(us_states$total_pop_15) & AREA>mean(us_states$AREA) ~ "low population-large area",
  total_pop_15<mean(us_states$total_pop_15) & AREA<mean(us_states$AREA) ~ "low population-small area"
))

# Output the newly data
us_states_type

## Simple feature collection with 49 features and 7 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -124.7042 ymin: 24.55868 xmax: -66.9824 ymax: 49.38436
## geographic CRS: NAD83
## First 10 features:
##    GEOID        NAME   REGION             AREA total_pop_10 total_pop_15
## 1     01     Alabama    South 133709.27 [km^2]      4712651      4830620
## 2     04     Arizona     West 295281.25 [km^2]      6246816      6641928
## 3     08    Colorado     West 269573.06 [km^2]      4887061      5278906
## 4     09 Connecticut Norteast  12976.59 [km^2]      3545837      3593222
## 5     12     Florida    South 151052.01 [km^2]     18511620     19645772
## 6     13     Georgia    South 152725.21 [km^2]      9468815     10006693
## 7     16       Idaho     West 216512.66 [km^2]      1526797      1616547
## 8     18     Indiana  Midwest  93648.40 [km^2]      6417398      6568645
## 9     20      Kansas  Midwest 213037.08 [km^2]      2809329      2892987
## 10    22   Louisiana    South 122345.76 [km^2]      4429940      4625253
##                          geometry                       TYPE
## 1  MULTIPOLYGON (((-88.20006 3...  low population-small area
## 2  MULTIPOLYGON (((-114.7196 3... high population-large area
## 3  MULTIPOLYGON (((-109.0501 4...  low population-large area
## 4  MULTIPOLYGON (((-73.48731 4...  low population-small area
## 5  MULTIPOLYGON (((-81.81169 2... high population-small area
## 6  MULTIPOLYGON (((-85.60516 3... high population-small area
## 7  MULTIPOLYGON (((-116.916 45...  low population-large area
## 8  MULTIPOLYGON (((-87.52404 4... high population-small area
## 9  MULTIPOLYGON (((-102.0517 4...  low population-large area
## 10 MULTIPOLYGON (((-92.01783 2...  low population-small area

*To understand more about basic operation of dplyr, please click here.

With the newly data derived, we can then apply the categorized symbol map. The code and result are as follows.

ggplot(us_states_type)+
  geom_sf(aes(fill=TYPE))

Basic Operations/Elements of Map

Map Overlay

It is simple by using ggplot2 to make an overlaying map. Shapefiles can be overlay by repeating geom_sf(data=). Let’s take New Zealand and its top 101 heighest points for example. Layers used in this example are provided in the package spData. nz is the polygon shapefile, and nz_height is the point. The code and result are as follows.

ggplot()+
  geom_sf(data=nz)+
  geom_sf(data=nz_height, color="red")

North Arrow and Scale

North arrow is a graphical representation indicating the direction of north in the map. The scale of a map is the ratio of a distance on the map to the corresponding distance on the ground. These two items are required when making the maps. In ggplot2, we can simply use the function annotation_north_arrow and annotation_scale in package ggspatial to add them. Again, we use map of New Zealand (nz) to display the result.

# Please install the package ggspatial first!
install.packages("ggspatial")
library(ggspatial)

ggplot()+
  geom_sf(data=nz)+
  annotation_scale(location="br")+
  annotation_north_arrow(location="tl")

Parameter location= in two functions represents that the place where signs are expected to be put. “br” means “bottom right, while”tl" means “top left”. If we do not prefer the location of default value, we can slghtly alter the location by using parameter pad_x and pad_y. Usage of other parameters please look up ggspatial documentation.

Linetype/Shape

We have introduced the parameter color, fill and size in the past section. But what if we want to change the type of line (e.g., dashed, dotted) or change the shape of point (e.g., diamond, triangle)? We need to add parameters in geom_sf(). Take map of New Zealand for instance again.

ggplot()+
  geom_sf(data=nz, color="blue", linetype="dashed")+       # make the border of nz dashed
  geom_sf(data=nz_height, color="red", size=2, shape=4)    # set the shape of nz_height to "X"

Linetype and points shapes commonly used in R are illustrated in the figure below.

Theme

Maps above show the background (longitude and latitude) and the axes. But sometimes we do not prefer to show the details about the geographic information, then we need to set those elements to be blanked. Or we might change the position of legend, size of legend, and so forth, in order to make the maps more elegant. In these situation, we need function theme to customize the style of map. Let’s optimize our map of New Zealand. Suppose we don’t want the background as well as the coordinate. Also, we need to change the position of legend on the bottom-right of the map.

ggplot()+
  geom_sf(data=nz, aes(fill=Island))+
  theme(axis.text=element_blank(),             # remove the value on axis
        axis.ticks=element_blank(),            # remove the ticks on axis
        axis.title=element_blank(),            # remove the title on axis (X,Y)
        legend.text=element_text(size=12),     # set up the size of text in legend to 12
        legend.title=element_text(size=15),    # set up the size of legend title to 15
        legend.position=c(0.87, 0.15),         # set up the position of legend to the bottom-right ((0,0) at the bottom-left)
        panel.background=element_blank())      # remove the background

Example above shows one of the common theme in making map. In fact, function theme provides lots of parameters for the users to set. To make an ideal map, it is suggested to read theme documentation and take advantage of it.

Other Mapping Tools

ggplot is a powerful package for making maps (in fact it is also helpful on plotting bar chart, line chart, histogram, and so forth) though, the major drawback is that raster objects are not natively supported by it. Hence, many users are inclined to use a specialized package for making maps, such as “tmap” and “leaflet”. In this notebook, we would briefly introduce how they work with a simple example.

tmap

Like ggplot2, tmap is based on the idea of a “grammar of graphics”. This involves a separation between the input data and the aesthetics (how data are visualised): each input dataset can be mapped by the visual variables. The basic building block is tm_shape(), which defines input data, raster and vector objects, followed by one or more layer elements such as tm_fill() and tm_dots(). Here, we can compare the command used in two different packages.

Package	tmap	ggplot2
Basic Block	`tm_shape()+`	`ggplot()+`
Polygon	`tm_polygons(col=X)`	`geom_sf(fill=X)`
Fill	`tm_fill(col=X)`	`geom_sf(color=NA, fill=X)`
Border	`tm_borders(col=X, lwd=Y, lty=Z)`	`geom_sf(color=X, size=Y, linetype=Z)`
Point	`tm_dots(col=X, size=Y shape=Z)`	`geom_sf(color=X, size=Y, shape=Z)`
Line	`tm_lines(col=X, lwd=Y, lty=Z)`	`geom_sf(color=X, size=Y, linetype=Z)`
Raster	`tm_raster()`	Does not support raster data
Aesthetic	`tm_dots(col="ATTRIBUTE")`	`geom_sf(aes(color=ATTRIBUTE))`

Plotting the map by using tmap, please install tmap package first. Then, let’s display the map of New Zealand.

# Please install the package tmap first!
install.packages("tmap")
library(tmap)

tm_shape(nz)+
  tm_polygons(col="Island")+
  tm_shape(nz)+
  tm_borders(col="grey", lty="dashed")+
  tm_shape(nz_height)+
  tm_dots(shape=4, col="red", size=1)

Also, tmap supports to plot the interactive map, which is provided by leaflet (discussed later). The plotting procedure is very easy, just change the mode from “plot” to “view” by function tmap_mode("view"). If we do not define the mode used, the default value would be tmap_mode("plot").

tmap_mode("view")

## tmap mode set to interactive viewing

tm_shape(nz)+
  tm_polygons(col="Island")+
  tm_shape(nz)+
  tm_borders(col="grey", lty="dashed")

leaflet

Leaflet is the most mature and widely used interactive mapping package in R. It provides a relatively low-level interface to the Leaflet JavaScript library and many of its arguments can be understood by reading the documentation of the original JavaScript library.

Leaflet maps are created with leaflet(), the result of which is a leaflet map object which can be piped to other leaflet functions. This allows multiple map layers and control settings to be added interactively.

# Please install the package leaflet first!
install.packages("leaflet")
library(leaflet)

The basic command of leaflet includes addTiles, addPolygons, addCircles. addTiles means the map used. If no arguments is given, OpenStreetMap is set by default. Or we can use addProviderTiles to connect other maps provided by different websites. addPolygons is used to paint the border color of polygon, and also fill in the color. addCircles is used for the point data. It draws a buffer with a given radius. Note that the unit of radius is “meters”.

Now, let’s display the interactive map of New Zealand by using leaflet.

leaflet(data=st_transform(nz, crs=4326))%>% 
  addTiles()%>%
  addPolygons(color="green", fillColor="gray")%>%
  addCircles(data=st_transform(nz_height, crs=4326), color="red", radius=1000)

* Please ignore the function st_transform() above, which is applied to transform the coordinate reference system, we would discuss it in the following chapter.

Several method to plot the visualized map have been introduced. Here, we spend a lot on understanding how to use package ggplot2 to plot the map, for it is the most common and universal tool to plot all kinds of graph in R. Other mapping tools are also helpful, particularly when it comes to raster shapefile. If the readers are interested in those packages, and want to learn more advanced skills please read chapter 8 of book “Geocomputation with R”. But please note that, ggplot2 is introduced more detailed in this notebook, while book “Geocomputation with R” may provide sketchy information on the topic.

PRACTICE
In this chapter, we introduce the simple example on the us_states and nz data which “spdata” package provides. Then, it’s time for us to make our own maps of Taipei.

But, how to connect to shapefile of Taipei map? There is no package like “us_states” for Taipei, and thus we need to download the files on government open data in advance. The list below shows commonly used websites, which are in rich of the shapefiles regarding transportation, economy, and administration.

For convenience, please download the data here directly. This zip file contains all data needed in the notebook. Please unzip the file first, and make sure that you place the file in the same directory as your R script file.

Here, we use Taipei village map (polygon), cycle path map (line), MRT routes (line) and stations (point) map, and YouBike map (point), to practice plotting maps.

First, let’s make a map regarding cycling system and village population. Please follow the instructions below:

Plot the village map, and fill in the color by population (PP) of each village. (Hint: scale_fill_continuous())
Add the cycling path. (Hint: Use alpha to adjust the transparency of the color, not necessary.)
Add the point of YouBike stations, and size of point is determined by the bikes capacity (BIKESCAPAC). (Hint: scale_size_continuous())
Finally, use function theme to optimize your map.

# Please put the files you download in the same directory of your RStudio environment!!
# read the shapefiles
taipei_village_map=read_sf("./data/taipei_map/taipei_village_map.shp")
taipei_cycle_path=read_sf("./data/taipei_cycle_path/taipei_cycle_path.shp")
taipei_youbike=read_sf("./data/taipei_youbike/taipei_youbike.shp")

The result is illustrated below.

With the maps above, we may come out some interesting questions: Is the design capacity of YouBike stations associated with the population of its location? Or, does the cycling path meet the supply of bike sharing system? Now, we can briefly observe the “phenomenon” from the visualized map, but of course, a concrete and pratical conclusion should be drawn after conducting spatial analysis.

Now let’s make a more complicated map, a Taipei district map with MRT routes and stations. In addition, the transfer stations should be labeled with text and use a different point shape compared to non-transfer station. Thus, the first task we need to do is to classify the transfer and non-transfer stations. It is recommended to use package dplyr to tidy the data.

After cleaning the data. Please follow the instructions below to plot the map:

Plot the village map, and fill in the color by district (TOWNCODE). (Hint: scale_fill_manual() or scale_fill_brewer())
Add the MRT route, and paint them with its specific color (i.e., 板南線 should be painted blue). (Hint: scale_color_manual())
Add the non-transfer MRT stations.
Add the transfer MRT stations. Pleaes note that you should set up a different shape compared to the former.
Finally, use function theme to optimize your map.

# Please put the files you download in the same directory of your RStudio environment!!
# read the shapefiles
taipei_village_map=read_sf("./data/taipei_map/taipei_village_map.shp")
taipei_mrt_route=read_sf("./data/taipei_mrt/taipei_mrt_route.shp")
taipei_mrt_station=read_sf("./data/taipei_mrt/taipei_mrt_station.shp")

# tidy data
transfer_station=filter(taipei_mrt_station, ...)
non_transfer_station=filter(taipei_mrt_station, ...)

The result is illustrated below.

## Warning: ggrepel: 3 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

In this example, you may find that making maps might be a bit difficult when lots of conditions are going to be considered. Thus, besides the source code provided in this notebook, we need to learn more on dplyr or other packages that could efficiently clean the data. Moreover, making maps elegant is another issue. Only with the study on this notebook is enough, but not advanced, we should practice more and search for more functions from websites (others’ work), to optimize the skills of map visualization.

Geographic Data

We have introduced the concept of WKT in the previous section, and then we may come up with a few questions: Can R understand WKT format? And how R produce the WKT format from its base code and sf package? In this section, we would first introduce “sf” class and learn the fundamental skill about how geographic data is built. Then, we would further discuss about the transformation of CRS and the method to export the shapefile. Spatial data combines the geometric information and its attributes, and thus, the study on how to join both of them together would be followed by. Please note that, we would use sf package in this section (also the first time to use), and make sure you have this package downloaded (install.packages("sf")).

Simple Feature Geometries (sfg)

Usually, we have the geographic data in shapefile format downloaded from the website. This allows us to plot the map or do geometric operations without defining the WKT. However, what if we are asked to define the geometric based on WKT format on our own? Simple feature geometries (sfg) helps us to create these objects.

The sfg class contains (multi-)point, (multi-)linestring, (multi-)polygon and geometry collection. A set of functions which start with st_ prefix and end with the name of geometry type create the sfg objects. For instance, st_point() creates the POINT of WKT format. sfg objects can be created from three base R types:

R Data Type	Code	sfg Object
numeric vector	`c( )`	`st_point`
matrix	`rbind(c( ), c( ))`	`st_linestring` `st_multipoint`
list	`list(rbind(c( ), c( )))`	`st_polygon` `st_multilinestring` `st_multipolygon`

Based on the table, we then use the functions of sfg objects to generate geometric information in WKT format.

Point

point_ex=st_point(c(2,3))
class(point_ex)  #identify the data type of sfg object

## [1] "XY"    "POINT" "sfg"

point_ex  #show the WKT format of geometric data

## POINT (2 3)

ggplot(point_ex)+
  geom_sf()

Linestring

linestring_ex=st_linestring(rbind(c(2,3), c(4,4), c(3,5), c(1,4)))
linestring_ex

## LINESTRING (2 3, 4 4, 3 5, 1 4)

ggplot(linestring_ex)+
  geom_sf()

Use rbind to create the matrix, combining all the vector cells (c( )). Note that the connection of line depends on the sequence of the vector cells in the matrix.

Polygon

polygon_ex=st_polygon(list(rbind(c(2,3), c(4,4), c(3,5), c(1,4), c(2,3))))
polygon_ex

## POLYGON ((2 3, 4 4, 3 5, 1 4, 2 3))

ggplot(polygon_ex)+
  geom_sf()

Note that the polygon should be closed, that is, the first vector in the list is the same as the last one.

Multipoint

point_ex=st_multipoint(rbind(c(2,3), c(4,4), c(3,5), c(1,4)))
point_ex

## MULTIPOINT ((2 3), (4 4), (3 5), (1 4))

ggplot(point_ex)+
  geom_sf()

The R data type in st_multipoint is the same as st_linestring, which both of them are matrices.

Multilinestring

linestring_ex=st_multilinestring(list(rbind(c(2,3), c(4,4), c(3,5)), rbind(c(2,5), c(1,2))))
linestring_ex

## MULTILINESTRING ((2 3, 4 4, 3 5), (2 5, 1 2))

ggplot(linestring_ex)+
  geom_sf()

Multipolygon

polygon_ex=st_multipolygon(list(list(rbind(c(2,3), c(4,4), c(3,5), c(1,4), c(2,3))),
                      list(rbind(c(1,5), c(2,5), c(3,6), c(1,5)))))
polygon_ex

## MULTIPOLYGON (((2 3, 4 4, 3 5, 1 4, 2 3)), ((1 5, 2 5, 3 6, 1 5)))

ggplot(polygon_ex)+
  geom_sf()

A polygon is created by a list, and thus, multipolygon is generated by a list of lists.

Simple Feature Columns (sfc)

The sfc object is a list of sfg objects, which additionally contains information about the coordinate system (CRS), geometry type and the border of the whole geometry set (listing the minimum and maximum of x-axis and y-axis). sfc object represents the geometry column in sf data frames.

point_1=st_point(c(2,3))
point_2=st_point(c(4,2))
point_3=st_point(c(1,1))
point_sfc=st_sfc(point_1, point_2, point_3, crs=4326)  #set the CRS to be 4326 (WGS 84)
point_sfc

## Geometry set for 3 features 
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 1 ymin: 1 xmax: 4 ymax: 3
## geographic CRS: WGS 84

## POINT (2 3)

## POINT (4 2)

## POINT (1 1)

ggplot(point_sfc)+
  geom_sf()

In this example, we can find that the geometry set consists of three points. The border of the geometry set is listed in the bbox (xmin: 1 ymin: 1 xmax: 4 ymax: 3). CRS is set to be EPSG:4326 (i.e.,latitude–longitude system). And please note the axis of the graph. It is labeled as the latitude and longitude format, instead of the integer value labeled in the sfg objects.

If we want to check the CRS and border of the geometry, we can use the functions as follows.

# Check CRS
st_crs(point_sfc)

## Coordinate Reference System:
##   User input: EPSG:4326 
##   wkt:
## GEOGCRS["WGS 84",
##     DATUM["World Geodetic System 1984",
##         ELLIPSOID["WGS 84",6378137,298.257223563,
##             LENGTHUNIT["metre",1]]],
##     PRIMEM["Greenwich",0,
##         ANGLEUNIT["degree",0.0174532925199433]],
##     CS[ellipsoidal,2],
##         AXIS["geodetic latitude (Lat)",north,
##             ORDER[1],
##             ANGLEUNIT["degree",0.0174532925199433]],
##         AXIS["geodetic longitude (Lon)",east,
##             ORDER[2],
##             ANGLEUNIT["degree",0.0174532925199433]],
##     USAGE[
##         SCOPE["unknown"],
##         AREA["World"],
##         BBOX[-90,-180,90,180]],
##     ID["EPSG",4326]]

# Check the border
st_bbox(point_sfc)

## xmin ymin xmax ymax 
##    1    1    4    3

Simple Feature (sf)

The sf object does not only contain the geometry information, but sets of features with attributes. The basic component of its geometry data is sfg, which is recorded in WKT format. Remember the us_states data in package spData? Let’s take a closer look on it. We know that the class of us_states data is not only sf but data frame. A tuple, namely a row, is a simple feature. A simple feature must contain feature attributes and a simple feature geometries (sfg) object, which defines the location of the tuple. The list-column with the simple feature geometries (sfg) for each tuple is then called simple feature columns (sfc).

Again, in the figure above, we can find that the geometry type is “multipolygon”. The border of the us_states data is listed in bbox. CRS used in this data is NAD83 (EPSG:4269).

How do we create the simple feature on our own? It is simple, just create the attribute in a data frame, and then use st_sf() function to combine the attributes and the simple feature columns (sfc). There are two parameters should be given in st_sf(). One is data frame, which provides attribute of the features. The other one is geometry, which defines the geometry data that should be based on. If we do not define the CRS in function st_sfc(), then it is required to set CRS in function st_sf().

Example below shows the code to make the simple features of NCTU and NTHU.

uni_geom=st_sfc(
  st_point(c(120.999645, 24.789071)),  # define the sfg of first school (NCTU)
  st_point(c(120.996665, 24.796442))   # define the sfg of second school (NTHU)
  ,crs=4326)  # set the CRS

uni=data.frame(
  name=c("NCTU","NTHU"),               # name of schools
  type=c("university","university"),   # type of schools
  phone=c("03-5714769","03-5712121"),  # phone of schools
  geometry=uni_geom                    # put the sfc into the data frame
)

class(uni)  # the class of "uni" is remain a data frame

## [1] "data.frame"

uni=st_sf(uni, geometry=uni$geometry)  # define the sfc as geometry
# uni=st_sf(uni, geometry=uni$geometry, crs=4326) 

class(uni)  # the class of "uni" changes to data frame as well as sf

## [1] "sf"         "data.frame"

To check whether the class of data we produce is “sf”, we can directly plot the map to confirm it. Let’s recall the tmap function to plot the interactive map.

tmap_mode("view")

## tmap mode set to interactive viewing

tm_shape(uni)+
  tm_dots(size=0.2)

Convert Text File to Shapefile

In fact, most of the spatial data is stored as shapefile format, which is simple for us to read the files, plot the map and conduct spatial operations in R. However, some spatial data is saved as CSV or XML files. These are not standard geographic data format, and hence, we should convert them in advance. Most of the geometric data they record is under WKT format with pure text (e.g., directly record “POINT (3 2)”). But the function st_sfc() we have learned in the previous section cannot parse the text, whereas it can only identify the vector, matrix, list data type written in R base code. Here, we would introduce a similar function called st_as_sfc, which can help us to read the pure text of WKT.

st_as_sfc("POINT(3 2)")

## Geometry set for 1 feature 
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 3 ymin: 2 xmax: 3 ymax: 2
## CRS:            NA

## POINT (3 2)

Also, a function named st_as_text() can retrieve the WKT pure text from the R base data type.

point_ex=st_point(c(3,2))
st_as_text(point_ex)

## [1] "POINT (3 2)"

Knowing the skills of st_as_sfc, let’s take two example data downloaded from government open data, to practice this skill. Data needed has been downloaded previously. You placed the file in the same directory as the R script file. Click here to re-download the file if it is lost.

Let’s use the CSV data named “hsinchu_bus_route” in “csv_files” to produce the bus route map of Hsinchu. There are four columns in this data, including “geometry” columns, which records in WKT format. Use function read_csv() to import the CSV data first. And apply the function st_as_sfc() to convert the column “geometry” to simple feature columns (sfc). Finally, use function st_sf() to give the data frame with sf class and define its CRS (EPSG::4326).

# read CSV files
BusShape=read.csv("./data/csv_files/hsinchu_bus_route.csv")

# convert the column "geometry" to sfc
BusShape$Geometry=st_as_sfc(BusShape$Geometry)

# give the data frame with sf class
BusShape=st_sf(BusShape, crs=4326)

Then, we can use function class() and plot the map to confirm whether we successfully convert CSV file to shapefile.

class(BusShape)

## [1] "sf"         "data.frame"

tm_shape(BusShape)+
  tm_lines()

In the following example, we would plot the map regarding scenic spots in Hsinchu. Please import the CSV data named “hsinchu_scenicSpot” in “csv_files” first. There are four columns in the data: ID, name, latitude and longitude. We need to add a new column (here we name it “geometry”) to paste longitude (X) and latitude (Y) together, and then create the WKT format as POINT(X Y). Use st_as_sfc to convert the column “geometry” to sfc. Last, apply st_sf() to give the data frame with sf class and define its CRS.

# read CSV files
ScenicSpot=read.csv("./data/csv_files/hsinchu_scenicSpot.csv")

# paste the latitude and longitude together, and create WKT text
ScenicSpot=mutate(ScenicSpot, geometry=paste("POINT(", PositionLon, " ", PositionLat, ")"))

# convert the column "geometry" to sfc
ScenicSpot$geometry=st_as_sfc(ScenicSpot$geometry)

# give the data frame with sf class
ScenicSpot=st_sf(ScenicSpot, crs=4326)

Use function class() and plot the map to confirm whether we successfully convert CSV file to shapefile.

class(ScenicSpot)

## [1] "sf"         "data.frame"

tm_shape(ScenicSpot)+
  tm_dots(col="red")

Reproject Geographic Data

After collecting the shapefile data, we may find that their CRS are not consistent, which may cause error in plotting maps as well as in conducting spatial analysis. Hence, it is required to covert them to a same CRS. The inconsistency of CRS often results from the different origin or purpose of the map. In Taiwan, there are two major type of CRS. One is EPSG:4326, which is recorded in latitude and longitude. The other one is EPSG:3826, which is specifically used for the island. However, since shapefiles are managed and released by different sectors (private sector and gonvernment), CRS has not been unified. Function st_transform() can solve this problem by resetting CRS.

We have convert “ScenicSpot” from CSV file to shapefile in EPSG:4326 in the previous section. In the example below, we would further transform CRS to EPSG:3826.

# check the original CRS
st_crs(ScenicSpot)$epsg

## [1] 4326

# transform CRS
ScenicSpot=st_transform(ScenicSpot, crs=3826)

# check the new CRS
st_crs(ScenicSpot)$epsg

## [1] 3826

Let’s compare the data frame of the original one and the new one in the figure below. We can find that the geometry column has been converted based on the CRS.

Geographic Data Export

We know how to import the shapefile in the previous practice (by function read_sf()). But how to export the shapefile we make in R? By using function st_write, we can export sf data. Two essential parameters should be filled in write_sf(): obj means the object we have produced in R; dsn means the directory where we want to save shapefile (the directory should include the name of it and file extension *.shp). Sometimes we have Chinese characters in our shapefile, we should additionally add the parameter layer_options="ENCODING=UTF-8" to ensure that characters would not be garbled.

Let’s again use “ScenicSpot” in the following example. Export the shapefile to the same directory of your R script. You can definitely use absolute path, or simply use relative path to get the same result.

# relative path
write_sf(ScenicSpot, "./ScenicSpot.shp", layer_options="ENCODING=UTF-8")

## Warning in abbreviate_shapefile_names(obj): Field names abbreviated for ESRI
## Shapefile driver

Successfully export the shapefile now? If yes, please check the directory you export to. You may find five files. The file extension includes *.shp, *.shx, *.dbf, *.prj, and *.cpg. The first four files have been introduced in the previous chapter. The last one; however, is in fact not necessary. It is used to specify the code page for identifying the character encoding to be used.

Geographic Data Format

Up to now, we learn how to import and export the shapefile. But is it true that R can only cope with the shapefile format? Apparently not! There are in fact several formats we have not discussed in the past section. Here, we would like to introduce some important geographic data formats that are commonly used in spatial analysis.

Shapefile is the most popular vector data format. However, it has some limitations listed below:

It is a multi-file format, which consists of at least three files.
It only supports 255 columns and the file size is limited to 2 GB.
It is unable to distinguish between a polygon and a multipolygon.

In spite of the limitations, we are still inclined to use shapefile, and lots of data provided by government use it. Aside from shapefile, Geopackage is a format for exchanging geospatial information. It defines the rules on how to store geospatial information in a tiny SQLite container. Hence, GeoPackage is a lightweight spatial database container, which allows the storage of vector and raster data. KML (Keyhole Markup Language) is an XML notation for expressing geographic annotation and visualization within two-dimensional maps and three-dimensional Earth browsers. It was developed for use with Google Earth. GeoJSON is an open standard format designed for representing simple geographical features, along with their non-spatial attributes. It is based on the JSON format. GPX (GPS Exchange Format) is an XML schema designed as a common GPS data format for software applications. It can be used to describe points, tracks, and routes.

If we want to export the geographic data in those formats, just simply change the file extension in the directory. Take “ScenicSpot” we produce again, and generate the spatial data in GeoPackage format.

# relative path
write_sf(ScenicSpot, "./ScenicSpot.gpkg")

# import the data and check if it is successfully generated
read_sf("./ScenicSpot.gpkg")

## Simple feature collection with 94 features and 4 fields
## geometry type:  POINT
## dimension:      XY
## bbox:           xmin: 239280.7 ymin: 2737355 xmax: 251527.1 ymax: 2749333
## projected CRS:  TWD97 / TM2 zone 121
## # A tibble: 94 x 5
##    ID            Name              PositionLat PositionLon                  geom
##    <chr>         <chr>                   <dbl>       <dbl>           <POINT [m]>
##  1 C1_376580000~ 新竹公園(中山公園)~        24.8        121.      (247688 2743764)
##  2 C1_376580000~ 進士第                   24.8        121.    (246495.3 2744675)
##  3 C1_376580000~ 楊氏節孝坊               24.8        121.    (246417.3 2743970)
##  4 C1_376580000~ 香山火車站               24.8        121.    (241294.3 2739544)
##  5 C1_376580000~ 于飛島(鳥島)             24.8        121.    (247149.1 2740851)
##  6 C1_376580000~ 十草原(櫻花草原)         24.8        121.    (243716.3 2740007)
##  7 C1_376580000~ 新竹都城隍廟             24.8        121.    (246528.1 2744113)
##  8 C1_376580000~ 鄭用錫墓                 24.8        121.    (247062.5 2742288)
##  9 C1_376580000~ 消防博物館               24.8        121.    (246781.4 2744342)
## 10 C1_376580000~ 新竹神社殘跡(大陸人民處理中心)~        24.8        121.    (245226.6 2742840)
## # ... with 84 more rows

Join Attribute and Spatial Data

In reality, most of the attribute features we want to analyze and spatial feature are not in the same file. Usually, the spatial features are stored as shapefile format, while the attribute features are provided in a CSV or spreadsheet file. Take transportation data for instance. If we want to plot the map showing the number of passengers getting on and off at each MRT stations, we need spatial features recording the location of each stations, and attribute features recording the ridership counts. In this example, the geometry data (shapefile) can be downloaded from GIS-T website, while the operating data (spreadsheet) can be derived from Taipei Rapid Transit Corporation. Apparently, the data management agency and the format of two data are not consistent. Hence, combining data from different sources would be a vital and common task before the spatial analysis.

A primary key is required among the data to be joined. Primary key is a specific choice of a minimal set of attributes (columns) that uniquely specify a tuple (row) in a table. In other words, primary key is a common column that can help us link the spatial features to attribute features. In the MRT example we have mentioned above, the primary key between spatial and attribute data may be “MRT station name” or “MRT station ID”. The procedure of join is illustrated in the figure below.

Now, let’s join the data provided in package spData. “world” is simple feature data that stores the geometry of the shape of each country. “coffee_data” is a data frame which records the production of coffee in every nations each year. We want to join two data together, and perform the map about the world coffee production by country. Use function left_join to accomplish our goal. In left_join, the first data placed would be on the left-hand side of the new data frame; while the latter one would be on the right-hand side. Parameter by="" should be given if the column is specifically to be joined, or the function would join the data based on all the columns that have common name by default. The code and result is shown below.

# join the spatial and attribute features
world_coffee=left_join(world, coffee_data, by="name_long")

# print the result
world_coffee[,c("name_long","continent","coffee_production_2016","coffee_production_2017")]

## Simple feature collection with 177 features and 4 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: -180 ymin: -90 xmax: 180 ymax: 83.64513
## geographic CRS: WGS 84
## # A tibble: 177 x 5
##    name_long continent coffee_producti~ coffee_producti~
##    <chr>     <chr>                <int>            <int>
##  1 Fiji      Oceania                 NA               NA
##  2 Tanzania  Africa                  81               66
##  3 Western ~ Africa                  NA               NA
##  4 Canada    North Am~               NA               NA
##  5 United S~ North Am~               NA               NA
##  6 Kazakhst~ Asia                    NA               NA
##  7 Uzbekist~ Asia                    NA               NA
##  8 Papua Ne~ Oceania                114               74
##  9 Indonesia Asia                   742              360
## 10 Argentina South Am~               NA               NA
## # ... with 167 more rows, and 1 more variable: geom <MULTIPOLYGON [arc_degree]>

# plot the map (use 2017 data)
ggplot(world)+
  geom_sf(fill=NA)+
  geom_sf(data=filter(world_coffee, !is.na(coffee_production_2017)), aes(fill=coffee_production_2017))+
  scale_fill_continuous(low="#F2E6E6", high="#984B4B")+
  theme(panel.background=element_blank())