Introduction to R

Introduction

Note

In Week 1, you learned the big GIS building blocks: layers, the difference between raster and vector data, and why coordinate reference systems (CRS) and projections matter for making maps and doing analysis.

In Week 2, you added remote sensing ideas—especially that most satellite products we use in GIS arrive as rasters and come with important “resolution” choices and tradeoffs.

This tutorial is where we start doing the “cookie-dough” thing: chaining small operations into a full workflow (read data → inspect → transform → analyze → export).

In this tutorial, have a gentle introduction to coding in R and help you get familiar with how the system works.

It is important to remember that writing code in R is very tricky - every capital letter and punctuation mark needs to be in the right place, or the code won’t work.

Luckily, the easiest way to learn to code in R is to look at examples and then to switch out parts of those examples to see what happens. You can use that strategy to learn how the R language works for the tasks that you are trying to accomplish. That’s the strategy that I’m going to be using to teach you in this class.

The interesting thing about R is that pretty much everything you need can be written in R scripts. The R script is a series of commands for R that it follows to conduct analysis, build maps, and so on.

Scripts are really useful because they are also a record of every step of your analysis. That means that if you wind up making a mistake that you don’t notice until much later, all you need to do is find the source of the problem in your script, fix it, and rerun your code from there. That’s a lot better than having to redo all your work from scratch!

I’m using a special format of R scripts called a Quarto Document, which allows you to combine the script with HTML text. When I generate the HTML, it runs all the code blocks in the document in order and puts the results below. That means you can check what you’re getting when you run the code against what I got when I ran it.

Quarto Documents are handy for making class materials, but they can also be used to make general websites or even presentations. We’ll see how to use those in this class.

The Objectives

  • Become familiar with the basics of the R language
  • Become familiar with editing and running R code in RStudio
  • Learn how to install and use basic geospatial package in R
  • Learn how to plot raster objects
  • Learn how to conduct map algebra

The Data

Raster reminder (Week 1 + Week 2)

A GeoTIFF like GlobalPM252020.tif is a raster: a grid of pixels where each cell has a value and a known spatial location. In Week 1, we contrasted rasters with vectors (points/lines/polygons). In Week 2, you saw that most remotely sensed datasets (except things like LiDAR point clouds) come to us in raster form.

Why you should care right away: every raster comes with metadata like CRS, extent, and resolution. Those are the first things you should check before you trust a map or combine layers.

Some of the data for this tutorial will be downloaded directly using R, but there are two datasets that we will reading in from a file:

  • “GlobalPM252020.tif” is a Geotiff file (an image file with spatial information attached) showing estimated average global atmospheric concentrations of PM 2.5, sourced from the Socioeconomic Data and Applications Center (SEDAC)
  • gpw2015.tif, also a Geotiff file, which contains local population estimates for 2015 from the Gridded Population of the World dataset, also sourced from SEDAC

These Geotiffs are avaitutorialle for download in this zip file.

Tutorial report deliverable 1

In 3–5 sentences, answer:

  1. What does each pixel value represent in the PM2.5 raster vs. the population raster?
  2. What might go wrong if you combine rasters that don’t share the same CRS, extent, and resolution?

First Line of Code

Let’s start with your first line of code. You can copy and paste the following into your script editor in RStudio to run it:

2 + 2 #Any part of a line in R that starts with a # will be ignored with the code is run.
[1] 4
#We use lines (or parts of lines) that start with # to make notes in the code.

Once you have the code pasted in your script editor, you highlight the code with the cursor and then click “Run”, which is in the upper right-hand corner of this script editor window, to run it. Try running the code above now.

Congratulations! That’s your first code. This is pretty basic. Among many other things, R works as a simple calculator. It also obeys standard order of operations rules. Run the lines below for examples:

2 + (3 * 2)
[1] 8
(2 + 3) * 2
[1] 10
4 / 2
[1] 2
(4 + 2) / 2
[1] 3
4 + (2 / 2)
[1] 5

Okay, so you get the idea!

Object-Oriented Programming in R

R is an object-oriented programming language. What that means is that R works by creating snippets of information in the computer memory that we call “objects” and then doing things to those objects.

I often think about the data that I am working on almost like a physical structure that I’m using R to build and manipulate. Another helpful metaphor might be to think of objects as little digital creatures that you use R to communicate with.

However it feels easiest to think about it, let’s introduce your first object by running the following line:

x <- 5

The <- sign in that line is the assignment function. In R, functions are pieces of code that do things to objects. In those terms, all the mathematical symbols we used above would count as functions.

The assignment function takes whatever is on the right-hand side of the function and creates an object according to the instructions on the left-hand side that contains the thing on the right-hand side.

In this case, we created an object we called x and gave it the content of the number 5. We can then type in the name of the object to ask R to print back its contents:

x
[1] 5

Importantly, R is case sensitive. That means it cares whether or not letters are capitalized. Try running this:

X
Error: object 'X' not found

R should have given you this message: “Error: object ‘X’ not found”. When R says an object is not found, it’s telling you that something that you think is read into the environment (that is, the collection of objects that R has in memory at the moment) isn’t actually there.

If you look in the Environment tab in RStudio, which is probably to the right of your script editing window, you can see a list of all the objects that you have created in the environment. You should see that there is an object called “x” there. Notice that “x” isn’t capitalized. That’s why R can’t find it. R considers x and X to be completely different.

GIS connection

In R, a layer is an object: a container that holds geometry + attributes (vector) or a pixel grid + metadata (raster). When we do GIS in R, we mostly do two things:

  • Create objects (read data, transform it, make new layers)
  • Pass objects through functions (buffer, clip, raster algebra, summaries, exports)

You can pack pretty much anything that R can hold in memory into an object. We’ll be putting whole layers of spatial data into objects soon! For now, let me just show you a bit of how this works:

x <- c(1, 5, 10)

In this line, we used the c, or concatenate function, which takes a list of items separated by commas and puts them into a single object called a vector (not to be confused with the idea of a vector layer in GIS), which is just a line of data:

x
[1]  1  5 10

The parentheses and commas are really, really important in R. Commas separate different inputs into a function. We call these separate inputs arguments. The parentheses tell R where the arguments for each function start and end.

See what happens if we mess this up:

x <- c(1 5, 10)
x <- c(1, 5) 10
Error in parse(text = input): <text>:1:10: unexpected numeric constant
1: x <- c(1 5
             ^

Both of those situations cause R to say Error: unexpected . . .. When you see an error that starts that way, it’s R’s way of telling you that there’s something wrong with the way you typed your code and that things are out of order.

RStudio actually helps you keep track of coding problems. In the first of the two lines above, for example, you should see a red underline under 5. That’s RStudio telling you that something fishy is going on with your code there (in this case, you’re missing the comma after 1).

Okay, let’s get our x back in working order:

x <- c(1, 2, 3, 4, 5, 10)

Basic Vector Operations

Once you’ve created a vector, you can use functions to perform operations on all the elements of the vector at once:

x <- c(1, 5, 10)
x + 2
[1]  3  7 12
x * 2
[1]  2 10 20
x ^ 2
[1]   1  25 100
x / 2
[1] 0.5 2.5 5.0

You can also use statistical functions on the whole vector. Here are some common handy examples:

min(x)
[1] 1
mean(x)
[1] 5.333333
median(x)
[1] 5
max(x)
[1] 10
sd(x) # Standard Deviation
[1] 4.50925
summary(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   3.000   5.000   5.333   7.500  10.000 

The Packages

Packages = toolboxes

In Week 1 we talked about GIS as software + data + hardware. packages are basically add-on toolboxes that turn base R into a GIS:

  • terra ≈ raster toolbox (read/write rasters, map algebra, raster geoprocessing)
  • sf ≈ vector toolbox (points/lines/polygons; joins; overlay; buffers)
  • rnaturalearth ≈ “data source” (a convenient way to pull reference layers)
  • tidyverse ≈ data wrangling + pipelines (the “chain several steps together” habit you need for real GIS workflows).

To turn R into a powerful GIS engine, we need to add some packages Packages in R add additional functionality to your code, similar to plugins in QGIS. The code below will help you install and load the packages we’ll be using today:

require(rnaturalearth) # Allows you to import vector layers from Natural Earth
Loading required package: rnaturalearth
Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
logical.return = TRUE, : there is no package called 'rnaturalearth'
require(tidyverse) # Useful data cleaning, management, and plotting functions
Loading required package: tidyverse
Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'ggplot2' was built under R version 4.5.2
Warning: package 'tidyr' was built under R version 4.5.2
Warning: package 'readr' was built under R version 4.5.2
Warning: package 'purrr' was built under R version 4.5.2
Warning: package 'forcats' was built under R version 4.5.2
Warning: package 'lubridate' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.1     ✔ stringr   1.6.0
✔ ggplot2   4.0.0     ✔ tibble    3.3.0
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.2.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
require(sf) # Functions for vector data reading, writing, and geoprocessing
Loading required package: sf
Warning: package 'sf' was built under R version 4.5.2
Linking to GEOS 3.13.1, GDAL 3.11.4, PROJ 9.7.0; sf_use_s2() is TRUE
require(ggmap) # Functions for making cool maps
Loading required package: ggmap
Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
logical.return = TRUE, : there is no package called 'ggmap'
require(terra) # Functions for raster input, output, and geoprocessing
Loading required package: terra
Warning: package 'terra' was built under R version 4.5.2
terra 1.8.80

Attaching package: 'terra'

The following object is masked from 'package:tidyr':

    extract

Setting the Working Directory

R reads in data using information on the paths to files on your computer. It’s helpful to set the working directory, which gives R a default folder for reading and writing data for the session.

You can set the working directory to the folder where your script and data are located using the below code. But how do you find the PATH_TO_FILE? What I recommend for beginners is to save your R script and all the data you are using into a single folder somewhere on your computer. Then, if you are in RStudio, you can click Session -> Set Working Directory -> To Source File Location. RStudio will automatically generate a line of code like the one below. Then you can copy and paste that code into your script:

setwd("/cloud/project/Intro_Geospatial_in_R_Data")
Error in setwd("/cloud/project/Intro_Geospatial_in_R_Data"): cannot change working directory

Reading and Plotting Raster Data

Once you have set your working directory, you can read in the PM2.5 raster layer:

pm25 <- rast("GlobalPM252020.tif")
plot(pm25)

What to check every time you load a raster

Before doing anything “analysis-y,” inspect:

  • CRS: what coordinate system/projection is this raster in?
  • Resolution: how big is a pixel on the ground (and does it vary with latitude)?
  • Extent: what geographic area is covered?

This mirrors Week 1’s CRS/projection discussion: distance and area calculations usually require projected data, and misaligned layers can produce nonsense.

Tutorial report deliverable 2 (raster metadata)

Run these and report the results in your tutorial write-up (include units where possible):

crs(pm25)
[1] "GEOGCRS[\"WGS 84 (CRS84)\",\n    DATUM[\"World Geodetic System 1984\",\n        ELLIPSOID[\"WGS 84\",6378137,298.257223563,\n            LENGTHUNIT[\"metre\",1]]],\n    PRIMEM[\"Greenwich\",0,\n        ANGLEUNIT[\"degree\",0.0174532925199433]],\n    CS[ellipsoidal,2],\n        AXIS[\"geodetic longitude (Lon)\",east,\n            ORDER[1],\n            ANGLEUNIT[\"degree\",0.0174532925199433]],\n        AXIS[\"geodetic latitude (Lat)\",north,\n            ORDER[2],\n            ANGLEUNIT[\"degree\",0.0174532925199433]],\n    USAGE[\n        SCOPE[\"unknown\"],\n        AREA[\"World\"],\n        BBOX[-90,-180,90,180]],\n    ID[\"OGC\",\"CRS84\"]]"
res(pm25)
[1] 0.1 0.1
ext(pm25)
SpatExtent : -179.999996947, 179.999996947, -55.000000762, 67.999996947 (xmin, xmax, ymin, ymax)
global(pm25, fun = c("min","mean","max"), na.rm = TRUE)
               min     mean   max
GlobalPM252020 0.8 18.00101 254.6

Then interpret: What CRS are the data in? What are the implications of the CRS for how you should work with and interpret the data?

Performing Map Algebra

Map algebra is very straightforward in the terra package, which is now the main workhorse package for working with raster data in R. terra allows you to treat the entire raster as if it were just another value, so you can perform operations on each pixel in a raster by typing some simple code.

For example, we can use a conditional statement to create a new raster showing only the parts of the world that fail to meet the US EPA’s maximum safe standard for PM 2.5 annual average levels (12 micrograms per cubic meter):

abvepa <- pm25 >= 12
plot(abvepa)

Map algebra as geoprocessing

This is the raster version of what you saw in the “cookie dough” slides: geoprocessing tools are basically algorithms that manipulate spatial data, and raster tools work by calculating new pixel values from old pixel values.

A threshold like pm25 >= 12 creates a logical raster (TRUE/FALSE). Under the hood that becomes a 1/0 raster (where 1 - true and 0 = false) — handy for further calculations.

Tutorial report deliverable 3 (interpret the threshold)

  1. What does 12 represent in the code above, and what are the units?
  2. If you changed the threshold to 5 or 35, how might the real-world policy or health implications change?

We can also rescale the PM2.5 raster in terms of the percentage of the EPA’s maximum:

epapct <- (pm25 / 12) * 100
plot(epapct)

Working with Population Data

Now we’re going to use our population raster to estimate exposure to unsafe levels of PM 2.5. To start, we need to read in the gridded population layer:

pop <- rast("gpw2015.tif")
plot(pop)

You’ll notice that it’s difficult to see much here – that’s because global population is very unevenly distributed, so some pixels in the raster have huge numbers of people, while many have very low values. One way to address this would be to covert these into orders of magnitude by computing a logarithm.

We can chain together series of functions where the output of one function is used as the input of another. This is accomplished with the pipeline operator: |>. If you place the pipeline after a function, it will take that function’s output and feed it in as the first argument in the following function.

Using pipelines is the preferred way to write R code with sequences of operations because it makes the step clearer, almost like making your own flowchart in the code.

pop |> #We start with the object that has the data we will be using 
  log(base=10) |> #Then we do one operation
  plot() #And then another

|---------|---------|---------|---------|
=========================================
                                          

Ensuring Matching CRS and Resolution

Why CRS + resolution matching matters

To combine rasters cell-by-cell, you need them on the same grid. That includes:

  • same CRS (so “where” each cell is located is comparable),
  • same resolution (same pixel size),
  • same extent (covering the same area), and
  • aligned origins (cells line up).

If these don’t match, R will throw an error—or worse, silently resample and you’ll get results you can’t interpret.

This is the raster analogue of vector overlay tools like intersect or clip from the “cookie dough” lecture: the computer can only overlay raster layers if the pixels match up.

Because we’ll be making comparisons across rasters, we need to make sure that the two raster datasets are in the same coordinate reference system (CRS).

The pm25 raster is in latitude and longitude with a WGS 1984 datum, and we’ll set that first. We can look up WGS 1984 at (spatialreference.org)[spatialreference.org], and we can see that it has an EPSG code of 4326.

We can use the crs() function to set or print out CRS information for a terra raster object, like this:

crs(pm25) <- "epsg:4326"
crs(pm25)
[1] "GEOGCRS[\"WGS 84\",\n    ENSEMBLE[\"World Geodetic System 1984 ensemble\",\n        MEMBER[\"World Geodetic System 1984 (Transit)\"],\n        MEMBER[\"World Geodetic System 1984 (G730)\"],\n        MEMBER[\"World Geodetic System 1984 (G873)\"],\n        MEMBER[\"World Geodetic System 1984 (G1150)\"],\n        MEMBER[\"World Geodetic System 1984 (G1674)\"],\n        MEMBER[\"World Geodetic System 1984 (G1762)\"],\n        MEMBER[\"World Geodetic System 1984 (G2139)\"],\n        MEMBER[\"World Geodetic System 1984 (G2296)\"],\n        ELLIPSOID[\"WGS 84\",6378137,298.257223563,\n            LENGTHUNIT[\"metre\",1]],\n        ENSEMBLEACCURACY[2.0]],\n    PRIMEM[\"Greenwich\",0,\n        ANGLEUNIT[\"degree\",0.0174532925199433]],\n    CS[ellipsoidal,2],\n        AXIS[\"geodetic latitude (Lat)\",north,\n            ORDER[1],\n            ANGLEUNIT[\"degree\",0.0174532925199433]],\n        AXIS[\"geodetic longitude (Lon)\",east,\n            ORDER[2],\n            ANGLEUNIT[\"degree\",0.0174532925199433]],\n    USAGE[\n        SCOPE[\"Horizontal component of 3D system.\"],\n        AREA[\"World.\"],\n        BBOX[-90,-180,90,180]],\n    ID[\"EPSG\",4326]]"

Now we need to check that the pop raster is in the same CRS:

crs(pop)
[1] "GEOGCRS[\"WGS 84\",\n    ENSEMBLE[\"World Geodetic System 1984 ensemble\",\n        MEMBER[\"World Geodetic System 1984 (Transit)\"],\n        MEMBER[\"World Geodetic System 1984 (G730)\"],\n        MEMBER[\"World Geodetic System 1984 (G873)\"],\n        MEMBER[\"World Geodetic System 1984 (G1150)\"],\n        MEMBER[\"World Geodetic System 1984 (G1674)\"],\n        MEMBER[\"World Geodetic System 1984 (G1762)\"],\n        MEMBER[\"World Geodetic System 1984 (G2139)\"],\n        MEMBER[\"World Geodetic System 1984 (G2296)\"],\n        ELLIPSOID[\"WGS 84\",6378137,298.257223563,\n            LENGTHUNIT[\"metre\",1]],\n        ENSEMBLEACCURACY[2.0]],\n    PRIMEM[\"Greenwich\",0,\n        ANGLEUNIT[\"degree\",0.0174532925199433]],\n    CS[ellipsoidal,2],\n        AXIS[\"geodetic latitude (Lat)\",north,\n            ORDER[1],\n            ANGLEUNIT[\"degree\",0.0174532925199433]],\n        AXIS[\"geodetic longitude (Lon)\",east,\n            ORDER[2],\n            ANGLEUNIT[\"degree\",0.0174532925199433]],\n    USAGE[\n        SCOPE[\"Horizontal component of 3D system.\"],\n        AREA[\"World.\"],\n        BBOX[-90,-180,90,180]],\n    ID[\"EPSG\",4326]]"

Great! They are both in WGS 1984, so we can safely continue with our analysis. Resample the raster to ensure matching resolution:

Analyzing Population Exposure

Remember that we had an object above called abvepa, which covered all the areas where average annual PM 2.5 levels were above US EPA standards? Well, that is a binary (1/0) raster, there the 1 values are the places that above the EPA limits.

We can therefore use some map algebra to make a new raster that shows population in places above the EPA’s safe limits of PM 2.5, again converting population to powers of ten:

popabvepa <- pop * abvepa
Error: [*] extents do not match
plot(popabvepa)
Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'plot': object 'popabvepa' not found

Now you should have come across an error. What this error is telling you is that, despite that the rasters are in the same projection, they don’t cover exactly the same area (i.e., their extents don’t match), which they have to if we are to conduct map algebra with both of them.

We can have a look at the extent of each raster using the ext() function:

ext(abvepa)
SpatExtent : -179.999996947, 179.999996947, -55.000000762, 67.999996947 (xmin, xmax, ymin, ymax)
ext(pop)
SpatExtent : -180, 180, -90, 89.9999999999999 (xmin, xmax, ymin, ymax)

When you compare the two rasters, is clear that abvepa covers less of the planet’s surface than the pop raster does.

That means we’ll need to crop() (that’s what terra calls clipping) the pop raster to match the abvepa raster before continuing.

pop_crop <- crop(pop, abvepa) #You start with the raster you are cropping, followed
#by the raster you are using to crop it.

Now we should be able to do our map algebra:

popabvepa <- pop_crop * abvepa
Error: [*] number of rows and/or columns do not match
plot(popabvepa)
Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'plot': object 'popabvepa' not found

Uh-oh! It looks like we’ve got another problem. The issue here is that to perform map algebra, rasters must have the same resolution, and overlapping pixels. That means that we need to convert one of the rasters to match the other.

To change the resolution of a raster, we use the resample() function, which applies your choice of several algorithms for estimating the value of one raster in the pixels of another. This is basically how we get rasters to talk to each other.

pop_resample <- resample(pop_crop, abvepa)

Now we can do our calculation:

pop_abvepa <- log(pop_resample, base=10) * abvepa
plot(pop_abvepa)

Now, we did this all step-by-step just so you could see how it works, but we could also have done all that as a pipeline, like this:

pop_resample <- pop |>
  crop(abvepa) |>
  resample(abvepa) |>
  log(base=10)

pop_abvepa <- pop_resample * abvepa

plot(pop_abvepa)

Tutorial report deliverable 4 (fixing mismatches)

In your report:

  1. Describe why these mismatches in resolution and extent required us to make some corrections.
  2. Explain what project() and resample() do conceptually.

Congratulations! You have completed this tutorial!