2 + 2 #Any part of a line in R that starts with a # will be ignored with the code is run.[1] 4
#We use lines (or parts of lines) that start with # to make notes in the code.In Week 1, you learned the big GIS building blocks: layers, the difference between raster and vector data, and why coordinate reference systems (CRS) and projections matter for making maps and doing analysis.
In Week 2, you added remote sensing ideas—especially that most satellite products we use in GIS arrive as rasters and come with important “resolution” choices and tradeoffs.
This tutorial is where we start doing the “cookie-dough” thing: chaining small operations into a full workflow (read data → inspect → transform → analyze → export).
In this tutorial, have a gentle introduction to coding in R and help you get familiar with how the system works.
It is important to remember that writing code in R is very tricky - every capital letter and punctuation mark needs to be in the right place, or the code won’t work.
Luckily, the easiest way to learn to code in R is to look at examples and then to switch out parts of those examples to see what happens. You can use that strategy to learn how the R language works for the tasks that you are trying to accomplish. That’s the strategy that I’m going to be using to teach you in this class.
The interesting thing about R is that pretty much everything you need can be written in R scripts. The R script is a series of commands for R that it follows to conduct analysis, build maps, and so on.
Scripts are really useful because they are also a record of every step of your analysis. That means that if you wind up making a mistake that you don’t notice until much later, all you need to do is find the source of the problem in your script, fix it, and rerun your code from there. That’s a lot better than having to redo all your work from scratch!
I’m using a special format of R scripts called a Quarto Document, which allows you to combine the script with HTML text. When I generate the HTML, it runs all the code blocks in the document in order and puts the results below. That means you can check what you’re getting when you run the code against what I got when I ran it.
Quarto Documents are handy for making class materials, but they can also be used to make general websites or even presentations. We’ll see how to use those in this class.
R languageR code in RStudioRA GeoTIFF like GlobalPM252020.tif is a raster: a grid of pixels where each cell has a value and a known spatial location. In Week 1, we contrasted rasters with vectors (points/lines/polygons). In Week 2, you saw that most remotely sensed datasets (except things like LiDAR point clouds) come to us in raster form.
Why you should care right away: every raster comes with metadata like CRS, extent, and resolution. Those are the first things you should check before you trust a map or combine layers.
Some of the data for this tutorial will be downloaded directly using R, but there are two datasets that we will reading in from a file:
These Geotiffs are avaitutorialle for download in this zip file.
In 3–5 sentences, answer:
Let’s start with your first line of code. You can copy and paste the following into your script editor in RStudio to run it:
2 + 2 #Any part of a line in R that starts with a # will be ignored with the code is run.[1] 4
#We use lines (or parts of lines) that start with # to make notes in the code.Once you have the code pasted in your script editor, you highlight the code with the cursor and then click “Run”, which is in the upper right-hand corner of this script editor window, to run it. Try running the code above now.
Congratulations! That’s your first code. This is pretty basic. Among many other things, R works as a simple calculator. It also obeys standard order of operations rules. Run the lines below for examples:
2 + (3 * 2)[1] 8
(2 + 3) * 2[1] 10
4 / 2[1] 2
(4 + 2) / 2[1] 3
4 + (2 / 2)[1] 5
Okay, so you get the idea!
R is an object-oriented programming language. What that means is that R works by creating snippets of information in the computer memory that we call “objects” and then doing things to those objects.
I often think about the data that I am working on almost like a physical structure that I’m using R to build and manipulate. Another helpful metaphor might be to think of objects as little digital creatures that you use R to communicate with.
However it feels easiest to think about it, let’s introduce your first object by running the following line:
x <- 5The <- sign in that line is the assignment function. In R, functions are pieces of code that do things to objects. In those terms, all the mathematical symbols we used above would count as functions.
The assignment function takes whatever is on the right-hand side of the function and creates an object according to the instructions on the left-hand side that contains the thing on the right-hand side.
In this case, we created an object we called x and gave it the content of the number 5. We can then type in the name of the object to ask R to print back its contents:
x[1] 5
Importantly, R is case sensitive. That means it cares whether or not letters are capitalized. Try running this:
XError: object 'X' not found
R should have given you this message: “Error: object ‘X’ not found”. When R says an object is not found, it’s telling you that something that you think is read into the environment (that is, the collection of objects that R has in memory at the moment) isn’t actually there.
If you look in the Environment tab in RStudio, which is probably to the right of your script editing window, you can see a list of all the objects that you have created in the environment. You should see that there is an object called “x” there. Notice that “x” isn’t capitalized. That’s why R can’t find it. R considers x and X to be completely different.
In R, a layer is an object: a container that holds geometry + attributes (vector) or a pixel grid + metadata (raster). When we do GIS in R, we mostly do two things:
You can pack pretty much anything that R can hold in memory into an object. We’ll be putting whole layers of spatial data into objects soon! For now, let me just show you a bit of how this works:
x <- c(1, 5, 10)In this line, we used the c, or concatenate function, which takes a list of items separated by commas and puts them into a single object called a vector (not to be confused with the idea of a vector layer in GIS), which is just a line of data:
x[1] 1 5 10
The parentheses and commas are really, really important in R. Commas separate different inputs into a function. We call these separate inputs arguments. The parentheses tell R where the arguments for each function start and end.
See what happens if we mess this up:
x <- c(1 5, 10)
x <- c(1, 5) 10Error in parse(text = input): <text>:1:10: unexpected numeric constant
1: x <- c(1 5
^
Both of those situations cause R to say Error: unexpected . . .. When you see an error that starts that way, it’s R’s way of telling you that there’s something wrong with the way you typed your code and that things are out of order.
RStudio actually helps you keep track of coding problems. In the first of the two lines above, for example, you should see a red underline under 5. That’s RStudio telling you that something fishy is going on with your code there (in this case, you’re missing the comma after 1).
Okay, let’s get our x back in working order:
x <- c(1, 2, 3, 4, 5, 10)Once you’ve created a vector, you can use functions to perform operations on all the elements of the vector at once:
x <- c(1, 5, 10)
x + 2[1] 3 7 12
x * 2[1] 2 10 20
x ^ 2[1] 1 25 100
x / 2[1] 0.5 2.5 5.0
You can also use statistical functions on the whole vector. Here are some common handy examples:
min(x)[1] 1
mean(x)[1] 5.333333
median(x)[1] 5
max(x)[1] 10
sd(x) # Standard Deviation[1] 4.50925
summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 3.000 5.000 5.333 7.500 10.000
In Week 1 we talked about GIS as software + data + hardware. packages are basically add-on toolboxes that turn base R into a GIS:
terra ≈ raster toolbox (read/write rasters, map algebra, raster geoprocessing)sf ≈ vector toolbox (points/lines/polygons; joins; overlay; buffers)rnaturalearth ≈ “data source” (a convenient way to pull reference layers)tidyverse ≈ data wrangling + pipelines (the “chain several steps together” habit you need for real GIS workflows).To turn R into a powerful GIS engine, we need to add some packages Packages in R add additional functionality to your code, similar to plugins in QGIS. The code below will help you install and load the packages we’ll be using today:
require(rnaturalearth) # Allows you to import vector layers from Natural EarthLoading required package: rnaturalearth
Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
logical.return = TRUE, : there is no package called 'rnaturalearth'
require(tidyverse) # Useful data cleaning, management, and plotting functionsLoading required package: tidyverse
Warning: package 'tidyverse' was built under R version 4.5.2
Warning: package 'ggplot2' was built under R version 4.5.2
Warning: package 'tidyr' was built under R version 4.5.2
Warning: package 'readr' was built under R version 4.5.2
Warning: package 'purrr' was built under R version 4.5.2
Warning: package 'forcats' was built under R version 4.5.2
Warning: package 'lubridate' was built under R version 4.5.2
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.1 ✔ stringr 1.6.0
✔ ggplot2 4.0.0 ✔ tibble 3.3.0
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.2.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
require(sf) # Functions for vector data reading, writing, and geoprocessingLoading required package: sf
Warning: package 'sf' was built under R version 4.5.2
Linking to GEOS 3.13.1, GDAL 3.11.4, PROJ 9.7.0; sf_use_s2() is TRUE
require(ggmap) # Functions for making cool mapsLoading required package: ggmap
Warning in library(package, lib.loc = lib.loc, character.only = TRUE,
logical.return = TRUE, : there is no package called 'ggmap'
require(terra) # Functions for raster input, output, and geoprocessingLoading required package: terra
Warning: package 'terra' was built under R version 4.5.2
terra 1.8.80
Attaching package: 'terra'
The following object is masked from 'package:tidyr':
extract
R reads in data using information on the paths to files on your computer. It’s helpful to set the working directory, which gives R a default folder for reading and writing data for the session.
You can set the working directory to the folder where your script and data are located using the below code. But how do you find the PATH_TO_FILE? What I recommend for beginners is to save your R script and all the data you are using into a single folder somewhere on your computer. Then, if you are in RStudio, you can click Session -> Set Working Directory -> To Source File Location. RStudio will automatically generate a line of code like the one below. Then you can copy and paste that code into your script:
setwd("/cloud/project/Intro_Geospatial_in_R_Data")Error in setwd("/cloud/project/Intro_Geospatial_in_R_Data"): cannot change working directory
Once you have set your working directory, you can read in the PM2.5 raster layer:
pm25 <- rast("GlobalPM252020.tif")
plot(pm25)Before doing anything “analysis-y,” inspect:
This mirrors Week 1’s CRS/projection discussion: distance and area calculations usually require projected data, and misaligned layers can produce nonsense.
Run these and report the results in your tutorial write-up (include units where possible):
crs(pm25)[1] "GEOGCRS[\"WGS 84 (CRS84)\",\n DATUM[\"World Geodetic System 1984\",\n ELLIPSOID[\"WGS 84\",6378137,298.257223563,\n LENGTHUNIT[\"metre\",1]]],\n PRIMEM[\"Greenwich\",0,\n ANGLEUNIT[\"degree\",0.0174532925199433]],\n CS[ellipsoidal,2],\n AXIS[\"geodetic longitude (Lon)\",east,\n ORDER[1],\n ANGLEUNIT[\"degree\",0.0174532925199433]],\n AXIS[\"geodetic latitude (Lat)\",north,\n ORDER[2],\n ANGLEUNIT[\"degree\",0.0174532925199433]],\n USAGE[\n SCOPE[\"unknown\"],\n AREA[\"World\"],\n BBOX[-90,-180,90,180]],\n ID[\"OGC\",\"CRS84\"]]"
res(pm25)[1] 0.1 0.1
ext(pm25)SpatExtent : -179.999996947, 179.999996947, -55.000000762, 67.999996947 (xmin, xmax, ymin, ymax)
global(pm25, fun = c("min","mean","max"), na.rm = TRUE) min mean max
GlobalPM252020 0.8 18.00101 254.6
Then interpret: What CRS are the data in? What are the implications of the CRS for how you should work with and interpret the data?
Map algebra is very straightforward in the terra package, which is now the main workhorse package for working with raster data in R. terra allows you to treat the entire raster as if it were just another value, so you can perform operations on each pixel in a raster by typing some simple code.
For example, we can use a conditional statement to create a new raster showing only the parts of the world that fail to meet the US EPA’s maximum safe standard for PM 2.5 annual average levels (12 micrograms per cubic meter):
abvepa <- pm25 >= 12
plot(abvepa)This is the raster version of what you saw in the “cookie dough” slides: geoprocessing tools are basically algorithms that manipulate spatial data, and raster tools work by calculating new pixel values from old pixel values.
A threshold like pm25 >= 12 creates a logical raster (TRUE/FALSE). Under the hood that becomes a 1/0 raster (where 1 - true and 0 = false) — handy for further calculations.
12 represent in the code above, and what are the units?We can also rescale the PM2.5 raster in terms of the percentage of the EPA’s maximum:
epapct <- (pm25 / 12) * 100
plot(epapct)Now we’re going to use our population raster to estimate exposure to unsafe levels of PM 2.5. To start, we need to read in the gridded population layer:
pop <- rast("gpw2015.tif")
plot(pop)You’ll notice that it’s difficult to see much here – that’s because global population is very unevenly distributed, so some pixels in the raster have huge numbers of people, while many have very low values. One way to address this would be to covert these into orders of magnitude by computing a logarithm.
We can chain together series of functions where the output of one function is used as the input of another. This is accomplished with the pipeline operator: |>. If you place the pipeline after a function, it will take that function’s output and feed it in as the first argument in the following function.
Using pipelines is the preferred way to write R code with sequences of operations because it makes the step clearer, almost like making your own flowchart in the code.
pop |> #We start with the object that has the data we will be using
log(base=10) |> #Then we do one operation
plot() #And then another
|---------|---------|---------|---------|
=========================================
To combine rasters cell-by-cell, you need them on the same grid. That includes:
If these don’t match, R will throw an error—or worse, silently resample and you’ll get results you can’t interpret.
This is the raster analogue of vector overlay tools like intersect or clip from the “cookie dough” lecture: the computer can only overlay raster layers if the pixels match up.
Because we’ll be making comparisons across rasters, we need to make sure that the two raster datasets are in the same coordinate reference system (CRS).
The pm25 raster is in latitude and longitude with a WGS 1984 datum, and we’ll set that first. We can look up WGS 1984 at (spatialreference.org)[spatialreference.org], and we can see that it has an EPSG code of 4326.
We can use the crs() function to set or print out CRS information for a terra raster object, like this:
crs(pm25) <- "epsg:4326"
crs(pm25)[1] "GEOGCRS[\"WGS 84\",\n ENSEMBLE[\"World Geodetic System 1984 ensemble\",\n MEMBER[\"World Geodetic System 1984 (Transit)\"],\n MEMBER[\"World Geodetic System 1984 (G730)\"],\n MEMBER[\"World Geodetic System 1984 (G873)\"],\n MEMBER[\"World Geodetic System 1984 (G1150)\"],\n MEMBER[\"World Geodetic System 1984 (G1674)\"],\n MEMBER[\"World Geodetic System 1984 (G1762)\"],\n MEMBER[\"World Geodetic System 1984 (G2139)\"],\n MEMBER[\"World Geodetic System 1984 (G2296)\"],\n ELLIPSOID[\"WGS 84\",6378137,298.257223563,\n LENGTHUNIT[\"metre\",1]],\n ENSEMBLEACCURACY[2.0]],\n PRIMEM[\"Greenwich\",0,\n ANGLEUNIT[\"degree\",0.0174532925199433]],\n CS[ellipsoidal,2],\n AXIS[\"geodetic latitude (Lat)\",north,\n ORDER[1],\n ANGLEUNIT[\"degree\",0.0174532925199433]],\n AXIS[\"geodetic longitude (Lon)\",east,\n ORDER[2],\n ANGLEUNIT[\"degree\",0.0174532925199433]],\n USAGE[\n SCOPE[\"Horizontal component of 3D system.\"],\n AREA[\"World.\"],\n BBOX[-90,-180,90,180]],\n ID[\"EPSG\",4326]]"
Now we need to check that the pop raster is in the same CRS:
crs(pop)[1] "GEOGCRS[\"WGS 84\",\n ENSEMBLE[\"World Geodetic System 1984 ensemble\",\n MEMBER[\"World Geodetic System 1984 (Transit)\"],\n MEMBER[\"World Geodetic System 1984 (G730)\"],\n MEMBER[\"World Geodetic System 1984 (G873)\"],\n MEMBER[\"World Geodetic System 1984 (G1150)\"],\n MEMBER[\"World Geodetic System 1984 (G1674)\"],\n MEMBER[\"World Geodetic System 1984 (G1762)\"],\n MEMBER[\"World Geodetic System 1984 (G2139)\"],\n MEMBER[\"World Geodetic System 1984 (G2296)\"],\n ELLIPSOID[\"WGS 84\",6378137,298.257223563,\n LENGTHUNIT[\"metre\",1]],\n ENSEMBLEACCURACY[2.0]],\n PRIMEM[\"Greenwich\",0,\n ANGLEUNIT[\"degree\",0.0174532925199433]],\n CS[ellipsoidal,2],\n AXIS[\"geodetic latitude (Lat)\",north,\n ORDER[1],\n ANGLEUNIT[\"degree\",0.0174532925199433]],\n AXIS[\"geodetic longitude (Lon)\",east,\n ORDER[2],\n ANGLEUNIT[\"degree\",0.0174532925199433]],\n USAGE[\n SCOPE[\"Horizontal component of 3D system.\"],\n AREA[\"World.\"],\n BBOX[-90,-180,90,180]],\n ID[\"EPSG\",4326]]"
Great! They are both in WGS 1984, so we can safely continue with our analysis. Resample the raster to ensure matching resolution:
Remember that we had an object above called abvepa, which covered all the areas where average annual PM 2.5 levels were above US EPA standards? Well, that is a binary (1/0) raster, there the 1 values are the places that above the EPA limits.
We can therefore use some map algebra to make a new raster that shows population in places above the EPA’s safe limits of PM 2.5, again converting population to powers of ten:
popabvepa <- pop * abvepaError: [*] extents do not match
plot(popabvepa)Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'plot': object 'popabvepa' not found
Now you should have come across an error. What this error is telling you is that, despite that the rasters are in the same projection, they don’t cover exactly the same area (i.e., their extents don’t match), which they have to if we are to conduct map algebra with both of them.
We can have a look at the extent of each raster using the ext() function:
ext(abvepa)SpatExtent : -179.999996947, 179.999996947, -55.000000762, 67.999996947 (xmin, xmax, ymin, ymax)
ext(pop)SpatExtent : -180, 180, -90, 89.9999999999999 (xmin, xmax, ymin, ymax)
When you compare the two rasters, is clear that abvepa covers less of the planet’s surface than the pop raster does.
That means we’ll need to crop() (that’s what terra calls clipping) the pop raster to match the abvepa raster before continuing.
pop_crop <- crop(pop, abvepa) #You start with the raster you are cropping, followed
#by the raster you are using to crop it.Now we should be able to do our map algebra:
popabvepa <- pop_crop * abvepaError: [*] number of rows and/or columns do not match
plot(popabvepa)Error in h(simpleError(msg, call)): error in evaluating the argument 'x' in selecting a method for function 'plot': object 'popabvepa' not found
Uh-oh! It looks like we’ve got another problem. The issue here is that to perform map algebra, rasters must have the same resolution, and overlapping pixels. That means that we need to convert one of the rasters to match the other.
To change the resolution of a raster, we use the resample() function, which applies your choice of several algorithms for estimating the value of one raster in the pixels of another. This is basically how we get rasters to talk to each other.
pop_resample <- resample(pop_crop, abvepa)Now we can do our calculation:
pop_abvepa <- log(pop_resample, base=10) * abvepa
plot(pop_abvepa)Now, we did this all step-by-step just so you could see how it works, but we could also have done all that as a pipeline, like this:
pop_resample <- pop |>
crop(abvepa) |>
resample(abvepa) |>
log(base=10)
pop_abvepa <- pop_resample * abvepa
plot(pop_abvepa)In your report:
project() and resample() do conceptually.