Welcome to your first lesson in mapping! Mapping forms the foundation of most geographic analysis, and maps help us to better understand spatial patterns and distributions. But it isn’t enough to just make a map - we need to make maps that are accurate, clear, and effective geographical representations of the phenomena we are interested in. In this tutorial, we will start learning some of the tools that will help us do just that in R. We will start mapping using the urbnmapr package, which will help us to make maps at the state and county level.
All maps are made up of points, lines, and polygons. A point is a single dot on a map that corresponds to one lat/long coordinate. A line consists of a series of dots connected together, while a polygon is a shape that is created by the intersection of multiple lines. We will be working on learning to symbolize our maps using points and polygons in this lesson.
#Load the urbnmapr package and the sf (spatial features) library
#The sf package will help us to work with the data from the urbnmapr package
#We'll talk about sf in more detail later, but sf (simple features) objects are essentially the points, lines, and polygons that we will use to make maps - this is a special object type that is similar to the shapefiles that are used in GIS software
#Each sf object contains a special "geometry" column that defines the spatial properties of the object (don't worry, you don't need to understand it any deeper than that)
#Later on, I will show you how to transform lat/long coordinates into an sf object
library(sf)
library(urbnmapr)
library(dplyr)
library(ggplot2)
library(ggthemes)
#grab county shape files for our map
counties <- get_urbn_map("counties", sf = TRUE)
#filter the data to get just NYS
counties <- counties %>%
filter(state_abbv == "NY")
old-style crs object detected; please recreate object with a recent sf::st_crs()
head(counties)
Simple feature collection with 6 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 1797804 ymin: -99527.13 xmax: 2146630 ymax: 325411.8
Projected CRS: NAD27 / US National Atlas Equal Area
county_fips state_abbv state_fips county_name fips_class
1 36021 NY 36 Columbia County H1
2 36087 NY 36 Rockland County H1
3 36101 NY 36 Steuben County H1
4 36123 NY 36 Yates County H1
5 36033 NY 36 Franklin County H1
6 36093 NY 36 Schenectady County H1
state_name geometry
1 New York MULTIPOLYGON (((2106726 192...
2 New York MULTIPOLYGON (((2114963 -92...
3 New York MULTIPOLYGON (((1800842 -40...
4 New York MULTIPOLYGON (((1821444 -41...
5 New York MULTIPOLYGON (((1947143 308...
6 New York MULTIPOLYGON (((2056558 791...
Let’s plot the state-level data that we have!
ggplot() +
geom_sf(data = counties,
mapping = aes())+
theme_minimal()
Hmmmm…NYS shouldn’t be crooked! What happened?
All maps have a Coordinate Reference System (CRS) - this consists of the coordinate system, datum, and map projection specified by the map creator. The default CRS works well for the entire US, but it doesn’t work as nicely for NYS. We need to define a more regionally-specific CRS. Let’s start by checking the current CRS of the map.
#This command comes from the sf package
st_crs(counties)
Coordinate Reference System:
User input: EPSG:2163
wkt:
PROJCRS["NAD27 / US National Atlas Equal Area",
BASEGEOGCRS["NAD27",
DATUM["North American Datum 1927",
ELLIPSOID["Clarke 1866",6378206.4,294.978698213898,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4267]],
CONVERSION["US National Atlas Equal Area",
METHOD["Lambert Azimuthal Equal Area (Spherical)",
ID["EPSG",1027]],
PARAMETER["Latitude of natural origin",45,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",-100,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["False easting",0,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",0,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["easting (X)",east,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["northing (Y)",north,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Statistical analysis."],
AREA["United States (USA) - onshore and offshore."],
BBOX[15.56,167.65,74.71,-65.69]],
ID["EPSG",9311]]
Based on this, we can see that the CRS is EPSG:2163. The datum is North American Datum 1927, while the map uses the US National Atlas Equal Area projection. All we really need to know here is what the current CRS is, so that we can determine whether to alter it. For our NYS map, we will need to change the CRS.
How do we know what CRS to choose? There are likely a number of CRS’ that would work well for this map. To pick one, we can look at a list of state projections here: https://github.com/veltman/d3-stateplane. I picked one of the NY specific options.
#Transform the CRS
counties <- counties %>%
st_transform("EPSG:32116")
#Let's see if this looks any better
ggplot() +
geom_sf(data = counties,
mapping = aes())+
theme_minimal()
#Much better!
Our map is really coming along! Next, we are going to want to load in some data to visualize on the map. We’ll start with the Covid-19 data that you used for your homework assignment. This first type of map is called a choropleth map - essentially, this is a map that uses the polygons on a map (in our case, counties) to visualize a variable with color.
#Set wd
setwd("~/Binghamton/geog380")
#load in the data
covid <- read.csv("covid_data_ny.csv")
#We'll need to merge this data with our counties dataset
#We can use the left_join function to do this
#First though, we need to inspect our ID columns
unique(covid$County)
[1] "Kings" "Queens" "New York" "Suffolk"
[5] "Nassau" "Bronx" "Westchester" "Erie"
[9] "Richmond" "Monroe" "Onondaga" "Orange"
[13] "Rockland" "Dutchess" "Albany" "Oneida"
[17] "Saratoga" "Niagara" "Broome" "Schenectady"
[21] "Ulster" "Rensselaer" "Oswego" "Putnam"
[25] "Chautauqua" "Chemung" "St. Lawrence" "Tompkins"
[29] "Jefferson" "Ontario" "Steuben" "Sullivan"
[33] "Clinton" "Wayne" "Cayuga" "Cattaraugus"
[37] "Warren" "Herkimer" "Madison" "Genesee"
[41] "Fulton" "Washington" "Montgomery" "Livingston"
[45] "Tioga" "Columbia" "Cortland" "Otsego"
[49] "Franklin" "Chenango" "Greene" "Allegany"
[53] "Orleans" "Wyoming" "Delaware" "Essex"
[57] "Seneca" "Lewis" "Schoharie" "Schuyler"
[61] "Yates" "Hamilton"
unique(counties$county_name)
[1] "Columbia County" "Rockland County" "Steuben County"
[4] "Yates County" "Franklin County" "Schenectady County"
[7] "Fulton County" "Lewis County" "Allegany County"
[10] "Montgomery County" "Tompkins County" "Orleans County"
[13] "Suffolk County" "Jefferson County" "Clinton County"
[16] "Wyoming County" "Schuyler County" "Niagara County"
[19] "Hamilton County" "Genesee County" "Cattaraugus County"
[22] "Monroe County" "New York County" "Oswego County"
[25] "Tioga County" "Madison County" "Queens County"
[28] "Bronx County" "Livingston County" "Orange County"
[31] "Albany County" "Ontario County" "Essex County"
[34] "Greene County" "Broome County" "Herkimer County"
[37] "Otsego County" "Washington County" "Chenango County"
[40] "Dutchess County" "Chemung County" "Cortland County"
[43] "Saratoga County" "Delaware County" "Nassau County"
[46] "Erie County" "Cayuga County" "Rensselaer County"
[49] "Westchester County" "Putnam County" "Ulster County"
[52] "Kings County" "Wayne County" "St. Lawrence County"
[55] "Seneca County" "Schoharie County" "Oneida County"
[58] "Richmond County" "Warren County" "Onondaga County"
[61] "Chautauqua County" "Sullivan County"
#There's an easier way to do this!
#We can use a logical operation to check whether the county names match
#The names need to be in the same order, or we'll get only False
counties <- counties %>% arrange(county_name)
covid <- covid %>% arrange(County)
counties$county_name == covid$County
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE
We have a problem! The counties data contains the word “County” in each county name, while the Covid data does not. You will run into problems like this pretty often, which is why it is so important to inspect your data before you analyze it. We can fix this with the separate() function from the tidyr package like this:
library(tidyr)
#Think about the county names - why did I use " County" as a separator instead of just a space?
counties1 <- counties %>%
separate(county_name, c("county_name", "county"), sep = " County") %>%
select(-county)
#Did it work?
head(counties1)
Simple feature collection with 6 features and 6 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 40348.68 ymin: 92189.85 xmax: 482314.7 ymax: 381253.8
Projected CRS: NAD83 / New York Central
county_fips state_abbv state_fips county_name fips_class state_name
1 36001 NY 36 Albany H1 New York
2 36003 NY 36 Allegany H1 New York
3 36005 NY 36 Bronx H6 New York
4 36007 NY 36 Broome H1 New York
5 36009 NY 36 Cattaraugus H1 New York
6 36011 NY 36 Cayuga H1 New York
geometry
1 MULTIPOLYGON (((435438.3 27...
2 MULTIPOLYGON (((103693.1 28...
3 MULTIPOLYGON (((467992.4 10...
4 MULTIPOLYGON (((282328.4 26...
5 MULTIPOLYGON (((42112.87 28...
6 MULTIPOLYGON (((232317.8 31...
#Let's check if they match now!
counties1$county_name == covid$County
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[14] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[27] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[40] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[53] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
Now we can proceed to join the data.
covid_county <- counties1 %>%
left_join(
covid,
#This command allows us to join the data using two variables with different names
by = c("county_name" = "County")
)
head(covid_county)
Simple feature collection with 6 features and 12 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 40348.68 ymin: 92189.85 xmax: 482314.7 ymax: 381253.8
Projected CRS: NAD83 / New York Central
county_fips state_abbv state_fips county_name fips_class state_name
1 36001 NY 36 Albany H1 New York
2 36003 NY 36 Allegany H1 New York
3 36005 NY 36 Bronx H6 New York
4 36007 NY 36 Broome H1 New York
5 36009 NY 36 Cattaraugus H1 New York
6 36011 NY 36 Cayuga H1 New York
new_positive positive_cumulative new_tests tests_cumulative geography
1 47 70822 461 1353242 COUNTY
2 7 9849 166 241680 COUNTY
3 347 454174 4510 8453317 COUNTY
4 42 52622 464 1135294 COUNTY
5 23 17488 146 296531 COUNTY
6 12 18378 172 378493 COUNTY
region geometry
1 capital MULTIPOLYGON (((435438.3 27...
2 western MULTIPOLYGON (((103693.1 28...
3 new york MULTIPOLYGON (((467992.4 10...
4 southern tier MULTIPOLYGON (((282328.4 26...
5 western MULTIPOLYGON (((42112.87 28...
6 central MULTIPOLYGON (((232317.8 31...
Now, we’ll create our first map!
#Using Ggplot2 and the urbnmapr package to produce the map:
ggplot() +
geom_sf(data = covid_county,
mapping = aes(fill = new_positive))+
theme_minimal()
Hmmmmmmm….that’s not very nice looking, is it? Any suggestions?
First, let’s talk about normalizing our data. It looks like NYC has a huge number of cases, but we also know that NYC is bigger than any other region in the state. What might this look like from a more relative perspective? Often, it is much more meaningful to normalize our data so that we can better compare across counties. For example, here I am going to map positive as a percentage of total tests. Other common ways to normalize the data include calculating the number of cases per capita (using population data) or the percent growth in positive cases (if we had data over time).
#calculate the new variable that we will map
covid_county <- covid_county %>%
mutate(positive_percent = new_positive/new_tests)
#inspect the data
head(covid_county)
Simple feature collection with 6 features and 13 fields
Geometry type: MULTIPOLYGON
Dimension: XY
Bounding box: xmin: 40348.68 ymin: 92189.85 xmax: 482314.7 ymax: 381253.8
Projected CRS: NAD83 / New York Central
county_fips state_abbv state_fips county_name fips_class state_name
1 36001 NY 36 Albany H1 New York
2 36003 NY 36 Allegany H1 New York
3 36005 NY 36 Bronx H6 New York
4 36007 NY 36 Broome H1 New York
5 36009 NY 36 Cattaraugus H1 New York
6 36011 NY 36 Cayuga H1 New York
new_positive positive_cumulative new_tests tests_cumulative geography
1 47 70822 461 1353242 COUNTY
2 7 9849 166 241680 COUNTY
3 347 454174 4510 8453317 COUNTY
4 42 52622 464 1135294 COUNTY
5 23 17488 146 296531 COUNTY
6 12 18378 172 378493 COUNTY
region geometry positive_percent
1 capital MULTIPOLYGON (((435438.3 27... 0.10195228
2 western MULTIPOLYGON (((103693.1 28... 0.04216867
3 new york MULTIPOLYGON (((467992.4 10... 0.07694013
4 southern tier MULTIPOLYGON (((282328.4 26... 0.09051724
5 western MULTIPOLYGON (((42112.87 28... 0.15753425
6 central MULTIPOLYGON (((232317.8 31... 0.06976744
Everything looks pretty good! Now we’ll map this new variable.
ggplot() +
geom_sf(data = covid_county,
mapping = aes(fill = positive_percent))+
theme_minimal()
This looks much nicer, but we’re not done yet! I don’t like that the color gradient starts with dark blue - low values are colored in dark colors, while high values are lighter in this map. That’s not very intuitive. Let’s play around with the color gradient and some color options for our map.
ggplot() +
geom_sf(data = covid_county,
mapping = aes(fill = positive_percent))+
theme_minimal()+
scale_fill_gradient(
# Make legend title more readable
name = "Percent \nPositive Cases",
#set the color gradient
low = "lightblue",
high = "navyblue") +
labs(title = "Positive Covid Cases in NYS")+
theme(legend.title.align = 0.5,
plot.title = element_text(hjust = 0.5))
NA
Unfortunately, we cannot center the legend box without creating a new function to do so or hacking the source code - ggplot2 currently only supports a left-aligned legend box. Next, we’ll play around with some map colors and styles.
#play around with the color and theme
ggplot() +
geom_sf(data = covid_county,
mapping = aes(fill = positive_percent))+
theme_minimal()+
#Maybe I want to make my legend discrete - scale_fill_steps allows me to do that!
scale_fill_steps(
# Make legend title more readable
name = "Percent \nPositive Cases",
#set the color gradient
low = "blue",
high = "red",
#Convert from decimal to percent format
labels = scales::percent_format(),
#tells R how many breaks to put in the data
n.breaks = 4,
#Tells R to include the min and max values in the legend
show.limits = T) +
labs(title = "Positive Covid Cases in NYS",
subtitle = "As a Percent of Daily Tests",
caption = "Source: health.data.ny.gov")+
theme(legend.title.align = 0.5,
plot.title = element_text(hjust = 0.5, size = 18),
plot.subtitle = element_text(hjust = 0.5),
plot.caption = element_text(hjust = 0.5),
#Remove axis labels and ticks
line = element_blank(),
axis.text.x=element_blank(),
axis.ticks.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank())
Ok, so now we’ve learned more about creating choropleth maps. What if I have lat/long coordinates? How can I map those? We’ll create a point map (in the tutorial it is called a Bubble map) to visualize these coordinates. Let’s map cities in NYS.
#load in some city data - this contains data for all cities in the US, so we'll have to filter for NY
#This data is from: https://simplemaps.com/data/us-cities
cities <- read.csv("uscities.csv")
nycities <- cities %>%
filter(state_name == "New York",
population > 100000)
nycities1 <- nycities %>%
#I need to transform my lat/long coordinates into an sf object
#4326 is the crs used in gps - it is defined in lat/long coords
#we need to first convert to an sf object using this crs
st_as_sf(coords = c("lng", "lat"), crs = 4326) %>%
#now we can correctly convert to the same crs as our counties file, which is defined in meters
st_transform(crs = st_crs(counties))
#load in ggsflabel library
#make sure you install the devtools package first!
#This package will help us to make the labels look nice
devtools::install_github("yutannihilation/ggsflabel")
Skipping install of 'ggsflabel' from a github remote, the SHA1 (a489481b) has not changed since last install.
Use `force = TRUE` to force installation
library(ggsflabel)
Attaching package: ‘ggsflabel’
The following objects are masked from ‘package:ggplot2’:
geom_sf_label, geom_sf_text, StatSfCoordinates
#Let's plot our cities!
ggplot() +
geom_sf(data = counties)+
geom_sf(data = nycities1, color="blue")+
#Here, I am just widening the plot area - I want to label my cities, but the labels
#take up too much space
#How did I find these units? You can print out your sf object, and it will tell you the minimum and maximum x and y values for the object
#The units are in meters
coord_sf(ylim = c(20000, 550000))+
#add the labels! The geom_sf_label_repel argument lets me repel overlapping labels
#That way, we can read all of the text
#force tells R the force of the repulsion - I set it at 20, which is a bit low
#By default, it only displays the first 10 overlapping labels - I increased this to 20
#I also made the labels size 2, which is slightly smaller than the default
geom_sf_label_repel(data = nycities1, aes(label = city), force = 20,
max.overlaps = 20, size = 2)+
theme_minimal()+
labs(title = "Major Cities in New York State")+
theme(plot.title = element_text(hjust = 0.5, size = 16))
Ok, so there’s a lot going into this map. Here’s a breakdown of the steps you should use when you have lat/long coords:
Using these steps, you should be able to convert any lat/long coords into an sf object! Don’t worry if you don’t understand everything that is going on underneath these functions - you really just need to be able to follow the instructions here.
Ok, we will work on one final skill using point data today: sizing the points. Suppose I wanted to compare the cities by their population per sqkm (density) - I could change the sizes of my points to accomplish that!
#Re-run some code from earlier - I don't want to remove small cities this time
nycities <- cities %>%
filter(state_name == "New York") %>%
filter(population > 20000)
nycities1 <- nycities %>%
st_as_sf(coords = c("lng", "lat"), crs = 4326) %>%
st_transform(crs = st_crs(counties))
options(scipen = 999)
ggplot() +
geom_sf(data = counties)+
geom_sf(data = nycities1, aes(size = density), color="blue", alpha = 0.5)+
theme_minimal()+
labs(title = "Major Cities in New York State")+
theme(plot.title = element_text(hjust = 0.5, size = 16))
Hmmmm this is not terribly helpful…I could also think about changing the colors of the points!
ggplot() +
geom_sf(data = counties, fill = "white")+
geom_sf(data = nycities1, aes(color = density), size = 3)+
theme_minimal()+
labs(title = "Major Cities in New York State",
color = "Density \n (People per sqkm)")+
theme(plot.title = element_text(hjust = 0.5, size = 16),
legend.title.align = 0.5,)+
scale_color_gradient(
#set the color gradient
low = "lightblue",
high = "navyblue")
Do we love this map? No (it’s really ugly! This is not a good way to visualize density!), but it illustrates an important skill.
Resources
New York State (2022). New York State Statewide Covid-19 Testing. [Data Set]. Retreived from: https://health.data.ny.gov/Health/New-York-State-Statewide-COVID-19-Testing/xdss-u53e.
Simple Maps (2022). United States Cities Database. [Data Set]. Retrieved from: https://simplemaps.com/data/us-cities.
Urban Institute (2022). Introduction to Mapping. Retrieved from: https://urbaninstitute.github.io/r-at-urban/mapping.html.