If you’re like me, the idea of creating maps in R Studio may have been a bit intimidating.
A lot of the nice maps I see are made with Tableau or ArcGIS rather than R Studio, so my assumptions were that making a decent map in R must either very hard, or very annoying.
However, when I sat down to learn to make a map of the COVID-19 outbreak in R Studio, I was pleasantly surprised to find that it wasn’t so hard after all (although I must admit there were some annoying moments!).
Below, I’ll explain step-by-step how you can create a map in R Studio using longitude and latitude data from a .csv file.
All of the data is publicly available (see the Data Sources section), and the supplemental data can be found on my GitHub rpubs_data repo.
This section will explain how to make the following map:
To create a map of all of the confirmed COVID-19 cases in the U.S., I first loaded the following packages into R Studio:
library(readr)
library(mapproj)
library(tidyverse)
library(ggmap) #to use this package, need to register with Google and use Maps Static API, Geocoding API & maybe Geolocation API
Next, I loaded CDC data from the John’s Hopkins GitHub repository(data also available on the CDC’s website directly, but JHU has it in a nice layout already) and did some cleaning.
#Loading COVID-19 data from JHU
daily<-read.csv("C:/Users/Morganak/Documents/R/Projects/COVID-19/csse_covid_19_data/csse_covid_19_daily_reports/04-16-2020.csv", header=TRUE, sep=",", stringsAsFactors=FALSE)
#Renaming columns to more maneageable names
colnames(daily)<-c('State','Country','Date','Lat', 'Long', 'Confirmed', 'Deaths', 'Recover', 'Active', 'FIPS','Incident_Rate', 'Tested', 'Hospitalized', 'Mortality_Rate', 'UID', 'ISO3', 'Testing_Rate', 'Hospitalization_Rate')
us_only<-daily %>% dplyr::filter(Country=="US" & Confirmed!=0)#Filtering CDC data to only US confirmed cases
head(us_only)
## State Country Date Lat Long Confirmed
## 1 Alabama US 2020-04-16 23:30:51 32.3182 -86.9023 4345
## 2 Alaska US 2020-04-16 23:30:51 61.3707 -152.4044 300
## 3 Arizona US 2020-04-16 23:30:51 33.7298 -111.4312 4237
## 4 Arkansas US 2020-04-16 23:30:51 34.9697 -92.3731 1620
## 5 California US 2020-04-16 23:30:51 36.1162 -119.6816 27677
## 6 Colorado US 2020-04-16 23:30:51 39.0598 -105.3111 8286
## Deaths Recover Active FIPS Incident_Rate Tested Hospitalized
## 1 133 NA 4212 1 92.66572 36391 553
## 2 9 110 291 2 50.18829 8735 35
## 3 150 460 4087 4 58.21081 47398 578
## 4 37 548 1583 5 62.57177 22675 85
## 5 956 NA 26721 6 70.58907 246400 5031
## 6 355 NA 7931 8 146.21975 40533 1636
## Mortality_Rate UID ISO3 Testing_Rate Hospitalization_Rate
## 1 3.060990 84000001 USA 776.1100 12.727273
## 2 3.000000 84000002 USA 1461.3157 11.666667
## 3 3.540241 84000004 USA 651.1862 13.641728
## 4 2.283951 84000005 USA 875.8116 5.246914
## 5 3.457745 84000006 USA 628.4332 18.177548
## 6 4.284335 84000008 USA 715.2698 19.744147
After importing the data on the number of confirmed cases and filtering it down to only those in the United States, I found that other U.S. locations were also included in my filter results.
For example, the cruise ship “Diamond Princess” was listed under US, because it was a US cruise ship. However, I only wanted to get the total number of confirmed cases in each US state, so I needed to exclude these other locations.
To do this, I imported a spreadsheet that I found online listing just the 48 contiguous U.S. states, and loaded that into a vector, which I named “s”.
This spreadsheet is available for public use on my rpubs_data GitHub repo.
states<-read.csv2("C:/Users/Morganak/Documents/R/Projects/COVID-19/states.csv")#importing a file of U.S. state names to eliminate other U.S. places
s<-states[,"State"] #vector of state names
head(s)
## [1] Alabama Alaska Arizona Arkansas California Colorado
## 51 Levels: Alabama Alaska Arizona Arkansas California ... Wyoming
Then I filtered my dataframe so that it would only show locations listed in my vector of state names.
us_only<-filter(us_only, State %in% s) #filtering out locations not in the state name vector
Finally, I reduced the dataframe to only the three columns that I would need for my map. Building a map is essentially the same as any other scatterplot. You just need obervations, x values, and y values.
In this situation, the longitude coordinates would act as my x-axis values, the latitude coords would be my y-axis values, and the total confirmed cases in each state would be my points.
us_only<-us_only%>% dplyr::select('Confirmed','Lat', 'Long')#selecting only needed cols
us_only$Lat<-as.numeric(us_only$Lat)
us_only$Long<-as.numeric(us_only$Long)
To build the background of the map, I needed to first build a blank map of the United States. I chose the stamen map and the watercolor maptype.
Then I added the US data using the geom_point() function in ggmap (part of the ggplot2).
#Getting a map of the United States.
#Need to use Google API key here.
map<-get_map(location = 'US', zoom=3, source="stamen", maptype = "watercolor", crop=FALSE)
#Adding points for confirmed cases
us_point<-ggmap(map) + geom_point(data=us_only, aes(x=Long, y=Lat), size=2, alpha=.5)
I chose the watercolor map type, but there are lots of other map types. For example, here is the same map using Google’s map data from the get_googlemap() function and maptype=“terrain”:
map<-get_googlemap('US', zoom=3, maptype = "terrain", crop=FALSE)
map2<-ggmap(map) + geom_point(data=us_only, aes(x=Long, y=Lat), size=2, alpha=.5)
map2
Both of these maps show that the coronavirus has become extremely widespread, but neither convey the depth of the spread within particular areas.
For that, I wanted a map that would convey the distribution of confirmed COVID cases a little better.
Enter…the bubblemap!
To make these bubble maps, you’ll need the following packages:
library(maps)#for bubblemap
library(viridis) #for colors
To start off the map, I chose the world map option and then filtered down to just the United States.
map<-map_data("world") %>% filter(region=="USA") #map data
Then I created the background using geom_polygon. For size and aesthetics, I chose to just show the contiguous United States:
p<-ggplot(us_only,aes(x=Long, y=Lat, size=Confirmed)) +
geom_polygon(data = map, aes(x=long, y=lat, group=group), fill="gray", colour = "darkgray", size=0.5) + #setting line color, fill color, line size
ylim(23,50) + xlim(-125,-60) #resizing to just CONUS
p
Then I added the points using geom_point.
If you’ll notice, the top map looks a little distorted. That’s because the original plot is using Cartesian coordinates. Using the “coord_map()” function will correct the appearance.
p<-p+ geom_point(alpha=0.3, color="blue") + #everything inside aes appears in legend
coord_map() + #plotting with correct mercator projection (prior plot was cartesian coordinates)
ggtitle("Confirmed COVID-19 Cases in the U.S.")
p
To get rid of the axes, use theme_void():
I was pretty happy with how that map turned out, but wouldn’t it look even better with more color?
To do that, I took the same bubblemap as before, but added in a color scale using scale_color_viridis_c(). Now the size of the oubreak could be conveyed by different point colors, as well as point sizes.
p2<-ggplot() +
geom_polygon(data = map, aes(x=long, y=lat, group=group), fill="gray", colour = "darkgray", size=0.5, alpha=0.3) + #setting line color, fill color, line size
ylim(23,50) + xlim(-125,-60) + #resizing to just CONUS
geom_point(data=us_only, aes(x=Long, y=Lat, color=Confirmed, size=Confirmed), alpha=0.5) + #everything inside aes appears in legend
coord_map() + #plotting with correct mercator projection (prior plot was cartesian coordinates)
scale_color_viridis_c()+
ggtitle("Confirmed COVID-19 Cases in the U.S.") +
guides( colour = guide_legend()) +
theme_void() #gets rid of axes
p2
And there we have it! Several different ways to create maps of COVID-19 data using R Studio.
For other ways to used R Studio to study COVID-19 data, check out my other articles/whitepapers:
All data used for this project is free and publicly available. Here are my sources:
COVID-19 Data: CDC data on the location and count data for the number of confirmed COVID-19 cases in the United States was obtained via Novel Coronavirus (COVID-19) Cases public github repository provided by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). Their repo is also supported by ESRI Living Atlas Team and the Johns Hopkins University Applied Physics Lab (JHU APL). JHU obtains the U.S. data from the CDC’s webpage on the COVID-19 pandemic.
Files (states.csv and states_oconus.csv) - These are just two files used to create a vector of U.S. state names. They can be found here on my rpubs_data GitHub repo.