If you’re like me, the idea of creating maps in R Studio may have been a bit intimidating.

A lot of the nice maps I see are made with Tableau or ArcGIS rather than R Studio, so my assumptions were that making a decent map in R must either very hard, or very annoying.

However, when I sat down to learn to make a map of the COVID-19 outbreak in R Studio, I was pleasantly surprised to find that it wasn’t so hard after all (although I must admit there were some annoying moments!).

Below, I’ll explain step-by-step how you can create a map in R Studio using longitude and latitude data from a .csv file.

All of the data is publicly available (see the Data Sources section), and the supplemental data can be found on my GitHub rpubs_data repo.

Watercolor Map

This section will explain how to make the following map:

To create a map of all of the confirmed COVID-19 cases in the U.S., I first loaded the following packages into R Studio:

library(readr)
library(mapproj)
library(tidyverse)
library(ggmap) #to use this package, need to register with Google and use Maps Static API, Geocoding API & maybe Geolocation API

Next, I loaded CDC data from the John’s Hopkins GitHub repository(data also available on the CDC’s website directly, but JHU has it in a nice layout already) and did some cleaning.

#Loading COVID-19 data from JHU
daily<-read.csv("C:/Users/Morganak/Documents/R/Projects/COVID-19/csse_covid_19_data/csse_covid_19_daily_reports/04-16-2020.csv", header=TRUE, sep=",", stringsAsFactors=FALSE)

#Renaming columns to more maneageable names
colnames(daily)<-c('State','Country','Date','Lat', 'Long', 'Confirmed', 'Deaths', 'Recover', 'Active', 'FIPS','Incident_Rate', 'Tested', 'Hospitalized', 'Mortality_Rate', 'UID', 'ISO3', 'Testing_Rate', 'Hospitalization_Rate')

us_only<-daily %>% dplyr::filter(Country=="US" & Confirmed!=0)#Filtering CDC data to only US confirmed cases
head(us_only)
##        State Country                Date     Lat      Long Confirmed
## 1    Alabama      US 2020-04-16 23:30:51 32.3182  -86.9023      4345
## 2     Alaska      US 2020-04-16 23:30:51 61.3707 -152.4044       300
## 3    Arizona      US 2020-04-16 23:30:51 33.7298 -111.4312      4237
## 4   Arkansas      US 2020-04-16 23:30:51 34.9697  -92.3731      1620
## 5 California      US 2020-04-16 23:30:51 36.1162 -119.6816     27677
## 6   Colorado      US 2020-04-16 23:30:51 39.0598 -105.3111      8286
##   Deaths Recover Active FIPS Incident_Rate Tested Hospitalized
## 1    133      NA   4212    1      92.66572  36391          553
## 2      9     110    291    2      50.18829   8735           35
## 3    150     460   4087    4      58.21081  47398          578
## 4     37     548   1583    5      62.57177  22675           85
## 5    956      NA  26721    6      70.58907 246400         5031
## 6    355      NA   7931    8     146.21975  40533         1636
##   Mortality_Rate      UID ISO3 Testing_Rate Hospitalization_Rate
## 1       3.060990 84000001  USA     776.1100            12.727273
## 2       3.000000 84000002  USA    1461.3157            11.666667
## 3       3.540241 84000004  USA     651.1862            13.641728
## 4       2.283951 84000005  USA     875.8116             5.246914
## 5       3.457745 84000006  USA     628.4332            18.177548
## 6       4.284335 84000008  USA     715.2698            19.744147

After importing the data on the number of confirmed cases and filtering it down to only those in the United States, I found that other U.S. locations were also included in my filter results.

For example, the cruise ship “Diamond Princess” was listed under US, because it was a US cruise ship. However, I only wanted to get the total number of confirmed cases in each US state, so I needed to exclude these other locations.

To do this, I imported a spreadsheet that I found online listing just the 48 contiguous U.S. states, and loaded that into a vector, which I named “s”.

This spreadsheet is available for public use on my rpubs_data GitHub repo.

states<-read.csv2("C:/Users/Morganak/Documents/R/Projects/COVID-19/states.csv")#importing a file of U.S. state names to eliminate other U.S. places

s<-states[,"State"] #vector of state names
head(s)
## [1] Alabama    Alaska     Arizona    Arkansas   California Colorado  
## 51 Levels: Alabama Alaska Arizona Arkansas California ... Wyoming

Then I filtered my dataframe so that it would only show locations listed in my vector of state names.

us_only<-filter(us_only, State %in% s) #filtering out locations not in the state name vector

Finally, I reduced the dataframe to only the three columns that I would need for my map. Building a map is essentially the same as any other scatterplot. You just need obervations, x values, and y values.

In this situation, the longitude coordinates would act as my x-axis values, the latitude coords would be my y-axis values, and the total confirmed cases in each state would be my points.

us_only<-us_only%>% dplyr::select('Confirmed','Lat', 'Long')#selecting only needed cols
us_only$Lat<-as.numeric(us_only$Lat)
us_only$Long<-as.numeric(us_only$Long)

To build the background of the map, I needed to first build a blank map of the United States. I chose the stamen map and the watercolor maptype.

Then I added the US data using the geom_point() function in ggmap (part of the ggplot2).

#Getting a map of the United States. 
#Need to use Google API key here. 

map<-get_map(location = 'US', zoom=3, source="stamen", maptype = "watercolor", crop=FALSE)
#Adding points for confirmed cases
us_point<-ggmap(map) + geom_point(data=us_only, aes(x=Long, y=Lat), size=2, alpha=.5)

I chose the watercolor map type, but there are lots of other map types. For example, here is the same map using Google’s map data from the get_googlemap() function and maptype=“terrain”:

map<-get_googlemap('US', zoom=3, maptype = "terrain", crop=FALSE)

map2<-ggmap(map) + geom_point(data=us_only, aes(x=Long, y=Lat), size=2, alpha=.5)
map2

Both of these maps show that the coronavirus has become extremely widespread, but neither convey the depth of the spread within particular areas.

For that, I wanted a map that would convey the distribution of confirmed COVID cases a little better.

Enter…the bubblemap!

Bubble Maps

To make these bubble maps, you’ll need the following packages:

library(maps)#for bubblemap
library(viridis) #for colors

To start off the map, I chose the world map option and then filtered down to just the United States.

map<-map_data("world") %>% filter(region=="USA") #map data

Then I created the background using geom_polygon. For size and aesthetics, I chose to just show the contiguous United States:

p<-ggplot(us_only,aes(x=Long, y=Lat, size=Confirmed)) +
  geom_polygon(data = map, aes(x=long, y=lat, group=group), fill="gray", colour = "darkgray", size=0.5) + #setting line color, fill color, line size
  ylim(23,50) + xlim(-125,-60)  #resizing to just CONUS
p

Then I added the points using geom_point.

If you’ll notice, the top map looks a little distorted. That’s because the original plot is using Cartesian coordinates. Using the “coord_map()” function will correct the appearance.

p<-p+ geom_point(alpha=0.3, color="blue") + #everything inside aes appears in legend
  coord_map() + #plotting with correct mercator projection (prior plot was cartesian coordinates)
  ggtitle("Confirmed COVID-19 Cases in the U.S.") 
p

To get rid of the axes, use theme_void():

I was pretty happy with how that map turned out, but wouldn’t it look even better with more color?

To do that, I took the same bubblemap as before, but added in a color scale using scale_color_viridis_c(). Now the size of the oubreak could be conveyed by different point colors, as well as point sizes.

p2<-ggplot() +
  geom_polygon(data = map, aes(x=long, y=lat, group=group), fill="gray", colour = "darkgray", size=0.5, alpha=0.3) + #setting line color, fill color, line size
  ylim(23,50) + xlim(-125,-60) + #resizing to just CONUS
  geom_point(data=us_only, aes(x=Long, y=Lat, color=Confirmed, size=Confirmed), alpha=0.5) + #everything inside aes appears in legend
  coord_map() + #plotting with correct mercator projection (prior plot was cartesian coordinates)
  scale_color_viridis_c()+
  ggtitle("Confirmed COVID-19 Cases in the U.S.") +
  guides( colour = guide_legend()) +
  theme_void()  #gets rid of axes
p2

And there we have it! Several different ways to create maps of COVID-19 data using R Studio.

For other ways to used R Studio to study COVID-19 data, check out my other articles/whitepapers:

Data Sources

All data used for this project is free and publicly available. Here are my sources:

  • COVID-19 Data: CDC data on the location and count data for the number of confirmed COVID-19 cases in the United States was obtained via Novel Coronavirus (COVID-19) Cases public github repository provided by the Johns Hopkins University Center for Systems Science and Engineering (JHU CSSE). Their repo is also supported by ESRI Living Atlas Team and the Johns Hopkins University Applied Physics Lab (JHU APL). JHU obtains the U.S. data from the CDC’s webpage on the COVID-19 pandemic.

  • Files (states.csv and states_oconus.csv) - These are just two files used to create a vector of U.S. state names. They can be found here on my rpubs_data GitHub repo.