It’s free.
R is free. Esri ArcGIS costs money. But what about other free GIS, like Q GIS and Grass GIS?
It’s command-line-based rather than a GUI.
Esri software, QGIS, and GrassGIS are mainly graphical-user-interface (GUI) based, i.e. point and click. R is command-line based, like SAS. (From what I know, the others do have a command-line option. I never learned the command-line in Esri ArcMap because it requires Python knowledge. I haven’t tried the others.)
The advantage to GUI is it’s a little easier to learn. However, GUI pointing and clicking is inefficient and not replicable. What did I click on again? If you change your input data, you have to repeat all your steps manually. With R, the code is all there, so you know exactly what you did, so to replicate what you did, you can simply run the code again like you would in SAS, rather than having to remember what you clicked on.
Now’s the time.
R is continually improving, and it’s becoming easier to use. The package used here (sf) was released within the last year and plays nicely with the popular dplyr package for data wrangling. As Drake and Future have said, What a Time to be Alive.
What can’t R do?
Sometimes point-and-click can be better. For example, I had to draw bike routes in GIS for a research project. ArcGIS made this easy (albeit monotonous) with the trace segment feature tool. This would have been a headache to code in a command line. This feature appears to be on the horizon in R, but from my experience it’s not quite ready for prime time.
mapviewggmap package, which is used in this demo to geocode, follows the famous ggplot2 design philosophy and is thus a great candidate for static maps. Unfortunately, it doesn’t seem to play nicely with sf objects (explained below) on my last check, and I’m stubbornly reluctant to convert sf objects back to sp. Nonetheless, R is commonly used for static maps, and abundant resources are available to learn how. My favorite resource so far has been: https://eriqande.github.io/rep-res-eeb-2017/map-making-in-R.htmlRGoogleMaps package.)sf - the key package for spatial data management. The great thing about this package is you can treat spatial data as you would any other dataframe. It just has an extra column with a vector representing the geometry. This makes it straightforward to incorporate spatial data into a normal data-wrangling workflow, where you might add new variables, filter on a variable, select a column, merge with additional data, or any other common task. In brief, it made spatial data tidy. In comparison, shapefiles are confusing with their need for 5+ files and weird file extensions.
tidyverse - “The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.” I recommend this free book, R 4 Data Science, for an intro to it.
mapview - a way to make interactive maps in R. Pretty amazing.
ggmap - use Google Maps to geocode.
readxl - to import the Excel data
lwgeom - required for some spatial operations
Note, in order to install a package, you need write access to the folder where the packages go. This might be an issue if you’re working on an Emory computer. You can specify where the package goes like this:
install.packages("sf", lib = "H:/R")
install.packages("tidyverse", lib = "H:/R")
install.packages("mapview", lib = "H:/R")
#note, development version of ggmap may be needed. see below.
install.packages("ggmap", lib = "H:/R")
install.packages("readxl", lib = "H:/R")
install.packages("lwgeom" , lib = "H:/R")
install.packages("devtools", lib = "H:/R") #to install dev version of ggmap
install_github("dkahle/ggmap", lib = "H:/R") #load the dev version of ggmap
To load the packages from that location, first tell your computer where they are.
.libPaths("H:/R") #tell computer where to look for packages.
#load the packages.
library(sf)
library(tidyverse)
library(mapview)
library(devtools)
library(ggmap)
library(readxl)
library(lwgeom)
If you’re on your laptop, the default location will probably be fine, so there’s no need to change the library or tell the computer where the packages are.
install.packages("sf")
install.packages("tidyverse")
install.packages("mapview")
#skip this and load dev version below if ggmap::register_google function isn't working
install.packages("ggmap")
install.packages("readxl")
install.packages("lwgeom")
install.packages("devtools") #so that the dev version of ggmap can be installed
install_github("dkahle/ggmap") #load the dev version of ggmap
library(sf)
library(tidyverse)
library(mapview)
library(devtools)
library(ggmap)
library(readxl)
library(lwgeom)
Here is a dataset with information on 440 dialysis facilities and the 9 transplant centers in Network 6. The data can be downloaded from this Dropbox link (tip: right click and open in new tab if the link isn’t working).
Save the data somewhere, and tell R where you put it using setwd(). The file.path() function is neat because it makes file pathways operating-system independent.
setwd(file.path("C:", "Users", "maybe your name", "a subfolder", "maybe another subfolder", "the folder with the excel data"))
#make sure the working directory is what you said.
getwd()
Now load the data. It can be helpful to specify the variable types first, so R doesn’t have to guess. Here are two vectors with the information for the column type information (numeric, text, etc.), one for each sheet we’re going to load. The order in the vector corresponds to the variable’s order in the Excel sheet from left to right. (Note, I’m calling zip code text.)
#column types for 'Selected Facilities for QIP'
col_types_dialfacil = c("numeric", "numeric", "text", "numeric", "numeric", "text", "text", "text", "text", "text", "text", "text", "text", "numeric", "numeric")
#column types for Network 6 transplant centers
col_types_transpcenter = c("text", "text", "text", "text", "text", "text", "text")
#load facilities Excel sheet
facilities <- read_excel("N of 440 Final Dialysis Facility Population_For Maps.xlsx",
sheet = "Selected Facilities for QIP", range = "A1:O441",
col_names = TRUE, #first row is the column name
col_types = col_types_dialfacil) #the vector specifying column types
#load dialysis center Excel sheet
centers <- read_excel("N of 440 Final Dialysis Facility Population_For Maps.xlsx",
sheet = "Network 6 Transplant Centers", range = "A1:G10",
col_names = TRUE, #first row is the column name
col_types = col_types_transpcenter) #the vector specifying column types
Take a look at the data to check.
#what are the column names?
names(facilities)
## [1] "Treatment" "CCN" "Facility_Name"
## [4] "Patient_Census2" "Percent_Waitlisting" "OWNEDBY_DESC"
## [7] "PHYS_FIRST_LINE" "PHYS_SECOND_LINE" "PHYS_CITY"
## [10] "PHYS_STATE" "PHYS_ZIP_CODE" "PHYS_ZIP_EXT"
## [13] "PHYS_COUNTY_NAME" "NEAR_DIST_MILES" "Miles_NW6"
names(centers)
## [1] "Center_Name" "Center_Abbr" "Center_Full" "Center_Street"
## [5] "Center_City" "Center_State" "Center_Zip"
#how many rows and columns?
dim(facilities)
## [1] 440 15
dim(centers)
## [1] 9 7
#Look at the dataset.
View(facilities)
View(centers)
We need two things in each dataset in order to geocode:
* A key or unique ID variable
* An address in the form Street Number Street Address, City, State, Zip
Geocoding with Google Maps is a little slow, so it’s a good idea to save the geocoded results to the computer rather than geocoding every time you run the R code. For that reason, it’s nice to have a key to link the geocoded file back with the original non-geocoded data.
The facilities data doesn’t have an address variable in the above form, and the transplant centers data doesn’t have a numeric key variable. So we’ll make them.
We can make this column as a combination of other columns in the data, specifically: PHYS_FIRST_LINE PHYS_CITY , PHYS_STATE, and PHYS_ZIP_CODE.
In the dplyr package, the verb mutate creates new variables. Use that in combination with the paste0 function to combine the relevant columns. paste0 is like concatenate in Excel.
Note the use of the %>%, the pipe operator. Rather than including the data frame in every function, the data frame is passed through after each piped operation, indicated by the %>%. It makes the code more verb-oriented rather than noun-oriented, making it easier to read and more like a SAS data step. See the R 4 Data Science book for more details on piping.
facilities_1 = facilities %>%
#skip the PHYS_SECOND_LINE column
mutate(facility_address = paste0(PHYS_FIRST_LINE, ", ", PHYS_CITY, ", " , PHYS_STATE, ", ", PHYS_ZIP_CODE))
#check
View(facilities_1)
4024 Burnett Womack Building, CB #7211, Chapel Hill, NC, 27599.160 Dental Cir, Chapel Hill, NC 27514centers_1 = centers %>%
mutate(center_id = row_number()) %>%
#can make a new variable that represents the fixed addresses.
#I'm using case_when syntax, which is similar to an if-then statement in SAS.
mutate(center_address = case_when(
#if center_id is 7, fix the address.
center_id == 7 ~ "160 Dental Cir, Chapel Hill, NC 27514",
#if else, set to the original values.
TRUE ~ Center_Full
))
#check
#View(centers_1)
How many facilities are in Georgia?
Can answer a couple of ways:
1. Filter by state, and count the number of columns.
facilities_GA = facilities_1 %>%
filter(PHYS_STATE == "GA") #note the == rather than =
dim(facilities_GA) #152 rows, so 152 facilities in GA
## [1] 152 16
count_facilities_by_state = facilities_1 %>%
#now, every row is a state.
group_by(PHYS_STATE) %>%
#perform some operations for those rows.
summarise(
state_count = n()
)
print(count_facilities_by_state) #152 in Georgia
## # A tibble: 3 x 2
## PHYS_STATE state_count
## <chr> <int>
## 1 GA 152
## 2 NC 181
## 3 SC 107
ggmap package.In each of the two datasets – the one for dialysis facilities and the one for transplant centers – we have a column for the address. We will send that column to Google so it can tell us the latitude and longitude for the places. With the lat and lon, we can then convert it into an sf object.
To use this, you have to have a key for Google Maps API. It’s free and quick to get one here: https://developers.google.com/maps/documentation/javascript/get-api-key.
Once you have a key, you can register it in the console. And check how many times you can use it per day. The limit for free accounts is 2500 daily geocodes.
register_google(key = "AIzaSyBwpCFVf9sa7E-julIPmYuW18b675rSafE")
#Note, if this didn't work, try installing the develoment version of ggmap. See section 2.1.
geocodeQueryCheck() #limit is 2500 per day
## 2500 geocoding queries remaining.
Note, I’m using the dplyr verb select to only select the ID field and the address field. This makes it easier to link back into the main data once the geocoding is done. The key column name is CCN, and the address column name is facility_address. I’m specifying the package name before the function (dplyr::select) because sometimes the same function is used for multiple packages, so it can be helpful to specify for common words.
by_facility_geocode = facilities_1 %>%
dplyr::select(CCN, facility_address) %>%
#the function to perform the geocode
mutate_geocode(facility_address, force = TRUE) %>%
#drop the address field to avoid duplication when it's merged back with the original data
dplyr::mutate(facility_address = NULL)
#Check to make sure they were all geocoded.
View(by_facility_geocode)
The key column is center_id and the address column is center_address.
by_center_geocode = centers_1 %>%
dplyr::select(center_id, center_address) %>%
mutate_geocode(center_address, force = TRUE) %>%
dplyr::mutate(center_address = NULL)
#Check to make sure they were all geocoded.
View(by_center_geocode)
….The geocoding took a while. Let’s save to the hard drive so we don’t have to do that every time. They will save to the working directory specified above.
save(by_facility_geocode, file = "by_facility_geocode.RData")
save(by_center_geocode, file = "by_center_geocode.RData")
Check to make sure they load.
load(file = "by_facility_geocode.RData")
load(file = "by_center_geocode.RData")
The mutate_geocode function created a column for latitude (lat) and a column for longitude (lon). Let’s convert it to an sf object, which will take the geographical information stored in lat and lon and make a new column called geometry.
Convert to an sf object using the st_as_sf function, which takes three arguments: the data frame (which is piped through from the lines above in this case), the lon and lat values as a vector created by c, and a coordinate reference system.
I’m setting the coordinate reference system to 4326, which is a commonly used projection by web maps like Google Maps. Another option may have been one that’s more accurate in the Georgia area, such as 6446. This online book describes the ins and outs of coordinate reference systems: https://geocompr.robinlovelace.net/spatial-class.html#crs-intro. You can search for EPSG codes here: http://spatialreference.org/ref/.
The main thing to keep track of with coordinate reference systems is the unit for distance. The unit for EPSG 4326 is meters.
The below code also joins the aspatial information with the newly-created sf object using the left_join function. left_join takes three arguments: the data frame on the left side (piped through here), the data frame merging in on the right side (facilities_1 here), and the key variable.
Dialysis facilities
by_facility_sf = by_facility_geocode %>%
st_as_sf(coords = c("lon", "lat"), crs = 4326) %>%
left_join(facilities_1, by = "CCN")
Transplant centers
by_center_sf = by_center_geocode %>%
st_as_sf(coords = c("lon", "lat"), crs = 4326) %>%
left_join(centers_1, by = "center_id")
mapview.Mapview as an initial check.The stated purpose of the mapview package is to make interactive maps to support the GIS workflow, not necessarilly for publication-ready maps – although with lots of tinkering, a screenshot of the interactive maps could pass for publication ready, depending on cartography standards. Let’s take a look at our two spatial datasets using mapview.
The mapview function has several options to customize, but it will also accept one argument, the dataset, which is great for quick checks.
Dialysis facilities
mapview(by_facility_sf)
Transplant centers
mapview(by_center_sf)
A visual inspection shows that the geocoding seems to have worked. All of the dialysis facilities are in the Southeast, and the location of each transplant center seems correct.
Note the different background maps you can toggle through. You can also click on a given facility or center to query all of its data.
mapview options.The mapview package has extensive documentation highlighting its many options:
https://cran.r-project.org/web/packages/mapview/mapview.pdf
https://r-spatial.github.io/mapview/articles/index.html
Just add a plus sign between the two mapview calls. For this one, let’s also specify:
* the background map so it goes straight to OpenStreetMap
* the colors of each layer. Use color for the line color and col.regions for the color of the fill.
mapview(by_facility_sf, map.type = "OpenStreetMap",
color = "orangered", col.regions = "orangered") +
mapview(by_center_sf, map.type = "OpenStreetMap",
color = "blue", col.regions = "blue")
Look at the dialysis facilities by the size of the patient population - Patient_Census2. Use the cex argument to allow the size of the dots to vary by the values in Patient_Census2.
mapview(by_facility_sf, map.type = "OpenStreetMap", cex = "Patient_Census2", legend = FALSE)
Where are the intervention facilities? Color code the Treatment column.
mapview(by_facility_sf,
map.type = "OpenStreetMap",
zcol = "Treatment",
legend =TRUE ,#add a legend
layer.name = "Intervention facilities (1); No (0)" #label the legend
)
This is a little tricky. We can use the st_distance function in the sf package to find the distance between each facility and every other facility. We can then take the minimum of those measurements for each facility to determine the nearest distance to another facility.
The st_distance produces a matrix with all of the pairwise distances if the by_element argument is FALSE.
To keep track of which facility is which, let’s first create a new variable enumerating the facilities, like we have with the centers.
by_facility_sf1 = by_facility_sf %>%
#sort the data by CCN ascending.
arrange(CCN) %>%
mutate(facility_id = row_number())
fac_distance_matrix = st_distance(by_facility_sf1, by_element = FALSE)
That creates a 440x440 matrix. We need to find the minimum value for each facility.
#Convert it to a data frame (a tibble).
fac_distance_df = as_tibble(fac_distance_matrix)
#convert to numeric. Units are meters.
#the 1 means apply the function row wise. you can say 2 for column wise. I don't think it would make a difference, here. See this blog post for info about the apply family of functions https://www.r-bloggers.com/apply-lapply-rapply-sapply-functions-in-r/
fac_distance_df_numeric_m = as_tibble(apply(fac_distance_df, 1, as.numeric))
#There are zeros along the middle diaganol. We need to set them all to NA. Otherwise, the minimum for every facility will be zero.
fac_distance_df_numeric_m[fac_distance_df_numeric_m==0] <- NA
#Find the minimum of each column using sapply. using sapply because I want a vector in return.
min_fac_distance = sapply(fac_distance_df_numeric_m, min, na.rm = T)
#Now convert it to a tibble (tidy table) and do some more tidying up.
min_fac_dist_tibble = min_fac_distance %>%
as_tibble() %>%
mutate(facility_id = row_number()) %>%
rename(min_distance_m = value) %>%
#create new variables for min distance in kilometers and miles
mutate(min_distance_km = min_distance_m / 1000,
min_distance_mi = min_distance_m / 1609.34
)
Link that back with the rest of the data using left_join.
by_facility_sf2 = by_facility_sf1 %>%
left_join(min_fac_dist_tibble, by = "facility_id")
Well that was a lot. Let’s use mapview in combination with the measure distance tool in Google maps (in a web browser, not in R) to make sure the results are correct.
#Look at the new columns we made
by_facility_sf2 %>%
dplyr::select(facility_id, min_distance_m, min_distance_mi, min_distance_km) %>%
View()
#and check on the map.
mapview(by_facility_sf2)
I’m looking at the facility in the northwest corner of GA, CCN = 112574. By clicking on it, I can see that the min_distance_mi is 17.06. The closest one looks to be the one directly south of it, CCN = 112610. Using the addresses of each and the measure distance tool in Google Maps, we can confirm that this distance is correct. Phew.
This should be easier, because we can perform a unary union (st_union) (see sf documentation) on the transplant centers object so that it’s all one object. That way, the st_distance function won’t return a 440x9 matrix but instead a 440x1 matrix, which will be a little easier to work with. (Alternatively, we could have done this above, too, by making a copy of the by_facility_sf object and performing a unary union on it. Now we know both ways.)
trans_centers_one = by_center_sf %>% st_union(by_feature=FALSE)
Check using mapview.
mapview(trans_centers_one)
#calculate distance between each facility and the single feature
dist_fac_center_440x1 = st_distance(by_facility_sf2, trans_centers_one)
#Is this the same as the lowest value in each row of the 440x9 matrix?
dist_fac_center_440x9 = st_distance(by_facility_sf2, by_center_sf)
#View(dist_fac_center_440x1)
#View(dist_fac_center_440x9) #yes.
Like before, the st_distance function returns a matrix. We need to turn it back into a data frame (as_tibble) so we can link this information with the rest of the data.
min_cent_dist_tibble = dist_fac_center_440x1 %>%
as_tibble() %>%
mutate(facility_id = row_number()) %>%
#the variable is V1 this time rather than 'value'.
#take away the units and then delete the V1 variable.
mutate(min_distance_center_m = as.numeric(V1),
v1 = NULL) %>%
#min distance in kilometers and miles
mutate(min_distance_center_km = min_distance_center_m / 1000,
min_distance_center_mi = min_distance_center_m / 1609.34
)
Link that back with the rest of the data using left_join.
by_facility_sf3 = by_facility_sf2 %>%
left_join(min_cent_dist_tibble, by = "facility_id")
Again, check using mapview and Google Maps (outside of R).
mapview(by_facility_sf3) + mapview(by_center_sf, col.regions="green")
Examine how far away the dialysis facility in the NW corner of Georgia is from Piedmont Hospital using Google Maps. It should be 92.8 miles per our calculation. Crow-fly distance in Google Maps is about 92.5. Woo.
This could be used to determine how many people a transplant center could serve, using the patient census at each dialysis facility.
My strategy would be to start with a buffer (st_buffer) around each dialysis facility. You can read about the many ways to manipulate simple feature geometries here: https://cran.r-project.org/web/packages/sf/vignettes/sf3.html
https://geocompr.robinlovelace.net/
https://eriqande.github.io/rep-res-eeb-2017/
on basic data management and analysis in the tidyverse: http://r4ds.had.co.nz/