This tutorial benefited extensively from several other sources, including but not limited to spatial.org and Nick Eubank’s (here) and Justin de Benedictis Kessner’s (here) great online tutorials.

You’ll need to go to this GitHub link to download the data used in the tutorial.

How R Sees Spatial Data

In this first section, we will understand how R thinks about spatial data and how to import and export spatial datasets.

First we will load the libraries we will need during this workshop. If you have not installed the packages yet, refer to the pre-workshop instructions.

library(sp)
library(rgdal)
library(raster)
library(rgeos)
library(plyr)
library(maptools)
library(RColorBrewer)
library(ggmap)
library(gridExtra)
library(classInt)
library(geosphere)
library(spdep)
library(gmapsdistance)

Make Spatial Data Points

We start with the most basic geometric object, points. Like regular data points in plots, each spatial point needs a pair of x-y coordinates. The only difference is that this time, we pass them to the SpatialPoints() function and thus turn these points into a SpatialPoints object.

plus.coordinates <- rbind(c(1.5, 1.25), c(1.75, 1.25), 
                          c(2.25, 1.25), c(2.5, 1.25), #horizontal
                         c(2, 1.75), c(2, 1.5), 
                         c(2, 1), c(2, 0.75), #vertical
                         c(2, 1.25)) #center
first.spatial.points <- SpatialPoints(plus.coordinates) #points converted into spatial object
plot(first.spatial.points)                         

To grasp how R sees these points, we can ask use the summary() command. The output will show us the class of the object, minimum and maximum coordinates (which will be longitude and latitude values instead of random points in geospatial data), whether the object is projected or not (See below for more info), if projected, which coordinate reference system it uses (CRS), and finally, the total number of points in the data. If you want to reach all the coordinate values, you can use coordinates() command.

summary(first.spatial.points)
## Object of class SpatialPoints
## Coordinates:
##            min  max
## coords.x1 1.50 2.50
## coords.x2 0.75 1.75
## Is projected: NA 
## proj4string : [NA]
## Number of points: 9
summary(first.spatial.points)$class
## [1] "SpatialPoints"
## attr(,"package")
## [1] "sp"
summary(first.spatial.points)$bbox
##            min  max
## coords.x1 1.50 2.50
## coords.x2 0.75 1.75
summary(first.spatial.points)$is.projected
## [1] NA
summary(first.spatial.points)$proj4string
## [1] NA
summary(first.spatial.points)$npoints
## [1] 9
coordinates(first.spatial.points)
##       coords.x1 coords.x2
##  [1,]      1.50      1.25
##  [2,]      1.75      1.25
##  [3,]      2.25      1.25
##  [4,]      2.50      1.25
##  [5,]      2.00      1.75
##  [6,]      2.00      1.50
##  [7,]      2.00      1.00
##  [8,]      2.00      0.75
##  [9,]      2.00      1.25

Add Coordinate Reference Systems

The plus we created above is a simple spatial object for now. This SpatialPoints object will understand how the coordinates of its points relate to places in the real world and become geospatial through a Coordinate Reference System (CRS), which combines info on the geographic coordinate system and the projection. The CRS of an object can be found through proj4string() or the summary of the object as we observed above. The NA value means that our SpatialPoints does not have a CRS defined yet. You can also use the is.projected() command, which will tell you whether a CRS is defined or not.

is.projected(first.spatial.points) #see if a projection is defined
## [1] NA
#FALSE if it has a geographic coordinate system but lacks a projection, 
#NA if it lacks both. 

So we need to define a CRS so that our spatial object becomes a geospatial object. The CRS can be defined by simply passing the reference code for the projection. We’ll use the most common CRS, EPSG:4326, although there are many other possible reference systems: CRS("+init=EPSG:4326"). Normally the parameters we needed to define is much longer but fortunately for R to understand it +init=EPSG:4326 is sufficient, it fills in all the missing parameters.

CRS("+init=EPSG:4326") #is called with only an EPSG code, R will try to complete all the CRS parameters
## CRS arguments:
##  +init=EPSG:4326 +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84
## +towgs84=0,0,0
crs.geo <- CRS("+init=EPSG:4326")
proj4string(first.spatial.points) <- crs.geo  #assign a projection to our data
is.projected(first.spatial.points) #check it is defined now
## [1] FALSE

This SpatialPoints object doesn’t contain any variables at the moment. We can move from a SpatialPoints to a SpatialPointsDataFrame by adding a data.frame of variables to the points. To practice, we’ll add some randomly generated data. Note the SpatialPoints will merge with the data.frame based on the order of observations, and therefore the number of our spatial points should be equal to the number of rows in the new data.

vars <- data.frame(vars1 = rnorm(9), vars2 = c(101:109))
vars
first.spdf <- SpatialPointsDataFrame(first.spatial.points, vars)
summary(first.spdf)
## Object of class SpatialPointsDataFrame
## Coordinates:
##            min  max
## coords.x1 1.50 2.50
## coords.x2 0.75 1.75
## Is projected: FALSE 
## proj4string :
## [+init=EPSG:4326 +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84
## +towgs84=0,0,0]
## Number of points: 9
## Data attributes:
##      vars1             vars2    
##  Min.   :-2.1758   Min.   :101  
##  1st Qu.:-1.1690   1st Qu.:103  
##  Median :-0.1975   Median :105  
##  Mean   :-0.3422   Mean   :105  
##  3rd Qu.: 0.4849   3rd Qu.:107  
##  Max.   : 0.7958   Max.   :109

Say you don’t have a SpatialPoints object but only a data.frame with two columns indicating longitude and latitude values. SpatialPointsDataFrame object can also be created directly from this data frame in one step. You’ll need to specify which columns contain the coordinates by using the coordinates() command. We’ll use this method soon to create the boston.graf SpatialPointsDataFrame.

SpatialPolygons

Like a SpatialPoints consists of points, a SpatialPolygons object consists of polygons. To obtain a SpatialPolygons object, we need three steps. We’ll first create Polygon objects, two-dimensional geometrical shapes, from points. Then we will combine those into Polygons objects (Here, we need to give each polygon a unique ID). Finally we will combine Polygons into SpatialPolygons.

#create polygon objects from points.
triangle1 <- Polygon(rbind(c(1, 1), c(2, 1), c(1.5, 0))) 
triangle2 <- Polygon(rbind(c(1, 1), c(1.5, 2), c(2, 1)))

#create polygons objects from polygon
t1 <- Polygons(list(triangle1), "triangle1")
t2 <- Polygons(list(triangle2), "triangle2")

#create spatial polygons object from polygons, equal to shapefiles.
first.spatial.polygons <- SpatialPolygons(list(t1, t2))
plot(first.spatial.polygons)

As with SpatialPoints, we can add variables to SpatialPolygons. Note that the rownames of the data.frame that includes our variables should match the unique IDs of the polygon objects. Ideally the number of our spatial polygons should be equal to the number of rows in the new data so that we don’t have any missing values in our SpatialPolygonsDataFrame.

vars <- data.frame(attr1 = 1:2, attr2 = 6:5, 
                   row.names = c("triangle2", "triangle1"))
first.spoldf <- SpatialPolygonsDataFrame(first.spatial.polygons, vars)
as.data.frame(first.spoldf) #the rows were re-ordered to match the ID
spplot(first.spoldf)

Remember that in order the SpatialPoints object to understand how the coordinates of its points relate to places in the real world, we assigned to it a Coordinate Reference System (CRS). We’ll do the same for this SpatialPolygons, using exactly the same method.

crs.geo <- CRS("+init=EPSG:4326") #geographical, datum WGS84
proj4string(first.spoldf) <- crs.geo

Importing and Exporting Spatial Data

Of course creating SpatialPoints or SpatialPolygons by hand is not common. Instead we use existing spatial data by reading them from external sources, the most common among whom is shapefiles.

In order to read shapefiles, which will usually be in form of an ESRI shapefile, we will use readOGR() and writeOGR() commands. readOGR() needs at least the following two arguments: dsn =, which asks the path to the folder that contains the files, and layer =, which asks the name of the shapefile (without the extension .shp).

boston <- readOGR(dsn = "BostonData", 
                  layer = "city_council_districts") 
## OGR data source with driver: ESRI Shapefile 
## Source: "BostonData", layer: "city_council_districts"
## with 9 features
## It has 5 fields
## Integer64 fields read as strings:  OBJECTID

To understand the shapefile, it’s good to check what it includes using the str command. It will show that boston is actually a combination of data, coords, bbox, and a proj4string. We can call each single attribute either by the corresponding command or by using the @ command.

#str(sf)
bbox(boston) #or boston```bbox
##         min       max
## x  739594.1  795021.6
## y 2908187.3 2970072.8
coordinates(boston)[1:5, ] #or boston```coords
##       [,1]    [,2]
## 0 787005.1 2961184
## 1 777534.0 2948923
## 2 776281.9 2933061
## 3 769011.6 2930656
## 4 757770.8 2922005
as.data.frame(boston)[1:5, ] #or boston```data
proj4string(boston) #orboston```proj4string
## [1] "+proj=lcc +lat_1=41.71666666666667 +lat_2=42.68333333333333 +lat_0=41 +lon_0=-71.5 +x_0=199999.9999999999 +y_0=750000 +datum=NAD83 +units=us-ft +no_defs +ellps=GRS80 +towgs84=0,0,0"

Combining Spatial Data with Other Data Sources

Merging

Spatial + Non-Spatial

Now we will learn how to merge a Spatial object (SpatialPolygonsDataFrame, or SpatialPointsDataFrame) with other tabular data, using a common unique identifier. First, load the elections tabular data.

boston.elections <- read.csv("BostonData/dist_results.csv")

Now we have i) a SpatialPolygons object named boston, and ii) a data frame called boston.elections, where DISTRICT is the column that contains the unique identifier in boston, and district is the column that contains the unique identifier in boston.elections.

boston <- merge(boston, boston.elections, 
                by.x = "DISTRICT", by.y = "District")
boston@data[1:5, ] ##check if merged

Never merge the @data attribute of your Spatial object directly with other data since it may corrupt your data:

boston@data <- merge(boston@data, boston.elections, 
by.x = "DISTRICT", by.y = "District")

Spatial + Spatial

Before merging two Spatial datasets, the first thing you need to do is to check whether they have the same CRS. Because both of your files should aggree on which exact point on Earth they refer to with a given value. If you try to merge two spatial objects with a different CRS, your results won’t make much sense.

If they don’t have the same CRS, you’ll need to reproject one of your objects. (Reprojecting changes not just the proj4string associated with an object, but also all the actual x and y coordinates.)

You can reproject any vector data very simply using the sptransform() command, which will need two inputs: i) the object you want to reproject, and ii) the CRS code you want for your reprojection. I’ll show it here with some fake data for now, since we’ll try it with real data later when we create maps.

common.crs <- CRS("+init=EPSG:4326")
MyCity.reprojected <- spTransform(MyCity, common.crs)

Since our ultimate goal is to make sure both of your spatial objects have the same CRS, you can also retrieve the CRS from one spatial object with the proj4string() command, store it, and reproject your second spatial object with this CRS you stored. So if you have two objects named MyCityA and MyCityB, the code will look as follows:

common.crs <- CRS(proj4string(MyCityB))
MyCityA.reprojected <- spTransform(MyCityA, common.crs)

Spatial Merges

So far we have used a common unique identifier to merge, as with the regular datasets. We can also use the geolocation to merge. Now we’re going to work with Boston municipal service requests for graffiti removal.

We already know the geolocation of graffitis but we don’t know in which district of Boston they are located. So we’ll just overlay the graffitis with our Boston map, a SpatialPolygons object, using the over() command, which will match each graffiti with a district. After the match we’ll combine the two data using the spCbind() command.

Let’s start by loading our graffiti data.frame and turning it into a SpatialPointsDataFrame so that R understands that it is a geospatial file where each graffiti shows an exact location (longitude and latitude):

boston.requests <- read.csv("BostonData/graffiti.csv")
boston.graf <- SpatialPointsDataFrame(coords = cbind(boston.requests$LONGITUDE,
                                                     boston.requests$LATITUDE),
                                      data = boston.requests,
                                      proj4string = CRS("+proj=longlat"))

Next we need to make sure the Boston map has the same projection as our spatial point data frame:

boston <- spTransform(boston, CRS("+proj=longlat"))

Now for every graffiti, I want to get information about their district. I will use the over() command. The command basically asks R to return the row number of the observation in Data 2 that intersects with any given observation in Data 1.

boston.districts <- over(boston.graf, boston) 
boston.districts[1:5, ] ##check if overlaid

Note that over only returns the relevant row of Data 2 for each point. If boston did not have any data, we could just get the index of the intersecting polygon for each graffiti. If we just want to know the polygon index, we can use the geometry() command to remove the rest of the boston.graf data:

boston.districts2 <- over(boston.graf, geometry(boston)) 
boston.districts2[1:5] ##check if overlaid
## 1 2 3 4 5 
## 3 3 6 9 1

Finally we use the spCbind() command to merge the two spatial objects.

boston2 <- spCbind(boston.graf, boston.districts)
names(boston2)
##  [1] "CASE_ENQUIRY_ID"                "OPEN_DT"                       
##  [3] "CLOSED_DT"                      "TIME_DIFF"                     
##  [5] "CASE_STATUS"                    "CLOSURE_REASON"                
##  [7] "CASE_TITLE"                     "SUBJECT"                       
##  [9] "REASON"                         "TYPE"                          
## [11] "QUEUE"                          "Department"                    
## [13] "Location"                       "neighborhood"                  
## [15] "neighborhood_services_district" "LOCATION_STREET_NAME"          
## [17] "LOCATION_ZIPCODE"               "LATITUDE"                      
## [19] "LONGITUDE"                      "Source"                        
## [21] "completion.time"                "DISTRICT"                      
## [23] "OBJECTID"                       "Councillor"                    
## [25] "SHAPE_area"                     "SHAPE_len"                     
## [27] "Winning_Perc"                   "Runner_Perc"                   
## [29] "MOV"

Making Maps

Making Maps with plot()

Now that we merged our data with other data sources, non-spatial or spatial, we are ready to create maps. For most plots we can simply use the plot() command. Let’s first create a very simple plot where we overlay the Boston districts and graffiti locations:

plot(boston)
plot(boston.graf, col="blue", cex=0.5, add=T)

Another useful plotting method is coloring the polygons based on the values in one of the attributes. For instance, since we have already merged the Boston districts with the election results, we can color the districts based on the margin of victories in the city council elections. These maps, in which areas are shaded in proportion to the measurement of the statistical variable being shown on the map, are called choropleth maps.

One important step in creating the plot is getting the color codes we’ll use. For this you’ll i) choose n - the number of cuts you want to use, ii) obtain the quantile values that indicate the cutoff points, iii) pick a palette you like from amongst the options available in the brewer.pal() function. Now you are ready to obtain your color codes using the findColours() function. Note that different palettes have different limits on the maximum and minimum number of cuts allowed.

#determine the color codes
var <- as.numeric(boston$MOV)
nclr <- 5
plotclr <- brewer.pal(nclr,"Blues")
class <- classIntervals(var, nclr, style="quantile")
colcode <- findColours(class, plotclr)

Now you can just pass these color codes to plot():

#plot:
plot(boston, col = colcode, border=NA) #remove the borders since looks prettier
legend(x = "bottomright",
       fill = attr(colcode,"palette"), #set the legend colors
       bty = "n", #remove the around the legend
       legend = names(attr(colcode, "table")), #show the quantile values
       title = "MOV Quantile",
       cex = 0.6) #adjust legend size 

If you don’t like this color scheme, you can choose another color from the RColorBrewer package. You can see a list of all the color pallets that come with RColorBrewer with the following command:

display.brewer.all()

Now let’s make the same plot above with a different color pallet.

#obtain the new color codes: 
plotclr <- brewer.pal(nclr,"BuPu")
colcode <- findColours(class, plotclr)

#plot:
plot(boston, col = colcode, border=NA) #remove the borders since looks prettier
legend(x = "bottomright",
       fill = attr(colcode,"palette"), #set the legend colors
       bty = "n", #remove the around the legend
       legend = names(attr(colcode, "table")), #show the quantile values
       title="MOV Quantile",
       cex = 0.6)

As in the normal plots, we can also add labels to our map using the text() command. But unlike in a normal plot, here we need to indicate the coordinates where we want R to put our labels.

#store the central point of each polygon, where our labels will appear
centers = coordinates(boston)

# plot:
plot(boston, col = colcode, border=NA) #remove the borders for a prettier asthetic. 
legend(x = "bottomright",
       fill = attr(colcode,"palette"), #set the legend colors
       bty = "n", #remove the around the legend
       legend = names(attr(colcode, "table")), #show the quantile values
       title="MOV Quantile",
       cex = 0.6)
text(centers,label=boston$DISTRICT)

Making Maps with spplot()

spplot() is an extension of the plot() function in R specifically for making maps of spatial objects. It’s more convenient for filling in polygon colors based on attributes in the SpatialDataFrame object. You just need to pass the object and the name of columns you want to plot to the function. If you don’t pass specific column names, a separate figure will be created for each column.

spplot(boston,
       "MOV",
       main = "Boston Districts", 
       sub = "Margin of Victory", 
       col = "transparent") #remove the borders for a prettier asthetic. 

Making Maps with ggmap

The ggmap package is very convenient to create since we can combine ggmap() with all the ggplot2 functionalities. The first command we will use is get_map(), a smart wrapper that queries the map server of your choosing-Google Maps, OpenStreetMap, or Stamen Maps - and returns a map at a specified location. We will use it to create a basemap. We’ll see the basemap once we pass the data called through get_map() to ggmap.

There’s no need to add a specific latitude or longitude to get_map(), because ggmap() accepts text search inputs.

There are several basemap types we can pull from Google or Stamen Maps. Since this will be our base map, you may prefer less noisier and less colorful options such as "toner-lite". Note that you need internet connection as we are pulling the maps from online sources. Also note that Stamen Maps’ server is needy and sometimes doesn’t work.

Let’s first see how different basemap types look like:

boston.coords <- c(lon = 42.3192791, lat = -71.0884123)
g_toner <- get_map(location = boston.coords,
                   zoom = 12, #how much we want to zoom 
                   source = "stamen", #source site
                   maptype = "toner-lite") #map type
g_sat <- get_map(location = boston.coords, zoom = 12, source = "google", maptype = "satellite")
g_road <- get_map(location = boston.coords, zoom = 12, source = "google", maptype = "roadmap")
g_water <- get_map(location = boston.coords, zoom = 12,  source = "stamen",maptype = "watercolor")

#pass the data to `ggmap`
base_t <- ggmap(g_toner, #get_map object
                extent = "device", #removes the axes
                legend = "none",  #removes the legend
                maprange = T) #ensures that limits are determined based on the ggmap object, not the baselayer
base_s <- ggmap(g_sat, extent = "device", legend = "none", maprange = T) 
base_r <- ggmap(g_road, extent = "device", legend = "none", maprange = T)
base_w <- ggmap(g_water, extent = "device", legend = "none", maprange = T)

#use grid.arrange to show the plots in 2x2  
grid.arrange(base_s, base_r, base_t, base_w, nrow = 2, ncol = 2)

We’ll continue with the toner-lite basemap, base_t, and use it as a background for our plots. First we will add the Boston district borders to our basemap. Since boston is a SpatialPolygonsDataFrame, we will rely on the geom_polygon() function to add it to the basemap.

base_t + 
geom_polygon(aes(x = long, y = lat, group = group, alpha = 0.6), 
             #alpha sets the transparency of the polygon fill color
             data = boston, 
             color = "blue",
             lwd = 1)
             #lwd sets the width of the polygon borders

To add the number of graffiti removal requests to our basemap, on the other hand, we will rely on the geom_point() function, because, as you may remember, boston.requests is a data.frame.

base_t + 
geom_point(aes(x = LONGITUDE, y = LATITUDE, group = CASE_STATUS), 
             data = boston.requests, 
             color = "blue",
             cex = 0.8)
             #cex sets the size of the data points

Next we will try to color our district based on margin of victory, the MOV attribute that we’ve already added to the boston SpatialPolygonsDataFrame. However, unfortunately geom_polygon() doesn’t know how to deal with the extra data columns unless the input is a data.frame. We’ll need to fortify our SpatialPolygonsDataFrame to convert it to a data.frame so that the extra data columns work. The following steps successively i)store the rownames as an extra column, ii) separate polygons into individual vertices through the fortify() function, and iii) merge all the extra data columns back in for each vertex of each polygon.

boston@data$id <- rownames(boston@data) #create a unique ID for the later join
boston2 <- fortify(boston, region = "id") #this only has the coordinates
boston.gg <- merge(boston2, boston, by = "id", type = "left")

Now we are ready to create our choropleth map. We will use geom_polygon() command as in the previous simpler map, but this time with the data.frame that we named boston.gg. Note that the only difference of the code below from a regular ggplot2 code is geom_polygon() replaces ggplot().

base_t + 
  geom_polygon(aes(x = long, y = lat, group = group, fill = MOV),
               data = boston.gg,
               alpha = 0.6,
               color = "white") + 
  labs(fill = "Margin of Victory")

Finally, we can also show the simple data points instead of a density plot for the number of graffiti removal requests, using the geom_point() command.

base_t + 
  geom_polygon(aes(x = long, y = lat, group = group, fill = MOV),
               data = boston.gg,
               alpha = 0.6,
               color = "white") + 
  geom_point(aes(x = LONGITUDE, y = LATITUDE), data = boston.requests, color = "purple", size = 0.5)+ 
  labs(fill = "Margin of Victory")

We can also create a density plot using the usual ggplot() functionalities. We will add the number of graffiti removal requests on top of the previous plot by using the stat_density2d() command and the graffiti data.

base_t + 
  geom_polygon(aes(x = long, y = lat, group = group, fill = MOV),
               data = boston.gg,
               alpha = 0.6,
               color = "white") + 
  stat_density2d(aes(x = LONGITUDE, y = LATITUDE, alpha = ..level..), 
                 data = boston.requests, geom = "polygon") +    
  labs(fill = "Margin of Victory", alpha = "Density of Service Requests")

Location, Area, Distance Calculation Tools

Geocoding with Google Maps

This part will introduce you to tools for geocoding - converting addresses or names of locations into latitudes and longitudes - using the Google Maps API. Google Maps API will accept any query you could type into Google Maps through R and returns information on Google’s best guess for the latitude and longitude associated with your query. The tool for doing this from within R is the geocode() function.

Try to use the option output = ``more'' in your function so that you can get information on how certain Google is about its guess.

pub <- "Muddy Charles, Cambridge, MA"
location <- geocode(pub, source = "google", output = "more")
location
##   lon lat 
## 1  NA NA

One important but not obvious piece of information in the output is the loctype column, which gives information on how precise the location information is. In order of precision, the four possible values for this field are i) rooftop (We know the precise street address.) ii)range_interpolated (We approximately know the street address.) iii)geometric_center (Shows the center of a polyline -such as a street - or polygon -such as a district.) iv) approximate (Shows an approximatiation for units such as localities and admin areas)

Note that we can also find information the other way around, by giving R the long-lat values you are interested and asking for the address by using the revgeocode() command.

location_n <- as.numeric(location) #since `recgeocode` asks for numeric long-lat values
revgeocode(location_n[1:2], output = "more") # returns address and other info (county, country, etc)

Distance

Now we will try to find the commuting distance between the office and the Muddy Charles pub using the mapdist() command.

office <- ("30 Wadsworth St, Cambridge, MA")
mapdist(pub, office, mode = "bicycling", output = "simple") #with 'simple', the output looks clean
## Source : https://maps.googleapis.com/maps/api/distancematrix/json?origins=Muddy%20Charles%2C%20Cambridge%2C%20MA&destinations=30%20Wadsworth%20St%2C%20Cambridge%2C%20MA&mode=bicycling&language=en-EN

How about the Euclidian distance? Consider the following arbitrary data points, also shown in the map.

#create a df where each row is an arbitrary data point of long-lat.
fake.coords <- rbind(c(-144,70.3),c(-144.4,70),c(-144.4,70.3),c(-144.25,70)) 
#find the basemap for the arbitrary locations
ggmap(get_map(location = c(-144.2, 70.15), zoom=8)) + 
geom_point(data = data.frame(fake.coords), 
           aes(x = X1,y = X2), 
           col ="blue",
           pch = 16,
           cex = 3) 

spDist() function returns the matrix of distances between each pair points in the map above, with zeroes on diagonal.

spDists(fake.coords, longlat = T) #if FALSE, Euclidean, if TRUE ellipsoid distance
##          [,1]      [,2]     [,3]      [,4]
## [1,]  0.00000 36.744156 15.05469 34.785132
## [2,] 36.74416  0.000000 33.46909  5.727933
## [3,] 15.05469 33.469086  0.00000 33.948741
## [4,] 34.78513  5.727933 33.94874  0.000000

What about distance between the point and polygons? We can find it using the dist2Line() command. It basically finds the nearest coordinat on line/polygon and the id of the line/poly that is closest.

We can have multiple points and polygons, as long as they are all in long-lat coordinates (not projected). Distance calculation is less computationally intense to use spheroid earth assumption, but can also use `distfun = distVincentyEllipsoid} to return a more exact measure of distance (The default is in meters).

home_to_boston <- dist2Line(location[1:2], boston, distfun = distHaversine) #location[1:2] extracts long/lat 
home_to_boston #gives the distance, long/lat coordinates, and the id of the closest polygon
##      distance       lon      lat ID
## [1,] 646.8733 -71.08464 42.35406  7

We can use the arguments of this output in a map to show the coordinat closest to our location of interest - and the district it belongs to:

  base_t + 
  #highlight the closest district 
  geom_polygon(aes(x = long, y = lat, group = group),
               data = boston[boston$OBJECTID  == 7,],
               alpha = 0.6, color = "purple", fill = "purple") +
  #the coordinates of our point of interest
  geom_point(aes(x = lon, y = lat), 
             data = location, 
             color = "blue") + 
  #the closest coordinat
  geom_point(aes(x = X1, y = X2),
             data = data.frame(cbind(home_to_boston[,2], home_to_boston[,3])),
             color = "blue") +
  #line showing the distance between the two locations
  geom_line(data = data.frame(rbind(c(location$lon, location$lat), c(home_to_boston[,2], home_to_boston[,3]))),
            aes(x = lon, y = lat), 
            col = "red")

Spatial Data Analysis

In this section we’ll focus how we detect and deal with spatial dependence in regressions models and on basic regression models.

Assessing Spatial Clustering

To learn how to detect spatial autocorrelation, we will use one informal and one formal method. For this part we will use a new SpatialPolygonsDataFrame and name it boston_n, which not only includes coordinates and unique ids for polygon units, but also extra columns - some demographic variables from 2000. We can start by loading our data and take a look at the content of the dataset as well as the simple plot.

boston_n <- readOGR(dsn = "BostonData", 
                        layer = "Boston_N")
## OGR data source with driver: ESRI Shapefile 
## Source: "BostonData", layer: "Boston_N"
## with 26 features
## It has 8 fields
head(boston_n@data)
plot(boston_n)

The question we are interested in is whether the share of Black and Hispanic population in a neighborhood,a variable named blck_l_ is associated with education attainment, as measured by the percentage of people with at least a bachelor degre, mnmmbchlr. First we will check how our dataset looks like and change the column names to clearer names (It is common that shapefiles have corrupt column names). We can manipulate the data in SpatialPolygonsDataFrame objects exactly as we do in the typical data.frame objects.

names(boston_n)
## [1] "Name"    "OBJECTI" "Acres"   "SqMiles" "ShpSTAr" "ShpSTLn" "mnbchlr"
## [8] "blck_l_"
names(boston_n)[7:8] <- c("minimumbachelor", "black_lat")

An informal approach to see whether spatial autocorrelation is a concern is to map the residuals. The idea is if neighboring residuals are similar in size, our model has some spatial autocorrelation. To find the residuals we will run a regression model where we regress education attainment on the share of hispanic and black population, and store the residuals. Then we will create a choroplet map, in which neighborhoods with similar-sized residuals will appear in similar shades.

#save residuals 
my.model <- lm(minimumbachelor ~ black_lat, data = boston_n)
boston_n$resid_ols <- my.model$residuals

#get the color codes:
var <- as.numeric(boston_n$resid_ols)
nclr <- 5
plotclr <- brewer.pal(nclr,"Blues")
class <- classIntervals(var, nclr, style = "quantile")
colcode <- findColours(class, plotclr)

plot(boston_n, col = colcode, border = "grey")
legend(x = "bottomleft",
       fill = attr(colcode,"palette"), 
       bty = "n", 
       legend = names(attr(colcode, "table")), 
       title = "OLS Residuals Quantile",
       cex = 0.6)

Normally the colors in the map should look arbitrarily distributed. But the autocorrelation map shows similar residual values for neighborhoods spatially close to each other, suggesting that there might be some spatial factors we haven’t specified in our model.

We can do more formal tests of spatial autocorrelation as well but for that we first need to calculate the spatial weight matrix. To calculate the spatial weight matrix, we will first identify the neighborhoods that share edges using the poly2nb() command, and convert it to a matrix using the nb2listw() command. This matrix will also enable us to model the spatial autocorrelation among neighborhoods in our spatial regressions.

#make continuity NB (neighbors share edges)
continuity.nb <- poly2nb(boston_n, 
                         queen = TRUE) #if FALSE, two units are neighbors only if they share a side (not an edge)

summary(continuity.nb) # shows how many links each polygon has, mean connections=5.2
## Neighbour list object:
## Number of regions: 26 
## Number of nonzero links: 92 
## Percentage nonzero weights: 13.60947 
## Average number of links: 3.538462 
## 2 regions with no links:
## 11 25
## Link number distribution:
## 
## 0 1 2 3 4 5 6 8 9 
## 2 3 4 4 5 4 2 1 1 
## 3 least connected regions:
## 12 17 22 with 1 link
## 1 most connected region:
## 15 with 9 links
#plot neighbors
plot(continuity.nb, coordinates(boston_n))

#convert to a weight-matrix 
continuity.listw <- nb2listw(continuity.nb,  
                             style = "W", #for standardized rows
                             zero.policy = TRUE) #if TRUE, assigns zero to the lags of zones without neighbours

summary(continuity.listw, zero.policy = TRUE) #shows more info with row-standardized weights
## Characteristics of weights list object:
## Neighbour list object:
## Number of regions: 26 
## Number of nonzero links: 92 
## Percentage nonzero weights: 13.60947 
## Average number of links: 3.538462 
## 2 regions with no links:
## 11 25
## Link number distribution:
## 
## 0 1 2 3 4 5 6 8 9 
## 2 3 4 4 5 4 2 1 1 
## 3 least connected regions:
## 12 17 22 with 1 link
## 1 most connected region:
## 15 with 9 links
## 
## Weights style: W 
## Weights constants summary:
##    n  nn S0       S1       S2
## W 24 576 24 15.42398 104.2362
head(continuity.listw$weights) #shows the actual weights for each polygon
## [[1]]
## [1] 0.25 0.25 0.25 0.25
## 
## [[2]]
## [1] 0.2 0.2 0.2 0.2 0.2
## 
## [[3]]
## [1] 0.25 0.25 0.25 0.25
## 
## [[4]]
## [1] 0.5 0.5
## 
## [[5]]
## [1] 0.25 0.25 0.25 0.25
## 
## [[6]]
## [1] 0.5 0.5

One formal test of spatial dependence is the Moran’s I, which is easily executed using the moran.test() function and passing the weights created above, as well as the dependent variable we are interested in, to it. We can also observe the spatial dependence on a plot relying on the moran.plot() command.

###### Global Moran's I Test
moran.test(boston_n$minimumbachelor, 
           listw = continuity.listw,
           zero.policy = TRUE, #if TRUE, assigns zero to the lagged value of zones without neighbours
           adjust.n = TRUE)#if TRUE, the number of observations is adjusted for no-neighbour observations
## 
##  Moran I test under randomisation
## 
## data:  boston_n$minimumbachelor  
## weights: continuity.listw  
## 
## Moran I statistic standard deviate = 2.5398, p-value = 0.005545
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##        0.34111187       -0.04347826        0.02292917
moran.plot(boston_n$minimumbachelor, 
           continuity.listw,
           zero.policy = TRUE)

The statistic ranges from -1 to 1, and the stat of 0.798 means it’s significantly spatially correlated.

One can also calculate local Moran’s I statistics for each polygon rather than just global values. For local Moran’s I statistics, we use the localmoran() command.

locmoran.boston_n <- localmoran(boston_n$minimumbachelor, 
                          continuity.listw, 
                          zero.policy = TRUE)
head(locmoran.boston_n)
##              Ii  E.Ii    Var.Ii        Z.Ii Pr(z > 0)
## 0  0.2408182217 -0.04 0.2073821  0.61665129 0.2687324
## 1 -0.1154786299 -0.04 0.1585651 -0.18954846 0.5751685
## 2  0.0054051954 -0.04 0.2073821  0.09970568 0.4602890
## 3 -0.2063822346 -0.04 0.4514674 -0.24762459 0.5977876
## 4  0.0144825088 -0.04 0.2073821  0.11963864 0.4523847
## 5  0.0004766308 -0.04 0.4514674  0.06024086 0.4759819
boston_n <- spCbind(boston_n, as.data.frame(locmoran.boston_n))

If we want a more formal way of mapping spatial dependence, instead of mapping the residuals as above, we can also map the Z scores in the local Moran’s I test output.

#get the color codes:
var <- as.numeric(boston_n$Z.Ii)
nclr <- 5
plotclr <- brewer.pal(nclr,"Blues")
class <- classIntervals(var, nclr, style="quantile")
colcode <- findColours(class, plotclr)

plot(boston_n, col = colcode, border = "grey")
legend(x = "bottomleft",
       fill = attr(colcode,"palette"), 
       bty = "n", 
       legend = names(attr(colcode, "table")), 
       title = "Z Scores Quantile",
       cex = 0.6)

Now we can explicitly see that there is spatial dependence among neighborhoods.

Spatial Regression

Now that we detected spatial dependence, we can try to take it into account in our regression models. Spatial dependence in a regression setting can be modeled similarly to the autocorrelation in time series. The presence of spatial autocorrelation means a nonzero correlation with the error term, similar to the presence of an endogenous variable. This implies that OLS estimates in the non-spatial model will be biased and inconsistent (Anselin and Bera (1998)).

The estimation of the spatial autoregressive model (SAR) can be done with the function lagsarl(). Here we assume normality of the error term and use maximum likelihood. This model will give us the value of rho parameter and its p-value, and allow us comparing it with standard OLS.

#SAR Model
summary(my.model.sar <- lagsarlm(minimumbachelor ~ black_lat, 
                                 data = boston_n, 
                                 continuity.listw, 
                                 zero.policy = TRUE)) 
## 
## Call:lagsarlm(formula = minimumbachelor ~ black_lat, data = boston_n, 
##     listw = continuity.listw, zero.policy = TRUE)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.8753  -5.8921  -1.9245   5.8259  27.5723 
## 
## Type: lag 
## Regions with no neighbours included:
##  11 25 
## Coefficients: (asymptotic standard errors) 
##             Estimate Std. Error z value  Pr(>|z|)
## (Intercept) 33.59490    7.54598  4.4520 8.506e-06
## black_lat   -0.41208    0.10759 -3.8301 0.0001281
## 
## Rho: 0.47336, LR test value: 8.8257, p-value: 0.0029702
## Asymptotic standard error: 0.13514
##     z-value: 3.5027, p-value: 0.00046063
## Wald statistic: 12.269, p-value: 0.00046063
## 
## Log likelihood: -100.1283 for lag model
## ML residual variance (sigma squared): 121.54, (sigma: 11.025)
## Number of observations: 26 
## Number of parameters estimated: 4 
## AIC: 208.26, (AIC for lm: 215.08)
## LM test for residual autocorrelation
## test value: 0.0048895, p-value: 0.94425

We can map the new residuals to see if the new model helped with autocorrelation.

boston_n <- spCbind(boston_n, my.model.sar$residuals)
head(boston_n@data)
#get the color codes: 
var <- as.numeric(boston_n$my.model.sar.residuals)
nclr <- 5
plotclr <- brewer.pal(nclr,"Blues")
class <- classIntervals(var, nclr, style="quantile")
colcode <- findColours(class, plotclr)

plot(boston_n, col = colcode, border = "grey")
legend(x = "bottomleft",
       fill = attr(colcode,"palette"), 
       bty = "n", 
       legend = names(attr(colcode, "table")), 
       title = "OLS Residuals Quantile",
       cex = 0.6)

It looks a bit better.

Alternatively we can model spatial autocorrelation by specifing the autoregressive process in the error term relying on the errorsarlm function:

#error auto-Regresive
summary(my.model.ear <- errorsarlm(minimumbachelor ~ black_lat, 
                                   data = boston_n, 
                                   continuity.listw, 
                                   zero.policy = TRUE))
## 
## Call:errorsarlm(formula = minimumbachelor ~ black_lat, data = boston_n, 
##     listw = continuity.listw, zero.policy = TRUE)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -20.50404  -6.95151   0.27443  10.75028  27.16799 
## 
## Type: error 
## Regions with no neighbours included:
##  11 25 
## Coefficients: (asymptotic standard errors) 
##             Estimate Std. Error z value  Pr(>|z|)
## (Intercept) 51.75193    6.03055  8.5816 < 2.2e-16
## black_lat   -0.58032    0.12462 -4.6567 3.214e-06
## 
## Lambda: 0.56342, LR test value: 3.5274, p-value: 0.060362
## Asymptotic standard error: 0.17381
##     z-value: 3.2416, p-value: 0.0011886
## Wald statistic: 10.508, p-value: 0.0011886
## 
## Log likelihood: -102.7775 for error model
## ML residual variance (sigma squared): 144.44, (sigma: 12.018)
## Number of observations: 26 
## Number of parameters estimated: 4 
## AIC: 213.55, (AIC for lm: 215.08)