Introduction

New York is currently experiencing the largest measles outbreak since 1994. As of 22nd April 2019, there have been 626 cases of measles reported in New York City and Rockland County. Measles is a significant public health concern for several reasons:

Outline

In this document, we will visualize the school districts of New York State using a choropleth map, symbolizing the polygons by the percentage of measles vaccination exemptions in the school district. We will see if there are clusters or pockets of unvaccinated groups, and briefly consider what this means for public health.

The map we will be creating at the end.

The map we will be creating at the end.

The purpose of this document:

I am a Master of Public Health student, passionate about using data to spur evidence-based interventions to improve health and well-being for all populations. I’m constantly learning new techniques of analyzing data, and this project is a way for me to document these learnings. Additionally, this document is meticulously annotated because I would like for it to serve as an instructional guide to other public health practitioners who are getting started with using R for spatial analysis. In this document, I have laid out all of the data sources, code and reference materials in a manner that makes following along very easy. The code is provided in a step-by-step manner, with errors and comments included to encourage reproduction. Feel free to skip over code chunks and concepts that you might already be familiar with.

Concepts:

Some of the important concepts that we will touch upon are:

  • Using regular expressions to clean lat-long data.
  • Using spatial operations to join attributes from tables that do not have a common key. In this document, we will join measles exemptions data from schools (points) to the outlines of school districts (polygons).
  • Using projections and spatial points/polygons dataframes.
  • Creating a choropleth map and assessing spatial autocorrelation.

Load dependencies

These are some of the libraries that we will be using. If you do not have these installed, you will have to run the install.packages() function first before calling the libraries.

library(sf)
library(tidyverse)
library(rgdal)
library(tmap)
library(sp)
library(spdep)
library(spatialEco)
library(knitr)

Importing the Schools with Immunization Data

Let’s read in the data from the New York State Department of Health immunization survey.

imm <- read.csv("Immunization_survey.csv")

Let’s take a look at the data.

(head(imm))
##   ï..School.ID District.Name Report.Period           Type
## 1 270701659807                   2013-2014 Private School
## 2 321000145201                   2013-2014 Private School
## 3  20801659058                   2012-2013 Private School
## 4  20801659059                   2012-2013 Private School
## 5  20801659060                   2012-2013 Private School
## 6 141101609343                   2013-2014 Private School
##                      School.Name Percent.Medical.Exemptions
## 1                TURNPIKE SCHOOL                          0
## 2                    AQUINAS H S                          0
## 3            SUNNY MEADOW SCHOOL                          0
## 4                  RUSTIC HOLLOW                          0
## 5               CARSON RD SCHOOL                          0
## 6 BLOSSOM GARDENS FRIENDS SCHOOL                          0
##   Percent.Religious.Exemptions Percent.Immunized.Polio
## 1                           92                       4
## 2                            0                     100
## 3                          100                       0
## 4                          100                       0
## 5                          100                       0
## 6                            0                     100
##   Percent.Immunized.Measles Percent.Immunized.Mumps
## 1                         0                       4
## 2                       100                     100
## 3                         0                       0
## 4                         0                       0
## 5                         0                       0
## 6                       100                     100
##   Percent.Immunized.Rubella Percent.Immunized.Diphtheria
## 1                         4                            4
## 2                       100                          100
## 3                         0                            0
## 4                         0                            0
## 5                         0                            0
## 6                       100                          100
##   Percent.Immunized.HepatitisB Percent.Immunized.Varicella
## 1                            4                           4
## 2                          100                         100
## 3                            0                           0
## 4                            0                           0
## 5                            0                           0
## 6                          100                         100
##   Percent.Completely.Immunized               Street    City     County
## 1                            0          20 PARK ST.   FONDA MONTGOMERY
## 2                          100        685 E 182D ST   BRONX      BRONX
## 3                            0           7 COURT ST BELMONT   ALLEGANY
## 4                            0          7 COURT ST. BELMONT   ALLEGANY
## 5                            0           7 COURT ST BELMONT   ALLEGANY
## 6                          100 13961 SISSON HIGHWAY COLLINS       ERIE
##   State Zip.Code
## 1    NY    12068
## 2    NY    10457
## 3    NY    14813
## 4    NY    14813
## 5    NY    14813
## 6    NY    14034
##                                                                           Location
## 1           20 PARK ST.\nFONDA, NY 12068\n(42.953072192000036, -74.37310776999993)
## 2         685 E 182D ST\nBRONX, NY 10457\n(40.849331720000066, -73.89809272799994)
## 3           7 COURT ST\nBELMONT, NY 14813\n(42.22441664100006, -78.03413660399997)
## 4          7 COURT ST.\nBELMONT, NY 14813\n(42.22441664100006, -78.03413660399997)
## 5           7 COURT ST\nBELMONT, NY 14813\n(42.22441664100006, -78.03413660399997)
## 6 13961 SISSON HIGHWAY\nCOLLINS, NY 14034\n(42.49592655600003, -78.84794022099999)

Each row of the data represents a unique school that was captured by the immunization survey. Most column names are self explanatory. For the purpose of this analysis project, our columns of interest are:

Pruning

Filtering for report period 2017-2018, for public schools, and removing schools that do not have an associated district name.

rec_imm <- imm %>% 
filter(Report.Period == "2017-2018" & Type == "Public School")  %>% 
  filter(!(is.na(District.Name)))

Let’s create a column called meas_no that holds values for the percent of students that did not receive measles vaccinations.

rec_imm <- rec_imm %>% 
mutate( meas_no = (100 - (Percent.Immunized.Measles)))

Unlocking the Location Data

Now that I have data from only the public schools from 2017-2018, I want to plot the schools on a map. Where are the coordinates stored?

head(rec_imm$Location)
## [1] LIME RIDGE ROAD\nPOUGHQUAG, NY 12570\n(41.59995607400003, -73.70731006599993)  
## [2] PEARL ST\nHOLLAND, NY 14080\n(42.64399618700003, -78.53924020499994)           
## [3] 550 MARTIN RD\nLACKAWANNA, NY 14218\n(42.81664690500003, -78.80204467399994)   
## [4] S 4432 BAY VIEW RD\nHAMBURG, NY 14075\n(42.768125068000074, -78.83079897699997)
## [5] MAIN ST\nREMSEN, NY 13438\n(43.32135221300007, -75.18573825299995)             
## [6] BARKER ROAD\nPITTSFORD, NY 14534\n(43.05915900000008, -77.52740685299995)      
## 13545 Levels: *\nWYOMING, NY 14591\n(42.82638000000003, -78.08938999999998) ...

The locations are clearly available, but they appear with addresses. We need to clean this information so that we have two clean columns, one for the latitude and one for the longitude.

Cleaning Location Data with Regular Expressions

While it may be tempting to want to try and clean the location data manually, it simply is not feasible for the number of rows in this dataset. We have to write a regular expression to cut out the addresses and get two separate columns for the latitude and the longitude, and then convert it into a spatial dataset. Here are some links that I used to familiarize myself with the regular expressions that I needed to write for this analysis.

Before we begin attacking the ‘Location’ column, let us first change its class to character.

rec_imm$Location = as.character(rec_imm$Location)

Let me try to remove all text upto and including the (

rec_imm$geo <- gsub(".*\\(","", rec_imm$Location) 

What is the gsub function doing? It has 3 arguments, first, the old term that you are looking to replace, second, the new term with which you are replacing the old term, and third, the name of the column that the regular expression will be evaluated over. Refer to this link for more information.

In this case, .* represents all characters. The double \s before the opening parenthesis tell R to interpret the ( literally. So, this function tells R to find everything upto and including the first (, and replace it with “”. This is basically another way of saying delete everything upto and including the first bracket. We use double escapes for ( as it is a special character.

Let’s take a look to see how the chunk code played out.

head(rec_imm[, c("Location", "geo")])
##                                                                          Location
## 1   LIME RIDGE ROAD\nPOUGHQUAG, NY 12570\n(41.59995607400003, -73.70731006599993)
## 2            PEARL ST\nHOLLAND, NY 14080\n(42.64399618700003, -78.53924020499994)
## 3    550 MARTIN RD\nLACKAWANNA, NY 14218\n(42.81664690500003, -78.80204467399994)
## 4 S 4432 BAY VIEW RD\nHAMBURG, NY 14075\n(42.768125068000074, -78.83079897699997)
## 5              MAIN ST\nREMSEN, NY 13438\n(43.32135221300007, -75.18573825299995)
## 6       BARKER ROAD\nPITTSFORD, NY 14534\n(43.05915900000008, -77.52740685299995)
##                                       geo
## 1  41.59995607400003, -73.70731006599993)
## 2  42.64399618700003, -78.53924020499994)
## 3  42.81664690500003, -78.80204467399994)
## 4 42.768125068000074, -78.83079897699997)
## 5  43.32135221300007, -75.18573825299995)
## 6  43.05915900000008, -77.52740685299995)

Nice! In the column geo, everything before and including the first bracket has been removed. Now, let me remove everything after the last bracket.

rec_imm$geog <- gsub("\\).*", "", rec_imm$geo)

How has this one turned out?

head(rec_imm[, c("Location", "geo", "geog")])
##                                                                          Location
## 1   LIME RIDGE ROAD\nPOUGHQUAG, NY 12570\n(41.59995607400003, -73.70731006599993)
## 2            PEARL ST\nHOLLAND, NY 14080\n(42.64399618700003, -78.53924020499994)
## 3    550 MARTIN RD\nLACKAWANNA, NY 14218\n(42.81664690500003, -78.80204467399994)
## 4 S 4432 BAY VIEW RD\nHAMBURG, NY 14075\n(42.768125068000074, -78.83079897699997)
## 5              MAIN ST\nREMSEN, NY 13438\n(43.32135221300007, -75.18573825299995)
## 6       BARKER ROAD\nPITTSFORD, NY 14534\n(43.05915900000008, -77.52740685299995)
##                                       geo
## 1  41.59995607400003, -73.70731006599993)
## 2  42.64399618700003, -78.53924020499994)
## 3  42.81664690500003, -78.80204467399994)
## 4 42.768125068000074, -78.83079897699997)
## 5  43.32135221300007, -75.18573825299995)
## 6  43.05915900000008, -77.52740685299995)
##                                     geog
## 1  41.59995607400003, -73.70731006599993
## 2  42.64399618700003, -78.53924020499994
## 3  42.81664690500003, -78.80204467399994
## 4 42.768125068000074, -78.83079897699997
## 5  43.32135221300007, -75.18573825299995
## 6  43.05915900000008, -77.52740685299995

Awesome, the newly added geog column has latitudes and longitudes with the addresses and brackets removed. But we are not done yet. We will now use the comma to split the geog column into two separate ‘lat’ and ‘long’ columns.

rec_imm <- separate(rec_imm, geog, 
                    into =c("lat", "long"), sep = ",")

Has it worked?

head(rec_imm[, c("Location", "geo", "lat", "long")])
##                                                                          Location
## 1   LIME RIDGE ROAD\nPOUGHQUAG, NY 12570\n(41.59995607400003, -73.70731006599993)
## 2            PEARL ST\nHOLLAND, NY 14080\n(42.64399618700003, -78.53924020499994)
## 3    550 MARTIN RD\nLACKAWANNA, NY 14218\n(42.81664690500003, -78.80204467399994)
## 4 S 4432 BAY VIEW RD\nHAMBURG, NY 14075\n(42.768125068000074, -78.83079897699997)
## 5              MAIN ST\nREMSEN, NY 13438\n(43.32135221300007, -75.18573825299995)
## 6       BARKER ROAD\nPITTSFORD, NY 14534\n(43.05915900000008, -77.52740685299995)
##                                       geo                lat
## 1  41.59995607400003, -73.70731006599993)  41.59995607400003
## 2  42.64399618700003, -78.53924020499994)  42.64399618700003
## 3  42.81664690500003, -78.80204467399994)  42.81664690500003
## 4 42.768125068000074, -78.83079897699997) 42.768125068000074
## 5  43.32135221300007, -75.18573825299995)  43.32135221300007
## 6  43.05915900000008, -77.52740685299995)  43.05915900000008
##                  long
## 1  -73.70731006599993
## 2  -78.53924020499994
## 3  -78.80204467399994
## 4  -78.83079897699997
## 5  -75.18573825299995
## 6  -77.52740685299995

Studying the different columns in this table is a good way to understand how we went, step by step, from a messy Location column to clean lat long columns. However, the class for lat and long is still character. Let’s change it back to numeric.

which(colnames(rec_imm) == "lat") #Identifying column numbers
## [1] 24
# If lat is 24, long is 25.
rec_imm[ , c(24, 25)] <- lapply(rec_imm[, c(24,25)], as.numeric)

Perfect. You now have clean lat and long columns.

Converting to a Spatial dataframe

The columns that we obtained as a result of regular expression cleaning are just that, columns. The rec_imm dataset is still just a dataframe. In order to spatialize the rec_imm dataframe into a newer spatial dataframe called spat_imm, we will have to tell R that the lat long columns hold coordinates.

We will first pull the positional information into an object called xy.

xy <- rec_imm[, c(25, 24)] 
# The order for coordinates is always longitude first, latitude second.
# This is why 25 appears before 24.

Now let’s create a spatial dataframe from rec_imm feeding in xy as the positional coordinates.

spat_imm <- SpatialPointsDataFrame(coords = xy, data = rec_imm) 

Did it work?

str(spat_imm)
## Formal class 'SpatialPointsDataFrame' [package "sp"] with 5 slots
##   ..@ data       :'data.frame':  2832 obs. of  25 variables:
##   .. ..$ ï..School.ID                : num [1:2832] 1.32e+11 1.42e+11 1.42e+11 1.42e+11 4.12e+11 ...
##   .. ..$ District.Name               : Factor w/ 719 levels "","ADDISON CENTRAL SCHOOL",..: 24 288 334 215 530 501 540 540 212 359 ...
##   .. ..$ Report.Period               : Factor w/ 6 levels "2012-2013","2013-2014",..: 6 6 6 6 6 6 6 6 6 6 ...
##   .. ..$ Type                        : Factor w/ 4 levels "BOCES School",..: 3 3 3 3 3 3 3 3 3 3 ...
##   .. ..$ School.Name                 : Factor w/ 6303 levels " BAIS YAAKOV AYELES HASHACHAR",..: 381 2293 2697 1779 4365 338 4699 4705 1072 5897 ...
##   .. ..$ Percent.Medical.Exemptions  : num [1:2832] 0 0.25 0 0.07 1.6 0 0.48 0 0 0.84 ...
##   .. ..$ Percent.Religious.Exemptions: num [1:2832] 1.12 2.7 0 0.41 0 1.12 0.12 0 0.49 1.11 ...
##   .. ..$ Percent.Immunized.Polio     : num [1:2832] 98.9 97.3 100 99.6 100 98.6 99 99.8 98.5 98.6 ...
##   .. ..$ Percent.Immunized.Measles   : num [1:2832] 98.9 97.3 99 99.5 100 98.9 99.3 99.9 99.5 98.3 ...
##   .. ..$ Percent.Immunized.Mumps     : num [1:2832] 98.9 97.3 99 99.5 100 98.9 99.3 99.9 99.5 98.3 ...
##   .. ..$ Percent.Immunized.Rubella   : num [1:2832] 98.9 97.3 99 99.5 100 98.9 99.3 99.9 99.5 98.3 ...
##   .. ..$ Percent.Immunized.Diphtheria: num [1:2832] 98.9 97.3 100 99.6 100 98.9 99.2 99.8 98.5 98.3 ...
##   .. ..$ Percent.Immunized.HepatitisB: num [1:2832] 98.9 97.3 97.5 99.5 100 98.9 99.5 99.9 99.5 98.3 ...
##   .. ..$ Percent.Immunized.Varicella : num [1:2832] 98.9 97.3 97.7 99.6 100 98.5 99 100 99.5 98.1 ...
##   .. ..$ Percent.Completely.Immunized: num [1:2832] 98.9 96.6 95.7 99.3 97.9 96.6 96.9 99.1 98.1 97.8 ...
##   .. ..$ Street                      : Factor w/ 7766 levels "","*","1-3 MONROE ST",..: 7202 7388 5177 7551 7226 6796 2272 1684 6842 7244 ...
##   .. ..$ City                        : Factor w/ 1064 levels "ACCORD","ADAMS",..: 767 441 501 406 787 743 803 803 349 540 ...
##   .. ..$ County                      : Factor w/ 62 levels "ALBANY","ALLEGANY",..: 14 15 15 15 33 28 28 28 30 30 ...
##   .. ..$ State                       : Factor w/ 1 level "NY": 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..$ Zip.Code                    : int [1:2832] 12570 14080 14218 14075 13438 14534 14605 14607 11520 11561 ...
##   .. ..$ Location                    : chr [1:2832] "LIME RIDGE ROAD\nPOUGHQUAG, NY 12570\n(41.59995607400003, -73.70731006599993)" "PEARL ST\nHOLLAND, NY 14080\n(42.64399618700003, -78.53924020499994)" "550 MARTIN RD\nLACKAWANNA, NY 14218\n(42.81664690500003, -78.80204467399994)" "S 4432 BAY VIEW RD\nHAMBURG, NY 14075\n(42.768125068000074, -78.83079897699997)" ...
##   .. ..$ meas_no                     : num [1:2832] 1.1 2.7 1 0.5 0 ...
##   .. ..$ geo                         : chr [1:2832] "41.59995607400003, -73.70731006599993)" "42.64399618700003, -78.53924020499994)" "42.81664690500003, -78.80204467399994)" "42.768125068000074, -78.83079897699997)" ...
##   .. ..$ lat                         : num [1:2832] 41.6 42.6 42.8 42.8 43.3 ...
##   .. ..$ long                        : num [1:2832] -73.7 -78.5 -78.8 -78.8 -75.2 ...
##   ..@ coords.nrs : num(0) 
##   ..@ coords     : num [1:2832, 1:2] -73.7 -78.5 -78.8 -78.8 -75.2 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : NULL
##   .. .. ..$ : chr [1:2] "long" "lat"
##   ..@ bbox       : num [1:2, 1:2] -79.7 40.6 -72 45
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:2] "long" "lat"
##   .. .. ..$ : chr [1:2] "min" "max"
##   ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot
##   .. .. ..@ projargs: chr NA

Yes, it did. We now have a spatial points data frame!

Mapping the Schools

We are now ready to map the schools.

plot(spat_imm)

This map definitely needs improving. A look at the tmap documentation might help.

?tmap
tm_shape(spat_imm) + tm_dots( col = "blue") + 
tm_layout(title = "Public Schools in NYS\n 2017-2018")

Obtaining the Geography (School Districts)

Let’s read in the school districts shape files. These are provided by the New York State Education Department. You can access them from here.

distsf <- st_read("./School_Districts/SchoolDistricts_2016_v2.shp")
## Reading layer `SchoolDistricts_2016_v2' from data source `D:\Lakshman\A_Cornell\Side_project\Autocorrelation_Measles_Vaccination\School_Districts\SchoolDistricts_2016_v2.shp' using driver `ESRI Shapefile'
## Simple feature collection with 935 features and 40 fields
## geometry type:  POLYGON
## dimension:      XY
## bbox:           xmin: 105571 ymin: 4481030 xmax: 764138.9 ymax: 4985467
## epsg (SRID):    26918
## proj4string:    +proj=utm +zone=18 +datum=NAD83 +units=m +no_defs

What is the class of distsf?

class(distsf)
## [1] "sf"         "data.frame"

This is an sf spatial object (simple features).

Let’s view this spatial data frame:

head(distsf)

There are a large number of columns. The ‘geometry’ column tells R how to plot these school district polygons on a map. Remember, we will want to eventually join the immunization data from the schools(spat_imm) to the polygons from distsf. Doing this would be easy if there was a key column that these two datasets had in common. In fact, when you look at the column names from distsf, you will notice a ‘SCHOOL_ID’ column that might be a candidate key column to join by, but unfortunately, the corresponding school ID column from the spat_imm datasets represents a different identification system used by the New York State Department of Health for the immunization survey. Similarly, the SCHOOLDIST column from the distsf object cannot be used to link to the District.Name from the spat_imm dataset because of differences in the way the districts are spelled. The only way that we can associate the polygons and the immunization data is by mapping them and performing a spatial overlay.

Let’s plot distsf to see what it looks like. Unlike the schools, we do not need to specify coordinates for this data as it is already a spatial dataset (simple features).

districts <- tm_shape(distsf) +tm_borders() + 
tm_layout(title = "School Districts of \n New York State")

districts

Now, let us overlay the dots representing schools over the polygons representing school districts. It is important to note here that you have to map out the school districts first by listing tm_shape(distsf) before you plot the schools using tm_shape(spat_imm). If you try to do it the other way round, you will not get an overlay map.

tm_shape(distsf) + tm_borders() + 
tm_shape(spat_imm) + tm_dots( col = "blue" ) +
tm_layout(title = "Public Schools and School \nDistricts of New York State")

You can see obvious clusters of schools in regions of high population density. Just by looking at this map, you can begin tying public schools to their respective school districts. However, as there are a lot of schools in the data, let us try and make R do this for us.

Getting the Projections Right

Up until now, we did not use projections. We developed the map of schools and districts wihout having to worry about this topic, because it was a purely visual map. If we want to use that map to link schools to school districts, we need to ensure that both layers of the map are in the same projection so that R can correctly measure distances while performing a spatial point over polygon operation.

Points

Let us start with our school points (spat_imm). What projection are the points in? We can use the proj4string function to assess this.

proj4string(spat_imm)
## [1] NA

Apparently, this dataset did not come with a preloaded projection. Since this is lat-long data, we will assign it to WGS 84. There are two steps to this—

1. Writing a Projection

We will first write a WGS84 projection for spat_imm using the CRS function.

proj4string(spat_imm) <- CRS("+proj=longlat +datum=WGS84")

2. Assigning the Projection

A written projection has to be assigned using the spTransform function for the projection to ‘stick’.

points_proj <- spTransform(spat_imm, CRS("+proj=longlat +datum=WGS84")) 

Polygons

Now, onto our school district polygons (distsf). Remember, the polygons differ from the points because they were made available as spatial data, and we do not have to write a projection. We just have to reassign the WGS 84 projection to it. What projection are the polygons in?

proj4string(distsf)

An error appears because distsf is a simple features spatial object. We need to convert it into a spatial polygons data frame to run the proj4string function.

distsf <- as(distsf, 'Spatial')
proj4string(distsf)
## [1] "+proj=utm +zone=18 +datum=NAD83 +units=m +no_defs +ellps=GRS80 +towgs84=0,0,0"

The distsf is in UTM Zone 18N. If you want to learn more about this projection, I would recommend watching this and reading this. We will, however, be reprojecting this data to WGS 84, the same as the points.

We use the spTransform function to assign the WGS84 projection to the polygons.

polygons_proj <- spTransform(distsf, CRS("+proj=longlat +datum=WGS84")) 

A quick way to check if the polygons and points are in the same coordinate system is to call the summary for both datasets and see if the bounding box for the polygons outlines the limits of the points.

bbox(points_proj)
##            min       max
## long -79.71208 -71.95790
## lat   40.58346  44.98913
bbox(polygons_proj)
##         min       max
## x -79.76225 -71.85614
## y  40.47738  45.01582

They approximately match, indicating that we have projected both datasets correctly.

Points in Polygons

To quickly summarize what we have so far: We have two datasets, one is a New York State Department of Health spatial points data frame with vaccination exemption data from public schools in New York from 2017-2018, another is a New York State Department of Education spatial polygons data frame with the outlines of school districts. The bad news: we do not know which points go with any given polygon. The good news: we have them both in the WGS84 projection, and we can now perform a point in polygon operation to figure this out.

We use the over function.

pointsinpoly <- over(points_proj, polygons_proj)

What is this thing?

head(pointsinpoly)
## Warning in format.POSIXlt(as.POSIXlt(x), ...): unable to identify current timezone 'H':
## please set environment variable 'TZ'
##   OBJECTID_1 OBJECTID FID_School      AREA PERIMETER SCHOOL_ SCHOOL_ID
## 1        257      258        256 303717256 149833.67     257       257
## 2        552      553        552 208182210 102960.20     553       553
## 3        604      605        604  16697833  23099.36     605       605
## 4        593      594        593  62976192  64367.01     594       594
## 5        821      825        824 240674540 130495.93     827       827
## 6        725      729        727  81962406  57021.98     730       730
##   SCHDIST04_ SCHDIST041 SCH_LATEST SCH_LATE_1 SCHOOLDIST POLYTYPE
## 1  303717256  149833.67        682        708  ARLINGTON DTM ONLY
## 2  208182210  102960.20        386        433    HOLLAND DTM ONLY
## 3   16697833   23099.36        334        372  LACKAWANA DTM ONLY
## 4   62976192   64367.01        345        388   FRONTIER DTM ONLY
## 5  240674540  130495.93        112        782     REMSEN DTM ONLY
## 6   81962406   57021.98        209        258  PITTSFORD DTM ONLY
##   PRIMARYPOL USERNAME    UPDATE_ EDITED RPOLY_ LPOLY_ EACODE  District2
## 1          Y  unknown 1900-01-01   NONE    682    682 134601  ARLINGTON
## 2          Y  unknown 1900-01-01   NONE    386    386 145001    HOLLAND
## 3          Y  unknown 1900-01-01   NONE    334    334 140900 LACKAWANNA
## 4          Y  unknown 1900-01-01   NONE    345    345 144804   FRONTIER
## 5          Y  unknown 1900-01-01   NONE    112    112 305201     REMSEN
## 6          Y  unknown 1900-01-01   NONE    209    209 264601  PITTSFORD
##        Inst_ID    SchDist FID_BOCES_ SEDdir_BOC
## 1 800000053261  ARLINGTON          7       1390
## 2 800000052163    HOLLAND          9       1492
## 3 800000052156 LACKAWANNA          8       1491
## 4 800000052176   FRONTIER          8       1491
## 5 800000041401     REMSEN         21       4190
## 6 800000050008  PITTSFORD         17       2691
##                            SEDdir_B_1 Shape_Leng OID_  ID     INSTITID
## 1                      BOCES DUTCHESS  149835.20   83  84 800000053261
## 2 BOCES ERIE 2-CHAUTAUQUA-CATTARAUGUS  102960.53  118 119 800000052163
## 3                        BOCES ERIE 1   23099.47   96  97 800000052156
## 4                        BOCES ERIE 1   64367.22  685 686 800000052176
## 5       BOCES ONEIDA-HERKIMER-MADISON  130496.91  149 150 800000041401
## 6                      BOCES MONROE 1   57022.60  413 414 800000050008
##     SED_CODE_1         POPULAR_NA INSTPECD INSTSUBYPE            INSSUBDE
## 1 131601060000      ARLINGTON CSD       16          7 INDEPENDENT CENTRAL
## 2 141701040000        HOLLAND CSD       16          4             CENTRAL
## 3 141800010000 LACKAWANNA CITY SD       16          1                CITY
## 4 141604060000       FRONTIER CSD       16          7 INDEPENDENT CENTRAL
## 5 411701040000         REMSEN CSD       16          4             CENTRAL
## 6 261401060000      PITTSFORD CSD       16          7 INDEPENDENT CENTRAL
##       MUNICODE SDLCODE ORPTSCOD_1 Shape_Le_1 Shape_Area
## 1 130668600100  131601     134601  149835.20  303723281
## 2 140739600100  141701     145001  102960.53  208183562
## 3 140525000000  141800     140900   23099.47   16697987
## 4 140636100400  141604     144804   64367.22   62976616
## 5 300770500100  411701     305201  130496.91  240677964
## 6 260666900100  261401     264601   57022.60   81964086

The pointsinpoly object is a regular dataframe (it is not spatial). It carries a critical bit of information that makes life easy for us.

Take a close look at the first column OBJECTID_1, and the column of automatically generated index numbers that immediately precedes it. OBJECTID_1 is a number that uniquely identifies each polygon in the school districts (polygons_proj) dataset. The column of index numbers that precedes it is uniquely identifying each school from the schools (points_proj) dataset. Let us take a narrower look at it.

pointsinpoly <- over(points_proj, polygons_proj[,c("OBJECTID_1", "SCHOOLDIST")])

head(pointsinpoly)
##   OBJECTID_1 SCHOOLDIST
## 1        257  ARLINGTON
## 2        552    HOLLAND
## 3        604  LACKAWANA
## 4        593   FRONTIER
## 5        821     REMSEN
## 6        725  PITTSFORD
#the automatic index on the left is
#the inaccessible column that shows the unique number assigned to each school. 
#For example, the Island Park school district (OBJECTID_1 = 3) has schools 645 and 827
#associated with it.

This is perfect. We have a dataframe that ties each school with the school district that it is located in. However, we cannot make a map from this because of two reasons: first, we have lost the outlines of the polygons, and second, we do not have the vaccination exemption data from the schools. We are going to proceed with 3 steps:

  1. We will first pull the OBJECTID_1 and SCHOOLDIST columns from pointsinpoly and add it to points_proj. We now have schools, measles immunizations, and the associated school districts with their unique id numbers in one column.
  2. We will then restrict our analysis to only schools that have non-zero vaccination rates by removing the schools that have one hundred percent vaccination rates (or 0 in the meas_no column). This is because we will be visualizing the vaccine exemption rates using an average score per district. We do not want the 100 percent vaccine compliant schools to skew the results; even a single school with a high percentage of unvaccinated children could be problematic for the entire district, provided that school has a large number of students.
  3. Next, we will add these point attributes to the polygons using the unique id numbers as a common field.

Step One: Adding Overlay Attributes to Points

points_proj$polygonid <- pointsinpoly$OBJECTID_1
points_proj$districtname <- pointsinpoly$SCHOOLDIST

Step Two: Removing Non-Zero Exemption Rates

Note: You cannot use dplyr from the tidyverse for this command. The tidyverse and spatial objects do not mesh together well. Use the subset command from base R.

nzero_exem <- subset(points_proj, meas_no > 0, 
                    select = c(School.Name, 
                              meas_no, 
                              Percent.Immunized.Measles,
                              polygonid, 
                              districtname))

Step Three: Joining Points to Polygons

Preparation

Now, let us join the nzero_exem attributes to the polygons_proj data usingpolygonid from nzero_exem and OBJECTID_1 from polygons_proj as common fields. We will do a left join with polygons_proj as the left table and nzero_exem as the right table. Explore joins here.

Nothing needs to be changed in the left table. In the right table (nzero_exem), there are some modifications that need to be done before merging.

Let us first rename ‘polygonid’ in nzero_exem as OBJECTID_1 to create a common field.

names(nzero_exem) #which column is polygonid?
## [1] "School.Name"               "meas_no"                  
## [3] "Percent.Immunized.Measles" "polygonid"                
## [5] "districtname"
names(nzero_exem)[4] <- "OBJECTID_1" # renaming that column

There is an NA in the district name in row 2313. We should remove it to avoid trouble with grouping.

nzero_exem <- nzero_exem[-(which(is.na(nzero_exem$districtname))), ]

The nzero_exem object is a spatial points data frame. We should convert it to a dataframe prior to grouping.

nzero_exem <- as.data.frame(nzero_exem)

Let’s now group nzero_exem by districtname before joining it to polygons_proj using OBJECTID_1, so that the number of polygons are the same in the right and left tables. It is a good idea to explictly mention here the package from which the group_by and summarize function originate. We are calculating the average exemption rate by district and rounding the value upto 2 decimal digits.

nzero_exem <- nzero_exem %>% 
  dplyr::group_by(districtname, OBJECTID_1) %>% 
  dplyr::summarize( meas_no = round(mean(meas_no),2))
#This yields the mean exemption rate for the districts.

Performing the Join

We will use the merge function from sp to join the nzero_exem data to polygons_proj. The documentation for this function states that we can use it to merge a spatial object with a data frame. The polygons_proj object is the spatial object. An important point: we should not eliminate polygons for which there is no match, because the occurrence of non-matches just means that these school districts do not have a public school with a non-zero measles exemption rate.

join <- merge(polygons_proj, nzero_exem, 
              by = intersect(names(polygons_proj), 
                             names(nzero_exem)), all.x = TRUE)

Phew. We are done. Let’s finally create the map.

Let’s Visualize

tmap_mode("plot")
## tmap mode set to plotting
 tm_shape(join) + 
  tm_fill("meas_no", style = "jenks", n = 8, title = "Percent Schoolchildren Unvaccinated for Measles")+
  tm_layout(title = "Measles Exemptions in \nNew York State by \nSchool District", 
          legend.title.size = 1,
          legend.text.size = 0.6,
          legend.position = c("left", "bottom"),
          legend.bg.color = "white",legend.bg.alpha = 1) +
   tm_view(view.legend.position = c("left", "bottom"))

This is an interesting map that represents school districts with a higher rate of measles vaccine exemptions with a darker red or brown color. The grey areas represent school districts that did not have public schools, or districts where all schools had 100 percent vaccination rates. We can already see clusters where the exemptions rates seem to be high. Let’s perform a statistical test to prove this.

Spatial autocorrelation

Is there spatial autocorrelation?

We will remove NAs before performing the Moran’s I test.

join.omit <- sp.na.omit(join, col.name = "meas_no")

Defining the neighbors:

nb <- poly2nb(join.omit, queen=TRUE) 
#Queen contiguity, where every neighbor that shares an edge or point is 
#considered a neighbor

Weighting the neighbors:

lw <- nb2listw(nb, style="W", zero.policy=TRUE) #weighting the neighbors equally

Performing the Moran’s test:

moran.test(join.omit$meas_no,lw, zero.policy = TRUE)
## 
##  Moran I test under randomisation
## 
## data:  join.omit$meas_no  
## weights: lw  n reduced by no-neighbour observations
##   
## 
## Moran I statistic standard deviate = 3.3873, p-value = 0.0003529
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic       Expectation          Variance 
##      0.0765238010     -0.0015797788      0.0005316635

There is clustering as evidenced by the 0.00035 p-value. This means that school districts with high measles exemptions rates are more likely to be surrounded by school districts where the exemption rates are high. The current outbreak has been reported to have affected New York City and Rockland County, mainly. Let’s take a closer look at the Rockland Country region.

knitr::include_graphics("Rockland_County.jpg")

The high percentage of unvaccinated schoolchildren in Rockland County is probably associated with the outbreak at this region.

Let us quickly examine the distributions of the percent of exemptions.

fivenum(nzero_exem$meas_no) #The spread of percent exemptions
## [1]  0.10  0.65  1.00  1.53 23.40
plot(density(nzero_exem$meas_no))

Conclusion

What does this mean for public health?

  1. Certain communities are more likely to opt for exemptions than others, and they cluster together. If we could obtain other variables for these communities, such as ethnic compositions, education levels, median income and so on, we could perform a geographically weighted regression to understand what influences vaccine avoidance.

  2. We have not achieved the herd immunity threshold uniformly. For measles, the herd immunity threshold is around 94.5 percent. The distribution reveals that most school districts have exemption percentages between 0 and 5, but there are school districts with very high exemption rates, for example, upto 23.40 percent. As long as these exemptions are allowed to reach such high values, we will have recurring measles outbreaks.

  3. This map is a useful tool for public health planners in New York to assess where the most exemptions are occurring. A quick Google search as of the date of my publishing this document reveals this to be the first map tracking measles exemptions in New York School Districts.

Thanks for following along patiently! Feel free to reach me at lb685@cornell.edu if you would like to learn more about how this code works.