New York is currently experiencing the largest measles outbreak since 1994. As of 22nd April 2019, there have been 626 cases of measles reported in New York City and Rockland County. Measles is a significant public health concern for several reasons:
In this document, we will visualize the school districts of New York State using a choropleth map, symbolizing the polygons by the percentage of measles vaccination exemptions in the school district. We will see if there are clusters or pockets of unvaccinated groups, and briefly consider what this means for public health.
The map we will be creating at the end.
I am a Master of Public Health student, passionate about using data to spur evidence-based interventions to improve health and well-being for all populations. I’m constantly learning new techniques of analyzing data, and this project is a way for me to document these learnings. Additionally, this document is meticulously annotated because I would like for it to serve as an instructional guide to other public health practitioners who are getting started with using R for spatial analysis. In this document, I have laid out all of the data sources, code and reference materials in a manner that makes following along very easy. The code is provided in a step-by-step manner, with errors and comments included to encourage reproduction. Feel free to skip over code chunks and concepts that you might already be familiar with.
Some of the important concepts that we will touch upon are:
These are some of the libraries that we will be using. If you do not have these installed, you will have to run the install.packages() function first before calling the libraries.
library(sf)
library(tidyverse)
library(rgdal)
library(tmap)
library(sp)
library(spdep)
library(spatialEco)
library(knitr)
Let’s read in the data from the New York State Department of Health immunization survey.
imm <- read.csv("Immunization_survey.csv")
Let’s take a look at the data.
(head(imm))
## ï..School.ID District.Name Report.Period Type
## 1 270701659807 2013-2014 Private School
## 2 321000145201 2013-2014 Private School
## 3 20801659058 2012-2013 Private School
## 4 20801659059 2012-2013 Private School
## 5 20801659060 2012-2013 Private School
## 6 141101609343 2013-2014 Private School
## School.Name Percent.Medical.Exemptions
## 1 TURNPIKE SCHOOL 0
## 2 AQUINAS H S 0
## 3 SUNNY MEADOW SCHOOL 0
## 4 RUSTIC HOLLOW 0
## 5 CARSON RD SCHOOL 0
## 6 BLOSSOM GARDENS FRIENDS SCHOOL 0
## Percent.Religious.Exemptions Percent.Immunized.Polio
## 1 92 4
## 2 0 100
## 3 100 0
## 4 100 0
## 5 100 0
## 6 0 100
## Percent.Immunized.Measles Percent.Immunized.Mumps
## 1 0 4
## 2 100 100
## 3 0 0
## 4 0 0
## 5 0 0
## 6 100 100
## Percent.Immunized.Rubella Percent.Immunized.Diphtheria
## 1 4 4
## 2 100 100
## 3 0 0
## 4 0 0
## 5 0 0
## 6 100 100
## Percent.Immunized.HepatitisB Percent.Immunized.Varicella
## 1 4 4
## 2 100 100
## 3 0 0
## 4 0 0
## 5 0 0
## 6 100 100
## Percent.Completely.Immunized Street City County
## 1 0 20 PARK ST. FONDA MONTGOMERY
## 2 100 685 E 182D ST BRONX BRONX
## 3 0 7 COURT ST BELMONT ALLEGANY
## 4 0 7 COURT ST. BELMONT ALLEGANY
## 5 0 7 COURT ST BELMONT ALLEGANY
## 6 100 13961 SISSON HIGHWAY COLLINS ERIE
## State Zip.Code
## 1 NY 12068
## 2 NY 10457
## 3 NY 14813
## 4 NY 14813
## 5 NY 14813
## 6 NY 14034
## Location
## 1 20 PARK ST.\nFONDA, NY 12068\n(42.953072192000036, -74.37310776999993)
## 2 685 E 182D ST\nBRONX, NY 10457\n(40.849331720000066, -73.89809272799994)
## 3 7 COURT ST\nBELMONT, NY 14813\n(42.22441664100006, -78.03413660399997)
## 4 7 COURT ST.\nBELMONT, NY 14813\n(42.22441664100006, -78.03413660399997)
## 5 7 COURT ST\nBELMONT, NY 14813\n(42.22441664100006, -78.03413660399997)
## 6 13961 SISSON HIGHWAY\nCOLLINS, NY 14034\n(42.49592655600003, -78.84794022099999)
Each row of the data represents a unique school that was captured by the immunization survey. Most column names are self explanatory. For the purpose of this analysis project, our columns of interest are:
Filtering for report period 2017-2018, for public schools, and removing schools that do not have an associated district name.
rec_imm <- imm %>%
filter(Report.Period == "2017-2018" & Type == "Public School") %>%
filter(!(is.na(District.Name)))
Let’s create a column called meas_no that holds values for the percent of students that did not receive measles vaccinations.
rec_imm <- rec_imm %>%
mutate( meas_no = (100 - (Percent.Immunized.Measles)))
Now that I have data from only the public schools from 2017-2018, I want to plot the schools on a map. Where are the coordinates stored?
head(rec_imm$Location)
## [1] LIME RIDGE ROAD\nPOUGHQUAG, NY 12570\n(41.59995607400003, -73.70731006599993)
## [2] PEARL ST\nHOLLAND, NY 14080\n(42.64399618700003, -78.53924020499994)
## [3] 550 MARTIN RD\nLACKAWANNA, NY 14218\n(42.81664690500003, -78.80204467399994)
## [4] S 4432 BAY VIEW RD\nHAMBURG, NY 14075\n(42.768125068000074, -78.83079897699997)
## [5] MAIN ST\nREMSEN, NY 13438\n(43.32135221300007, -75.18573825299995)
## [6] BARKER ROAD\nPITTSFORD, NY 14534\n(43.05915900000008, -77.52740685299995)
## 13545 Levels: *\nWYOMING, NY 14591\n(42.82638000000003, -78.08938999999998) ...
The locations are clearly available, but they appear with addresses. We need to clean this information so that we have two clean columns, one for the latitude and one for the longitude.
While it may be tempting to want to try and clean the location data manually, it simply is not feasible for the number of rows in this dataset. We have to write a regular expression to cut out the addresses and get two separate columns for the latitude and the longitude, and then convert it into a spatial dataset. Here are some links that I used to familiarize myself with the regular expressions that I needed to write for this analysis.
Before we begin attacking the ‘Location’ column, let us first change its class to character.
rec_imm$Location = as.character(rec_imm$Location)
Let me try to remove all text upto and including the (
rec_imm$geo <- gsub(".*\\(","", rec_imm$Location)
What is the gsub function doing? It has 3 arguments, first, the old term that you are looking to replace, second, the new term with which you are replacing the old term, and third, the name of the column that the regular expression will be evaluated over. Refer to this link for more information.
In this case, .* represents all characters. The double \s before the opening parenthesis tell R to interpret the ( literally. So, this function tells R to find everything upto and including the first (, and replace it with “”. This is basically another way of saying delete everything upto and including the first bracket. We use double escapes for ( as it is a special character.
Let’s take a look to see how the chunk code played out.
head(rec_imm[, c("Location", "geo")])
## Location
## 1 LIME RIDGE ROAD\nPOUGHQUAG, NY 12570\n(41.59995607400003, -73.70731006599993)
## 2 PEARL ST\nHOLLAND, NY 14080\n(42.64399618700003, -78.53924020499994)
## 3 550 MARTIN RD\nLACKAWANNA, NY 14218\n(42.81664690500003, -78.80204467399994)
## 4 S 4432 BAY VIEW RD\nHAMBURG, NY 14075\n(42.768125068000074, -78.83079897699997)
## 5 MAIN ST\nREMSEN, NY 13438\n(43.32135221300007, -75.18573825299995)
## 6 BARKER ROAD\nPITTSFORD, NY 14534\n(43.05915900000008, -77.52740685299995)
## geo
## 1 41.59995607400003, -73.70731006599993)
## 2 42.64399618700003, -78.53924020499994)
## 3 42.81664690500003, -78.80204467399994)
## 4 42.768125068000074, -78.83079897699997)
## 5 43.32135221300007, -75.18573825299995)
## 6 43.05915900000008, -77.52740685299995)
Nice! In the column geo, everything before and including the first bracket has been removed. Now, let me remove everything after the last bracket.
rec_imm$geog <- gsub("\\).*", "", rec_imm$geo)
How has this one turned out?
head(rec_imm[, c("Location", "geo", "geog")])
## Location
## 1 LIME RIDGE ROAD\nPOUGHQUAG, NY 12570\n(41.59995607400003, -73.70731006599993)
## 2 PEARL ST\nHOLLAND, NY 14080\n(42.64399618700003, -78.53924020499994)
## 3 550 MARTIN RD\nLACKAWANNA, NY 14218\n(42.81664690500003, -78.80204467399994)
## 4 S 4432 BAY VIEW RD\nHAMBURG, NY 14075\n(42.768125068000074, -78.83079897699997)
## 5 MAIN ST\nREMSEN, NY 13438\n(43.32135221300007, -75.18573825299995)
## 6 BARKER ROAD\nPITTSFORD, NY 14534\n(43.05915900000008, -77.52740685299995)
## geo
## 1 41.59995607400003, -73.70731006599993)
## 2 42.64399618700003, -78.53924020499994)
## 3 42.81664690500003, -78.80204467399994)
## 4 42.768125068000074, -78.83079897699997)
## 5 43.32135221300007, -75.18573825299995)
## 6 43.05915900000008, -77.52740685299995)
## geog
## 1 41.59995607400003, -73.70731006599993
## 2 42.64399618700003, -78.53924020499994
## 3 42.81664690500003, -78.80204467399994
## 4 42.768125068000074, -78.83079897699997
## 5 43.32135221300007, -75.18573825299995
## 6 43.05915900000008, -77.52740685299995
Awesome, the newly added geog column has latitudes and longitudes with the addresses and brackets removed. But we are not done yet. We will now use the comma to split the geog column into two separate ‘lat’ and ‘long’ columns.
rec_imm <- separate(rec_imm, geog,
into =c("lat", "long"), sep = ",")
Has it worked?
head(rec_imm[, c("Location", "geo", "lat", "long")])
## Location
## 1 LIME RIDGE ROAD\nPOUGHQUAG, NY 12570\n(41.59995607400003, -73.70731006599993)
## 2 PEARL ST\nHOLLAND, NY 14080\n(42.64399618700003, -78.53924020499994)
## 3 550 MARTIN RD\nLACKAWANNA, NY 14218\n(42.81664690500003, -78.80204467399994)
## 4 S 4432 BAY VIEW RD\nHAMBURG, NY 14075\n(42.768125068000074, -78.83079897699997)
## 5 MAIN ST\nREMSEN, NY 13438\n(43.32135221300007, -75.18573825299995)
## 6 BARKER ROAD\nPITTSFORD, NY 14534\n(43.05915900000008, -77.52740685299995)
## geo lat
## 1 41.59995607400003, -73.70731006599993) 41.59995607400003
## 2 42.64399618700003, -78.53924020499994) 42.64399618700003
## 3 42.81664690500003, -78.80204467399994) 42.81664690500003
## 4 42.768125068000074, -78.83079897699997) 42.768125068000074
## 5 43.32135221300007, -75.18573825299995) 43.32135221300007
## 6 43.05915900000008, -77.52740685299995) 43.05915900000008
## long
## 1 -73.70731006599993
## 2 -78.53924020499994
## 3 -78.80204467399994
## 4 -78.83079897699997
## 5 -75.18573825299995
## 6 -77.52740685299995
Studying the different columns in this table is a good way to understand how we went, step by step, from a messy Location column to clean lat long columns. However, the class for lat and long is still character. Let’s change it back to numeric.
which(colnames(rec_imm) == "lat") #Identifying column numbers
## [1] 24
# If lat is 24, long is 25.
rec_imm[ , c(24, 25)] <- lapply(rec_imm[, c(24,25)], as.numeric)
Perfect. You now have clean lat and long columns.
The columns that we obtained as a result of regular expression cleaning are just that, columns. The rec_imm dataset is still just a dataframe. In order to spatialize the rec_imm dataframe into a newer spatial dataframe called spat_imm, we will have to tell R that the lat long columns hold coordinates.
We will first pull the positional information into an object called xy.
xy <- rec_imm[, c(25, 24)]
# The order for coordinates is always longitude first, latitude second.
# This is why 25 appears before 24.
Now let’s create a spatial dataframe from rec_imm feeding in xy as the positional coordinates.
spat_imm <- SpatialPointsDataFrame(coords = xy, data = rec_imm)
Did it work?
str(spat_imm)
## Formal class 'SpatialPointsDataFrame' [package "sp"] with 5 slots
## ..@ data :'data.frame': 2832 obs. of 25 variables:
## .. ..$ ï..School.ID : num [1:2832] 1.32e+11 1.42e+11 1.42e+11 1.42e+11 4.12e+11 ...
## .. ..$ District.Name : Factor w/ 719 levels "","ADDISON CENTRAL SCHOOL",..: 24 288 334 215 530 501 540 540 212 359 ...
## .. ..$ Report.Period : Factor w/ 6 levels "2012-2013","2013-2014",..: 6 6 6 6 6 6 6 6 6 6 ...
## .. ..$ Type : Factor w/ 4 levels "BOCES School",..: 3 3 3 3 3 3 3 3 3 3 ...
## .. ..$ School.Name : Factor w/ 6303 levels " BAIS YAAKOV AYELES HASHACHAR",..: 381 2293 2697 1779 4365 338 4699 4705 1072 5897 ...
## .. ..$ Percent.Medical.Exemptions : num [1:2832] 0 0.25 0 0.07 1.6 0 0.48 0 0 0.84 ...
## .. ..$ Percent.Religious.Exemptions: num [1:2832] 1.12 2.7 0 0.41 0 1.12 0.12 0 0.49 1.11 ...
## .. ..$ Percent.Immunized.Polio : num [1:2832] 98.9 97.3 100 99.6 100 98.6 99 99.8 98.5 98.6 ...
## .. ..$ Percent.Immunized.Measles : num [1:2832] 98.9 97.3 99 99.5 100 98.9 99.3 99.9 99.5 98.3 ...
## .. ..$ Percent.Immunized.Mumps : num [1:2832] 98.9 97.3 99 99.5 100 98.9 99.3 99.9 99.5 98.3 ...
## .. ..$ Percent.Immunized.Rubella : num [1:2832] 98.9 97.3 99 99.5 100 98.9 99.3 99.9 99.5 98.3 ...
## .. ..$ Percent.Immunized.Diphtheria: num [1:2832] 98.9 97.3 100 99.6 100 98.9 99.2 99.8 98.5 98.3 ...
## .. ..$ Percent.Immunized.HepatitisB: num [1:2832] 98.9 97.3 97.5 99.5 100 98.9 99.5 99.9 99.5 98.3 ...
## .. ..$ Percent.Immunized.Varicella : num [1:2832] 98.9 97.3 97.7 99.6 100 98.5 99 100 99.5 98.1 ...
## .. ..$ Percent.Completely.Immunized: num [1:2832] 98.9 96.6 95.7 99.3 97.9 96.6 96.9 99.1 98.1 97.8 ...
## .. ..$ Street : Factor w/ 7766 levels "","*","1-3 MONROE ST",..: 7202 7388 5177 7551 7226 6796 2272 1684 6842 7244 ...
## .. ..$ City : Factor w/ 1064 levels "ACCORD","ADAMS",..: 767 441 501 406 787 743 803 803 349 540 ...
## .. ..$ County : Factor w/ 62 levels "ALBANY","ALLEGANY",..: 14 15 15 15 33 28 28 28 30 30 ...
## .. ..$ State : Factor w/ 1 level "NY": 1 1 1 1 1 1 1 1 1 1 ...
## .. ..$ Zip.Code : int [1:2832] 12570 14080 14218 14075 13438 14534 14605 14607 11520 11561 ...
## .. ..$ Location : chr [1:2832] "LIME RIDGE ROAD\nPOUGHQUAG, NY 12570\n(41.59995607400003, -73.70731006599993)" "PEARL ST\nHOLLAND, NY 14080\n(42.64399618700003, -78.53924020499994)" "550 MARTIN RD\nLACKAWANNA, NY 14218\n(42.81664690500003, -78.80204467399994)" "S 4432 BAY VIEW RD\nHAMBURG, NY 14075\n(42.768125068000074, -78.83079897699997)" ...
## .. ..$ meas_no : num [1:2832] 1.1 2.7 1 0.5 0 ...
## .. ..$ geo : chr [1:2832] "41.59995607400003, -73.70731006599993)" "42.64399618700003, -78.53924020499994)" "42.81664690500003, -78.80204467399994)" "42.768125068000074, -78.83079897699997)" ...
## .. ..$ lat : num [1:2832] 41.6 42.6 42.8 42.8 43.3 ...
## .. ..$ long : num [1:2832] -73.7 -78.5 -78.8 -78.8 -75.2 ...
## ..@ coords.nrs : num(0)
## ..@ coords : num [1:2832, 1:2] -73.7 -78.5 -78.8 -78.8 -75.2 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : NULL
## .. .. ..$ : chr [1:2] "long" "lat"
## ..@ bbox : num [1:2, 1:2] -79.7 40.6 -72 45
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:2] "long" "lat"
## .. .. ..$ : chr [1:2] "min" "max"
## ..@ proj4string:Formal class 'CRS' [package "sp"] with 1 slot
## .. .. ..@ projargs: chr NA
Yes, it did. We now have a spatial points data frame!
We are now ready to map the schools.
plot(spat_imm)
This map definitely needs improving. A look at the tmap documentation might help.
?tmap
tm_shape(spat_imm) + tm_dots( col = "blue") +
tm_layout(title = "Public Schools in NYS\n 2017-2018")
Let’s read in the school districts shape files. These are provided by the New York State Education Department. You can access them from here.
distsf <- st_read("./School_Districts/SchoolDistricts_2016_v2.shp")
## Reading layer `SchoolDistricts_2016_v2' from data source `D:\Lakshman\A_Cornell\Side_project\Autocorrelation_Measles_Vaccination\School_Districts\SchoolDistricts_2016_v2.shp' using driver `ESRI Shapefile'
## Simple feature collection with 935 features and 40 fields
## geometry type: POLYGON
## dimension: XY
## bbox: xmin: 105571 ymin: 4481030 xmax: 764138.9 ymax: 4985467
## epsg (SRID): 26918
## proj4string: +proj=utm +zone=18 +datum=NAD83 +units=m +no_defs
What is the class of distsf?
class(distsf)
## [1] "sf" "data.frame"
This is an sf spatial object (simple features).
Let’s view this spatial data frame:
head(distsf)
There are a large number of columns. The ‘geometry’ column tells R how to plot these school district polygons on a map. Remember, we will want to eventually join the immunization data from the schools(spat_imm) to the polygons from distsf. Doing this would be easy if there was a key column that these two datasets had in common. In fact, when you look at the column names from distsf, you will notice a ‘SCHOOL_ID’ column that might be a candidate key column to join by, but unfortunately, the corresponding school ID column from the spat_imm datasets represents a different identification system used by the New York State Department of Health for the immunization survey. Similarly, the SCHOOLDIST column from the distsf object cannot be used to link to the District.Name from the spat_imm dataset because of differences in the way the districts are spelled. The only way that we can associate the polygons and the immunization data is by mapping them and performing a spatial overlay.
Let’s plot distsf to see what it looks like. Unlike the schools, we do not need to specify coordinates for this data as it is already a spatial dataset (simple features).
districts <- tm_shape(distsf) +tm_borders() +
tm_layout(title = "School Districts of \n New York State")
districts
Now, let us overlay the dots representing schools over the polygons representing school districts. It is important to note here that you have to map out the school districts first by listing tm_shape(distsf) before you plot the schools using tm_shape(spat_imm). If you try to do it the other way round, you will not get an overlay map.
tm_shape(distsf) + tm_borders() +
tm_shape(spat_imm) + tm_dots( col = "blue" ) +
tm_layout(title = "Public Schools and School \nDistricts of New York State")
You can see obvious clusters of schools in regions of high population density. Just by looking at this map, you can begin tying public schools to their respective school districts. However, as there are a lot of schools in the data, let us try and make R do this for us.
Up until now, we did not use projections. We developed the map of schools and districts wihout having to worry about this topic, because it was a purely visual map. If we want to use that map to link schools to school districts, we need to ensure that both layers of the map are in the same projection so that R can correctly measure distances while performing a spatial point over polygon operation.
Let us start with our school points (spat_imm). What projection are the points in? We can use the proj4string function to assess this.
proj4string(spat_imm)
## [1] NA
Apparently, this dataset did not come with a preloaded projection. Since this is lat-long data, we will assign it to WGS 84. There are two steps to this—
We will first write a WGS84 projection for spat_imm using the CRS function.
proj4string(spat_imm) <- CRS("+proj=longlat +datum=WGS84")
A written projection has to be assigned using the spTransform function for the projection to ‘stick’.
points_proj <- spTransform(spat_imm, CRS("+proj=longlat +datum=WGS84"))
Now, onto our school district polygons (distsf). Remember, the polygons differ from the points because they were made available as spatial data, and we do not have to write a projection. We just have to reassign the WGS 84 projection to it. What projection are the polygons in?
proj4string(distsf)
An error appears because distsf is a simple features spatial object. We need to convert it into a spatial polygons data frame to run the proj4string function.
distsf <- as(distsf, 'Spatial')
proj4string(distsf)
## [1] "+proj=utm +zone=18 +datum=NAD83 +units=m +no_defs +ellps=GRS80 +towgs84=0,0,0"
The distsf is in UTM Zone 18N. If you want to learn more about this projection, I would recommend watching this and reading this. We will, however, be reprojecting this data to WGS 84, the same as the points.
We use the spTransform function to assign the WGS84 projection to the polygons.
polygons_proj <- spTransform(distsf, CRS("+proj=longlat +datum=WGS84"))
A quick way to check if the polygons and points are in the same coordinate system is to call the summary for both datasets and see if the bounding box for the polygons outlines the limits of the points.
bbox(points_proj)
## min max
## long -79.71208 -71.95790
## lat 40.58346 44.98913
bbox(polygons_proj)
## min max
## x -79.76225 -71.85614
## y 40.47738 45.01582
They approximately match, indicating that we have projected both datasets correctly.
To quickly summarize what we have so far: We have two datasets, one is a New York State Department of Health spatial points data frame with vaccination exemption data from public schools in New York from 2017-2018, another is a New York State Department of Education spatial polygons data frame with the outlines of school districts. The bad news: we do not know which points go with any given polygon. The good news: we have them both in the WGS84 projection, and we can now perform a point in polygon operation to figure this out.
We use the over function.
pointsinpoly <- over(points_proj, polygons_proj)
What is this thing?
head(pointsinpoly)
## Warning in format.POSIXlt(as.POSIXlt(x), ...): unable to identify current timezone 'H':
## please set environment variable 'TZ'
## OBJECTID_1 OBJECTID FID_School AREA PERIMETER SCHOOL_ SCHOOL_ID
## 1 257 258 256 303717256 149833.67 257 257
## 2 552 553 552 208182210 102960.20 553 553
## 3 604 605 604 16697833 23099.36 605 605
## 4 593 594 593 62976192 64367.01 594 594
## 5 821 825 824 240674540 130495.93 827 827
## 6 725 729 727 81962406 57021.98 730 730
## SCHDIST04_ SCHDIST041 SCH_LATEST SCH_LATE_1 SCHOOLDIST POLYTYPE
## 1 303717256 149833.67 682 708 ARLINGTON DTM ONLY
## 2 208182210 102960.20 386 433 HOLLAND DTM ONLY
## 3 16697833 23099.36 334 372 LACKAWANA DTM ONLY
## 4 62976192 64367.01 345 388 FRONTIER DTM ONLY
## 5 240674540 130495.93 112 782 REMSEN DTM ONLY
## 6 81962406 57021.98 209 258 PITTSFORD DTM ONLY
## PRIMARYPOL USERNAME UPDATE_ EDITED RPOLY_ LPOLY_ EACODE District2
## 1 Y unknown 1900-01-01 NONE 682 682 134601 ARLINGTON
## 2 Y unknown 1900-01-01 NONE 386 386 145001 HOLLAND
## 3 Y unknown 1900-01-01 NONE 334 334 140900 LACKAWANNA
## 4 Y unknown 1900-01-01 NONE 345 345 144804 FRONTIER
## 5 Y unknown 1900-01-01 NONE 112 112 305201 REMSEN
## 6 Y unknown 1900-01-01 NONE 209 209 264601 PITTSFORD
## Inst_ID SchDist FID_BOCES_ SEDdir_BOC
## 1 800000053261 ARLINGTON 7 1390
## 2 800000052163 HOLLAND 9 1492
## 3 800000052156 LACKAWANNA 8 1491
## 4 800000052176 FRONTIER 8 1491
## 5 800000041401 REMSEN 21 4190
## 6 800000050008 PITTSFORD 17 2691
## SEDdir_B_1 Shape_Leng OID_ ID INSTITID
## 1 BOCES DUTCHESS 149835.20 83 84 800000053261
## 2 BOCES ERIE 2-CHAUTAUQUA-CATTARAUGUS 102960.53 118 119 800000052163
## 3 BOCES ERIE 1 23099.47 96 97 800000052156
## 4 BOCES ERIE 1 64367.22 685 686 800000052176
## 5 BOCES ONEIDA-HERKIMER-MADISON 130496.91 149 150 800000041401
## 6 BOCES MONROE 1 57022.60 413 414 800000050008
## SED_CODE_1 POPULAR_NA INSTPECD INSTSUBYPE INSSUBDE
## 1 131601060000 ARLINGTON CSD 16 7 INDEPENDENT CENTRAL
## 2 141701040000 HOLLAND CSD 16 4 CENTRAL
## 3 141800010000 LACKAWANNA CITY SD 16 1 CITY
## 4 141604060000 FRONTIER CSD 16 7 INDEPENDENT CENTRAL
## 5 411701040000 REMSEN CSD 16 4 CENTRAL
## 6 261401060000 PITTSFORD CSD 16 7 INDEPENDENT CENTRAL
## MUNICODE SDLCODE ORPTSCOD_1 Shape_Le_1 Shape_Area
## 1 130668600100 131601 134601 149835.20 303723281
## 2 140739600100 141701 145001 102960.53 208183562
## 3 140525000000 141800 140900 23099.47 16697987
## 4 140636100400 141604 144804 64367.22 62976616
## 5 300770500100 411701 305201 130496.91 240677964
## 6 260666900100 261401 264601 57022.60 81964086
The pointsinpoly object is a regular dataframe (it is not spatial). It carries a critical bit of information that makes life easy for us.
Take a close look at the first column OBJECTID_1, and the column of automatically generated index numbers that immediately precedes it. OBJECTID_1 is a number that uniquely identifies each polygon in the school districts (polygons_proj) dataset. The column of index numbers that precedes it is uniquely identifying each school from the schools (points_proj) dataset. Let us take a narrower look at it.
pointsinpoly <- over(points_proj, polygons_proj[,c("OBJECTID_1", "SCHOOLDIST")])
head(pointsinpoly)
## OBJECTID_1 SCHOOLDIST
## 1 257 ARLINGTON
## 2 552 HOLLAND
## 3 604 LACKAWANA
## 4 593 FRONTIER
## 5 821 REMSEN
## 6 725 PITTSFORD
#the automatic index on the left is
#the inaccessible column that shows the unique number assigned to each school.
#For example, the Island Park school district (OBJECTID_1 = 3) has schools 645 and 827
#associated with it.
This is perfect. We have a dataframe that ties each school with the school district that it is located in. However, we cannot make a map from this because of two reasons: first, we have lost the outlines of the polygons, and second, we do not have the vaccination exemption data from the schools. We are going to proceed with 3 steps:
points_proj$polygonid <- pointsinpoly$OBJECTID_1
points_proj$districtname <- pointsinpoly$SCHOOLDIST
Note: You cannot use dplyr from the tidyverse for this command. The tidyverse and spatial objects do not mesh together well. Use the subset command from base R.
nzero_exem <- subset(points_proj, meas_no > 0,
select = c(School.Name,
meas_no,
Percent.Immunized.Measles,
polygonid,
districtname))
Now, let us join the nzero_exem attributes to the polygons_proj data usingpolygonid from nzero_exem and OBJECTID_1 from polygons_proj as common fields. We will do a left join with polygons_proj as the left table and nzero_exem as the right table. Explore joins here.
Nothing needs to be changed in the left table. In the right table (nzero_exem), there are some modifications that need to be done before merging.
Let us first rename ‘polygonid’ in nzero_exem as OBJECTID_1 to create a common field.
names(nzero_exem) #which column is polygonid?
## [1] "School.Name" "meas_no"
## [3] "Percent.Immunized.Measles" "polygonid"
## [5] "districtname"
names(nzero_exem)[4] <- "OBJECTID_1" # renaming that column
There is an NA in the district name in row 2313. We should remove it to avoid trouble with grouping.
nzero_exem <- nzero_exem[-(which(is.na(nzero_exem$districtname))), ]
The nzero_exem object is a spatial points data frame. We should convert it to a dataframe prior to grouping.
nzero_exem <- as.data.frame(nzero_exem)
Let’s now group nzero_exem by districtname before joining it to polygons_proj using OBJECTID_1, so that the number of polygons are the same in the right and left tables. It is a good idea to explictly mention here the package from which the group_by and summarize function originate. We are calculating the average exemption rate by district and rounding the value upto 2 decimal digits.
nzero_exem <- nzero_exem %>%
dplyr::group_by(districtname, OBJECTID_1) %>%
dplyr::summarize( meas_no = round(mean(meas_no),2))
#This yields the mean exemption rate for the districts.
We will use the merge function from sp to join the nzero_exem data to polygons_proj. The documentation for this function states that we can use it to merge a spatial object with a data frame. The polygons_proj object is the spatial object. An important point: we should not eliminate polygons for which there is no match, because the occurrence of non-matches just means that these school districts do not have a public school with a non-zero measles exemption rate.
join <- merge(polygons_proj, nzero_exem,
by = intersect(names(polygons_proj),
names(nzero_exem)), all.x = TRUE)
Phew. We are done. Let’s finally create the map.
tmap_mode("plot")
## tmap mode set to plotting
tm_shape(join) +
tm_fill("meas_no", style = "jenks", n = 8, title = "Percent Schoolchildren Unvaccinated for Measles")+
tm_layout(title = "Measles Exemptions in \nNew York State by \nSchool District",
legend.title.size = 1,
legend.text.size = 0.6,
legend.position = c("left", "bottom"),
legend.bg.color = "white",legend.bg.alpha = 1) +
tm_view(view.legend.position = c("left", "bottom"))
This is an interesting map that represents school districts with a higher rate of measles vaccine exemptions with a darker red or brown color. The grey areas represent school districts that did not have public schools, or districts where all schools had 100 percent vaccination rates. We can already see clusters where the exemptions rates seem to be high. Let’s perform a statistical test to prove this.
Is there spatial autocorrelation?
We will remove NAs before performing the Moran’s I test.
join.omit <- sp.na.omit(join, col.name = "meas_no")
Defining the neighbors:
nb <- poly2nb(join.omit, queen=TRUE)
#Queen contiguity, where every neighbor that shares an edge or point is
#considered a neighbor
Weighting the neighbors:
lw <- nb2listw(nb, style="W", zero.policy=TRUE) #weighting the neighbors equally
Performing the Moran’s test:
moran.test(join.omit$meas_no,lw, zero.policy = TRUE)
##
## Moran I test under randomisation
##
## data: join.omit$meas_no
## weights: lw n reduced by no-neighbour observations
##
##
## Moran I statistic standard deviate = 3.3873, p-value = 0.0003529
## alternative hypothesis: greater
## sample estimates:
## Moran I statistic Expectation Variance
## 0.0765238010 -0.0015797788 0.0005316635
There is clustering as evidenced by the 0.00035 p-value. This means that school districts with high measles exemptions rates are more likely to be surrounded by school districts where the exemption rates are high. The current outbreak has been reported to have affected New York City and Rockland County, mainly. Let’s take a closer look at the Rockland Country region.
knitr::include_graphics("Rockland_County.jpg")
The high percentage of unvaccinated schoolchildren in Rockland County is probably associated with the outbreak at this region.
Let us quickly examine the distributions of the percent of exemptions.
fivenum(nzero_exem$meas_no) #The spread of percent exemptions
## [1] 0.10 0.65 1.00 1.53 23.40
plot(density(nzero_exem$meas_no))
What does this mean for public health?
Certain communities are more likely to opt for exemptions than others, and they cluster together. If we could obtain other variables for these communities, such as ethnic compositions, education levels, median income and so on, we could perform a geographically weighted regression to understand what influences vaccine avoidance.
We have not achieved the herd immunity threshold uniformly. For measles, the herd immunity threshold is around 94.5 percent. The distribution reveals that most school districts have exemption percentages between 0 and 5, but there are school districts with very high exemption rates, for example, upto 23.40 percent. As long as these exemptions are allowed to reach such high values, we will have recurring measles outbreaks.
This map is a useful tool for public health planners in New York to assess where the most exemptions are occurring. A quick Google search as of the date of my publishing this document reveals this to be the first map tracking measles exemptions in New York School Districts.
Thanks for following along patiently! Feel free to reach me at lb685@cornell.edu if you would like to learn more about how this code works.