The Department of Conservation and Recreation manages the Massachusetts state park system — and that means managing production companies who want to film movies, television shows and commercials at locations from the summit of Mount Greylock to the Charles River Esplanade. Over the past decade, Massachusetts state parks have served as settings in movies such as “Patriots Day” and “The Finest Hours,” and TV shows including “Wheel of Fortune” and “This Old House.”
But what are the most popular film-shoot locations in the state park system? Last year I requested data on film permits issued by the DCR over a 10-year period.
My goal with this exercise is to:
Starting with the packages I’ll need right away. I’ll add more later for the chart and map.
library(readxl)
library(dplyr)
library(tidyr)
library(stringr)
In response to my public records request, the state provided six Excel spreadsheets for permits issued between 2007 and September 2017.
df_1 <- read_excel("data/Active List 1.xls")
df_2 <- read_excel("data/Active List 2.xls")
df_3 <- read_excel("data/Active List 3.xls")
df_4 <- read_excel("data/Archive List 1.xls")
df_5 <- read_excel("data/Archive List 2.xls")
df_6 <- read_excel("data/Archive List 3.xls")
The number of columns — and some of the column names — are inconsistent across the six spreadsheets. For example, the number of variables ranges from 39 to 47, and the permit sites are referred to as ‘Location’ in some sheets, and ‘Combo253’ in others. We’ll make the key column names consistent, and drop the columns we don’t need.
The most important variable is the DCR location. But we’ll include permit year, information about who pulled the permit, the project name and details about the crew. Those are helpful for reference, and in weeding out duplicates.
Some of the newer spreadsheets also include a second DCR location column with more precise detail. That’s helpful, too — but it’s missing from the older sheets, so we’ll drop it. Finally, the spreadsheets imported as df_4-df_6 don’t have a Permit Year column — so I’ll include the column for Permit Number and extract the year from that later.
I’m going to create new dataframes titled df_1a-df_6a, to help stay organized.
df_1a <- df_1 %>%
select(Permit_year = `Permit Year`,
Issued_to = `ISSUED TO`,
DCR_location = Combo253,
Project = `EVENT NAME`,
Project_detail = SpecialEventInformation)
df_2a <- df_2 %>%
select(Permit_year = `Permit_Year`,
Issued_to = `ISSUED TO`,
DCR_location = Combo253,
Project = `EVENT NAME`,
Project_detail = SpecialEventInformation)
df_3a <- df_3 %>%
select(Permit_year = `Permit_Year`,
Issued_to = `ISSUED TO`,
DCR_location = Combo253,
Project = `EVENT NAME`,
Project_detail = SpecialEventInformation)
df_4a <- df_4 %>%
select(Permit_year = `PERMIT NUMBER`,
Issued_to = `ISSUED TO`,
DCR_location = Location,
Project = `EVENT NAME`,
Project_detail = SpecialEventInformation)
df_5a <- df_5 %>%
select(Permit_year = `PERMIT NUMBER`,
Issued_to = `ISSUED TO`,
DCR_location = Location,
Project = `EVENT NAME`,
Project_detail = SpecialEventInformation)
df_6a <- df_6 %>%
select(Permit_year = `PERMIT NUMBER`,
Issued_to = `ISSUED TO`,
DCR_location = Location,
Project = `EVENT NAME`,
Project_detail = SpecialEventInformation)
Now, fix the permit year data in df_4a-df_6a.
The format for permit numbers begins with the year issued — for example, 2007-0281. So, deleting everything from the hyphen to the end of the string gives us the year.
Side note: The permit year isn’t a factor in the two visualizations I’m creating here. But, it’s a good variable to have in case I decide to look at permits issued by year, or permits at specific locations by year.
df_4a$Permit_year <- as.numeric(gsub("-.*$", "", df_4a$Permit_year))
df_5a$Permit_year <- as.numeric(gsub("-.*$", "", df_5a$Permit_year))
df_6a$Permit_year <- as.numeric(gsub("-.*$", "", df_6a$Permit_year))
We’ve created dataframes df_1a-df_6a, with common column names. Time to join them.
dcr_permits <- bind_rows(df_1a, df_2a, df_3a, df_4a, df_5a, df_6a)
On an earlier pass through the raw spreadsheets I noticed that some entries were duplicated. I’m not sure how many — but if there’s even one, there may be others. Let’s check.
dcr_permits <- distinct(dcr_permits)
Looks like there were quite a few. The first iteration of dcr_permits had 1412 obs, but running distinct(dcr_permits) returned 835.
location_totals <- dcr_permits %>%
group_by(DCR_location) %>%
summarize(Location_total=n()) %>%
arrange(desc(Location_total))
Identify locations with 10 or more permits:
top_locations <- location_totals %>%
filter(Location_total>=10)
| DCR_location | Location_total |
|---|---|
| Walden Pond State Reservation | 54 |
| Esplanade, Storrow Embankment Boston | 39 |
| Memorial Drive | 34 |
| Charles River Reservation | 32 |
| Castle Island | 31 |
| Blue Hills Reservation | 29 |
| Esplanade, Cambridge - Memorial Drive | 24 |
| Paul Revere Park, Charlestown | 21 |
| Borderland State Park | 18 |
| Georges Island | 17 |
| Pilgrim Memorial State Park | 17 |
| Carson Beach | 16 |
| Middlesex Fells Reservations | 13 |
| Revere Beach | 12 |
| Cambridge Parkway | 10 |
| Harold Parker State Forest | 10 |
| Wompatuck State Park | 10 |
We’ll show locations with 10 or more permits.
Shout-out to Nicholas Field — I referred to the code for his chart on multilingual circulation in Toronto public library branches because I forgot how to flip the axes and present the totals in descending order.
library(ggplot2)
library(forcats)
library(scales)
library(extrafont)
ggplot(top_locations, aes(x=fct_reorder(DCR_location, Location_total), y=Location_total)) +
geom_bar(stat="identity", fill="#31a354", width = .6) +
geom_text(aes(label=Location_total), hjust=-.5, family="BentonSans Light", size=3) +
coord_flip() +
theme_classic() +
scale_y_continuous(breaks = pretty_breaks(n = 6)) +
labs(title="Film permits at Massachusetts DCR properties",
subtitle="Locations with 10 or more permits between 2007 and 2017",
y="", x="",
caption="Source: Massachusetts Department of Conservation and Recreation") +
theme(axis.line.y = element_line(color="grey80"),
axis.line.x = element_line(color="grey80"),
axis.ticks = element_blank(),
axis.text.y = element_text(family="BentonSans Regular"),
axis.text.x = element_text(family="BentonSans Light"),
plot.title = element_text(size=18, family="BentonSans Bold"),
plot.subtitle = element_text(size=12, family="BentonSans Book"),
plot.caption = element_text(size=8, family="BentonSans Light", hjust=1))
I’ll start by loading libraries and options, and a base Massachusetts map. I’ll use leaflet for a final interactive version — but I found creating a static map was a good way to do a quick check on the geocoding accuracy. And if I decide to create a static version later, I’ll have a head start.
library(tigris)
library(sf)
library(ggmap)
library(leaflet)
options(tigris_class="sf")
options(device = "X11")
X11.options(type = "cairo")
ma <- states(cb=T) %>%
filter(NAME=="Massachusetts")
A challenge is that the raw data only gives a basic location description — usually a state park name, but no city, town, street address or zip code.
I was hoping the state’s GIS department would have a data layer for DCR properties. But, no such luck.
I’m going to take a leap of faith and add an address column that tacks the state name onto the location description, and see if that works.
location_geo <- location_totals %>%
mutate(address = paste0(DCR_location, ", ", "MA"))
geo <- mutate_geocode(location_geo, address)
It looks like the geocoding went OK. Let’s look at a datatable to make sure it didn’t fail to return lat / lon pairings for any locations. If any are missing, it’s possible to update manually, as documented below.
library(DT)
datatable(geo)
This has been the big inconsistency: the very first pass returned lat / lon pairings for every location. Subsequent passes have failed to geocode some locations — in some cases, dozens of locations end up getting dropped. I did export a .csv of the completed geo dataframe, which is available in the repo’s output_data folder as geo.csv.
Let’s take a look at a static map to make sure all the points are within Massachusetts.
ggplot(ma) +
geom_sf() +
geom_point(data = geo, aes(x=lon, y=lat, size=Location_total), color="#31a354") +
theme_void() +
theme(panel.grid.major = element_line(colour = 'transparent'))
In some runs of the code, I’ve had two locations geocode incorrectly, with both falling to the southwest of the state.
But which ones are they? The state’s southwest border is at lon -73.496867. So, let’s look for any locations that geocoded at a longitude west of -73.496867. (This can be adapted in future runs if any points fall to the north, south or east of the state’s borders.)
mapcheck <- geo %>%
filter(lon < -73.496867)
datatable(mapcheck)
In some cases the geocode step is returning a DCR_location, listed as NA, falling far outside the state. We can drop that row from the geo dataframe.
In other runs of the code, a DCR_location listed as Riverdale Park is geocoding to Maryland. The correct Riverdale Park address is Dedham, MA 02026. Find that lat / lon in Google maps and update it in the geo dataframe.
geo <- geo %>%
filter(DCR_location != "NA")
# Riverdale park lon and lat:
geo[168, 4] = -71.174923
geo[168, 5] = 42.271868
Geocode also failed once on a location listed as Nonantum Road Branch. It worked on subsequent try, but it looks like the lat / lon is wrong — it’s showing up in Marblehead, but it should be in Newton. The permit information mentions the crew parking at Community Rowing, which is adjacent to some DCR property. From Google maps, the correct lat / lon is: 42.358630, -71.165553. Change that manually:
geo[160, 4] = -71.165553
geo[160, 5] = 42.358630
A general note: The Nonantum Road error makes me wary of other location errors. Overall, it appears that geocoding w/ raw DCR_location names worked OK — I can see that places like Mount Greylock and Bash Bish Falls are about where they should be. But for a truly final version, I’ll need to find a better way to double-check location data.
Yet another geocode pass also returned an OVER_QUERY_LIMIT warning and returned NA values as the lat / lon for Mary O’Malley Waterfront Park in Chelsea, MA. If that happens again, here’s the manual fix. (This exact behavior may not reproduce in the future. For example, an over query error could introduce lat / lon NAs for a different property. Documenting nonetheless to offer an easy fix if it happens: based on any locations listed in error warnings, look up lat / lon and add manually.)
geo[54, 4] = -71.052152
geo[54, 5] = 42.388772
As a final bit of preparation: When I ran str(geo) in the console to make sure lat / lon were showing as numeric, I noticed that Location_totals are integers. Not sure if it matters, but I’ll change them to numeric.
geo$Location_total <- as.numeric(geo$Location_total)
Start by building the popup.
popup_dcr <- paste0("<b>", geo$DCR_location,
"</b><br />Number of permits: ", "<b>",geo$Location_total,"</b>")
Make the final map. I’m switching up the colors to make the smallest circles easier to see.
film_map <- leaflet(geo) %>%
addProviderTiles(providers$Esri.WorldGrayCanvas) %>%
setView(-71.931180, 42.385453, zoom = 8) %>%
addCircleMarkers(~lon, ~lat, popup = ~popup_dcr, weight = 3, radius = ~Location_total/2,
color="#c51b7d", stroke = FALSE, fillOpacity = 0.7)
film_map
I spent a lot of time tweaking the radius parameters. Pegging them strictly to Location_total created huge circles that made the map unwieldy at certain zoom levels.
Setting the radius to the square root of Locaton_total knocked down the size of the larger circles too much, I thought. I settled on Location_total/2 — but I’m still not sure this is the best choice.