Choropleth map: NSW COVID-19 data

Rohin Chhabra

2020-08-17

The ongoing pandemic has highlighted the importance of effectively utilizing data to understand how and where the virus is spreading through communities around the world. This vignette will be exploring the NSW COVID-19 dataset to see if we can identify any trends in NSW and Sydney. We will cover several R packages which can be used together to intuitively interpret any type of Spatial data.

1. Loading required packages and getting the data ready

library(leaflet)
library(leaflet.extras)
library(data.table)
library(dplyr)
library(rgdal)

There are several packages which will be installed as pre-requisites when installing above libraries.

Let’s load in the data

We’ll be using the fread function from the data.table library. It allows for faster file reading and also provides a syntax closer to SQL when analyzing the data.

# The `data.table` library is an extension of the in-built `data.frame` function. 
# Using a web link to import data, always ensures we have most upto date data
covid <- fread('https://data.nsw.gov.au/data/dataset/aefcde60-3b0c-4bc0-9af1-6fe652944ec2/resource/21304414-1ff1-4243-a5d2-f52778048b29/download/confirmed_cases_table1_location.csv')
covid = as.data.table(covid)
head(covid)
#>    notification_date postcode lhd_2010_code        lhd_2010_name lga_code19
#> 1:        2020-01-25     2134          X700               Sydney      11300
#> 2:        2020-01-25     2121          X760      Northern Sydney      16260
#> 3:        2020-01-25     2071          X760      Northern Sydney      14500
#> 4:        2020-01-27     2033          X720 South Eastern Sydney      16550
#> 5:        2020-03-01     2163          X710 South Western Sydney      12850
#> 6:        2020-03-01     2077          X760      Northern Sydney      14000
#>         lga_name19
#> 1:     Burwood (A)
#> 2:  Parramatta (C)
#> 3: Ku-ring-gai (A)
#> 4:    Randwick (C)
#> 5:   Fairfield (C)
#> 6:     Hornsby (A)

We are interested in the number of patients from each postcode. It would be useful to see where these cases lie on a map to understand if there any hotspots within the City.

We need to convert our postcode values into longitude and latitude coordinates, this will allow us to use the leaflet package for creating the Choropleth map. We’ll be using the freely available Australian postcode file which contains longitude and latitude cooridnates for all postcodes in Australia.

postcodes <- fread("australian_postcodes.csv")
postcodes = as.data.table(postcodes)
head(postcodes)
#>       id postcode                       locality state     long       lat dc
#> 1:   230      200                            ANU   ACT   0.0000   0.00000   
#> 2: 21820      200 Australian National University   ACT 149.1189 -35.27770   
#> 3:   232      800                         DARWIN    NT 130.8367 -12.45868   
#> 4:   233      801                         DARWIN    NT 130.8367 -12.45868   
#> 5:   234      804                          PARAP    NT 130.8733 -12.42802   
#> 6:   235      810                          ALAWA    NT 130.8662 -12.38181   
#>    type                  status   sa3        sa3name sa4 sa4name region
#> 1:                                 NA                 NA             R1
#> 2:            Added 19-Jan-2020    NA                 NA             R1
#> 3:           Updated 6-Feb-2020 70101    Darwin City 701  Darwin     R1
#> 4:      Updated 25-Mar-2020 SA3 70101    Darwin City 701  Darwin     R1
#> 5:      Updated 25-Mar-2020 SA3 70102 Darwin Suburbs 701  Darwin     R1
#> 6:           Updated 6-Feb-2020 70102 Darwin Suburbs 701  Darwin     R1

# Reading in a file consisting only of Sydney postcodes for further analysis
sydney_postcodes <- fread("sydney_postcodes.csv")
sydney_postcodes = as.data.table(sydney_postcodes)

Filter out on non-NSW postcodes to match our dataset and remove any duplicate values from the postcodes to create a unique list.

nsw_codes <- postcodes[state == 'NSW']

# Multiple suburbs share a postcode, but the COVID dataset only has the postcode 
# values listed for each case. 
# Remove duplicate values to create a unique df of postcodes
nsw_codes <- distinct(nsw_codes, postcode, .keep_all = TRUE)
nsw_codes = as.data.table(nsw_codes)

We’ll be using the merge function to copy across longitude and latitude values to the COVID-19 dataset.

# Both datasets contain the postcodes column, use this as PK value
covid <- merge(covid, nsw_codes[, c("postcode", "long", "lat")], by="postcode")
# clean up null values and incorrect postcode coordinates
covid <- filter(covid, postcode !='1871' & long != 0.0000)

2. Plotting the data

We’ll be using the leaflet package to plot our data. We can use leaflet to add different types of layers to our map to make it intuitive and interactive.

map <- leaflet(covid) %>%
  # addTiles() function adds the default map option from the OpenStreetMap API
  addTiles() %>% 
  # the Cluster marker will automatically sum up the occurrences of cases in each postcode
  addMarkers( clusterOptions = markerClusterOptions()) %>% 
  # CartoDB provides an uncluttered background free of labels
  # there are several other Providers that can be used (see references)
  addProviderTiles(providers$CartoDB.Positron) 
#> Assuming "long" and "lat" are longitude and latitude, respectively
# removing this map as the file is too large during publishing
# map

We enhance this map by using a shapefile which has information on NSW suburb boundaries to clearly outline the map. This can be obtained from the data.gov.au database for free.

Once un-zipped, the folder should contain 7 files, ensure these are copied to your working directory before the next step (see github repo for structure)

3. Using rgdal

Once the files are in your working directory, we’ll use the rgdal package to read the NSW_LOCALITY_POLYGON_shp.shp file through the readOGR function.

# this stores the input data into a suitable Spatial vector object. 
nsw_shape <- readOGR(dsn="NSW_LOCALITY_POLYGON_shp.shp")
#> Warning in OGRSpatialRef(dsn, layer, morphFromESRI = morphFromESRI, dumpSRS =
#> dumpSRS, : Discarded datum Geocentric_Datum_of_Australia_2020 in CRS definition:
#> +proj=longlat +ellps=GRS80 +towgs84=0,0,0,0,0,0,0 +no_defs
#> OGR data source with driver: ESRI Shapefile 
#> Source: "/Users/rohinchhabra/Desktop/Assignment1/NSW_LOCALITY_POLYGON_shp.shp", layer: "NSW_LOCALITY_POLYGON_shp"
#> with 4591 features
#> It has 12 fields

4. Using leaflet for customization

We can now use the addPolygons layer in leaflet to outline the localities in NSW. For simplicity, lets focus on suburbs only in Sydney:

# `sydney_postcodes` variable  allows us to focus our analysis to only Sydney suburbs
covid_sydney <- covid %>% filter(postcode %in% sydney_postcodes$postcode)

sydney_map <- leaflet(covid_sydney) %>%
  addTiles() %>% 
  addMarkers( clusterOptions = markerClusterOptions()) %>% 
  addProviderTiles(providers$CartoDB.Positron) %>%
  addPolygons(data = nsw_shape, color = "#444444", weight = 1, smoothFactor = 0.5,
              opacity = 0.5, fillOpacity = 0.1) %>%
  # coordinates provided are to center the map onto a certain spot, zoom happens at that point
  # coordinates for Sydney city
  setView(lng = 151.2016, 
            lat = -33.86052, 
           zoom = 11) 
#> Assuming "long" and "lat" are longitude and latitude, respectively

#sydney_map
knitr::include_graphics("image1.png")

We’ll use functions like colorBin(), highlight, addLegend() and label to enhance the look of our map.

The highlight function in conjuction with the sprintf function maps the labels. Usually labels in R are implemented through a Shiny App, however we’ll be using the lapply method to convert our HTML markdown into a variable readable by R.

# creating a df for the frequency of cases in each postcode
# this will help us define the colour intensity range
total_cases_by_postcode <- as.data.frame(table(covid_sydney$postcode))
colnames(total_cases_by_postcode) <- c("postcode", "Freq")

# add labels for when a segment is highlighted
labels <- sprintf(
  "<strong>%s</strong><br/>%g cases",
  total_cases_by_postcode$postcode, total_cases_by_postcode$Freq
  # finally we'll use `lapply` to ensure the HTML markup is interpreted as such by R
) %>% lapply(htmltools::HTML)

# define intensity intervals for bins
bins <- c(0, 20, 40, 50, 60, 70, 80, 100, Inf)
# this colour pallette will highlight the highest concentrations in Red 
# intuitive for higher number of cases
pal <- colorBin("YlOrRd", domain = total_cases_by_postcode$Freq, bins = bins)

sydney_map <- leaflet(covid_sydney) %>%
  addTiles() %>% 
  addProviderTiles(providers$CartoDB.Positron) %>%
  # `addMarkers( clusterOptions = markerClusterOptions())` removed as it 
  #   doesn't fit the Choropleth map
  # fillColor function used with the Frequency of cases from the total_cases_by_postcode df
  addPolygons(data = nsw_shape, fillColor = ~pal(total_cases_by_postcode$Freq),
              weight = 1,
              opacity = 1,
              color = "black",
              #dashed outline for a cleaner look
              dashArray = "4",
              fillOpacity = 0.7,
              highlight = highlightOptions(
                #thickness of the outline when highlighted
                weight = 4,
                #color of outline when highlighted
                color = "gray",
                #fills the outline when highlighted
                dashArray = "",
                fillOpacity = 0.7,
                #weather the shape should be brought to the front on hover
                bringToFront = TRUE),
              #defining labels from above
               label = labels,
              #we are able to customise how the labels should look when highlighted
              labelOptions = labelOptions(
                style = list("font-weight" = "normal", padding = "2.5px 7.5px"),
                textsize = "16px",
                direction = "auto")) %>%
  setView(lng = 151.2016, 
            lat = -33.86052, 
           zoom = 11) %>%   
  #add the legend using the total_cases_by_postcode df
  addLegend(pal = pal, values = ~total_cases_by_postcode$Freq, opacity = 0.7, title = "Number of COVID cases",
                  position = "bottomright")

#sydney_map
knitr::include_graphics("image2.png")

This is the final version of our Choropleth map. We can see there are more cases in places like Bondi, The Hills Shire and Inner West. There are inaccuracies within the postcode coordinates data as we can see several postcodes that are tagged to incorrect segments on the map. Even with these discrepancies, we are still able get a general understanding of where the cases have spread in Sydney.

Let’s plot the active number of cases to understand more about COVID-19 hotspots in Sydney.

5. Plotting active cases

For this example, any cases that’s reported 30 days from today’s system date will be considered as an active case.

We’ll need to calculate the difference in days between the notifcation_date and the system_date:

# adding a new column for 'system_date'
covid_active <- mutate(covid_sydney, system_date = Sys.Date())
covid_active <- as.data.frame(covid_active)
# adding a new column for `difference` measured in days
covid_active <- covid_active %>% mutate(difference = covid_active$system_date - as.Date(covid_active$notification_date))

Filter out cases where the difference is less than 30 days and map this new dataframe:

covid_active <- filter(covid_active, difference < 30)
#lets add a colour range to see if we can see any trends
total_active_cases <- as.data.frame(table(covid_active$postcode))
colnames(total_active_cases) <- c("postcode", "Freq")

Circle markers in leaflet

We can use the addCircles() function to better understand the hotspots in the City.

bins <- c(0, 5, 7, 10, 13, 15, 17, 20, 30, Inf)
pal <- colorBin("YlOrRd", domain = total_active_cases$Freq, bins = bins)

active_map_circles <- leaflet(covid_active) %>% addTiles() %>%
  addCircles(lng = ~long, lat = ~lat, color= ~pal(total_active_cases$Freq), weight = 3,
             # radius will allow us to visualise the spread of the virus
             radius = ~total_active_cases$Freq * 100, popup = ~total_active_cases$postcode) %>%
  addProviderTiles(providers$CartoDB.Positron) %>%
  setView(lng = 151.2016, 
          lat = -33.86052, 
          zoom = 11) %>%
  addLegend("bottomright", pal = pal, values = ~total_active_cases$Freq,
            title = "Active COVID-19 Cases")

active_map_circles

With this map, we can see hotspots in the South of Sydney, Inner West and in the North West of Sydney.

In conclusion, there are several ways to manipulate the same dataset to derive a range of insights. Packages like leaflet combined with a few others can exponentially reduce the ‘time to market’ for Data Scientists.

These are just some of the options available in the packages discussed above. See below for a full range of references.

6. References

7. Github repo

https://github.com/rohinc/nsw_covid_19