Derek Funk, M.S.
Nima Zahadat, Ph.D.
The George Washington University

1 Abstract

State cancer registries are required by law to annually report cancer incidence and mortality to federal cancer organizations for review. Local cancer centers face obstacles in being able to retrieve and present this data that is relevant to their patient population. The traditional solution to this problem has been to manually compile static reports that are often either out of date or cumbersome to create. This project focuses on Shiny, a package for the R programming language that allows for web development, and how it was used by the George Washington Cancer Center to create a cancer data visualizer for its catchment area. Shiny allows for automation of report development, with a large emphasis on reproducibility.

Keywords: cancer, risk & protective factors, George Washington Cancer Center, R, Shiny, health informatics, research reproducibility

2 Introduction

Health organizations that provide information to the public often face the issue of communicating data effectively. There are five questions the health organization must attempt to address in full.

One: Is the information being presented accurate?
Human data entry or inadequate data extraction tools can lead to untrustworthy data.

Two: Is there a quick turnaround between when the communication was requested and when it was actually delivered?
Often, by the time a communication is ingested by the public, there is already a need for an updated version.

Three: Is the information presented in a static report, or is there a self-exploratory tool for the end-user?
Self-service tools allow organizations to communicate multiple insights at the same time and also engage the public.

Four: Is the information transparently reproducible?
Reproducibility gives the viewer trust in what they are consuming and the researcher the ability to vet the process. Transparency means it is clear where and how the data was organized, and that it will not take weeks to recreate the project.

Five: What types of privacy concerns exist with the reported data?
Due to patient privacy, many health data sources must be suppressed at the individual level.

Businesses have had tremendous success addressing problems #1-3 by establishing centralized data warehouses with devoted database professionals who create automated data ingestion pipelines. On top of this, business intelligence developers can use tools such as Tableau and Power BI to create dashboards that the rest of the organization can use without having to constantly create ad hoc reports.

Many health organizations do not have the luxury of having a large team of data professionals that can specialize in each part of the data process. In addition, problem #4 is more paramount in health research than in business.

This paper provides a case study of how R Shiny was used at the George Washington Cancer Center (GWCC) to create a self-service tool that allows for exploration of cancer rates and related risk factors in the DC-Maryland-Virginia metropolitan area. Other health organizations that do not have an existing data team, but that do have some knowledge in R, may benefit from the use of Shiny in their data communication objectives.

3 Literature Review

Many health organizations are investing in analytics to report their cancer data in ways that are more compelling and useful. A high-level example is the Centers for Disease Control and Prevention (CDC), which annually aggregates cancer incidence and mortality rates from all U.S. states. In addition to making this information publicly available, they have a cancer data visualizer that helps make the ingestion of this data more palatable [1].

A local example concerns the University of Miami Sylvester Comprehensive Cancer Center, which created a tool called SCAN360 that visualizes various cancer rates and related factors across Florida’s counties and neighborhoods [2]. The tool was used by the cancer center’s outreach team to discover a high rate of cervical cancer in one of their neighborhoods, from which they were allowed to actively flag and monitor the area.

In both of these examples, one of the main challenges behind attaining the most complete information has to do with patient privacy. The Health Insurance Portability and Accountability Act of 1996 (HIPAA) includes national rules regarding how patient information can be shared, including that of cancer data [3]. For example, the CDC shows cancer rates for various combinations of cancer sites, gender, race, and state. However, at a certain granularity the counts become so low that the risk of patient identification begins to manifest. In compliance with HIPAA, the CDC must suppress the public information if the counts become lower than 16 cases. Similarly, the DC Cancer Registry must suppress their data when the counts are lower than 10 cases.

Another challenge with compiling cancer data has to do with the time and effort involved. Since 1993, cancer has been categorized as a “reportable disease” [4]. In order to receive 5-year cancer control funding grants, state cancer registries must compile and send their cancer data every year for review by federal cancer organizations such as the CDC and National Cancer Institute. This intensive review process results in a multi-year lag between the time of data incidence and time of data reporting. Consumers of this kind of data must have the understanding that their outreach actions are always based on information that did not occur in the immediate past.

In addition to keeping up to date with local cancer rates, cancer outreach teams are heavily invested in keeping tabs on a host of socioeconomic factors. These variables are called “risk & protective factors” and span many categories, including demographics, income and employment, environmental factors, and risk behaviors. Every catchment area constitutes a very different risk & protective factors cancer profile. It is the goal of every outreach team to associate the high prevalence of specific cancer sites with certain risk & protective factors, and see how these relationships have changed over time across their sub-regions.

Many analytic tools are used in the public health sector. Due to the prevalence of the R programming language in healthcare, the R package Shiny has been used for various analytic capabilities, including the SCAN360 project. GWCC made the active decision to use Shiny in creating their cancer data visualizer to make cancer data available to local professionals and the public.

4 Research Methodology

The GWCC serves patients from the following regions:

District of Columbia and its eight wards
Charles County, Maryland
Montgomery County, Maryland
Prince George’s County, Maryland
Arlington County, Virginia
Fairfax County, Virginia
Loudoun County, Virginia
Prince William County, Virginia
City of Alexandria, Virginia
City of Fairfax, Virginia
City of Falls Church, Virginia
City of Manassas, Virginia
City of Manassas Park, Virginia

The last five regions are independent cities, which are not in the territory of any county and thus must be included separately.

All these regions together constitute what is known as the GWCC Catchment Area seen below.

This project ventured through each of the following high-level stages:

Data sources identification
Data extraction
Data pre-processing
Application design
Application development
Prototype demos & user feedback
Iterative enhancements
Go-live publishing

5 Data

At a high level, all data presented in the visualizer come from one of two groups:

Cancer Rates
Risk & Protective Factors

Cancer data for all counties, independent cities, and DC are taken from the CDC. The CDC offers 5-year average annual age-adjusted incidence and mortality rates for 27 cancer sites over the time periods 2011-2015 and 2012-2016 for download.

Cancer data for DC wards was specially requested from the DC Cancer Registry [5]. This subset contains 5-year average annual age-adjusted incidence and mortality rates for just the time period 2012-2016 and for only a few cancer sites that have enough reportable cases.

Risk & protective factors include many variables that come from categories such as socio-demographics, economic resources, environmental factors, housing and transportation, and health and risk behaviors. For this project, they can also be categorized based on source:

American Community Survey (ACS)
Robert Wood Johnson Foundation (RWJF)
Environmental Protection Agency (EPA)

ACS data for all counties and independent cities are retrieved by using the US Census Bureau application programming interface (API) [6]. All of these variables are 5-year estimates for 2013, 2014, 2015, 2016, and 2017.

ACS data for DC and its wards are available on the web from the DC Office of Planning for download [7].

RWJF data includes health and risk behaviors for certain counties at limited years. These are availabe on the web for download [8].

EPA data includes air quality index estimates for all counties and DC for the years 2013, 2014, 2015, 2016, and 2017. These are availabe on the web for download [9].

Consult the appendix for a detailed listing of each data source and variable, including documentation on the US Census Bureau API.

6 Data Analysis

Two simple yet important questions present themselves in the context of data pre-processing:

How should the data be retrieved?
How should the data be formatted?

In this project, R was used to handle both of these questions. However, in more complicated architectures the answers to these questions may warrant the use of dozens of technologies. In general, the tools to use are ones that are efficient and reliable, but they may differ by organization based on existing skillsets.

The answer to question #1 in this project involved a combination of downloading and reading flat files, as well as accessing an API. Of all the data sources, the US Census Bureau was the only one that was able to provide information via an API. One may assume that the general answer to question #1 is to always use an API (if it is available) over flat files, since it provides the user with more control. However, flat files do have the advantage over APIs in that once they are downloaded, they do not change. APIs, on the other hand, are subject to redefinitions. In fact, there are a handful of variables in the US Census Bureau whose variable keys changed at a certain time point, presumably due to API redefinition. This was only caught after thorough testing and requires a code change to account for the different variable keys.

The answer to question #2 in this project was to process the raw data into tidy data, where information is stored in a data frame that has one variable per column and one observation per row. The main point of Hadley Wickham’s paper “Tidy Data” is that while many raw data sources are well-presented and easy to consume for humans, this does not make the data easy to process for computers [10]. Many of the raw data sources in this project had inconsistently named files, inconsistently named variables, or shifting cell locations, making it difficult to reuse processing functions. The eventual R data pipeline was very custom in producing formatted tidy files that the Shiny app could read.

Below is a high-level diagram of how the raw data sources flow to the visualizer. Click on and drag the nodes to read the node labels more easily.

The main advantage of having these tidy data files is that the Shiny application is able to efficiently read them. Very minor data processing is done within the application to maintain load speed. In addition, the structure of these tidy files allows for future inclusion of new data with minimal required code updates within the application.

7 Key Findings

This section outlines some of the key observations made using the Shiny app. The full visualizer can be explored at https://gwcancerdatavisualizer.shinyapps.io/cancer_data_visualizer/.

This first widget below shows one of the visualizer’s cancer maps, which shows both cancer rates and risk & protective factors for the time period 2012-2016. Use the filters at the top to change the map view to any specific region, variable, and variable category. Any region on the map can be hovered over or clicked on for more information. The List View displays the same information as the Map View in a grid. The Show DC Wards toggle can be turned on to view values for the wards of DC.

Geographically, it is evident that cancer incidence and mortality are highest in the southeast of the GWCC Catchment Area. This includes the regions of Charles County, Prince George’s County, and DC (primarily wards 7 and 8). These regions have a high proportion of young, African-American families who are impacted by lower education and employment opportunities, lower household incomes, and less private health insurance. As a result, GWCC is particularly concerned in addressing the needs of its cancer patients from these areas, especially wards 7 and 8.

This next widget focuses on cancer incidence and mortality rates, but allows for a few more data views. These include another time period of 2011-2015, a breakdown by race, and the ability to compare rates across regions or cancer sites. Use the filters and then click Update to see the plot change values. The Select Chart feature can be used to switch the graph between incidence and mortality rates.

In general, the most problematic cancer sites both in terms of incidence and mortality include female breast, prostate, lung, and colorectal cancer. African Americans suffer the most in comparison to other races across most cancer sites and regions of the GWCC Catchment Area.

This third widget shows risk & protective factors for all the other available years. This tool also allows the option to compare a variable between any two regions. Like the previous widget, click on one of the options in the Select Chart box to view a different set of plots.

A similar observation can be made here as in the cancer map, which is that the southeast region of the GWCC Catchment Area warrants the most concern in terms of access to healthcare, income, and employment. The tool exhibits these patterns consistently throughout the years.

8 Recommendations

The existing cancer data visualizer written in Shiny represents an initial project stage at the GWCC. There are several directions that the GWCC is interested in taking in the future.

First, although the cancer data visualizer is able to show recent cancer rates and risk & protective factors, it can be difficult to see exactly how some of these variables are changing over time in relation to each other. The widget below represents a portion of the visualizer’s Data Explorer tab. This last feature of the visualizer is the newest prototype feature, and it was intended to answer some of these questions regarding variable relationships over time. Further work will be done in the visualizer, including the enhancement of this feature and introducing custom calculations throughout the app.

Second, building and maintaining a Shiny app requires some R experience. Although a Shiny app is very powerful in an organization that has an experienced Shiny developer, it can be difficult to share the development work with many people. The next phase of this project may include a transition to Tableau, which has many built-in features that are easier to learn for new contributors.

Lastly, although the GWCC is mainly focused on its catchment area, there is interest in expanding this type of visualizer to other regions in the country. Many of the sources used in this project already include data for regions outside of the GWCC Catchment Area, and much of the work here would constitute generalizing the application.

9 Summary

The main focus of this project was to present a technical solution to a data challenge within a cancer health organization. Cancer centers like the GWCC face issues such as data accuracy, reporting timeliness, reporting methodology, reproducibility, and patient privacy. In order to address these concerns, they all require the use of technologies that provide reliable and consistent answers. Ultimately, the GWCC chose R Shiny as the tool to visualize cancer rates and risk & protective factors across the GWCC Catchment Area.

In general, health organizations may find value in using Shiny if they have enough experience with R development. Otherwise, alternative self-service tools such as Tableau can ease the transition to report development without having to largely invest in learning a new programming language. Either approach will enable organizations to report their data more accurately, quickly, and transparently.

10 Biography

Derek Funk is a graduate student in the Data Science Program at The George Washington University. He is interested in data visualization, interactive data science, and software development. He has worked as an actuary, business intelligence analyst, and now as a software consultant. In his freetime, he enjoys creating personal apps, playing soccer, and catching up on Netflix.

Dr. Nima Zahadat is a professor of data science, information systems security, and digital forensics. His research focus is on studying the Internet of Things, data mining, information visualization, mobile security, security policy management, and memory forensics. He has been teaching since 2001 and has developed and taught over 100 topics. Dr. Zahadat has also been a consultant with the federal government agencies, the US Air Force, Navy, Marines, and the Coast Guard. He enjoys teaching, biking, reading, and writing.

11 References

[1] “USCS Data Visualizations”, Gis.cdc.gov, 2020. [Online]. Available: https://gis.cdc.gov/Cancer/USCS/DataViz.html. [Accessed: 04- May- 2020].

[2] “Scan 360 | Cancer Data”, Scan360.com, 2020. [Online]. Available: https://www.scan360.com/cancer-data. [Accessed: 04- May- 2020].

[3] “HIPAA for Professionals”, HHS.gov, 2020. [Online]. Available: https://www.hhs.gov/hipaa/for-professionals/index.html. [Accessed: 04- May- 2020].

[4] “U.S. Cancer Statistics Data Visualizations Tool Technical Notes | CDC”, Cdc.gov, 2020. [Online]. Available: https://www.cdc.gov/cancer/uscs/technical_notes/index.htm. [Accessed: 04- May- 2020].

[5] “Cancer Registry | doh”, Dchealth.dc.gov, 2020. [Online]. Available: https://dchealth.dc.gov/service/cancer-registry-0. [Accessed: 04- May- 2020].

[6] US Bureau, “Developers”, The United States Census Bureau, 2020. [Online]. Available: https://www.census.gov/developers/. [Accessed: 04- May- 2020].

[7] “American Community Survey (ACS) Estimates | op”, Planning.dc.gov, 2020. [Online]. Available: https://planning.dc.gov/page/american-community-survey-acs-estimates. [Accessed: 04- May- 2020].

[8] “How Healthy is your County? | County Health Rankings”, County Health Rankings & Roadmaps, 2020. [Online]. Available: https://www.countyhealthrankings.org/. [Accessed: 04- May- 2020].

[9] “Download Files | AirData | US EPA”, Aqs.epa.gov, 2020. [Online]. Available: https://aqs.epa.gov/aqsweb/airdata/download_files.html. [Accessed: 04- May- 2020].

[10] H. Wickham, “Tidy Data”, Journal of Statistical Software, vol. 59, no. 10, 2014. Available: 10.18637/jss.v059.i10.

[11] “DataTables Options”, Rstudio.github.io, 2020. [Online]. Available: https://rstudio.github.io/DT/options.html. [Accessed: 04- May- 2020].

[12] “GeoJSON and KML data for the United States”, Eric Celeste, 2020. [Online]. Available: https://eric.clst.org/tech/usgeojson/. [Accessed: 04- May- 2020].

[13] “Shiny”, Shiny.rstudio.com, 2020. [Online]. Available: https://shiny.rstudio.com/. [Accessed: 04- May- 2020].

[14] “Tidyverse”, Tidyverse.org, 2020. [Online]. Available: https://www.tidyverse.org/. [Accessed: 04- May- 2020].

[15] H. Wickham and G. Grolemund, R for Data Science. O’Reilly, 2017.

12 Appendix

12.1 Data

12.1.1 List of Data Sources

12.1.2 List of Variables

12.1.3 List of Cancer Sites

12.2 How to Reproduce this Entire Project

Visit https://github.com/Derek-Funk/GW-Cancer-Data-Visualizer for all code and supporting files required to reproduce this project.

12.2.1 How to Reproduce the Data Pre-Processing

There are two data pipelines:

Process data from the CDC and DC Cancer Registry into one master cancer data file
Process data from the ACS, RWJF, and EPA into one master risk & protective factors data file

These pipelines can be reproduced as follows:

Download the Data folder.
Open the preProcessing_createCancerMasterDataFile_v5.R file.

Replace all file paths with relevant local paths.
Run the entire script. This script takes about 16 minutes.

Open the preProcessing_createNonCancerMasterDataFile_v11.R file.

Replace all file paths with relevant local paths.
Run the entire script. This script takes about 25 minutes.

The final master files are titled masterDataFile_cancer_countyWard.csv and masterDataFile_nonCancer_countyWard.csv.

NOTE: If you wish to retrieve the raw data files yourself, consult the List of Data Sources appendix subsection.

12.2.2 How to Reproduce the Shiny Application

Download the Shiny App folder.
In RStudio, open either the global.R or app.R file. Click ‘Run App’ in the upper-right of the Source pane.

NOTE: These steps are to run the Shiny app on a local machine. If you wish to reproduce this app on https://www.shinyapps.io/, visit https://docs.rstudio.com/shinyapps.io/index.html on how to deploy the app to the cloud.

12.2.3 How to Reproduce this Interactive Paper

In the Paper folder, download the R Markdown file.
In RStudio, knit this document (File -> Knit Document). This code will generate the final HTML paper.

R Shiny as an Effective Tool for Reproducible Dissemmination of Public Health Data