Cancer Data Analysis

Vaidyanathan Subramanian, Rahul Singh

2022-06-04

Cancer Disease in the United States

Cancer is “a disease in which some of the body’s cells grow uncontrollably and spread to other parts of the body” (National Cancer Institute 2021). There are many different types of cancer, and those different types are usually named for the part of the body where the cancer starts. There is a one in three chance that you will have to deal with cancer somewhere in your body in your lifetime (American Cancer Society 2021).

Purpose

Toxic chemicals are virtually always released into the environment as a result of current and historical industrial activity, some of which are known or suspected carcinogens (National Institutes of Health 2018). While plant operators can implement technical and operational methods to reduce these emissions, and governments can use legislation to encourage plant operators to do so, no industrial process can be completely clean, and industrially produced toxins remain an unavoidable part of life in a modern society.

Potential linkages between industrial operations and cancer hot areas can be investigated using publicly available geospatial data. While the actual causes of individual tumors are complex and often unknown with any degree of certainty, aggregated data analysis can provide important insights and avenues for directing further (and sometimes scarce) investigatory resources in the right direction.

We’d like to analyze cancer registry data to see if there’s a link between geographic location, age, behavioral risk factors, and industrial toxins in the environment

Hypothesis

Exposure to certain chemicals in the environment, at home, and at work may contribute to an individual’s risk of developing cancer. We are comparing spatially-detailed cancer registry incidence data to identify the cancer hotspots in the city of Illinois. The variables to consider are:

Data

Reference: http://www.idph.state.il.us/iscrstats/statebyrace/Show-Statebyrace-Table.aspx

Both incidence and mortality statistics are available from the Illinois State Cancer Registry, while mortality data is only provided on a state-wide basis. To protect privacy and promote accurate reporting and analysis, data is always aggregated (combined) into geographic areas. Both incidence and mortality statistics are available from the Illinois State Cancer Registry, while mortality data is only published on a state-by-state basis.

Reference: https://data.census.gov/cedsci/profile?g=0400000US17

The American Community Study (ACS) of the Census Bureau is a continuous survey that offers information about people in the United States on an annual basis in addition to the baseline information acquired in the decennial census. When a researcher needs information about the general people, they frequently turn to the ACS.

Analytic Scope/Methods

Exploratory Data Analysis

##        Year ZIP.Code         Cancer.Group Male.Count Female.Count Total.Count
## 1 2014-2018    60002 All Cancers Combined        379          384         763
## 2 2014-2018    60004 All Cancers Combined        695          888        1583
## 3 2014-2018    60005 All Cancers Combined        413          528         941
## 4 2014-2018    60006 All Cancers Combined          1            4           5
## 5 2014-2018    60007 All Cancers Combined        519          641        1160
## 6 2014-2018    60008 All Cancers Combined        297          322         619
##    X
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
##  [1] "All Cancers Combined"  "All Other Cancers"     "Breast-in situ"       
##  [4] "Breast-invasive"       "Cervix"                "Colorectal"           
##  [7] "Leukemias & Lymphomas" "Lung & Bronchus"       "Nervous System"       
## [10] "Oral Cavity"           "Prostate"              "Urinary System"
## [1] "2014-2018" "2009-2013" "2004-2008" "1999-2003" "1994-1998"
##   ZIP.Code Total.Count
## 1    60002         763
## 2    60004        1583
## 3    60005         941
## 4    60006           5
## 5    60007        1160
## 6    60008         619

The American Community Survey data supplied above includes both spatial polygons and demographic information for each ZIP Code area in the cancer data.

plot(acs_data["Median.Age"], main="Median Age by Zip Code", border=NA, pal=colorRampPalette(c("brown", "lightblue", "red")))

Join the Cancer and ACS Data

Most cancers occur where there are more people, as shown by the map of cancer numbers above. This information is mostly about where more people reside, thus it isn’t really relevant on its own. In general, percentages are used to express rates (per 100 residents). Disease rates are reported in annual cases per 100,000 residents (per 100k) for comparatively rare health disorders like cancer, because the data are more understandable as full numbers (e.g. 200 per 100k is easier to read than 0.2 percent ).

We can determine the rate by dividing by population, multiplying by 100,000 (to get the per 100k), and dividing by five because the ACS data contains population values (since the counts are total counts over a five year period).

Because cancer is uncommon and many ZIP Codes have small populations, one or two cancer cases in a ZIP Code can cause the cancer rate to skyrocket, while a lack of instances can make a region appear to be a cancer-free haven. The tiny numbers problem, also known as variance instability, is a condition that affects many people.

An X/Y scatter graphic comparing population and crude rates reveals this distortion of variance in sparsely inhabited areas. In low-population ZIP Codes, the most extreme rates (high and low) reach at the bottom of the chart.

We don’t want to dismiss areas with high crude rates as outliers or use a log transform to indiscriminately squeeze all high rate areas into normality because they may be areas of concern.

On an X/Y scatter plot comparing the percent of 65+ with the (estimated) crude rate, we can see a general pattern going upward from left to right, indicating that this is the case.

To make a basic linear model of cancer rates as a function of age, use the lm() function (weighted by population). The abline() function is used to draw a regression line on the plot using the model’s output.

As a result, places with high cancer crude rates are more likely to have a greater population of older individuals, obscuring other concerns that may be boosting cancer risk in different areas.

In epidemiology, a common technique for addressing this issue is to modify rates based on the proportions of residents in different age groups, so that rates across areas can be compared to take into account factors other than whether some areas have disproportionately high numbers of older (or younger) residents.

Linear Model

Cancer risk rises as you become older, as it does with most diseases (National Institutes of Health 2014).

On an X/Y scatter plot comparing the percent of people aged 65 and up to the (estimated) crude rate, we can discern a broad upward pattern from left to right.

A basic linear model of cancer rates as a function of age can be created using the lm() function (weighted by population). To draw a regression line on the plot, the model output is provided to the abline() method.

As a result, places with high cancer crude rates are more likely to have a greater population of older individuals, obscuring other concerns that may be boosting cancer risk in different areas.

A precise breakdown of the ages of those diagnosed with cancer, which is not publicly available in Illinois, is required for rigorous calculation of age-adjusted rates. The lm() function’s linear model result, on the other hand, can be used to construct an estimated age adjustment.

## [1] 485.3436

The weighted mean of this adjusted rate for both sexes across the state is consistent with the published rates of 503.6 for men and 444.5 for women

We often wish to know what the usual rates are for health issues so that we can pay extra attention to groups of people or locations where the rates are higher than normal. This mean, however, may not be valid because the population is not evenly distributed throughout ZIP Codes. The weighted.mean() function, which raises or decreases each number’s contribution to the mean computation based on a weighting value, is one approach for dealing with this scenario. We can utilize the Total.Population parameter as a weight in this scenario.

## [1] 562.283
## 
##  Pearson's Chi-squared test
## 
## data:  Zip_Codes$Crude.Rate and Zip_Codes$Total.Population
## X-squared = 1668429, df = 1655213, p-value = 2.167e-13
## [1] 558.3118
## 
##  Pearson's Chi-squared test
## 
## data:  Zip_Codes$Estimated.Rate and Zip_Codes$Total.Population
## X-squared = 1722875, df = 1714104, p-value = 1.126e-06
## [1] 485.3436
## 
##  Pearson's Chi-squared test
## 
## data:  Zip_Codes$Age.Adjusted.Rate and Zip_Codes$Total.Population
## X-squared = 1703402, df = 1702155, p-value = 0.2495

Correlations

##              Estimated.Rate                  Median.Age 
##                       1.000                       0.487 
##             Percent.65.Plus    Old.age.Dependency.Ratio 
##                       0.451                       0.414 
##                  Crude.Rate           Age.Adjusted.Rate 
##                       0.412                       0.407 
##        Age.Dependency.Ratio          Percent.Homeowners 
##                       0.324                       0.301 
## Percent.Drove.Alone.to.Work            Percent.Disabled 
##                       0.288                       0.278

The strongest positive connections, such as Median.Age and Percent.65.Plus, are connected with higher percentages of elderly residents.

Conversely, higher percentages of younger residents, such as school enrolment, renters, and women of reproductive age, are connected with the largest negative connections.

Rates are positively correlated with ZIP Code area size and adversely correlated with overall population and population density. This could show a trend toward higher rates in low-population rural areas, yet given that rural areas are also older, this could just be another age-related association.

Hot Spots

Disease rates that are similar tend to cluster together in space. These clusters could be caused by naturally existing or man-made carcinogens in the environment, or by common social standards for dangerous behaviors (such as smoking or nutrition). The randomness of disease occurrence can make such clustering difficult to detect.

The Getis-Ord GI* statistic algorithm finds high-valued areas surrounded by high-valued areas. Unlike mere observation or approaches like kernel density analysis, this program creates p-values by comparing all areas statistically to determine how likely it is that clusters of high values in specific locations occurred by coincidence. This systematic application of statistics allows for a better level of confidence (but not absolute certainty) that the patterns are more than simply a visual illusion and indicate grouping worth exploring.

It’s important to remember that just because you’re in a hot region doesn’t mean your rates are all high. Rather, this implies that the overall level in a hot spot is high, despite the fact that some places inside the hot spot may have low levels. Although the values in the 95 percent hot areas (category 5) are generally greater than the values in the 95 percent cold spots, this box plot depicting the distribution of rates among the five hot/cold spot categories varies greatly (category 1).

## 
##  Pearson's Chi-squared test
## 
## data:  getis.ord$Age.Adjusted.Rate and getis.ord$Hot.Spots
## X-squared = 5464, df = 5460, p-value = 0.4822

Summary

Future Research

References