Vaidyanathan Subramanian, Rahul Singh
2022-06-04
Cancer is “a disease in which some of the body’s cells grow uncontrollably and spread to other parts of the body” (National Cancer Institute 2021). There are many different types of cancer, and those different types are usually named for the part of the body where the cancer starts. There is a one in three chance that you will have to deal with cancer somewhere in your body in your lifetime (American Cancer Society 2021).
Cancer is the second leading cause of death in the United States, exceeded only by heart disease.
One of every four deaths in the United States is due to cancer.
In the United States in 2018, 1,708,921 new cancer cases were reported and 599,265 people died of cancer.
For every 100,000 people, 436 new cancer cases were reported and 149 people died of cancer.
2018 is the latest year for which incidence data are available.
Toxic chemicals are virtually always released into the environment as a result of current and historical industrial activity, some of which are known or suspected carcinogens (National Institutes of Health 2018). While plant operators can implement technical and operational methods to reduce these emissions, and governments can use legislation to encourage plant operators to do so, no industrial process can be completely clean, and industrially produced toxins remain an unavoidable part of life in a modern society.
Potential linkages between industrial operations and cancer hot areas can be investigated using publicly available geospatial data. While the actual causes of individual tumors are complex and often unknown with any degree of certainty, aggregated data analysis can provide important insights and avenues for directing further (and sometimes scarce) investigatory resources in the right direction.
We’d like to analyze cancer registry data to see if there’s a link between geographic location, age, behavioral risk factors, and industrial toxins in the environment
Exposure to certain chemicals in the environment, at home, and at work may contribute to an individual’s risk of developing cancer. We are comparing spatially-detailed cancer registry incidence data to identify the cancer hotspots in the city of Illinois. The variables to consider are:
Reference: http://www.idph.state.il.us/iscrstats/statebyrace/Show-Statebyrace-Table.aspx
Both incidence and mortality statistics are available from the Illinois State Cancer Registry, while mortality data is only provided on a state-wide basis. To protect privacy and promote accurate reporting and analysis, data is always aggregated (combined) into geographic areas. Both incidence and mortality statistics are available from the Illinois State Cancer Registry, while mortality data is only published on a state-by-state basis.
Reference: https://data.census.gov/cedsci/profile?g=0400000US17
The American Community Study (ACS) of the Census Bureau is a continuous survey that offers information about people in the United States on an annual basis in addition to the baseline information acquired in the decennial census. When a researcher needs information about the general people, they frequently turn to the ACS.
## Year ZIP.Code Cancer.Group Male.Count Female.Count Total.Count
## 1 2014-2018 60002 All Cancers Combined 379 384 763
## 2 2014-2018 60004 All Cancers Combined 695 888 1583
## 3 2014-2018 60005 All Cancers Combined 413 528 941
## 4 2014-2018 60006 All Cancers Combined 1 4 5
## 5 2014-2018 60007 All Cancers Combined 519 641 1160
## 6 2014-2018 60008 All Cancers Combined 297 322 619
## X
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
## [1] "All Cancers Combined" "All Other Cancers" "Breast-in situ"
## [4] "Breast-invasive" "Cervix" "Colorectal"
## [7] "Leukemias & Lymphomas" "Lung & Bronchus" "Nervous System"
## [10] "Oral Cavity" "Prostate" "Urinary System"
## [1] "2014-2018" "2009-2013" "2004-2008" "1999-2003" "1994-1998"
## ZIP.Code Total.Count
## 1 60002 763
## 2 60004 1583
## 3 60005 941
## 4 60006 5
## 5 60007 1160
## 6 60008 619
The American Community Survey data supplied above includes both spatial polygons and demographic information for each ZIP Code area in the cancer data.
plot(acs_data["Median.Age"], main="Median Age by Zip Code", border=NA, pal=colorRampPalette(c("brown", "lightblue", "red")))Most cancers occur where there are more people, as shown by the map of cancer numbers above. This information is mostly about where more people reside, thus it isn’t really relevant on its own. In general, percentages are used to express rates (per 100 residents). Disease rates are reported in annual cases per 100,000 residents (per 100k) for comparatively rare health disorders like cancer, because the data are more understandable as full numbers (e.g. 200 per 100k is easier to read than 0.2 percent ).
We can determine the rate by dividing by population, multiplying by 100,000 (to get the per 100k), and dividing by five because the ACS data contains population values (since the counts are total counts over a five year period).
Because cancer is uncommon and many ZIP Codes have small populations, one or two cancer cases in a ZIP Code can cause the cancer rate to skyrocket, while a lack of instances can make a region appear to be a cancer-free haven. The tiny numbers problem, also known as variance instability, is a condition that affects many people.
An X/Y scatter graphic comparing population and crude rates reveals this distortion of variance in sparsely inhabited areas. In low-population ZIP Codes, the most extreme rates (high and low) reach at the bottom of the chart.
We don’t want to dismiss areas with high crude rates as outliers or use a log transform to indiscriminately squeeze all high rate areas into normality because they may be areas of concern.
On an X/Y scatter plot comparing the percent of 65+ with the (estimated) crude rate, we can see a general pattern going upward from left to right, indicating that this is the case.
To make a basic linear model of cancer rates as a function of age, use the lm() function (weighted by population). The abline() function is used to draw a regression line on the plot using the model’s output.
As a result, places with high cancer crude rates are more likely to have a greater population of older individuals, obscuring other concerns that may be boosting cancer risk in different areas.
In epidemiology, a common technique for addressing this issue is to modify rates based on the proportions of residents in different age groups, so that rates across areas can be compared to take into account factors other than whether some areas have disproportionately high numbers of older (or younger) residents.
Cancer risk rises as you become older, as it does with most diseases (National Institutes of Health 2014).
On an X/Y scatter plot comparing the percent of people aged 65 and up to the (estimated) crude rate, we can discern a broad upward pattern from left to right.
A basic linear model of cancer rates as a function of age can be created using the lm() function (weighted by population). To draw a regression line on the plot, the model output is provided to the abline() method.
As a result, places with high cancer crude rates are more likely to have a greater population of older individuals, obscuring other concerns that may be boosting cancer risk in different areas.
A precise breakdown of the ages of those diagnosed with cancer, which is not publicly available in Illinois, is required for rigorous calculation of age-adjusted rates. The lm() function’s linear model result, on the other hand, can be used to construct an estimated age adjustment.
## [1] 485.3436
The weighted mean of this adjusted rate for both sexes across the state is consistent with the published rates of 503.6 for men and 444.5 for women
We often wish to know what the usual rates are for health issues so that we can pay extra attention to groups of people or locations where the rates are higher than normal. This mean, however, may not be valid because the population is not evenly distributed throughout ZIP Codes. The weighted.mean() function, which raises or decreases each number’s contribution to the mean computation based on a weighting value, is one approach for dealing with this scenario. We can utilize the Total.Population parameter as a weight in this scenario.
## [1] 562.283
##
## Pearson's Chi-squared test
##
## data: Zip_Codes$Crude.Rate and Zip_Codes$Total.Population
## X-squared = 1668429, df = 1655213, p-value = 2.167e-13
## [1] 558.3118
##
## Pearson's Chi-squared test
##
## data: Zip_Codes$Estimated.Rate and Zip_Codes$Total.Population
## X-squared = 1722875, df = 1714104, p-value = 1.126e-06
## [1] 485.3436
##
## Pearson's Chi-squared test
##
## data: Zip_Codes$Age.Adjusted.Rate and Zip_Codes$Total.Population
## X-squared = 1703402, df = 1702155, p-value = 0.2495
## Estimated.Rate Median.Age
## 1.000 0.487
## Percent.65.Plus Old.age.Dependency.Ratio
## 0.451 0.414
## Crude.Rate Age.Adjusted.Rate
## 0.412 0.407
## Age.Dependency.Ratio Percent.Homeowners
## 0.324 0.301
## Percent.Drove.Alone.to.Work Percent.Disabled
## 0.288 0.278
The strongest positive connections, such as Median.Age and Percent.65.Plus, are connected with higher percentages of elderly residents.
Conversely, higher percentages of younger residents, such as school enrolment, renters, and women of reproductive age, are connected with the largest negative connections.
Rates are positively correlated with ZIP Code area size and adversely correlated with overall population and population density. This could show a trend toward higher rates in low-population rural areas, yet given that rural areas are also older, this could just be another age-related association.
Disease rates that are similar tend to cluster together in space. These clusters could be caused by naturally existing or man-made carcinogens in the environment, or by common social standards for dangerous behaviors (such as smoking or nutrition). The randomness of disease occurrence can make such clustering difficult to detect.
The Getis-Ord GI* statistic algorithm finds high-valued areas surrounded by high-valued areas. Unlike mere observation or approaches like kernel density analysis, this program creates p-values by comparing all areas statistically to determine how likely it is that clusters of high values in specific locations occurred by coincidence. This systematic application of statistics allows for a better level of confidence (but not absolute certainty) that the patterns are more than simply a visual illusion and indicate grouping worth exploring.
It’s important to remember that just because you’re in a hot region doesn’t mean your rates are all high. Rather, this implies that the overall level in a hot spot is high, despite the fact that some places inside the hot spot may have low levels. Although the values in the 95 percent hot areas (category 5) are generally greater than the values in the 95 percent cold spots, this box plot depicting the distribution of rates among the five hot/cold spot categories varies greatly (category 1).
##
## Pearson's Chi-squared test
##
## data: getis.ord$Age.Adjusted.Rate and getis.ord$Hot.Spots
## X-squared = 5464, df = 5460, p-value = 0.4822
We did an analysis of Illinois Cancer Registry data by combing with GeoSpatial data of American Community Study to identify cancer hotspots in the state of Illinois.
The strongest positive connections, such as Median.Age and Percent.65.Plus, are connected with higher percentages of elderly residents.
Conversely, higher percentages of younger residents, such as school enrolment, renters, and women of reproductive age, are connected with the largest negative connections.
Rates are positively correlated with ZIP Code area size and adversely correlated with overall population and population density. This could show a trend toward higher rates in low-population rural areas, yet given that rural areas are also older, this could just be another age-related association.