1 Introduction

In late August 2018, the Chicago Tribune published an article regarding an EPA report which found that Sterigenics facilities in Willowbrook, IL, have been releasing ethylene oxide into the air. Ethylene oxide (EtO) is used at the Sterigenics facilities to sterilize medical equipment. It is a known carcinogen and has been linked to many different kinds of health issues. EtO is even used in creating thermobaric weapons because it is so flammable and explosive.

The two Sterigenics facilities in Willowbrook, IL, are located in a small industrial/office area which is surrounded on all sides by residences. Within 1 mile of the Sterigenics facilities, aside from the many residences, there is Gower Middle School (less than a half mile away), Hinsdale South High School and Gower West Elementary School (3/4 of a mile away), many parks, and popular shopping places like Target and the Willowbrook Town Center. Every day EtO is breathed in by residents, the many employees who work in the businesses immediately surrounding the facilities, the teachers and students at the nearby schools, and more. Long term, daily exposure over the years to this harmful compound can be harmful to health.

In this analysis, I will attempt to determine whether or not the residents of the areas within a few miles of the facilities face a higher rate of cancer diagnoses than the rest of the state. It is important to note here that this analysis concerns ONLY cancer diagnoses and does not inlcude data on the many other problems linked to long term EtO exposure. The data also include people who lived in the respective ZIP codes at the time of diagnosis, so there may be people who work in the affected area but live farther away who may not be included in the counts. The cancer data comes from the Illinois Department of Public Health and contains data of cancer diagnoses for the state from 1986-2015. For most of this analysis I use only the data from 2001-2015 because it is my understanding that the ZIP code boundaries were re-drawn in 2000 and I would not want to misrepresent my findings by using incorrect data. The following table shows the ZIP Codes and towns/villages being considered.

ZIP_Codes Towns
60527 Willowbrook/Burr Ridge
60439 Lemont/Woodridge/Willow Springs
60561 Darien
60521 Hinsdale
60558 Western Springs
60514 Clarendon Hills
60559 Westmont
60525 La Grange/Hodgkins/Countryside
60480 Willow Springs
60515 Downers Grove
60516 Downers Grove/Woodridge
60517 Woodridge/Bolingbrook

2 Analysis

I will note here that all of the code I used for this analysis is available both in the appendix of this report and in a Github repository. I performed this analysis in both R and Python. All relevant files and outputs will be in the Github repository. I want to be as transparent as possible about my methods, so for this reason I am providing all of this information. Follow this link to my repository: https://github.com/athanasios8193/illinois_cancer.

2.1 Acquiring the Data

The cancer data was taken from the Illinois Department of Public Health website (http://www.idph.state.il.us/cancer/statistics.htm). For this analysis, I downloaded the ZIP Code areas file (ZPCD8615.EXE) and the associated README (READMEv25.pdf). Each line in the file contained a single cancer diagnosis. Each diagnosis was represented as a 37 character string with each character having some kind of significance. The ‘zpcd8615.dat’ file has been included in my Github repository in the Data folder so you don’t need to go through the process of downloading it yourself.

X1 X2
26609381 4440.7681000 -87.9886000
2361938310439.4771000 -88.3713000
22619441 4339.6049000 -87.6964000
2461944210439.6049000 -87.6964000
22629593 3337.7241000 -88.9291000
14617612 2440.5190000 -88.9819000

2.2 Cleaning and Reformating the Data

The output above shows the original format of the dataset. This is clearly not easily understood even with the instructions on how to interpret the information from the guide. My next step was to extract each portion of each line to its own column so that the data could be read more easily. I decided only to extract the values for sex, years, ZIP code, age, and type of cancer. There is also information on what stage the cancer was at during diagnosis, but I neglected to include this because I didn’t deem it relevant. If anyone decides to try and replicate what I did, feel free to include that information. Whether in R or Python, the step of splitting into columns is fairly simple. The result of this process is shown in the next table.

sex years zip age type
2 6 60938 4 4
2 3 61938 4 10
2 2 61944 3 4
2 4 61944 4 10
2 2 62959 3 3
1 4 61761 4 2

This output is much nicer to look at by far, but it can still be improved upon. It is a lot of work to constantly have to refer to the guide provided by the IDPH, so I took it a step further and made reference dictionaries to replace the number codes with text to make the tables very clear. The output of this step is shown below.

sex years zip age type
female 2011-2015 60938 65+ breast invasive-female
female 1996-2000 61938 65+ all other cancers
female 1991-1995 61944 45-64 breast invasive-female
female 2001-2005 61944 65+ all other cancers
female 1991-1995 62959 45-64 lung and bronchus
male 2001-2005 61761 65+ colorectal

This output is much more user-friendly than any of the others. I have included this table (18401445 rows, 5 columns) as a .csv called ‘il_cancer.csv.’ It is located in my Github repository in the Data folder. Feel free to explore it on your own and draw your own inferences.

2.3 Exploring the Data

2.3.1 State of Illinois Cancer Diagnoses

Now that the data are in a clean, tidy format, it is time to explore the data. I’ll start off by showing the trends for the entire state of Illinois since 1986. This first graph shows the number of cancer cases per 100,000 people per five year period in the state as reported by the IDPH since 1986. It is clear that the overall trend is up, despite there being some decline from ’96-’00 to ’01-’05, for example.

NOTE: You can hover over the bars of each plot to see the underlying numbers.

Next, I show the same graph but this time I break up each 5-year period by age. Within each age group, the number of cases goes up during each 5-year period. There is a stark difference in volume of cancer diagnoses between people above and below the age of 44. The 45-64 year age group has the highest rate of increase of cancer diagnoses over the 5-year periods. The 65 and older group is far and away the largest in terms of overall diagnoses, but the number has been fairly flat over the last 15 years in this time frame.

NOTE: On the graphs with a legend, you can turn groups on or off by clicking on the corresponding number in the legend.

2.3.2 Cancer Diagnoses in the Areas Surrounding Sterigenics

As alluded to before, the rest of this analysis will only concern the years 2001-2015 since the ZIP code boundaries were redrawn around the beginning of that time frame. The analysis will also revolve around the areas within a few miles of the Sterigenics plants in Willowbrook to see if there is any evidence of an increased cancer rate in the area relative to the rest of the state.

From 2001-2005 to 2006-2010 the total number of cancer diagnoses went up around 600, with an increase of 70 cases from ’06-’10 to ’11-’15.

When breaking up the previous graph by age, it becomes apparent that the two oldest age groups see an increase in cancer diagnoses. In the other two age groups, you see an increase in total number of diagnoses form ’01-’05 to ’06-’10 but then a decline from ’06-’10 to ’11-’15.

Further exploration into the breakdown of cases between men and women or possibly to determine what kinds of cancer were most prevalent could be done, but they are beyond the scope of this high level analysis. Feel free to explore the provided data on your own.

2.3.3 Cancer Statistics Analysis

Now that overall trends have been observed, the next step is to get actual hard numbers. A good way to do this is to get normalized statistics such as ‘1 Cancer Diagnosis for Every X People’ and ‘Number of Cancer Diagnoses per 100000 People.’ In order to accomplish this, I went to the U.S. Census Bureau website and got a dataset of the population of the state of Illinois per ZIP code. I used the data from the 2010 Census because it is the last year of collected data. There were estimates for more recent years as well, but I chose not to use them. Steps on downloading this dataset are in the ‘cancer.R’ file in the Github repository. I will also include the dataset in the Data folder of the Github repository so you don’t need to repeat the steps if you want to do your own analysis.

According the the U.S. Census Bureau, in 2010 the population of the state of Illinois was 12,830,632 people.

Diagnoses_Since_2001 Cases_per_Year One_in_Every_X_per_Year per_100000_per_Year
1004742 66982.8 191.5511 522.0538

The results of most consequence from this table are the ‘One in Every X per Year’ and ‘Cases per 100000 per Year,’ as they are normalized numbers so comparisons can be made to other areas. These numbers say that in the state of Illinois, from 2001-2015, 1 in every 191 people was diagnosed with cancer and 522 out of every 100000 people was diagnosed with cancer. These two numbers report the exact same thing, so use whichever result makes the scale of cancer diagnoses easiest to understand.

Next, I aggregated the number of diagnoses by zip code and combined those results with the U.S. Census Data by ZIP code. I did this for every ZIP code included in the IDPH data set and calculated diagnoses per year, ‘One in Every X’ per year, and ‘X per 100000 People per Year,’ as well as the ratio of the per 100000 people per year for each ZIP code compared to the state as a whole. The first few results are shown below.

NOTE: There were two incomplete rows so I removed them from the set.

zip population freq per_year one_in_every_X_per_year per_100000_per_year zip_vs_state
60002 24299 1978 131.8667 184.2695 542.6835 1.0395165
60004 50582 4530 302.0000 167.4901 597.0503 1.1436567
60005 29308 2790 186.0000 157.5699 634.6390 1.2156583
60007 33820 3155 210.3333 160.7924 621.9200 1.1912948
60008 22717 1723 114.8667 197.7684 505.6419 0.9685628
60010 44095 4109 273.9333 160.9698 621.2345 1.1899817

Once I had this entire output, the results of which are saved to ‘il_cancer_statistics.csv’ (located in the Github repository’s Data folder), I extracted the ZIP codes of interest. These results are shown in the table below.

zip population freq per_year one_in_every_X_per_year per_100000_per_year zip_vs_state
60525 31168 3389 225.93333 137.9522 724.8888 1.3885327
60561 23115 2341 156.06667 148.1098 675.1749 1.2933052
60515 27503 2727 181.80000 151.2816 661.0188 1.2661891
60527 27486 2720 181.33333 151.5772 659.7298 1.2637200
60480 5246 494 32.93333 159.2915 627.7799 1.2025196
60558 12960 1214 80.93333 160.1318 624.4856 1.1962093
60439 22919 2049 136.60000 167.7818 596.0120 1.1416679
60516 29084 2544 169.60000 171.4858 583.1385 1.1170085
60521 17597 1529 101.93333 172.6324 579.2654 1.1095895
60514 9708 817 54.46667 178.2375 561.0493 1.0746964
60559 24852 1917 127.80000 194.4601 514.2443 0.9850409
60517 32038 2110 140.66667 227.7583 439.0619 0.8410282

This table shows that 10/12 of the ZIP codes of interest are above the state of Illinois baseline level. La Grange, Darien, Downers Grove, and Willowbrook are highest with 39%, 29%, 27%, and 26% more cancer diagnoses per 100000 people. Only Westmont has a lower rate than the rest of the surrounding ZIP codes.

The following table shows the proportion of ZIP codes where the Cases per 100000 People was greater than the state of Illinois overall value. I then increment the percentage higher than Illinois by 5% until I reached 20% greater. The numbers in the affected area start to get smaller due to the small sample size.

Entire_State Area_of_Interest
Greater_than_Illinois 0.7677279 0.8333333
5%_Greater_than_Illinois 0.7054993 0.8333333
10%_Greater_than_Illinois 0.6403763 0.7500000
15%_Greater_than_Illinois 0.5643994 0.5000000
20%_Greater_than_Illinois 0.4833575 0.4166667

2.3.4 Map of Sterigenics Affected Areas

It is perhaps easiest to visualize the data by plotting on a map of the area. The map was created using Leaflet and ZIP Code boundary data provided by the U.S. Census Bureau.

The circles on the map are centered on the Sterigenics facilities. Each circle represents a half mile radius away from the facilities. You can zoom in and move around on the map as you desire. Each blue marker represents the geometric center of each ZIP Code of interest. Hovering over a blue marker will tell you which ZIP code it represents, and clicking on a marker will give you the per 100,000 people per year number and its comparison to the state of Illinois as a whole.

The shaded in areas with blue outlines are the ZIP code regions as provided by the U.S. Census Bureau. The darker the ZIP Code area is shaded in, the higher the cancer incidence per 100,000 people per year.

2.3.5 Graphical Comparison of Local to State

This graph is a visual representation of the per 100000 cancer rates in the Sterigenics area vs the state of Illinois as a whole (thick red line). It is clear from this that many of the local ZIP codes have a much higher rate than the state-wide baseline.

3 Deeper Statistical Analysis

3.1 Determining Normality

Now that the data has been explored, trends have been noticed, and it has become apparent that the cancer rates in the area are higher than the state as a whole, I would like to do a deeper dive into the numbers and see if the difference is statistically significant. For the sake of this next analysis, I will be using the local data you’ve already seen and compare it to the rest of the ZIP codes in the state to see if our area has higher cancer rates than the rest of the state on average.

In order to compare the area of interest with the rest of the state, I will be using a t-test, first developed by William Sealy Gosset under the pseudoym “Student.” A t-test, in simplest terms, compares the average of two samples to determine if they are equal. In order for a t-test to be valid, the data has to be normally distributed. Below I show the plots for the distributions of the per 100,000 cancer rates in the area of interest and the rest of the state.

The local data sample size is just 12, so it would be hard to determine whether or not it is normal or not.

The rest of the state has an interesting distribution. The left portion of the plot appears to be normally distributed, but the distribution has a very long right tail starting at around 1200. I am guessing that this skew is caused by ZIP codes that have very small populations because their diagnoses per 100,000 rates would be higher than a ZIP code with a higher population. This is something I will come back to later on in the analysis.

Another way to determine normality is by using QQ-plot or quantile-quantile plots. I will not go into the theory behind these plots too much, but the basic idea is that if the QQ-plot lies on a straight line, your data are approximately normally distributed. I will show QQ-plots for both data sets. Keep in mind that the local dataset has a low number of entries so it may not necessarily fit on the line.

This local QQ-plot is a lot better than I expected it to be. Most of the points lie near or on the line with the exception of the two lowest values. I will accept this as normally distributed and valid for a t-test.

Most of these data points lie on the line. The ones that don’t are the outliers that were observed in the histogram above. I believe, despite these outliers, that I can consider this data normally distributed and valid for a t-test.

3.2 T-testing

Now that it has been determined that both the local and rest of state data are normally distributed, I will perform a t-test on the two samples. The \(\alpha\) value I am going to use is the standard 0.05. The p-value returned from the t-test will be compared to this value to determine whether the test is statistically significant. In other words, if the p-value is greater than 0.05, then you can not conclude that the average cancer rate in the area of interest is greater than the rest of the state. If the p-value is less than 0.05, then you can conclude that the average rate in the area of interest is, in fact, greater than the rest of the state.

## 
##  Welch Two Sample t-test
## 
## data:  cancer_local$per_100000_per_year and cancer_rest$per_100000_per_year
## t = -0.92, df = 12.254, p-value = 0.8123
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -61.64846       Inf
## sample estimates:
## mean of x mean of y 
##  603.8208  624.8330

Based on this output, we can NOT say that the local area has a higher cancer rate than the rest of the state. The p-value returned by the test is 0.8123 which is much higher than the \(\alpha\) value of 0.5. Now is an important time to look back at the distribution of cancer rates for the rest of the state. There was a noticeable right tail skewing the distribution which also messed with the QQ-plot. I hypothesized that the large skew was caused by ZIP codes with small populations. I will subset the rest of the state data to remove all ZIP codes with populations less than 5,000 people and then do another t-test. First, let’s see what happens to the distribution when we subset.

This distribution looks much nicer than the one which included all of the smaller sized ZIP codes. Apart from making the data more normally distributed, I should note that in our area of interest, the smallest population size is 5,246 people. Now the rest of the state data are much more similar to the local data.

This QQ-plot shows a much more normally distributed sample than the sample that included all other counties in Illinois.

Following is the results of a t-test with the local data and the newly subset data.

## 
##  Welch Two Sample t-test
## 
## data:  cancer_local$per_100000_per_year and cancer_rest_big$per_100000_per_year
## t = 2.6068, df = 12.886, p-value = 0.01093
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  19.30541      Inf
## sample estimates:
## mean of x mean of y 
##  603.8208  543.5274

The p-value returned in this test is 0.01. This is lower than the standard 0.05 value, so based on this result, we can conclude that the ZIP code areas around Sterigenics in Willowbrook has higher cancer rates per year than the rest of the state while the population is greater than 5,000 people. This outcome was somewhat expected from the visualizations above, but now there is stronger evidence for a link between the Sterigenics plant and higher cancer incidence in the surrounding areas.

3.3 Chi-Squared Test

Another, and perhaps more appropriate, test to determine if the cancer rates in the affected areas are higher than the rest of the state is a \(\chi^2\), or chi-squared test. In this case, I made a contingency table of the average number of cancer diagnoses per year and average number of people not diagnosed with cancer in both the areas around Sterigenics and the rest of the ZIP codes in the state with populations larger than 5,000 people. This table is shown below, with row and column sums added in.

Cancer No Cancer Sum
Sterigenics Area 1590 57705 59295
Rest of State 15988 700233 716221
Sum 17578 757938 775516

The next step in a \(\chi^2\) test is to find the expected value of each cell. To find this, you multiply the sum of the row by the sum of the column of the cell of interest and divide that product by the total number of observations in the table. You do this for each of the cells in the table (excluding the sums). Once you have calculated these, you can find the \(\chi^2\) values for each cell by using the formula \(\chi^2 = \frac{{(X-E(X))}^2}{E(X)}\) where X is the observed value and E(X) is the expected value. Sum up the \(\chi^2\) values to get the total \(\chi^2\) value. This sum, combined with the degrees of freedom, which is equal to the (number of rows - 1) x (number of columns - 1), or 1 in this case, are looked up in a \(\chi^2\) distribution table to find the p-value. Following is the results of a \(\chi^2\) test of the contingency table shown above.

## 
##  Pearson's Chi-squared test
## 
## data:  contingency_table
## X-squared = 49.889, df = 1, p-value = 1.627e-12

The overall \(\chi^2\) value is 49.888624, which is very high which leads to a p-value of 1.627253910^{-12}, which is very low. The p-value being this low means that some value in the table has a value that has a statistically significant difference from the expected value. First, let us look at what the expected values were.

Cancer No Cancer Sum
Sterigenics Area 1344 57951 59295
Rest of State 16234 699987 716221
Sum 17578 757938 775516

We can see very obviously that the number of cancer diagnoses in the Sterigenics area is 256 larger than what was expected. The rest of the values have a difference of around 200 of their expected values, but since the number of cancer diagnoses in the Sterigenics is smaller than the others, it has a larger significance. Below is a table of the Pearson residuals. These can be interpreted like standard deviation, where values between -2 and 2 are within 95% of the mean and are considered statistically insignificant, and numbers outside of that range are thought to be statisically significant when \(\alpha\) is equal to 0.05, which is the case here.

Cancer No Cancer
Sterigenics Area 6.710430 -1.0219232
Rest of State -1.930794 0.2940382

The 6.71 value for cancer diagnoses in the Sterigenics area is incredibly significant. The results of this \(\chi^2\) test, along with the results of the t-test, suggest a link between proximity to the Sterigenics facilities in Willowbrook and cancer rates.

4 Conclusions

The findings above prove that the area around the Sterigenics plants in Willowbrook have a higher cancer rate than the rest of the state of Illinois. Due to the high amounts of EtO emitted from the plants, it is likely that these higher rates could be linked to the presence of Sterigenics in the community. There are many limitations to the analysis I performed. I only looked at cancer data and didn’t look at other possible ailments caused by EtO. If any data exist on other diseases or illnesses caused by EtO, I could do further analysis. Also, there are many people who work in the areas surrounding Sterigenics that do not necessarily live in the area, so if they did get cancer as a result of breathing in EtO, they would be reported in their own ZIP codes and would have been excluded. I could have tried to see if the rates of breast cancer were higher in this area compared to the rest of the state as well, as there is a link between breast cancer specifically and EtO.

With all this considered, further research and testing by people more qualified than me needs to be done to determine a link between the high cancer incidence and the presence of Sterigenics.

5 Appendix

5.1 References

Illinois Cancer Data

-Illinois Department of Public Health, Illinois State Cancer Registry, public dataset, 1986-2015, data as of November 2017

Illinois Population Data

-U.S. Census Bureau; Census 2010, Summary File 1, 5-Digit ZIP Code Tabulation within Illinois; generated by Athanasios Stamatoukos; using American FactFinder; http://factfinder.census.gov; (5 October 2018)

Motivation

I got the idea to do this analysis after seeing posts by “Stop Sterigenics” group members Richard Morton and Katherine M Howard

5.2 Code

5.2.1 Loading and Cleaning Data

library(dplyr)
library(ggplot2)
library(plotly)
library(knitr)

library(leaflet)
library(RColorBrewer)

library(tigris)
library(rgeos)
library(sp)
zipcodes <- c(60527, 60439, 60561, 60521, 60558, 60514, 60559, 60525, 60480, 60515, 60516, 60517)
towns <- c('Willowbrook/Burr Ridge', 'Lemont/Woodridge/Willow Springs',
           'Darien', 'Hinsdale', 'Western Springs', 'Clarendon Hills', 'Westmont',
           'La Grange/Hodgkins/Countryside', 'Willow Springs', 'Downers Grove',
           'Downers Grove/Woodridge', 'Woodridge/Bolingbrook')
ziptable <- cbind(ZIP_Codes=zipcodes, Towns=towns)
kable(ziptable, align='c')
data <- read.delim('./Data/zpcd8615.dat', sep='\t', header=FALSE, stringsAsFactors = FALSE)
test <- strsplit(data$V1, "   ")
mat <- t(sapply(test, '['))
data <- data.frame(mat, stringsAsFactors = FALSE)
rm(test)
rm(mat)
kable(head(data), align='c')
data$sex <- substr(data[,1], 1,1)
data$years <- substr(data[,1], 2,2)
data$zip <- substr(data[,1], 3,7)
data$age <- substr(data[,1], 11,11)
data$type <- substr(data[,1],9,10)
data$lat <- substr(data[,1], 12,24)
data$long <- data$X2

data <- data[,-c(1,2)]

data$sex <- as.numeric(data$sex)
data$years <-as.numeric(data$years)
data$zip <- as.numeric(data$zip)
data$age <- as.numeric(data$age)
data$type <- as.numeric(data$type)
data$lat <- as.numeric(data$lat)
data$long <- as.numeric(data$long)
kable(head(data[,-c(6,7)]), align='c')
sexcode <- c('male', 'female')
diagnosisyear <- c('1986-1990', '1991-1995', '1996-2000', '2001-2005', '2006-2010', '2011-2015')
agegroup <- c('0-14', '15-44', '45-64', '65+')
cancertype <- c('oral cavity and pharynx', 'colorectal', 'lung and bronchus',
                'breast invasive-female', 'cervix', 'prostate', 'urinary system',
                'central nervous system', 'lukemias and lymphomas', 'all other cancers',
                'breast in-situ-female')
data[,1] <- sexcode[data[,1]]
data[,2] <- diagnosisyear[data[,2]]
data[,4] <- agegroup[data[,4]]
data[,5] <- cancertype[data[,5]]

rm(sexcode)
rm(diagnosisyear)
rm(agegroup)
rm(cancertype)

kable(head(data[,-c(6,7)]), align='c')

5.2.2 Exploring Data

plot1 <- data %>% group_by(years) %>% count()
plot1 <- data.frame(plot1)

il80 <- 11426518
il90 <- 11430602
il00 <- 12419293
il10 <- 12830632

plot1[1,2] <- plot1[1,2]*100000/il80
plot1[2:3,2] <- plot1[2:3,2]*100000/il90
plot1[4:5,2] <- plot1[4:5,2]*100000/il00
plot1[6,2] <- plot1[6,2]*100000/il10

g <- ggplot(plot1, aes(x=years, y=n)) + geom_bar(stat='identity') + xlab('Years') + ylab('Number of Cases') +
        ggtitle('Number of Cancer Cases per 100000 per Five Year Period <br> in Illinois Since 1986')
ggplotly(g)
h <-ggplot(data, aes(x=years, fill=age)) + geom_bar(position='dodge') +
        xlab('Years') + ylab('Number of Cases') +
        ggtitle('Number of Cancer Cases in Illinois since 1986 \n by Age at Diagnosis') +
        scale_fill_discrete(name='Age')
ggplotly(h)
since2000 <- c('2001-2005', '2006-2010', '2011-2015')
zipcodes <- c(60527, 60439, 60561, 60521, 60558, 60514, 60559, 60525, 60480, 60515, 60516, 60517)
datasub <- subset(data, years %in% since2000)
datalocal <- subset(data, zip %in% zipcodes)
datalocalsub <- subset(datasub, zip %in% zipcodes)
i <- ggplot(datalocalsub, aes(x=years)) + geom_bar() +
        ggtitle('Number of Cancer Cases in Sterigenics Area Since 2001') +
        xlab('Years') + ylab('Number of Cases')
ggplotly(i)
j <- ggplot(datalocalsub, aes(x=years, fill=age)) + geom_bar(position='dodge') +
        ggtitle('Number of Cancer Cases in Sterigenics Area Since 2001 \n by Age at Diagnosis') +
        xlab('Years') + ylab('Number of Cases') +
        scale_fill_discrete(name='Age')
ggplotly(j)

5.2.3 Comparing Cancer Rates to State of Illinois

pop_illinois <- 12830632
num_cases <- nrow(datasub)
num_cases_per_year <- num_cases/15
one_in_every_il <- pop_illinois/num_cases_per_year
per_100000_il_per_year <- num_cases_per_year*100000/pop_illinois
illinois_overall <- data.frame('Diagnoses_Since_2001' = num_cases,
                               'Cases_per_Year' = num_cases_per_year,
                               'One_in_Every_X_per_Year' = one_in_every_il,
                               'per_100000_per_Year' = per_100000_il_per_year)
kable(x=illinois_overall, format='markdown', align='c')
ilpop <- read.csv('./Data/il_2010_populations.csv', header=TRUE, skip=1)
ilpop <- ilpop[,-c(1,3)]
colnames(ilpop) <- c('zip', 'population')
ilcancer <- datasub %>% group_by(zip, lat, long) %>% count()
ilcancer <- as.data.frame(ilcancer)
colnames(ilcancer) <- c('zip', 'lat', 'long', 'freq')
cancer <- left_join(ilpop, ilcancer)
cancer <- subset(cancer, complete.cases(cancer))
cancer$per_year <- cancer$freq/15
cancer$one_in_every_X_per_year <- cancer$population/cancer$per_year
cancer$per_100000_per_year <- cancer$per_year*100000/cancer$population
cancer$zip_vs_state <- cancer$per_100000_per_year/per_100000_il_per_year
kable(head(cancer[,-c(3,4)]), align='c')
cancer_local <- subset(cancer, zip %in% zipcodes)
kable(arrange(cancer_local[,-c(3,4)], desc(zip_vs_state)), align='c')
cancer_greater <- c(nrow(subset(cancer, zip_vs_state > 1))/nrow(cancer),
                    nrow(subset(cancer, zip_vs_state > 1.05))/nrow(cancer),
                    nrow(subset(cancer, zip_vs_state > 1.10))/nrow(cancer),
                    nrow(subset(cancer, zip_vs_state > 1.15))/nrow(cancer),
                    nrow(subset(cancer, zip_vs_state > 1.20))/nrow(cancer))
cancer_local_greater <- c(nrow(subset(cancer_local, zip_vs_state >1))/nrow(cancer_local),
                          nrow(subset(cancer_local, zip_vs_state >1.05))/nrow(cancer_local),
                          nrow(subset(cancer_local, zip_vs_state >1.10))/nrow(cancer_local),
                          nrow(subset(cancer_local, zip_vs_state >1.15))/nrow(cancer_local),
                          nrow(subset(cancer_local, zip_vs_state >1.20))/nrow(cancer_local))
cancer_greater_both <- data.frame('Entire_State'=cancer_greater,
                                  'Area_of_Interest'=cancer_local_greater,
                                  row.names = c('Greater_than_Illinois',
                                                '5%_Greater_than_Illinois',
                                                '10%_Greater_than_Illinois',
                                                '15%_Greater_than_Illinois',
                                                '20%_Greater_than_Illinois'))
kable(cancer_greater_both, align='c')
datashp <- zctas(cb = TRUE, starts_with = zipcodes)
data_map <- merge(datashp, cancer_local, by.x = 'GEOID10', by.y ='zip')

n <- leaflet(data_map) %>% addTiles()
n <- n %>% addMarkers(lat = ~lat, lng = ~long, 
                      popup=~paste('Per 100,000 per Year: ', as.character(round(per_100000_per_year, 2)), '<br>',
                                   'ZIP Code vs State: ', as.character(round(zip_vs_state, 2))),
                      label=~paste('Geographic Center of ', GEOID10, ' ZIP Code'))
n <- n %>% addPolygons(data=data_map, weight=3, opacity = 1, fillOpacity = 0.5,
                       fillColor = ~colorQuantile('Greys', per_100000_per_year)(per_100000_per_year))
n <- n %>% addCircles(lat = 41.747375, lng = -87.939954,
                      label = 'Sterigenics Plant Radius ',
                      color=rev(brewer.pal(8, 'Spectral')),
                      radius = (8:1)*804.672, opacity = 1, fillOpacity = 0.15)

n
k <- ggplot(cancer_local, aes(x=reorder(as.factor(zip), -per_100000_per_year), y=per_100000_per_year)) + 
        geom_bar(stat='identity') +
        geom_hline(yintercept = per_100000_il_per_year, lwd=2, color='red') + 
        xlab('Zip Code') + ylab('Cancer Diagnoses per 100000 People per Year') +
        scale_y_continuous(breaks = sort(c(seq(0, 750, length.out=6),
                                           round(per_100000_il_per_year,2)))) +
        ggtitle('Cancer Diagnoses per 100000 People per Year Compared to \n State of Illinois Overall')
k

5.2.4 Statistical Analysis

cancer_rest <- subset(cancer, !(zip %in% zipcodes))
ggplot(cancer_local, aes(x=per_100000_per_year)) + geom_histogram(bins=10) + ggtitle('Local Cancer Rate Distribution') +
        xlab('Diagnoses per 100,000 People per Year')
ggplot(cancer_rest, aes(x=per_100000_per_year)) + geom_histogram(bins=45) +
        ggtitle('Rest of State Cancer Rate Distribution') +
        xlab('Diagnoses per 100,000 People per Year')
ggplot(cancer_local, aes(sample=per_100000_per_year)) + stat_qq_line(lwd=3, color='red') +
        stat_qq() + ggtitle('QQ-Plot for Local Cancer Rates')
ggplot(cancer_rest, aes(sample=per_100000_per_year)) + stat_qq_line(lwd=3, color='red') +
        stat_qq() + ggtitle('QQ-Plot for Rest of State Cancer Rates')
test1 <- t.test(cancer_local$per_100000_per_year, cancer_rest$per_100000_per_year, alternative = 'greater')
test1
cancer_rest_big <- subset(cancer_rest, population > 5000)
ggplot(cancer_rest_big, aes(x=per_100000_per_year)) + geom_histogram(bins=30) +
        ggtitle('Rest of State (Population > 5000) Cancer Rate Distribution') +
        xlab('Diagnoses per 100,000 People per Year')
ggplot(cancer_rest_big, aes(sample=per_100000_per_year)) + stat_qq_line(lwd=3, color='red') +
        stat_qq() + ggtitle('QQ-Plot for Rest of State (Population > 5000) Cancer Rates')
test2 <- t.test(cancer_local$per_100000_per_year, cancer_rest_big$per_100000_per_year, alternative = 'greater')
test2
local_cancer <- sum(cancer_local$freq)/15
local_no_cancer <- sum(cancer_local$population)/15-local_cancer
rest_cancer <- sum(cancer_rest_big$freq)/15
rest_no_cancer <- sum(cancer_rest_big$population)/15 - rest_cancer

contingency_table <- matrix(c(local_cancer, local_no_cancer, rest_cancer, rest_no_cancer), nrow=2)
contingency_table <- round(contingency_table, 0)

rownames(contingency_table) <- c('Sterigenics Area', 'Rest of State')
colnames(contingency_table) <- c('Cancer', 'No Cancer')

kable(addmargins(round(contingency_table, 0)), align='c')
chitest <- chisq.test(contingency_table, correct = FALSE)
chitest
kable(addmargins(round(chitest$expected, 0)), align = 'c')
kable(chitest$residuals, align = 'c')