0— title: “Project 2” author: “Christina Takla” date: “2023-04-19” output: html_document —

Introduction

The dataset I chose explores the association of arsenic levels in water systems to the number of people within that community water system who contracted either bladder, colorectal or kidney cancer. This data was collected by the US EPA office of research and development.

The original dataset has 23 variables. Three of those variables were categorical; the fips code, State, and County. The rest of the variables are quantitative (or should be- the count for bladder cancer and colorectal cancer were in character form as numbers). The rest of the variables were the population served by each community water system (CWS), the number of CWS’s in the county, how many people in each county who have either of the three cancers mentioned above, an unweighted value representing the arsenic concentration, a weighted value representing the arsenic concentration that is adjusted to the number of CWS in the county. There are also demographic variables to show what percent of the population affected by the cancers are male, black, have lived in the county for over 5 years, ever had an endoscopy, or ever smoked. There are variables for the EPA’s Environmental Quality Index (EQI) that is usually measured for the air, water, land, buildings, and sociodemographic factors. The higher values suggest worse environmental quality, and lower values suggest better environmental quality. There is also an overall EQI variable in the dataset. Lastly there is an unweighted and weighted arsenic category, with the higher number indicating a higher/more dangerous concentration.

Having this large amount of information, I decided to only work with a small amount of variables but also mutated the dataframe to include population data for each county by finding the government population data set the EPA used to make their calculations. The population data was taken by the US Geological Survey service. From the original dataset, I used State, County, Population_Served, Bladder_Count, Colorectal_Count, Kidney_Count, weighted, weighted_cat, and EQI_WATER.

I chose this dataset, because I’ve always had an interest in how environmental factors alter DNA to cause cancers. The first step is to analyze the data we have of possible carcinogens, so I wanted to see if there was a correlation between the arsenic concentrations in the water and the proportion of people in each county contracting bladder, colorectal, or kidney cancer. The EPA paper written using this dataset, did find positive associations between higher levels of arsenic and the cancers.

I used the calculations and supplemental data in the paper, “Aggregated Cumulative County Arsenic in Drinking Water and Associations with Bladder, Colorectal, and Kidney Cancers, Accounting for Population Served” published by the EPA in J Expo Sci Environ Epidemiol, to make sense of the acronyms, data, and look further into the specifics of the data collection process.

citation: Krajewski AK, Jimenez MP, Rappazzo KM, Lobdell DT, Jagai JS. Aggregated cumulative county arsenic in drinking water and associations with bladder, colorectal, and kidney cancers, accounting for population served. J Expo Sci Environ Epidemiol. 2021 Nov;31(6):979-989. doi: 10.1038/s41370-021-00314-8. Epub 2021 Mar 10. PMID: 33692484; PMCID: PMC8862296.

Introducing necessary files and keys for the project

setwd("~/Data110/4.12.23/cb_2021_us_county_5m")
arsenicdatafile <- "Arsenic_Cancer_Final_2020Sept2.csv"
MDfipscode <- "24"
usshapefile <- "cb_2021_us_county_5m.shp"
popdatafile <- "usco2010.csv"

Loading libraries that may be used for the exploratory data analysis

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0     ✔ purrr   1.0.1
## ✔ tibble  3.1.8     ✔ dplyr   1.1.0
## ✔ tidyr   1.3.0     ✔ stringr 1.5.0
## ✔ readr   2.1.3     ✔ forcats 1.0.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

library(tmap)

## Warning: package 'tmap' was built under R version 4.2.3

library(tmaptools)

## Warning: package 'tmaptools' was built under R version 4.2.3

library(leaflet)

## Warning: package 'leaflet' was built under R version 4.2.3

library(sf)

## Warning: package 'sf' was built under R version 4.2.3

## Linking to GEOS 3.9.3, GDAL 3.5.2, PROJ 8.2.1; sf_use_s2() is TRUE

library(leaflet.extras)

## Warning: package 'leaflet.extras' was built under R version 4.2.3

library(dplyr)
library(rio)

## Warning: package 'rio' was built under R version 4.2.3

library(sp)

## Warning: package 'sp' was built under R version 4.2.3

library(plotly)

## Warning: package 'plotly' was built under R version 4.2.3

## 
## Attaching package: 'plotly'
## 
## The following object is masked from 'package:rio':
## 
##     export
## 
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following object is masked from 'package:graphics':
## 
##     layout

Importing and cleaning the data for use

setwd("~/Data110/4.12.23/cb_2021_us_county_5m")
arsenicdata <- import(arsenicdatafile)
uspopdata <- import(popdatafile)

I am removing some columns that are not offering useful information about the water quality or the people affected by the arsenic concentration in the water.

arsenicdata <- arsenicdata[,c("State", "County", "Population_Served", "Bladder_Count", "Colorectal_Count", "Kidney_Count", "Total_CWS_County", "unweighted", "weighted", "Percent_Male", "Percent_Ever_Had_Endoscopy", "Percent_Ever_Smoked", "EQI_WATER", "unweighted_cat", "weighted_cat")]

str(arsenicdata)

## 'data.frame':    943 obs. of  15 variables:
##  $ State                     : chr  "California" "California" "California" "California" ...
##  $ County                    : chr  "Alameda" "Alpine" "Amador" "Butte" ...
##  $ Population_Served         : int  328325 50 5700 100086 1400 356 26 112 250 90 ...
##  $ Bladder_Count             : chr  "238" "0" "14" "67" ...
##  $ Colorectal_Count          : chr  "506" "0" "20" "105" ...
##  $ Kidney_Count              : int  178 0 10 36 12 4 156 5 30 124 ...
##  $ Total_CWS_County          : int  14 5 25 53 19 10 42 15 19 102 ...
##  $ unweighted                : num  13.8 11 12 19.3 13.4 ...
##  $ weighted                  : num  14.7 27.1 11.2 11.4 13 ...
##  $ Percent_Male              : num  49.1 52.6 55.1 49 49.6 50.8 48.8 55.2 49.9 50.1 ...
##  $ Percent_Ever_Had_Endoscopy: num  51.2 52.7 58.3 52.2 51.8 44.1 63.3 43.7 62 49.9 ...
##  $ Percent_Ever_Smoked       : num  31.4 53.7 53.4 44.5 46.4 38.1 38.3 44.2 38.7 30.2 ...
##  $ EQI_WATER                 : num  1.273 0.237 1.228 0.861 0.941 ...
##  $ unweighted_cat            : int  3 2 2 3 3 4 4 2 3 4 ...
##  $ weighted_cat              : int  4 4 3 3 4 4 2 3 3 4 ...

I am changing the count of people who have contracted bladder or colorectal cancer from character form to numeric.

arsenicdata$Bladder_Count <- as.numeric(arsenicdata$Bladder_Count)

## Warning: NAs introduced by coercion

arsenicdata$Colorectal_Count <- as.numeric(arsenicdata$Colorectal_Count)

## Warning: NAs introduced by coercion

str(arsenicdata)

## 'data.frame':    943 obs. of  15 variables:
##  $ State                     : chr  "California" "California" "California" "California" ...
##  $ County                    : chr  "Alameda" "Alpine" "Amador" "Butte" ...
##  $ Population_Served         : int  328325 50 5700 100086 1400 356 26 112 250 90 ...
##  $ Bladder_Count             : num  238 0 14 67 17 3 224 6 52 122 ...
##  $ Colorectal_Count          : num  506 0 20 105 23 7 413 11 77 262 ...
##  $ Kidney_Count              : int  178 0 10 36 12 4 156 5 30 124 ...
##  $ Total_CWS_County          : int  14 5 25 53 19 10 42 15 19 102 ...
##  $ unweighted                : num  13.8 11 12 19.3 13.4 ...
##  $ weighted                  : num  14.7 27.1 11.2 11.4 13 ...
##  $ Percent_Male              : num  49.1 52.6 55.1 49 49.6 50.8 48.8 55.2 49.9 50.1 ...
##  $ Percent_Ever_Had_Endoscopy: num  51.2 52.7 58.3 52.2 51.8 44.1 63.3 43.7 62 49.9 ...
##  $ Percent_Ever_Smoked       : num  31.4 53.7 53.4 44.5 46.4 38.1 38.3 44.2 38.7 30.2 ...
##  $ EQI_WATER                 : num  1.273 0.237 1.228 0.861 0.941 ...
##  $ unweighted_cat            : int  3 2 2 3 3 4 4 2 3 4 ...
##  $ weighted_cat              : int  4 4 3 3 4 4 2 3 3 4 ...

I would like there to be a variable showing the total population in each county. I used the same raw data set from the US Geological Survey service that was used in the paper by the EPA to make the arsenic data set. In order to transfer the total population from the uspop dataframe to the dataframe for my plot, they need to have the same number of observations. I thought the best way to eliminate unnecessary observations would be to match the counties.

uspopdata$COUNTY<-gsub(" County","",as.character(uspopdata$COUNTY))
uspop <- uspopdata[uspopdata$COUNTY %in% arsenicdata$County ,] %>%
  filter(STATE == "MD" | STATE == "MA" | STATE == "ME" | STATE == "MO")

For my first plot, I did not want it to be an overwhelming amount of information, so I chose to display variables for the states that start with “M” in the arsenic dataset.

plot1data <- arsenicdata %>%
  select(State, County, Population_Served, Bladder_Count, Colorectal_Count, Kidney_Count, weighted) %>%
  filter(State == "Maryland" | State == "Massachusetts" | State == "Maine" | State == "Missouri")

There were some counties available in the arsenic dataset that were not in the us population dataset, so I removed the observations that were not available in both.

plot1data_ <- plot1data[plot1data$County %in% uspop$COUNTY ,]

I wanted to make a single column of the total number of people who contracted any of the three cancers, and finally add the total population of each county. With these values I can get the proportion of the population that is diagnosed with any of the three cancers.

plot1data_$`Cancer count` <- plot1data_$Bladder_Count + plot1data_$Colorectal_Count + plot1data_$Kidney_Count

plot1datafin <- plot1data_ %>%
  mutate(`Total_Pop (in thousands)` = uspop$`TP-TotPop`) %>%
  select(State, County, `Total_Pop (in thousands)`, `Cancer count`, weighted)

  plot1datafin$`Cancer Rate` <- (plot1datafin$`Cancer count`/(plot1datafin$`Total_Pop (in thousands)`*1000))*100

Statistical Analysis & Visual

Based off the linear model and scatter plot with a line of best fit, there does not seem to be a strong linear relationship between the rate of cancer in each county’s population and the concentration of arsenic within their community water systems. The p-value is also much larger that 0.05 indicating that there is not enough evidence to reject the null hypothesis as the data is not significant. The R^2 value is closer to 0 indicating not much of a correlation relationship between the two variables.

lin_model <- lm(plot1datafin$`Cancer Rate` ~ plot1datafin$weighted)
summary(lin_model)

## 
## Call:
## lm(formula = plot1datafin$`Cancer Rate` ~ plot1datafin$weighted)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.085112 -0.013714  0.000981  0.015957  0.068098 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           0.0837310  0.0028865  29.007   <2e-16 ***
## plot1datafin$weighted 0.0001969  0.0002776   0.709    0.479    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02844 on 159 degrees of freedom
## Multiple R-squared:  0.003154,   Adjusted R-squared:  -0.003116 
## F-statistic: 0.5031 on 1 and 159 DF,  p-value: 0.4792

plot1 <- ggplot(plot1datafin, aes(weighted, `Cancer Rate`)) +
  geom_smooth(method = "lm")+
  geom_point(aes( color = State)) +
  theme_minimal() +
  labs( title = "Arsenic Concentration in Water and Rate of Cancer Diagnosis",
        x = "Arsenic Concentration",
        y = "Cancer Rate")
ggplotly(plot1)

## `geom_smooth()` using formula = 'y ~ x'

Cleaning the data for my second plot

Instead of the weighted value, I will be making a boxplot with the weighted category (1-4, 4 being the worse arsenic concentration).

plot2data <- arsenicdata %>%
  select(State, County, Population_Served, Bladder_Count, Colorectal_Count, Kidney_Count, weighted_cat) %>%
  filter(State == "Maryland" | State == "Massachusetts" | State == "Maine" | State == "Missouri")

plot2data <- plot2data[plot2data$County %in% uspop$COUNTY ,]

plot2data$`Cancer count` <- plot2data$Bladder_Count + plot2data$Colorectal_Count + plot2data$Kidney_Count

plot2datafin <- plot2data %>%
  mutate(`Total_Pop (in thousands)` = uspop$`TP-TotPop`) %>%
  select(State, County, `Total_Pop (in thousands)`, `Cancer count`, weighted_cat) %>%
  group_by(weighted_cat)

  plot2datafin$`Cancer Rate` <- (plot1datafin$`Cancer count`/(plot2datafin$`Total_Pop (in thousands)`*1000))*100

plot2 <- ggplot(plot2datafin, aes(x = weighted_cat, y = `Cancer Rate`)) +
  geom_boxplot(aes(group = weighted_cat, fill = weighted_cat))+
  labs(title = "Arsenic Category and Proportion of Cancer Diagnoses in the Population", ) +
  xlab("Arsenic Category") +
  guides(fill=guide_legend(title="Arsenic Category"))+
  theme_minimal()
plot2 + theme(plot.title=element_text(size=10))

GIS

library(raster)

## Warning: package 'raster' was built under R version 4.2.3

## 
## Attaching package: 'raster'

## The following object is masked from 'package:plotly':
## 
##     select

## The following object is masked from 'package:dplyr':
## 
##     select

library(rgdal)

## Warning: package 'rgdal' was built under R version 4.2.3

## Please note that rgdal will be retired during 2023,
## plan transition to sf/stars/terra functions using GDAL and PROJ
## at your earliest convenience.
## See https://r-spatial.org/r/2022/04/12/evolution.html and https://github.com/r-spatial/evolution
## rgdal: version: 1.6-5, (SVN revision 1199)
## Geospatial Data Abstraction Library extensions to R successfully loaded
## Loaded GDAL runtime: GDAL 3.5.2, released 2022/09/02
## Path to GDAL shared files: C:/Users/takla/AppData/Local/R/win-library/4.2/rgdal/gdal
## GDAL binary built with GEOS: TRUE 
## Loaded PROJ runtime: Rel. 8.2.1, January 1st, 2022, [PJ_VERSION: 821]
## Path to PROJ shared files: C:/Users/takla/AppData/Local/R/win-library/4.2/rgdal/proj
## PROJ CDN enabled: FALSE
## Linking to sp version:1.6-0
## To mute warnings of possible GDAL/OSR exportToProj4() degradation,
## use options("rgdal_show_exportToProj4_warnings"="none") before loading sp or rgdal.

Clean the data and include new useful columns

MDdata <- arsenicdata[,c("State", "County", "Bladder_Count", "Colorectal_Count", "Kidney_Count", "EQI_WATER")] %>%
  filter(State == "Maryland")

MDdata$`Cancer count` <- MDdata$Bladder_Count + MDdata$Colorectal_Count + MDdata$Kidney_Count

Get the geographic data

usgeo <- shapefile("cb_2021_us_county_5m.shp")

tmap_options(check.and.fix = TRUE)
qtm(usgeo)

## Warning: The shape usgeo is invalid. See sf::st_is_valid

MDgeo <- usgeo[usgeo$STATEFP==MDfipscode,]

qtm(MDgeo)

MDgeo$NAME <- as.character(MDgeo$NAME)

MDgeo <- MDgeo[order(MDgeo$NAME),]
MDdata <- MDdata[order(MDdata$County),]

Make a static map

library(sf)
MDmap <- merge(MDgeo, MDdata, by.x = "NAME", by.y = "County")

qtm(MDmap, "EQI_WATER")

## Variable(s) "EQI_WATER" contains positive and negative values, so midpoint is set to 0. Set midpoint = NA to show the full spectrum of the color palette.

Change color palettes

tm_shape(MDmap) +
 tm_fill("EQI_WATER", title="EQI_WATER", palette = "Reds") +
 tm_borders(alpha=.5) +
 tm_text("NAME", size=0.8)

Make an interactive map

MDPalette <- colorNumeric(palette = "Reds", domain=MDmap$EQI_WATER)

library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:purrr':
## 
##     discard

## The following object is masked from 'package:readr':
## 
##     col_factor

MDpopup <- paste0("County: ", MDmap$NAME,
"EQI Water ", MDmap$EQI_WATER, "Cancer Count ", MDmap$`Cancer count`)

library(rgeos)

## Warning: package 'rgeos' was built under R version 4.2.3

## rgeos version: 0.6-2, (SVN revision 693)
##  GEOS runtime version: 3.9.3-CAPI-1.14.3 
##  Please note that rgeos will be retired during 2023,
## plan transition to sf functions using GEOS at your earliest convenience.
##  GEOS using OverlayNG
##  Linking to sp version: 1.6-0 
##  Polygon checking: TRUE

## 
## Attaching package: 'rgeos'

## The following object is masked from 'package:dplyr':
## 
##     symdiff

MDmap_projected <- sp::spTransform(MDmap, "+proj=longlat +datum=WGS84")
leaflet(MDmap_projected) %>%
 addProviderTiles("CartoDB.Positron") %>%
 addPolygons(stroke=FALSE,
 smoothFactor = 0.2,
 fillOpacity = .8,
 popup= MDpopup,
 color= ~MDPalette(MDmap$EQI_WATER))

Final Question

I am surprised by the results of my visualizations. I expected there to be a positive relationship between the weighted value and the cancer rate in each county. In the EPA paper, their calculations were done in a way I did not fully understand, because I don’t think it accounted for the population properly, at least form what I saw. It is more likely that my calculations were done in a simplified way that misinterpreted the data presented. I was also surprised that the boxplot showed category 3 to have the highest average cancer rate. I expected category 4 to have the highest average.

I wished I included the cancer rate in the interactive map instead of the count. But, I find it hard to compare the values for each county while using the interactive map. I prefer a table for that many comparisons.