Arsenic and Flouride Levels in Maine Drinking Water

Using two data sets provided by State of Maine Health and Environmental Testing Laboratory (HETL) and the Maine Tracking Network we looked at the correlation between flouride and arsenic levels in private wells. These data sets contain data on flouride and arsenic levels of samples taken from private wells from 1999 to 2013. Each data set has 6 columns

The code below downloads and imports the data sets from Professor James Suleiman’s MBA 676 course website:

# Download files from internet and store in temporary files
set.seed(9242017)
flouride_file <- tempfile('flouride', fileext = '.csv')
arsenic_file <- tempfile('arsenic', fileext = '.csv')

download.file('http://jamessuleiman.com/mba676/assets/units/unit4/flouride.csv',
              destfile = flouride_file,
              method = 'auto')

download.file('http://jamessuleiman.com/mba676/assets/units/unit4/arsenic.csv',
              destfile = arsenic_file,
              method = 'auto')

# Read temporary files and store as dataframes
flo_tab <- read.csv(flouride_file)
ars_tab <- read.csv(arsenic_file)


We are going to join this two data frames into one based on the location. Before doing this we are going to convert the flouride and arsenic measurements to the same units. This is not entirely necessary to do, but just to keep things straigh and avoid confusion we will take this step. The arsenic units are in micrograms per liter while the flouride measurement are in milligrams per liter. We will convert the median flouride measurements to micrograms per liter. There are a thousand micrograms in one milligram so we will multiply the values in the median column of the flouride data set by a factor of on thousand.

Also, before joining the data sets we will rename the columns that we are interested in keeping so that things are less confusing. The column “n_wells_tested” will be changed to “n.arsenic” and “n.flouride”, “percent_wells_above_guideline” will be changed to “percent.above.arsenic” and “percent.above.flouride”, and “median” will be changed to “median.arsenic” and “median.flouride”.

library(dplyr)
flo <- flo_tab %>%
     select(location, 
            n_wells_tested, 
            percent_wells_above_guideline,
            median) %>%
     mutate(median = median * 1000) %>%
     rename(n.flouride = n_wells_tested,
            percent.above.flouride = percent_wells_above_guideline,
            median.flouride = median)

ars <- ars_tab %>%
     select(location, 
            n_wells_tested, 
            percent_wells_above_guideline,
            median) %>%
     rename(n.arsenic = n_wells_tested,
            percent.above.arsenic = percent_wells_above_guideline,
            median.arsenic = median)
     
wells <- full_join(flo, ars, by = 'location')

rm(ars,flo)


For areas where less than 20 wells were sample, only the maximum observed contaminant concentration is reported. We are interested in whether or not there is a correlation between flouride and arsenic concentrations and for this we will look at the reported median concentrations. Thus we will restrict our investigation to areas that sampled at least 20 wells for both the flouride and arsenic tests.

wells <- wells %>%
     filter(n.flouride >= 20, n.arsenic >= 20)

Before testing for correlation between the median arsenic and flouride levels, let’s take a look at the distribution of these variables.

hist(wells$median.flouride,
     main = 'Observed Distrubtion of Median Flouride Level',
     xlab = 'Concentration ug/L')

hist(wells$median.arsenic,
     main = 'Observed Distribution of Median Arsenic Level',
     xlab = 'Concentration ug/L')


It is clear that the distributions of the median contaminant levels are not close to normal (they very right-skewed). Thus we will use a non-parametric test of correlation. A commonly used non parametric test is the Spearman’s rank correlation test. This test compares the rank of different observations. For example, if one location has higher median flouride levels than another it has a higher “rank” for flouride level. The test basically looks at whether observations that rank higher in one variable (e.g. median flouride level) also rank higher in the other variable (e.g. median arsenic level). The Spearman’s rank correlation coefficient can take values between negative one and positive values. Negative values indicate the variables are negatively correlated while positive values indicate a positive correlation. The “Hmisc” package in R can calculate the Spearman’s rank coefficient and the p-value for said statistic. The p-value is the likelihood of observing the correlation just by chance. For our purposes we will use the standard alpha level of 5%, so we won’t consider the results significant unless the p-value is less than 0.05.

library(Hmisc)
rcorr(as.matrix(wells[,c('median.flouride', 'median.arsenic')]), 
      type="spearman")
##                 median.flouride median.arsenic
## median.flouride            1.00           0.11
## median.arsenic             0.11           1.00
## 
## n= 341 
## 
## 
## P
##                 median.flouride median.arsenic
## median.flouride                 0.0527        
## median.arsenic  0.0527


The output from the ‘rcorr’ function means that the Spearman’s coefficient for the rank correlation between median flouride and median arsenic levels is 0.11, indicating that there is a weak positive correlation between median flouride and arsenic levels. However, the p-value is 0.0527, meaning that there is a 5.27% chance that the observed correlation is just due to chance. Since our chosen cut-off point was 0.05 we cannot conclude that there is any significant correlation between median arsenic and flouride levels. Thus we conclude that there is no significant evidence for a correlation between arsenic and flouride levels in private drinking wells.