Looking For a Correlation

Attempt #1

For this assignment I wanted to discover if there was a correlation between two of the variables that were tested. Initially, I wanted to see if there was a relationship between fluoride and arsenic levels. I thought that wells with a higher concentration of fluoride might have a lower concentration of arsenic. This was based on the assumption that better developed and maintained wells would likely see more fluoride added into the water supply and better protections against arsenic being introduced.After merging the two data frames by location and filtering out any locations that had fewer than 20 wells tested for either fluoride or arsenic, I was ready to make some graphs. I ended up plotting three graphs on my first attempt.

These graphs consisted of:

The median levels of arsenic at each location plotted against the median levels of fluoride
The 95th percentile levels of arsenic at each location plotted against the 95th percentile levels of fluoride
The maximum levels of arsenic at each location plotted against the maximum levels of fluoride

Here are the resulting scatterplots:

Unfortunately, as can be seen here, there isn’t much of a clear correlation between any of these variable pairs. There is a slight positive correlation with all three but, aside from the “Median” scatteplot, the lines are practically horizontal meaning that correlation is virtually nonexistent.

Attempt #2

The next thing I tried was to plot the median levels of recorded arsenic/fluoride at each location against the number of wells tested. This was, in a way, just me taking on the assumption from attempt #1 from a different angle. The thought process here was that more wells means a larger town/city which, in theory, should mean higher quality utilities. Here are the resulting scatterplots:

Much like with attempt #1, I feel nothing significant is present in these plots. The line of best fit for the fluoride chart can hardly be any closer to horizontal. As for arsenic, though it does seem to show a positive correlation, I’m not convinced that it is valid. If I were to further test this comparison I would like for there to be more locations tested with a high well count to better see how that data may pan out.

Attempt #3

My third attempt to find a correlation built off my second attempts use of the number of wells tested. This time I wanted to see if the number of wells tested would have a noticeable effect on the percent of wells that tested above the recommended guidelines for both fluoride and arsenic. These plots both yielded similar information, but also raised some more questions. Here they are:

These graphs both present a similar pattern. When the number of wells tested is less then 200 for fluoride there are clear striations that are tending toward 0. Past this point the data largely start to break down, possibly due to a small sample size. The data points that fall between 200 and 300 wells tested do seem to somewhat follow the same pattern before becoming too sparse to make much of any inference past that point.

The graph that displays the same data but with arsenic instead of fluoride tells a very similar story. Between 20 and 100 wells tested there are very noticeable striations of data that tend toward 0. between 100 and 200 there is a slight indication of that downward trend. This trend further breaks down between 200 and 300 so much so that is is practically nonexistent. Interestingly, after we cross the line of 300 wells tested, only a single location has fewer than 26% of the tested wells exceeding the recommended guidelines.

# This is the code used to generate the table below

knitr::kable(A.wells_over_300, "html", col.names = c("Location", "Wells tested", "Percent of wells over guideline"), align = "lcc", caption = "Percent of wells that tested over the guideline in locations where more than 300 wells were tested")

Percent of wells that tested over the guideline in locations where more than 300 wells were tested
Location	Wells tested	Percent of wells over guideline
Standish	632	26.9
Gorham	467	50.1
Augusta	454	26.4
Ellsworth	428	29.7
Winthrop	424	44.8
Belgrade	401	31.2
Readfield	344	39.8
Buxton	334	43.4
Harpswell	300	5.3

Upon discovering this interesting quirk in the data, I realized a larger sample size would be needed in order to better understand the mechanics going on here.I can speculate all day as to why 8 of the 9 locations where over 300 wells that were tested for arsenic came back with over 26% of the wells above the normal guidelines. Perhaps these locations are seated on deposits of rock that contain elevated levels of arsenic? I feel the only way to answer this would be to have more data.

A01: Trent Dexter

Trent Dexter

3/13/2021

Looking For a Correlation

Attempt #1

Attempt #2

Attempt #3