My Markdown leads to a Cartesian Graph that displays Maine locations according to two variables. Locations placement on the graph are determined by the ‘# of wells tested above AR limits’ (X-Axis) and the # of wells tested above AR limits (Y-Axis). To get these results, I had to wrangle the data using the R commands which are detailed throughout this R Markdown. As a result, I feel the resulting graph clearly shows locations with unsafe wells. I believe this graph could be used by a lot of professionals including real estate agents and many in the agriculture community.
Maine’s Maximum Exposure Guideline for fluoride is 2 milligrams per liter (mg/L). For arsenic is 10 micrograms per liter (ug/L). The data sets provideThe fields included in both csv files include:
location - the name of the town, township, or regional area in Mainen_wells_tested - the number of wells tested.percent_wells_above_guideline - percentage of wells that tested above the maximum exposure guidelinesmedian - mg/L for flouride, ug/L for arsenicpercentile_95 - the 95th percentile readings in mg/L or ug/Lmaximum - the maximum readings in mg/L or ug/LThe first challenge was to configure my R Studio Cloud environment with the proper packages, library and data sets. I uploaded relavent packages (dplyr, tidyr & ggplot2) by using the library command and install.package command. Once completed, I uploaded my data sets and assigned a new name using the following code:
ME_AR <- read.csv("arsenic.csv", header = TRUE, stringsAsFactors = FALSE)
ME_FL <- read.csv("flouride.csv", header = TRUE, stringsAsFactors = FALSE)
install.packages("dplyr")
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/3.6'
## (as 'lib' is unspecified)
install.packages("tidyr")
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/3.6'
## (as 'lib' is unspecified)
install.packages("ggplot2")
## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/3.6'
## (as 'lib' is unspecified)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(tidyr)
library(ggplot2)
I tried several different commands to learn about these data sets. I experiemnted with many filter, gather and mutate commands and decided on to filter all locations that had wells that exceeded the state’s guidelines of Arsenic or Flouride. I felt it would be more compelling to show total number of wells above guideline verus a percentage of n_wells_tested. So, I manipulated my data sets with the following commands:
number_high_AR_wells <- mutate(ME_AR, (n_wells_tested * percent_wells_above_guideline) / 100)
names(number_high_AR_wells) <- c("location", "n_wells_tested", "percent_wells_above_guideline", "median", "percentile", "maximum", "total_wells_above_ARguideline" )
number_high_FL_wells <- mutate(ME_FL, (n_wells_tested * percent_wells_above_guideline) / 100)
names(number_high_FL_wells) <- c("location", "n_wells_tested", "percent_wells_above_guideline", "median", "percentile", "maximum", "total_wells_above_FLguideline" )
I realized that I forgot to filter my number_high_FL_wells. So, I corrected this oversight before plotting my results.
summary(number_high_FL_wells)
## location n_wells_tested percent_wells_above_guideline
## Length:917 Min. : 0.00 Min. : 0.000
## Class :character 1st Qu.: 0.00 1st Qu.: 0.000
## Mode :character Median : 6.00 Median : 0.600
## Mean : 38.17 Mean : 2.448
## 3rd Qu.: 49.00 3rd Qu.: 3.125
## Max. :503.00 Max. :30.000
## NA's :557
## median percentile maximum
## Min. :0.1000 Min. :0.1000 Min. : 0.0500
## 1st Qu.:0.1000 1st Qu.:0.5195 1st Qu.: 0.4225
## Median :0.1000 Median :0.9855 Median : 1.3000
## Mean :0.1762 Mean :1.1471 Mean : 1.8987
## 3rd Qu.:0.2000 3rd Qu.:1.5995 3rd Qu.: 2.9000
## Max. :1.2900 Max. :4.4400 Max. :14.0000
## NA's :557 NA's :557 NA's :363
## total_wells_above_FLguideline
## Min. : 0.0000
## 1st Qu.: 0.0000
## Median : 0.9755
## Mean : 2.3711
## 3rd Qu.: 2.9933
## Max. :46.7790
## NA's :557
I realized quickly that I need to to refine my data set to get a concise graph. So, I set up to join my tables with this code:
ME_AR_FL <- number_high_AR_wells %>% filter(total_wells_above_ARguideline > 0) %>% select(location, total_wells_above_ARguideline) %>% inner_join(number_high_FL_wells)
## Joining, by = "location"
I changed the variables names…
names(ME_AR_FL) <- c("location", "total_wells_above_guideline", "n_wells_tested", "percent_wells_above_guideline", "median", "percentile", "maximum", "total_wells_above_FLguideline" )
…and refined the variables further to get to my X,Y values. I also filtered the data set to show only total_wells_above_FLguideline > 0 to compensate for earlier oversight.
ME_AR_FL2 <- select (ME_AR_FL, "location", "total_wells_above_guideline", "total_wells_above_FLguideline")
ME_AR_FL3 <- filter (ME_AR_FL2, total_wells_above_FLguideline > 0)
This resulted in the following table:
summary(ME_AR_FL3)
## location total_wells_above_guideline
## Length:158 Min. : 0.952
## Class :character 1st Qu.: 3.275
## Mode :character Median : 8.998
## Mean : 23.197
## 3rd Qu.: 21.011
## Max. :189.952
## total_wells_above_FLguideline
## Min. : 0.954
## 1st Qu.: 1.048
## Median : 2.982
## Mean : 4.820
## 3rd Qu.: 5.045
## Max. :46.779
Visualizing this data took several experiments. Ideally, this looks best on a map. However, I feel this graph, coupled w/ the data table, provides a quick reference guide for my audience. Here you can see Maine locations that have high levels of Flouride and Arsenic. All of these locations tested above state guidelines, however these are the locations to avoid using well water.
ggplot(ME_AR_FL2, aes(label = location, x = total_wells_above_guideline, y = total_wells_above_FLguideline)) + geom_label()
## Warning: Removed 1 rows containing missing values (geom_label).