MBA676 Assignment #1

Introduction

My Markdown leads to a Cartesian Graph that displays Maine locations according to two variables. Locations placement on the graph are determined by the ‘# of wells tested above AR limits’ (X-Axis) and the # of wells tested above AR limits (Y-Axis). To get these results, I had to wrangle the data using the R commands which are detailed throughout this R Markdown. As a result, I feel the resulting graph clearly shows locations with unsafe wells. I believe this graph could be used by a lot of professionals including real estate agents and many in the agriculture community.

Assignment Assumptions

Maine’s Maximum Exposure Guideline for fluoride is 2 milligrams per liter (mg/L). For arsenic is 10 micrograms per liter (ug/L). The data sets provideThe fields included in both csv files include:

location - the name of the town, township, or regional area in Maine
n_wells_tested - the number of wells tested.
percent_wells_above_guideline - percentage of wells that tested above the maximum exposure guidelines
median - mg/L for flouride, ug/L for arsenic
percentile_95 - the 95th percentile readings in mg/L or ug/L
maximum - the maximum readings in mg/L or ug/L

Here is my R Code and resulting graph.

The first challenge was to configure my R Studio Cloud environment with the proper packages, library and data sets. I uploaded relavent packages (dplyr, tidyr & ggplot2) by using the library command and install.package command. Once completed, I uploaded my data sets and assigned a new name using the following code:

ME_AR <- read.csv("arsenic.csv", header = TRUE, stringsAsFactors = FALSE)
ME_FL <- read.csv("flouride.csv", header = TRUE, stringsAsFactors = FALSE)
install.packages("dplyr")

## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/3.6'
## (as 'lib' is unspecified)

install.packages("tidyr")

## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/3.6'
## (as 'lib' is unspecified)

install.packages("ggplot2")

## Installing package into '/home/rstudio-user/R/x86_64-pc-linux-gnu-library/3.6'
## (as 'lib' is unspecified)

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)
library(ggplot2)

Let the Data Wrangling begin!

I tried several different commands to learn about these data sets. I experiemnted with many filter, gather and mutate commands and decided on to filter all locations that had wells that exceeded the state’s guidelines of Arsenic or Flouride. I felt it would be more compelling to show total number of wells above guideline verus a percentage of n_wells_tested. So, I manipulated my data sets with the following commands:

number_high_AR_wells <- mutate(ME_AR, (n_wells_tested * percent_wells_above_guideline) / 100)

names(number_high_AR_wells) <- c("location", "n_wells_tested", "percent_wells_above_guideline", "median", "percentile", "maximum", "total_wells_above_ARguideline" )

number_high_FL_wells <- mutate(ME_FL, (n_wells_tested * percent_wells_above_guideline) / 100)
names(number_high_FL_wells) <- c("location", "n_wells_tested", "percent_wells_above_guideline", "median", "percentile", "maximum", "total_wells_above_FLguideline" )

I realized that I forgot to filter my number_high_FL_wells. So, I corrected this oversight before plotting my results.

summary(number_high_FL_wells)

##    location         n_wells_tested   percent_wells_above_guideline
##  Length:917         Min.   :  0.00   Min.   : 0.000               
##  Class :character   1st Qu.:  0.00   1st Qu.: 0.000               
##  Mode  :character   Median :  6.00   Median : 0.600               
##                     Mean   : 38.17   Mean   : 2.448               
##                     3rd Qu.: 49.00   3rd Qu.: 3.125               
##                     Max.   :503.00   Max.   :30.000               
##                                      NA's   :557                  
##      median         percentile        maximum       
##  Min.   :0.1000   Min.   :0.1000   Min.   : 0.0500  
##  1st Qu.:0.1000   1st Qu.:0.5195   1st Qu.: 0.4225  
##  Median :0.1000   Median :0.9855   Median : 1.3000  
##  Mean   :0.1762   Mean   :1.1471   Mean   : 1.8987  
##  3rd Qu.:0.2000   3rd Qu.:1.5995   3rd Qu.: 2.9000  
##  Max.   :1.2900   Max.   :4.4400   Max.   :14.0000  
##  NA's   :557      NA's   :557      NA's   :363      
##  total_wells_above_FLguideline
##  Min.   : 0.0000              
##  1st Qu.: 0.0000              
##  Median : 0.9755              
##  Mean   : 2.3711              
##  3rd Qu.: 2.9933              
##  Max.   :46.7790              
##  NA's   :557

Generate a concise table for plotting data:

I realized quickly that I need to to refine my data set to get a concise graph. So, I set up to join my tables with this code:

ME_AR_FL <- number_high_AR_wells %>% filter(total_wells_above_ARguideline > 0) %>% select(location, total_wells_above_ARguideline) %>% inner_join(number_high_FL_wells)

## Joining, by = "location"

I changed the variables names…

names(ME_AR_FL) <- c("location", "total_wells_above_guideline", "n_wells_tested", "percent_wells_above_guideline", "median", "percentile", "maximum", "total_wells_above_FLguideline" )

…and refined the variables further to get to my X,Y values. I also filtered the data set to show only total_wells_above_FLguideline > 0 to compensate for earlier oversight.

ME_AR_FL2 <- select (ME_AR_FL, "location", "total_wells_above_guideline", "total_wells_above_FLguideline")
ME_AR_FL3 <- filter (ME_AR_FL2, total_wells_above_FLguideline > 0)

This resulted in the following table:

summary(ME_AR_FL3)

##    location         total_wells_above_guideline
##  Length:158         Min.   :  0.952            
##  Class :character   1st Qu.:  3.275            
##  Mode  :character   Median :  8.998            
##                     Mean   : 23.197            
##                     3rd Qu.: 21.011            
##                     Max.   :189.952            
##  total_wells_above_FLguideline
##  Min.   : 0.954               
##  1st Qu.: 1.048               
##  Median : 2.982               
##  Mean   : 4.820               
##  3rd Qu.: 5.045               
##  Max.   :46.779

Plotting with ggplot!

Visualizing this data took several experiments. Ideally, this looks best on a map. However, I feel this graph, coupled w/ the data table, provides a quick reference guide for my audience. Here you can see Maine locations that have high levels of Flouride and Arsenic. All of these locations tested above state guidelines, however these are the locations to avoid using well water.

ggplot(ME_AR_FL2, aes(label = location, x = total_wells_above_guideline, y = total_wells_above_FLguideline)) + geom_label()

## Warning: Removed 1 rows containing missing values (geom_label).