Refer to the detailed instructions for this assignment in Brightspace.
Don’t alter the three code chunks in this section. First we read in the two data sets and deleting missing values.
#install.packages('readr')
library(tidyverse)
fluoride <- read_csv("http://jamessuleiman.com/teaching/datasets/fluoride.csv")
fluoride <- fluoride %>% drop_na()
arsenic <- read_csv("http://jamessuleiman.com/teaching/datasets/arsenic.csv")
arsenic <- arsenic %>% drop_na()
Next we display the first few rows of fluoride.
head(fluoride)
## # A tibble: 6 x 6
## location n_wells_tested percent_wells_above_gui… median percentile_95 maximum
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Otis 60 30 1.13 3.2 3.6
## 2 Dedham 102 22.5 0.94 3.27 7
## 3 Denmark 46 19.6 0.45 3.15 3.9
## 4 Surry 175 18.3 0.8 3.52 6.9
## 5 Prospect 57 17.5 0.785 2.5 2.7
## 6 Eastbrook 31 16.1 1.29 2.44 3.3
Then we display the first few rows of arsenic.
head(arsenic)
## # A tibble: 6 x 6
## location n_wells_tested percent_wells_above_g… median percentile_95 maximum
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Manchester 275 58.9 14 93 200
## 2 Gorham 467 50.1 10.5 130 460
## 3 Columbia 42 50 9.8 65.9 200
## 4 Monmouth 277 49.5 10 110 368
## 5 Eliot 73 49.3 9.7 41.4 45
## 6 Columbia F… 25 48 8.1 53.8 71
In the code chunk below, create a new tibble called chemicals that joins fluoride and arsenic. You probably want to do an inner join but the join type is up to you.
chemicals <- arsenic %>% inner_join(fluoride, by= "location")
The next code chunk displays the head of your newly created chemicals tibble. Take a look to verify that your join looks ok.
head(chemicals)
## # A tibble: 6 x 11
## location n_wells_tested.x percent_wells_a… median.x percentile_95.x maximum.x
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Manches… 275 58.9 14 93 200
## 2 Gorham 467 50.1 10.5 130 460
## 3 Columbia 42 50 9.8 65.9 200
## 4 Monmouth 277 49.5 10 110 368
## 5 Eliot 73 49.3 9.7 41.4 45
## 6 Columbi… 25 48 8.1 53.8 71
## # … with 5 more variables: n_wells_tested.y <dbl>,
## # percent_wells_above_guideline.y <dbl>, median.y <dbl>,
## # percentile_95.y <dbl>, maximum.y <dbl>
# x=arsenic, y=fluoride
In the code chunk below create an interesting subset of the data. You’ll likely find an interesting subset by filtering for locations that have high or low levels of arsenic, fluoride, or both.
median_arsenic <- chemicals %>% filter(chemicals$median.x >=10 & chemicals$median.y <2) %>% arrange(desc(median.x))
median_fluoride <- chemicals %>% filter(chemicals$median.x <10 & chemicals$median.y >=2) %>% arrange(desc(median.y))
median_chemicals <- chemicals %>% filter(chemicals$median.x >=10 & chemicals$median.y >=2)
dim(median_arsenic)[1]
## [1] 3
dim(median_fluoride)[1]
## [1] 0
dim(median_chemicals)[1]
## [1] 0
percentile_95_arsenic <- chemicals %>% filter(chemicals$percentile_95.x >=10 & chemicals$percentile_95.y <2) %>% arrange(desc(percentile_95.x))
percentile_95_fluoride <- chemicals %>% filter(chemicals$percentile_95.x <10 & chemicals$percentile_95.y >=2) %>% arrange(desc(percentile_95.x))
percentile_95_chemicals <- chemicals %>% filter(chemicals$percentile_95.x >=10 & chemicals$percentile_95.y >=2)
dim(percentile_95_arsenic)[1]
## [1] 190
dim(percentile_95_fluoride)[1]
## [1] 23
dim(percentile_95_chemicals)[1]
## [1] 27
Above, we observe the number of towns where at least 50% of the wells were out of compliance regarding the amount of arsenic and fluoride found in their wells. By this measure, 3 towns failed outright regarding arsenic. One the other hand, there were no towns where the majority of tested wells were outside of acceptable parameters for fluoride.
I have also pulled out the 95th percentile measure, which shows towns where at least 5% of the wells were out of compliance. This indicates towns that seem to have problem areas but not necessarily an overwhelming amount of noncompliance. 190 towns had wells that tested high for only arsenic, 23 towns had wells that tested high on only fluoride, and 27 towns had wells out of compliance with both (though, not necessarily high levels of both fluoride and arsenic in the same wells).
I think it’s clear that arsenic was a larger (and potentially much more dangerous) concern in Maine wells than fluoride was.
head(median_arsenic)
## # A tibble: 3 x 11
## location n_wells_tested.x percent_wells_a… median.x percentile_95.x maximum.x
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Manches… 275 58.9 14 93 200
## 2 Gorham 467 50.1 10.5 130 460
## 3 Monmouth 277 49.5 10 110 368
## # … with 5 more variables: n_wells_tested.y <dbl>,
## # percent_wells_above_guideline.y <dbl>, median.y <dbl>,
## # percentile_95.y <dbl>, maximum.y <dbl>
head(median_fluoride)
## # A tibble: 0 x 11
## # … with 11 variables: location <chr>, n_wells_tested.x <dbl>,
## # percent_wells_above_guideline.x <dbl>, median.x <dbl>,
## # percentile_95.x <dbl>, maximum.x <dbl>, n_wells_tested.y <dbl>,
## # percent_wells_above_guideline.y <dbl>, median.y <dbl>,
## # percentile_95.y <dbl>, maximum.y <dbl>
head(median_chemicals)
## # A tibble: 0 x 11
## # … with 11 variables: location <chr>, n_wells_tested.x <dbl>,
## # percent_wells_above_guideline.x <dbl>, median.x <dbl>,
## # percentile_95.x <dbl>, maximum.x <dbl>, n_wells_tested.y <dbl>,
## # percent_wells_above_guideline.y <dbl>, median.y <dbl>,
## # percentile_95.y <dbl>, maximum.y <dbl>
head(percentile_95_arsenic)
## # A tibble: 6 x 11
## location n_wells_tested.x percent_wells_a… median.x percentile_95.x maximum.x
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Danforth 35 40 5 372. 3100
## 2 Standish 632 26.9 2 154 550
## 3 Gorham 467 50.1 10.5 130 460
## 4 Wales 42 33.3 2 129 180
## 5 Camden 50 34 5 125 404
## 6 Northpo… 157 26.1 2.1 114. 1700
## # … with 5 more variables: n_wells_tested.y <dbl>,
## # percent_wells_above_guideline.y <dbl>, median.y <dbl>,
## # percentile_95.y <dbl>, maximum.y <dbl>
head(percentile_95_fluoride)
## # A tibble: 6 x 11
## location n_wells_tested.x percent_wells_a… median.x percentile_95.x maximum.x
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Cornish 23 4.3 0.785 9.86 68
## 2 York 119 2.5 0.73 9.72 35
## 3 Southwe… 34 5.9 0.5 8.97 23
## 4 Prospect 50 4 1 8 16
## 5 Lebanon 48 4.2 0.5 7.96 31
## 6 Greenwo… 23 4.3 0.25 7.12 15
## # … with 5 more variables: n_wells_tested.y <dbl>,
## # percent_wells_above_guideline.y <dbl>, median.y <dbl>,
## # percentile_95.y <dbl>, maximum.y <dbl>
head(percentile_95_chemicals)
## # A tibble: 6 x 11
## location n_wells_tested.x percent_wells_a… median.x percentile_95.x maximum.x
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Blue Hi… 241 42.7 7 229 930
## 2 Orland 123 40.7 5.4 44.6 160
## 3 Surry 181 40.3 6 146. 470
## 4 Mariavi… 30 40 7.2 55.5 57
## 5 Otis 53 39.6 4.8 121. 200
## 6 Sedgwick 142 37.3 4.2 237. 840
## # … with 5 more variables: n_wells_tested.y <dbl>,
## # percent_wells_above_guideline.y <dbl>, median.y <dbl>,
## # percentile_95.y <dbl>, maximum.y <dbl>
In the code chunk below, create a ggplot visualization of your subset that is fairly simple for a viewer to comprehend.
ggplot(data=chemicals, aes(x=percentile_95.x, y=percentile_95.y)) +
geom_point(size=1, alpha= 0.3) +
geom_hline(yintercept=2, col="blue") +
geom_vline(xintercept=10, col = "red") +
ggtitle("Minimum Levels of Flouride vs Arsentic in the Top 5% of Wells") +
xlab("Arsenic in Top 5% of Wells") +
ylab("Fluoride in Top 5% of Wells")
chemicals<- chemicals %>% mutate(noncompliance_95 = if_else(percentile_95.x < 10 & percentile_95.y < 2, " in compliance/ none", if_else(percentile_95.y >= 2 & percentile_95.x < 10, " only fluoride", if_else(percentile_95.x >= 10 & percentile_95.y < 2, " only arsenic", if_else (percentile_95.x >= 10 & percentile_95.y >= 2, "both", "missing")))))
counts_95 <- chemicals %>% group_by (noncompliance_95) %>%
summarize(count=n())
## `summarise()` ungrouping output (override with `.groups` argument)
ggplot(data= counts_95, aes(x= noncompliance_95, y= count)) +
geom_bar(stat= "identity") +
ggtitle("Number of Towns 5% or More Out of Compliance") +
xlab("Chemicals") +
ylab("Number of Towns")
ggplot(data=chemicals, aes(x=median.x, y=median.y)) +
geom_point(size=1, alpha= 0.3) +
geom_hline(yintercept=2, col="blue") +
geom_vline(xintercept=10, col = "red") +
ggtitle("Median Levels of Flouride vs Arsentic in Maine Towns") +
xlab(" Median Arsenic") +
ylab("Median Fluoride")
chemicals<- chemicals %>% mutate(noncompliance_median = if_else(median.x < 10 & median.y < 2, " in compliance/ none", if_else(median.y >= 2 & median.x < 10, " only fluoride", if_else(median.x >= 10 & median.y < 2, " only arsenic", if_else (median.x >= 10 & median.y >= 2, "both", "missing")))))
counts_median <- chemicals %>% group_by (noncompliance_median) %>%
summarize(count=n())
## `summarise()` ungrouping output (override with `.groups` argument)
ggplot(data= counts_median, aes(x= noncompliance_median, y= count)) +
geom_bar(stat= "identity") +
ggtitle("Number of Towns 50% or More Out of Compliance") +
xlab("Chemicals") +
ylab("Number of Towns")