Refer to the detailed instructions for this assignment in Brightspace.
Don’t alter the three code chunks in this section. First we read in the two data sets and deleting missing values.
library(tidyverse)
fluoride <- read_csv("http://jamessuleiman.com/teaching/datasets/fluoride.csv")
fluoride <- fluoride %>% drop_na()
arsenic <- read_csv("http://jamessuleiman.com/teaching/datasets/arsenic.csv")
arsenic <- arsenic %>% drop_na()
Next we display the first few rows of fluoride.
head(fluoride)
## # A tibble: 6 x 6
## location n_wells_tested percent_wells_above_gui… median percentile_95 maximum
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Otis 60 30 1.13 3.2 3.6
## 2 Dedham 102 22.5 0.94 3.27 7
## 3 Denmark 46 19.6 0.45 3.15 3.9
## 4 Surry 175 18.3 0.8 3.52 6.9
## 5 Prospect 57 17.5 0.785 2.5 2.7
## 6 Eastbrook 31 16.1 1.29 2.44 3.3
Then we display the first few rows of arsenic.
head(arsenic)
## # A tibble: 6 x 6
## location n_wells_tested percent_wells_above_g… median percentile_95 maximum
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Manchester 275 58.9 14 93 200
## 2 Gorham 467 50.1 10.5 130 460
## 3 Columbia 42 50 9.8 65.9 200
## 4 Monmouth 277 49.5 10 110 368
## 5 Eliot 73 49.3 9.7 41.4 45
## 6 Columbia F… 25 48 8.1 53.8 71
In the code chunk below, create a new tibble called chemicals that joins fluoride and arsenic. You probably want to do an inner join but the join type is up to you.
chemicals <- fluoride %>% inner_join(arsenic, by = "location")
The next code chunk displays the head of your newly created chemicals tibble. Take a look to verify that your join looks ok.
head(chemicals)
## # A tibble: 6 x 11
## location n_wells_tested.x percent_wells_a… median.x percentile_95.x maximum.x
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Otis 60 30 1.13 3.2 3.6
## 2 Dedham 102 22.5 0.94 3.27 7
## 3 Denmark 46 19.6 0.45 3.15 3.9
## 4 Surry 175 18.3 0.8 3.52 6.9
## 5 Prospect 57 17.5 0.785 2.5 2.7
## 6 Eastbro… 31 16.1 1.29 2.44 3.3
## # … with 5 more variables: n_wells_tested.y <dbl>,
## # percent_wells_above_guideline.y <dbl>, median.y <dbl>,
## # percentile_95.y <dbl>, maximum.y <dbl>
In the code chunk below create an interesting subset of the data. You’ll likely find an interesting subset by filtering for locations that have high or low levels of arsenic, fluoride, or both.
high_arsenic <- chemicals %>% filter (maximum.y > 10) %>% arrange(desc(maximum.y))
summary(high_arsenic)
## location n_wells_tested.x percent_wells_above_guideline.x
## Length:291 Min. : 21.0 Min. : 0.000
## Class :character 1st Qu.: 45.5 1st Qu.: 0.000
## Mode :character Median : 74.0 Median : 0.800
## Mean :105.2 Mean : 2.396
## 3rd Qu.:139.0 3rd Qu.: 2.950
## Max. :503.0 Max. :30.000
## median.x percentile_95.x maximum.x n_wells_tested.y
## Min. :0.100 Min. :0.100 Min. : 0.100 Min. : 20.00
## 1st Qu.:0.100 1st Qu.:0.587 1st Qu.: 1.200 1st Qu.: 39.00
## Median :0.100 Median :1.034 Median : 2.200 Median : 62.00
## Mean :0.175 Mean :1.160 Mean : 2.613 Mean : 94.36
## 3rd Qu.:0.200 3rd Qu.:1.562 3rd Qu.: 3.600 3rd Qu.:118.50
## Max. :1.290 Max. :4.180 Max. :14.000 Max. :632.00
## percent_wells_above_guideline.y median.y percentile_95.y
## Min. : 0.80 Min. : 0.250 Min. : 1.000
## 1st Qu.: 5.40 1st Qu.: 0.500 1st Qu.: 9.832
## Median :10.60 Median : 1.000 Median : 16.900
## Mean :14.50 Mean : 1.814 Mean : 28.629
## 3rd Qu.:19.55 3rd Qu.: 2.000 3rd Qu.: 31.500
## Max. :58.90 Max. :14.000 Max. :372.500
## maximum.y
## Min. : 11.0
## 1st Qu.: 26.5
## Median : 46.0
## Mean : 110.4
## 3rd Qu.: 110.0
## Max. :3100.0
highest_arsenic <- chemicals %>% filter (maximum.y > 110.4) %>% arrange(desc(maximum.y))
For the interesting subset, I first filtered out the joined entries where arsenic levels were greater than 10 ug/L into a tibble called high_arsenic. This resulted in 291 observations out of the original 360.
To narrow down the number of observations, I created a summary of the high_arsenic tibble to determine the mean of arsenic levels greater than 10 ug/L, which resulted in 110.4 ug/L. I re-ran a new filter based on all observations that were greater than this mean value (110.4 ug/L) and stored the results in a second tibble called highest_arsenic. This second table dropped to 65 observations, which I considered my interesting subset.
I was curious to see if there were any correlations between these highest arsenic levels and fluoride levels, and created a plot to compare these 2 variables. I labeled the x-axis ‘Max Arsenic ug/L’ and the y-axis ‘Max Fluoride mg/L’. I also set the code chunk option echo to FALSE to prevent the code from being displayed in the markup. From the plot, there does not appear to be any correlation between high levels of arsenic versus fluoride.
Display the first few rows of your interesting subset in the code chunk below.
head (highest_arsenic)
## # A tibble: 6 x 11
## location n_wells_tested.x percent_wells_a… median.x percentile_95.x maximum.x
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Danforth 35 0 0.2 0.96 1.9
## 2 Northpo… 87 1.1 0.275 1.25 4.1
## 3 Blue Hi… 209 9.6 0.43 2.86 4.5
## 4 Sedgwick 143 11.2 0.425 2.87 4.2
## 5 Buxton 383 1 0.1 0.899 3.2
## 6 Standish 290 1.7 0.1 1.19 4.7
## # … with 5 more variables: n_wells_tested.y <dbl>,
## # percent_wells_above_guideline.y <dbl>, median.y <dbl>,
## # percentile_95.y <dbl>, maximum.y <dbl>
In the code chunk below, create a ggplot visualization of your subset that is fairly simple for a viewer to comprehend.
Once you are done, knit, publish, and then submit your link to your published RPubs document in Brightspace.