Refer to the detailed instructions for this assignment in Brightspace.
Don’t alter the three code chunks in this section. First we read in the two data sets and deleting missing values.
library(tidyverse)
fluoride <- read_csv("http://jamessuleiman.com/teaching/datasets/fluoride.csv")
fluoride <- fluoride %>% drop_na()
arsenic <- read_csv("http://jamessuleiman.com/teaching/datasets/arsenic.csv")
arsenic <- arsenic %>% drop_na()
Next we display the first few rows of fluoride.
head(fluoride)
## # A tibble: 6 x 6
## location n_wells_tested percent_wells_above_gui… median percentile_95 maximum
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Otis 60 30 1.13 3.2 3.6
## 2 Dedham 102 22.5 0.94 3.27 7
## 3 Denmark 46 19.6 0.45 3.15 3.9
## 4 Surry 175 18.3 0.8 3.52 6.9
## 5 Prospect 57 17.5 0.785 2.5 2.7
## 6 Eastbrook 31 16.1 1.29 2.44 3.3
got <- count (fluoride)
Then we display the first few rows of arsenic.
head(arsenic)
## # A tibble: 6 x 6
## location n_wells_tested percent_wells_above_g… median percentile_95 maximum
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Manchester 275 58.9 14 93 200
## 2 Gorham 467 50.1 10.5 130 460
## 3 Columbia 42 50 9.8 65.9 200
## 4 Monmouth 277 49.5 10 110 368
## 5 Eliot 73 49.3 9.7 41.4 45
## 6 Columbia F… 25 48 8.1 53.8 71
got <- count (arsenic)
In the code chunk below, create a new tibble called chemicals that joins fluoride and arsenic. You probably want to do an inner join but the join type is up to you.
chemicals <- fluoride%>%
inner_join(arsenic , by="location")
head(chemicals)
## # A tibble: 6 x 11
## location n_wells_tested.x percent_wells_a… median.x percentile_95.x maximum.x
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Otis 60 30 1.13 3.2 3.6
## 2 Dedham 102 22.5 0.94 3.27 7
## 3 Denmark 46 19.6 0.45 3.15 3.9
## 4 Surry 175 18.3 0.8 3.52 6.9
## 5 Prospect 57 17.5 0.785 2.5 2.7
## 6 Eastbro… 31 16.1 1.29 2.44 3.3
## # … with 5 more variables: n_wells_tested.y <dbl>,
## # percent_wells_above_guideline.y <dbl>, median.y <dbl>,
## # percentile_95.y <dbl>, maximum.y <dbl>
The next code chunk displays the head of your newly created chemicals tibble. Take a look to verify that your join looks ok.
head(chemicals)
## # A tibble: 6 x 11
## location n_wells_tested.x percent_wells_a… median.x percentile_95.x maximum.x
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Otis 60 30 1.13 3.2 3.6
## 2 Dedham 102 22.5 0.94 3.27 7
## 3 Denmark 46 19.6 0.45 3.15 3.9
## 4 Surry 175 18.3 0.8 3.52 6.9
## 5 Prospect 57 17.5 0.785 2.5 2.7
## 6 Eastbro… 31 16.1 1.29 2.44 3.3
## # … with 5 more variables: n_wells_tested.y <dbl>,
## # percent_wells_above_guideline.y <dbl>, median.y <dbl>,
## # percentile_95.y <dbl>, maximum.y <dbl>
In the code chunk below create an interesting subset of the data. You’ll likely find an interesting subset by filtering for locations that have high or low levels of arsenic, flouride, or both.
chemicals
## # A tibble: 341 x 11
## location n_wells_tested.x percent_wells_a… median.x percentile_95.x maximum.x
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Otis 60 30 1.13 3.2 3.6
## 2 Dedham 102 22.5 0.94 3.27 7
## 3 Denmark 46 19.6 0.45 3.15 3.9
## 4 Surry 175 18.3 0.8 3.52 6.9
## 5 Prospect 57 17.5 0.785 2.5 2.7
## 6 Eastbro… 31 16.1 1.29 2.44 3.3
## 7 Mercer 32 15.6 0.6 4.18 6.1
## 8 Fryeburg 52 15.4 0.76 3.12 4.1
## 9 Brownfi… 33 15.2 0.265 2.44 4.2
## 10 Stockto… 56 14.3 0.6 2.84 3.3
## # … with 331 more rows, and 5 more variables: n_wells_tested.y <dbl>,
## # percent_wells_above_guideline.y <dbl>, median.y <dbl>,
## # percentile_95.y <dbl>, maximum.y <dbl>
chemicals%>%select(location,median.x,median.y)%>%top_n(10)
## Selecting by median.y
## # A tibble: 10 x 3
## location median.x median.y
## <chr> <dbl> <dbl>
## 1 Mariaville 0.4 7.2
## 2 Manchester 0.3 14
## 3 Monmouth 0.3 10
## 4 Winthrop 0.31 8.2
## 5 Readfield 0.3 7.2
## 6 Columbia 0.31 9.8
## 7 Columbia Falls 0.21 8.1
## 8 Eliot 0.2 9.7
## 9 Gorham 0.1 10.5
## 10 Hallowell 0.1 8.6
Edit this part to discuss how you selected your interesting subset
From the data I wanted to check the the top ten locations with the highest arsenic and fluoride levels. I went ahead and created a code that would give the top ten from the data in the chemicals data set I had created. From the top ten locations it seems the median of fluoride is equal to the median of arsenic It was difficult for me though to be able to arrange or filter the data reflects in a descending order.
Display the first few rows of your interesting subset in the code chunk below.
chemicals
## # A tibble: 341 x 11
## location n_wells_tested.x percent_wells_a… median.x percentile_95.x maximum.x
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Otis 60 30 1.13 3.2 3.6
## 2 Dedham 102 22.5 0.94 3.27 7
## 3 Denmark 46 19.6 0.45 3.15 3.9
## 4 Surry 175 18.3 0.8 3.52 6.9
## 5 Prospect 57 17.5 0.785 2.5 2.7
## 6 Eastbro… 31 16.1 1.29 2.44 3.3
## 7 Mercer 32 15.6 0.6 4.18 6.1
## 8 Fryeburg 52 15.4 0.76 3.12 4.1
## 9 Brownfi… 33 15.2 0.265 2.44 4.2
## 10 Stockto… 56 14.3 0.6 2.84 3.3
## # … with 331 more rows, and 5 more variables: n_wells_tested.y <dbl>,
## # percent_wells_above_guideline.y <dbl>, median.y <dbl>,
## # percentile_95.y <dbl>, maximum.y <dbl>
chemicals%>%select(location,median.x,median.y)%>%top_n(3)
## Selecting by median.y
## # A tibble: 3 x 3
## location median.x median.y
## <chr> <dbl> <dbl>
## 1 Manchester 0.3 14
## 2 Monmouth 0.3 10
## 3 Gorham 0.1 10.5
In the code chunk below, create a ggplot visualization of your subset that is fairly simple for a viewer to comprehend.
Once you are done, knit, publish, and then submit your link to your published RPubs document in Brightspace.