The Maine Tracking Network offers a plethora of public data on Maine Environmental Public Health. The data sets provided for MBA 676 Assignment 1 were extracted from this information available on this portal.
I started the assignment by downloading the data sets from our class website. I the data into R markdown after I installed and loaded the packages necessary to generate a report.
library(knitr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(pander)
Read in arsenic and fluoride as dafaframes and rename the variables.
arsenic <- read.csv("arsenic.csv", header = TRUE, stringsAsFactors = FALSE)
fluoride <- read.csv("fluoride.csv", header = TRUE, stringsAsFactors = FALSE)
I renamed the variables and used knitr formatting to display the tables for aesthetic purposes.
kable(head(arsenic %>% rename(Location=location, N_WellsTested_As= n_wells_tested, As_PctAboveGuideline = percent_wells_above_guideline, MedianLevelAs=median, As95Percentile=percentile_95,MaxAs=maximum)) %>% filter( N_WellsTested_As>=20))
Location | N_WellsTested_As | As_PctAboveGuideline | MedianLevelAs | As95Percentile | MaxAs |
---|---|---|---|---|---|
Manchester | 275 | 58.9 | 14.0 | 93.00 | 200 |
Gorham | 467 | 50.1 | 10.5 | 130.00 | 460 |
Columbia | 42 | 50.0 | 9.8 | 65.90 | 200 |
Monmouth | 277 | 49.5 | 10.0 | 110.00 | 368 |
Eliot | 73 | 49.3 | 9.7 | 41.35 | 45 |
Columbia Falls | 25 | 48.0 | 8.1 | 53.75 | 71 |
kable(head(fluoride %>% rename(Location=location, N_WellsTestedF= n_wells_tested, F_PctAboveGuideline = percent_wells_above_guideline, MedianLevelF=median, F95Percentile=percentile_95,MaxF=maximum)))
Location | N_WellsTestedF | F_PctAboveGuideline | MedianLevelF | F95Percentile | MaxF |
---|---|---|---|---|---|
Otis | 60 | 30.0 | 1.130 | 3.200 | 3.6 |
Dedham | 102 | 22.5 | 0.940 | 3.270 | 7.0 |
Denmark | 46 | 19.6 | 0.450 | 3.150 | 3.9 |
Surry | 175 | 18.3 | 0.800 | 3.525 | 6.9 |
Prospect | 57 | 17.5 | 0.785 | 2.500 | 2.7 |
Eastbrook | 31 | 16.1 | 1.290 | 2.445 | 3.3 |
kable(str(arsenic))
## 'data.frame': 917 obs. of 6 variables:
## $ location : chr "Manchester" "Gorham" "Columbia" "Monmouth" ...
## $ n_wells_tested : int 275 467 42 277 73 25 424 65 334 241 ...
## $ percent_wells_above_guideline: num 58.9 50.1 50 49.5 49.3 48 44.8 44.6 43.4 42.7 ...
## $ median : num 14 10.5 9.8 10 9.7 8.1 8.2 8.6 6 7 ...
## $ percentile_95 : num 93 130 65.9 110 41.4 ...
## $ maximum : num 200 460 200 368 45 71 240 431 670 930 ...
Although this is redundant, I created new dataframes with the renamed Variables.
arsenic1 <- arsenic %>% rename(Location=location, N_WellsTested_As= n_wells_tested, As_PctExceeds = percent_wells_above_guideline, MedianLevelAs=median, As95Percentile=percentile_95,MaxAs=maximum) %>% filter(N_WellsTested_As>=20)
fluoride1 <- fluoride %>% rename(Location=location, N_WellsTestedF= n_wells_tested, F_PctExceeds= percent_wells_above_guideline, MedianLevelF=median, F95Percentile=percentile_95,MaxF=maximum) %>% filter(N_WellsTestedF>=20)
Arsenic data renamed variables
kable(head(arsenic1))
Location | N_WellsTested_As | As_PctExceeds | MedianLevelAs | As95Percentile | MaxAs |
---|---|---|---|---|---|
Manchester | 275 | 58.9 | 14.0 | 93.00 | 200 |
Gorham | 467 | 50.1 | 10.5 | 130.00 | 460 |
Columbia | 42 | 50.0 | 9.8 | 65.90 | 200 |
Monmouth | 277 | 49.5 | 10.0 | 110.00 | 368 |
Eliot | 73 | 49.3 | 9.7 | 41.35 | 45 |
Columbia Falls | 25 | 48.0 | 8.1 | 53.75 | 71 |
fluoride data with renamed variables
kable(head(fluoride1))
Location | N_WellsTestedF | F_PctExceeds | MedianLevelF | F95Percentile | MaxF |
---|---|---|---|---|---|
Otis | 60 | 30.0 | 1.130 | 3.200 | 3.6 |
Dedham | 102 | 22.5 | 0.940 | 3.270 | 7.0 |
Denmark | 46 | 19.6 | 0.450 | 3.150 | 3.9 |
Surry | 175 | 18.3 | 0.800 | 3.525 | 6.9 |
Prospect | 57 | 17.5 | 0.785 | 2.500 | 2.7 |
Eastbrook | 31 | 16.1 | 1.290 | 2.445 | 3.3 |
For space sake, I only ran the head() function. Although only 6 rows are available for view, with the filter function, I was able to eliminate the NA values by limiting water tests to 20 and above for the towns. I learned this from a classmate’s method when viewing on RPubs.
Inner Join to combine the two dataframes into a new dataframe called AsF. The entire dataset includes well water testing information on arsenic and fluoride levels for over 300 towns in Maine. To verify that the tables joined correctly, I inspected the first few rows and ran a summary on the variables.
AsF <- inner_join(arsenic1,fluoride1)
## Joining, by = "Location"
These tables offer a quick glance of the AsF table join.
kable(summary(AsF))
Location | N_WellsTested_As | As_PctExceeds | MedianLevelAs | As95Percentile | MaxAs | N_WellsTestedF | F_PctExceeds | MedianLevelF | F95Percentile | MaxF | |
---|---|---|---|---|---|---|---|---|---|---|---|
Length:341 | Min. : 20.00 | Min. : 0.00 | Min. : 0.25 | Min. : 0.50 | Min. : 1.00 | Min. : 21.00 | Min. : 0.000 | Min. :0.1000 | Min. :0.100 | Min. : 0.100 | |
Class :character | 1st Qu.: 36.00 | 1st Qu.: 3.20 | 1st Qu.: 0.50 | 1st Qu.: 6.22 | 1st Qu.: 18.00 | 1st Qu.: 43.00 | 1st Qu.: 0.000 | 1st Qu.:0.1000 | 1st Qu.:0.522 | 1st Qu.: 1.200 | |
Mode :character | Median : 57.00 | Median : 8.30 | Median : 1.00 | Median : 13.55 | Median : 39.00 | Median : 69.00 | Median : 0.700 | Median :0.1000 | Median :1.003 | Median : 2.200 | |
NA | Mean : 86.69 | Mean :12.37 | Mean : 1.62 | Mean : 24.87 | Mean : 95.03 | Mean : 97.21 | Mean : 2.465 | Mean :0.1765 | Mean :1.153 | Mean : 2.546 | |
NA | 3rd Qu.:109.00 | 3rd Qu.:18.30 | 3rd Qu.: 1.90 | 3rd Qu.: 28.20 | 3rd Qu.: 92.00 | 3rd Qu.:126.00 | 3rd Qu.: 3.100 | 3rd Qu.:0.2000 | 3rd Qu.:1.601 | 3rd Qu.: 3.400 | |
NA | Max. :632.00 | Max. :58.90 | Max. :14.00 | Max. :372.50 | Max. :3100.00 | Max. :503.00 | Max. :30.000 | Max. :1.2900 | Max. :4.180 | Max. :14.000 |
kable(head(AsF))
Location | N_WellsTested_As | As_PctExceeds | MedianLevelAs | As95Percentile | MaxAs | N_WellsTestedF | F_PctExceeds | MedianLevelF | F95Percentile | MaxF |
---|---|---|---|---|---|---|---|---|---|---|
Manchester | 275 | 58.9 | 14.0 | 93.00 | 200 | 276 | 3.3 | 0.30 | 1.700 | 3.60 |
Gorham | 467 | 50.1 | 10.5 | 130.00 | 460 | 452 | 0.0 | 0.10 | 0.682 | 2.00 |
Columbia | 42 | 50.0 | 9.8 | 65.90 | 200 | 54 | 1.9 | 0.31 | 1.329 | 4.30 |
Monmouth | 277 | 49.5 | 10.0 | 110.00 | 368 | 288 | 3.1 | 0.30 | 1.676 | 3.40 |
Eliot | 73 | 49.3 | 9.7 | 41.35 | 45 | 84 | 0.0 | 0.20 | 0.658 | 1.54 |
Columbia Falls | 25 | 48.0 | 8.1 | 53.75 | 71 | 38 | 0.0 | 0.21 | 0.641 | 0.90 |
str(AsF)
## 'data.frame': 341 obs. of 11 variables:
## $ Location : chr "Manchester" "Gorham" "Columbia" "Monmouth" ...
## $ N_WellsTested_As: int 275 467 42 277 73 25 424 65 334 241 ...
## $ As_PctExceeds : num 58.9 50.1 50 49.5 49.3 48 44.8 44.6 43.4 42.7 ...
## $ MedianLevelAs : num 14 10.5 9.8 10 9.7 8.1 8.2 8.6 6 7 ...
## $ As95Percentile : num 93 130 65.9 110 41.4 ...
## $ MaxAs : num 200 460 200 368 45 71 240 431 670 930 ...
## $ N_WellsTestedF : int 276 452 54 288 84 38 453 59 383 209 ...
## $ F_PctExceeds : num 3.3 0 1.9 3.1 0 0 3.1 0 1 9.6 ...
## $ MedianLevelF : num 0.3 0.1 0.31 0.3 0.2 0.21 0.31 0.1 0.1 0.43 ...
## $ F95Percentile : num 1.7 0.682 1.329 1.676 0.658 ...
## $ MaxF : num 3.6 2 4.3 3.4 1.54 0.9 3.7 1.63 3.2 4.5 ...
I wanted to compare the percentage of wells tested that have excess levels of arsenic and fluoride in the water samples. I did this by using the subset function.
Subsets1<- subset(AsF, select = c("As_PctExceeds", "F_PctExceeds"))
summary(Subsets1)
## As_PctExceeds F_PctExceeds
## Min. : 0.00 Min. : 0.000
## 1st Qu.: 3.20 1st Qu.: 0.000
## Median : 8.30 Median : 0.700
## Mean :12.37 Mean : 2.465
## 3rd Qu.:18.30 3rd Qu.: 3.100
## Max. :58.90 Max. :30.000
I also wanted to look for any correlations in the data findings.
cor(Subsets1)
## As_PctExceeds F_PctExceeds
## As_PctExceeds 1.0000000 0.1215215
## F_PctExceeds 0.1215215 1.0000000
I obviously coded it wrong or there is no correlation.
I also plotted the subset.
plot(Subsets1)
To calculate the number of wells in ME that have been tested for arsenic…
kable(sum(AsF$N_WellsTested_As))
29560 |
To calculate the number of wells in ME that have been tested for fluoride…
kable(sum(AsF$N_WellsTestedF))
33148 |
Between 1999-2013, there were more well water samples tested for fluoride.
I also had a little fun looking for correlations with the cor().Note the correlation between As testing and excess As findings is much stronger than that of the fluoride relationship. This is no doubt due to the fact that there is very little naturally occurring fluoride in groundwater.
pander(cor(AsF$N_WellsTested_As, AsF$As_PctExceeds))
0.3575
pander(cor(AsF$N_WellsTestedF, AsF$F_PctExceeds))
0.02335
Population Table
I wanted to bring in another data set that I could join with the water sample data. I found town population data and converted it into to csv from this webpage
Pop <- read.csv("NewPopTowns.csv", header = TRUE, stringsAsFactors = FALSE)
Population Table Variable Names
kable(names(Pop))
City |
Population |
Rename Pop so there’s a common variable name for a new join
PopTable <- Pop %>% rename(Location = City)
kable(str(PopTable))
## 'data.frame': 452 obs. of 2 variables:
## $ Location : chr "Portland" "Lewiston" "Bangor" "South Portland" ...
## $ Population: int 66881 36202 32391 25556 22871 21282 20893 20495 19691 19078 ...
Inner Join
PopAsF <- inner_join(AsF, PopTable) %>% select(Location,Population,N_WellsTested_As,As_PctExceeds,N_WellsTestedF, F_PctExceeds, MedianLevelAs,MedianLevelF)
## Joining, by = "Location"
The joined table showing the population of the towns in Maine and the arsenic and fluoride information from their well water.
kable(head(PopAsF))
Location | Population | N_WellsTested_As | As_PctExceeds | N_WellsTestedF | F_PctExceeds | MedianLevelAs | MedianLevelF |
---|---|---|---|---|---|---|---|
Manchester | 2528 | 275 | 58.9 | 276 | 3.3 | 14.0 | 0.30 |
Gorham | 17181 | 467 | 50.1 | 452 | 0.0 | 10.5 | 0.10 |
Columbia | 466 | 42 | 50.0 | 54 | 1.9 | 9.8 | 0.31 |
Monmouth | 4025 | 277 | 49.5 | 288 | 3.1 | 10.0 | 0.30 |
Eliot | 6335 | 73 | 49.3 | 84 | 0.0 | 9.7 | 0.20 |
Winthrop | 5921 | 424 | 44.8 | 453 | 3.1 | 8.2 | 0.31 |
I previously ran into trouble because some variables were classified as “chr” when they should have been “num” or “int” so I ran the str() function to verify they were classed correctly.
kable(str(PopAsF))
## 'data.frame': 299 obs. of 8 variables:
## $ Location : chr "Manchester" "Gorham" "Columbia" "Monmouth" ...
## $ Population : int 2528 17181 466 4025 6335 5921 2315 8149 2664 4454 ...
## $ N_WellsTested_As: int 275 467 42 277 73 424 65 334 241 133 ...
## $ As_PctExceeds : num 58.9 50.1 50 49.5 49.3 44.8 44.6 43.4 42.7 41.4 ...
## $ N_WellsTestedF : int 276 452 54 288 84 453 59 383 209 143 ...
## $ F_PctExceeds : num 3.3 0 1.9 3.1 0 3.1 0 1 9.6 3.5 ...
## $ MedianLevelAs : num 14 10.5 9.8 10 9.7 8.2 8.6 6 7 5.25 ...
## $ MedianLevelF : num 0.3 0.1 0.31 0.3 0.2 0.31 0.1 0.1 0.43 0.2 ...
The percentage of wells that contain water with arsenic exceeding guidelines and their median arsenic levels.
This table also makes note if the median level of arsenic in each town exceeds the guidelines. To do this, I used the mutate function. I limited the table to 20 towns to save space.
The majority of ME towns did not have median levels of As that were in excess of the guidelines from the sampling between 1999-2013.
kable(PopAsF %>% mutate(AsAboveG.L.= MedianLevelAs >= 10) %>% select(Location, Population, As_PctExceeds, MedianLevelAs,AsAboveG.L.) %>% arrange(desc(MedianLevelAs)) %>% top_n(20,MedianLevelAs),digits=1)
Location | Population | As_PctExceeds | MedianLevelAs | AsAboveG.L. |
---|---|---|---|---|
Manchester | 2528 | 58.9 | 14.0 | TRUE |
Gorham | 17181 | 50.1 | 10.5 | TRUE |
Monmouth | 4025 | 49.5 | 10.0 | TRUE |
Columbia | 466 | 50.0 | 9.8 | FALSE |
Eliot | 6335 | 49.3 | 9.7 | FALSE |
Hallowell | 2315 | 44.6 | 8.6 | FALSE |
Winthrop | 5921 | 44.8 | 8.2 | FALSE |
Mariaville | 524 | 40.0 | 7.2 | FALSE |
Readfield | 2520 | 39.8 | 7.2 | FALSE |
Blue Hill | 2664 | 42.7 | 7.0 | FALSE |
Millinocket | 4359 | 31.0 | 6.8 | FALSE |
Buxton | 8149 | 43.4 | 6.0 | FALSE |
Surry | 1459 | 40.3 | 6.0 | FALSE |
Rome | 991 | 26.6 | 5.5 | FALSE |
Orland | 2205 | 40.7 | 5.4 | FALSE |
Hollis | 4454 | 41.4 | 5.2 | FALSE |
Belgrade | 3115 | 31.2 | 5.2 | FALSE |
Scarborough | 19691 | 35.2 | 5.2 | FALSE |
Vassalboro | 4247 | 26.3 | 5.2 | FALSE |
Oakland | 6138 | 33.0 | 5.0 | FALSE |
Therapeutic fluoride
**According to the American Academy of Pediatric Dentists, “since fluoride from water supplies is now one of several sources of fluoride, the Department of Health and Human Services recently has proposed to not have a fluoride range, but rather to limit the recommendation to the lower limit of 0.7 ppm F.”
Using these parameters above, Maine well water does not offer dental cavity protection for the most part; this is illustrated in the table below. Again, I’ve limited the data to 20 rows to save space. The vast majority of samples show that excess flouoride is not a problem, in fact, there in insufficient fluoride to offer dental cavity prophylaxis. In the column SubTherapeuticF, FALSE indicates that the median level in meets the AAPD recommendations.
I have also run the str() function to verify the data is classified correctly.
TherapeuticF <- PopAsF %>% mutate(SubTherapeuticF = MedianLevelF<=0.7) %>% select(Location, Population, N_WellsTestedF, SubTherapeuticF, MedianLevelF) %>% arrange(desc(MedianLevelF)) %>% top_n(20, MedianLevelF)
str(TherapeuticF)
## 'data.frame': 20 obs. of 5 variables:
## $ Location : chr "Eastbrook" "Otis" "Marshfield" "Dedham" ...
## $ Population : int 421 668 508 1683 1459 3395 991 1608 7857 4359 ...
## $ N_WellsTestedF : int 31 60 31 102 175 52 82 56 503 29 ...
## $ SubTherapeuticF: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ MedianLevelF : num 1.29 1.13 1 0.94 0.8 0.76 0.6 0.6 0.5 0.495 ...
kable(TherapeuticF)
Location | Population | N_WellsTestedF | SubTherapeuticF | MedianLevelF |
---|---|---|---|---|
Eastbrook | 421 | 31 | FALSE | 1.290 |
Otis | 668 | 60 | FALSE | 1.130 |
Marshfield | 508 | 31 | FALSE | 1.000 |
Dedham | 1683 | 102 | FALSE | 0.940 |
Surry | 1459 | 175 | FALSE | 0.800 |
Fryeburg | 3395 | 52 | FALSE | 0.760 |
Rome | 991 | 82 | TRUE | 0.600 |
Stockton Springs | 1608 | 56 | TRUE | 0.600 |
Ellsworth | 7857 | 503 | TRUE | 0.500 |
Millinocket | 4359 | 29 | TRUE | 0.495 |
Stoneham | 236 | 25 | TRUE | 0.490 |
Smithfield | 999 | 79 | TRUE | 0.450 |
Denmark | 1143 | 46 | TRUE | 0.450 |
Casco | 3880 | 119 | TRUE | 0.445 |
Chesterville | 1330 | 81 | TRUE | 0.440 |
Otisfield | 1762 | 62 | TRUE | 0.440 |
Lovell | 1133 | 43 | TRUE | 0.435 |
Blue Hill | 2664 | 209 | TRUE | 0.430 |
Sedgwick | 1182 | 143 | TRUE | 0.425 |
Porter | 1490 | 41 | TRUE | 0.425 |
Rather than sort through the whole table to determine if any towns’ wells have median fluoride levels in excess of 2mg/ml, the maximum exposure guideline, R can do a quick calculation. The ‘NA’ below means no values were returned.
I’ve loaded the library(pander) function for formatting. Like kable(), it offers a cleaner presentation. For some reason, kable() wouldn’t work for this code.
ExcessF <- sum(PopAsF$MedianLevelF>=2)
pander(sum(ExcessF>=2))
0
I thought I would look to see if there is a correlation between the towns that have the most water testing per capita and findings of excessive levels of either compound. To do this, I calculated water tests per capita, using the higher of the two N_WellsTested. I did this in two steps, first calculating the higher test figure to use as a denominator for ‘tests per capita’.
MaxWells <- PopAsF %>% mutate(MaxWell = pmax(N_WellsTested_As,N_WellsTestedF)) %>% select(Location, Population, MaxWell, As_PctExceeds, F_PctExceeds)
kable(head(MaxWells))
Location | Population | MaxWell | As_PctExceeds | F_PctExceeds |
---|---|---|---|---|
Manchester | 2528 | 276 | 58.9 | 3.3 |
Gorham | 17181 | 467 | 50.1 | 0.0 |
Columbia | 466 | 54 | 50.0 | 1.9 |
Monmouth | 4025 | 288 | 49.5 | 3.1 |
Eliot | 6335 | 84 | 49.3 | 0.0 |
Winthrop | 5921 | 453 | 44.8 | 3.1 |
And just to make sure the new MaxWells variable is classified as an ‘int’…
kable(str(MaxWells))
## 'data.frame': 299 obs. of 5 variables:
## $ Location : chr "Manchester" "Gorham" "Columbia" "Monmouth" ...
## $ Population : int 2528 17181 466 4025 6335 5921 2315 8149 2664 4454 ...
## $ MaxWell : int 276 467 54 288 84 453 65 383 241 143 ...
## $ As_PctExceeds: num 58.9 50.1 50 49.5 49.3 44.8 44.6 43.4 42.7 41.4 ...
## $ F_PctExceeds : num 3.3 0 1.9 3.1 0 3.1 0 1 9.6 3.5 ...
Next, I create a new dataframe with the MaxWells variable to get the tests per capita. Again, I use the mutate function. I know I could have done this all in one code instead of two, but I kept getting error codes. I do it successfully later.
MaxWells1 <- MaxWells %>% mutate(TestsPerPop = Population/MaxWell) %>% select(Location, TestsPerPop, As_PctExceeds, F_PctExceeds) %>% arrange(desc(As_PctExceeds))
kable(head(MaxWells1, digits = 2))
Location | TestsPerPop | As_PctExceeds | F_PctExceeds |
---|---|---|---|
Manchester | 9.15942 | 58.9 | 3.3 |
Gorham | 36.79015 | 50.1 | 0.0 |
Columbia | 8.62963 | 50.0 | 1.9 |
Monmouth | 13.97569 | 49.5 | 3.1 |
Eliot | 75.41667 | 49.3 | 0.0 |
Winthrop | 13.07064 | 44.8 | 3.1 |
str(MaxWells1)
## 'data.frame': 299 obs. of 4 variables:
## $ Location : chr "Manchester" "Gorham" "Columbia" "Monmouth" ...
## $ TestsPerPop : num 9.16 36.79 8.63 13.98 75.42 ...
## $ As_PctExceeds: num 58.9 50.1 50 49.5 49.3 44.8 44.6 43.4 42.7 41.4 ...
## $ F_PctExceeds : num 3.3 0 1.9 3.1 0 3.1 0 1 9.6 3.5 ...
Plot looking for a correlation between tests per capita and the percentage of wells that have excessive arsenic
library(ggvis)
MaxWells1 %>% ggvis(~TestsPerPop, ~As_PctExceeds ) %>% layer_points(size := 30, shape := "circle", stroke := "blue", fill := NA)
Is there a correlation between tests performed per capita and elevated arsenic levels?
MaxWells1.lm = lm(As_PctExceeds ~TestsPerPop, data = MaxWells1)
pander(summary(MaxWells1.lm)$r.squared)
8.727e-07
Out of desperation, I reversed the X, Y variables, for naught.
MaxWells1.lm = lm(TestsPerPop ~ As_PctExceeds, data = MaxWells1)
pander(summary(MaxWells1.lm)$r.squared)
8.727e-07
pander(cor(MaxWells1$TestsPerPop, MaxWells1$As_PctExceeds))
-0.0009342
Initially, I thought the problem here was that that I couldnt’t exclude the NA values or that I need to set values for the x-axis. But then I was able to filter the NA values out. The problem could also be that the TestsPerPop is a calculated field, and I’ve yet to learn to deal with them. I’m also not sure if I was able to get the regression summary to work.
One final set I decided to generate, after the deadline, was inspired by a classmate’s work. She created a clean water table. I decided to see where the best well water is; my definition of best is low arsenic and a therapeutic level of fluoride.
#Create a variable for Maine's Population
MEPop <- 1329000
kable(MEPop)
1329000 |
#filter safe arsenic water
Good_WaterAs <- PopAsF %>% filter(MedianLevelAs<10) %>% select(Location, Population, MedianLevelAs)
Good_WaterF <- PopAsF %>% filter(MedianLevelF<=1.2) %>% select(Location, Population, MedianLevelF)
#get rid of too high
#get rid of too low fluoride
BetterF <- Good_WaterF %>% filter(MedianLevelF>=.7) %>% select (Location, Population, MedianLevelF)
#This is theraputic
#Join tables to get optimal levels of both
BestH2O <- inner_join(BetterF,Good_WaterAs) %>% mutate(PctOfState=Population/MEPop*100) %>% select(Location, Population, PctOfState, MedianLevelF, MedianLevelAs)
## Joining, by = c("Location", "Population")
Based on the data given to us, these 5 towns offer a good balanced well water.
#The code above is how to change font color in R Markdown.
kable(BestH2O)
Location | Population | PctOfState | MedianLevelF | MedianLevelAs |
---|---|---|---|---|
Surry | 1459 | 0.1097818 | 0.80 | 6.0 |
Otis | 668 | 0.0502634 | 1.13 | 4.8 |
Dedham | 1683 | 0.1266366 | 0.94 | 1.0 |
Marshfield | 508 | 0.0382242 | 1.00 | 1.0 |
Fryeburg | 3395 | 0.2554552 | 0.76 | 0.5 |
I decided to try another route and explore if there were more dental practices in the counties where there is a higher population of patients who don’t get adequate fluoridationin their water. I am ignoring whether patients use fluoride supplements and any dental health trends that might be present. My route is circuitous because this is a learning experience.
What percent of the homes rely on private wells for their water by county?
To help explore if the dentists where the fluoride is not, I use household size from the latest Census data.. Wells use by percentage is given for households. The Census gives mean household size by county.
WellsInCounties <- read.csv("HomesWithWellsbyPercent.csv", header=TRUE, stringsAsFactors=FALSE)
head(WellsInCounties)
## County PctOfHomeWells Population CapitaHousehold
## 1 Androscoggin 0.44 107233 2.35
## 2 Aroostook 0.59 68628 2.24
## 3 Cumberland 0.36 289977 2.35
## 4 Franklin 0.61 29991 2.46
## 5 Hancock 0.76 54659 2.18
## 6 Kennebec 0.56 119980 2.30
str(WellsInCounties)
## 'data.frame': 17 obs. of 4 variables:
## $ County : chr "Androscoggin" "Aroostook" "Cumberland" "Franklin" ...
## $ PctOfHomeWells : num 0.44 0.59 0.36 0.61 0.76 0.56 0.56 0.82 0.65 0.54 ...
## $ Population : int 107233 68628 289977 29991 54659 119980 39855 33969 57202 152692 ...
## $ CapitaHousehold: num 2.35 2.24 2.35 2.46 2.18 2.3 2.26 2.26 2.53 2.36 ...
fluoride Levels by County. The State of Maine Health and Environmental Testing Laboratory provided these data. 1999-2013,
This information is much the same as that reviewed earlier, however now it is by county, not town. This is because I could only find data on practicing Dentists by county.
Countyfluoride <- read.csv("fluorideByCounty.csv", header = TRUE, stringsAsFactors = FALSE)
kable(str(Countyfluoride))
## 'data.frame': 37 obs. of 6 variables:
## $ County : chr "Hancock" "Oxford" "Franklin" "York" ...
## $ N_WellsTestedF: int 3744 1611 1087 2642 1291 5890 1843 1569 4297 1745 ...
## $ F_PctExceeds : num 0.07 0.04 0.03 0.03 0.03 0.03 0.02 0.02 0.02 0.02 ...
## $ MedianLevelF : num 0.28 0.1 0.1 0.1 0.1 0.2 0.1 0.1 0.1 0.1 ...
## $ F95Percentile : num 0.02 0.02 0.02 0.02 0.01 0.02 0.01 0.01 0.01 0.01 ...
## $ MaxF : num 9.1 9.9 5.6 9.6 14 6.8 5.3 5.4 9.1 4.1 ...
kable(summary(Countyfluoride))
County | N_WellsTestedF | F_PctExceeds | MedianLevelF | F95Percentile | MaxF | |
---|---|---|---|---|---|---|
Length:37 | Min. : 581 | Min. :0.0000 | Min. :0.1000 | Min. :0.0000 | Min. : 1.600 | |
Class :character | 1st Qu.:1206 | 1st Qu.:0.0100 | 1st Qu.:0.1000 | 1st Qu.:0.0100 | 1st Qu.: 4.000 | |
Mode :character | Median :1678 | Median :0.0200 | Median :0.1000 | Median :0.0100 | Median : 5.500 | |
NA | Mean :2190 | Mean :0.0225 | Mean :0.1175 | Mean :0.0125 | Mean : 6.594 | |
NA | 3rd Qu.:2832 | 3rd Qu.:0.0300 | 3rd Qu.:0.1000 | 3rd Qu.:0.0200 | 3rd Qu.: 9.225 | |
NA | Max. :5890 | Max. :0.0700 | Max. :0.2800 | Max. :0.0200 | Max. :14.000 | |
NA | NA’s :21 | NA’s :21 | NA’s :21 | NA’s :21 | NA’s :21 |
The total number of well water tests done to evaluate fluoride levels between 1999-2013 was…
library(pander)
pander(sum(Countyfluoride$N_WellsTestedF, na.rm=TRUE))
35044
I used the pander() to get nicer formatting for the output.
Practicing Dentists Actively Practicing Dentists to 10,000 Population in Maine by County, November, 2011
CountyDentists <- read.csv("PracticingDentists.csv", header = TRUE, stringsAsFactors = FALSE)
names(CountyDentists)
## [1] "County" "DentistPer10K"
str(CountyDentists)
## 'data.frame': 16 obs. of 2 variables:
## $ County : chr "Androscoggin" "Aroostook" "Cumberland" "Franklin" ...
## $ DentistPer10K: num 4.18 3.34 7.35 3.9 4.59 6.47 6.29 3.48 2.25 5.33 ...
I do an inner join of the table that shows what percent of homes have wells (by county) and the fluoride data by county.
fluorideCombo <- inner_join(WellsInCounties, Countyfluoride)
## Joining, by = "County"
str(fluorideCombo)
## 'data.frame': 16 obs. of 9 variables:
## $ County : chr "Androscoggin" "Aroostook" "Cumberland" "Franklin" ...
## $ PctOfHomeWells : num 0.44 0.59 0.36 0.61 0.76 0.56 0.56 0.82 0.65 0.54 ...
## $ Population : int 107233 68628 289977 29991 54659 119980 39855 33969 57202 152692 ...
## $ CapitaHousehold: num 2.35 2.24 2.35 2.46 2.18 2.3 2.26 2.26 2.53 2.36 ...
## $ N_WellsTestedF : int 1569 1048 4297 1087 3744 5890 940 2106 1611 3404 ...
## $ F_PctExceeds : num 0.02 0.01 0.02 0.03 0.07 0.03 0.01 0.01 0.04 0.01 ...
## $ MedianLevelF : num 0.1 0.1 0.1 0.1 0.28 0.2 0.1 0.1 0.1 0.1 ...
## $ F95Percentile : num 0.01 0.01 0.01 0.02 0.02 0.02 0.01 0.01 0.02 0.01 ...
## $ MaxF : num 5.4 10 9.1 5.6 9.1 6.8 3.2 4.8 9.9 3.7 ...
Join Dentists by county data to the new county fluoride dataset.
DentistAndfluoride <- inner_join(fluorideCombo, CountyDentists)
## Joining, by = "County"
head(DentistAndfluoride)
## County PctOfHomeWells Population CapitaHousehold N_WellsTestedF
## 1 Androscoggin 0.44 107233 2.35 1569
## 2 Aroostook 0.59 68628 2.24 1048
## 3 Cumberland 0.36 289977 2.35 4297
## 4 Franklin 0.61 29991 2.46 1087
## 5 Hancock 0.76 54659 2.18 3744
## 6 Kennebec 0.56 119980 2.30 5890
## F_PctExceeds MedianLevelF F95Percentile MaxF DentistPer10K
## 1 0.02 0.10 0.01 5.4 4.18
## 2 0.01 0.10 0.01 10.0 3.34
## 3 0.02 0.10 0.01 9.1 7.35
## 4 0.03 0.10 0.02 5.6 3.90
## 5 0.07 0.28 0.02 9.1 4.59
## 6 0.03 0.20 0.02 6.8 6.47
str(DentistAndfluoride)
## 'data.frame': 16 obs. of 10 variables:
## $ County : chr "Androscoggin" "Aroostook" "Cumberland" "Franklin" ...
## $ PctOfHomeWells : num 0.44 0.59 0.36 0.61 0.76 0.56 0.56 0.82 0.65 0.54 ...
## $ Population : int 107233 68628 289977 29991 54659 119980 39855 33969 57202 152692 ...
## $ CapitaHousehold: num 2.35 2.24 2.35 2.46 2.18 2.3 2.26 2.26 2.53 2.36 ...
## $ N_WellsTestedF : int 1569 1048 4297 1087 3744 5890 940 2106 1611 3404 ...
## $ F_PctExceeds : num 0.02 0.01 0.02 0.03 0.07 0.03 0.01 0.01 0.04 0.01 ...
## $ MedianLevelF : num 0.1 0.1 0.1 0.1 0.28 0.2 0.1 0.1 0.1 0.1 ...
## $ F95Percentile : num 0.01 0.01 0.01 0.02 0.02 0.02 0.01 0.01 0.02 0.01 ...
## $ MaxF : num 5.4 10 9.1 5.6 9.1 6.8 3.2 4.8 9.9 3.7 ...
## $ DentistPer10K : num 4.18 3.34 7.35 3.9 4.59 6.47 6.29 3.48 2.25 5.33 ...
kable(DentistAndfluoride)
County | PctOfHomeWells | Population | CapitaHousehold | N_WellsTestedF | F_PctExceeds | MedianLevelF | F95Percentile | MaxF | DentistPer10K |
---|---|---|---|---|---|---|---|---|---|
Androscoggin | 0.44 | 107233 | 2.35 | 1569 | 0.02 | 0.10 | 0.01 | 5.4 | 4.18 |
Aroostook | 0.59 | 68628 | 2.24 | 1048 | 0.01 | 0.10 | 0.01 | 10.0 | 3.34 |
Cumberland | 0.36 | 289977 | 2.35 | 4297 | 0.02 | 0.10 | 0.01 | 9.1 | 7.35 |
Franklin | 0.61 | 29991 | 2.46 | 1087 | 0.03 | 0.10 | 0.02 | 5.6 | 3.90 |
Hancock | 0.76 | 54659 | 2.18 | 3744 | 0.07 | 0.28 | 0.02 | 9.1 | 4.59 |
Kennebec | 0.56 | 119980 | 2.30 | 5890 | 0.03 | 0.20 | 0.02 | 6.8 | 6.47 |
Knox | 0.56 | 39855 | 2.26 | 940 | 0.01 | 0.10 | 0.01 | 3.2 | 6.29 |
Lincoln | 0.82 | 33969 | 2.26 | 2106 | 0.01 | 0.10 | 0.01 | 4.8 | 3.48 |
Oxford | 0.65 | 57202 | 2.53 | 1611 | 0.04 | 0.10 | 0.02 | 9.9 | 2.25 |
Penobscot | 0.54 | 152692 | 2.36 | 3404 | 0.01 | 0.10 | 0.01 | 3.7 | 5.33 |
Piscataquis | 0.73 | 16931 | 2.22 | 581 | 0.00 | 0.10 | 0.00 | 1.6 | 3.42 |
Sagadahoc | 0.58 | 35149 | 2.31 | 1246 | 0.01 | 0.10 | 0.01 | 3.3 | 6.80 |
Somerset | 0.72 | 51113 | 2.37 | 1291 | 0.03 | 0.10 | 0.01 | 14.0 | 2.49 |
Waldo | 0.78 | 39155 | 2.31 | 1745 | 0.02 | 0.10 | 0.01 | 4.1 | 2.58 |
Washington | 0.75 | 31625 | 2.20 | 1843 | 0.02 | 0.10 | 0.01 | 5.3 | 3.96 |
York | 0.50 | 201169 | 2.40 | 2642 | 0.03 | 0.10 | 0.02 | 9.6 | 3.96 |
To calculate the number of people with subTherapeutic fluoride:
kable(DentistAndfluoride %>% mutate(WellDrinkers=Population/CapitaHousehold*PctOfHomeWells) %>% mutate(SubTherapeutic=MedianLevelF<=0.7) %>% select(County, WellDrinkers,SubTherapeutic))
County | WellDrinkers | SubTherapeutic |
---|---|---|
Androscoggin | 20077.668 | TRUE |
Aroostook | 18076.125 | TRUE |
Cumberland | 44422.009 | TRUE |
Franklin | 7436.793 | TRUE |
Hancock | 19055.431 | TRUE |
Kennebec | 29212.522 | TRUE |
Knox | 9875.575 | TRUE |
Lincoln | 12325.035 | TRUE |
Oxford | 14696.166 | TRUE |
Penobscot | 34938.000 | TRUE |
Piscataquis | 5567.401 | TRUE |
Sagadahoc | 8825.290 | TRUE |
Somerset | 15528.000 | TRUE |
Waldo | 13221.169 | TRUE |
Washington | 10781.250 | TRUE |
York | 41910.208 | TRUE |
Dentists per 10,000 population and number of people who may not get enough fluoride. Is there a correlation? Do the Dentists go to where there might be needy teeth?
Chart <- DentistAndfluoride %>% mutate(WellDrinkers=Population/CapitaHousehold*PctOfHomeWells) %>% mutate(Wellper10K=WellDrinkers/10000) %>% mutate(Ratio=Wellper10K/DentistPer10K) %>% select(County, WellDrinkers, Wellper10K,DentistPer10K,Ratio)
kable(Chart, digits = 2)
County | WellDrinkers | Wellper10K | DentistPer10K | Ratio |
---|---|---|---|---|
Androscoggin | 20077.67 | 2.01 | 4.18 | 0.48 |
Aroostook | 18076.12 | 1.81 | 3.34 | 0.54 |
Cumberland | 44422.01 | 4.44 | 7.35 | 0.60 |
Franklin | 7436.79 | 0.74 | 3.90 | 0.19 |
Hancock | 19055.43 | 1.91 | 4.59 | 0.42 |
Kennebec | 29212.52 | 2.92 | 6.47 | 0.45 |
Knox | 9875.58 | 0.99 | 6.29 | 0.16 |
Lincoln | 12325.04 | 1.23 | 3.48 | 0.35 |
Oxford | 14696.17 | 1.47 | 2.25 | 0.65 |
Penobscot | 34938.00 | 3.49 | 5.33 | 0.66 |
Piscataquis | 5567.40 | 0.56 | 3.42 | 0.16 |
Sagadahoc | 8825.29 | 0.88 | 6.80 | 0.13 |
Somerset | 15528.00 | 1.55 | 2.49 | 0.62 |
Waldo | 13221.17 | 1.32 | 2.58 | 0.51 |
Washington | 10781.25 | 1.08 | 3.96 | 0.27 |
York | 41910.21 | 4.19 | 3.96 | 1.06 |
Plot There does not appear to be a pattern between potential fluoride deprivation and Dental practices. The visualization showns the data points for the number of wellwater drinkers who are likely not getting adequate fluoride from it and the number of County dentists, More non-flouridated drinkers does not bring more dentists. This linear model illustrates a weak correlation.
library(ggvis)
Chart %>% ggvis(~WellDrinkers,~DentistPer10K) %>% layer_points(size := 50, shape := "circle", stroke := "blue") %>% layer_model_predictions(model = "lm", se = TRUE)
## Guessing formula = DentistPer10K ~ WellDrinkers
Linear regression and a low R-Sq. will confirm this. Below, I try three different codes to calculate the R.Sq.
Chart.lm = lm(WellDrinkers ~ DentistPer10K, data=Chart)
pander(summary(Chart.lm)$r.squared)
0.1561
pander(cor(Chart$WellDrinkers, Chart$DentistPer10K))
0.3951
Patients <- Chart$WellDrinkers
Dentist <- Chart$DentistPer10K
pander(cor(Patients, Dentist))
0.3951
Two for Three.
Certainly, there is more that could be brought in for more sophisticated analysis. I just have to learn the code.
Lastly, I want to save the joined table I made.
write.csv(DentistAndfluoride,"DentistAndfluorideAndPop.csv",row.names = TRUE)