Assignment 1, MBA 676

The Beginning:

The Maine Tracking Network offers a plethora of public data on Maine Environmental Public Health. The data sets provided for MBA 676 Assignment 1 were extracted from this information available on this portal.

I started the assignment by downloading the data sets from our class website. I the data into R markdown after I installed and loaded the packages necessary to generate a report.

library(knitr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(pander)

Read in arsenic and fluoride as dafaframes and rename the variables.

arsenic <- read.csv("arsenic.csv", header = TRUE, stringsAsFactors = FALSE)
fluoride <- read.csv("fluoride.csv", header = TRUE, stringsAsFactors = FALSE)

I renamed the variables and used knitr formatting to display the tables for aesthetic purposes.

kable(head(arsenic %>% rename(Location=location, N_WellsTested_As= n_wells_tested, As_PctAboveGuideline = percent_wells_above_guideline, MedianLevelAs=median, As95Percentile=percentile_95,MaxAs=maximum)) %>% filter( N_WellsTested_As>=20))

Location	N_WellsTested_As	As_PctAboveGuideline	MedianLevelAs	As95Percentile	MaxAs
Manchester	275	58.9	14.0	93.00	200
Gorham	467	50.1	10.5	130.00	460
Columbia	42	50.0	9.8	65.90	200
Monmouth	277	49.5	10.0	110.00	368
Eliot	73	49.3	9.7	41.35	45
Columbia Falls	25	48.0	8.1	53.75	71

kable(head(fluoride %>% rename(Location=location, N_WellsTestedF= n_wells_tested, F_PctAboveGuideline = percent_wells_above_guideline, MedianLevelF=median, F95Percentile=percentile_95,MaxF=maximum)))

Location	N_WellsTestedF	F_PctAboveGuideline	MedianLevelF	F95Percentile	MaxF
Otis	60	30.0	1.130	3.200	3.6
Dedham	102	22.5	0.940	3.270	7.0
Denmark	46	19.6	0.450	3.150	3.9
Surry	175	18.3	0.800	3.525	6.9
Prospect	57	17.5	0.785	2.500	2.7
Eastbrook	31	16.1	1.290	2.445	3.3

kable(str(arsenic))

## 'data.frame':    917 obs. of  6 variables:
##  $ location                     : chr  "Manchester" "Gorham" "Columbia" "Monmouth" ...
##  $ n_wells_tested               : int  275 467 42 277 73 25 424 65 334 241 ...
##  $ percent_wells_above_guideline: num  58.9 50.1 50 49.5 49.3 48 44.8 44.6 43.4 42.7 ...
##  $ median                       : num  14 10.5 9.8 10 9.7 8.1 8.2 8.6 6 7 ...
##  $ percentile_95                : num  93 130 65.9 110 41.4 ...
##  $ maximum                      : num  200 460 200 368 45 71 240 431 670 930 ...

Although this is redundant, I created new dataframes with the renamed Variables.

arsenic1 <- arsenic %>% rename(Location=location, N_WellsTested_As= n_wells_tested, As_PctExceeds = percent_wells_above_guideline, MedianLevelAs=median, As95Percentile=percentile_95,MaxAs=maximum) %>% filter(N_WellsTested_As>=20) 
fluoride1 <- fluoride %>% rename(Location=location, N_WellsTestedF= n_wells_tested, F_PctExceeds= percent_wells_above_guideline, MedianLevelF=median, F95Percentile=percentile_95,MaxF=maximum) %>% filter(N_WellsTestedF>=20)

Arsenic data renamed variables

kable(head(arsenic1))

Location	N_WellsTested_As	As_PctExceeds	MedianLevelAs	As95Percentile	MaxAs
Manchester	275	58.9	14.0	93.00	200
Gorham	467	50.1	10.5	130.00	460
Columbia	42	50.0	9.8	65.90	200
Monmouth	277	49.5	10.0	110.00	368
Eliot	73	49.3	9.7	41.35	45
Columbia Falls	25	48.0	8.1	53.75	71

fluoride data with renamed variables

kable(head(fluoride1))

Location	N_WellsTestedF	F_PctExceeds	MedianLevelF	F95Percentile	MaxF
Otis	60	30.0	1.130	3.200	3.6
Dedham	102	22.5	0.940	3.270	7.0
Denmark	46	19.6	0.450	3.150	3.9
Surry	175	18.3	0.800	3.525	6.9
Prospect	57	17.5	0.785	2.500	2.7
Eastbrook	31	16.1	1.290	2.445	3.3

For space sake, I only ran the head() function. Although only 6 rows are available for view, with the filter function, I was able to eliminate the NA values by limiting water tests to 20 and above for the towns. I learned this from a classmate’s method when viewing on RPubs.

Inner Join to combine the two dataframes into a new dataframe called AsF. The entire dataset includes well water testing information on arsenic and fluoride levels for over 300 towns in Maine. To verify that the tables joined correctly, I inspected the first few rows and ran a summary on the variables.

AsF <- inner_join(arsenic1,fluoride1)

## Joining, by = "Location"

These tables offer a quick glance of the AsF table join.

kable(summary(AsF))

Location	N_WellsTested_As	As_PctExceeds	MedianLevelAs	As95Percentile	MaxAs	N_WellsTestedF	F_PctExceeds	MedianLevelF	F95Percentile	MaxF
Length:341	Min. : 20.00	Min. : 0.00	Min. : 0.25	Min. : 0.50	Min. : 1.00	Min. : 21.00	Min. : 0.000	Min. :0.1000	Min. :0.100	Min. : 0.100
Class :character	1st Qu.: 36.00	1st Qu.: 3.20	1st Qu.: 0.50	1st Qu.: 6.22	1st Qu.: 18.00	1st Qu.: 43.00	1st Qu.: 0.000	1st Qu.:0.1000	1st Qu.:0.522	1st Qu.: 1.200
Mode :character	Median : 57.00	Median : 8.30	Median : 1.00	Median : 13.55	Median : 39.00	Median : 69.00	Median : 0.700	Median :0.1000	Median :1.003	Median : 2.200
NA	Mean : 86.69	Mean :12.37	Mean : 1.62	Mean : 24.87	Mean : 95.03	Mean : 97.21	Mean : 2.465	Mean :0.1765	Mean :1.153	Mean : 2.546
NA	3rd Qu.:109.00	3rd Qu.:18.30	3rd Qu.: 1.90	3rd Qu.: 28.20	3rd Qu.: 92.00	3rd Qu.:126.00	3rd Qu.: 3.100	3rd Qu.:0.2000	3rd Qu.:1.601	3rd Qu.: 3.400
NA	Max. :632.00	Max. :58.90	Max. :14.00	Max. :372.50	Max. :3100.00	Max. :503.00	Max. :30.000	Max. :1.2900	Max. :4.180	Max. :14.000

kable(head(AsF))

Location	N_WellsTested_As	As_PctExceeds	MedianLevelAs	As95Percentile	MaxAs	N_WellsTestedF	F_PctExceeds	MedianLevelF	F95Percentile	MaxF
Manchester	275	58.9	14.0	93.00	200	276	3.3	0.30	1.700	3.60
Gorham	467	50.1	10.5	130.00	460	452	0.0	0.10	0.682	2.00
Columbia	42	50.0	9.8	65.90	200	54	1.9	0.31	1.329	4.30
Monmouth	277	49.5	10.0	110.00	368	288	3.1	0.30	1.676	3.40
Eliot	73	49.3	9.7	41.35	45	84	0.0	0.20	0.658	1.54
Columbia Falls	25	48.0	8.1	53.75	71	38	0.0	0.21	0.641	0.90

str(AsF)

## 'data.frame':    341 obs. of  11 variables:
##  $ Location        : chr  "Manchester" "Gorham" "Columbia" "Monmouth" ...
##  $ N_WellsTested_As: int  275 467 42 277 73 25 424 65 334 241 ...
##  $ As_PctExceeds   : num  58.9 50.1 50 49.5 49.3 48 44.8 44.6 43.4 42.7 ...
##  $ MedianLevelAs   : num  14 10.5 9.8 10 9.7 8.1 8.2 8.6 6 7 ...
##  $ As95Percentile  : num  93 130 65.9 110 41.4 ...
##  $ MaxAs           : num  200 460 200 368 45 71 240 431 670 930 ...
##  $ N_WellsTestedF  : int  276 452 54 288 84 38 453 59 383 209 ...
##  $ F_PctExceeds    : num  3.3 0 1.9 3.1 0 0 3.1 0 1 9.6 ...
##  $ MedianLevelF    : num  0.3 0.1 0.31 0.3 0.2 0.21 0.31 0.1 0.1 0.43 ...
##  $ F95Percentile   : num  1.7 0.682 1.329 1.676 0.658 ...
##  $ MaxF            : num  3.6 2 4.3 3.4 1.54 0.9 3.7 1.63 3.2 4.5 ...

I wanted to compare the percentage of wells tested that have excess levels of arsenic and fluoride in the water samples. I did this by using the subset function.

Subsets1<- subset(AsF, select = c("As_PctExceeds", "F_PctExceeds"))
summary(Subsets1)

##  As_PctExceeds    F_PctExceeds   
##  Min.   : 0.00   Min.   : 0.000  
##  1st Qu.: 3.20   1st Qu.: 0.000  
##  Median : 8.30   Median : 0.700  
##  Mean   :12.37   Mean   : 2.465  
##  3rd Qu.:18.30   3rd Qu.: 3.100  
##  Max.   :58.90   Max.   :30.000

I also wanted to look for any correlations in the data findings.

cor(Subsets1)

##               As_PctExceeds F_PctExceeds
## As_PctExceeds     1.0000000    0.1215215
## F_PctExceeds      0.1215215    1.0000000

I obviously coded it wrong or there is no correlation.

I also plotted the subset.

plot(Subsets1)

To calculate the number of wells in ME that have been tested for arsenic…

kable(sum(AsF$N_WellsTested_As))

29560

To calculate the number of wells in ME that have been tested for fluoride…

kable(sum(AsF$N_WellsTestedF))

33148

Between 1999-2013, there were more well water samples tested for fluoride.

I also had a little fun looking for correlations with the cor().Note the correlation between As testing and excess As findings is much stronger than that of the fluoride relationship. This is no doubt due to the fact that there is very little naturally occurring fluoride in groundwater.

pander(cor(AsF$N_WellsTested_As, AsF$As_PctExceeds))

0.3575

pander(cor(AsF$N_WellsTestedF, AsF$F_PctExceeds))

0.02335

Population Table

I wanted to bring in another data set that I could join with the water sample data. I found town population data and converted it into to csv from this webpage

Pop <- read.csv("NewPopTowns.csv", header = TRUE, stringsAsFactors = FALSE)

Population Table Variable Names

kable(names(Pop))

City

Population

Rename Pop so there’s a common variable name for a new join

PopTable <- Pop %>% rename(Location = City)
kable(str(PopTable))

## 'data.frame':    452 obs. of  2 variables:
##  $ Location  : chr  "Portland" "Lewiston" "Bangor" "South Portland" ...
##  $ Population: int  66881 36202 32391 25556 22871 21282 20893 20495 19691 19078 ...

Inner Join

PopAsF <- inner_join(AsF, PopTable) %>%  select(Location,Population,N_WellsTested_As,As_PctExceeds,N_WellsTestedF, F_PctExceeds,  MedianLevelAs,MedianLevelF)

## Joining, by = "Location"

The joined table showing the population of the towns in Maine and the arsenic and fluoride information from their well water.

kable(head(PopAsF))

Location	Population	N_WellsTested_As	As_PctExceeds	N_WellsTestedF	F_PctExceeds	MedianLevelAs	MedianLevelF
Manchester	2528	275	58.9	276	3.3	14.0	0.30
Gorham	17181	467	50.1	452	0.0	10.5	0.10
Columbia	466	42	50.0	54	1.9	9.8	0.31
Monmouth	4025	277	49.5	288	3.1	10.0	0.30
Eliot	6335	73	49.3	84	0.0	9.7	0.20
Winthrop	5921	424	44.8	453	3.1	8.2	0.31

I previously ran into trouble because some variables were classified as “chr” when they should have been “num” or “int” so I ran the str() function to verify they were classed correctly.

kable(str(PopAsF))

## 'data.frame':    299 obs. of  8 variables:
##  $ Location        : chr  "Manchester" "Gorham" "Columbia" "Monmouth" ...
##  $ Population      : int  2528 17181 466 4025 6335 5921 2315 8149 2664 4454 ...
##  $ N_WellsTested_As: int  275 467 42 277 73 424 65 334 241 133 ...
##  $ As_PctExceeds   : num  58.9 50.1 50 49.5 49.3 44.8 44.6 43.4 42.7 41.4 ...
##  $ N_WellsTestedF  : int  276 452 54 288 84 453 59 383 209 143 ...
##  $ F_PctExceeds    : num  3.3 0 1.9 3.1 0 3.1 0 1 9.6 3.5 ...
##  $ MedianLevelAs   : num  14 10.5 9.8 10 9.7 8.2 8.6 6 7 5.25 ...
##  $ MedianLevelF    : num  0.3 0.1 0.31 0.3 0.2 0.31 0.1 0.1 0.43 0.2 ...

The percentage of wells that contain water with arsenic exceeding guidelines and their median arsenic levels.

This table also makes note if the median level of arsenic in each town exceeds the guidelines. To do this, I used the mutate function. I limited the table to 20 towns to save space.

The majority of ME towns did not have median levels of As that were in excess of the guidelines from the sampling between 1999-2013.

kable(PopAsF %>% mutate(AsAboveG.L.= MedianLevelAs >= 10) %>% select(Location, Population, As_PctExceeds, MedianLevelAs,AsAboveG.L.) %>% arrange(desc(MedianLevelAs)) %>% top_n(20,MedianLevelAs),digits=1)

Location	Population	As_PctExceeds	MedianLevelAs	AsAboveG.L.
Manchester	2528	58.9	14.0	TRUE
Gorham	17181	50.1	10.5	TRUE
Monmouth	4025	49.5	10.0	TRUE
Columbia	466	50.0	9.8	FALSE
Eliot	6335	49.3	9.7	FALSE
Hallowell	2315	44.6	8.6	FALSE
Winthrop	5921	44.8	8.2	FALSE
Mariaville	524	40.0	7.2	FALSE
Readfield	2520	39.8	7.2	FALSE
Blue Hill	2664	42.7	7.0	FALSE
Millinocket	4359	31.0	6.8	FALSE
Buxton	8149	43.4	6.0	FALSE
Surry	1459	40.3	6.0	FALSE
Rome	991	26.6	5.5	FALSE
Orland	2205	40.7	5.4	FALSE
Hollis	4454	41.4	5.2	FALSE
Belgrade	3115	31.2	5.2	FALSE
Scarborough	19691	35.2	5.2	FALSE
Vassalboro	4247	26.3	5.2	FALSE
Oakland	6138	33.0	5.0	FALSE

Therapeutic fluoride

**According to the American Academy of Pediatric Dentists, “since fluoride from water supplies is now one of several sources of fluoride, the Department of Health and Human Services recently has proposed to not have a fluoride range, but rather to limit the recommendation to the lower limit of 0.7 ppm F.”

Using these parameters above, Maine well water does not offer dental cavity protection for the most part; this is illustrated in the table below. Again, I’ve limited the data to 20 rows to save space. The vast majority of samples show that excess flouoride is not a problem, in fact, there in insufficient fluoride to offer dental cavity prophylaxis. In the column SubTherapeuticF, FALSE indicates that the median level in meets the AAPD recommendations.

I have also run the str() function to verify the data is classified correctly.

TherapeuticF <- PopAsF %>% mutate(SubTherapeuticF = MedianLevelF<=0.7) %>% select(Location, Population, N_WellsTestedF, SubTherapeuticF, MedianLevelF) %>% arrange(desc(MedianLevelF)) %>% top_n(20, MedianLevelF) 
str(TherapeuticF)

## 'data.frame':    20 obs. of  5 variables:
##  $ Location       : chr  "Eastbrook" "Otis" "Marshfield" "Dedham" ...
##  $ Population     : int  421 668 508 1683 1459 3395 991 1608 7857 4359 ...
##  $ N_WellsTestedF : int  31 60 31 102 175 52 82 56 503 29 ...
##  $ SubTherapeuticF: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ MedianLevelF   : num  1.29 1.13 1 0.94 0.8 0.76 0.6 0.6 0.5 0.495 ...

kable(TherapeuticF)

Location	Population	N_WellsTestedF	SubTherapeuticF	MedianLevelF
Eastbrook	421	31	FALSE	1.290
Otis	668	60	FALSE	1.130
Marshfield	508	31	FALSE	1.000
Dedham	1683	102	FALSE	0.940
Surry	1459	175	FALSE	0.800
Fryeburg	3395	52	FALSE	0.760
Rome	991	82	TRUE	0.600
Stockton Springs	1608	56	TRUE	0.600
Ellsworth	7857	503	TRUE	0.500
Millinocket	4359	29	TRUE	0.495
Stoneham	236	25	TRUE	0.490
Smithfield	999	79	TRUE	0.450
Denmark	1143	46	TRUE	0.450
Casco	3880	119	TRUE	0.445
Chesterville	1330	81	TRUE	0.440
Otisfield	1762	62	TRUE	0.440
Lovell	1133	43	TRUE	0.435
Blue Hill	2664	209	TRUE	0.430
Sedgwick	1182	143	TRUE	0.425
Porter	1490	41	TRUE	0.425

Rather than sort through the whole table to determine if any towns’ wells have median fluoride levels in excess of 2mg/ml, the maximum exposure guideline, R can do a quick calculation. The ‘NA’ below means no values were returned.

I’ve loaded the library(pander) function for formatting. Like kable(), it offers a cleaner presentation. For some reason, kable() wouldn’t work for this code.

ExcessF <- sum(PopAsF$MedianLevelF>=2)
pander(sum(ExcessF>=2))

I thought I would look to see if there is a correlation between the towns that have the most water testing per capita and findings of excessive levels of either compound. To do this, I calculated water tests per capita, using the higher of the two N_WellsTested. I did this in two steps, first calculating the higher test figure to use as a denominator for ‘tests per capita’.

MaxWells <- PopAsF %>% mutate(MaxWell = pmax(N_WellsTested_As,N_WellsTestedF)) %>% select(Location, Population, MaxWell, As_PctExceeds, F_PctExceeds)
kable(head(MaxWells))

Location	Population	MaxWell	As_PctExceeds	F_PctExceeds
Manchester	2528	276	58.9	3.3
Gorham	17181	467	50.1	0.0
Columbia	466	54	50.0	1.9
Monmouth	4025	288	49.5	3.1
Eliot	6335	84	49.3	0.0
Winthrop	5921	453	44.8	3.1

And just to make sure the new MaxWells variable is classified as an ‘int’…

kable(str(MaxWells))

## 'data.frame':    299 obs. of  5 variables:
##  $ Location     : chr  "Manchester" "Gorham" "Columbia" "Monmouth" ...
##  $ Population   : int  2528 17181 466 4025 6335 5921 2315 8149 2664 4454 ...
##  $ MaxWell      : int  276 467 54 288 84 453 65 383 241 143 ...
##  $ As_PctExceeds: num  58.9 50.1 50 49.5 49.3 44.8 44.6 43.4 42.7 41.4 ...
##  $ F_PctExceeds : num  3.3 0 1.9 3.1 0 3.1 0 1 9.6 3.5 ...

Next, I create a new dataframe with the MaxWells variable to get the tests per capita. Again, I use the mutate function. I know I could have done this all in one code instead of two, but I kept getting error codes. I do it successfully later.

MaxWells1 <- MaxWells %>% mutate(TestsPerPop = Population/MaxWell) %>% select(Location, TestsPerPop,  As_PctExceeds, F_PctExceeds) %>%  arrange(desc(As_PctExceeds))
kable(head(MaxWells1, digits = 2))

Location	TestsPerPop	As_PctExceeds	F_PctExceeds
Manchester	9.15942	58.9	3.3
Gorham	36.79015	50.1	0.0
Columbia	8.62963	50.0	1.9
Monmouth	13.97569	49.5	3.1
Eliot	75.41667	49.3	0.0
Winthrop	13.07064	44.8	3.1

str(MaxWells1)

## 'data.frame':    299 obs. of  4 variables:
##  $ Location     : chr  "Manchester" "Gorham" "Columbia" "Monmouth" ...
##  $ TestsPerPop  : num  9.16 36.79 8.63 13.98 75.42 ...
##  $ As_PctExceeds: num  58.9 50.1 50 49.5 49.3 44.8 44.6 43.4 42.7 41.4 ...
##  $ F_PctExceeds : num  3.3 0 1.9 3.1 0 3.1 0 1 9.6 3.5 ...

Plot looking for a correlation between tests per capita and the percentage of wells that have excessive arsenic

library(ggvis) 
MaxWells1 %>% ggvis(~TestsPerPop, ~As_PctExceeds ) %>% layer_points(size := 30, shape := "circle", stroke := "blue", fill := NA)

Is there a correlation between tests performed per capita and elevated arsenic levels?

MaxWells1.lm = lm(As_PctExceeds ~TestsPerPop, data = MaxWells1)
pander(summary(MaxWells1.lm)$r.squared)

8.727e-07

Out of desperation, I reversed the X, Y variables, for naught.

MaxWells1.lm = lm(TestsPerPop ~ As_PctExceeds, data = MaxWells1)
pander(summary(MaxWells1.lm)$r.squared)

8.727e-07

pander(cor(MaxWells1$TestsPerPop, MaxWells1$As_PctExceeds))

-0.0009342

Initially, I thought the problem here was that that I couldnt’t exclude the NA values or that I need to set values for the x-axis. But then I was able to filter the NA values out. The problem could also be that the TestsPerPop is a calculated field, and I’ve yet to learn to deal with them. I’m also not sure if I was able to get the regression summary to work.

One final set I decided to generate, after the deadline, was inspired by a classmate’s work. She created a clean water table. I decided to see where the best well water is; my definition of best is low arsenic and a therapeutic level of fluoride.

#Create a variable for Maine's Population
MEPop <- 1329000
kable(MEPop)

1329000

#filter safe arsenic water
Good_WaterAs <- PopAsF %>% filter(MedianLevelAs<10) %>% select(Location, Population, MedianLevelAs)
Good_WaterF <- PopAsF %>% filter(MedianLevelF<=1.2) %>% select(Location, Population, MedianLevelF)
#get rid of too high
#get rid of too low fluoride
BetterF <- Good_WaterF %>% filter(MedianLevelF>=.7) %>% select (Location, Population, MedianLevelF)
#This is theraputic
#Join tables to get optimal levels of both
BestH2O <- inner_join(BetterF,Good_WaterAs) %>% mutate(PctOfState=Population/MEPop*100) %>% select(Location, Population, PctOfState, MedianLevelF, MedianLevelAs)

## Joining, by = c("Location", "Population")

Based on the data given to us, these 5 towns offer a good balanced well water.

#The code above is how to change font color in R Markdown. 
kable(BestH2O)

Location	Population	PctOfState	MedianLevelF	MedianLevelAs
Surry	1459	0.1097818	0.80	6.0
Otis	668	0.0502634	1.13	4.8
Dedham	1683	0.1266366	0.94	1.0
Marshfield	508	0.0382242	1.00	1.0
Fryeburg	3395	0.2554552	0.76	0.5

Onward

I decided to try another route and explore if there were more dental practices in the counties where there is a higher population of patients who don’t get adequate fluoridationin their water. I am ignoring whether patients use fluoride supplements and any dental health trends that might be present. My route is circuitous because this is a learning experience.

What percent of the homes rely on private wells for their water by county?

To help explore if the dentists where the fluoride is not, I use household size from the latest Census data.. Wells use by percentage is given for households. The Census gives mean household size by county.

WellsInCounties <- read.csv("HomesWithWellsbyPercent.csv", header=TRUE, stringsAsFactors=FALSE)
head(WellsInCounties)

##         County PctOfHomeWells Population CapitaHousehold
## 1 Androscoggin           0.44     107233            2.35
## 2    Aroostook           0.59      68628            2.24
## 3   Cumberland           0.36     289977            2.35
## 4     Franklin           0.61      29991            2.46
## 5      Hancock           0.76      54659            2.18
## 6     Kennebec           0.56     119980            2.30

str(WellsInCounties)

## 'data.frame':    17 obs. of  4 variables:
##  $ County         : chr  "Androscoggin" "Aroostook" "Cumberland" "Franklin" ...
##  $ PctOfHomeWells : num  0.44 0.59 0.36 0.61 0.76 0.56 0.56 0.82 0.65 0.54 ...
##  $ Population     : int  107233 68628 289977 29991 54659 119980 39855 33969 57202 152692 ...
##  $ CapitaHousehold: num  2.35 2.24 2.35 2.46 2.18 2.3 2.26 2.26 2.53 2.36 ...

fluoride Levels by County. The State of Maine Health and Environmental Testing Laboratory provided these data. 1999-2013,

This information is much the same as that reviewed earlier, however now it is by county, not town. This is because I could only find data on practicing Dentists by county.

Countyfluoride <- read.csv("fluorideByCounty.csv", header = TRUE, stringsAsFactors = FALSE) 
kable(str(Countyfluoride))

## 'data.frame':    37 obs. of  6 variables:
##  $ County        : chr  "Hancock" "Oxford" "Franklin" "York" ...
##  $ N_WellsTestedF: int  3744 1611 1087 2642 1291 5890 1843 1569 4297 1745 ...
##  $ F_PctExceeds  : num  0.07 0.04 0.03 0.03 0.03 0.03 0.02 0.02 0.02 0.02 ...
##  $ MedianLevelF  : num  0.28 0.1 0.1 0.1 0.1 0.2 0.1 0.1 0.1 0.1 ...
##  $ F95Percentile : num  0.02 0.02 0.02 0.02 0.01 0.02 0.01 0.01 0.01 0.01 ...
##  $ MaxF          : num  9.1 9.9 5.6 9.6 14 6.8 5.3 5.4 9.1 4.1 ...

kable(summary(Countyfluoride))

County	N_WellsTestedF	F_PctExceeds	MedianLevelF	F95Percentile	MaxF
Length:37	Min. : 581	Min. :0.0000	Min. :0.1000	Min. :0.0000	Min. : 1.600
Class :character	1st Qu.:1206	1st Qu.:0.0100	1st Qu.:0.1000	1st Qu.:0.0100	1st Qu.: 4.000
Mode :character	Median :1678	Median :0.0200	Median :0.1000	Median :0.0100	Median : 5.500
NA	Mean :2190	Mean :0.0225	Mean :0.1175	Mean :0.0125	Mean : 6.594
NA	3rd Qu.:2832	3rd Qu.:0.0300	3rd Qu.:0.1000	3rd Qu.:0.0200	3rd Qu.: 9.225
NA	Max. :5890	Max. :0.0700	Max. :0.2800	Max. :0.0200	Max. :14.000
NA	NA’s :21	NA’s :21	NA’s :21	NA’s :21	NA’s :21

The total number of well water tests done to evaluate fluoride levels between 1999-2013 was…

library(pander)
pander(sum(Countyfluoride$N_WellsTestedF, na.rm=TRUE))

35044

I used the pander() to get nicer formatting for the output.

Practicing Dentists Actively Practicing Dentists to 10,000 Population in Maine by County, November, 2011

CountyDentists <- read.csv("PracticingDentists.csv", header = TRUE, stringsAsFactors = FALSE)
names(CountyDentists)

## [1] "County"        "DentistPer10K"

str(CountyDentists)

## 'data.frame':    16 obs. of  2 variables:
##  $ County       : chr  "Androscoggin" "Aroostook" "Cumberland" "Franklin" ...
##  $ DentistPer10K: num  4.18 3.34 7.35 3.9 4.59 6.47 6.29 3.48 2.25 5.33 ...

I do an inner join of the table that shows what percent of homes have wells (by county) and the fluoride data by county.

fluorideCombo <- inner_join(WellsInCounties, Countyfluoride)

## Joining, by = "County"

str(fluorideCombo)

## 'data.frame':    16 obs. of  9 variables:
##  $ County         : chr  "Androscoggin" "Aroostook" "Cumberland" "Franklin" ...
##  $ PctOfHomeWells : num  0.44 0.59 0.36 0.61 0.76 0.56 0.56 0.82 0.65 0.54 ...
##  $ Population     : int  107233 68628 289977 29991 54659 119980 39855 33969 57202 152692 ...
##  $ CapitaHousehold: num  2.35 2.24 2.35 2.46 2.18 2.3 2.26 2.26 2.53 2.36 ...
##  $ N_WellsTestedF : int  1569 1048 4297 1087 3744 5890 940 2106 1611 3404 ...
##  $ F_PctExceeds   : num  0.02 0.01 0.02 0.03 0.07 0.03 0.01 0.01 0.04 0.01 ...
##  $ MedianLevelF   : num  0.1 0.1 0.1 0.1 0.28 0.2 0.1 0.1 0.1 0.1 ...
##  $ F95Percentile  : num  0.01 0.01 0.01 0.02 0.02 0.02 0.01 0.01 0.02 0.01 ...
##  $ MaxF           : num  5.4 10 9.1 5.6 9.1 6.8 3.2 4.8 9.9 3.7 ...

Join Dentists by county data to the new county fluoride dataset.

DentistAndfluoride <- inner_join(fluorideCombo, CountyDentists)

## Joining, by = "County"

head(DentistAndfluoride)

##         County PctOfHomeWells Population CapitaHousehold N_WellsTestedF
## 1 Androscoggin           0.44     107233            2.35           1569
## 2    Aroostook           0.59      68628            2.24           1048
## 3   Cumberland           0.36     289977            2.35           4297
## 4     Franklin           0.61      29991            2.46           1087
## 5      Hancock           0.76      54659            2.18           3744
## 6     Kennebec           0.56     119980            2.30           5890
##   F_PctExceeds MedianLevelF F95Percentile MaxF DentistPer10K
## 1         0.02         0.10          0.01  5.4          4.18
## 2         0.01         0.10          0.01 10.0          3.34
## 3         0.02         0.10          0.01  9.1          7.35
## 4         0.03         0.10          0.02  5.6          3.90
## 5         0.07         0.28          0.02  9.1          4.59
## 6         0.03         0.20          0.02  6.8          6.47

str(DentistAndfluoride)

## 'data.frame':    16 obs. of  10 variables:
##  $ County         : chr  "Androscoggin" "Aroostook" "Cumberland" "Franklin" ...
##  $ PctOfHomeWells : num  0.44 0.59 0.36 0.61 0.76 0.56 0.56 0.82 0.65 0.54 ...
##  $ Population     : int  107233 68628 289977 29991 54659 119980 39855 33969 57202 152692 ...
##  $ CapitaHousehold: num  2.35 2.24 2.35 2.46 2.18 2.3 2.26 2.26 2.53 2.36 ...
##  $ N_WellsTestedF : int  1569 1048 4297 1087 3744 5890 940 2106 1611 3404 ...
##  $ F_PctExceeds   : num  0.02 0.01 0.02 0.03 0.07 0.03 0.01 0.01 0.04 0.01 ...
##  $ MedianLevelF   : num  0.1 0.1 0.1 0.1 0.28 0.2 0.1 0.1 0.1 0.1 ...
##  $ F95Percentile  : num  0.01 0.01 0.01 0.02 0.02 0.02 0.01 0.01 0.02 0.01 ...
##  $ MaxF           : num  5.4 10 9.1 5.6 9.1 6.8 3.2 4.8 9.9 3.7 ...
##  $ DentistPer10K  : num  4.18 3.34 7.35 3.9 4.59 6.47 6.29 3.48 2.25 5.33 ...

kable(DentistAndfluoride)

County	PctOfHomeWells	Population	CapitaHousehold	N_WellsTestedF	F_PctExceeds	MedianLevelF	F95Percentile	MaxF	DentistPer10K
Androscoggin	0.44	107233	2.35	1569	0.02	0.10	0.01	5.4	4.18
Aroostook	0.59	68628	2.24	1048	0.01	0.10	0.01	10.0	3.34
Cumberland	0.36	289977	2.35	4297	0.02	0.10	0.01	9.1	7.35
Franklin	0.61	29991	2.46	1087	0.03	0.10	0.02	5.6	3.90
Hancock	0.76	54659	2.18	3744	0.07	0.28	0.02	9.1	4.59
Kennebec	0.56	119980	2.30	5890	0.03	0.20	0.02	6.8	6.47
Knox	0.56	39855	2.26	940	0.01	0.10	0.01	3.2	6.29
Lincoln	0.82	33969	2.26	2106	0.01	0.10	0.01	4.8	3.48
Oxford	0.65	57202	2.53	1611	0.04	0.10	0.02	9.9	2.25
Penobscot	0.54	152692	2.36	3404	0.01	0.10	0.01	3.7	5.33
Piscataquis	0.73	16931	2.22	581	0.00	0.10	0.00	1.6	3.42
Sagadahoc	0.58	35149	2.31	1246	0.01	0.10	0.01	3.3	6.80
Somerset	0.72	51113	2.37	1291	0.03	0.10	0.01	14.0	2.49
Waldo	0.78	39155	2.31	1745	0.02	0.10	0.01	4.1	2.58
Washington	0.75	31625	2.20	1843	0.02	0.10	0.01	5.3	3.96
York	0.50	201169	2.40	2642	0.03	0.10	0.02	9.6	3.96

To calculate the number of people with subTherapeutic fluoride:

kable(DentistAndfluoride %>% mutate(WellDrinkers=Population/CapitaHousehold*PctOfHomeWells) %>% mutate(SubTherapeutic=MedianLevelF<=0.7) %>% select(County, WellDrinkers,SubTherapeutic))

County	WellDrinkers	SubTherapeutic
Androscoggin	20077.668	TRUE
Aroostook	18076.125	TRUE
Cumberland	44422.009	TRUE
Franklin	7436.793	TRUE
Hancock	19055.431	TRUE
Kennebec	29212.522	TRUE
Knox	9875.575	TRUE
Lincoln	12325.035	TRUE
Oxford	14696.166	TRUE
Penobscot	34938.000	TRUE
Piscataquis	5567.401	TRUE
Sagadahoc	8825.290	TRUE
Somerset	15528.000	TRUE
Waldo	13221.169	TRUE
Washington	10781.250	TRUE
York	41910.208	TRUE

Dentists per 10,000 population and number of people who may not get enough fluoride. Is there a correlation? Do the Dentists go to where there might be needy teeth?

Chart <- DentistAndfluoride %>% mutate(WellDrinkers=Population/CapitaHousehold*PctOfHomeWells) %>% mutate(Wellper10K=WellDrinkers/10000) %>% mutate(Ratio=Wellper10K/DentistPer10K) %>%  select(County, WellDrinkers, Wellper10K,DentistPer10K,Ratio)
kable(Chart, digits = 2)

County	WellDrinkers	Wellper10K	DentistPer10K	Ratio
Androscoggin	20077.67	2.01	4.18	0.48
Aroostook	18076.12	1.81	3.34	0.54
Cumberland	44422.01	4.44	7.35	0.60
Franklin	7436.79	0.74	3.90	0.19
Hancock	19055.43	1.91	4.59	0.42
Kennebec	29212.52	2.92	6.47	0.45
Knox	9875.58	0.99	6.29	0.16
Lincoln	12325.04	1.23	3.48	0.35
Oxford	14696.17	1.47	2.25	0.65
Penobscot	34938.00	3.49	5.33	0.66
Piscataquis	5567.40	0.56	3.42	0.16
Sagadahoc	8825.29	0.88	6.80	0.13
Somerset	15528.00	1.55	2.49	0.62
Waldo	13221.17	1.32	2.58	0.51
Washington	10781.25	1.08	3.96	0.27
York	41910.21	4.19	3.96	1.06

Plot There does not appear to be a pattern between potential fluoride deprivation and Dental practices. The visualization showns the data points for the number of wellwater drinkers who are likely not getting adequate fluoride from it and the number of County dentists, More non-flouridated drinkers does not bring more dentists. This linear model illustrates a weak correlation.

library(ggvis)
Chart %>% ggvis(~WellDrinkers,~DentistPer10K) %>% layer_points(size := 50, shape := "circle", stroke := "blue") %>% layer_model_predictions(model = "lm", se = TRUE)

## Guessing formula = DentistPer10K ~ WellDrinkers

Linear regression and a low R-Sq. will confirm this. Below, I try three different codes to calculate the R.Sq.

Chart.lm = lm(WellDrinkers ~ DentistPer10K, data=Chart)
pander(summary(Chart.lm)$r.squared)

0.1561

pander(cor(Chart$WellDrinkers, Chart$DentistPer10K))

0.3951

Patients <- Chart$WellDrinkers
Dentist <- Chart$DentistPer10K
pander(cor(Patients, Dentist))

0.3951

Two for Three.

Certainly, there is more that could be brought in for more sophisticated analysis. I just have to learn the code.

Lastly, I want to save the joined table I made.

write.csv(DentistAndfluoride,"DentistAndfluorideAndPop.csv",row.names = TRUE)

Assignment 1, MBA 676

Chris Iyer

September 19, 2016

The Beginning:

Onward