In this report, I will be joining two separate datasets from the Maine Tracking Network that focus on Arsenic and Flouride levels in wells tested throughout the state.
The two aspects of the data that I wished to focus on were 1.) was there any correlation between Arsenic and Flouride levels once we joined the two datasets at the location level and 2) since this join is purely a programmatic approach using the R language, can we look at characteristics of each dataset that might lead us to have less confidence about how the data was actually gathered and if joining them is in fact appropriate even though we are able to do this at the location level.
Upon joining the two datasets at the location level and pulling in both sets of variables, I ran a plot first to simply check if the sample sizes by location for both arsenic and flouride were roughly the same. They were the same and correlated with a correlation value of .97.
library(dplyr)
arsenic <- read.csv("C:/Users/Peter Beretich/Documents/thomas/USMAnalyticscourse/unit04/arsenic.csv", stringsAsFactors = FALSE)
flouride <- read.csv("C:/Users/Peter Beretich/Documents/thomas/USMAnalyticscourse/unit04/flouride.csv", stringsAsFactors = FALSE)
names(arsenic) <- c("location", "AS_n_wells","AS_perc_above","AS_median","AS_95","AS_max")
names(flouride) <- c("location", "FL_n_wells","FL_perc_above","FL_median","FL_95","FL_max")
arsenic_flouride<-flouride %>% select("location", "FL_n_wells","FL_perc_above","FL_median","FL_95","FL_max") %>% inner_join(arsenic)
library(ggvis)
arsenic_flouride %>% ggvis(~FL_n_wells, ~AS_n_wells) %>%
layer_points()
with (arsenic_flouride, cor(FL_n_wells, AS_n_wells))
## [1] 0.9733364
At this point, the sample sizes for the Arsenic and Flouride data pairs by location are very similar and I wish to see if there is any relationship between Arsenic and Flouride levels by location. From the plot it does not seem that the median values of Arsenic and Flouride are highly correlated and in fact we get a correlation value of .196 which is somewhat low.
arsenic_flouride %>% ggvis(~FL_median, ~AS_median) %>%
layer_points()
with (arsenic_flouride, cor(FL_median, AS_median,use="complete.obs", method="pearson"))
## [1] 0.1956527
Let’s take a look at a density plot for each of the datasets
# Kernel Density Plot
d <- density(arsenic$AS_n_wells) # returns the density data
plot(d) # plots the results
# Kernel Density Plot
d <- density(flouride$FL_n_wells) # returns the density data
plot(d) # plots the results
We can see that the sample sizes by location are skewed to a large number of locations with minimal wells tested. This gave me pause to see if the correlation between the median values is overly affected by a large number of small sample sizes. At this point, I test the correlation again with sample sizes over 100.
highsample<-arsenic_flouride %>% filter (AS_n_wells>100)
highsample %>% ggvis(~FL_median, ~AS_median) %>%
layer_points()
with (highsample, cor(FL_median, AS_median,use="complete.obs", method="pearson"))
## [1] 0.3234243
So although, the correlation is still rather weak, it is considerably higher at .32 than the previous correlation value of .19. It seems that larger sample sizes are showing stronger correlation between the Arsenic and Flouride median levels. At this point I wish to plot for each of the datasets, the median value of Arsenic and Flouride levels against their respective sample sizes. I would hope to find that there is no correlation for either dataset and certainly not a great difference between the two in their respective correlations. To me any correlation between the average values sampled and the sample sizes themselves would call into question the integrity of the data and would lead me to question whether joining these two datasets at the location level would be proper for further analysis.
arsenic_flouride %>% ggvis(~FL_n_wells, ~FL_median) %>%
layer_points()
with (arsenic_flouride, cor(FL_n_wells, FL_median,use="complete.obs", method="pearson"))
## [1] 0.03967881
arsenic_flouride %>% ggvis(~AS_n_wells, ~AS_median) %>%
layer_points()
with (arsenic_flouride, cor(AS_n_wells, AS_median,use="complete.obs", method="pearson"))
## [1] 0.3743485
So the plots are not particularly revealing however the two correlations are. Within the Flouride dataset, at a correlation value of .03976 there is virtually no linear correlation between the number of wells tested at each location and the median Flouride value for the location. The same can not be said for Arsenic however since a correlation of .374 implies that we are beginning to see that the number of wells tested is influencing positively the median value of Arsenic levels by location. This leaves me with a lower confidence in interpreting any type of relationship between the two datasets’ values and calls into question the data integrity of the Arsenic dataset.
As an analyst by trade, I find it natural to link databases to be able to create a richer dataset from which further insights can be gleaned. However, being able to test each dataset qualitatively to see how they may differ under certain characteristics allows me to proceed more cautiously in automatically extrapolating insights just because a join was made. Rather than drill down from the two joined datasets to see how Arsenic and Flouride behave or interact at the location level, I preferred in this exercise to discover if either of the datasets may have questionable qualities or design flaws. I believe that given what I have found within the Arsenic dataset, I am less confident in analyzing the two datasets together as well as inferring any insights from the Arsenic dataset by itself.