Required Packages

library(tidyverse)
## -- Attaching packages ----------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.3     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts -------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(BSDA)
## Warning: package 'BSDA' was built under R version 4.0.3
## Loading required package: lattice
## 
## Attaching package: 'BSDA'
## The following object is masked from 'package:datasets':
## 
##     Orange
library(mgcv)
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## The following objects are masked from 'package:BSDA':
## 
##     Gasoline, Wheat
## The following object is masked from 'package:dplyr':
## 
##     collapse
## This is mgcv 1.8-31. For overview type 'help("mgcv-package")'.

Introduction

The majority of the water on Earth is found in oceans and unusable by humans.Thus humans must look to other sources for this life sustaining liquid. Most of the world’s freshwater is found in glaciers, which is equally difficult to harness. As a result, humans must turn to freshwater that is found in underground reserves of water called aquifers. These features of the subsurface Earth are important not only to the environment and the field of geology, but to the worldwide economy, as they can provide water for agriculture to areas where surface water is scarce. They make up 22% of the Earth’s freshwater and are replenished when water enters at recharge zones. Classifications of aquifers vary based on the type of rock that they are composed of. These classifications include porous rock, gravel or other broken rocks. Since aquifers are so far underground, their reserve of water is usually not interrupted by evaporation and for the most part is free from pollution. However, the over extraction of water for both residential and agricultural use can cause issues such as saltwater intrusion, where ocean water infiltrates the rock and compromises the freshwater in the aquifer. It is also common for gas from underground tanks or sewage from septic tanks to contaminate nearby groundwater. Therefore, it is imperative that scientists monitor the health of aquifers, especially as humans use more and more

Not only does waste created by humans contaminate groundwater, but rocks that are water soluble can dissolve in groundwater causing the presence of foreign minerals. A calcium rich limestone, for example, may cause a higher calcium content in water. This explains why this dataset has a large variety of substances that can be found in groundwater, since there many types of rocks that makeup aquifers.

My interest in groundwater began while I was taking Environmental Science at Montgomery College in 2019. I never realized that there were massive reserves of water underground that so many were reliant on. Then I realized that members of my family, who live in rural New York, are reliant on wells for all their water. Their water is contaminated with sulfur causing it to smell like eggs, yet their next door neighbors have uncontaminated artesian well water flowing through every faucet. I am currently taking a Geology class and am hoping to continue studying this at a four year university so I can study groundwater in the future.


The Data

11,032 wells were sampled, though some were sampled multiple times. This data comes from untreated wells in the United States.

gw <- read.csv("C:/users/maddie/desktop/maddie's trashcan/biostat/gw.csv", header= TRUE)

Collection of Data

This dataset is collection of 38105 observations from 24 variables relating to groundwater. The data comes from the U.S. Geological Survey National Water Information System Database and is entirely from observations. Observations are from the years 1988 to 2017.

There is a strict set of procedures that the United States Geological Survey follows when collecting and testing water samples. These procedures are outlined in the National Field Manual for the Collection of Water-Quality Data and are updated when changes to protocol are made.

Definition of Variables

Quantitative Variables

  1. ph: The pH scale describes the alkalinity or acidity of a liquid. This measurement is on a scale of 0 to 14 with 0 being ACIDIC, 14 being BASIC, and 7 being NEUTRAL

  1. o2: Levels of Dissolved Oxygen in water, in milligrams per liter (mg/L)

  2. temp: Water Temperature, in Celsius (°C)

  3. depth: Depth of the Well, in Meters (m)

  4. tds: Total Dissolved Solids, in milligrams per liter (mg/L)

  5. concentrations of:

    • manganese (mn), in micrograms per liter (μg/L)
    • iron (fe), in micrograms per liter (μg/L)
    • aluminium (al), in micrograms per liter (μg/L)
    • sulfate (so4), in milligrams per liter (mg/L)
    • silica (sio2), in milligrams per liter (mg/L)
    • floride (f), in milligrams per liter (mg/L)
    • chloride (cl), in milligrams per liter (mg/L)
    • sodium (na), in milligrams per liter (mg/L)
    • potassium (k), in milligrams per liter (mg/L)
    • magnesium (mg), in milligrams per liter (mg/L)
    • calcium (ca), in milligrams per liter (mg/L)

Categorical Variables

  1. lith: Describes the lithology (characteristics of the rocks) of the aquifer
table(gw$lith)
## 
##                                             carbonate       crystalline 
##               753                 5              3320              2208 
##           glacial         sandstone semi-consolidated             shale 
##              6605              3892              4538               433 
##      ss-carbonate    unconsolidated         volcanics 
##               384             14540              1425
  1. state: Includes observations from the 48 contiguous states and D.C.
table(gw$state)
## 
##   AL   AR   AZ   CA   CO   CT   DC   DE   FL   GA   IA   ID   IL   IN   KS   KY 
##  204  343 1724 5118 1238  112   31  335  727  550  748 2631  303  373  632   42 
##   LA   MA   MD   ME   MI   MN   MO   MS   MT   NC   ND   NE   NH   NJ   NM   NV 
##  820  523 1219   69  363  748  554  291  861  644  505 1249  197 1415  404 1075 
##   NY   OH   OK   OR   PA   RI   SC   SD   TN   TX   UT   VA   VT   WA   WI   WV 
## 1158  442  533  235 1310   24  288  406  762 1269 1366  850   31 1977  283  463 
##   WY 
##  658
  1. aquifer: Name of the aquifer that was sampled

  2. w_type: Type of well that was sampled

table(gw$w_type)
## 
##    Commercial      Domestic    Industrial    Irrigation    Monitoring 
##           262         11032           295          2181         13958 
##         Other Public supply         Stock       Unknown 
##          1227          5407           978          2763
  1. year: Year when the data was collected
table(gw$year)
## 
## 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 
## 1812 2230 1655 1757 1567 2293 1968 1877 1261 1490 1546 1191 1391  773  971  575 
## 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 
##  637  917 1124 1154 1190  914 1314 1207 1242 1499 1784  405  359

I will focus on these statistics:

mean pH, total dissolved solids, manganese, chloride, and sulfates

All of these variables are quantitative and I will use them to create plots, confidence intervals and regression models in order to answer my overarching question.

Cleaning the Dataset

In order to ensure that the dataset is prepared for statistical analysis I:

  • Replaced all the < 10 values for the variable FE with 5. Since 5 is less than 10 and halfway between 0 and 10, this is a fair replacement.

  • Replaced all the < 4 values for the variable MN with 2. Since 2 is less than 4 and halfway between 2 and 4, this is a fair replacement.

  • Replaced all the < 0.1 values for variable F with 0.05. Since 0.05 is less than 0.1 and halfway between 0 and 0.1, this is a fair replacement.

  • Removed the month and day from the date variable to allow for easier creation of categorical tables.

  • Replaced the “other aquifer” and “other aquifers” fields with “other” for consistency and ease.

  • Converted variables classified as “characters” to “numbers” to allow for calculations.

  • Removed the 2 values from the year 2017, since they contained outliers that were causing difficulties with my data analysis.

gw$mn <- as.numeric(gw$mn)
## Warning: NAs introduced by coercion
gw$tds <- as.numeric(gw$tds)
## Warning: NAs introduced by coercion
gw$so4 <- as.numeric(gw$so4)
## Warning: NAs introduced by coercion
gw$cl <- as.numeric(gw$cl)
## Warning: NAs introduced by coercion
gw$ca <- as.numeric(gw$ca)
## Warning: NAs introduced by coercion
gw$mg <- as.numeric(gw$mg)
## Warning: NAs introduced by coercion
gw$k <- as.numeric(gw$k)
## Warning: NAs introduced by coercion

I will focus on answering this question:

How do various elements effect the pH of groundwater?


Summary and Analysis of the Data

I would like to study the parameter mean pH as it relates to the concentrations of manganese, chloride, total dissolved solids, and sulfates.

I will look into how these specific variables have changed over time as well, in order to indentify trends that could be meaningful.

pH

Plot and Summary Statistics:

hist(gw$ph, main="pH of Groundwater", xlab="pH", xlim = c(0,14), 
   breaks= 14, col = c("red", "red", "red", "orange", "orange", "orange","yellow", "yellow", "green", "green", "green", "cyan",               "cyan", "blue", "blue"))

mean(gw$ph, na.rm = TRUE)
## [1] 7.155658
sd(gw$ph, na.rm = TRUE)
## [1] 0.8856095

The pH appears to be mostly normally distributed. The mean pH of the groundwater in this dataset is 7.155658, which is slightly above the pH of pure water, which is 7.

What is the true mean pH, at a 95% confindence level?

Since the groundwater dataset contains so many observations, the confidence interval for the data became very narrow. As a result, I took a random sample of 100 in order to calculate a confidence interval.

set.seed(4)
samp1 <- sample(gw$ph, 100)
hist(samp1, main="pH sample", col= "honeydew3")

The histogram of samp1 shows that this sample is mostly normally distributed. Also, there is a large sample size, that was randomly selected, and the observations are independent from each other. Therefore, the conditions to use the t-distribution have been satisfied.

mean(samp1, na.rm= TRUE)
## [1] 7.197826
sd(samp1, na.rm= TRUE)
## [1] 0.7676106
tsum.test(7.197826, 0.7676106, 100)
## Warning in tsum.test(7.197826, 0.7676106, 100): argument 'var.equal' ignored for
## one-sample test.
## 
##  One-sample t-Test
## 
## data:  Summarized x
## t = 93.769, df = 99, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
##  7.045515 7.350137
## sample estimates:
## mean of x 
##  7.197826

We are 95% confident that the true mean pH of groundwater is between 7.046 and 7.350.


Manganese

Manganese (mn) is a metal that, in small concentrations, is necessary for the function of humans and other organisms. Too much, however, has caused adverse health effects to some. According to the EPA, there is data showing that inhaling too much manganese can cause neurological issues in humans. Though there have not been long-term studies about the effects of this element on humans when ingested through water, a short-term study showed that ingesting water with manganese caused mental issues. Despite this there is not enough information to generalize this claim. Also, water containing manganese tends to create black stains on objects that it comes into contact with which is unpleasant to the eye. Large concentrations of manganese are associated with acidic water.

Plot and Summary Statistics:

hist(gw$mn, main="Concentration of Manganese", col = "honeydew3")

mean(gw$mn, na.rm = TRUE)
## [1] 75.58737
sd(gw$mn, na.rm = TRUE)
## [1] 160.9267

The distribution of manganese content is heavily right skewed.

gw_avgmn <- gw %>%
  group_by(year) %>%
  mutate(avg_mn = mean(mn, na.rm = TRUE))
gw_avgmn
## # A tibble: 38,103 x 25
## # Groups:   year [29]
##    ï..usgs_id aquifer state  year  time lith  w_type depth    o2    ph  temp
##         <dbl> <chr>   <chr> <int> <int> <chr> <chr>  <chr> <dbl> <dbl> <dbl>
##  1    3.54e14 Ada-Va~ OK     1992  1900 sand~ Publi~ "82"    6.2   7.6  17.2
##  2    3.54e14 Ada-Va~ OK     1992  1705 sand~ Monit~ "70"   NA     7.1  19.9
##  3    3.54e14 Ada-Va~ OK     1992  1900 sand~ Monit~ "47"   NA     8.1  NA  
##  4    3.62e14 Ada-Va~ OK     1997  1955 sand~ Domes~ "46"   NA     7.3  NA  
##  5    3.62e14 Ada-Va~ OK     1997  1755 sand~ Domes~ ""     NA     6.4  NA  
##  6    3.62e14 Ada-Va~ OK     1997  1855 sand~ Domes~ "24"   NA     7.1  NA  
##  7    3.62e14 Ada-Va~ OK     1997  1645 sand~ Domes~ "61"   NA     8    NA  
##  8    3.62e14 Ada-Va~ OK     1997  1840 sand~ Domes~ ""     NA     7.5  13.3
##  9    3.62e14 Ada-Va~ OK     1997  1720 sand~ Domes~ "38"   NA     8.2  16.2
## 10    3.62e14 Ada-Va~ OK     1997  1640 sand~ Domes~ "34"   NA     8.3  NA  
## # ... with 38,093 more rows, and 14 more variables: ca <dbl>, mg <dbl>,
## #   k <dbl>, na <chr>, alk <chr>, cl <dbl>, f <dbl>, sio2 <dbl>, so4 <dbl>,
## #   al <chr>, fe <chr>, mn <dbl>, tds <dbl>, avg_mn <dbl>

Looking at the average concentrations of manganese for each year that this data was collected, it seems that there has been an exponential decrease in the concentrations of manganese in groundwater. This led me to create a log model to see if my observations from the initial plot were more meaningful then they seemed at first glance.

p1 <- gw_avgmn %>%
  ggplot(aes(year, avg_mn)) +
  geom_line(col="firebrick3", size = 1) + theme(panel.background = element_rect(fill= "honeydew3"))+
  geom_smooth()
p1
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

log.model <-lm(log(avg_mn) ~ year, gw_avgmn)
log.model.df <- data.frame(x = gw_avgmn$year,
                           y = exp(fitted(log.model)))
p2 <- ggplot(log.model.df, aes(x=x, y=y)) + 
  geom_point() +
  geom_smooth(method="lm", aes(color="Exp Model"), formula= (y ~ exp(x)), linetype = 1) +
  geom_line(data = log.model.df, aes(x, y, color = "Log Model"), size = 1, linetype = 2) + 
  guides(color = guide_legend("Model Type"))
p2
## Warning: Computation failed in `stat_smooth()`:
## NA/NaN/Inf in 'x'

log.model <-lm(log(avg_mn) ~ year, gw_avgmn)
summary(log.model)
## 
## Call:
## lm(formula = log(avg_mn) ~ year, data = gw_avgmn)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.34138 -0.08913  0.01034  0.07695  0.26342 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.738e+01  1.537e-01   308.3   <2e-16 ***
## year        -2.154e-02  7.684e-05  -280.3   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1273 on 38101 degrees of freedom
## Multiple R-squared:  0.6734, Adjusted R-squared:  0.6734 
## F-statistic: 7.857e+04 on 1 and 38101 DF,  p-value: < 2.2e-16

This equation represents the output of this model:

avg_mn-hat = −0.02154(year) + 47.38

Both year and avg_mn have p-values that are practically 0 which shows that they are significant. Based on the Adjusted R-squared value, 67.34% of the variance can be explained by this model.


Sulfate

Sulfates, which may enter groundwater sources through mine water and waste from factories, cause an unpleasant taste in groundwater. Digestive issues have also been reported by some whose water contains large amounts of these compounds.

Plot and Summary Statistics:

hist(gw$so4, main="Concentration of Sulfates", col = "honeydew3")

mean(gw$so4, na.rm = TRUE)
## [1] 81.49143
sd(gw$so4, na.rm = TRUE)
## [1] 146.1925

Chloride

Plot and Summary Statistics:

hist(gw$cl, main="Concentration of Chloride", col = "honeydew3")

mean(gw$cl, na.rm= TRUE)
## [1] 53.05933
sd(gw$cl, na.rm= TRUE)
## [1] 111.9861

According to the National Ground Water Association:

“A salty taste… may be detectable when the chloride exceeds 100 ppm.”

For how much of this data might you detect a salty taste in the water due to the concentration of chloride?

Though the concentrations of chloride are not normally distributed, I was able to use normal distribution calculations because of the large sample size.

1-pnorm(100, 53.059, 111.986 )
## [1] 0.3375465

Based on normal distribution calculations, 33.75% of the wells in this sample have chloride levels that exceed 100 ppm. People who drink from these groundwater sources may detect a salty taste.


Total Dissolved Solids

According to the World Health Organization, total dissolved solids are salts and some organic matter that are present in water. This includes calcium, magnesium, potassium and sulfates, along with other elements and compounds. Therefore, many of the elements that were studied in this groundwater dataset make up the concentration of total dissolved solids. Like other contents of water, total dissolved solids can impact how water tastes.

Plot and Summary Statistics:

hist(gw$tds, main= "Total Dissolved Solids", col = "honeydew3")

mean(gw$tds, na.rm = TRUE)
## [1] 326.2441
sd(gw$tds, na.rm = TRUE)
## [1] 218.6673

The total dissolved solids content is slightly right skewed.

p3 <- gw %>%
  group_by(year) %>%
  mutate(avg_tds = mean(tds, na.rm = TRUE)) %>%
  ggplot(aes(year, avg_tds)) +
  geom_line(col="firebrick3", size = 1) + theme(panel.background = element_rect(fill= "honeydew3"))
p3

Based on the time series, there is a slight upward trend in the amount of total dissolved solids from the groundwater sample. However, is this trend meaningful enough for me to perform a successful linear regression?

First, I created a data frame that contains the variable avg_tds which represents the mean total dissolved solids, grouped by year.

gw_avgtds <- gw %>%
  group_by(year) %>%
  mutate(avg_tds = mean(tds, na.rm = TRUE))

Then, I created the linear regression model and plotted it.

m1 <- lm(gw_avgtds$avg_tds ~ gw$year, data = gw_avgtds)

Are the conditions for linear regression satisfied?

plot(m1)

hist(m1$residuals)

  1. Linearity: This model is somewhat linear, though this condition is loosely met.

  2. Normal Residuals: Based on the histogram, the residuals are mostly normal.

  3. Constant Variance: Condition is met.

plot(gw_avgtds$avg_tds ~ gw$year)
abline(m1, col="darkturquoise")

summary(m1)
## 
## Call:
## lm(formula = gw_avgtds$avg_tds ~ gw$year, data = gw_avgtds)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -64.679 -17.569   3.965  14.507  66.341 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -5.709e+03  3.531e+01  -161.7   <2e-16 ***
## gw$year      3.016e+00  1.766e-02   170.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 29.24 on 38101 degrees of freedom
## Multiple R-squared:  0.4337, Adjusted R-squared:  0.4336 
## F-statistic: 2.917e+04 on 1 and 38101 DF,  p-value: < 2.2e-16

This equation represents the output of this model:

avg_tds-hat = 3.016(year) − 5709

Both year and avg_tds have p-values that are practically 0 which shows that they are significant. Based on the Adjusted R-squared value, 43.36% of the variance can be explained by this model. The Adjusted R-squared is rather low, meaning that a linear model may not be suited for this data. This is corroborated by the plot of the linear regression, which shows that the points are scattered around the regression line instead of being close to them.

Since the conditions for a linear regression were hardly met and the plot of avg_tds was not fully linear to begin with, I was not surprised that this model is not appropriate.

According to the National Ground Water Association:

“Water containing more than 1,000 ppm of dissolved solids is unsuitable for many purposes.”

How much of the water sample exceeds the suitable concentration of total dissolved solids?

Since the data is in mg/L I had to convert this to ppm in order to do calculations. However, this did not pose an issue because 1 ppm = 1 mg/L.

1-pnorm(1000, 326.2435, 218.6589)
## [1] 0.00103045

A very small amount of the sample has a TDS concentration that is greater than 1000 ppm, meaning that most of the samples have a suitable concentration of TDS. This is based on normal distribution calculations, which I used because of the large sample size.

Above, I explained that many of the variables in this dataset makeup the concentration of total dissolved solids. Upon learning this I wondered:

Would variables such as mg, cl, k, and so4 be predictors of tds?

To answer this, I created a multiple regression model that relates all of these variables.

fit5 <- lm(tds ~ mg + cl + k + so4, data= gw)
summary(fit5)
## 
## Call:
## lm(formula = tds ~ mg + cl + k + so4, data = gw)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1250.51   -60.26    -8.96    45.43   746.52 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 121.47746    1.06833  113.71   <2e-16 ***
## mg            4.55815    0.05077   89.78   <2e-16 ***
## cl            1.69713    0.01292  131.36   <2e-16 ***
## k             5.19786    0.15859   32.78   <2e-16 ***
## so4           1.37674    0.01030  133.68   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 92.92 on 19715 degrees of freedom
##   (18383 observations deleted due to missingness)
## Multiple R-squared:  0.8176, Adjusted R-squared:  0.8175 
## F-statistic: 2.209e+04 on 4 and 19715 DF,  p-value: < 2.2e-16

This equation represents the output of this model:

tds-hat = 4.558(mg) + 1.697(cl) + 5.198(k) + 1.377(so4) = 121.446

The p-values are very small, showing that there is significance of all the variables. The Adjusted R-Squared is rather large, showing that 81.75% of the variance can be explained by this model. There appears to be an association between tds and many of the variables from the dataset.


Multiple Regression

Now that I have explored the variables on their own, how do they all work together?

Is there a relationship between the concentrations of manganese, chloride, total dissolved solids, and sulfates in groundwater and the water’s pH? To explore this, I created a multiple regression model using these four predictors.

fit3 <- lm(ph ~ mn + cl + tds + so4, data= gw)
summary(fit3)
## 
## Call:
## lm(formula = ph ~ mn + cl + tds + so4, data = gw)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5637 -0.4372  0.0271  0.4906  3.8562 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.683e+00  1.240e-02  538.78   <2e-16 ***
## mn          -8.722e-04  4.028e-05  -21.66   <2e-16 ***
## cl          -4.741e-03  1.535e-04  -30.89   <2e-16 ***
## tds          2.502e-03  5.408e-05   46.27   <2e-16 ***
## so4         -2.718e-03  1.310e-04  -20.74   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8094 on 18579 degrees of freedom
##   (19519 observations deleted due to missingness)
## Multiple R-squared:  0.1351, Adjusted R-squared:  0.1349 
## F-statistic: 725.5 on 4 and 18579 DF,  p-value: < 2.2e-16

This equation represents the output of this model:

ph-hat = -0.0009(mn) - 0.0047(cl) + 0.0025(tds) - 0.0027(so4) + 6.683

Every variable in this model is significant based on the low p-values that are practically 0. However, the Adjusted R-Squared is low, showing that only 13.49% of the variance can be explained by this model. Since the R-Squared value is this low, this model may not be appropriate.

plot(fit3)

Since the first model was not appropriate, I tried a Generalized Additive Model. In this model, the predictors are smoothed through s().

fit4 <- gam(ph ~ s(mn) + s(cl) + s(tds) + s(so4), data = gw)
summary(fit4)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## ph ~ s(mn) + s(cl) + s(tds) + s(so4)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 7.161386   0.005282    1356   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##          edf Ref.df      F p-value    
## s(mn)  8.031  8.709  92.73  <2e-16 ***
## s(cl)  8.883  8.995 116.22  <2e-16 ***
## s(tds) 8.780  8.983 715.47  <2e-16 ***
## s(so4) 8.538  8.923  19.26  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.315   Deviance explained = 31.7%
## GCV = 0.51944  Scale est. = 0.51845   n = 18584

The summary of this model shows that the p-values are still significant. The Adjusted R-Squared increased slightly, showing that 31.5% of the variance is explained by this model. Though the value increased, it is still low, meaning that this model may not be appropriate.

Conclusion

Based on the statistical analysis I performed, there appears to be a relationship between the pH of groundwater and its contents. This did not come as a surprise, since the pH of pure water (H2O) is 7, meaning that the pH is changed through the addition of other substances.

Important Findings

Confidence interval for mean pH:

  • We are 95% confident that the true mean pH of groundwater is between 7.046 and 7.350.

  • The pH of groundwater is close to neutral though it may be slightly basic.

Results from the log model for concentrations of avg_mn:

  • The p-value from this model was <2e-16, which is very small and practically 0. This shows that there is some significance .

  • The Adjusted R-Squared revealed that 67.34% of the variance can be explained by the model.

Results from the linear regression for concentrations of avg_tds:

  • A linear regression model was no suitable based on the model I created.

  • The p-value from this model was <2e-16, which is very small and practically 0. This shows that there is some significance .

  • The Adjusted R-Squared revealed that 43.36% of the variance can be explained by this model.

Results from the multiple regression between tds and various substances found in groundwater:

  • The p-value for every predictor in this model was <2e-16, which is very small and practically 0. This shows that there is some significance.

  • The Adjusted R-Squared revealed that 81.75% of the variance can be explained by this model.

Results from the multiple regression between pH and various substances found in groundwater:

  • The p-value for every predictor in this model was <2e-16, which is very small and practically 0. This shows that there is some significance.

  • The Adjusted R-Squared revealed that 13.49% of the variance can be explained by this model.

Results from the generalized additive model:

  • The p-value for every predictor in this model was <2e-16, which is very small and practically 0. This shows that there is some significance.

  • The Adjusted R-Squared revealed that 31.5% of the variance is explained by this model.

Implications of the Results

I do not think my results are that surprising. I assumed that the contents of water would impact its pH from the beginning but I wanted to see if statistical calculations would back this up. I do not think the analysis I did was the best way to show the association between pH and water contents but I feel that if I played with the models more, I would get a better R-Squared.

The pH of most groundwater is very close to neutral, though the existence of other elements, minerals, and compounds in water may impact this measurement.

Opinions on the Statistical Analysis

I have to admit, I am very impressed with the work that I have done. I never thought that I would learn to and understand how to code. Also, I had so much fun researching this topic that this project hardly felt like work.

I feel that my analysis went well, though I wish I had more ideas about what to study in this dataset. There is lots of great data to be explored and if I have more time I would really like to explore more of my variables. My dataset has a lot of categorical variables which made it very easy to create confidence intervals and regression models. My only complaint about the data is there are lots of NAs in many of the variables. I think it would have been very interesting to have more data to work with for the concentrations of various elements.

Also, I really struggled to figure out what direction to take my project. That being said, I think I had a good mix of forms of analysis and used concepts from different units of the course. I wish I had worked with more of the categorical variables, but there were so many quantitative values to play with I never got to the categorical ones.

One question I have about my statistical analysis is, why was every p-value in my various regression models almost zero? It seems odd that there was no variation in my p-values. Was everything really significant or is there some sort of issue with my analysis? This is my biggest concern with my work.

There is much more to explore and lots of questions that could be created from this data. I chose to focus on pH since there were values for most of the observations, however, there are a plethora of other variables that are ready to explore.

Overall, I think this project shows that I have learned a lot since the beginning of class. I used techniques I did not understand in MATH 117 and used R for everything. I was finally able to understand and apply the code from our labs and I am proud that I effectively used the program. Though the code may be difficult and errors will come up you can create a lot of cool things in R. Even if I do not use it after this class I know learning to code will be a valuable for my future in STEM. And who knows? Maybe I will make more pretty histograms and plots because I really enjoyed creating them.

References

Made possible from lectures by:

Dr. Dana Felice

Dr. Dennis Coskren

and of course,

Professor Saidi for helping with the code.