library(tidyverse)
## -- Attaching packages ----------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts -------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(BSDA)
## Warning: package 'BSDA' was built under R version 4.0.3
## Loading required package: lattice
##
## Attaching package: 'BSDA'
## The following object is masked from 'package:datasets':
##
## Orange
library(mgcv)
## Loading required package: nlme
##
## Attaching package: 'nlme'
## The following objects are masked from 'package:BSDA':
##
## Gasoline, Wheat
## The following object is masked from 'package:dplyr':
##
## collapse
## This is mgcv 1.8-31. For overview type 'help("mgcv-package")'.
The majority of the water on Earth is found in oceans and unusable by humans.Thus humans must look to other sources for this life sustaining liquid. Most of the world’s freshwater is found in glaciers, which is equally difficult to harness. As a result, humans must turn to freshwater that is found in underground reserves of water called aquifers. These features of the subsurface Earth are important not only to the environment and the field of geology, but to the worldwide economy, as they can provide water for agriculture to areas where surface water is scarce. They make up 22% of the Earth’s freshwater and are replenished when water enters at recharge zones. Classifications of aquifers vary based on the type of rock that they are composed of. These classifications include porous rock, gravel or other broken rocks. Since aquifers are so far underground, their reserve of water is usually not interrupted by evaporation and for the most part is free from pollution. However, the over extraction of water for both residential and agricultural use can cause issues such as saltwater intrusion, where ocean water infiltrates the rock and compromises the freshwater in the aquifer. It is also common for gas from underground tanks or sewage from septic tanks to contaminate nearby groundwater. Therefore, it is imperative that scientists monitor the health of aquifers, especially as humans use more and more
Not only does waste created by humans contaminate groundwater, but rocks that are water soluble can dissolve in groundwater causing the presence of foreign minerals. A calcium rich limestone, for example, may cause a higher calcium content in water. This explains why this dataset has a large variety of substances that can be found in groundwater, since there many types of rocks that makeup aquifers.
My interest in groundwater began while I was taking Environmental Science at Montgomery College in 2019. I never realized that there were massive reserves of water underground that so many were reliant on. Then I realized that members of my family, who live in rural New York, are reliant on wells for all their water. Their water is contaminated with sulfur causing it to smell like eggs, yet their next door neighbors have uncontaminated artesian well water flowing through every faucet. I am currently taking a Geology class and am hoping to continue studying this at a four year university so I can study groundwater in the future.
11,032 wells were sampled, though some were sampled multiple times. This data comes from untreated wells in the United States.
gw <- read.csv("C:/users/maddie/desktop/maddie's trashcan/biostat/gw.csv", header= TRUE)
This dataset is collection of 38105 observations from 24 variables relating to groundwater. The data comes from the U.S. Geological Survey National Water Information System Database and is entirely from observations. Observations are from the years 1988 to 2017.
There is a strict set of procedures that the United States Geological Survey follows when collecting and testing water samples. These procedures are outlined in the National Field Manual for the Collection of Water-Quality Data and are updated when changes to protocol are made.
o2: Levels of Dissolved Oxygen in water, in milligrams per liter (mg/L)
temp: Water Temperature, in Celsius (°C)
depth: Depth of the Well, in Meters (m)
tds: Total Dissolved Solids, in milligrams per liter (mg/L)
concentrations of:
table(gw$lith)
##
## carbonate crystalline
## 753 5 3320 2208
## glacial sandstone semi-consolidated shale
## 6605 3892 4538 433
## ss-carbonate unconsolidated volcanics
## 384 14540 1425
table(gw$state)
##
## AL AR AZ CA CO CT DC DE FL GA IA ID IL IN KS KY
## 204 343 1724 5118 1238 112 31 335 727 550 748 2631 303 373 632 42
## LA MA MD ME MI MN MO MS MT NC ND NE NH NJ NM NV
## 820 523 1219 69 363 748 554 291 861 644 505 1249 197 1415 404 1075
## NY OH OK OR PA RI SC SD TN TX UT VA VT WA WI WV
## 1158 442 533 235 1310 24 288 406 762 1269 1366 850 31 1977 283 463
## WY
## 658
aquifer: Name of the aquifer that was sampled
w_type: Type of well that was sampled
table(gw$w_type)
##
## Commercial Domestic Industrial Irrigation Monitoring
## 262 11032 295 2181 13958
## Other Public supply Stock Unknown
## 1227 5407 978 2763
table(gw$year)
##
## 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003
## 1812 2230 1655 1757 1567 2293 1968 1877 1261 1490 1546 1191 1391 773 971 575
## 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
## 637 917 1124 1154 1190 914 1314 1207 1242 1499 1784 405 359
mean pH, total dissolved solids, manganese, chloride, and sulfates
All of these variables are quantitative and I will use them to create plots, confidence intervals and regression models in order to answer my overarching question.
Replaced all the < 10 values for the variable FE with 5. Since 5 is less than 10 and halfway between 0 and 10, this is a fair replacement.
Replaced all the < 4 values for the variable MN with 2. Since 2 is less than 4 and halfway between 2 and 4, this is a fair replacement.
Replaced all the < 0.1 values for variable F with 0.05. Since 0.05 is less than 0.1 and halfway between 0 and 0.1, this is a fair replacement.
Removed the month and day from the date variable to allow for easier creation of categorical tables.
Replaced the “other aquifer” and “other aquifers” fields with “other” for consistency and ease.
Converted variables classified as “characters” to “numbers” to allow for calculations.
Removed the 2 values from the year 2017, since they contained outliers that were causing difficulties with my data analysis.
gw$mn <- as.numeric(gw$mn)
## Warning: NAs introduced by coercion
gw$tds <- as.numeric(gw$tds)
## Warning: NAs introduced by coercion
gw$so4 <- as.numeric(gw$so4)
## Warning: NAs introduced by coercion
gw$cl <- as.numeric(gw$cl)
## Warning: NAs introduced by coercion
gw$ca <- as.numeric(gw$ca)
## Warning: NAs introduced by coercion
gw$mg <- as.numeric(gw$mg)
## Warning: NAs introduced by coercion
gw$k <- as.numeric(gw$k)
## Warning: NAs introduced by coercion
I will focus on answering this question:
How do various elements effect the pH of groundwater?
I would like to study the parameter mean pH as it relates to the concentrations of manganese, chloride, total dissolved solids, and sulfates.
I will look into how these specific variables have changed over time as well, in order to indentify trends that could be meaningful.
Plot and Summary Statistics:
hist(gw$ph, main="pH of Groundwater", xlab="pH", xlim = c(0,14),
breaks= 14, col = c("red", "red", "red", "orange", "orange", "orange","yellow", "yellow", "green", "green", "green", "cyan", "cyan", "blue", "blue"))
mean(gw$ph, na.rm = TRUE)
## [1] 7.155658
sd(gw$ph, na.rm = TRUE)
## [1] 0.8856095
The pH appears to be mostly normally distributed. The mean pH of the groundwater in this dataset is 7.155658, which is slightly above the pH of pure water, which is 7.
What is the true mean pH, at a 95% confindence level?
Since the groundwater dataset contains so many observations, the confidence interval for the data became very narrow. As a result, I took a random sample of 100 in order to calculate a confidence interval.
set.seed(4)
samp1 <- sample(gw$ph, 100)
hist(samp1, main="pH sample", col= "honeydew3")
The histogram of samp1 shows that this sample is mostly normally distributed. Also, there is a large sample size, that was randomly selected, and the observations are independent from each other. Therefore, the conditions to use the t-distribution have been satisfied.
mean(samp1, na.rm= TRUE)
## [1] 7.197826
sd(samp1, na.rm= TRUE)
## [1] 0.7676106
tsum.test(7.197826, 0.7676106, 100)
## Warning in tsum.test(7.197826, 0.7676106, 100): argument 'var.equal' ignored for
## one-sample test.
##
## One-sample t-Test
##
## data: Summarized x
## t = 93.769, df = 99, p-value < 2.2e-16
## alternative hypothesis: true mean is not equal to 0
## 95 percent confidence interval:
## 7.045515 7.350137
## sample estimates:
## mean of x
## 7.197826
We are 95% confident that the true mean pH of groundwater is between 7.046 and 7.350.
Manganese (mn) is a metal that, in small concentrations, is necessary for the function of humans and other organisms. Too much, however, has caused adverse health effects to some. According to the EPA, there is data showing that inhaling too much manganese can cause neurological issues in humans. Though there have not been long-term studies about the effects of this element on humans when ingested through water, a short-term study showed that ingesting water with manganese caused mental issues. Despite this there is not enough information to generalize this claim. Also, water containing manganese tends to create black stains on objects that it comes into contact with which is unpleasant to the eye. Large concentrations of manganese are associated with acidic water.
Plot and Summary Statistics:
hist(gw$mn, main="Concentration of Manganese", col = "honeydew3")
mean(gw$mn, na.rm = TRUE)
## [1] 75.58737
sd(gw$mn, na.rm = TRUE)
## [1] 160.9267
The distribution of manganese content is heavily right skewed.
gw_avgmn <- gw %>%
group_by(year) %>%
mutate(avg_mn = mean(mn, na.rm = TRUE))
gw_avgmn
## # A tibble: 38,103 x 25
## # Groups: year [29]
## ï..usgs_id aquifer state year time lith w_type depth o2 ph temp
## <dbl> <chr> <chr> <int> <int> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 3.54e14 Ada-Va~ OK 1992 1900 sand~ Publi~ "82" 6.2 7.6 17.2
## 2 3.54e14 Ada-Va~ OK 1992 1705 sand~ Monit~ "70" NA 7.1 19.9
## 3 3.54e14 Ada-Va~ OK 1992 1900 sand~ Monit~ "47" NA 8.1 NA
## 4 3.62e14 Ada-Va~ OK 1997 1955 sand~ Domes~ "46" NA 7.3 NA
## 5 3.62e14 Ada-Va~ OK 1997 1755 sand~ Domes~ "" NA 6.4 NA
## 6 3.62e14 Ada-Va~ OK 1997 1855 sand~ Domes~ "24" NA 7.1 NA
## 7 3.62e14 Ada-Va~ OK 1997 1645 sand~ Domes~ "61" NA 8 NA
## 8 3.62e14 Ada-Va~ OK 1997 1840 sand~ Domes~ "" NA 7.5 13.3
## 9 3.62e14 Ada-Va~ OK 1997 1720 sand~ Domes~ "38" NA 8.2 16.2
## 10 3.62e14 Ada-Va~ OK 1997 1640 sand~ Domes~ "34" NA 8.3 NA
## # ... with 38,093 more rows, and 14 more variables: ca <dbl>, mg <dbl>,
## # k <dbl>, na <chr>, alk <chr>, cl <dbl>, f <dbl>, sio2 <dbl>, so4 <dbl>,
## # al <chr>, fe <chr>, mn <dbl>, tds <dbl>, avg_mn <dbl>
Looking at the average concentrations of manganese for each year that this data was collected, it seems that there has been an exponential decrease in the concentrations of manganese in groundwater. This led me to create a log model to see if my observations from the initial plot were more meaningful then they seemed at first glance.
p1 <- gw_avgmn %>%
ggplot(aes(year, avg_mn)) +
geom_line(col="firebrick3", size = 1) + theme(panel.background = element_rect(fill= "honeydew3"))+
geom_smooth()
p1
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
log.model <-lm(log(avg_mn) ~ year, gw_avgmn)
log.model.df <- data.frame(x = gw_avgmn$year,
y = exp(fitted(log.model)))
p2 <- ggplot(log.model.df, aes(x=x, y=y)) +
geom_point() +
geom_smooth(method="lm", aes(color="Exp Model"), formula= (y ~ exp(x)), linetype = 1) +
geom_line(data = log.model.df, aes(x, y, color = "Log Model"), size = 1, linetype = 2) +
guides(color = guide_legend("Model Type"))
p2
## Warning: Computation failed in `stat_smooth()`:
## NA/NaN/Inf in 'x'
log.model <-lm(log(avg_mn) ~ year, gw_avgmn)
summary(log.model)
##
## Call:
## lm(formula = log(avg_mn) ~ year, data = gw_avgmn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.34138 -0.08913 0.01034 0.07695 0.26342
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.738e+01 1.537e-01 308.3 <2e-16 ***
## year -2.154e-02 7.684e-05 -280.3 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1273 on 38101 degrees of freedom
## Multiple R-squared: 0.6734, Adjusted R-squared: 0.6734
## F-statistic: 7.857e+04 on 1 and 38101 DF, p-value: < 2.2e-16
This equation represents the output of this model:
avg_mn-hat = −0.02154(year) + 47.38
Both year and avg_mn have p-values that are practically 0 which shows that they are significant. Based on the Adjusted R-squared value, 67.34% of the variance can be explained by this model.
Sulfates, which may enter groundwater sources through mine water and waste from factories, cause an unpleasant taste in groundwater. Digestive issues have also been reported by some whose water contains large amounts of these compounds.
Plot and Summary Statistics:
hist(gw$so4, main="Concentration of Sulfates", col = "honeydew3")
mean(gw$so4, na.rm = TRUE)
## [1] 81.49143
sd(gw$so4, na.rm = TRUE)
## [1] 146.1925
Plot and Summary Statistics:
hist(gw$cl, main="Concentration of Chloride", col = "honeydew3")
mean(gw$cl, na.rm= TRUE)
## [1] 53.05933
sd(gw$cl, na.rm= TRUE)
## [1] 111.9861
“A salty taste… may be detectable when the chloride exceeds 100 ppm.”
For how much of this data might you detect a salty taste in the water due to the concentration of chloride?
Though the concentrations of chloride are not normally distributed, I was able to use normal distribution calculations because of the large sample size.
1-pnorm(100, 53.059, 111.986 )
## [1] 0.3375465
Based on normal distribution calculations, 33.75% of the wells in this sample have chloride levels that exceed 100 ppm. People who drink from these groundwater sources may detect a salty taste.
According to the World Health Organization, total dissolved solids are salts and some organic matter that are present in water. This includes calcium, magnesium, potassium and sulfates, along with other elements and compounds. Therefore, many of the elements that were studied in this groundwater dataset make up the concentration of total dissolved solids. Like other contents of water, total dissolved solids can impact how water tastes.
Plot and Summary Statistics:
hist(gw$tds, main= "Total Dissolved Solids", col = "honeydew3")
mean(gw$tds, na.rm = TRUE)
## [1] 326.2441
sd(gw$tds, na.rm = TRUE)
## [1] 218.6673
The total dissolved solids content is slightly right skewed.
p3 <- gw %>%
group_by(year) %>%
mutate(avg_tds = mean(tds, na.rm = TRUE)) %>%
ggplot(aes(year, avg_tds)) +
geom_line(col="firebrick3", size = 1) + theme(panel.background = element_rect(fill= "honeydew3"))
p3
Based on the time series, there is a slight upward trend in the amount of total dissolved solids from the groundwater sample. However, is this trend meaningful enough for me to perform a successful linear regression?
First, I created a data frame that contains the variable avg_tds which represents the mean total dissolved solids, grouped by year.
gw_avgtds <- gw %>%
group_by(year) %>%
mutate(avg_tds = mean(tds, na.rm = TRUE))
Then, I created the linear regression model and plotted it.
m1 <- lm(gw_avgtds$avg_tds ~ gw$year, data = gw_avgtds)
Are the conditions for linear regression satisfied?
plot(m1)
hist(m1$residuals)
Linearity: This model is somewhat linear, though this condition is loosely met.
Normal Residuals: Based on the histogram, the residuals are mostly normal.
Constant Variance: Condition is met.
plot(gw_avgtds$avg_tds ~ gw$year)
abline(m1, col="darkturquoise")
summary(m1)
##
## Call:
## lm(formula = gw_avgtds$avg_tds ~ gw$year, data = gw_avgtds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64.679 -17.569 3.965 14.507 66.341
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.709e+03 3.531e+01 -161.7 <2e-16 ***
## gw$year 3.016e+00 1.766e-02 170.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 29.24 on 38101 degrees of freedom
## Multiple R-squared: 0.4337, Adjusted R-squared: 0.4336
## F-statistic: 2.917e+04 on 1 and 38101 DF, p-value: < 2.2e-16
This equation represents the output of this model:
avg_tds-hat = 3.016(year) − 5709
Both year and avg_tds have p-values that are practically 0 which shows that they are significant. Based on the Adjusted R-squared value, 43.36% of the variance can be explained by this model. The Adjusted R-squared is rather low, meaning that a linear model may not be suited for this data. This is corroborated by the plot of the linear regression, which shows that the points are scattered around the regression line instead of being close to them.
Since the conditions for a linear regression were hardly met and the plot of avg_tds was not fully linear to begin with, I was not surprised that this model is not appropriate.
“Water containing more than 1,000 ppm of dissolved solids is unsuitable for many purposes.”
How much of the water sample exceeds the suitable concentration of total dissolved solids?
Since the data is in mg/L I had to convert this to ppm in order to do calculations. However, this did not pose an issue because 1 ppm = 1 mg/L.
1-pnorm(1000, 326.2435, 218.6589)
## [1] 0.00103045
A very small amount of the sample has a TDS concentration that is greater than 1000 ppm, meaning that most of the samples have a suitable concentration of TDS. This is based on normal distribution calculations, which I used because of the large sample size.
Above, I explained that many of the variables in this dataset makeup the concentration of total dissolved solids. Upon learning this I wondered:
Would variables such as mg, cl, k, and so4 be predictors of tds?
To answer this, I created a multiple regression model that relates all of these variables.
fit5 <- lm(tds ~ mg + cl + k + so4, data= gw)
summary(fit5)
##
## Call:
## lm(formula = tds ~ mg + cl + k + so4, data = gw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1250.51 -60.26 -8.96 45.43 746.52
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 121.47746 1.06833 113.71 <2e-16 ***
## mg 4.55815 0.05077 89.78 <2e-16 ***
## cl 1.69713 0.01292 131.36 <2e-16 ***
## k 5.19786 0.15859 32.78 <2e-16 ***
## so4 1.37674 0.01030 133.68 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 92.92 on 19715 degrees of freedom
## (18383 observations deleted due to missingness)
## Multiple R-squared: 0.8176, Adjusted R-squared: 0.8175
## F-statistic: 2.209e+04 on 4 and 19715 DF, p-value: < 2.2e-16
This equation represents the output of this model:
tds-hat = 4.558(mg) + 1.697(cl) + 5.198(k) + 1.377(so4) = 121.446
The p-values are very small, showing that there is significance of all the variables. The Adjusted R-Squared is rather large, showing that 81.75% of the variance can be explained by this model. There appears to be an association between tds and many of the variables from the dataset.
Is there a relationship between the concentrations of manganese, chloride, total dissolved solids, and sulfates in groundwater and the water’s pH? To explore this, I created a multiple regression model using these four predictors.
fit3 <- lm(ph ~ mn + cl + tds + so4, data= gw)
summary(fit3)
##
## Call:
## lm(formula = ph ~ mn + cl + tds + so4, data = gw)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.5637 -0.4372 0.0271 0.4906 3.8562
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.683e+00 1.240e-02 538.78 <2e-16 ***
## mn -8.722e-04 4.028e-05 -21.66 <2e-16 ***
## cl -4.741e-03 1.535e-04 -30.89 <2e-16 ***
## tds 2.502e-03 5.408e-05 46.27 <2e-16 ***
## so4 -2.718e-03 1.310e-04 -20.74 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8094 on 18579 degrees of freedom
## (19519 observations deleted due to missingness)
## Multiple R-squared: 0.1351, Adjusted R-squared: 0.1349
## F-statistic: 725.5 on 4 and 18579 DF, p-value: < 2.2e-16
This equation represents the output of this model:
ph-hat = -0.0009(mn) - 0.0047(cl) + 0.0025(tds) - 0.0027(so4) + 6.683
Every variable in this model is significant based on the low p-values that are practically 0. However, the Adjusted R-Squared is low, showing that only 13.49% of the variance can be explained by this model. Since the R-Squared value is this low, this model may not be appropriate.
plot(fit3)
Since the first model was not appropriate, I tried a Generalized Additive Model. In this model, the predictors are smoothed through s().
fit4 <- gam(ph ~ s(mn) + s(cl) + s(tds) + s(so4), data = gw)
summary(fit4)
##
## Family: gaussian
## Link function: identity
##
## Formula:
## ph ~ s(mn) + s(cl) + s(tds) + s(so4)
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.161386 0.005282 1356 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(mn) 8.031 8.709 92.73 <2e-16 ***
## s(cl) 8.883 8.995 116.22 <2e-16 ***
## s(tds) 8.780 8.983 715.47 <2e-16 ***
## s(so4) 8.538 8.923 19.26 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.315 Deviance explained = 31.7%
## GCV = 0.51944 Scale est. = 0.51845 n = 18584
The summary of this model shows that the p-values are still significant. The Adjusted R-Squared increased slightly, showing that 31.5% of the variance is explained by this model. Though the value increased, it is still low, meaning that this model may not be appropriate.
Based on the statistical analysis I performed, there appears to be a relationship between the pH of groundwater and its contents. This did not come as a surprise, since the pH of pure water (H2O) is 7, meaning that the pH is changed through the addition of other substances.
Confidence interval for mean pH:
We are 95% confident that the true mean pH of groundwater is between 7.046 and 7.350.
The pH of groundwater is close to neutral though it may be slightly basic.
Results from the log model for concentrations of avg_mn:
The p-value from this model was <2e-16, which is very small and practically 0. This shows that there is some significance .
The Adjusted R-Squared revealed that 67.34% of the variance can be explained by the model.
Results from the linear regression for concentrations of avg_tds:
A linear regression model was no suitable based on the model I created.
The p-value from this model was <2e-16, which is very small and practically 0. This shows that there is some significance .
The Adjusted R-Squared revealed that 43.36% of the variance can be explained by this model.
Results from the multiple regression between tds and various substances found in groundwater:
The p-value for every predictor in this model was <2e-16, which is very small and practically 0. This shows that there is some significance.
The Adjusted R-Squared revealed that 81.75% of the variance can be explained by this model.
Results from the multiple regression between pH and various substances found in groundwater:
The p-value for every predictor in this model was <2e-16, which is very small and practically 0. This shows that there is some significance.
The Adjusted R-Squared revealed that 13.49% of the variance can be explained by this model.
Results from the generalized additive model:
The p-value for every predictor in this model was <2e-16, which is very small and practically 0. This shows that there is some significance.
The Adjusted R-Squared revealed that 31.5% of the variance is explained by this model.
I do not think my results are that surprising. I assumed that the contents of water would impact its pH from the beginning but I wanted to see if statistical calculations would back this up. I do not think the analysis I did was the best way to show the association between pH and water contents but I feel that if I played with the models more, I would get a better R-Squared.
The pH of most groundwater is very close to neutral, though the existence of other elements, minerals, and compounds in water may impact this measurement.
I have to admit, I am very impressed with the work that I have done. I never thought that I would learn to and understand how to code. Also, I had so much fun researching this topic that this project hardly felt like work.
I feel that my analysis went well, though I wish I had more ideas about what to study in this dataset. There is lots of great data to be explored and if I have more time I would really like to explore more of my variables. My dataset has a lot of categorical variables which made it very easy to create confidence intervals and regression models. My only complaint about the data is there are lots of NAs in many of the variables. I think it would have been very interesting to have more data to work with for the concentrations of various elements.
Also, I really struggled to figure out what direction to take my project. That being said, I think I had a good mix of forms of analysis and used concepts from different units of the course. I wish I had worked with more of the categorical variables, but there were so many quantitative values to play with I never got to the categorical ones.
One question I have about my statistical analysis is, why was every p-value in my various regression models almost zero? It seems odd that there was no variation in my p-values. Was everything really significant or is there some sort of issue with my analysis? This is my biggest concern with my work.
There is much more to explore and lots of questions that could be created from this data. I chose to focus on pH since there were values for most of the observations, however, there are a plethora of other variables that are ready to explore.
Overall, I think this project shows that I have learned a lot since the beginning of class. I used techniques I did not understand in MATH 117 and used R for everything. I was finally able to understand and apply the code from our labs and I am proud that I effectively used the program. Though the code may be difficult and errors will come up you can create a lot of cool things in R. Even if I do not use it after this class I know learning to code will be a valuable for my future in STEM. And who knows? Maybe I will make more pretty histograms and plots because I really enjoyed creating them.
Made possible from lectures by:
Dr. Dana Felice
Dr. Dennis Coskren
and of course,
Professor Saidi for helping with the code.