Groundwater.csv is a huge dataset, spanning more than 38,000 observations, each of which records 24 different variables. The dataset tracks different aquifers over time, measuring the concentration of different elements, as well as the depth and location. The source is USGS, or US Geological Survey, at https://waterdata.usgs.gov/nwis/gw. Over the course of this project, I plan on exploring how climate change has affected aquifers, and how different states are disproportionately affected.
First, I will load the dataset, and check to see if it has all 50 states in it.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Warning: One or more parsing issues, call `problems()` on your data frame for details,
e.g.:
dat <- vroom(...)
problems(dat)
Rows: 38103 Columns: 24
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (7): aquifer, state, lith, w_type, alk, so4, al
dbl (9): usgs_id, year, time, o2, ph, temp, k, f, sio2
num (8): depth, ca, mg, na, cl, fe, mn, tds
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 49 × 1
state
<chr>
1 AL
2 AR
3 AZ
4 CA
5 CO
6 CT
7 DC
8 DE
9 FL
10 GA
# ℹ 39 more rows
The groundstates dataset only has 49 states in it - it is missing Alaska. This is fine, though, since alaska is outside of the continental US, and it’s exclusion won’t be too noticeable in the final visulization.
Next, I’m going to clean up the groundwater dataset so that it only has what I need, and so that it’s easier to set up some quick scatterplots to see what I’m dealing with.
groundclean <- groundwater |>#this gets rid of every column except for year, state, and one third thing (temp)select(year, temp, state) |>#this gets rid of all rows with nana.omit() |>#this sorts it by year primarily, and by state secondarilyarrange(year, state)#this is a quick & dirty plot to find where the interesting data isggplot(groundclean, aes(year, temp)) +geom_point() +geom_smooth()
`geom_smooth()` using method = 'gam' and formula = 'y ~ s(x, bs = "cs")'
So I looked at multiple different scatterplots, and the one that I thought told the most interesting story was the one that tracked temperate. From here, I’m going to put together the linear regression
lmAqTemp =lm(temp~year, data = groundclean)summary(lmAqTemp)
Call:
lm(formula = temp ~ year, data = groundclean)
Residuals:
Min 1Q Median 3Q Max
-15.615 -3.980 -1.021 3.353 81.856
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.643e+02 7.124e+00 -23.06 <2e-16 ***
year 9.028e-02 3.562e-03 25.35 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.71 on 35465 degrees of freedom
Multiple R-squared: 0.01779, Adjusted R-squared: 0.01777
F-statistic: 642.5 on 1 and 35465 DF, p-value: < 2.2e-16
As seen above, the p-value for this linear model is 2.2e-16, or, 0.00000000000000022. This means that the association is quite significant. The equation for the model is y = 0.09028x - 164.3, where x is the year and y is the approximated tempurature. The positive slope and very small p-value show a slow but steady increase in tempurature.
Finally, I am going to clean the data to prepare it for the final visulization.
groundsummary <- groundclean |>filter(state %in%c("AL", "AR", "KS", "KY", "NM", "OR", "TX", "UT")) |>#this takes each state's average tempurature for that yearsummarize(.by =c("state", "year"), temp =mean(temp)) |>#This puts the states back next to each otherarrange(state)ggplot(groundsummary) +geom_point(aes(year, temp, color = state)) +geom_line(aes(year, temp, color = state)) +geom_abline(slope = .09028,intercept =-164.3) +scale_color_brewer(palette ="Set3") +theme_grey() +labs(x ="Year",y ="Tempurature (Degrees Celcius)",caption ="US Geological Survey",title ="Average Aquifer Temperatures over Time",color ="Relevant States")
My graph shows the rise in aquifer temperatures over the years. This involved a very specific portion of a very large dataset, so it needed much of the dataset to be cleaned or filtered out. I knew from the start that I only wanted to be tracking one variable beyond state and time, so I was able to narrow the dataset down by a lot from the beginning. I took some time to look through different scatterplots to see what was most interesting before settling on temperature. After that, I filtered out all the rows with missing values. From there, I started making the final visulization. I realized that it was WAY too busy, so I spent some time cutting out the less interesting states before arriving at the final list. I also included the regression line from before. I had wanted to do a hex heatmap showing the difference in tempuratures from like 1990-2010, but there was too much missing data for that. The line chart that I ended up with was able to span the missing values, so that worked out.