Overview
Objectives
Section 1 - Directory Structures
- Making good directory layouts
- Full, base, and relative paths
Section 2 - Introduction to functions
- Developing our own functions
  - Function definitions:
Section 3 - Importing and Analyzing Data (and uncertainty)

rm(list = ls())

Overview

Note: This assignment was adapted from content developed by Professor Erin Hestir, Qian (2016), and Helsel et al. (2020).

In this assignment, we will focus on directory structure, defining and developing functions, importing data, exploring data, and describing uncertainty.

Follow the prompts and complete the exercises below. Please document all of your work and your examples in RMarkdown and knit them as a .html file. Please also make sure to show all of your work in code chunks in the notebook. All accompanying text prose for your answers should be in text in the notebook.

Note, Crimson blocks of text are action items for you! These signify actions and responses for which you are responsible

The assignment contains 3 sections:

Section 1 - Directory Structures

Section 2 - Introduction to functions

Section 3 - Importing and Analyzing Data (and uncertainty)

Sections 1 and 2 are setup as guides/tutorials. They should help you further refine your understanding and execution of R. Section 3 is setup as an exercise for you to work through. Here, you will be applying the methods and tools we have discussed in class to conduct the required analysis.

Objectives

After this assignment, you will be able to:

Understand directory structures in R
Write, or define, a function
Import data
Use functional programming and conduct exploratory data analysis on real-world environmental data
Test some of the assumptions of statistical inference
Describe the uncertainty associated with datasets

Section 1 - Directory Structures

Having a logical, consistent directory structure is an important component of every project inside this class, your other classes, your thesis, and life beyond your thesis. Again, when we do science, we are not just doing stuff, we are performing reproducible analyses. You need to layout your work, save different steps of your work, and be able to find it again in logical places.

I will not dictate to you how to set up your directory structures for this class. Everyone’s brain works a little bit differently when it comes to logical organization. Sometimes if you work on large, collaborative projects, they WILL dictate these structures. For good reason (again, science is about documentation, reproducibility, and sharing). For the purposes of this class, find something that works for you and stick with it.

Making good directory layouts

Before you set something up, read the following blog posts

On good directory layout: https://nicercode.github.io/blog/2013-04-05-projects/
On what RStudio offers for projects and version control: https://support.rstudio.com/hc/en-us/articles/200526207-Using-Projects

1. What are the three main reasons you want to have a good project layout? Can you think of any others?.

Three main reasons I want to have a good project layout is to be able to easily redirect my project wherever and whenever. This means that if I start on a project today, I should be able to read my code tomorrow easily tomorrow without having to go back to the beginning. Another reason is to be able to reproduce my datasets for others to use because we want reproducible datasets for verification as scientists. Lastly, a good project layout would make it easier to transport it to other coding programs like latex or python.

2. What are the three primary principles to follow in a good project layout?

Treat data as read only
Treat generated data as disposable
Separate function and application

Full, base, and relative paths

The full path is the complete directions for how to find the desired directory or file. It always starts with the home directory or root (e.g., C:\Users\CardiB\). A full path is the base path when used to set the working directory to a specific directory. So in this class, Cardi B set up her directory layout so that her full path looks something like this: C:\Users\CardiB\classes\ES207\Rscripts\lab1. If Cardi B wanted to set a smart base path for the working directory (see below), it might look something like this: C:\Users\CardiB\classes\ES207.

Note! File or directory paths and the home directory look different on different operating systems. Linux will look something like this: /home/CardiB. Windows will look like the examples above: C:\Documents and Settings\CardiB or C:\Users\CardiB. Mac OS X and Unix look like this: /Users/CardiB/. The format will vary depending on what version of windows you are running. NOTE the direction of the slashes.

3. Write out the full path for your R installation. Use the format of the operating system you are currently using.

“C:.1-win.exe”

4. Write the path above using a different operating system.

“/Users/CardiB/Downloads/R-4.5.1-win.exe”

A relative path is a path to a directory or file that starts from the location determined by the working directory. So, if our working directory is set to /Users/CardiB/classes/ES207, we can create a relative path for all directories and files within the ES207 directory.

5. Write out the full path for the directory structure you have set up for this class all the way to where you have saved this tutorial.

“C:_1_Importing_And_Exploring.Rmd”

6. Write out the relative path for this file assuming your working directory to be set to your equivalent of Users/CardiB/classes/.

“C:”

7. Write out the paths in an operating system other than your own.

“/Users/noahs/Desktop/Classes/ES207”

In R, the working directory is the directory where R will look for any file to open and where it save any output.

Without a working directory, all of your R scripts would need a full file path written out every time you want to open or save a file. It is much more efficient to set a base file path as our working directory and then all the file paths in our scripts are just the file path relative to that base path (this is known as a relative path).

Find a full path to a file in an unknown location

Oh no! What if you can’t figure out the file path to this .html file you downloaded?

Windows: Right click on file, select Properties, go to the General tab, or in the file path bar at the top of each navigator window.
Mac OS X: Right click or CRTL+click on the file, select Get Info

Determine current working directory

Once we are in R, we can view the current working directory using getwd()

getwd()

## [1] "C:/Users/noahs/Desktop/Classes/ES207"

Set the working directory

To set our current working directory to the location where our data are located, we can use R script, or we can use the RStudio GUI.

?setwd()

## starting httpd help server ... done

setwd("C:/Users/noahs/Desktop/Classes/ES207")

To view the file contents and sub directories of your working directory:

list.files()

## [1] "HW_1_Importing_And_Exploring.html" "HW_1_Importing_And_Exploring.Rmd" 
## [3] "rsconnect"                         "Statistical Methods.pdf"

You can also set the working directory using the RStudio GUI by navigating to Session > Select Working Directory > Choose Directory.

Note that using either method, you will be able to see the contents of the current working directory in the Files pane of the RStudio GUI (to your right).

Section 2 - Introduction to functions

Functions are the core of R – in the previous activities, we used multiple functions, including c(), mean(), sd(), data(), hist(), and q()

Functions will always have parentheses which are used to pass parameters. Some functions don’t have parameters or don’t need them, like q().

Using and writing functions is a key part of learning any programming language. We’ll do a quick introduction here.

Often we will want to write our own functions. Let’s say we want to perform a calculation over and over again in our code. We could write the equation out each time, or we can save a little bit of time and build a function that allows us to repeat several operations using a single command.

Before we go any further, begin by reading this blog post. Make sure you can replicate the code shown in the blog post: https://www.r-bloggers.com/how-to-write-and-debug-an-r-function/

8. What are the general elements of a function? Which elements are not required?

The general elements of a function is a name, at least one argument, and a body of code that does something.

The following elements are not required: return

9. How do you pass a value of an object from the local environment to the global environment (outside of a function)?

You will have to return the value and assign it outside of the function. In short, write multiple arguments.

10. What is the name of the R function that can help you identify errors in your function?

debug

11. Aside from using the function identified in 26. above, what is another way to troubleshoot your functions? Hint: we used this in our first function oddcount in homework 1.

print(head(object))

12. What is the name of a function to write your own error messages when writing a function?.

stop() and stopifnot() functions

13. Why do you want to keep your functions short?

It makes your code easy to update since you only have to change one function and every other function that uses that function will also be updated accordingly.

14. What are local versus global variables? Why do they matter when you write functions?

Local variables is when objects do not exist outside the function (i.e, no return of object)

Global variables, on the other hand, can be accessed anywhere and it is defined outside of the function.

Developing our own functions

Now we are going to define a function. With all functions, we need to consider the following:

*What are the inputs?

*How does the function manipulate the inputs?

*What are the outputs that will be returned?

We are going to define a function that does the following:

*Input: a vector of values

*Output: a count of the number of odd values in the vector

Carefully, type the following code chunk in the console or terminal. Hit enter at the end of each line. A couple of things to note:

*The brackets demarcate the beginning { and end } of a function or operator

*You will notice that until you finish typing the function in you will see a + instead of a >. This + is what is called a “line continuation character” and is a reminder that you haven’t finished the function (with the end brace).

oddcount <- function(x) 
{
    k <- 0 # assign 0 to k
    for (n in x) {
        if(n %% 2 == 1) # %% is the modulo operator
        {
            k <- k+1 
        }
    }
  return(k)
}

If you get any errors, make sure you have a > and re-type the function until you don’t get an error. Very common errors are not closing parentheses and braces. We will use tools later on to help with that.

Use the tab key to create the spacing. Tabs are not needed, but are key parts of keeping code readable.

Now we have defined the function oddcount. Notice that the header of the function tells us there is one input, x. Recall that the input must be a vector, which we can create using c(). If you want to print the code of the function again, just type:

# don't use parentheses!
oddcount

## function (x) 
## {
##     k <- 0
##     for (n in x) {
##         if (n%%2 == 1) {
##             k <- k + 1
##         }
##     }
##     return(k)
## }

#calling just the oddcount shows us just how the function looks like

Now, let’s try the function out by passing it a parameter:

oddcount(x <- c(1,3,5))

## [1] 3

#we're gonna tell the code to count how many odds are there in this vector

There are three odd numbers in this vector.

Try:

oddcount(x <- c(1,2,3,7,9))

## [1] 4

15. How many odd numbers were in this vector?.

There are 4 odd numbers in this vector.

The function appears to be working, now let’s look more carefully at what it is doing.

Following the flow of a function is a critical skill for understanding what it is doing (and fixing problems that may arise).

First, let’s look at the modulo operator “%%”. This operator returns the remainder of a division, e.g.:

38 %% 7

## [1] 3

#%%7

Notice that the modulo of an even number divided by 2 is 0:

38 %% 2

## [1] 0

And an odd number is 1:

39 %% 2

## [1] 1

We’ll learn more about for loops later on, but for now, for(n in x) { ... } can be interpreted as setting n equal to the first element in vector x (also defined as x[1]), running the commands inside of the braces, and then once it is done, repeating the process on x[2], and so-on until it gets to the end of x. We can see this by typing:

y <- c(3,0,7)
for(n in y) { print(n) } # simply prints the value of the variable

## [1] 3
## [1] 0
## [1] 7

This is the same as:

n <- y[1]
print(n)

## [1] 3

n <- y[2]
print(n)

## [1] 0

n <- y[3]
print(n)

## [1] 7

Anyways, back to our oddcount function:

The function starts with k set to 0, and then starts the loop: for (n in x) { if(n %% 2 == 1) k <- k+1 # %% is the modulo operator }

x was passed to the function as e.g.: oddcount(x <- c(1,3,5)), so x is c(1,3,5)

The loop starts with n <- x[1], then tests whether or not n is odd by checking if n %% 2 is equal (==, note the TWO equal signs) to 1.

Logical statements are used frequently, so let’s see what if() is actually testing by typing:

37 %% 2

## [1] 1

37 %% 2 == 1

## [1] TRUE

38 %% 2

## [1] 0

38 %% 2 == 1

## [1] FALSE

It appears that an if() statement, fundamentally, is just testing whether a statement is TRUE or FALSE.

So, if n %% 2 does equal 1, the function adds 1 to k, and repeats the process on x[2], and so on.

Once the function is done, we want to return the value of k which is keeping track of how many odd numbers are in our vector, so we use the return() statement: return(k)

Function definitions:

In the function definition, x is the “formal argument” or formal parameter of function oddcount c(1,3,5) is the actual argument of the function.

Another note: functions can only return a SINGLE variable. If we want to return a lot of things, we’ll use some tricks discussed later.

A lot of times it is helpful to see, more specifically, what a function is doing by using print() statements.

paste() merges the contents together into a single string.

We will modify the code a bit to help watch the function run:

oddcount <- function(x) {
#   print("x is:")
    print(x)
    k <- 0 # assign 0 to k
    print(paste("k is initialized as",k))
    for (n in x) {
        print(paste("current x value being tested is",n))
        if(n %% 2 == 1) 
        {
            k <- k+1 # %% is the modulo operator
            print(paste(n,"is an odd number!"))
        } else
        {
            print(paste(n,"is an even number!"))
        }
        print(paste("k is currently",k))
    }
    print(paste("The final k is",k))
    return(k)
}
# And trying running our more verbose function:
oddcount(x <- c(1,2,3,7,9))

## [1] 1 2 3 7 9
## [1] "k is initialized as 0"
## [1] "current x value being tested is 1"
## [1] "1 is an odd number!"
## [1] "k is currently 1"
## [1] "current x value being tested is 2"
## [1] "2 is an even number!"
## [1] "k is currently 1"
## [1] "current x value being tested is 3"
## [1] "3 is an odd number!"
## [1] "k is currently 2"
## [1] "current x value being tested is 7"
## [1] "7 is an odd number!"
## [1] "k is currently 3"
## [1] "current x value being tested is 9"
## [1] "9 is an odd number!"
## [1] "k is currently 4"
## [1] "The final k is 4"

## [1] 4

#this function basically is telling us how many odd numbers are in the given vector by testing each n value starting with 1 and so on

16. Try creating a new function “evencount” that counts the even numbers in a vector. Turn in your script as a .R script. Make sure you add appropriate comments - you will be graded on this.

evencount <- function(x) {
#   print("x is:")
    print(x)
    k <- 0 # assign 0 to k
    print(paste("k is initialized as",k))
    for (n in x) {
        print(paste("current x value being tested is",n))
        if(n %% 2 == 0) #For even numbers, n %% 2 == 0. For odd numbers, n %% 2 == 1
        {
            k <- k+1 # %% is the modulo operator
            print(paste(n,"is an even number!"))
        } else
        {
            print(paste(n,"is an odd number!"))
        }
        print(paste("k is currently",k))
    }
    print(paste("The final k is",k))
    return(k)
}
# And trying running our more verbose function:
evencount(x <- c(1,2,3,5,6,7,8,9))

## [1] 1 2 3 5 6 7 8 9
## [1] "k is initialized as 0"
## [1] "current x value being tested is 1"
## [1] "1 is an odd number!"
## [1] "k is currently 0"
## [1] "current x value being tested is 2"
## [1] "2 is an even number!"
## [1] "k is currently 1"
## [1] "current x value being tested is 3"
## [1] "3 is an odd number!"
## [1] "k is currently 1"
## [1] "current x value being tested is 5"
## [1] "5 is an odd number!"
## [1] "k is currently 1"
## [1] "current x value being tested is 6"
## [1] "6 is an even number!"
## [1] "k is currently 2"
## [1] "current x value being tested is 7"
## [1] "7 is an odd number!"
## [1] "k is currently 2"
## [1] "current x value being tested is 8"
## [1] "8 is an even number!"
## [1] "k is currently 3"
## [1] "current x value being tested is 9"
## [1] "9 is an odd number!"
## [1] "k is currently 3"
## [1] "The final k is 3"

## [1] 3

#there should only be 1 even number so our result should be k = 1

Section 3 - Importing and Analyzing Data (and uncertainty)

Prologue

A key element of this assignment is importing datasets. Functions that are commonly used for importing data include read.csv() and read.table(). Additionally, you can use the readr package within the tidyverse library.

If needed, here is a non-exhaustive list of some resources related to these functions:

*R Data Import Tutorial by Karlijn Willems https://www.datacamp.com/community/tutorials/r-data-import-tutorial

*“Reading and Importing Excel Files into R Tutorial” by Karlijn Willems https://www.datacamp.com/community/tutorials/r-tutorial-read-excel-into-r

*Chapter 11 in the R for Data Science book https://r4ds.had.co.nz/data-import.html

One final note, when importing data, it is helpful to pay particular attention to your working directory and file paths.

Importing Data and Conducting Exploratory Data Analysis

Researchers have conducted chemical measurements of different springs to evaluate the impact of different draining rock types on water composition. Data from these measurements are found in the ChlorideConcentrations.csv file available in CatCourses. This file contains water chloride concentrations (milligrams Chloride per liter water) associated with two different rock types – Granodiorite and Quartz monzonite.

17. Import the data and compare the chloride concentrations from the two rock types. In particular, generate and compare the following graphs for chloride concentrations of the two rock types: a) histogram, b) boxplot, c) Q-Q plot. Describe the similarities and differences in chloride between these two rock types. What characteristics are evident in each graph?

library(readr)
data <- read.csv("C:\\Users\\noahs\\Downloads\\ChlorideConcentrations.csv")
Granodiorite <- data$Granodiorite
Quartz <- data$Quartz.monzonite

#comparing histogram
hist(Granodiorite)

hist(Quartz)

#comparing boxplots
boxplot(Granodiorite,
        main = "Boxplot of Granodiorite",
        ylab = "Values",
        col = "lightblue")

boxplot(Quartz,
        main = "Boxplot of Quartz",
        ylab = "Values",
        col = "lightgreen")

#comparing qqplot
qqplot(Granodiorite, Quartz,
       main = "Q–Q Plot: Granodiorite vs Quartz",
       xlab = "Granodiorite Quantiles",
       ylab = "Quartz Quantiles",
       col = "blue")
abline(0, 1, col = "red")  # 45-degree line for reference

#note to self
#Lower quantiles = near the origin (smallest values).
#Middle quantiles = central 25%–75% of the data (around the median).
#Upper quantiles = the far right side (largest values, tails/outliers).

Granodiorite has a wider range of chloride concentrations as opposed to Quartz (0-10 and 0-3.5, respectively). Both histograms shows a right-skewed distribution.

When comparing the boxplots, again we can see that Granodiorite still has a wider spread. The key insight from the boxplot is that a most of Granodiorite’s data lies close to the very low end, with the median near zero and the majority of values clustered below 2 as evident by the IQR. On the other hand, Quartz boxplot also showcases that the majority of its dataset is also in the lower values similar to Granodiorite; however, there is less variability for Quartz because it is less dispersed with its values mostly clustered around 0.5-1.

The Q–Q plot comparing Granodiorite and Quartz shows that Granodiorite values are consistently larger across most of the distribution. At the lower quantiles, the points fall below the reference line, indicating that even the smallest Granodiorite values are greater than those of Quartz. This pattern continues through the middle quantiles, where Granodiorite remains higher, and becomes even more pronounced at the upper quantiles as Granodiorite extends to much larger values. Overall, the plot highlights that Granodiorite has a wider spread and heavier upper tail compared to the more compact distribution of Quartz.

Describing Uncertainty

In class and our readings, we have learned that Confidence Intervals and Prediction Intervals are two of the primary ways we can explore and describe the uncertainty associated with a dataset. Let’s explore these concepts a little more.

18. Using the chloride concentrations (from Question 1), compute 95% Confidence Interval estimates for the median of the granodiorite data.

?qt
?is.na
list(Granodiorite)

## [[1]]
##  [1]  6.0  5.0  0.5  0.5  0.6 10.0  0.4  1.2  0.2  0.7  0.3  0.2  0.8  0.2  1.7
## [16]  6.0  0.5  3.0   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA   NA
## [31]   NA   NA   NA   NA

hist(Granodiorite)

#we will find which transformations are better

#1.square root transformation
Grano_sqrt <- sqrt(Granodiorite)
hist(Grano_sqrt)

qqnorm(Grano_sqrt, main = "Normal Q-Q Plot",
       xlab = "Theoretical Quantiles", ylab = "Sample Quantiles",
       plot.it = TRUE, datax = FALSE)
qqline(Grano_sqrt, datax = FALSE, distribution = qnorm,
       probs = c(0.25, 0.75), qtype = 7)

#doesn't look good

#2.log transformation
Grano_log <- log(Granodiorite)
hist(Grano_log)

qqnorm(Grano_log, main = "Normal Q-Q Plot",
       xlab = "Theoretical Quantiles", ylab = "Sample Quantiles",
       plot.it = TRUE, datax = FALSE)
qqline(Grano_log, datax = FALSE, distribution = qnorm,
       probs = c(0.25, 0.75), qtype = 7)

#looks better than the sqrt transformation, but still not as normal as we want it to be

#3.reciprocal transformation
Grano_recip <- 1/Granodiorite
hist(Grano_recip)

qqnorm(Grano_recip, main = "Normal Q-Q Plot",
       xlab = "Theoretical Quantiles", ylab = "Sample Quantiles",
       plot.it = TRUE, datax = FALSE)
qqline(Grano_recip, datax = FALSE, distribution = qnorm,
       probs = c(0.25, 0.75), qtype = 7)

#more of the points in the center are in the line compared to the previous transformations so i will use the reciprocal transformation moving forward


n_Grano = sum(!is.na(Grano_recip))
n_Grano

## [1] 18

df = n_Grano - 1
df

## [1] 17

qt(.975,df) #right of the median

## [1] 2.109816

qt(.025,df) #left of the median

## [1] -2.109816

ME <- qt(0.975,df)*sd(Grano_recip, na.rm = TRUE)/sqrt(n_Grano)
ME

## [1] 0.8477373

#because we are only interested in the median of the data, not mean so our formula changes to median
x_Grano = median(Grano_recip, na.rm = TRUE)
#because there is NA values in our dataset, we must tell R to remove NA values by stating na.rm = TRUE
x_Grano

## [1] 1.547619

CI_low = x_Grano - ME
CI_high = x_Grano  + ME
CI_low

## [1] 0.6998817

CI_high

## [1] 2.395356

cat("The lower and upper bounds of the 95% confidence interval are:", round(CI_low,4), round(CI_high,4))

## The lower and upper bounds of the 95% confidence interval are: 0.6999 2.3954

#mean median for non-visual

Scenario: You have been collecting data on well yields from across the state and storing your measurements in a dataset labeled VAwells.csv. You encounter a new well and find that its yield is 0.85 gallons per minute per foot.

19. Is the yield of the new well likely to belong to the same distribution as the data in VAwells.csv or does it represent something larger? Use a 95% Prediction Interval to inform your answer.

#we are going to do a prediction interval because we are interested in knowing if the new data point will have the same distribution 



VaWells <- read.csv("C:\\Users\\noahs\\Downloads\\VAwells.csv")
#always check for assumptions
VaWells_Yields <- VaWells$yields
#does it pass normality assumption? check with histogram or qqplot
hist(VaWells_Yields)

qqnorm(VaWells_Yields, main = "Normal Q-Q Plot",
       xlab = "Theoretical Quantiles", ylab = "Sample Quantiles",
       plot.it = TRUE, datax = FALSE)
qqline(VaWells_Yields, datax = FALSE, distribution = qnorm,
       probs = c(0.25, 0.75), qtype = 7)

#negatively skewed, so I will use bootstrapping method to conduct the 95% PI
n_VaWells_Yields <-length(VaWells_Yields)
df_VaWells_Yields <- n-1

VaWells_Yields <- sample(VaWells_Yields,n) #our initial sample

set.seed(120)       # for reproducibility



x_bar_VaWells_Yields <- mean(VaWells_Yields)
sigma_VaWells_Yields <- sd(VaWells_Yields)

t_90<- qt(0.95,df_VaWells_Yields)

# Margin of Error for our 90% Prediction Interval
ME_PI_VaWells_Yields <- t_90 * sigma_VaWells_Yields * sqrt(1+(1/n_VaWells_Yields))

#Lower and Upper Bounds of our 90% Prediction Interval
LB_PI_VaWells_Yields <- x_bar_VaWells_Yields - ME_PI_VaWells_Yields
UB_PI_VaWells_Yields <- x_bar_VaWells_Yields + ME_PI_VaWells_Yields


cat("Our 90% Prediction Interval is:", LB_PI_VaWells_Yields, UB_PI_VaWells_Yields)

## Our 90% Prediction Interval is: -0.4457362 1.06345

This means that the new well does not belong to same distribution because its yield of 0.85 gallons per minute per foot is larger than our 95% PI upper bound of 0.387.

Scenario: You have been tracking stream flows for the Conecuh River and storing your measurements in a dataset labeled Conecuh.csv.

20. Import Conecuh.csv and construct the 95% Confidence and Prediction Intervals for the mean and median annual stream flows.

data_Conecuh <- read.csv("C:\\Users\\noahs\\Downloads\\Conecuh.csv")
Conecuh_flows <- data_Conecuh$Flowcfs

#CALCULATING FOR THE 95% CI MEAN AND MEDIAN
hist(Conecuh_flows)

qqnorm(Conecuh_flows, main = "Normal Q-Q Plot",
       xlab = "Theoretical Quantiles", ylab = "Sample Quantiles",
       plot.it = TRUE, datax = FALSE)
qqline(Conecuh_flows, datax = FALSE, distribution = qnorm,
       probs = c(0.25, 0.75), qtype = 7)

#doesn't look quite normal
#some transformations

#1.square root transformation
ConecuhFlows_sqrt <- sqrt(Conecuh_flows)
hist(ConecuhFlows_sqrt)

qqnorm(ConecuhFlows_sqrt, main = "Normal Q-Q Plot",
       xlab = "Theoretical Quantiles", ylab = "Sample Quantiles",
       plot.it = TRUE, datax = FALSE)
qqline(ConecuhFlows_sqrt, datax = FALSE, distribution = qnorm,
       probs = c(0.25, 0.75), qtype = 7)

#2.log transformation
ConecuhFlows_log <- log(Conecuh_flows)
hist(ConecuhFlows_log)

qqnorm(ConecuhFlows_log, main = "Normal Q-Q Plot",
       xlab = "Theoretical Quantiles", ylab = "Sample Quantiles",
       plot.it = TRUE, datax = FALSE)
qqline(ConecuhFlows_log, datax = FALSE, distribution = qnorm,
       probs = c(0.25, 0.75), qtype = 7)

#3.reciprocal transformation
ConecuhFlows_recip <- 1/Conecuh_flows
hist(ConecuhFlows_recip)

qqnorm(ConecuhFlows_recip, main = "Normal Q-Q Plot",
       xlab = "Theoretical Quantiles", ylab = "Sample Quantiles",
       plot.it = TRUE, datax = FALSE)
qqline(ConecuhFlows_recip, datax = FALSE, distribution = qnorm,
       probs = c(0.25, 0.75), qtype = 7)

#the sqrt transformation is as best as it could get.. 

n = sum(!is.na(ConecuhFlows_sqrt))
n

## [1] 20

df = n - 1
df

## [1] 19

qt(.975,df) #right of the median

## [1] 2.093024

qt(.025,df) #left of the median

## [1] -2.093024

ME <- qt(0.975,df)*sd(ConecuhFlows_sqrt, na.rm = TRUE)/sqrt(n)
ME

## [1] 2.402404

x_flows_mean = mean(ConecuhFlows_sqrt, na.rm = TRUE)
#because there is NA values in our dataset, we must tell R to remove NA values by stating na.rm = TRUE
x_flows_median = median(ConecuhFlows_sqrt, na.rm = TRUE)

CI_low_mean = x_flows_mean - ME
CI_high_mean = x_flows_mean + ME

CI_low_median = x_flows_median - ME
CI_high_median = x_flows_median + ME

cat("The lower and upper bounds of the 95% confidence interval for the mean and median are as follows: (",
    round(CI_low_mean, 3), ", ", round(CI_high_mean, 3), 
    ") and (", round(CI_low_median, 3), ", ", round(CI_high_median, 3), ")\n", sep = "")

## The lower and upper bounds of the 95% confidence interval for the mean and median are as follows: (23.244, 28.048) and (21.702, 26.506)

#CALCULATING FOR THE 95% PI MEAN AND MEDIAN

#using the bootstrap method

samp_Conecuh_Flows <- sample(Conecuh_flows,20) #our initial sample

set.seed(120)       # for reproducibility
B <- 10000           # number of bootstrap replicates
n <- length(samp_Conecuh_Flows)


#initialize vector bstrap to store results of for loop FOR MEAN
bstrap_mean <- c()

for (i in 1:10000) {
  #first take the sample
  bsample_mean <- sample(samp_Conecuh_Flows,20,replace = TRUE)
  # now calculate the mean bootstrap estimate
  bestimate_mean <- mean(bsample_mean)
  bstrap_mean <- c(bstrap_mean, bestimate_mean)
}

lower_bound_mean <- quantile(bstrap_mean, 0.025)
upper_bound_mean <- quantile(bstrap_mean, 0.975)

print("The 95% prediction interval for the mean is:")

## [1] "The 95% prediction interval for the mean is:"

c(lower_bound_mean, upper_bound_mean)

##    2.5%   97.5% 
## 571.045 800.750

#This is because I did a square root transformation so its basically the squared version of the confidence interval

#initialize vector bstrap to store results of for loop FOR MEDIAN
bstrap_median <- c()

for (i in 1:10000) {
  #first take the sample
  bsample_median <- sample(samp_Conecuh_Flows,20,replace = TRUE)
  # now calculate the median bootstrap estimate
  bestimate_median <- median(bsample_median)
  bstrap_median <- c(bstrap_median, bestimate_median)
}

lower_bound_median <- quantile(bstrap_median, 0.025)
upper_bound_median <- quantile(bstrap_median, 0.975)

print("The 95% prediction interval for the median is:")

## [1] "The 95% prediction interval for the median is:"

c(lower_bound_median, upper_bound_median)

##  2.5% 97.5% 
## 527.0 876.5

21. Using Conecuh.csv, apply a bootstrap approach to construct the 95% Confidence Interval for the mean annual streamflows. How does your answer change if you use 1000 bootstrap replicates vs. 10000 bootstrap replicates? How does your bootstrap Confidence Interval compare to your estimate in Question 4?

#population data
x <- data_Conecuh$Flowcfs

#our initial sample in the data is 20
#how do you determine initial sample size?
samp_x <- sample(x,length(x), replace = FALSE)
n = length(samp_x)

# Helper that returns B bootstrap means (each resample has size n, with replacement)
boot_means <- function(B) {
  replicate(B, mean(sample(samp_x, n, replace = TRUE)))
}

#for 1000 samples
bstrap_1k  <- boot_means(1000)
ci_1k <- quantile(bstrap_1k, c(0.025, 0.975))

# 10k resamples
bstrap_10k <- boot_means(10000)
ci_10k <- quantile(bstrap_10k, c(0.025, 0.975))

hist(bstrap_1k)

hist(bstrap_10k)

cat("95% CI (1,000 resamples):", paste(round(ci_1k, 3), collapse = " to "), "\n")

## 95% CI (1,000 resamples): 562.149 to 797.514

cat("95% CI (10,000 resamples):", paste(round(ci_10k, 3), collapse = " to "), "\n")

## 95% CI (10,000 resamples): 569.999 to 801.351

Comparing between the 1,000 samples to the 10,000 samples, we see that the 10,000 samples has a wider range indicating that there is more variability but this also means that the 90% CI with more samples has a slightly higher confidence interval.

Challenge Questions

Following are two “challenge questions.” Please complete at least ONE(1) of these questions. If you complete both, you will be eligible for up to a 10% bonus on this assignment.

Challenge 1

The Great Lakes Environmental Research Laboratory of the U.S. National Oceanic and Atmospheric Administration (NOAA-GLERL) routinely monitors the western basin of Lake Erie. The data file LakeErie2.csv (available in CatCourses) contains total phosphorous (TP) and chlorophyll a (chla) concentration data collected by NOAA-GLERL, as well as additional data collected by the Ohio Department of Natural Resources (ODNR), and the University of Toledo (Toledo).

CQ1. Write a function that calculates the mean, median, standard deviation, interquartile range, and QQ-Plot. Apply that function to the TP concentration data. In ~300 words or less discuss the differences between these measures and what conclusions you can draw about the data.

Note: Most actions in R are invoked by calling a function. Therefore, creating and using functions to accomplish specific tasks (especially tasks that are likely to be repeated multiple times) can be very useful. We were introduced to functions in HW 1 and (especially) HW 2, so you may find it helpful to refer back to them if you need a refresher on functions.

data_LakeErie = read.csv("C:\\Users\\noahs\\Downloads\\LakeErie2.csv")
TP = data_LakeErie$TP
#We are analyzing TP (total phosphorus) concentration



f_TP <- function(x) 
{
  
TP_mean = mean(x,na.rm = TRUE) #na.rm = TRUE ignores missing values
print(paste("Total Phosphorus Concentration mean is",TP_mean))

TP_median = median(x,na.rm = TRUE)
print(paste("Total Phosphorus Concentration median is",TP_median))

TP_sd = sd(x,na.rm = TRUE)
print(paste("Total Phosphorus Concentration standard deviation is",TP_sd))


TP_IQR = IQR(x, na.rm = TRUE)
print(paste("Total Phosphorus Concentration IQR is",TP_IQR))
qqnorm(TP,  main = "Normal Q-Q Plot",
       xlab = "Theoretical Quantiles", ylab = "Sample Quantiles",
       plot.it = TRUE, datax = FALSE)

qqline(x, datax = FALSE, distribution = qnorm,
       probs = c(0.25, 0.75), qtype = 7)
}

f_TP(TP)

## [1] "Total Phosphorus Concentration mean is 59.876050069541"
## [1] "Total Phosphorus Concentration median is 40.24"
## [1] "Total Phosphorus Concentration standard deviation is 57.5196223607315"
## [1] "Total Phosphorus Concentration IQR is 49.43"

The mean, median, standard deviation, and IQR are as follows: 59.88, 40.24, 57.52, and 49.43, respectively. Since the mean is 59.88 and median is 40.24, this indicates that the average is substantially higher than the median which again reinforces that this is a right skew distribution. The standard deviation is 57.52 which tells us strong variability (again, reinforces the evidence of outliers). The QQ-plot shows us that this is not a normal distribution and suggests a right-skew due to outliers on the right side; therefore, a transformation will be needed like a log or square root transformation for further analysis.

Challenge 2

Suppose you have a dataset, PotomacOneDayLow.csv, that consists of the annual 1-day minimum discharge (flowrate in volume/time) at the Little Falls water intake site on the Potamac River. We want to select a design discharge for the water intake so that it can continue to function when flowrates are low. More specifically, we want to select a design discharge so that the intake only becomes inoperable in 1 year out of 10 (i.e. once every 10 years // 10% probability of becoming inoperable in any given year). The intake becomes inoperable when it is no longer submerged in the river (i.e., the 1-day minimum discharge of the river becomes so low that the intake is no longer submerged). We know that an estimate of the 0.1 (10%) frequency annual minimum flow is somewhat uncertain, so we will want to be conservative in our selection of this design discharge by incorporating a “safety factor.”

CQ2. Using this information and dataset, what design discharge should be established for this intake in order to ensure that there is only a 5% probability that the true 0.1 frequency annual minimum flow is below the design discharge?

Hint: If you’re not quite sure where to start, Section 3.7 in Helsel et al. (2020) may be helpful.

##find the PI for 10th percentile, 5% safety factor
data_Potoma = read.csv("C:\\Users\\noahs\\Downloads\\PotomacOneDayLow.csv")

#bootstrapping method
# Load data

flows <- na.omit(data_Potoma$Q)     # ignores any missing data
n <- length(flows)
B <- 10000                   # Number of bootstrap samples
alpha <- 0.05                # For 95% intervals
df = n-1

# 1. Bootstrap CI for 10th percentile
set.seed(777)
bstrap <- c()
for (i in 1:500) {
  #first take the sample
  bsample <- sample(flows,n, replace = T)
  # now calculate the bootstrap estimate
  bestimate <- quantile(bsample,0.1)
  bstrap <- c(bstrap, bestimate)
}

# Percentile CI
ci_low <- quantile(bstrap, 0.025)
ci_high <- quantile(bstrap, 0.975)

cat("95% Bootstrap CI for 10th percentile:\n")

## 95% Bootstrap CI for 10th percentile:

cat("Lower:", round(ci_low, 2), "cfs  Upper:", round(ci_high, 2), "cfs\n")

## Lower: 11.47 cfs  Upper: 21.12 cfs

#our design discharge is the lower bound of this value confidence interval (11.47, 21.14)
#BUT since we want a safety factor of 5 to account for uncertainty 
#then we can add 5% on top of that value 
#11.47 + (11.47*0.05) = 11.47 + 0.5735 = 12.0435
q_design = 11.47 + (11.47*0.05) 
#so 

cat("Our design discharge taking into account a safety factor 5 will be", round(q_design,2), "cfs")

## Our design discharge taking into account a safety factor 5 will be 12.04 cfs

“There is a 95% probability that the estimated 10th percentile of annual minimum discharge for a new dataset would fall between 4.93 and 28.23 cfs. This will be the discharge”

# -----------------------------
# 2. Bootstrap PI for 10th percentile
# -----------------------------

x_10_flows <- quantile(flows,0.1) #replace the mean in the formula

sigma_flows <- sd(flows)

t_95<- qt(0.975,83)

# Margin of Error for our 95% Prediction Interval
ME_PI_flows <- t_95 * sigma_flows * sqrt(1+(1/84))

#Lower and Upper Bounds of our 95% Prediction Interval
LB_PI_flows <- x_10_flows - ME_PI_flows
UB_PI_flows <- x_10_flows + ME_PI_flows

cat("Our 95% Prediction Interval is:", LB_PI_flows, UB_PI_flows)

## Our 95% Prediction Interval is: -30.22556 63.58156

summary(flows)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.43   23.79   33.84   38.43   47.43  155.74

#note to self The confidence interval tells us the true population 4.78, 28.01 Since 95% bootstrap PI tells us it will be 4.93, 28.23 cfs for the 10th percentile, this will be what our new observation point would look like We want a safety factor so let’s say the 5th percentile of that 10th percentile flow (from the CI) will be our design discharge Q_design = quantile(boot_q10,0.05) = 10.26

There’s only a 5% chance that the true 10th-percentile annual minimum flow is below 10.26 cfs, based on the bootstrap distribution. So this 10.26 cfs will be our design discharge that the intake becomes inoperable. To be safe, let’s apply a 10% safety factor so 10.26 + (10.26x0.10 ) = 11.286

Using the CI (10.26 cfs) gives you protection against parameter uncertainty:

“We probably haven’t underestimated the 10th percentile.”

Using the PI (≈ 4.9–28.2 cfs) shows the range of future realizations:

“Even if our estimate is right, future samples might produce lower annual minima just by chance.”

HW 1: Statistical Assumptions and Describing Uncertainty

Professor Sam Markolf | ES 207 Fall 2025