Homework 3

Simple Linear Regression Model Inference

Author

Spencer Hamilton

Data and Description

Climate change has left California particularly vulnerable to severe drought conditions. One factor affecting water availability in Southern California is stream runoff from snowfall. If runoff could be predicted, engineers, planners, and policy makers could do their jobs more effectively because they would have an estimate as to how much water is entering the area.

The Runoff Water data set compares the stream runoff (column 2) (in acre-feet) of a river near Bishop, California (due east of San Jose) with snowfall (column 1) (in inches) at a site in the Sierra Nevada mountains. The data set contains 43 years’ worth of measurements. Download the water.txt file from Canvas, and put it in the same folder as this Quarto file.

0. Replace the text “< PUT YOUR NAME HERE >” (above next to “author:”) with your full name.

1. Read in the data set, and call the dataframe “water”. Print a summary of the data and make sure the data makes sense.

water <- read.table('water.txt', header = TRUE)
summary(water)

     Precip           Runoff      
 Min.   : 4.600   Min.   : 41785  
 1st Qu.: 8.705   1st Qu.: 59857  
 Median :12.140   Median : 69177  
 Mean   :13.522   Mean   : 77756  
 3rd Qu.:16.920   3rd Qu.: 92206  
 Max.   :33.070   Max.   :146345

2. Create (and display) a scatterplot of the data with variables on the appropriate axes. Make you plot look professional (e.g., axis labels are descriptive). You should save your plot as an object to be used throughout the rest of the assignment.

saved_plot <- ggplot(data = water) + 
  geom_point(mapping = aes(x = Precip, y = Runoff)) + labs(title = "Climate Change Scatterplot",
  x = "snowfall (in inches)",
  Y =  "stream runoff (in acre-feet")
saved_plot

3. Calculate (and report) the correlation coefficient. Use that and the scatterplot to briefly describe the relationship between Stream Runoff and Snowfall.

water_cc <- cor(water$Precip,water$Runoff)
water_cc

[1] 0.938436

There is a strong positive linear relationship between these variables. As snowfall increases, stream runoff tends to increase proportionally. The scatter plot confirms this, showing that as snowfall rises, so does stream runoff. In addition, we can see in the scatter plot that there usually isn’t very much snowfall, as most of the points are on the left side of the graph (right skew).

4. Add the OLS regression line to the scatterplot you created in 2. Display the plot.

saved_plot +
  geom_smooth(mapping = aes(x = Precip, y = Runoff),
              method = "lm",
              formula = 'y ~ x',
              se = FALSE)

5. Fit a simple linear regression model to the data (no transformations), and save the residuals and fitted values to the `water` dataframe. Print a summary of the linear model.

#A
lm_water <- lm(Runoff ~ Precip, data = water)
#B
summary(lm_water)


Call:
lm(formula = Runoff ~ Precip, data = water)

Residuals:
     Min       1Q   Median       3Q      Max 
-17603.8  -5338.0    332.1   3410.6  20875.6 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  27014.6     3218.9   8.393 1.93e-10 ***
Precip        3752.5      215.7  17.394  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8922 on 41 degrees of freedom
Multiple R-squared:  0.8807,    Adjusted R-squared:  0.8778 
F-statistic: 302.6 on 1 and 41 DF,  p-value: < 2.2e-16

#C
water$residuals <- residuals(lm_water)
water$fitted_values <- fitted(lm_water)

print(head(water))

  Precip Runoff  residuals fitted_values
1   6.47  54235  2941.8309      51293.17
2  10.26  67567  2051.9105      65515.09
3  11.35  66161 -3444.2988      69605.30
4  11.13  68094  -685.7519      68779.75
5  22.81 107080 -5528.7836     112608.78
6   7.41  67594 12773.4945      54820.51

Questions 6 to 10 involve using diagnostics to determine if the linear regression assumptions are met and if there are influential observations. For each assumption, (1) perform appropriate diagnostics to determine if the assumption is violated, and (2) explain whether or not you think the assumption is violated and why you think that.

6. (L) \(X\) vs \(Y\) is linear

saved_plot +
  geom_smooth(mapping = aes(x = Precip, y = Runoff),
              method = "lm",
              formula = 'y ~ x',
              se = FALSE)

I would say that there is still a linear relationship between snowfall and runoff.

7. (I) The errors are independent (no diagnostic tools necessary in this case - just think about how the data was collected and briefly write your thoughts)

Looking at how the data was collected, we have no reason to believe the errors are not independent. I am very cautious to just forcefully assume this is true, as there could be measurement biases, or time-lag effects on when the snow is melted into snow (what if some years it is slower to melt based on temp or something like that). But yeah, we generally just assume this is true.

8. (N) The errors are normally distributed (use at least two diagnostic tools)

# The code below produces a basic histogram (from the in-class analysis #2)

hist(lm_water$residuals, main = "Histogram of Residuals", xlab = "Residuals")

#Shapiro Test (also in-class #2)
shapiro.test(lm_water$residuals)


    Shapiro-Wilk normality test

data:  lm_water$residuals
W = 0.96775, p-value = 0.2639

#Kolmogorov-smirnov test
ks.test(water$residuals, "pnorm")


    Exact one-sample Kolmogorov-Smirnov test

data:  water$residuals
D = 0.51163, p-value = 5.716e-11
alternative hypothesis: two-sided

The residuals are generally right skewed when we look at the histogram of the data, even though the shapiro-wilks test approves of the normalcy, I do not believe the residuals are normally distributed. The Kolmogorov-Smirnoff test agrees, with a rejecting p-value of 5.7e-11. So, no, the residuals are not normal.

9. (E) The errors have equal (constant) variance across all values of \(X\) (homoscedastic)

#Code from in-class #2
autoplot(lm_water, which = 3, ncol = 1, nrow = 1) +
  theme(aspect.ratio = 1)

autoplot(lm_water, which = 2, ncol = 1, nrow = 1)  +
  theme_bw() + 
  coord_equal()

The errors do not have equal variance across all variables of x. We can see this in the very straight line drawn in the Scale-location plot, and the curvy-ness of the datapoints in the normal q-q plot. So, no there is no homoscedastic-ness.

10. Check for influential points and report your findings.

autoplot(lm_water, which = 4, ncol = 1, nrow = 1)  +
 theme(aspect.ratio = 1)

As we keep saying, the few datapoints at the right edge which are causing the large skew have much more power than the rest of the datapoints. Even excluding 35, 36 and 37, we can see that the other points above 30 have lots of influence.

Based on your answers to questions 6 through 10, you may (or may not) have decided a transformation to the data is needed. This was, hopefully, good practice for assessing model assumptions. For simplicity for this assignment, we will use the orignial model (no transformations) for the rest of the questions.

Okay, yes, a transformation is needed.

11. Mathematically write out the fitted simple linear regression model for this data set using the coefficients you found above (do not use betas or matrix notation). Do not use “X” and “Y” in your model - use descriptive variable names.

\(\hat{Runoff (acre-feet)}\) = 27014.6 + 3752.5 \(\times\) \(\text{Precipitation (inches)}\)

12. Compute a 95% confidence interval for the slope using the output from the lm() function (to get the standard error of \(\beta_1\)) in tandem with the qt() function (to get the correct critical value).

# Extract standard error of the slope coefficient
se_beta1 <- summary(lm_water)$coef["Precip", "Std. Error"]

# Calculate degrees of freedom
df <- length(water$Precip) - 2  # (number of observations - number of coefficients)

# Determine critical value for 95% confidence interval
t_star <- qt(0.975, df)  # 0.975 corresponds to (1 - alpha/2) for a two-tailed test

# Compute margin of error
ME <- t_star * se_beta1

# Calculate confidence interval for slope coefficient
beta1 <- 3752.485 # Estimated slope coefficient
confidence_interval <- c(beta1 - ME, beta1 + ME)

# Print the confidence interval
confidence_interval

[1] 3316.809 4188.161

We are 95% confident that the true slope of the line (how much runoff increases with each addition of an inch of water precipitation) is between 3316.809 and 4188.161

13. Compute a 95% confidence interval for the slope using the confint() function in R (you should get the same answer as in 12). Interpret the confidence interval.

confint(lm_water)

                2.5 %    97.5 %
(Intercept) 20513.978 33515.197
Precip       3316.809  4188.162

We are 95% confident that the true slope of the line (how much runoff increases with each addition of an inch of water precipitation) is between 3316.809 and 4188.162.

14. Based on the confidence interval, is there a statistically significant linear association between snowfall and stream water? Why or why not?

Zero is not included in our confidence interval. This means that according to our 95% confidence interval, there is a statistically significant linear relationship between snowfall and stream water.

15. Print a summary of the linear model. Interpret the results from the hypothesis test output for the slope.

summary(lm_water)


Call:
lm(formula = Runoff ~ Precip, data = water)

Residuals:
     Min       1Q   Median       3Q      Max 
-17603.8  -5338.0    332.1   3410.6  20875.6 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  27014.6     3218.9   8.393 1.93e-10 ***
Precip        3752.5      215.7  17.394  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8922 on 41 degrees of freedom
Multiple R-squared:  0.8807,    Adjusted R-squared:  0.8778 
F-statistic: 302.6 on 1 and 41 DF,  p-value: < 2.2e-16

The p-value is 2.2e-16. This means reject the null hypothesis and must conclude that there is a relationship between runoff and precipitation.

16. Briefly describe the difference between (1) a confidence interval for the slope, (2) a confidence interval for the conditional mean of \(Y\) given \(x\), and (3) a prediction interval for individual observations.

Confidence interval of slope: This interval estimates the range within which the true population slope coefficient (what the actual increase in rainwater is per extra inch of precipitation).
Confidence interval for the conditional mean of \(Y\) given \(x\): This interval estimates the range within which the true mean value of Y for a specific value of x is likely to lie (how much runoff at specific values of precipitation).
Prediction interval for individual observations This interval estimates the range within which a new, individual observation of Y for a specific value of x is likely to fall (what would a new random value of runoff look given a specific value of precipitation).

17. Compute, print, and interpret a 95% confidence interval for the average of \(Y\) when \(x_i=30\). You may use the predict() function in R.

new.dat <- data.frame(Precip=30)
predict(lm_water, newdata = new.dat, interval = 'confidence')

       fit      lwr      upr
1 139589.2 131902.2 147276.1

We are 95% confident that that true value of average runoff is between lower bound: 131902.2 and the upper bound of 147276.1 given the precipitation is 30 inches.

18. Create a confidence band for the average of \(Y\) over a sequence of \(X\) values spanning the range of the data, and overlay this band (using a distinct color) on your previous scatterplot that you created in 4. Print the plot.

# Generate sequence of X values spanning the range of the data
x_seq <- seq(min(water$Precip), max(water$Precip), length.out = 100)

# Predict the average value of Y for each X value in the sequence
predicted <- predict(lm_water, newdata = data.frame(Precip = x_seq), interval = "confidence", level = 0.95)

# Plot original scatter plot
plot(water$Precip, water$Runoff, xlab = "X", ylab = "Y", main = "Scatterplot with Confidence Band", pch = 16)

# Add confidence band
lines(x_seq, predicted[, 1], col = "blue", lwd = 2)  # Mean line
polygon(c(x_seq, rev(x_seq)), c(predicted[, 2], rev(predicted[, 3])), col = rgb(0, 0, 1, alpha = 0.3), border = NA)  # Confidence band

# Add legend
legend("topright", legend = "95% Confidence Band", fill = rgb(0, 0, 1, alpha = 0.3))

19. Briefly explain why the confidence band is shaped the way that it is.

The band is widest at the ends, following the qq-normal plot where we are more unsure of the values at the edges. The more data points, the more confident of a range we can draw, as it is easier to predict in places where we know what we are working with.

20. Compute, print, and interpret a 95% prediction interval for \(Y\) when \(x_i=30\). You may use the predict() function in R.

# Predict the value of Y for x = 30 with a prediction interval
new_data <- data.frame(Precip = 30)
predicted_interval <- predict(lm_water, newdata = new_data, interval = "prediction", level = 0.95)

predicted_interval

       fit      lwr      upr
1 139589.2 119998.8 159179.5

We predict with 95% confidence that a data point of runoff will be between 119998.8 and 159179.5 acre-feet given that there is 30 inches of precipitation.

21. Create a prediction band for \(Y\) over a sequence of \(X\) values spanning the range of the data, and overlay this band (using a distinct color) on your previous scatterplot that you created in 4. Print the plot.

# Generate sequence of X values spanning the range of the data
x_seq <- seq(min(water$Precip), max(water$Precip), length.out = 100)

# Predict the average value of Y for each X value in the sequence
predicted <- predict(lm_water, newdata = data.frame(Precip = x_seq), interval = "predict", level = 0.95)

# Plot original scatter plot
plot(water$Precip, water$Runoff, xlab = "X", ylab = "Y", main = "Scatterplot with Confidence Band", pch = 16)

# Add confidence band
lines(x_seq, predicted[, 1], col = "red", lwd = 2)  # Mean line
polygon(c(x_seq, rev(x_seq)), c(predicted[, 2], rev(predicted[, 3])), col = rgb(1, 0, 0, alpha = 0.3), border = NA)  # Confidence band (red)

# Add legend
legend("topright", legend = "95% Confidence Band", fill = rgb(0, 0, 1, alpha = 0.3))

22. Briefly explain how/why the prediction band differs from the confidence band.

This band is much wider because we have seen values many random values, this confidence interval needs to cover 95% of them (generally, we hope it would, as long as our data isn’t skewed or something). So, this band is much wider to be able to hit any random point, instead of just the average (or at least what we imagine is it average). #### 23. What is the MSE (Mean Square Error) for the linear model you fit? Hint: you may refer to the ANOVA table results.

summary(lm_water)


Call:
lm(formula = Runoff ~ Precip, data = water)

Residuals:
     Min       1Q   Median       3Q      Max 
-17603.8  -5338.0    332.1   3410.6  20875.6 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  27014.6     3218.9   8.393 1.93e-10 ***
Precip        3752.5      215.7  17.394  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8922 on 41 degrees of freedom
Multiple R-squared:  0.8807,    Adjusted R-squared:  0.8778 
F-statistic: 302.6 on 1 and 41 DF,  p-value: < 2.2e-16

#the mse is the square of the residual standard error
8922^2

[1] 79602084

24. Briefly explain (1) what the MSE estimates and (2) a drawback to using it as a model evaluation metric.

The MSE represents the average squared difference between the observed values of the runoff and the values predicted by the linear regression model. MSE is a measure of the noise in the data that is not explained by the relationship of runoff to precipitation.

25. Calculate the RMSE (Root Mean Square Error) for the linear model you fit. Print and interpret the result.

#this is given as the Residual standard error:8922
summary(lm_water)


Call:
lm(formula = Runoff ~ Precip, data = water)

Residuals:
     Min       1Q   Median       3Q      Max 
-17603.8  -5338.0    332.1   3410.6  20875.6 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  27014.6     3218.9   8.393 1.93e-10 ***
Precip        3752.5      215.7  17.394  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8922 on 41 degrees of freedom
Multiple R-squared:  0.8807,    Adjusted R-squared:  0.8778 
F-statistic: 302.6 on 1 and 41 DF,  p-value: < 2.2e-16

26. Print a summary of the linear model. Briefly interpret the R-Squared (Coefficient of Determination) value.

summary(lm_water)


Call:
lm(formula = Runoff ~ Precip, data = water)

Residuals:
     Min       1Q   Median       3Q      Max 
-17603.8  -5338.0    332.1   3410.6  20875.6 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  27014.6     3218.9   8.393 1.93e-10 ***
Precip        3752.5      215.7  17.394  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 8922 on 41 degrees of freedom
Multiple R-squared:  0.8807,    Adjusted R-squared:  0.8778 
F-statistic: 302.6 on 1 and 41 DF,  p-value: < 2.2e-16

We have a Multiple R-squared of .8807. This means that ~88% of the variance in runoff can be explained by precipitation in the linear model.

27. What is the difference between the R-Squared value and the Adjusted R-Squared (shown in the summary output above)?

The R-squared value is just the explained variance over the total variance in the runoff. Adjusted R-squared however, adjusts the R-squared value by the sample size. This may help get a more accurate measurement of how good our model is at fitting. They both try to tell us how good our model is, but the adjusted R-squared tries to take more factors into account.

28. Look at the F-Statistic and corresponding \(p\)-value from the summary of the linear model (output shown above). Do these values indicate that \(X\) has a statistically significant linear association with \(Y\)?

Yes, the f-statistic tests if the predictor variable has a non-zero coeficcient, and the p-value tells us how likely this predictor variable was zero, how likely would our sample be? As you can see, we can confirm that the precipitation has a non-zero coeficient. \(x\) has a statistically significant association with \(Y\).

29. Briefly summarize what you learned, personally, from this analysis about the statistics, model fitting process, etc.

Personally, I learned that this is not as hard as I worried. I can find all of the relevant data, and come to the correct conclusions. It is easy if we know what to do, and as long as we prepare to work with the data, it will be easy.

30. Briefly summarize what you learned from this analysis to a non-statistician. Write a few sentences about (1) the purpose of this data set and analysis and (2) what you learned about this data set from your analysis. Write your response as if you were addressing the mayor of city that relies on the stream runoff from the Sierras (avoid using statistics jargon) and just provide the main take-aways.

1)The purpose of this data set was to look more at statistics generally. It is a wild world out there, and we always are trying to answer the simple question of, what is happening, is there cause and effect happening here? This data set and analysis focused on how we can look at the relationship of two variables. Are they correlated? Do they have a relationship? How can we know? 2)The main takeaway is that there is a relationship between precipitation and runoff. When there is a lot of rain, there will be a lot of runoff. If the mayor wants more runoff, he should do more to help the rain happen (perhaps he has a weather machine, or could pray harder for rain?).These are the main takeaways I would give to a leader.