Note: This lab was adapted from content developed by Qian (2016) and Helsel et al. (2020).
In this assignment, we will focus on simple linear regression (SLR) and multiple linear regression (MLR).
Follow the prompts and complete the exercises below. Please document all of your work and your examples in RMarkdown and knit them as a .html file. Please also make sure to show all of your work in code chunks in the notebook. All accompanying text prose for your answers should be in text in the notebook.
As with previous assignments, >Crimson blocks of text are action items for you!
After this assignment, you will be able to:
Similar to past assignments, a key element of this assignment is importing datasets. Functions that are commonly used for importing data include read.csv() and read.table(). Additionally, you can use the readr package within the tidyverse library.
If needed, here is a non-exhaustive list of some resources related to these functions:
R Data Import Tutorial by Karlijn Willems https://www.datacamp.com/community/tutorials/r-data-import-tutorial
“Reading and Importing Excel Files into R Tutorial” by Karlijn Willems https://www.datacamp.com/community/tutorials/r-tutorial-read-excel-into-r
Chapter 11 in the R for Data Science book https://r4ds.had.co.nz/data-import.html
One final note, when importing data, it is helpful to pay particular attention to your working directory and file paths.
Scenario 1: Use data from stream gauges (see q90_gagesinfo.RData in CatCourses) to explore the relation between drainage area and the Q90 streamflow value. Q90 is the daily streamflow value expected to be exceeded about 90 percent of the time and often used as a measure of low streamflow frequency in practice. Given this information, complete the following tasks:
1. Determine if there is a linear relation between drainage area and Q90 is present. Does a logarithmic transformation of the two variables improve the relation?
load("C:/Users/noahs/Downloads/q90_gagesinfo.RData")
area <- q90_gagesinfo$area_mi2
cfs <- q90_gagesinfo$Q90_cfs
#check distribution of both variables
hist(area)
hist(cfs)
plot(area, cfs)
cat("It appears to have an exponential relationship. We will try a logarithmic transformation")
## It appears to have an exponential relationship. We will try a logarithmic transformation
cfs_log <- log(cfs)
area_log <- log(area)
plot(area_log, cfs_log)
cat("Linearity improved after doing a log transformation for both variables")
## Linearity improved after doing a log transformation for both variables
2. Using the results from Question 1, create a regression equation that can be used to estimate Q90 based on drainage area. Is drainage area a significant predictor of the Q90 streamflow? Provide evidence. Does the sign of the coefficient on drainage area match your intuition? Explain.?
stream_regression <- lm(cfs_log ~ area_log, data = q90_gagesinfo)
summary(stream_regression)
##
## Call:
## lm(formula = cfs_log ~ area_log, data = q90_gagesinfo)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.63567 -0.32925 -0.07835 0.24060 0.88128
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.27734 0.29042 -7.841 9.46e-09 ***
## area_log 1.17536 0.07337 16.019 3.02e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4473 on 30 degrees of freedom
## Multiple R-squared: 0.8953, Adjusted R-squared: 0.8918
## F-statistic: 256.6 on 1 and 30 DF, p-value: 3.018e-16
#If p < 0.05 → drainage area is a statistically significant predictor of Q90 streamflow.
#If p ≥ 0.05 → drainage area is not statistically significant.
cat("The regression equation is: Q90cfs_log = −2.27734 + 1.17536×area_mi2_log.\n")
## The regression equation is: Q90cfs_log = −2.27734 + 1.17536×area_mi2_log.
cat("Based on the p-value of 3.02E-16 which is sufficiently smaller than our alpha value of 0.05, we can determine that drainage area is a significant predictor of Q90 streamflow.\n")
## Based on the p-value of 3.02E-16 which is sufficiently smaller than our alpha value of 0.05, we can determine that drainage area is a significant predictor of Q90 streamflow.
cat("The sign of the log drainage-area coefficient is positive, which matches hydrologic expectations because larger watersheds typically generate larger baseflow discharges.")
## The sign of the log drainage-area coefficient is positive, which matches hydrologic expectations because larger watersheds typically generate larger baseflow discharges.
3. Based on the plot of residuals from the regression analysis, does the assumption of homogeneous variance appear to be met? Why or why not? This will help us determine whether it is appropriate to use the regression equation to make predictions.
par(mfrow=c(2,2))
plot(stream_regression)
cat("Because the residuals vs fitted plot does not appear to be randomly scattered around 0, it fails homoscedasticity; therefore, the assumption of homogenous variance does not pass.\n")
## Because the residuals vs fitted plot does not appear to be randomly scattered around 0, it fails homoscedasticity; therefore, the assumption of homogenous variance does not pass.
Note: If you need a refresher on transformations for regressions, Section 9.6 in Helsel et al. (2020) textbook and Section 5.4 in the Qian (2016) textbook should be helpful.
Scenario 2: McDonald and Schwing (1973) collected data for mortality rates and various environmental factors from 60 U.S. metropolitan areas. The data is stored as pollution.csv in CatCourses. Definitions of the variables can be found in pollution.txt in CatCourses. For this exercise, we will model mortality rate given concentrations of various air pollutants (nitrogen oxides, sulfur dioxide, and hydrocarbons). Given this information, complete the following tasks:
4. Create a scatter plot of mortality rate versus level of nitrogen oxides. Does it appear that a linear model will fit these data well?
pollution <- read.csv("C:/Users/noahs/Downloads/pollution.csv")
attach(pollution)
plot(mort, nox)
cat("It is unclear that there is a linear relationship.")
## It is unclear that there is a linear relationship.
5. Conduct a regression analysis for mortality rate and level of nitrogen oxides. Interpret the regression coefficient for level of nitrogen oxides (including whether it is statistically significant) and the overall fit of the model. Include an evaluation of the residual plots from the regression as part of your analysis.
pollution_regress <- lm(mort ~ nox, data = pollution)
summary(pollution_regress)
##
## Call:
## lm(formula = mort ~ nox, data = pollution)
##
## Residuals:
## Min 1Q Median 3Q Max
## -148.654 -43.710 1.751 41.663 172.211
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 942.7115 9.0034 104.706 <2e-16 ***
## nox -0.1039 0.1758 -0.591 0.557
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 62.55 on 58 degrees of freedom
## Multiple R-squared: 0.005987, Adjusted R-squared: -0.01115
## F-statistic: 0.3494 on 1 and 58 DF, p-value: 0.5568
cat("The regression equation is: Mortality rate = 942.71 -0.1039(nox)\n")
## The regression equation is: Mortality rate = 942.71 -0.1039(nox)
cat("The results show that this is not statistically significant as the p-value is 0.557 which is greater than our alpha value of 0.05.\n")
## The results show that this is not statistically significant as the p-value is 0.557 which is greater than our alpha value of 0.05.
6. Perform a logarithmic transformation on the data you analyzed in Part A. Conduct a regression analysis with the transformed data. Interpret the regression coefficient for nitrogen oxides. Include an evaluation of the residual plots from the regression as part of your analysis.
hist(mort)
hist(nox)
log_mort <- log(mort)
log_nox <- log(nox)
hist(log_mort)
hist(log_nox)
#improved after log transformation
log_pollution_regress <- lm(log_mort ~ log_nox, data = pollution)
summary(log_pollution_regress)
##
## Call:
## lm(formula = log_mort ~ log_nox, data = pollution)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.18930 -0.02957 0.01132 0.03897 0.16275
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.807175 0.018349 370.975 <2e-16 ***
## log_nox 0.015893 0.007048 2.255 0.0279 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06412 on 58 degrees of freedom
## Multiple R-squared: 0.08061, Adjusted R-squared: 0.06476
## F-statistic: 5.085 on 1 and 58 DF, p-value: 0.02792
cat("The regression equation is: Mortality rate = 6.81 + 0.016(nox)\n")
## The regression equation is: Mortality rate = 6.81 + 0.016(nox)
cat("The results show that this is statistically significant as the p-value is 0.0279 which is less than our alpha value of 0.05.\n")
## The results show that this is statistically significant as the p-value is 0.0279 which is less than our alpha value of 0.05.
7. Now, let’s look at all pollutants simultaneously. Fit a model predicting mortality rate using levels of nitrogen oxides, sulfur dioxide, and hydrocarbons as inputs. Use logarithmic transformations if/when appropriate. Plot the fitted regression model and interpret the coefficients.
#we will use log_nox and log_mort
hist(so2)
hist(hc)
log_so2 <- log(so2)
log_hc <- log(hc)
hist(log_so2)
hist(log_hc)
all_pollution_regress <- lm(log_mort ~ log_nox + log_so2 + log_hc , data = pollution)
summary(all_pollution_regress)
##
## Call:
## lm(formula = log_mort ~ log_nox + log_so2 + log_hc, data = pollution)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.10874 -0.03574 -0.00218 0.03709 0.20085
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.826749 0.022701 300.726 < 2e-16 ***
## log_nox 0.059837 0.023021 2.599 0.01192 *
## log_so2 0.014309 0.007584 1.887 0.06436 .
## log_hc -0.060812 0.020553 -2.959 0.00452 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.05753 on 56 degrees of freedom
## Multiple R-squared: 0.2852, Adjusted R-squared: 0.2469
## F-statistic: 7.449 on 3 and 56 DF, p-value: 0.0002777
par(mfrow=c(2,2))
plot(all_pollution_regress)
cat("A 1% increase in nitrogen oxides (NOₓ) is associated with a 0.06% increase in mortality, holding SO₂ and hydrocarbons constant.\n")
## A 1% increase in nitrogen oxides (NOₓ) is associated with a 0.06% increase in mortality, holding SO₂ and hydrocarbons constant.
cat("A 1% increase in sulfur dioxide (SO₂) is associated with a 0.01% increase in mortality, holding NOₓ and hydrocarbons constant\n")
## A 1% increase in sulfur dioxide (SO₂) is associated with a 0.01% increase in mortality, holding NOₓ and hydrocarbons constant
cat("A 1% increase in hydrocarbons is associated with a 0.06% decrease in mortality, holding NOₓ and SO₂ constant\n")
## A 1% increase in hydrocarbons is associated with a 0.06% decrease in mortality, holding NOₓ and SO₂ constant
Following are two Bonus Questions associated with Scenario 2. If completed, you will be eligible for up to a 5% bonus on this assignment.
BQ1. Investigate potential interaction effects among the three predictors. You can do so in a variety of ways (e.g., correlation tests, conditional plots, etc.). If you suspect interaction effects are important, refit the model with these interactions and interpret the fitted model coefficients.
Hint: conditional plots can be generated using either the coplot() function or the xyplot() function.
BQ2. Of the different models you analyzed for Scenario 2, which seems to be most suitable for the data/analysis? Why?