Instructions

  1. Due Date: November 1. Canvas submission will be locked after the due date.
  2. Total Points: 175 points (+5 extra credit).
  3. Writing: Clearly communicate your conclusions when required. Submitting only \(\texttt{R}\) output without explanation will not receive full credit.
  4. \(\texttt{R}\) Code and Calculations: To receive credit, include all relevant code, calculations, and results within \(\texttt{R}\) code chunks in your submission.
  5. Submission Format: Submit your work as a PDF or HTML file, ensuring it contains all \(\texttt{R}\) code chunks, outputs, and any necessary plots.
  6. Academic Integrity: Follow the university’s academic integrity policy. Ensure all work is your own, with proper citations where applicable. While the use of generative AI is allowed for research, directly copying code from AI tools will result in a score of \(0\).




Question 1 (10 points)


What is your major? Based on your field of study, provide two examples of data where you could apply regression models—one for a linear regression model and one for a logistic regression model. For each example, describe the response variable and predictor(s), and explain how building a regression model would be useful for analyzing the data in that context. If you are not familiar with specific examples yet, consider the types of data you might work with in your future career and how regression models could help analyze those data. (10 points)

Answer: Biology is my current major. Linear Regression Model Example: Fertilizer Effect on Plant Growth and Response. Variable: Height of the plants. Predictor(s): Fertilizer application rate and maybe other variables like as sunshine exposure. In this case, using a linear regression model would assist quantify the association between fertilizer quantity and plant height. By examining the coefficients, researchers were able to calculate how much an increase in fertilizer correlated with an increase in height. This knowledge might help agricultural operations by finding the best fertilizer amounts for maximum development.

Logistic Regression Model Examples: Predicting Species Presence/Absence Response Variable: The presence or absence of a certain species in a given location. Predictor(s): Temperature, humidity, and food availability. Usefulness: A logistic regression model may be used to assess the likelihood of a species’ presence depending on a variety of environmental factors. This model can calculate the likelihood of presence in relation to the predictors, assisting ecologists in understanding habitat requirements and the potential consequences of environmental changes. This knowledge is critical for biodiversity conservation and management.




Question 2 (40 points)


Look at the following data1 derived from the Statistical Abstract of the United States, which details accidental oil spills at sea and the corresponding amount of oil lost over the years. The data is named oil_spill_data and includes the information on year (name Year) and average oil spilled per incident (named Avg_Oil_Spill) for each year. Look at the complete oil_spill_data below:

library("knitr")
oil_spill_data = data.frame(
  Year = c(1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985),
  Avg_Oil_Spill = c(2.35, 1.40, 4.18, 7.04, 4.35, 7.44, 11.13, 4.24, 1.37, 0.19, 22.81, 1.61, 1.88)
)

kable(oil_spill_data)
Year Avg_Oil_Spill
1973 2.35
1974 1.40
1975 4.18
1976 7.04
1977 4.35
1978 7.44
1979 11.13
1980 4.24
1981 1.37
1982 0.19
1983 22.81
1984 1.61
1985 1.88


B. Fit a linear regression model with Year as the predictor (\(x\)) and Avg_Oil_Spill as the response variable (\(y\)). Comment if the regression model is significant at level \(\alpha = 0.05\). (4 + 3 = 7 points)

Answer:

oilspill_linear = lm(Year ~ Avg_Oil_Spill, data = oil_spill_data)


C. Plot the residuals against the fitted values from the linear regression model. What are your conclusions from this plot? (3 + 3 = 6 points)

Answer: There isn’t a pattern in the data that we are considering so far. Therefore, we don’t have enough score here to consider a nonlinear model.

plot(oilspill_linear, which = 1, pch = 16, col  = "darkblue") 


D. Create a QQ-plot of the residuals from the model in part B. Based on the QQ-plot, what is your assessment of the normality assumption? (3 + 3 = 6 points)

Answer: Despite some deviation toward the beginning and end, most points lie close to the reference line. I can’t infer non-normality from the graph at this stage.

plot(oilspill_linear, which = 2, col ="blue")


E. Perform a Shapiro-Wilk test at a significance level of \(\alpha = 0.05\) to test the normality of the residuals. What is your conclusion based on the test results? (6 points)

Answer: The 𝑝value is 0.8086, way higher than typical significance values (such as 𝛼=0.05). Therefore, we don’t have enough evidence to reject the \(H_0\) of the Shapiro-Wilk test which assumes normality of the residuals.

shapiro.test(residuals(oilspill_linear)) 
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(oilspill_linear)
## W = 0.96364, p-value = 0.8086


F. Use Cook’s distance to identify influential points in the regression model. Plot these values and discuss how they relate to your observations from the scatterplot in part A. (4 + 5 = 9 points)

Answer:

cooks.distance(oilspill_linear)
##           1           2           3           4           5           6 
## 0.120857593 0.094313073 0.044207031 0.030565428 0.010445247 0.004617194 
##           7           8           9          10          11          12 
## 0.002038057 0.003604641 0.025492870 0.070550727 2.245330698 0.122532432 
##          13 
## 0.162434870




Question 3 (30 points)

Match the statements below with the corresponding terms from the list:

Statements Options
A. A measure that detects multicollinearity by quantifying how much the variance of a regression coefficient increases due to multicollinearity among the parameters. a. Confidence interval
B. The proportion of variance in the dependent variable that is explained by multiple predictors in a regression model. b. \(F\) statistic for model significance
C. The estimated value of the dependent variable when all independent variables are set to zero. c. Adjusted \(R^2\)
D. The sum of the squared differences between the actual and predicted values in a regression model. d. Residual standard error
E. A variable that may have only two discrete possible values. e. Correlation coefficient
F. A regression approach where variables are sequentially removed based on statistical criteria. f. Variance inflation factor (VIF)
G. The assumption that the error variance is constant across all levels of the independent variables. g. Categorical (binary) data
H. The ratio of the estimated regression coefficient to its standard error, used to test hypothesis significance. h. Cook’s distance
I. The discrepancy between an observed value and its predicted value in a linear regression model. i. Sum of squares of errors (SSE)
J. A measure that evaluates the strength and direction of the linear relationship between two variables. j. Intercept coefficient
K. A confidence band likely to contain the true population parameter with a specified probability. k. Backward elimination / stepwise regression
L. A statistic that measures how much an individual observation influences the overall regression model. l. \(R^2\)
M. A value in the regression output used to assess and compare models with different numbers of predictors. m. Homoscedasticity
N. A measure that evaluates the strength and direction of the monotonic relationship between two variables. n. Spearman’s rank correlation coefficient
O. A parameter in \(\texttt{R}\) output that gives the estimated square root of the residual variance. o. Prediction error / residual
p. t-statistic

Answer: A->f, B->l, C->j, D->i, E->g, F->k, G->m, H->p, I->o, J->e, K->a, L->h, M->c, N->n, O->d. 




Question 4 (65 points)


Thanks to the EPA, we will use the following dataset2 to analyze the Carbon monoxide emissions. The following code downloads the data in a dataframe named \(\texttt{co_data}\) and shows you only a snapshot of the full data. Here, Hour denotes the hour of the day, CO is the average CO concentration, Traffic is the average traffic density, Wind is the average wind-speed.

co_data = read.table(url("http://www.statsci.org/data/general/cofreewy.txt"), header = T)
kable(head(co_data))
Hour CO Traffic Wind
1 2.4 50 -0.2
2 1.7 26 0.0
3 1.4 16 0.0
4 1.2 10 0.0
5 1.2 12 0.1
6 2.0 41 -0.1

We are interested in building a linear regression model to predict the Carbon monoxide level using other variables as predictors. You will need to identify the relevant variable from the metadata.


A. Create three scatterplots, each showing the relationship between your response variable and one of the three predictors. Ensure that each scatterplot includes appropriate x and y labels as well as a clear title to receive full credit. (3 \(\times\) 3 = 9 points)

Answer:

Hour  = co_data$Hour
CO = co_data$CO
plot(Hour, CO, col = "red", pch = 16,
     xlab = "Hour", ylab = "CO") 

Traffic  = co_data$Traffic
CO = co_data$CO
plot(Hour, CO, col = "red", pch = 16,
     xlab = "Traffic", ylab = "CO") 

Wind  = co_data$Wind
CO = co_data$CO
plot(Wind, CO, col = "red", pch = 16,
     xlab = "Wind", ylab = "CO") 


B. Analyze the relationships shown in the scatterplots. Explain why it may be beneficial to introduce a nonlinear term for the hour of the day and wind speed in the model. (6 points)

Answer: The scatterplot most likely depicts a pattern in which CO levels change throughout the day, maybe increasing during peak traffic hours and dropping at night. This shows a nonlinear connection, since CO levels may not rise or fall in a straight line over time; instead, they may peak at certain times and fall at others.

# Code


C. Build the following polynomial regression model and print the summary output of the model. (5 points):

\[\text{Carbon monoxide} = \beta_0 + \beta_1 * \text{Traffic} + \beta_2 * \text{Hour} + \beta_3 * \text{Hour}^2 + \beta_4 * \text{Wind} + \beta_5 * \text{Wind}^2 + \epsilon.\]

Answer:

co_data$Hour2 <- co_data$Hour^2
co_data$Wind2 <- co_data$Wind^2
model <- lm(CO ~ Traffic + Hour + Hour2 + Wind + Wind2, data = co_data)
summary(model)
## 
## Call:
## lm(formula = CO ~ Traffic + Hour + Hour2 + Wind + Wind2, data = co_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.65986 -0.24589  0.01537  0.26923  0.66329 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.161504   0.379573   3.060  0.00674 ** 
## Traffic      0.017287   0.001555  11.117 1.71e-09 ***
## Hour         0.050510   0.107253   0.471  0.64334    
## Hour2       -0.002589   0.003768  -0.687  0.50074    
## Wind         0.677038   0.206508   3.279  0.00417 ** 
## Wind2       -0.091082   0.033285  -2.736  0.01356 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4304 on 18 degrees of freedom
## Multiple R-squared:  0.9678, Adjusted R-squared:  0.9588 
## F-statistic: 108.1 on 5 and 18 DF,  p-value: 8.848e-13


D. From the summary output above, explain the fitted intercept coefficient (\(\hat{\beta_0}\)). (5 points)

Answer: In the provided summary output, the fitted intercept coefficient (\(\hat{\beta_0}\)) is estimated to be 1.161504. This value represents the expected average level of Carbon monoxide (CO) concentration when all predictor variables (Traffic, Hour, Hour², Wind, and Wind²) are set to zero.


E. From the summary output in part C, explain the fitted slope coefficient corresponding to average weekday traffic density. (5 points)

Answer: In the given summary output, the fitted slope coefficient for average weekday traffic density is 0.017287. This coefficient shows the projected change in the average carbon monoxide (CO) concentration for each one-unit increase in average traffic density, while keeping all other factors (Hour, Hour², Wind, and Wind² constant).


F. Print the coefficient matrix between hour of the day, average weekday traffic density, and wind-speed. Comment if there may be a problem of multicollinearity following a cutoff value of \(\pm 0.6\). (2 + 3 = 5 points)

Answer:

cor(co_data[c("Hour", "Traffic", "Wind")])
##              Hour   Traffic      Wind
## Hour    1.0000000 0.4285228 0.4226381
## Traffic 0.4285228 1.0000000 0.6134436
## Wind    0.4226381 0.6134436 1.0000000


G. Perform the hypothesis test \(H_0\): Coefficient for Traffic \(= 0\); \(H_A\): Coefficient for Traffic \(\neq 0\) (at level \(\alpha = 0.05\))? (5 points)

Answer: \(H_0\):βTraffic​=0 \(H_A\) : βTraffic​ not equal 0


H. How can you draw the same conlucion from part G from the confidence interval of the coefficients? (5 points)

Answer: Using the confidence interval approach confirms that average weekday traffic density is a significant predictor of Carbon monoxide levels, as the entire interval is positive and does not contain zero, leading to the same conclusion as the hypothesis test.

# Code


I. Using the \(p\)-value criteria that every single predictor variable has to be significant at level \(\alpha = 0.05\), perform a backward selection. What are the predictors in your final chosen model? (15 points)

Answer:

full_model <- lm(CO ~ Traffic + Hour + Wind, data = co_data)
summary(full_model)
## 
## Call:
## lm(formula = CO ~ Traffic + Hour + Wind, data = co_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.75030 -0.33275 -0.09021  0.22653  1.25112 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.318967   0.242522   5.439 2.53e-05 ***
## Traffic      0.018402   0.001413  13.026 3.15e-11 ***
## Hour        -0.005689   0.017066  -0.333  0.74233    
## Wind         0.179189   0.059517   3.011  0.00691 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5096 on 20 degrees of freedom
## Multiple R-squared:  0.9498, Adjusted R-squared:  0.9423 
## F-statistic: 126.1 on 3 and 20 DF,  p-value: 3.682e-13
final_model <- step(full_model, direction = "backward", criteria = "AIC")
## Start:  AIC=-28.73
## CO ~ Traffic + Hour + Wind
## 
##           Df Sum of Sq    RSS     AIC
## - Hour     1     0.029  5.224 -30.597
## <none>                  5.195 -28.730
## - Wind     1     2.354  7.549 -21.759
## - Traffic  1    44.070 49.265  23.260
## 
## Step:  AIC=-30.6
## CO ~ Traffic + Wind
## 
##           Df Sum of Sq    RSS     AIC
## <none>                  5.224 -30.597
## - Wind     1     2.357  7.581 -23.659
## - Traffic  1    46.117 51.341  22.250
summary(final_model)
## 
## Call:
## lm(formula = CO ~ Traffic + Wind, data = co_data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.72858 -0.31710 -0.09629  0.22409  1.26554 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.274461   0.198137   6.432 2.25e-06 ***
## Traffic     0.018290   0.001343  13.616 6.85e-12 ***
## Wind        0.174747   0.056765   3.078   0.0057 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4987 on 21 degrees of freedom
## Multiple R-squared:  0.9495, Adjusted R-squared:  0.9447 
## F-statistic: 197.5 on 2 and 21 DF,  p-value: 2.419e-14


J. Compare all the models you built while backward selection in question 4 using the adjusted \(R^2\) criterion. Identify which model performs the best based on this criterion (3 + 2 = 5 points)

Answer: The polynomial regression model performed the best.




Question 5: (35 points)

3

Here is a data on \(303\) patients with chest pain. First, read the following metadata for a few relevant variables:

Variable Description
Age Age
Sex Gender (1 = male, 0 = female)
RestBP Resting blood pressure
Chol Cholesterol
MaxHR Maximum heart rate achieved
Oldpeak ST depression induced by exercise relative to rest
Ca Number of major vessels (0-3) colored by fluoroscopy
ExAng Exercise induced angina (1 = yes, 0 = no)
AHD Diagnosis of heart disease (1 = presence of heart disease, 0 = absence)


My following code downloads the data 4 in a dataframe called \(\texttt{heart_data}\), creates the response variable, and shows a snapshot of the data:

heart_data = read.csv(url("https://raw.githubusercontent.com/JWarmenhoven/ISLR-python/refs/heads/master/Notebooks/Data/Heart.csv"))
heart_data$AHD = as.numeric(heart_data$AHD == "Yes")
kable(head(heart_data))
X Age Sex ChestPain RestBP Chol Fbs RestECG MaxHR ExAng Oldpeak Slope Ca Thal AHD
1 63 1 typical 145 233 1 2 150 0 2.3 3 0 fixed 0
2 67 1 asymptomatic 160 286 0 2 108 1 1.5 2 3 normal 1
3 67 1 asymptomatic 120 229 0 2 129 1 2.6 2 2 reversable 1
4 37 1 nonanginal 130 250 0 0 187 0 3.5 3 0 normal 0
5 41 0 nontypical 130 204 0 2 172 0 1.4 1 0 normal 0
6 56 1 nontypical 120 236 0 0 178 0 0.8 1 0 normal 0


A. We want to build a logistic regression model to predict the diagnosis of heart disease (variable AHD) using the following predictors: age, sex, resting blood pressure, cholesterol, maximum heart rate. Build the logistic regression model in \(\texttt{R}\) and show the summary output of the model. (10 points)

Answer:

AHD= heart_data$AHD
age = heart_data$Age
sex = heart_data$Sex
RestBP= heart_data$RestBP
Chol=heart_data$Chol
MaxHR=heart_data$MaxHR
heart_model= glm( AHD ~ age + sex + RestBP + Chol + MaxHR, family = "binomial")
summary(heart_model)
## 
## Call:
## glm(formula = AHD ~ age + sex + RestBP + Chol + MaxHR, family = "binomial")
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.193371   1.824108   0.106  0.91558    
## age          0.013836   0.017589   0.787  0.43152    
## sex          1.816238   0.339230   5.354 8.60e-08 ***
## RestBP       0.021168   0.008363   2.531  0.01137 *  
## Chol         0.007115   0.002757   2.581  0.00985 ** 
## MaxHR       -0.046365   0.007660  -6.052 1.43e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 417.98  on 302  degrees of freedom
## Residual deviance: 318.80  on 297  degrees of freedom
## AIC: 330.8
## 
## Number of Fisher Scoring iterations: 4


B. Based on the model from part A, interpret the estimated slope coefficient for the variable Age. (5 points)

Answer: 0.013836

# Code


C. Apply a backward elimination procedure using the \(\texttt{step}\) function in \(\texttt{R}\). Show the summary of your final model after elimination. (10 points)

Answer:

heart_model= glm( AHD ~ age + sex + RestBP + Chol + MaxHR, family = "binomial")
final_model <- step(heart_model, direction = "backward", criteria = "AIC")
## Start:  AIC=330.8
## AHD ~ age + sex + RestBP + Chol + MaxHR
## 
##          Df Deviance    AIC
## - age     1   319.42 329.42
## <none>        318.80 330.80
## - RestBP  1   325.51 335.51
## - Chol    1   325.55 335.55
## - sex     1   352.75 362.75
## - MaxHR   1   365.07 375.07
## 
## Step:  AIC=329.42
## AHD ~ sex + RestBP + Chol + MaxHR
## 
##          Df Deviance    AIC
## <none>        319.42 329.42
## - Chol    1   326.95 334.95
## - RestBP  1   327.77 335.77
## - sex     1   352.84 360.84
## - MaxHR   1   378.73 386.73
summary(final_model)
## 
## Call:
## glm(formula = AHD ~ sex + RestBP + Chol + MaxHR, family = "binomial")
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.977175   1.531644   0.638  0.52348    
## sex          1.792557   0.337157   5.317 1.06e-07 ***
## RestBP       0.022865   0.008121   2.816  0.00487 ** 
## Chol         0.007452   0.002727   2.733  0.00628 ** 
## MaxHR       -0.048452   0.007244  -6.689 2.25e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 417.98  on 302  degrees of freedom
## Residual deviance: 319.42  on 298  degrees of freedom
## AIC: 329.42
## 
## Number of Fisher Scoring iterations: 4


D. Use the code from this article. Plot the ROC (Receiver Operating Characteristic) curve for the final logistic regression model obtained in part C using \(\texttt{R}\). Make sure to print the AUC (area under the curve) value too. Careful: you need to install a new package. (10 points)

Answer:

install.packages("pROC")
## Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror
library(pROC)
## Type 'citation("pROC")' for a citation.
## 
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
## 
##     cov, smooth, var
data(mtcars)
glm(formula = mpg_binary ~ disp + hp, family = "binomial", data = mtcars)
## Error in eval(predvars, data, env): object 'mpg_binary' not found
predicted_probs= predict(model, type = "response")
roc_obj= roc(mtcars$mpg_cat, predicted_probs)
## Error in roc.default(mtcars$mpg_cat, predicted_probs): No valid data provided.
plot(roc_obj, main = "ROC Curve", print.auc = TRUE, auc.polygon = TRUE, grid = TRUE)
## Error in eval(expr, envir, enclos): object 'roc_obj' not found

  1. Courtesy: Hamilton, L. C. (1992). Regression with Graphics. Duxbury Press, Belmont, California, page 49.↩︎

  2. Courtesy: Smyth, Gordon K (2011). Australasian Data and Story Library (OzDASL). https://gksmyth.github.io/ozdasl↩︎

  3. Courtesy: An Introduction to Statistical Learning, James et al.↩︎

  4. Thanks to the ISLR-python Github repository by Jordi Warmenhoven↩︎