Answer: Biology is my current major. Linear Regression Model Example: Fertilizer Effect on Plant Growth and Response. Variable: Height of the plants. Predictor(s): Fertilizer application rate and maybe other variables like as sunshine exposure. In this case, using a linear regression model would assist quantify the association between fertilizer quantity and plant height. By examining the coefficients, researchers were able to calculate how much an increase in fertilizer correlated with an increase in height. This knowledge might help agricultural operations by finding the best fertilizer amounts for maximum development.
Logistic Regression Model Examples: Predicting Species Presence/Absence Response Variable: The presence or absence of a certain species in a given location. Predictor(s): Temperature, humidity, and food availability. Usefulness: A logistic regression model may be used to assess the likelihood of a species’ presence depending on a variety of environmental factors. This model can calculate the likelihood of presence in relation to the predictors, assisting ecologists in understanding habitat requirements and the potential consequences of environmental changes. This knowledge is critical for biodiversity conservation and management.
oil_spill_data and includes
the information on year (name Year) and average oil spilled
per incident (named Avg_Oil_Spill) for each year.
Look at the complete oil_spill_data below:library("knitr")
oil_spill_data = data.frame(
Year = c(1973, 1974, 1975, 1976, 1977, 1978, 1979, 1980, 1981, 1982, 1983, 1984, 1985),
Avg_Oil_Spill = c(2.35, 1.40, 4.18, 7.04, 4.35, 7.44, 11.13, 4.24, 1.37, 0.19, 22.81, 1.61, 1.88)
)
kable(oil_spill_data)
| Year | Avg_Oil_Spill |
|---|---|
| 1973 | 2.35 |
| 1974 | 1.40 |
| 1975 | 4.18 |
| 1976 | 7.04 |
| 1977 | 4.35 |
| 1978 | 7.44 |
| 1979 | 11.13 |
| 1980 | 4.24 |
| 1981 | 1.37 |
| 1982 | 0.19 |
| 1983 | 22.81 |
| 1984 | 1.61 |
| 1985 | 1.88 |
Year versus
Avg_Oil_Spill. What observations or trends can you infer
from the plot? (3 + 3 = 6 points)Answer:
Year = oil_spill_data$Year
Avg_Oil_Spill = oil_spill_data$Avg_Oil_Spill
plot(Year, Avg_Oil_Spill, col = "red", pch = 16,
xlab = "Year", ylab = "Avg_Oil_Spill")
Year as the
predictor (\(x\)) and
Avg_Oil_Spill as the response variable (\(y\)). Comment if the regression model is
significant at level \(\alpha = 0.05\).
(4 + 3 = 7 points)Answer:
oilspill_linear = lm(Year ~ Avg_Oil_Spill, data = oil_spill_data)
Answer: There isn’t a pattern in the data that we are considering so far. Therefore, we don’t have enough score here to consider a nonlinear model.
plot(oilspill_linear, which = 1, pch = 16, col = "darkblue")
Answer: Despite some deviation toward the beginning and end, most points lie close to the reference line. I can’t infer non-normality from the graph at this stage.
plot(oilspill_linear, which = 2, col ="blue")
Answer: The 𝑝value is 0.8086, way higher than typical significance values (such as 𝛼=0.05). Therefore, we don’t have enough evidence to reject the \(H_0\) of the Shapiro-Wilk test which assumes normality of the residuals.
shapiro.test(residuals(oilspill_linear))
##
## Shapiro-Wilk normality test
##
## data: residuals(oilspill_linear)
## W = 0.96364, p-value = 0.8086
Answer:
cooks.distance(oilspill_linear)
## 1 2 3 4 5 6
## 0.120857593 0.094313073 0.044207031 0.030565428 0.010445247 0.004617194
## 7 8 9 10 11 12
## 0.002038057 0.003604641 0.025492870 0.070550727 2.245330698 0.122532432
## 13
## 0.162434870
| Statements | Options |
|---|---|
| A. A measure that detects multicollinearity by quantifying how much the variance of a regression coefficient increases due to multicollinearity among the parameters. | a. Confidence interval |
| B. The proportion of variance in the dependent variable that is explained by multiple predictors in a regression model. | b. \(F\) statistic for model significance |
| C. The estimated value of the dependent variable when all independent variables are set to zero. | c. Adjusted \(R^2\) |
| D. The sum of the squared differences between the actual and predicted values in a regression model. | d. Residual standard error |
| E. A variable that may have only two discrete possible values. | e. Correlation coefficient |
| F. A regression approach where variables are sequentially removed based on statistical criteria. | f. Variance inflation factor (VIF) |
| G. The assumption that the error variance is constant across all levels of the independent variables. | g. Categorical (binary) data |
| H. The ratio of the estimated regression coefficient to its standard error, used to test hypothesis significance. | h. Cook’s distance |
| I. The discrepancy between an observed value and its predicted value in a linear regression model. | i. Sum of squares of errors (SSE) |
| J. A measure that evaluates the strength and direction of the linear relationship between two variables. | j. Intercept coefficient |
| K. A confidence band likely to contain the true population parameter with a specified probability. | k. Backward elimination / stepwise regression |
| L. A statistic that measures how much an individual observation influences the overall regression model. | l. \(R^2\) |
| M. A value in the regression output used to assess and compare models with different numbers of predictors. | m. Homoscedasticity |
| N. A measure that evaluates the strength and direction of the monotonic relationship between two variables. | n. Spearman’s rank correlation coefficient |
| O. A parameter in \(\texttt{R}\) output that gives the estimated square root of the residual variance. | o. Prediction error / residual |
| p. t-statistic |
Answer: A->f, B->l, C->j, D->i, E->g, F->k, G->m, H->p, I->o, J->e, K->a, L->h, M->c, N->n, O->d.
Hour denotes the hour of
the day, CO is the average CO concentration,
Traffic is the average traffic density, Wind
is the average wind-speed. co_data = read.table(url("http://www.statsci.org/data/general/cofreewy.txt"), header = T)
kable(head(co_data))
| Hour | CO | Traffic | Wind |
|---|---|---|---|
| 1 | 2.4 | 50 | -0.2 |
| 2 | 1.7 | 26 | 0.0 |
| 3 | 1.4 | 16 | 0.0 |
| 4 | 1.2 | 10 | 0.0 |
| 5 | 1.2 | 12 | 0.1 |
| 6 | 2.0 | 41 | -0.1 |
Answer:
Hour = co_data$Hour
CO = co_data$CO
plot(Hour, CO, col = "red", pch = 16,
xlab = "Hour", ylab = "CO")
Traffic = co_data$Traffic
CO = co_data$CO
plot(Hour, CO, col = "red", pch = 16,
xlab = "Traffic", ylab = "CO")
Wind = co_data$Wind
CO = co_data$CO
plot(Wind, CO, col = "red", pch = 16,
xlab = "Wind", ylab = "CO")
Answer: The scatterplot most likely depicts a pattern in which CO levels change throughout the day, maybe increasing during peak traffic hours and dropping at night. This shows a nonlinear connection, since CO levels may not rise or fall in a straight line over time; instead, they may peak at certain times and fall at others.
# Code
\[\text{Carbon monoxide} = \beta_0 + \beta_1 * \text{Traffic} + \beta_2 * \text{Hour} + \beta_3 * \text{Hour}^2 + \beta_4 * \text{Wind} + \beta_5 * \text{Wind}^2 + \epsilon.\]
Answer:
co_data$Hour2 <- co_data$Hour^2
co_data$Wind2 <- co_data$Wind^2
model <- lm(CO ~ Traffic + Hour + Hour2 + Wind + Wind2, data = co_data)
summary(model)
##
## Call:
## lm(formula = CO ~ Traffic + Hour + Hour2 + Wind + Wind2, data = co_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.65986 -0.24589 0.01537 0.26923 0.66329
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.161504 0.379573 3.060 0.00674 **
## Traffic 0.017287 0.001555 11.117 1.71e-09 ***
## Hour 0.050510 0.107253 0.471 0.64334
## Hour2 -0.002589 0.003768 -0.687 0.50074
## Wind 0.677038 0.206508 3.279 0.00417 **
## Wind2 -0.091082 0.033285 -2.736 0.01356 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4304 on 18 degrees of freedom
## Multiple R-squared: 0.9678, Adjusted R-squared: 0.9588
## F-statistic: 108.1 on 5 and 18 DF, p-value: 8.848e-13
Answer: In the provided summary output, the fitted intercept coefficient (\(\hat{\beta_0}\)) is estimated to be 1.161504. This value represents the expected average level of Carbon monoxide (CO) concentration when all predictor variables (Traffic, Hour, Hour², Wind, and Wind²) are set to zero.
Answer: In the given summary output, the fitted slope coefficient for average weekday traffic density is 0.017287. This coefficient shows the projected change in the average carbon monoxide (CO) concentration for each one-unit increase in average traffic density, while keeping all other factors (Hour, Hour², Wind, and Wind² constant).
Answer:
cor(co_data[c("Hour", "Traffic", "Wind")])
## Hour Traffic Wind
## Hour 1.0000000 0.4285228 0.4226381
## Traffic 0.4285228 1.0000000 0.6134436
## Wind 0.4226381 0.6134436 1.0000000
Answer: \(H_0\):βTraffic=0 \(H_A\) : βTraffic not equal 0
Answer: Using the confidence interval approach confirms that average weekday traffic density is a significant predictor of Carbon monoxide levels, as the entire interval is positive and does not contain zero, leading to the same conclusion as the hypothesis test.
# Code
Answer:
full_model <- lm(CO ~ Traffic + Hour + Wind, data = co_data)
summary(full_model)
##
## Call:
## lm(formula = CO ~ Traffic + Hour + Wind, data = co_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.75030 -0.33275 -0.09021 0.22653 1.25112
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.318967 0.242522 5.439 2.53e-05 ***
## Traffic 0.018402 0.001413 13.026 3.15e-11 ***
## Hour -0.005689 0.017066 -0.333 0.74233
## Wind 0.179189 0.059517 3.011 0.00691 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5096 on 20 degrees of freedom
## Multiple R-squared: 0.9498, Adjusted R-squared: 0.9423
## F-statistic: 126.1 on 3 and 20 DF, p-value: 3.682e-13
final_model <- step(full_model, direction = "backward", criteria = "AIC")
## Start: AIC=-28.73
## CO ~ Traffic + Hour + Wind
##
## Df Sum of Sq RSS AIC
## - Hour 1 0.029 5.224 -30.597
## <none> 5.195 -28.730
## - Wind 1 2.354 7.549 -21.759
## - Traffic 1 44.070 49.265 23.260
##
## Step: AIC=-30.6
## CO ~ Traffic + Wind
##
## Df Sum of Sq RSS AIC
## <none> 5.224 -30.597
## - Wind 1 2.357 7.581 -23.659
## - Traffic 1 46.117 51.341 22.250
summary(final_model)
##
## Call:
## lm(formula = CO ~ Traffic + Wind, data = co_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.72858 -0.31710 -0.09629 0.22409 1.26554
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.274461 0.198137 6.432 2.25e-06 ***
## Traffic 0.018290 0.001343 13.616 6.85e-12 ***
## Wind 0.174747 0.056765 3.078 0.0057 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4987 on 21 degrees of freedom
## Multiple R-squared: 0.9495, Adjusted R-squared: 0.9447
## F-statistic: 197.5 on 2 and 21 DF, p-value: 2.419e-14
Answer: The polynomial regression model performed the best.
| Variable | Description |
|---|---|
Age |
Age |
Sex |
Gender (1 = male, 0 = female) |
RestBP |
Resting blood pressure |
Chol |
Cholesterol |
MaxHR |
Maximum heart rate achieved |
Oldpeak |
ST depression induced by exercise relative to rest |
Ca |
Number of major vessels (0-3) colored by fluoroscopy |
ExAng |
Exercise induced angina (1 = yes, 0 = no) |
AHD |
Diagnosis of heart disease (1 = presence of heart disease, 0 = absence) |
heart_data = read.csv(url("https://raw.githubusercontent.com/JWarmenhoven/ISLR-python/refs/heads/master/Notebooks/Data/Heart.csv"))
heart_data$AHD = as.numeric(heart_data$AHD == "Yes")
kable(head(heart_data))
| X | Age | Sex | ChestPain | RestBP | Chol | Fbs | RestECG | MaxHR | ExAng | Oldpeak | Slope | Ca | Thal | AHD |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 63 | 1 | typical | 145 | 233 | 1 | 2 | 150 | 0 | 2.3 | 3 | 0 | fixed | 0 |
| 2 | 67 | 1 | asymptomatic | 160 | 286 | 0 | 2 | 108 | 1 | 1.5 | 2 | 3 | normal | 1 |
| 3 | 67 | 1 | asymptomatic | 120 | 229 | 0 | 2 | 129 | 1 | 2.6 | 2 | 2 | reversable | 1 |
| 4 | 37 | 1 | nonanginal | 130 | 250 | 0 | 0 | 187 | 0 | 3.5 | 3 | 0 | normal | 0 |
| 5 | 41 | 0 | nontypical | 130 | 204 | 0 | 2 | 172 | 0 | 1.4 | 1 | 0 | normal | 0 |
| 6 | 56 | 1 | nontypical | 120 | 236 | 0 | 0 | 178 | 0 | 0.8 | 1 | 0 | normal | 0 |
AHD) using the following predictors: age, sex, resting blood pressure, cholesterol,
maximum heart rate. Build the logistic regression model in \(\texttt{R}\) and show the summary output of
the model. (10 points)Answer:
AHD= heart_data$AHD
age = heart_data$Age
sex = heart_data$Sex
RestBP= heart_data$RestBP
Chol=heart_data$Chol
MaxHR=heart_data$MaxHR
heart_model= glm( AHD ~ age + sex + RestBP + Chol + MaxHR, family = "binomial")
summary(heart_model)
##
## Call:
## glm(formula = AHD ~ age + sex + RestBP + Chol + MaxHR, family = "binomial")
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.193371 1.824108 0.106 0.91558
## age 0.013836 0.017589 0.787 0.43152
## sex 1.816238 0.339230 5.354 8.60e-08 ***
## RestBP 0.021168 0.008363 2.531 0.01137 *
## Chol 0.007115 0.002757 2.581 0.00985 **
## MaxHR -0.046365 0.007660 -6.052 1.43e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 417.98 on 302 degrees of freedom
## Residual deviance: 318.80 on 297 degrees of freedom
## AIC: 330.8
##
## Number of Fisher Scoring iterations: 4
Answer: 0.013836
# Code
Answer:
heart_model= glm( AHD ~ age + sex + RestBP + Chol + MaxHR, family = "binomial")
final_model <- step(heart_model, direction = "backward", criteria = "AIC")
## Start: AIC=330.8
## AHD ~ age + sex + RestBP + Chol + MaxHR
##
## Df Deviance AIC
## - age 1 319.42 329.42
## <none> 318.80 330.80
## - RestBP 1 325.51 335.51
## - Chol 1 325.55 335.55
## - sex 1 352.75 362.75
## - MaxHR 1 365.07 375.07
##
## Step: AIC=329.42
## AHD ~ sex + RestBP + Chol + MaxHR
##
## Df Deviance AIC
## <none> 319.42 329.42
## - Chol 1 326.95 334.95
## - RestBP 1 327.77 335.77
## - sex 1 352.84 360.84
## - MaxHR 1 378.73 386.73
summary(final_model)
##
## Call:
## glm(formula = AHD ~ sex + RestBP + Chol + MaxHR, family = "binomial")
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.977175 1.531644 0.638 0.52348
## sex 1.792557 0.337157 5.317 1.06e-07 ***
## RestBP 0.022865 0.008121 2.816 0.00487 **
## Chol 0.007452 0.002727 2.733 0.00628 **
## MaxHR -0.048452 0.007244 -6.689 2.25e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 417.98 on 302 degrees of freedom
## Residual deviance: 319.42 on 298 degrees of freedom
## AIC: 329.42
##
## Number of Fisher Scoring iterations: 4
Answer:
install.packages("pROC")
## Error in contrib.url(repos, "source"): trying to use CRAN without setting a mirror
library(pROC)
## Type 'citation("pROC")' for a citation.
##
## Attaching package: 'pROC'
## The following objects are masked from 'package:stats':
##
## cov, smooth, var
data(mtcars)
glm(formula = mpg_binary ~ disp + hp, family = "binomial", data = mtcars)
## Error in eval(predvars, data, env): object 'mpg_binary' not found
predicted_probs= predict(model, type = "response")
roc_obj= roc(mtcars$mpg_cat, predicted_probs)
## Error in roc.default(mtcars$mpg_cat, predicted_probs): No valid data provided.
plot(roc_obj, main = "ROC Curve", print.auc = TRUE, auc.polygon = TRUE, grid = TRUE)
## Error in eval(expr, envir, enclos): object 'roc_obj' not found
Courtesy: Hamilton, L. C. (1992). Regression with Graphics. Duxbury Press, Belmont, California, page 49.↩︎
Courtesy: Smyth, Gordon K (2011). Australasian Data and Story Library (OzDASL). https://gksmyth.github.io/ozdasl↩︎
Courtesy: An Introduction to Statistical Learning, James et al.↩︎
Thanks to the ISLR-python Github repository by Jordi Warmenhoven↩︎