5.We now examine the differences between LDA and QDA.
A: Here we would expect QDA to perform better on the training set due to its flexibility, but for LDA to perform better on the test set. QDA outperforming on the training data will be due to overfitting to any non linearity in the training data which would not likely be present in the test set.
A: This depends on the nature of the nonlinearity of the Bayes decision boundary. On the training data, we would expect QDA to out perform due to its higher flexibility. If the nonlinearity is quadratic then we would expect QDA to perform significantly better. Some nonlinear relationships will be poorly approximated by QDA and well approximated by LDA, so this again depends on how well QDA can model the non linearity.
A: Generally speaking, we would expect the test prediction accuracy of the higher flexibility model to improve relative to the lower flexibility model, as for a large n the probability of nonlinear training relationships being not authentic decreases.
A: It would be false. Particularly with a smaller sample size, the variance from using a more flexible method will lead to over fitting, yielding a higher test error than LDA. I can’t see how QDA would be useful regardless of the sample size though when we already know that the Bayes decision boundary is linear. If this logic was correct we would simply always follow the most flexible method.
A: When evaluating classification models, the primary concern is how well they generalize to new, unseen data. In this scenario, we have two models: logistic regression and 1-nearest neighbors (K=1). Logistic regression yields a training error of 20% and a test error of 30%, whereas 1-NN has an average error rate of 18% across both the training and test datasets. However, the training error for KNN can be interpreted as the error obtained when the training data is used as the test set. When K = 1, each test observation is classified based on its single closest training observation, which will always be itself, leading to zero training error. This remains true regardless of the dataset or classification method, assuming unique observations without identical predictors but different response values.
Given that the average error for K=1 is 18%, and we know that its training error is 0%, we can infer that its test error must be 2 × 18% = 36%, which is worse than the 30% test error of logistic regression. This suggests that 1-NN is overfitting while it performs perfectly on the training data, it struggles with generalization. Logistic regression, despite a higher training error, has a lower test error and better generalization to unseen data, which is the ultimate goal in classification. Therefore, logistic regression is the preferred method in this case, as it provides more reliable predictions for new observations.
On average, what fraction of people with an odds of 0.37 of defaulting on their credit card payment will in fact default?
Suppose that an individual has a 16 % chance of defaulting on her credit card payment. What are the odds that she will default?
library(ISLR)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
data(Auto)
mpg_horsepower <- lm(mpg ~ horsepower, data = Auto)
summary(mpg_horsepower)
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
The p-value for the horsepower variable is very small. so there is strong evidence to believe that horsepower is associated with mpg. Therefore, there is a relationship between the predictor and response.
Here R squared = 0.6059. This means 60.6% of the variation in mpg can be explained by horsepower. Adjusted R-squared = 0.6049, which is almost the same, indicating a strong model fit. Conclusion: The relationship is moderately strong, but other factors also affect mpg.
The coefficient for horsepower is -0.157845. Since it is negative, it means mpg decreases as horsepower increases. i.e., For every 1-unit increase in horsepower, mpg decreases by 0.15 mpg on average. Conclusion: The relationship is negative, meaning more powerful cars tend to have lower fuel efficiency.
predict(mpg_horsepower, data.frame(horsepower = 98), interval = "confidence", level = 0.95)
## fit lwr upr
## 1 24.46708 23.97308 24.96108
Thus, the predicted mpg for a car with 98 horsepower is 24.46 mpg.
lm_model <- lm(mpg ~ horsepower, data = Auto)
# Display the regression summary
summary(lm_model)
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
predict(lm_model, newdata = data.frame(horsepower = 98), interval = "confidence")
## fit lwr upr
## 1 24.46708 23.97308 24.96108
predict(lm_model, newdata = data.frame(horsepower = 98), interval = "prediction")
## fit lwr upr
## 1 24.46708 14.8094 34.12476
Confidence Interval (CI): 23.9 -24.9 is the range for the mean mpg for cars with 98 horsepower. Prediction Interval (PI): 14.8 -34.1 is the range for an individual car’s mpg with 98 horsepower, which is wider than CI.
plot(Auto$horsepower, Auto$mpg,
xlab = "Horsepower",
ylab = "Miles Per Gallon (MPG)",
main = "MPG vs Horsepower",
pch = 16, col = "blue")
abline(lm_model, col = "red", lwd = 2)
The scatterplot shows a negative correlation between horsepower and mpg. The red regression line confirms the downward trend (higher horsepower gives lower mpg).
par(mfrow = c(2, 2))
plot(lm_model)
Residuals vs Fitted Plot : It checks for non-linearity and homoscedasticity. As the red line is not flat or residuals form a pattern, the model may be misspecified.
Normal Q-Q Plot : It checks if residuals follow a normal distribution. As the points do not vary more , the normality is maintained.
Scale-Location Plot : It checks for homoscedasticity. A random spread is good, a clear pattern suggests variance issues. So it has less variance issues.
Residuals vs Leverage Plot : It identifies influential points. If a point has high leverage and Cook’s distance > 0.5, it may be unduly affecting the regression.
As the Residuals vs Fitted plot shows a clear curve, a linear model may not be the best fit. As the Scale-Location plot shows increasing spread, it suggests heteroscedasticity (unequal variance). As the Normal Q-Q plot has low deviations, residuals may be normally distributed. As Residuals vs Leverage plot has high leverage points, some observations may strongly influence the regression.
library(ISLR2)
##
## Attaching package: 'ISLR2'
## The following object is masked _by_ '.GlobalEnv':
##
## Auto
## The following objects are masked from 'package:ISLR':
##
## Auto, Credit
head(Carseats)
lm_model <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(lm_model)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
sales_lm <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(sales_lm)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Price = -0.054, The effect of a 1-unit increase in Price (for fixed values of Urban & US) is a change in Sales of -0.054 units (54 sales).
Urban = -0.022 , The effect of a store being in an urban area (for fixed values of Price & US) is a change in Sales of 0.022 units (22 sales). However, in this case, since the p-value for this variables T-test is so high, we can say that there is no evidence for a relationship between the car seat Sales at a store and whether the store was Urban (or rural).
US = 1.200 , The effect of a store being in the US (for fixed values of Price & Urban) is a change in Sales of 1.2 units (1200 sales).
Sales = 13.043469−0.054459⋅Price−0.021916⋅Urban+1.200573⋅US
Where: Urban = 1 for a store in an urban location, else 0 US = 1 for a store in the US, else 0
This question is asking about the results from the parameter T-tests. Based on the output & comments from part (b), we can reject the null hypothesis for the Price and US predictors, but there is insufficient evidence to reject the null hypothesis that the coefficient for Urban is zero.
sales_lm_2 <- lm(Sales ~ Price + US, data = Carseats)
summary(sales_lm_2)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
lm_reduced <- lm(Sales ~ Price, data = Carseats)
summary(lm_reduced)
##
## Call:
## lm(formula = Sales ~ Price, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.5224 -1.8442 -0.1459 1.6503 7.5108
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.641915 0.632812 21.558 <2e-16 ***
## Price -0.053073 0.005354 -9.912 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.532 on 398 degrees of freedom
## Multiple R-squared: 0.198, Adjusted R-squared: 0.196
## F-statistic: 98.25 on 1 and 398 DF, p-value: < 2.2e-16
summary(lm_model)$r.squared
## [1] 0.2392754
summary(lm_reduced)$r.squared
## [1] 0.1979812
The R-squared changes are not negligible, the removed variables do contribute something. As Adjusted R-squared improves in the reduced model, it suggests a better fit with fewer predictors.
confint(lm_reduced)
## 2.5 % 97.5 %
## (Intercept) 12.3978438 14.88598655
## Price -0.0635995 -0.04254653
If an interval does not contain 0, the predictor is significant. Price is significant.
par(mfrow = c(2, 2))
plot(lm_reduced)
Residuals vs Fitted: Its a random scatter so it is linear.
Normal Q-Q Plot: Points are close to the line. so it has normality. Deviations at the ends suggest outliers.
Scale-Location Plot: Constant variance is found.
Residuals vs Leverage Plot: Identifies high leverage points. Cook’s Distance > 0.5 means a highly influential observation. There are no highly influential points.
set.seed(1)
x1 <- runif(100)
x2 <- 0.5 * x1 + rnorm(100) / 10
y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100)
The form of the regression model is given by: y=B0+B1⋅x1+B2⋅x2+e The regression coefficients are given by:
B0= 2 B1= 2 B2= 0.3
cor(x1, x2)
## [1] 0.8351212
plot(x1, x2,
main = "Scatterplot of x1 vs x2",
xlab = "x1", ylab = "x2",
col = "blue", pch = 16)
As cor(x1, x2) is 0.84 close to 1 , so collinearity is high. A strong correlation means x1 and x2 provide redundant information in regression.
lm_model <- lm(y ~ x1 + x2)
summary(lm_model)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8311 -0.7273 -0.0537 0.6338 2.3359
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1305 0.2319 9.188 7.61e-15 ***
## x1 1.4396 0.7212 1.996 0.0487 *
## x2 1.0097 1.1337 0.891 0.3754
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared: 0.2088, Adjusted R-squared: 0.1925
## F-statistic: 12.8 on 2 and 97 DF, p-value: 1.164e-05
B0= 2.1304996 (B0= 2) B1= 1.4395554 (B1= 2) B2= 1.0096742 (B2= 0.3)
Using the standard alpha threshold of 0.05, we can reject the null hypothesis for B1 , but cannot reject the null hypothesis for B2.
lm_x1 <- lm(y ~ x1)
summary(lm_x1)
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.89495 -0.66874 -0.07785 0.59221 2.45560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1124 0.2307 9.155 8.27e-15 ***
## x1 1.9759 0.3963 4.986 2.66e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared: 0.2024, Adjusted R-squared: 0.1942
## F-statistic: 24.86 on 1 and 98 DF, p-value: 2.661e-06
In this case, the null hypothesis for B1 can be rejected. as p value is low.
lm_x2 <- lm(y ~ x2)
summary(lm_x2)
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.62687 -0.75156 -0.03598 0.72383 2.44890
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3899 0.1949 12.26 < 2e-16 ***
## x2 2.8996 0.6330 4.58 1.37e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared: 0.1763, Adjusted R-squared: 0.1679
## F-statistic: 20.98 on 1 and 98 DF, p-value: 1.366e-05
In this case, the null hypothesis for B1 can be rejected as p value is low.
These are not contradictory results, and arise because x2 does not offer enough ‘new information’ when fitting a model that already contains x1. The fact that x2 can be significant on its own and not significant in the presence of x1 arises from the fact that x1 and x2 are highly correlated, so using both the variables means a lot of the information provided by one can is effectively redundant.
Now, we add one additional observation with measurement errors.
x1 <- c(x1, 0.1)
x2 <- c(x2, 0.8)
y <- c(y, 6)
lm_model_new <- lm(y ~ x1 + x2)
lm_x1_new <- lm(y ~ x1)
lm_x2_new <- lm(y ~ x2)
summary(lm_model_new)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.73348 -0.69318 -0.05263 0.66385 2.30619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2267 0.2314 9.624 7.91e-16 ***
## x1 0.5394 0.5922 0.911 0.36458
## x2 2.5146 0.8977 2.801 0.00614 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared: 0.2188, Adjusted R-squared: 0.2029
## F-statistic: 13.72 on 2 and 98 DF, p-value: 5.564e-06
summary(lm_x1_new)
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8897 -0.6556 -0.0909 0.5682 3.5665
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2569 0.2390 9.445 1.78e-15 ***
## x1 1.7657 0.4124 4.282 4.29e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared: 0.1562, Adjusted R-squared: 0.1477
## F-statistic: 18.33 on 1 and 99 DF, p-value: 4.295e-05
summary(lm_x2_new)
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.64729 -0.71021 -0.06899 0.72699 2.38074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3451 0.1912 12.264 < 2e-16 ***
## x2 3.1190 0.6040 5.164 1.25e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared: 0.2122, Adjusted R-squared: 0.2042
## F-statistic: 26.66 on 1 and 99 DF, p-value: 1.253e-06
Key Observations from multiple regression model:
x1 is not statistically significant (p-value = 0.36458, failing to reject Hypothesis). x2 remains significant (p-value = 0.00614), but its coefficient is much larger than expected (β̂2 = 2.5146 instead of 0.3). The variance of coefficients increased, suggesting collinearity issues and the impact of the outlier.
Possible Explanation: The new observation increased the standard error for x1, making it appear insignificant. The inflated x2 coefficient suggests that x2 absorbed some of x1’s explanatory power due to their collinearity.
Key Observations from simple regression model(x1):
x1 is now significant (p-value = 4.29e-05), confirming that it explains some variance in y. The coefficient β̂1 = 1.7657 is closer to its true value of 2.0 than in the multiple regression model. Residual standard error increased slightly (1.111), suggesting that x1 alone does not fully capture the variation. Possible Explanation: The new observation did not completely distort the model but added noise. When x2 is included in the multiple regression, the collinearity masks x1’s effect, making it appear insignificant.
Key Observations from simple regression model(x2):
x2 is highly significant (p-value = 1.25e-06). However, β̂2 = 3.1190, which is much larger than the true value (0.3), meaning the new observation heavily influenced the model. The high residual standard error (1.074) suggests that the mismeasured observation skewed the model fit. Possible Explanation:
Since x2 is highly correlated with x1, its effect appears stronger than it actually is. The new observation exaggerated this effect, causing β̂2 to be overestimated.
Conclusion: Collinearity Issues: x1 is significant alone but insignificant in multiple regression due to correlation with x2. The inflated coefficient of x2 suggests collinearity effects. Impact of Mismeasured Observation:
It increased standard errors and distorted coefficient estimates. The observation is likely a high-leverage point, affecting all models. Outlier Detection:
The residual is large, meaning the new point does not fit the trend. Cook’s Distance can confirm whether it unduly influences the regression.