5.We now examine the differences between LDA and QDA.

If the Bayes decision boundary is linear, do we expect LDA or QDA to perform better on the training set? On the test set?

A: Here we would expect QDA to perform better on the training set due to its flexibility, but for LDA to perform better on the test set. QDA outperforming on the training data will be due to overfitting to any non linearity in the training data which would not likely be present in the test set.

If the Bayes decision boundary is non-linear, do we expect LDA or QDA to perform better on the training set? On the test set?

A: This depends on the nature of the nonlinearity of the Bayes decision boundary. On the training data, we would expect QDA to out perform due to its higher flexibility. If the nonlinearity is quadratic then we would expect QDA to perform significantly better. Some nonlinear relationships will be poorly approximated by QDA and well approximated by LDA, so this again depends on how well QDA can model the non linearity.

In general, as the sample size n increases, do we expect the test prediction accuracy of QDA relative to LDA to improve, decline, or be unchanged? Why?

A: Generally speaking, we would expect the test prediction accuracy of the higher flexibility model to improve relative to the lower flexibility model, as for a large n the probability of nonlinear training relationships being not authentic decreases.

True or False: Even if the Bayes decision boundary for a given problem is linear, we will probably achieve a superior test error rate using QDA rather than LDA because QDA is flexible enough to model a linear decision boundary. Justify your answer.

A: It would be false. Particularly with a smaller sample size, the variance from using a more flexible method will lead to over fitting, yielding a higher test error than LDA. I can’t see how QDA would be useful regardless of the sample size though when we already know that the Bayes decision boundary is linear. If this logic was correct we would simply always follow the most flexible method.

Suppose we collect data for a group of students in a statistics class with variables X1 = hours studied, X2 = undergrad GPA, and Y = receive an A. We ft a logistic regression and produce estimated coefficient, βˆ0 = −6, βˆ1 = 0.05, βˆ2 = 1.

Estimate the probability that a student who studies for 40 h and has an undergrad GPA of 3.5 gets an A in the class.
How many hours would the student in part (a) need to study to have a 50 % chance of getting an A in the class?

Suppose that we take a data set, divide it into equally-sized training and test sets, and then try out two different classification procedures.First we use logistic regression and get an error rate of 20 % on the training data and 30 % on the test data. Next we use 1-nearest neighbors (i.e. K = 1) and get an average error rate (averaged over both test and training data sets) of 18 %. Based on these results, which method should we prefer to use for classification of new observations? Why?

A: When evaluating classification models, the primary concern is how well they generalize to new, unseen data. In this scenario, we have two models: logistic regression and 1-nearest neighbors (K=1). Logistic regression yields a training error of 20% and a test error of 30%, whereas 1-NN has an average error rate of 18% across both the training and test datasets. However, the training error for KNN can be interpreted as the error obtained when the training data is used as the test set. When K = 1, each test observation is classified based on its single closest training observation, which will always be itself, leading to zero training error. This remains true regardless of the dataset or classification method, assuming unique observations without identical predictors but different response values.

Given that the average error for K=1 is 18%, and we know that its training error is 0%, we can infer that its test error must be 2 × 18% = 36%, which is worse than the 30% test error of logistic regression. This suggests that 1-NN is overfitting while it performs perfectly on the training data, it struggles with generalization. Logistic regression, despite a higher training error, has a lower test error and better generalization to unseen data, which is the ultimate goal in classification. Therefore, logistic regression is the preferred method in this case, as it provides more reliable predictions for new observations.

This problem has to do with odds.

On average, what fraction of people with an odds of 0.37 of defaulting on their credit card payment will in fact default?
Suppose that an individual has a 16 % chance of defaulting on her credit card payment. What are the odds that she will default?

Lab 4

Chapter 3

This question involves the use of simple linear regression on the Auto data set.

Use the lm() function to perform a simple linear regression with mpg as the response and horsepower as the predictor. Use the summary() function to print the results. Comment on the output. For example:

Is there a relationship between the predictor and the response?
How strong is the relationship between the predictor and the response?
Is the relationship between the predictor and the response positive or negative?
What is the predicted mpg associated with a horsepower of 98? What are the associated 95 % confdence and prediction intervals?

library(ISLR)
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

data(Auto)

mpg_horsepower <- lm(mpg ~ horsepower, data = Auto)
summary(mpg_horsepower)

## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

The p-value for the horsepower variable is very small. so there is strong evidence to believe that horsepower is associated with mpg. Therefore, there is a relationship between the predictor and response.

Here R squared = 0.6059. This means 60.6% of the variation in mpg can be explained by horsepower. Adjusted R-squared = 0.6049, which is almost the same, indicating a strong model fit. Conclusion: The relationship is moderately strong, but other factors also affect mpg.

The coefficient for horsepower is -0.157845. Since it is negative, it means mpg decreases as horsepower increases. i.e., For every 1-unit increase in horsepower, mpg decreases by 0.15 mpg on average. Conclusion: The relationship is negative, meaning more powerful cars tend to have lower fuel efficiency.

predict(mpg_horsepower, data.frame(horsepower = 98), interval = "confidence", level = 0.95)

##        fit      lwr      upr
## 1 24.46708 23.97308 24.96108

Thus, the predicted mpg for a car with 98 horsepower is 24.46 mpg.

lm_model <- lm(mpg ~ horsepower, data = Auto)

# Display the regression summary
summary(lm_model)

## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16

predict(lm_model, newdata = data.frame(horsepower = 98), interval = "confidence")

##        fit      lwr      upr
## 1 24.46708 23.97308 24.96108

predict(lm_model, newdata = data.frame(horsepower = 98), interval = "prediction")

##        fit     lwr      upr
## 1 24.46708 14.8094 34.12476

Confidence Interval (CI): 23.9 -24.9 is the range for the mean mpg for cars with 98 horsepower. Prediction Interval (PI): 14.8 -34.1 is the range for an individual car’s mpg with 98 horsepower, which is wider than CI.

Plot the response and the predictor. Use the abline() function to display the least squares regression line.

plot(Auto$horsepower, Auto$mpg, 
     xlab = "Horsepower", 
     ylab = "Miles Per Gallon (MPG)", 
     main = "MPG vs Horsepower",
     pch = 16, col = "blue")

abline(lm_model, col = "red", lwd = 2)

The scatterplot shows a negative correlation between horsepower and mpg. The red regression line confirms the downward trend (higher horsepower gives lower mpg).

Use the plot() function to produce diagnostic plots of the least squares regression ft. Comment on any problems you see with the fit.

par(mfrow = c(2, 2))
plot(lm_model)

Residuals vs Fitted Plot : It checks for non-linearity and homoscedasticity. As the red line is not flat or residuals form a pattern, the model may be misspecified.

Normal Q-Q Plot : It checks if residuals follow a normal distribution. As the points do not vary more , the normality is maintained.

Scale-Location Plot : It checks for homoscedasticity. A random spread is good, a clear pattern suggests variance issues. So it has less variance issues.

Residuals vs Leverage Plot : It identifies influential points. If a point has high leverage and Cook’s distance > 0.5, it may be unduly affecting the regression.

As the Residuals vs Fitted plot shows a clear curve, a linear model may not be the best fit. As the Scale-Location plot shows increasing spread, it suggests heteroscedasticity (unequal variance). As the Normal Q-Q plot has low deviations, residuals may be normally distributed. As Residuals vs Leverage plot has high leverage points, some observations may strongly influence the regression.

This question should be answered using the Carseats data set.

Fit a multiple regression model to predict Sales using Price, Urban, and US.

library(ISLR2)

## 
## Attaching package: 'ISLR2'

## The following object is masked _by_ '.GlobalEnv':
## 
##     Auto

## The following objects are masked from 'package:ISLR':
## 
##     Auto, Credit

head(Carseats)

lm_model <- lm(Sales ~ Price + Urban + US, data = Carseats)

summary(lm_model)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Provide an interpretation of each coeffcient in the model. Be careful—some of the variables in the model are qualitative!

sales_lm <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(sales_lm)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Price = -0.054, The effect of a 1-unit increase in Price (for fixed values of Urban & US) is a change in Sales of -0.054 units (54 sales).

Urban = -0.022 , The effect of a store being in an urban area (for fixed values of Price & US) is a change in Sales of 0.022 units (22 sales). However, in this case, since the p-value for this variables T-test is so high, we can say that there is no evidence for a relationship between the car seat Sales at a store and whether the store was Urban (or rural).

US = 1.200 , The effect of a store being in the US (for fixed values of Price & Urban) is a change in Sales of 1.2 units (1200 sales).

Write out the model in equation form, being careful to handle the qualitative variables properly.

Sales = 13.043469−0.054459⋅Price−0.021916⋅Urban+1.200573⋅US

Where: Urban = 1 for a store in an urban location, else 0 US = 1 for a store in the US, else 0

For which of the predictors can you reject the null hypothesis H0 : βj = 0?

This question is asking about the results from the parameter T-tests. Based on the output & comments from part (b), we can reject the null hypothesis for the Price and US predictors, but there is insufficient evidence to reject the null hypothesis that the coefficient for Urban is zero.

On the basis of your response to the previous question, ft a smaller model that only uses the predictors for which there is evidence of association with the outcome.

sales_lm_2 <- lm(Sales ~ Price + US, data = Carseats)
summary(sales_lm_2)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

lm_reduced <- lm(Sales ~ Price, data = Carseats)
summary(lm_reduced)

## 
## Call:
## lm(formula = Sales ~ Price, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5224 -1.8442 -0.1459  1.6503  7.5108 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.641915   0.632812  21.558   <2e-16 ***
## Price       -0.053073   0.005354  -9.912   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.532 on 398 degrees of freedom
## Multiple R-squared:  0.198,  Adjusted R-squared:  0.196 
## F-statistic: 98.25 on 1 and 398 DF,  p-value: < 2.2e-16

How well do the models in (a) and (e) ft the data?

summary(lm_model)$r.squared

## [1] 0.2392754

summary(lm_reduced)$r.squared

## [1] 0.1979812

The R-squared changes are not negligible, the removed variables do contribute something. As Adjusted R-squared improves in the reduced model, it suggests a better fit with fewer predictors.

Using the model from (e), obtain 95 % confdence intervals for the coefficient(s).

confint(lm_reduced)

##                  2.5 %      97.5 %
## (Intercept) 12.3978438 14.88598655
## Price       -0.0635995 -0.04254653

If an interval does not contain 0, the predictor is significant. Price is significant.

Is there evidence of outliers or high leverage observations in the model from (e)?

par(mfrow = c(2, 2))  
plot(lm_reduced)

Residuals vs Fitted: Its a random scatter so it is linear.

Normal Q-Q Plot: Points are close to the line. so it has normality. Deviations at the ends suggest outliers.

Scale-Location Plot: Constant variance is found.

Residuals vs Leverage Plot: Identifies high leverage points. Cook’s Distance > 0.5 means a highly influential observation. There are no highly influential points.

This problem focuses on the collinearity problem.

Perform the following commands in R: set.seed(1) x1 <- runif(100) x2 <- 0.5 * x1 + rnorm(100) / 10 y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100) The last line corresponds to creating a linear model in which y is a function of x1 and x2. Write out the form of the linear model. What are the regression coeffcients?

set.seed(1)
x1 <- runif(100) 
x2 <- 0.5 * x1 + rnorm(100) / 10  
y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100)

The form of the regression model is given by: y=B0+B1⋅x1+B2⋅x2+e The regression coefficients are given by:

B0= 2 B1= 2 B2= 0.3

What is the correlation between x1 and x2? Create a scatterplot displaying the relationship between the variables.

cor(x1, x2)

## [1] 0.8351212

plot(x1, x2, 
     main = "Scatterplot of x1 vs x2",
     xlab = "x1", ylab = "x2",
     col = "blue", pch = 16)

As cor(x1, x2) is 0.84 close to 1 , so collinearity is high. A strong correlation means x1 and x2 provide redundant information in regression.

Using this data, ft a least squares regression to predict y using x1 and x2. Describe the results obtained. What are βˆ0, βˆ1, and βˆ2? How do these relate to the true β0, β1, and β2? Can you reject the null hypothesis H0 : β1 = 0? How about the null hypothesis H0 : β2 = 0?

lm_model <- lm(y ~ x1 + x2)
summary(lm_model)

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8311 -0.7273 -0.0537  0.6338  2.3359 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1305     0.2319   9.188 7.61e-15 ***
## x1            1.4396     0.7212   1.996   0.0487 *  
## x2            1.0097     1.1337   0.891   0.3754    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared:  0.2088, Adjusted R-squared:  0.1925 
## F-statistic:  12.8 on 2 and 97 DF,  p-value: 1.164e-05

B0= 2.1304996 (B0= 2) B1= 1.4395554 (B1= 2) B2= 1.0096742 (B2= 0.3)

Using the standard alpha threshold of 0.05, we can reject the null hypothesis for B1 , but cannot reject the null hypothesis for B2.

Now ft a least squares regression to predict y using only x1. Comment on your results. Can you reject the null hypothesis H0 : β1 = 0?

lm_x1 <- lm(y ~ x1)
summary(lm_x1)

## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.89495 -0.66874 -0.07785  0.59221  2.45560 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1124     0.2307   9.155 8.27e-15 ***
## x1            1.9759     0.3963   4.986 2.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared:  0.2024, Adjusted R-squared:  0.1942 
## F-statistic: 24.86 on 1 and 98 DF,  p-value: 2.661e-06

In this case, the null hypothesis for B1 can be rejected. as p value is low.

Now ft a least squares regression to predict y using only x2. Comment on your results. Can you reject the null hypothesis H0 : β1 = 0?

lm_x2 <- lm(y ~ x2)
summary(lm_x2)

## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.62687 -0.75156 -0.03598  0.72383  2.44890 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3899     0.1949   12.26  < 2e-16 ***
## x2            2.8996     0.6330    4.58 1.37e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared:  0.1763, Adjusted R-squared:  0.1679 
## F-statistic: 20.98 on 1 and 98 DF,  p-value: 1.366e-05

In this case, the null hypothesis for B1 can be rejected as p value is low.

Do the results obtained in (c)–(e) contradict each other? Explain your answer. A: The supposedly contradictory results are from the fact that, in a model with x1 and x2 as predictors, the x2 variable was not significant. However, when testing a model with just x2 as a predictor, we find that it is significant.

These are not contradictory results, and arise because x2 does not offer enough ‘new information’ when fitting a model that already contains x1. The fact that x2 can be significant on its own and not significant in the presence of x1 arises from the fact that x1 and x2 are highly correlated, so using both the variables means a lot of the information provided by one can is effectively redundant.

Now suppose we obtain one additional observation, which was unfortunately mismeasured. x1 <- c(x1, 0.1) x2 <- c(x2, 0.8) y <- c(y, 6) Re-ft the linear models from (c) to (e) using this new data. What efect does this new observation have on the each of the models? In each model, is this observation an outlier? A high-leverage point? Both? Explain your answers.

Now, we add one additional observation with measurement errors.

x1 <- c(x1, 0.1)
x2 <- c(x2, 0.8)
y <- c(y, 6)

lm_model_new <- lm(y ~ x1 + x2)  
lm_x1_new <- lm(y ~ x1)  
lm_x2_new <- lm(y ~ x2)  

summary(lm_model_new)

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.73348 -0.69318 -0.05263  0.66385  2.30619 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2267     0.2314   9.624 7.91e-16 ***
## x1            0.5394     0.5922   0.911  0.36458    
## x2            2.5146     0.8977   2.801  0.00614 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared:  0.2188, Adjusted R-squared:  0.2029 
## F-statistic: 13.72 on 2 and 98 DF,  p-value: 5.564e-06

summary(lm_x1_new)

## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8897 -0.6556 -0.0909  0.5682  3.5665 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2569     0.2390   9.445 1.78e-15 ***
## x1            1.7657     0.4124   4.282 4.29e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared:  0.1562, Adjusted R-squared:  0.1477 
## F-statistic: 18.33 on 1 and 99 DF,  p-value: 4.295e-05

summary(lm_x2_new)

## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.64729 -0.71021 -0.06899  0.72699  2.38074 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3451     0.1912  12.264  < 2e-16 ***
## x2            3.1190     0.6040   5.164 1.25e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared:  0.2122, Adjusted R-squared:  0.2042 
## F-statistic: 26.66 on 1 and 99 DF,  p-value: 1.253e-06

Key Observations from multiple regression model:

x1 is not statistically significant (p-value = 0.36458, failing to reject Hypothesis). x2 remains significant (p-value = 0.00614), but its coefficient is much larger than expected (β̂2 = 2.5146 instead of 0.3). The variance of coefficients increased, suggesting collinearity issues and the impact of the outlier.

Possible Explanation: The new observation increased the standard error for x1, making it appear insignificant. The inflated x2 coefficient suggests that x2 absorbed some of x1’s explanatory power due to their collinearity.

Key Observations from simple regression model(x1):

x1 is now significant (p-value = 4.29e-05), confirming that it explains some variance in y. The coefficient β̂1 = 1.7657 is closer to its true value of 2.0 than in the multiple regression model. Residual standard error increased slightly (1.111), suggesting that x1 alone does not fully capture the variation. Possible Explanation: The new observation did not completely distort the model but added noise. When x2 is included in the multiple regression, the collinearity masks x1’s effect, making it appear insignificant.

Key Observations from simple regression model(x2):

x2 is highly significant (p-value = 1.25e-06). However, β̂2 = 3.1190, which is much larger than the true value (0.3), meaning the new observation heavily influenced the model. The high residual standard error (1.074) suggests that the mismeasured observation skewed the model fit. Possible Explanation:

Since x2 is highly correlated with x1, its effect appears stronger than it actually is. The new observation exaggerated this effect, causing β̂2 to be overestimated.

Conclusion: Collinearity Issues: x1 is significant alone but insignificant in multiple regression due to correlation with x2. The inflated coefficient of x2 suggests collinearity effects. Impact of Mismeasured Observation:

It increased standard errors and distorted coefficient estimates. The observation is likely a high-leverage point, affecting all models. Outlier Detection:

The residual is large, meaning the new point does not fit the trend. Cook’s Distance can confirm whether it unduly influences the regression.

Assignment 2

Lab 4

Chapter 3