Problem # 2: Carefully explain the differences between the KNN classifier and KNN regression methods.

The K-Nearest Neighbors (KNN) algorithm is a simple yet powerful method used for both classification and regression tasks. It’s a non-parametric approach, meaning it doesn’t make strong assumptions about the underlying data, and it relies on the idea that similar data points tend to be close to each other.

  1. What is KNN?

• KNN Classifier: Imagine you’re trying to categorize something, like deciding whether an animal in a photo is a cat or a dog. KNN looks at the “k” closest examples (neighbors) it has seen before and assigns the new example to the most common category among those neighbors. It’s like asking your closest friends for their opinion and going with the majority vote.

• KNN Regression: Now, think of predicting something numerical, like the price of a house. Instead of categories, KNN looks at the “k” nearest houses it knows about and predicts the price by averaging their values. It’s like estimating the cost of your dream home by checking what similar houses nearby sold for.

  1. What Does It Output?

• KNN Classifier: It gives you a clear-cut answer, like “cat” or “dog.”

• KNN Regression: It gives you a number, like $350,000 for a house or 72°F for the temperature.

  1. How Does It Make Predictions?

• KNN Classifier: It uses a majority vote. If most of the “k” neighbors are cats, the new example is labeled as a cat.

• KNN Regression: It calculates the average (or a weighted average) of the neighbors’ values. For example, if three nearby houses are priced at 300k,300k,320k, and 340k,it might predict 340k,it might predict 320k.

  1. How Does It Measure “Closeness”?

Both versions use distance measures, like Euclidean distance (think of it as the straight-line distance between two points on a map).

• KNN Classifier: The distance helps determine which neighbors are closest, so it can pick the most common class.

• KNN Regression: The distance helps calculate an average value, giving more weight to closer neighbors.

  1. How Does It Handle Outliers?

• KNN Classifier: It’s pretty robust. If most neighbors agree on a category, a single oddball neighbor won’t sway the decision much.

• KNN Regression: It’s more sensitive. Outliers can skew the average, leading to less accurate predictions. For example, if one house is priced way higher than the others, it could throw off the predicted price.

  1. How Do You Measure Its Performance?

• KNN Classifier: You’d use metrics like accuracy (how often it’s right), precision (how many of its positive predictions are correct), recall (how many actual positives it catches), and F1-score (a balance of precision and recall).

• KNN Regression: You’d use metrics like Mean Squared Error (MSE), which measures how far off the predictions are on average, or R-squared, which tells you how well the model explains the data.

  1. Where Is It Used?

• KNN Classifier: It’s great for tasks like recognizing images (e.g., is this a cat or a dog?), filtering spam emails, or helping doctors diagnose diseases based on symptoms.

• KNN Regression: It’s handy for predicting house prices, forecasting temperatures, or estimating stock prices based on historical trends.

Problem # 9

  1. Produce a scatterplot matrix which includes all of the variables in the data set.
# Load necessary libraries
library(GGally)
## Warning: package 'GGally' was built under R version 4.4.2
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.4.2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
# Load dataset
auto <- read.csv("Auto.csv", stringsAsFactors = TRUE)

# Convert horsepower to numeric (it might have missing values as '?')
auto$horsepower <- as.numeric(as.character(auto$horsepower))
## Warning: NAs introduced by coercion
# Remove 'name' column as it is qualitative
auto <- auto[ , !(names(auto) %in% c("name"))]

# Generate a scatterplot matrix
ggpairs(auto)
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 5 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 5 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 5 rows containing missing values
## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 5 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 5 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 5 rows containing non-finite outside the scale range
## (`stat_density()`).
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 5 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 5 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 5 rows containing missing values
## Warning in ggally_statistic(data = data, mapping = mapping, na.rm = na.rm, :
## Removed 5 rows containing missing values
## Warning: Removed 5 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 5 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 5 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Removed 5 rows containing missing values or values outside the scale range
## (`geom_point()`).

  1. Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.
# Compute correlation matrix
cor_matrix <- cor(auto, use = "complete.obs")
print(cor_matrix)
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000
  1. Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:
# Fit multiple linear regression model
model <- lm(mpg ~ ., data = auto)

# Display summary of the model
summary(model)
## 
## Call:
## lm(formula = mpg ~ ., data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16
  1. Is there a relationship between the predictors and the response?

Yes, there is a strong relationship between the predictors and the response variable (mpg). The F-statistic is 252.4 with a p-value < 2.2e-16, which is extremely small. This indicates that at least one of the predictors has a significant effect on mpg. The Multiple R-squared value is 0.8215, meaning that approximately 82.15% of the variation in mpg is explained by the predictor variables. The Adjusted R-squared is 0.8182, which suggests that even after adjusting for the number of predictors, the model still explains a high proportion of variance.

  1. Which predictors appear to have a statistically significant relationship to the response?

A predictor is statistically significant if its p-value < 0.05.

-> The significant predictors are:

Displacement (p-value = 0.00844): Positively related to mpg. Weight (p-value < 2e-16): Negatively related to mpg (most significant). Year (p-value < 2e-16): Positively related to mpg (highly significant). Origin (p-value = 4.67e-07): Positively related to mpg.

-> On the other hand, the following predictors do not show strong statistical significance:

Cylinders (p-value = 0.12780): Not statistically significant. Horsepower (p-value = 0.21963): Not statistically significant. Acceleration (p-value = 0.41548): Not statistically significant.

Thus, Weight, Year, Displacement, and Origin have a strong statistical relationship with mpg, while Cylinders, Horsepower, and Acceleration do not.

  1. What does the coefficient for the year variable suggest?

The coefficient for year is 0.750773, meaning that for each additional year, the mpg increases by approximately 0.75. This suggests that newer cars are more fuel-efficient, likely due to advancements in technology, better fuel economy, and efficiency improvements over time.

  1. Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
# Generate diagnostic plots
par(mfrow = c(2,2))  # Set 2x2 layout for plots
plot(model)

  1. Residuals vs Fitted Plot

This plot checks for non-linearity and homoscedasticity (constant variance of residuals). Ideally, residuals should be randomly scattered around zero. In this plot, a slight curve is visible, suggesting a possible non-linearity in the model. Some large residuals (outliers) are present, as seen by points far from the center.

🔹 Issue: Possible non-linearity in the data, indicating that a transformation might improve the model.

  1. Normal Q-Q Plot

This plot checks whether residuals follow a normal distribution. If points follow the diagonal line, the normality assumption holds. In this plot, deviation at the ends (tails) suggests the presence of outliers.

🔹 Issue: The model might not fully satisfy the normality assumption due to outliers.

  1. Scale-Location Plot

This checks for homoscedasticity (constant variance of residuals). Ideally, residuals should be spread evenly across fitted values. In this plot, the residuals appear somewhat evenly spread, though there are a few large residuals at the right.

🔹 Issue: Potential heteroscedasticity (unequal variance), meaning a transformation such as log(mpg) might be needed.

  1. Residuals vs Leverage Plot

This plot helps identify high-leverage points, which can have a strong influence on the model. Points beyond Cook’s distance lines are highly influential. Observations 327, 394, and 14 have high leverage and may significantly affect the regression.

🔹 Issue: A few high-leverage points exist, particularly point 14, which may distort the model’s fit.

  1. Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?
# Fit model with interaction terms
model_interact <- lm(mpg ~ cylinders * displacement + weight * acceleration, data = auto)

# Summary of the model with interactions
summary(model_interact)
## 
## Call:
## lm(formula = mpg ~ cylinders * displacement + weight * acceleration, 
##     data = auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.7750  -2.4764  -0.1929   1.9220  17.4032 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            45.9244021  7.8048993   5.884 8.63e-09 ***
## cylinders              -1.8504497  0.6119157  -3.024 0.002659 ** 
## displacement           -0.0804117  0.0184063  -4.369 1.60e-05 ***
## weight                 -0.0045984  0.0022941  -2.004 0.045711 *  
## acceleration            0.4822565  0.3722482   1.296 0.195906    
## cylinders:displacement  0.0100242  0.0025982   3.858 0.000134 ***
## weight:acceleration    -0.0000634  0.0001306  -0.485 0.627758    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.163 on 390 degrees of freedom
## Multiple R-squared:  0.7213, Adjusted R-squared:  0.717 
## F-statistic: 168.2 on 6 and 390 DF,  p-value: < 2.2e-16

This model includes interaction terms between:

  1. Cylinders and Displacement (cylinders:displacement)
  2. Weight and Acceleration (weight:acceleration)

-> Significant interaction term:

cylinders:displacement → p-value = 0.000134 (significant at 0.1% level) This suggests that the effect of displacement on mpg depends on the number of cylinders.

-> Non-significant interaction term:

weight:acceleration → p-value = 0.627758 (not significant) This suggests that weight and acceleration do not have a meaningful combined effect on mpg.

  1. Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.
# Apply transformations and fit models
model_log <- lm(mpg ~ log(weight), data = auto)
model_sqrt <- lm(mpg ~ sqrt(weight), data = auto)
model_square <- lm(mpg ~ I(weight^2), data = auto)

# Compare summaries
summary(model_log)
## 
## Call:
## lm(formula = mpg ~ log(weight), data = auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.4658  -2.6579  -0.2947   1.9395  15.9787 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 210.5391     5.9837   35.19   <2e-16 ***
## log(weight) -23.5050     0.7516  -31.27   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.203 on 395 degrees of freedom
## Multiple R-squared:  0.7123, Adjusted R-squared:  0.7116 
## F-statistic: 978.1 on 1 and 395 DF,  p-value: < 2.2e-16
summary(model_sqrt)
## 
## Call:
## lm(formula = mpg ~ sqrt(weight), data = auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.2769  -2.8948  -0.3705   2.0839  16.1925 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  69.84709    1.52239   45.88   <2e-16 ***
## sqrt(weight) -0.85860    0.02793  -30.74   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.255 on 395 degrees of freedom
## Multiple R-squared:  0.7052, Adjusted R-squared:  0.7044 
## F-statistic: 944.8 on 1 and 395 DF,  p-value: < 2.2e-16
summary(model_square)
## 
## Call:
## lm(formula = mpg ~ I(weight^2), data = auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.3230  -3.1839  -0.4874   2.4089  17.2085 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.453e+01  4.692e-01   73.61   <2e-16 ***
## I(weight^2) -1.155e-06  4.270e-08  -27.05   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.639 on 395 degrees of freedom
## Multiple R-squared:  0.6494, Adjusted R-squared:  0.6485 
## F-statistic: 731.7 on 1 and 395 DF,  p-value: < 2.2e-16

Findings:

  1. Log Transformation: mpg ~ log(weight) Coefficient: -23.5050, meaning a logarithmic increase in weight significantly decreases mpg. Multiple R-squared: 0.7123 Adjusted R-squared: 0.7116 Residual Standard Error (RSE): 4.203

-> Good model fit (R² = 0.7123) and strong statistical significance (p < 2e-16). The log transformation captures diminishing returns, meaning that small increases in weight for lighter cars impact mpg more than for heavier cars.

  1. Square Root Transformation: mpg ~ sqrt(weight) Coefficient: -0.85860, meaning an increase in sqrt(weight) leads to a decrease in mpg. Multiple R-squared: 0.7052 Adjusted R-squared: 0.7044 Residual Standard Error (RSE): 4.255

-> Slightly lower fit than the log model (R² = 0.7052). The square root transformation is useful but less effective than log(weight).

  1. Squared Transformation: mpg ~ I(weight^2) Coefficient: -1.155e-06, meaning that as weight^2 increases, mpg decreases. Multiple R-squared: 0.6494 Adjusted R-squared: 0.6485 Residual Standard Error (RSE): 4.639

-> Worst fit among the three models (R² = 0.6494). The squared transformation exaggerates the effect of large weights, making the model less predictive.

Problem # 10

This question should be answered using the Carseats data set.

  1. Fit a multiple regression model to predict Sales using Price, Urban, and US.
# Load necessary libraries
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.4.2
# Load dataset
carseats <- read.csv("Carseats.csv", stringsAsFactors = TRUE)

# Convert categorical variables to factors
carseats$Urban <- as.factor(carseats$Urban)
carseats$US <- as.factor(carseats$US)

# Fit multiple regression model
model_a <- lm(Sales ~ Price + Urban + US, data = carseats)

# Display summary
summary(model_a)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16
  1. Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

Regression Model: Sales = 13.04 - 0.0545(Price) - 0.0219(UrbanYes) + 1.2057(USYes) + ε

Each coefficient represents the expected change in Sales given a one-unit change in the predictor while holding other variables constant.

Interpretation of Each Coefficient:

-> Intercept (β₀ = 13.04)

When Price = 0, Urban = No, and US = No, the expected Sales would be 13.04 units. Since a Price of 0 is unrealistic, this is just a theoretical baseline.

-> Price (β₁ = -0.0545, p < 2e-16 → Significant)

For each $1 increase in Price, Sales decrease by 0.0545 units. Since the p-value is very small, Price has a strong negative effect on Sales.

-> UrbanYes (β₂ = -0.0219, p = 0.936 → Not Significant)

Whether the store is in an urban area (Yes) does not significantly impact Sales. The high p-value (0.936) means we cannot conclude that being in an urban area affects Sales.

-> USYes (β₃ = 1.2057, p = 4.86e-06 → Significant)

Stores in the US tend to have 1.2057 more units in Sales than non-US stores. Since the p-value is very small, this effect is statistically significant.

  1. Write out the model in equation form, being careful to handle the qualitative variables properly.

Sales = β₀ + β₁(Price) + β₂(UrbanYes) + β₃(USYes) + ε

where: UrbanYes = 1 if the store is urban, else 0. USYes = 1 if the store is in the US, else 0.

  1. For which of the predictors can you reject the null hypothesis H0 : βj = 0?
# Significant predictors
summary(model_a)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

-> Reject H0 (Significant Predictors):

Price (p < 2e-16) → Strong negative impact on Sales US (p = 4.86e-06) → Positive impact on Sales

-> Fail to Reject H0 (Not Significant Predictors):

Urban (p = 0.936) → No significant impact on Sales

  1. On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
model_e <- lm(Sales ~ Price, data = carseats)
summary(model_e)
## 
## Call:
## lm(formula = Sales ~ Price, data = carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5224 -1.8442 -0.1459  1.6503  7.5108 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.641915   0.632812  21.558   <2e-16 ***
## Price       -0.053073   0.005354  -9.912   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.532 on 398 degrees of freedom
## Multiple R-squared:  0.198,  Adjusted R-squared:  0.196 
## F-statistic: 98.25 on 1 and 398 DF,  p-value: < 2.2e-16

Reduced Model Equation: Sales=13.64 − 0.0531×Price + ε

-> Price remains significant (p < 2e-16), confirming its strong negative relationship with Sales. -> Intercept (13.64): The expected Sales when Price = 0.

-> R-squared decreased: Multiple R² = 0.198 (previously 0.239 in the full model). Adjusted R² = 0.196 (previously 0.233). This suggests that the US variable contributed to explaining some variance in Sales.

-> F-statistic = 98.25, with a very low p-value (< 2.2e-16), confirming that Price is strongly associated with Sales.

  1. How well do the models in (a) and (e) fit the data?
# Compare models
summary(model_a)$adj.r.squared
## [1] 0.2335123
summary(model_e)$adj.r.squared
## [1] 0.195966

-> Full Model (model_a) fits the data better:

Adjusted R² = 0.2335 means 23.35% of the variance in Sales is explained by Price, Urban, and US. The inclusion of US improved the model, but Urban did not contribute much (as seen in Part d).

-> Reduced Model (model_e) explains less variance

Adjusted R² = 0.1960 means 19.6% of the variance in Sales is explained by Price alone. This confirms that removing US led to some loss in explanatory power.

  1. Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).
confint(model_e, level = 0.95)
##                  2.5 %      97.5 %
## (Intercept) 12.3978438 14.88598655
## Price       -0.0635995 -0.04254653

->Intercept (12.40 to 14.89):

We are 95% confident that the true intercept lies between 12.40 and 14.89. This means that when Price = 0, the expected Sales would be within this range (though Price = 0 is unrealistic).

->Price (-0.0636 to -0.0425):

We are 95% confident that the effect of Price on Sales is between -0.0636 and -0.0425. Since the interval is entirely negative, this confirms that Price has a significant negative impact on Sales. A $1 increase in Price is expected to reduce Sales by approximately 0.0425 to 0.0636 units.

  1. Is there evidence of outliers or high leverage observations in the model from (e)?
par(mfrow=c(2,2))
plot(model_e)

1. Residuals vs Fitted Plot:

This plot checks for non-linearity and heteroscedasticity. Residuals should be randomly scattered around 0. The plot looks fairly random, suggesting the model is reasonable. However, a few extreme residuals (e.g., point 377) might indicate outliers.

  1. Normal Q-Q Plot:

This plot checks normality of residuals. Points should align along the diagonal line. The tails deviate, suggesting some non-normality. This might indicate outliers affecting the distribution.

  1. Scale-Location Plot:

This checks for homoscedasticity (constant variance of residuals). The residuals seem evenly spread, meaning variance is mostly constant. A few points (e.g., 377) show higher residual variance, possibly reinforcing the presence of outliers.

  1. Residuals vs Leverage Plot:

This plot identifies high-leverage points (points with extreme influence on the model). Points with high leverage appear on the right. Some points (e.g., 311, 377, 175) are close to Cook’s distance lines, suggesting influential observations.

Problem # 12

This problem involves simple linear regression without an intercept.

  1. Recall that the coefficient estimate ˆ β for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

For the regression coefficients to be equal, the following must hold: ∑xi2/∑xi2 = ∑yi2/∑yi2

Rearranging: ∑xi^2 = ∑yi^2

This means that the sum of squares of X must be equal to the sum of squares of Y.

-> This occurs when X and Y have equal variance. -> If all points lie on a line through the origin (e.g.,Y=±X), the variances of X and Y will be the same, making the two regression coefficients equal.

  1. Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.
# Set seed for reproducibility
set.seed(123)

# Generate X and Y with different variances
n <- 100
X <- rnorm(n, mean = 10, sd = 5)
Y <- 2 * X + rnorm(n, mean = 0, sd = 10)  # Different variance

# Regression of Y on X (without intercept)
beta_y_on_x <- sum(X * Y) / sum(X^2)

# Regression of X on Y (without intercept)
beta_x_on_y <- sum(X * Y) / sum(Y^2)

# Print coefficients
cat("β (Y ~ X):", beta_y_on_x, "\n")
## β (Y ~ X): 1.896779
cat("β (X ~ Y):", beta_x_on_y, "\n")
## β (X ~ Y): 0.4402068
  1. Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.
# Set seed for reproducibility
set.seed(123)

# Generate X
n <- 100
X <- rnorm(n, mean = 0, sd = 5)

# Set Y equal to X (ensuring equal variance)
Y <- X  

# Regression of Y on X (without intercept)
beta_y_on_x <- sum(X * Y) / sum(X^2)

# Regression of X on Y (without intercept)
beta_x_on_y <- sum(X * Y) / sum(Y^2)

# Print coefficients
cat("β (Y ~ X):", beta_y_on_x, "\n")
## β (Y ~ X): 1
cat("β (X ~ Y):", beta_x_on_y, "\n")
## β (X ~ Y): 1