Problem 2

Carefully explain the differences between the KNN classifier and KNN regression methods.
KNN classification predicts categorical labels by finding the K nearest neighbors and using majority voting - if 3 out of 5 neighbors are class A and 2 are class B, it predicts class A. KNN regression predicts continuous values by averaging the numeric values of the K nearest neighbors - to predict house price, it finds the K most similar houses and takes the mean of their prices. While both use the same proximity principle, classification outputs discrete labels and uses accuracy-based metrics, while regression outputs numerical values and uses error-based metrics like MSE.

##Problem 9
This question involves the use of multiple linear regression on the Auto data set.

(a) Produce a scatterplot matrix which includes all of the variables in the data set.

auto <- read.csv("C:/Users/doris/OneDrive/Documents/STA 6543/Auto.csv", na.strings = "?")
dim(auto)
## [1] 397   9
#Remove missing values
auto=na.omit(auto)
dim(auto)
## [1] 392   9
str(auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : int  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : int  3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : chr  "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
##  - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
##   ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
# Scatter plot matrix
plot(auto)

(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.

# Computing the correlations excluding the name variable
cor(auto[1:8])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output.

# Linear regression model
d1 <- lm(mpg ~ .-name, data = auto)
summary(d1)
## 
## Call:
## lm(formula = mpg ~ . - name, data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

For instance:
i. Is there a relationship between the predictors and the response? Statistically Significant Predictors (p < 0.05):

Using the significance level of 0.05
Significant Predictors
Weight (p < 2e-16): Strongest predictor. Each additional pound decreases mpg by 0.006474

Year (p < 2e-16): Each year increases mpg by 0.751, reflecting improving fuel efficiency over time

Origin (p = 4.67e-07): Cars from different origins have significantly different mpg (likely foreign vs domestic)

Displacement (p = 0.00844): Larger engine displacement slightly increases mpg (0.0199 per unit)

Non-Significant Predictors:
Cylinders (p = 0.128): Not significant when other variables are included Horsepower (p = 0.220): Surprisingly not significant, possibly due to correlation with other variables

Acceleration (p = 0.415): Not a significant predictor of mpg

ii. Which predictors appear to have a statistically significant relationship to the response?
Based on the regression output, the predictors with statistically significant relationships to the response (mpg) are those with p-values less than 0.05:

Statistically Significant Predictors: displacement (p = 0.00844) weight (p < 2e-16) year (p < 2e-16) origin (p = 4.67e-07)

iii. What does the coefficient for the year variable suggest?
The coefficient for the year variable is 0.750773, and it is highly statistically significant (p-value < 2e-16). This means that, on average, each additional model year is associated with an increase of approximately 0.75 miles per gallon (mpg), holding all other variables constant.

(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

# Using the plot function
par(mfrow = c(2, 2))
plot(d1)

Residuals vs Fitted: Residuals are fairly randomly scattered around zero. No clear curved pattern, suggesting linearity assumption is met

Q-Q Residuals: Most of the data points follow the straight line quite well but there are slight deviations at the tails mostly the upper tail. Thus normality assumption is reasonably satisfied

Scale-Location: Mild heteroscedasticity - there’s a slight upward trend in the red line

Residuals vs Leverage: No points exceed Cook’s distance thresholds of 0.5, indicating no overly influential observations that would compromise the model.

(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

# Method 1: Using * symbol (includes main effects AND interactions)
model1 <- lm(mpg ~ cylinders + displacement + horsepower + weight * year + acceleration + origin, data = auto)
summary(model1)
## 
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight * 
##     year + acceleration + origin, data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.9995 -1.8495 -0.1559  1.6061 11.7042 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -1.186e+02  1.338e+01  -8.864  < 2e-16 ***
## cylinders    -1.218e-01  3.032e-01  -0.402   0.6881    
## displacement  1.293e-02  7.019e-03   1.842   0.0663 .  
## horsepower   -2.877e-02  1.286e-02  -2.236   0.0259 *  
## weight        3.044e-02  4.652e-03   6.543 1.94e-10 ***
## year          2.084e+00  1.732e-01  12.033  < 2e-16 ***
## acceleration  1.447e-01  9.196e-02   1.574   0.1164    
## origin        1.174e+00  2.597e-01   4.519 8.30e-06 ***
## weight:year  -4.879e-04  6.097e-05  -8.002 1.47e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.084 on 383 degrees of freedom
## Multiple R-squared:  0.847,  Adjusted R-squared:  0.8439 
## F-statistic: 265.1 on 8 and 383 DF,  p-value: < 2.2e-16
# Method 2: Using : symbol (interaction term only)
model2 <- lm(mpg ~ weight:year, data = auto)
summary(model2)
## 
## Call:
## lm(formula = mpg ~ weight:year, data = auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.3849  -3.3041  -0.5901   2.6158  17.5737 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.571e+01  9.581e-01   47.71   <2e-16 ***
## weight:year -9.882e-05  4.105e-06  -24.07   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.957 on 390 degrees of freedom
## Multiple R-squared:  0.5977, Adjusted R-squared:  0.5967 
## F-statistic: 579.4 on 1 and 390 DF,  p-value: < 2.2e-16

Yes, the interaction is statistically significant.
Model 1 (weight * year):
weight:year interaction is highly significant (p = 1.47e-14, ***).

Model 2 (weight:year only):
Also highly significant (p < 2e-16).

(f) Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.

# Log transformation of horsepower (common for power-related variables)
log_hp_model <- lm(mpg ~ cylinders + displacement + log(horsepower) + weight + acceleration + year + origin, data = auto)

# Square root of horsepower
sqrt_hp_model <- lm(mpg ~ cylinders + displacement + sqrt(horsepower) + weight + acceleration + year + origin, data = auto)

# Square of horsepower
sq_hp_model <- lm(mpg ~ cylinders + displacement + I(horsepower^2) + weight + acceleration + year + origin, data = auto)
summary(log_hp_model)
## 
## Call:
## lm(formula = mpg ~ cylinders + displacement + log(horsepower) + 
##     weight + acceleration + year + origin, data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3115 -2.0041 -0.1726  1.8393 12.6579 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     27.254005   8.589614   3.173  0.00163 ** 
## cylinders       -0.486206   0.306692  -1.585  0.11372    
## displacement     0.019456   0.006876   2.830  0.00491 ** 
## log(horsepower) -9.506436   1.539619  -6.175 1.69e-09 ***
## weight          -0.004266   0.000694  -6.148 1.97e-09 ***
## acceleration    -0.292088   0.103804  -2.814  0.00515 ** 
## year             0.705329   0.048456  14.556  < 2e-16 ***
## origin           1.482435   0.259347   5.716 2.19e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.18 on 384 degrees of freedom
## Multiple R-squared:  0.837,  Adjusted R-squared:  0.834 
## F-statistic: 281.6 on 7 and 384 DF,  p-value: < 2.2e-16
summary(sqrt_hp_model)
## 
## Call:
## lm(formula = mpg ~ cylinders + displacement + sqrt(horsepower) + 
##     weight + acceleration + year + origin, data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5240 -1.9910 -0.1687  1.8181 12.9211 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -6.0373910  5.5460041  -1.089 0.277012    
## cylinders        -0.5222540  0.3166839  -1.649 0.099938 .  
## displacement      0.0220542  0.0071987   3.064 0.002341 ** 
## sqrt(horsepower) -1.1434906  0.3113771  -3.672 0.000274 ***
## weight           -0.0054593  0.0006842  -7.979 1.72e-14 ***
## acceleration     -0.1021239  0.1038565  -0.983 0.326070    
## year              0.7240379  0.0501791  14.429  < 2e-16 ***
## origin            1.5173206  0.2703470   5.612 3.83e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.277 on 384 degrees of freedom
## Multiple R-squared:  0.8269, Adjusted R-squared:  0.8237 
## F-statistic:   262 on 7 and 384 DF,  p-value: < 2.2e-16
summary(sq_hp_model)
## 
## Call:
## lm(formula = mpg ~ cylinders + displacement + I(horsepower^2) + 
##     weight + acceleration + year + origin, data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.4398 -2.1891  0.0808  1.8067 13.0014 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     -2.193e+01  4.303e+00  -5.096 5.44e-07 ***
## cylinders       -3.038e-01  3.280e-01  -0.926  0.35489    
## displacement     1.094e-02  7.851e-03   1.394  0.16413    
## I(horsepower^2)  8.282e-05  4.261e-05   1.944  0.05268 .  
## weight          -7.110e-03  5.957e-04 -11.936  < 2e-16 ***
## acceleration     2.321e-01  8.691e-02   2.671  0.00788 ** 
## year             7.839e-01  5.053e-02  15.515  < 2e-16 ***
## origin           1.194e+00  2.807e-01   4.256 2.62e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.318 on 384 degrees of freedom
## Multiple R-squared:  0.8225, Adjusted R-squared:  0.8193 
## F-statistic: 254.2 on 7 and 384 DF,  p-value: < 2.2e-16

Model Performance Comparison
Log transformation performs best:
Highest R² (0.837) and Adjusted R² (0.834) Lowest residual standard error (3.18) Most significant coefficients overall

Square root transformation is intermediate:
Moderate R² (0.827) and Adjusted R² (0.824) Higher residual standard error (3.277)

Squared transformation performs worst:
Lowest R² (0.823) and Adjusted R² (0.819) Highest residual standard error (3.318)

Problem 10

This question should be answered using the Carseats data set.

(a) Fit a multiple regression model to predict Sales using Price, Urban, and US.

library(ISLR)
attach(Carseats)
str(Carseats)
## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
# Multiple linear regression
d2 <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(d2)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

(b) Provide an interpretation of each coefficient in the model. Becareful—some of the variables in the model are qualitative!
Model Interpretation The fitted model is: Sales = 13.043 - 0.054×Price - 0.022×UrbanYes + 1.201×USYes

Coefficient Interpretations: Intercept (13.043): This represents the predicted sales when Price = 0, the store is not in an urban location (Urban = No), and the store is not in the US (US = No). While this baseline scenario may not be practically meaningful, it serves as the reference point for the model.

Price (-0.054): For each $1 increase in price, sales decrease by approximately 0.054 units (54 car seats per 1000), holding Urban and US location constant. This coefficient is highly significant (p < 2e-16), indicating a strong negative relationship between price and sales.

UrbanYes (-0.022): Stores in urban locations have sales that are 0.022 units lower on average compared to non-urban stores, holding price and US location constant. However, this coefficient is not statistically significant (p = 0.936), suggesting there’s insufficient evidence of a meaningful difference in sales between urban and non-urban locations.

USYes (1.201): Stores located in the US have sales that are approximately 1.201 units higher on average compared to stores outside the US, holding price and urban location constant. This coefficient is highly significant (p = 4.86e-06), indicating that US location is a strong predictor of higher sales.

(c) Write out the model in equation form, being careful to handle the qualitative variables properly.
Sales^ =13.0435−0.0545×Price−0.0219×UrbanYes+1.2006×USYes

(d) For which of the predictors can you reject the null hypothesis H0 : βj =0?
Looking at the regression output with a significance level of α = 0.05, I can determine which null hypotheses H₀: βⱼ = 0 can be rejected by comparing each p-value to 0.05:

Hypothesis Test Results (alpha = 0.05): Price coefficient (β₁): p-value: < 2e-16 Since p-value < 0.05: REJECT H₀

UrbanYes coefficient (β₂): p-value: 0.936 Since p-value = 0.936 > 0.05: FAIL TO REJECT H₀

USYes coefficient (β₃): p-value: 4.86e-06 (= 0.00000486) Since p-value < 0.05: REJECT H₀

Conclusion: At the 0.05 significance level, we reject the null hypothesis H₀: βⱼ = 0 for:Price and USYes

(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

# Fit the smaller model with only significant predictors
smaller_model <- lm(Sales ~ Price + US, data = Carseats)
summary(smaller_model)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

(f) How well do the models in (a) and (e) fit the data?
Same R-squared (0.2393) - Both models explain exactly 23.93% of the variance in Sales

Improved adjusted R-squared (0.2354 vs 0.2335) - The smaller model actually has a slightly better adjusted R-squared, confirming that removing the non-significant Urban variable was beneficial

(g) Using the model from (e), obtain 95% confidence intervals for the coefficient(s).

# Get 95% confidence intervals for the smaller model coefficients
confint(smaller_model)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

(h) Is there evidence of outliers or high leverage observations in the model from (e)?

# diagnostic plots
par(mfrow = c(2, 2))
plot(smaller_model)

summary(influence.measures(smaller_model))
## Potentially influential observations of
##   lm(formula = Sales ~ Price + US, data = Carseats) :
## 
##     dfb.1_ dfb.Pric dfb.USYs dffit   cov.r   cook.d hat    
## 26   0.24  -0.18    -0.17     0.28_*  0.97_*  0.03   0.01  
## 29  -0.10   0.10    -0.10    -0.18    0.97_*  0.01   0.01  
## 43  -0.11   0.10     0.03    -0.11    1.05_*  0.00   0.04_*
## 50  -0.10   0.17    -0.17     0.26_*  0.98    0.02   0.01  
## 51  -0.05   0.05    -0.11    -0.18    0.95_*  0.01   0.00  
## 58  -0.05  -0.02     0.16    -0.20    0.97_*  0.01   0.01  
## 69  -0.09   0.10     0.09     0.19    0.96_*  0.01   0.01  
## 126 -0.07   0.06     0.03    -0.07    1.03_*  0.00   0.03_*
## 160  0.00   0.00     0.00     0.01    1.02_*  0.00   0.02  
## 166  0.21  -0.23    -0.04    -0.24    1.02    0.02   0.03_*
## 172  0.06  -0.07     0.02     0.08    1.03_*  0.00   0.02  
## 175  0.14  -0.19     0.09    -0.21    1.03_*  0.02   0.03_*
## 210 -0.14   0.15    -0.10    -0.22    0.97_*  0.02   0.01  
## 270 -0.03   0.05    -0.03     0.06    1.03_*  0.00   0.02  
## 298 -0.06   0.06    -0.09    -0.15    0.97_*  0.01   0.00  
## 314 -0.05   0.04     0.02    -0.05    1.03_*  0.00   0.02_*
## 353 -0.02   0.03     0.09     0.15    0.97_*  0.01   0.00  
## 357  0.02  -0.02     0.02    -0.03    1.03_*  0.00   0.02  
## 368  0.26  -0.23    -0.11     0.27_*  1.01    0.02   0.02_*
## 377  0.14  -0.15     0.12     0.24    0.95_*  0.02   0.01  
## 384  0.00   0.00     0.00     0.00    1.02_*  0.00   0.02  
## 387 -0.03   0.04    -0.03     0.05    1.02_*  0.00   0.02  
## 396 -0.05   0.05     0.08     0.14    0.98_*  0.01   0.00
# Identify outliers using studentized residuals
rstudent_values <- rstudent(smaller_model)
outliers <- which(abs(rstudent_values) > 3)
outliers
## named integer(0)
#Cook's Distance (influential points)
cooksd <- cooks.distance(smaller_model)
influential <- which(cooksd > 4 / nrow(Carseats))
influential
##  26  29  31  50  51  58  69 107 144 166 175 210 259 273 299 311 317 368 377 
##  26  29  31  50  51  58  69 107 144 166 175 210 259 273 299 311 317 368 377
# Identify high leverage points
hat_values <- hatvalues(smaller_model)
high_leverage <- which(hat_values > (2 * mean(hat_values)))
high_leverage
##  43 126 156 157 160 166 172 175 192 204 209 270 273 314 316 357 366 368 384 387 
##  43 126 156 157 160 166 172 175 192 204 209 270 273 314 316 357 366 368 384 387

Conclusion
No formal outliers in terms of studentized residuals. Some high leverage and influential observations are present.

Problem 12

This problem involves simple linear regression without an intercept.

(a) Recall that the coefficient estimate ˆ β for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?
The coefficient estimates from regressing Y onto X and X onto Y, without an intercept, are equal if and only if the sum of squares of X equals the sum of squares of Y: Σ(xᵢ²) = Σ(yᵢ²).

(b) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

# Set seed for reproducibility
set.seed(123)

# Generate X ~ N(0, 1)
X <- rnorm(100)

# Generate Y as a linear function of X plus noise
Y <- 2 * X + rnorm(100)

# Regress Y onto X without intercept
model_Y_on_X <- lm(Y ~ X + 0)
beta_Y_on_X <- coef(model_Y_on_X)

# Regress X onto Y without intercept
model_X_on_Y <- lm(X ~ Y + 0)
beta_X_on_Y <- coef(model_X_on_Y)

# Print both slope estimates
cat("Beta (Y ~ X, no intercept):", beta_Y_on_X, "\n")
## Beta (Y ~ X, no intercept): 1.936372
cat("Beta (X ~ Y, no intercept):", beta_X_on_Y, "\n")
## Beta (X ~ Y, no intercept): 0.3975655
# Show they are different
cat("Are coefficients equal?", beta_Y_on_X == beta_X_on_Y, "\n")
## Are coefficients equal? FALSE
# Show why they're different
cat("Sum of X^2:", sum(X^2), "\n")
## Sum of X^2: 83.30737
cat("Sum of Y^2:", sum(Y^2), "\n")
## Sum of Y^2: 405.7547

(c) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

# Set seed for reproducibility
set.seed(123)

# Generate X ~ N(0, 1)
X <- rnorm(100)

# Generate Y to ensure sum(X^2) = sum(Y^2)
# Method: Scale X and add orthogonal noise
Y_temp <- 2 * X + rnorm(100, sd = 0.5)

# Scale Y so that sum(Y^2) = sum(X^2)
Y <- Y_temp * sqrt(sum(X^2) / sum(Y_temp^2))

# Verify that sum of squares are equal
cat("Sum of X^2:", sum(X^2), "\n")
## Sum of X^2: 83.30737
cat("Sum of Y^2:", sum(Y^2), "\n")
## Sum of Y^2: 83.30737
cat("Are sums of squares equal?", abs(sum(X^2) - sum(Y^2)) < 1e-10, "\n\n")
## Are sums of squares equal? TRUE
# Regress Y onto X without intercept
model_Y_on_X <- lm(Y ~ X + 0)
beta_Y_on_X <- coef(model_Y_on_X)

# Regress X onto Y without intercept
model_X_on_Y <- lm(X ~ Y + 0)
beta_X_on_Y <- coef(model_X_on_Y)

# Print both slope estimates
cat("Beta (Y ~ X, no intercept):", beta_Y_on_X, "\n")
## Beta (Y ~ X, no intercept): 0.9656775
cat("Beta (X ~ Y, no intercept):", beta_X_on_Y, "\n")
## Beta (X ~ Y, no intercept): 0.9656775
# Check if they are equal (within numerical precision)
cat("Are coefficients equal?", abs(beta_Y_on_X - beta_X_on_Y) < 1e-10, "\n")
## Are coefficients equal? TRUE
cat("Difference:", abs(beta_Y_on_X - beta_X_on_Y), "\n")
## Difference: 3.330669e-16
# Manual verification using formulas
manual_beta_YX <- sum(X * Y) / sum(X^2)
manual_beta_XY <- sum(X * Y) / sum(Y^2)

cat("\nManual calculation verification:\n")
## 
## Manual calculation verification:
cat("Manual Beta (Y ~ X):", manual_beta_YX, "\n")
## Manual Beta (Y ~ X): 0.9656775
cat("Manual Beta (X ~ Y):", manual_beta_XY, "\n")
## Manual Beta (X ~ Y): 0.9656775