Carefully explain the differences between the KNN classifier
and KNN regression methods.
KNN classification predicts categorical labels by finding the K nearest
neighbors and using majority voting - if 3 out of 5 neighbors are class
A and 2 are class B, it predicts class A. KNN regression predicts
continuous values by averaging the numeric values of the K nearest
neighbors - to predict house price, it finds the K most similar houses
and takes the mean of their prices. While both use the same proximity
principle, classification outputs discrete labels and uses
accuracy-based metrics, while regression outputs numerical values and
uses error-based metrics like MSE.
##Problem 9
This question involves the use of multiple linear regression on the Auto
data set.
(a) Produce a scatterplot matrix which includes all of the variables in the data set.
auto <- read.csv("C:/Users/doris/OneDrive/Documents/STA 6543/Auto.csv", na.strings = "?")
dim(auto)
## [1] 397 9
#Remove missing values
auto=na.omit(auto)
dim(auto)
## [1] 392 9
str(auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : int 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : int 3504 3693 3436 3433 3449 4341 4354 4312 4425 3850 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : chr "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
## - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
## ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
# Scatter plot matrix
plot(auto)
(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.
# Computing the correlations excluding the name variable
cor(auto[1:8])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output.
# Linear regression model
d1 <- lm(mpg ~ .-name, data = auto)
summary(d1)
##
## Call:
## lm(formula = mpg ~ . - name, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
For instance:
i. Is there a relationship between the predictors and the
response? Statistically Significant Predictors (p <
0.05):
Using the significance level of 0.05
Significant Predictors
Weight (p < 2e-16): Strongest predictor. Each additional pound
decreases mpg by 0.006474
Year (p < 2e-16): Each year increases mpg by 0.751, reflecting improving fuel efficiency over time
Origin (p = 4.67e-07): Cars from different origins have significantly different mpg (likely foreign vs domestic)
Displacement (p = 0.00844): Larger engine displacement slightly increases mpg (0.0199 per unit)
Non-Significant Predictors:
Cylinders (p = 0.128): Not significant when other variables are included
Horsepower (p = 0.220): Surprisingly not significant, possibly due to
correlation with other variables
Acceleration (p = 0.415): Not a significant predictor of mpg
ii. Which predictors appear to have a statistically
significant relationship to the response?
Based on the regression output, the predictors with statistically
significant relationships to the response (mpg) are those with p-values
less than 0.05:
Statistically Significant Predictors: displacement (p = 0.00844) weight (p < 2e-16) year (p < 2e-16) origin (p = 4.67e-07)
iii. What does the coefficient for the year variable
suggest?
The coefficient for the year variable is 0.750773, and it is highly
statistically significant (p-value < 2e-16). This means that, on
average, each additional model year is associated with an increase of
approximately 0.75 miles per gallon (mpg), holding all other variables
constant.
(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
# Using the plot function
par(mfrow = c(2, 2))
plot(d1)
Residuals vs Fitted: Residuals are fairly randomly scattered around zero. No clear curved pattern, suggesting linearity assumption is met
Q-Q Residuals: Most of the data points follow the straight line quite well but there are slight deviations at the tails mostly the upper tail. Thus normality assumption is reasonably satisfied
Scale-Location: Mild heteroscedasticity - there’s a slight upward trend in the red line
Residuals vs Leverage: No points exceed Cook’s distance thresholds of 0.5, indicating no overly influential observations that would compromise the model.
(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?
# Method 1: Using * symbol (includes main effects AND interactions)
model1 <- lm(mpg ~ cylinders + displacement + horsepower + weight * year + acceleration + origin, data = auto)
summary(model1)
##
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight *
## year + acceleration + origin, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.9995 -1.8495 -0.1559 1.6061 11.7042
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.186e+02 1.338e+01 -8.864 < 2e-16 ***
## cylinders -1.218e-01 3.032e-01 -0.402 0.6881
## displacement 1.293e-02 7.019e-03 1.842 0.0663 .
## horsepower -2.877e-02 1.286e-02 -2.236 0.0259 *
## weight 3.044e-02 4.652e-03 6.543 1.94e-10 ***
## year 2.084e+00 1.732e-01 12.033 < 2e-16 ***
## acceleration 1.447e-01 9.196e-02 1.574 0.1164
## origin 1.174e+00 2.597e-01 4.519 8.30e-06 ***
## weight:year -4.879e-04 6.097e-05 -8.002 1.47e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.084 on 383 degrees of freedom
## Multiple R-squared: 0.847, Adjusted R-squared: 0.8439
## F-statistic: 265.1 on 8 and 383 DF, p-value: < 2.2e-16
# Method 2: Using : symbol (interaction term only)
model2 <- lm(mpg ~ weight:year, data = auto)
summary(model2)
##
## Call:
## lm(formula = mpg ~ weight:year, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -12.3849 -3.3041 -0.5901 2.6158 17.5737
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.571e+01 9.581e-01 47.71 <2e-16 ***
## weight:year -9.882e-05 4.105e-06 -24.07 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.957 on 390 degrees of freedom
## Multiple R-squared: 0.5977, Adjusted R-squared: 0.5967
## F-statistic: 579.4 on 1 and 390 DF, p-value: < 2.2e-16
Yes, the interaction is statistically significant.
Model 1 (weight * year):
weight:year interaction is highly significant (p = 1.47e-14, ***).
Model 2 (weight:year only):
Also highly significant (p < 2e-16).
(f) Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.
# Log transformation of horsepower (common for power-related variables)
log_hp_model <- lm(mpg ~ cylinders + displacement + log(horsepower) + weight + acceleration + year + origin, data = auto)
# Square root of horsepower
sqrt_hp_model <- lm(mpg ~ cylinders + displacement + sqrt(horsepower) + weight + acceleration + year + origin, data = auto)
# Square of horsepower
sq_hp_model <- lm(mpg ~ cylinders + displacement + I(horsepower^2) + weight + acceleration + year + origin, data = auto)
summary(log_hp_model)
##
## Call:
## lm(formula = mpg ~ cylinders + displacement + log(horsepower) +
## weight + acceleration + year + origin, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3115 -2.0041 -0.1726 1.8393 12.6579
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.254005 8.589614 3.173 0.00163 **
## cylinders -0.486206 0.306692 -1.585 0.11372
## displacement 0.019456 0.006876 2.830 0.00491 **
## log(horsepower) -9.506436 1.539619 -6.175 1.69e-09 ***
## weight -0.004266 0.000694 -6.148 1.97e-09 ***
## acceleration -0.292088 0.103804 -2.814 0.00515 **
## year 0.705329 0.048456 14.556 < 2e-16 ***
## origin 1.482435 0.259347 5.716 2.19e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.18 on 384 degrees of freedom
## Multiple R-squared: 0.837, Adjusted R-squared: 0.834
## F-statistic: 281.6 on 7 and 384 DF, p-value: < 2.2e-16
summary(sqrt_hp_model)
##
## Call:
## lm(formula = mpg ~ cylinders + displacement + sqrt(horsepower) +
## weight + acceleration + year + origin, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5240 -1.9910 -0.1687 1.8181 12.9211
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -6.0373910 5.5460041 -1.089 0.277012
## cylinders -0.5222540 0.3166839 -1.649 0.099938 .
## displacement 0.0220542 0.0071987 3.064 0.002341 **
## sqrt(horsepower) -1.1434906 0.3113771 -3.672 0.000274 ***
## weight -0.0054593 0.0006842 -7.979 1.72e-14 ***
## acceleration -0.1021239 0.1038565 -0.983 0.326070
## year 0.7240379 0.0501791 14.429 < 2e-16 ***
## origin 1.5173206 0.2703470 5.612 3.83e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.277 on 384 degrees of freedom
## Multiple R-squared: 0.8269, Adjusted R-squared: 0.8237
## F-statistic: 262 on 7 and 384 DF, p-value: < 2.2e-16
summary(sq_hp_model)
##
## Call:
## lm(formula = mpg ~ cylinders + displacement + I(horsepower^2) +
## weight + acceleration + year + origin, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.4398 -2.1891 0.0808 1.8067 13.0014
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.193e+01 4.303e+00 -5.096 5.44e-07 ***
## cylinders -3.038e-01 3.280e-01 -0.926 0.35489
## displacement 1.094e-02 7.851e-03 1.394 0.16413
## I(horsepower^2) 8.282e-05 4.261e-05 1.944 0.05268 .
## weight -7.110e-03 5.957e-04 -11.936 < 2e-16 ***
## acceleration 2.321e-01 8.691e-02 2.671 0.00788 **
## year 7.839e-01 5.053e-02 15.515 < 2e-16 ***
## origin 1.194e+00 2.807e-01 4.256 2.62e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.318 on 384 degrees of freedom
## Multiple R-squared: 0.8225, Adjusted R-squared: 0.8193
## F-statistic: 254.2 on 7 and 384 DF, p-value: < 2.2e-16
Model Performance Comparison
Log transformation performs best:
Highest R² (0.837) and Adjusted R² (0.834) Lowest residual standard
error (3.18) Most significant coefficients overall
Square root transformation is intermediate:
Moderate R² (0.827) and Adjusted R² (0.824) Higher residual standard
error (3.277)
Squared transformation performs worst:
Lowest R² (0.823) and Adjusted R² (0.819) Highest residual standard
error (3.318)
This question should be answered using the Carseats data set.
(a) Fit a multiple regression model to predict Sales using Price, Urban, and US.
library(ISLR)
attach(Carseats)
str(Carseats)
## 'data.frame': 400 obs. of 11 variables:
## $ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
## $ CompPrice : num 138 111 113 117 141 124 115 136 132 132 ...
## $ Income : num 73 48 35 100 64 113 105 81 110 113 ...
## $ Advertising: num 11 16 10 4 3 13 0 15 0 0 ...
## $ Population : num 276 260 269 466 340 501 45 425 108 131 ...
## $ Price : num 120 83 80 97 128 72 108 120 124 124 ...
## $ ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
## $ Age : num 42 65 59 55 38 78 71 67 76 76 ...
## $ Education : num 17 10 12 14 13 16 15 10 10 17 ...
## $ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
## $ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
# Multiple linear regression
d2 <- lm(Sales ~ Price + Urban + US, data = Carseats)
summary(d2)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
(b) Provide an interpretation of each coefficient in the
model. Becareful—some of the variables in the model are
qualitative!
Model Interpretation The fitted model is: Sales = 13.043 - 0.054×Price -
0.022×UrbanYes + 1.201×USYes
Coefficient Interpretations: Intercept (13.043): This represents the predicted sales when Price = 0, the store is not in an urban location (Urban = No), and the store is not in the US (US = No). While this baseline scenario may not be practically meaningful, it serves as the reference point for the model.
Price (-0.054): For each $1 increase in price, sales decrease by approximately 0.054 units (54 car seats per 1000), holding Urban and US location constant. This coefficient is highly significant (p < 2e-16), indicating a strong negative relationship between price and sales.
UrbanYes (-0.022): Stores in urban locations have sales that are 0.022 units lower on average compared to non-urban stores, holding price and US location constant. However, this coefficient is not statistically significant (p = 0.936), suggesting there’s insufficient evidence of a meaningful difference in sales between urban and non-urban locations.
USYes (1.201): Stores located in the US have sales that are approximately 1.201 units higher on average compared to stores outside the US, holding price and urban location constant. This coefficient is highly significant (p = 4.86e-06), indicating that US location is a strong predictor of higher sales.
(c) Write out the model in equation form, being careful to
handle the qualitative variables properly.
Sales^ =13.0435−0.0545×Price−0.0219×UrbanYes+1.2006×USYes
(d) For which of the predictors can you reject the null
hypothesis H0 : βj =0?
Looking at the regression output with a significance level of α = 0.05,
I can determine which null hypotheses H₀: βⱼ = 0 can be rejected by
comparing each p-value to 0.05:
Hypothesis Test Results (alpha = 0.05): Price coefficient (β₁): p-value: < 2e-16 Since p-value < 0.05: REJECT H₀
UrbanYes coefficient (β₂): p-value: 0.936 Since p-value = 0.936 > 0.05: FAIL TO REJECT H₀
USYes coefficient (β₃): p-value: 4.86e-06 (= 0.00000486) Since p-value < 0.05: REJECT H₀
Conclusion: At the 0.05 significance level, we reject the null hypothesis H₀: βⱼ = 0 for:Price and USYes
(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
# Fit the smaller model with only significant predictors
smaller_model <- lm(Sales ~ Price + US, data = Carseats)
summary(smaller_model)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
(f) How well do the models in (a) and (e) fit the
data?
Same R-squared (0.2393) - Both models explain exactly 23.93% of the
variance in Sales
Improved adjusted R-squared (0.2354 vs 0.2335) - The smaller model actually has a slightly better adjusted R-squared, confirming that removing the non-significant Urban variable was beneficial
(g) Using the model from (e), obtain 95% confidence intervals for the coefficient(s).
# Get 95% confidence intervals for the smaller model coefficients
confint(smaller_model)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
(h) Is there evidence of outliers or high leverage observations in the model from (e)?
# diagnostic plots
par(mfrow = c(2, 2))
plot(smaller_model)
summary(influence.measures(smaller_model))
## Potentially influential observations of
## lm(formula = Sales ~ Price + US, data = Carseats) :
##
## dfb.1_ dfb.Pric dfb.USYs dffit cov.r cook.d hat
## 26 0.24 -0.18 -0.17 0.28_* 0.97_* 0.03 0.01
## 29 -0.10 0.10 -0.10 -0.18 0.97_* 0.01 0.01
## 43 -0.11 0.10 0.03 -0.11 1.05_* 0.00 0.04_*
## 50 -0.10 0.17 -0.17 0.26_* 0.98 0.02 0.01
## 51 -0.05 0.05 -0.11 -0.18 0.95_* 0.01 0.00
## 58 -0.05 -0.02 0.16 -0.20 0.97_* 0.01 0.01
## 69 -0.09 0.10 0.09 0.19 0.96_* 0.01 0.01
## 126 -0.07 0.06 0.03 -0.07 1.03_* 0.00 0.03_*
## 160 0.00 0.00 0.00 0.01 1.02_* 0.00 0.02
## 166 0.21 -0.23 -0.04 -0.24 1.02 0.02 0.03_*
## 172 0.06 -0.07 0.02 0.08 1.03_* 0.00 0.02
## 175 0.14 -0.19 0.09 -0.21 1.03_* 0.02 0.03_*
## 210 -0.14 0.15 -0.10 -0.22 0.97_* 0.02 0.01
## 270 -0.03 0.05 -0.03 0.06 1.03_* 0.00 0.02
## 298 -0.06 0.06 -0.09 -0.15 0.97_* 0.01 0.00
## 314 -0.05 0.04 0.02 -0.05 1.03_* 0.00 0.02_*
## 353 -0.02 0.03 0.09 0.15 0.97_* 0.01 0.00
## 357 0.02 -0.02 0.02 -0.03 1.03_* 0.00 0.02
## 368 0.26 -0.23 -0.11 0.27_* 1.01 0.02 0.02_*
## 377 0.14 -0.15 0.12 0.24 0.95_* 0.02 0.01
## 384 0.00 0.00 0.00 0.00 1.02_* 0.00 0.02
## 387 -0.03 0.04 -0.03 0.05 1.02_* 0.00 0.02
## 396 -0.05 0.05 0.08 0.14 0.98_* 0.01 0.00
# Identify outliers using studentized residuals
rstudent_values <- rstudent(smaller_model)
outliers <- which(abs(rstudent_values) > 3)
outliers
## named integer(0)
#Cook's Distance (influential points)
cooksd <- cooks.distance(smaller_model)
influential <- which(cooksd > 4 / nrow(Carseats))
influential
## 26 29 31 50 51 58 69 107 144 166 175 210 259 273 299 311 317 368 377
## 26 29 31 50 51 58 69 107 144 166 175 210 259 273 299 311 317 368 377
# Identify high leverage points
hat_values <- hatvalues(smaller_model)
high_leverage <- which(hat_values > (2 * mean(hat_values)))
high_leverage
## 43 126 156 157 160 166 172 175 192 204 209 270 273 314 316 357 366 368 384 387
## 43 126 156 157 160 166 172 175 192 204 209 270 273 314 316 357 366 368 384 387
Conclusion
No formal outliers in terms of studentized residuals. Some high leverage
and influential observations are present.
This problem involves simple linear regression without an intercept.
(a) Recall that the coefficient estimate ˆ β for the linear
regression of Y onto X without an intercept is given by (3.38). Under
what circumstance is the coefficient estimate for the regression of X
onto Y the same as the coefficient estimate for the regression of Y onto
X?
The coefficient estimates from regressing Y onto X and X onto Y, without
an intercept, are equal if and only if the sum of squares of X equals
the sum of squares of Y: Σ(xᵢ²) = Σ(yᵢ²).
(b) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.
# Set seed for reproducibility
set.seed(123)
# Generate X ~ N(0, 1)
X <- rnorm(100)
# Generate Y as a linear function of X plus noise
Y <- 2 * X + rnorm(100)
# Regress Y onto X without intercept
model_Y_on_X <- lm(Y ~ X + 0)
beta_Y_on_X <- coef(model_Y_on_X)
# Regress X onto Y without intercept
model_X_on_Y <- lm(X ~ Y + 0)
beta_X_on_Y <- coef(model_X_on_Y)
# Print both slope estimates
cat("Beta (Y ~ X, no intercept):", beta_Y_on_X, "\n")
## Beta (Y ~ X, no intercept): 1.936372
cat("Beta (X ~ Y, no intercept):", beta_X_on_Y, "\n")
## Beta (X ~ Y, no intercept): 0.3975655
# Show they are different
cat("Are coefficients equal?", beta_Y_on_X == beta_X_on_Y, "\n")
## Are coefficients equal? FALSE
# Show why they're different
cat("Sum of X^2:", sum(X^2), "\n")
## Sum of X^2: 83.30737
cat("Sum of Y^2:", sum(Y^2), "\n")
## Sum of Y^2: 405.7547
(c) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.
# Set seed for reproducibility
set.seed(123)
# Generate X ~ N(0, 1)
X <- rnorm(100)
# Generate Y to ensure sum(X^2) = sum(Y^2)
# Method: Scale X and add orthogonal noise
Y_temp <- 2 * X + rnorm(100, sd = 0.5)
# Scale Y so that sum(Y^2) = sum(X^2)
Y <- Y_temp * sqrt(sum(X^2) / sum(Y_temp^2))
# Verify that sum of squares are equal
cat("Sum of X^2:", sum(X^2), "\n")
## Sum of X^2: 83.30737
cat("Sum of Y^2:", sum(Y^2), "\n")
## Sum of Y^2: 83.30737
cat("Are sums of squares equal?", abs(sum(X^2) - sum(Y^2)) < 1e-10, "\n\n")
## Are sums of squares equal? TRUE
# Regress Y onto X without intercept
model_Y_on_X <- lm(Y ~ X + 0)
beta_Y_on_X <- coef(model_Y_on_X)
# Regress X onto Y without intercept
model_X_on_Y <- lm(X ~ Y + 0)
beta_X_on_Y <- coef(model_X_on_Y)
# Print both slope estimates
cat("Beta (Y ~ X, no intercept):", beta_Y_on_X, "\n")
## Beta (Y ~ X, no intercept): 0.9656775
cat("Beta (X ~ Y, no intercept):", beta_X_on_Y, "\n")
## Beta (X ~ Y, no intercept): 0.9656775
# Check if they are equal (within numerical precision)
cat("Are coefficients equal?", abs(beta_Y_on_X - beta_X_on_Y) < 1e-10, "\n")
## Are coefficients equal? TRUE
cat("Difference:", abs(beta_Y_on_X - beta_X_on_Y), "\n")
## Difference: 3.330669e-16
# Manual verification using formulas
manual_beta_YX <- sum(X * Y) / sum(X^2)
manual_beta_XY <- sum(X * Y) / sum(Y^2)
cat("\nManual calculation verification:\n")
##
## Manual calculation verification:
cat("Manual Beta (Y ~ X):", manual_beta_YX, "\n")
## Manual Beta (Y ~ X): 0.9656775
cat("Manual Beta (X ~ Y):", manual_beta_XY, "\n")
## Manual Beta (X ~ Y): 0.9656775