Carefully explain the differences between the KNN classifier and KNN regression methods.
The main difference between these methods is that the KNN classifier is used to solve classification problems where the response is qualitative while the KNN regression solves regression problems with quantitative responses.
Auto <- read.table ("C:\\Users\\Winni\\Downloads\\Auto.data ", header = T, na.strings = "?",
stringsAsFactors = T)
Auto <- na.omit(Auto)
pairs(~mpg+cylinders+displacement+horsepower+weight+acceleration +year +origin + name,data=Auto)
Auto_new <- Auto|>select(mpg:origin)
cor(Auto_new)
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
auto.model <- lm(mpg~., data=Auto_new)
summary(auto.model)
##
## Call:
## lm(formula = mpg ~ ., data = Auto_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Comment on the output. For instance: i. Is there a relationship between the predictors and the response?
From the F-statistic and the corresponding p-value, we see that F=252.4>1 suggesting there is a relationship between the predictors and response.
#Answer The displacement, weight, year, and origin variables have statistically significant relationship to mpg, as the absolute value for their t-value is greater than 1, and they have small p-values as well.
#Answer The coefficient for year suggests that holding all other predictor fixed, each year cars saw on average a 0.75 increase in their miles per gallon.
par(mfrow=c(2,2))
plot(auto.model)
(i) Looking at the residual plot, i see no indication of unsually large
outliers. (ii) From the leverage plot, we see that the observation (14)
has high leverage.
auto.modelINT1 <- lm(mpg~cylinders*displacement + cylinders*weight + cylinders*horsepower + cylinders*origin + cylinders*year + cylinders*acceleration, data=Auto)
summary(auto.modelINT1)
##
## Call:
## lm(formula = mpg ~ cylinders * displacement + cylinders * weight +
## cylinders * horsepower + cylinders * origin + cylinders *
## year + cylinders * acceleration, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.7843 -1.6237 -0.0424 1.3271 12.3258
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.902e+01 1.485e+01 -1.954 0.051483 .
## cylinders 3.090e+00 2.702e+00 1.143 0.253571
## displacement 4.810e-03 2.408e-02 0.200 0.841758
## weight -1.006e-02 2.457e-03 -4.093 5.20e-05 ***
## horsepower -1.777e-01 5.472e-02 -3.248 0.001267 **
## origin -2.217e+00 1.381e+00 -1.606 0.109192
## year 1.336e+00 1.654e-01 8.078 8.93e-15 ***
## acceleration -1.020e-02 2.989e-01 -0.034 0.972804
## cylinders:displacement 3.982e-04 3.632e-03 0.110 0.912762
## cylinders:weight 8.602e-04 3.624e-04 2.374 0.018113 *
## cylinders:horsepower 1.918e-02 8.013e-03 2.393 0.017185 *
## cylinders:origin 7.242e-01 3.182e-01 2.276 0.023421 *
## cylinders:year -1.164e-01 3.143e-02 -3.703 0.000245 ***
## cylinders:acceleration 1.166e-03 5.431e-02 0.021 0.982878
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.865 on 378 degrees of freedom
## Multiple R-squared: 0.8697, Adjusted R-squared: 0.8652
## F-statistic: 194.1 on 13 and 378 DF, p-value: < 2.2e-16
In this model, we see that the cylinders:weight, cylinders:horsepower, cylinders:origin and cylinders:year interaction term are statistically significant. #2
auto.modelINT2 <- lm(mpg~displacement*horsepower + displacement*weight + displacement*origin + displacement*year + displacement*acceleration+ cylinders, data=Auto)
summary(auto.modelINT2)
##
## Call:
## lm(formula = mpg ~ displacement * horsepower + displacement *
## weight + displacement * origin + displacement * year + displacement *
## acceleration + cylinders, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.5046 -1.5468 0.0123 1.3195 13.5004
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.979e+01 8.992e+00 -3.313 0.001013 **
## displacement 7.178e-02 4.427e-02 1.622 0.105727
## horsepower -9.919e-02 3.175e-02 -3.124 0.001923 **
## weight -7.945e-03 1.367e-03 -5.811 1.32e-08 ***
## origin -8.544e-01 8.949e-01 -0.955 0.340354
## year 1.104e+00 9.824e-02 11.236 < 2e-16 ***
## acceleration 8.621e-02 1.772e-01 0.487 0.626826
## cylinders 6.256e-01 2.978e-01 2.101 0.036307 *
## displacement:horsepower 1.666e-04 1.026e-04 1.625 0.105086
## displacement:weight 1.550e-05 4.052e-06 3.826 0.000152 ***
## displacement:origin 1.233e-02 7.734e-03 1.594 0.111767
## displacement:year -2.039e-03 5.157e-04 -3.954 9.19e-05 ***
## displacement:acceleration -5.375e-04 8.597e-04 -0.625 0.532218
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.832 on 379 degrees of freedom
## Multiple R-squared: 0.8724, Adjusted R-squared: 0.8683
## F-statistic: 215.9 on 12 and 379 DF, p-value: < 2.2e-16
Here,we see that the displacement:weight, displacement:year interaction terms are statistically significant.
#3
auto.modelINT3 <- lm(mpg~horsepower*weight+horsepower*acceleration+horsepower*year+ horsepower*origin + cylinders+displacement,data=Auto)
summary(auto.modelINT3)
##
## Call:
## lm(formula = mpg ~ horsepower * weight + horsepower * acceleration +
## horsepower * year + horsepower * origin + cylinders + displacement,
## data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.1377 -1.3451 -0.0611 1.2719 11.1489
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.626e+01 1.084e+01 -5.189 3.46e-07 ***
## horsepower 3.797e-01 1.059e-01 3.586 0.00038 ***
## weight -7.885e-03 1.095e-03 -7.199 3.27e-12 ***
## acceleration 2.124e-01 1.632e-01 1.302 0.19386
## year 1.390e+00 1.346e-01 10.333 < 2e-16 ***
## origin 1.297e+00 1.086e+00 1.194 0.23309
## cylinders 3.549e-01 2.891e-01 1.228 0.22030
## displacement -9.768e-03 7.639e-03 -1.279 0.20178
## horsepower:weight 3.682e-05 6.959e-06 5.291 2.05e-07 ***
## horsepower:acceleration -4.386e-03 1.759e-03 -2.493 0.01308 *
## horsepower:year -6.620e-03 1.345e-03 -4.921 1.29e-06 ***
## horsepower:origin -6.978e-03 1.284e-02 -0.543 0.58723
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.803 on 380 degrees of freedom
## Multiple R-squared: 0.8747, Adjusted R-squared: 0.871
## F-statistic: 241.1 on 11 and 380 DF, p-value: < 2.2e-16
Here we see that the horsepower:weight, horsepower:acceleration, and horsepower:year interaction terms are statistically significant.
#4
auto.modelINT4 <- lm(mpg~weight*acceleration + weight*origin + weight*year + horsepower + displacement + cylinders, data=Auto)
summary(auto.modelINT4)
##
## Call:
## lm(formula = mpg ~ weight * acceleration + weight * origin +
## weight * year + horsepower + displacement + cylinders, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.7396 -1.6905 -0.0713 1.3018 11.3500
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.175e+02 1.306e+01 -8.999 < 2e-16 ***
## weight 3.098e-02 4.526e-03 6.845 3.07e-11 ***
## acceleration 1.213e+00 2.414e-01 5.025 7.75e-07 ***
## origin 2.699e+00 1.240e+00 2.176 0.03016 *
## year 1.817e+00 1.759e-01 10.327 < 2e-16 ***
## horsepower -4.138e-02 1.303e-02 -3.176 0.00161 **
## displacement -1.120e-03 7.355e-03 -0.152 0.87908
## cylinders -3.498e-02 2.965e-01 -0.118 0.90614
## weight:acceleration -4.126e-04 8.430e-05 -4.895 1.46e-06 ***
## weight:origin -7.752e-04 5.302e-04 -1.462 0.14457
## weight:year -3.803e-04 6.274e-05 -6.062 3.23e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.978 on 381 degrees of freedom
## Multiple R-squared: 0.8582, Adjusted R-squared: 0.8544
## F-statistic: 230.5 on 10 and 381 DF, p-value: < 2.2e-16
Here the weight:acceleration and weight:year interaction terms are significant.
#5
auto.modelINT5 <- lm(mpg~cylinders+acceleration*year + acceleration*origin + displacement + horsepower + weight, data=Auto)
summary(auto.modelINT5)
##
## Call:
## lm(formula = mpg ~ cylinders + acceleration * year + acceleration *
## origin + displacement + horsepower + weight, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.2160 -1.9139 -0.1561 1.6798 12.2113
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 88.5083128 20.0395827 4.417 1.31e-05 ***
## cylinders -0.3553012 0.3029518 -1.173 0.2416
## acceleration -6.6459681 1.2408097 -5.356 1.47e-07 ***
## year -0.4274157 0.2643531 -1.617 0.1067
## origin -7.9692322 1.5977193 -4.988 9.28e-07 ***
## displacement 0.0013427 0.0072991 0.184 0.8541
## horsepower -0.0318848 0.0128631 -2.479 0.0136 *
## weight -0.0049283 0.0006313 -7.806 5.71e-14 ***
## acceleration:year 0.0750484 0.0164081 4.574 6.48e-06 ***
## acceleration:origin 0.5650941 0.0965091 5.855 1.03e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.071 on 382 degrees of freedom
## Multiple R-squared: 0.8488, Adjusted R-squared: 0.8452
## F-statistic: 238.2 on 9 and 382 DF, p-value: < 2.2e-16
Here we see that the acceleration:year and acceleration:origin interaction terms are statistically significant.
#6
auto.modelINT6 <- lm(mpg~cylinders + displacement+horsepower + weight + acceleration + year*origin, data=Auto)
summary(auto.modelINT6)
##
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight +
## acceleration + year * origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.6072 -2.0439 -0.0596 1.7121 12.3368
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.492e+00 9.044e+00 0.939 0.348353
## cylinders -5.042e-01 3.192e-01 -1.579 0.115082
## displacement 1.567e-02 7.530e-03 2.081 0.038060 *
## horsepower -1.399e-02 1.364e-02 -1.025 0.305786
## weight -6.352e-03 6.449e-04 -9.851 < 2e-16 ***
## acceleration 9.185e-02 9.766e-02 0.941 0.347546
## year 4.189e-01 1.125e-01 3.723 0.000226 ***
## origin -1.405e+01 4.699e+00 -2.989 0.002978 **
## year:origin 1.989e-01 6.030e-02 3.298 0.001064 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.286 on 383 degrees of freedom
## Multiple R-squared: 0.8264, Adjusted R-squared: 0.8228
## F-statistic: 227.9 on 8 and 383 DF, p-value: < 2.2e-16
Here we see that the year:origin interaction term is statistically significant.
#Answer #1 Log transformation
auto.modellog <- lm(mpg~log(cylinders)+log(displacement)+log(horsepower)+log(weight)+log(acceleration)+log(year)+log(origin), data=Auto_new)
summary((auto.modellog))
##
## Call:
## lm(formula = mpg ~ log(cylinders) + log(displacement) + log(horsepower) +
## log(weight) + log(acceleration) + log(year) + log(origin),
## data = Auto_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5987 -1.8172 -0.0181 1.5906 12.8132
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -66.5643 17.5053 -3.803 0.000167 ***
## log(cylinders) 1.4818 1.6589 0.893 0.372273
## log(displacement) -1.0551 1.5385 -0.686 0.493230
## log(horsepower) -6.9657 1.5569 -4.474 1.01e-05 ***
## log(weight) -12.5728 2.2251 -5.650 3.12e-08 ***
## log(acceleration) -4.9831 1.6078 -3.099 0.002082 **
## log(year) 54.9857 3.5555 15.465 < 2e-16 ***
## log(origin) 1.5822 0.5083 3.113 0.001991 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.069 on 384 degrees of freedom
## Multiple R-squared: 0.8482, Adjusted R-squared: 0.8454
## F-statistic: 306.5 on 7 and 384 DF, p-value: < 2.2e-16
#2 square root transformation
auto.modelroot <- lm(mpg~sqrt(cylinders)+sqrt(displacement)+sqrt(horsepower)+sqrt(weight)+sqrt(acceleration)+sqrt(year)+sqrt(origin), data=Auto_new)
summary((auto.modelroot))
##
## Call:
## lm(formula = mpg ~ sqrt(cylinders) + sqrt(displacement) + sqrt(horsepower) +
## sqrt(weight) + sqrt(acceleration) + sqrt(year) + sqrt(origin),
## data = Auto_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5250 -1.9822 -0.1111 1.7347 13.0681
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -49.79814 9.17832 -5.426 1.02e-07 ***
## sqrt(cylinders) -0.23699 1.53753 -0.154 0.8776
## sqrt(displacement) 0.22580 0.22940 0.984 0.3256
## sqrt(horsepower) -0.77976 0.30788 -2.533 0.0117 *
## sqrt(weight) -0.62172 0.07898 -7.872 3.59e-14 ***
## sqrt(acceleration) -0.82529 0.83443 -0.989 0.3233
## sqrt(year) 12.79030 0.85891 14.891 < 2e-16 ***
## sqrt(origin) 3.26036 0.76767 4.247 2.72e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.21 on 384 degrees of freedom
## Multiple R-squared: 0.8338, Adjusted R-squared: 0.8308
## F-statistic: 275.3 on 7 and 384 DF, p-value: < 2.2e-16
auto.modelsquare <- lm(mpg~ I(cylinders^2)+ I(displacement^2)+I(horsepower^2) +I(weight^2)+ I(acceleration^2)+I(year^2)+ I(origin^2), data=Auto_new)
summary((auto.modelsquare))
##
## Call:
## lm(formula = mpg ~ I(cylinders^2) + I(displacement^2) + I(horsepower^2) +
## I(weight^2) + I(acceleration^2) + I(year^2) + I(origin^2),
## data = Auto_new)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.6786 -2.3227 -0.0582 1.9073 12.9807
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.208e+00 2.356e+00 0.513 0.608382
## I(cylinders^2) -8.829e-02 2.521e-02 -3.502 0.000515 ***
## I(displacement^2) 5.680e-05 1.382e-05 4.109 4.87e-05 ***
## I(horsepower^2) -3.621e-05 4.975e-05 -0.728 0.467201
## I(weight^2) -9.351e-07 8.978e-08 -10.416 < 2e-16 ***
## I(acceleration^2) 6.278e-03 2.690e-03 2.334 0.020130 *
## I(year^2) 4.999e-03 3.530e-04 14.160 < 2e-16 ***
## I(origin^2) 4.129e-01 6.914e-02 5.971 5.37e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.539 on 384 degrees of freedom
## Multiple R-squared: 0.7981, Adjusted R-squared: 0.7944
## F-statistic: 216.8 on 7 and 384 DF, p-value: < 2.2e-16
str(Carseats)
## 'data.frame': 400 obs. of 11 variables:
## $ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
## $ CompPrice : num 138 111 113 117 141 124 115 136 132 132 ...
## $ Income : num 73 48 35 100 64 113 105 81 110 113 ...
## $ Advertising: num 11 16 10 4 3 13 0 15 0 0 ...
## $ Population : num 276 260 269 466 340 501 45 425 108 131 ...
## $ Price : num 120 83 80 97 128 72 108 120 124 124 ...
## $ ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
## $ Age : num 42 65 59 55 38 78 71 67 76 76 ...
## $ Education : num 17 10 12 14 13 16 15 10 10 17 ...
## $ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
## $ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
sales.model <- lm(Sales~Price+ Urban + US, data=Carseats)
summary(sales.model)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
coef(sales.model)
## (Intercept) Price UrbanYes USYes
## 13.04346894 -0.05445885 -0.02191615 1.20057270
(i). 13.0435 is the overall average sales among non US and non Urban stores and when the price is 0. (ii) 0.05446 implies that on average, sales will decrease by 55 units when the price increases by 1000 dollars when the other predictors are fixed. (iii) 1.2001 implies that on average, US stores will have 1200 more sales than Non-US stores.
Write out the model in equation form, being careful to handle the qualitative variables properly. ## Answer Sales = 13.0435-0.0545Price-0.02192UrbanYes+1.2006USYes
For which of the predictors can you reject the null hypothesis H0 : βj = 0? #Answer The Price and US Predictors.
On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
sales.reducedmodel <- lm(Sales~Price+US, data=Carseats)
summary(sales.reducedmodel)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
confint(sales.reducedmodel, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
plot(predict(sales.reducedmodel), rstudent(sales.reducedmodel))
plot(hatvalues(sales.reducedmodel))
which.max(hatvalues(sales.reducedmodel))
## 43
## 43
1.From the rstudent plot, we see that there is no indication of an outlier as no observations have studentized residuals greater than 3. 2. We see that there is an indication of high leverage observations from the plot, and whichmax function, we see that observation 43 had the largest leverage statistic.
Recall that the coefficient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X? #Answer The coeffcient estimate is equal for both when sum(x[j]^2, j==1, n) = sum(y[j]^2, j==1, n)
Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.
set.seed(2)
x <- 67:166
y <- 2 * x
y.model <- lm(y ~ x + 0)
x.model <- lm(x ~ y + 0)
summary(y.model)
## Warning in summary.lm(y.model): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.024e-14 -1.243e-14 -2.320e-15 5.060e-15 9.963e-13
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 2.000e+00 8.445e-17 2.368e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.014e-13 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 5.609e+32 on 1 and 99 DF, p-value: < 2.2e-16
summary(x.model)
## Warning in summary.lm(x.model): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.512e-14 -6.220e-15 -1.160e-15 2.530e-15 4.982e-13
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 5.000e-01 2.111e-17 2.368e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.068e-14 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 5.609e+32 on 1 and 99 DF, p-value: < 2.2e-16
x <-21:120
y <- 120:21
x.model <- lm(x~y+0)
summary(x.model)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64.54 -22.15 20.24 62.64 105.03
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.71285 0.07049 10.11 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 53.7 on 99 degrees of freedom
## Multiple R-squared: 0.5081, Adjusted R-squared: 0.5032
## F-statistic: 102.3 on 1 and 99 DF, p-value: < 2.2e-16
y.model <- lm(y~x+0)
summary(y.model)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -64.54 -22.15 20.24 62.64 105.03
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 0.71285 0.07049 10.11 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 53.7 on 99 degrees of freedom
## Multiple R-squared: 0.5081, Adjusted R-squared: 0.5032
## F-statistic: 102.3 on 1 and 99 DF, p-value: < 2.2e-16