Exercises

2. Carefully explain the differences between the KNN classifier and KNN regression methods.

The goal is the main difference. The KNN classifier is used when the response variable is categorical and the KNN regression is used when the response variable is numerical. For KNN Classifier the class prediction \(f(x_0)\) is the class \(j\) with the highest conditional probability. For KNN Regression, it identifies the \(K\) training observations that are closest to the observations it is predicting \((x_0)\), then it estimates \(f(x_0)\) by taling the mean \(y\) value of these training observations.

Applied

9. This question involves the use of multiple linear regression on the Auto data set.

library(MASS)
library(ISLR)
library(tinytex)
str(Auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : num  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...

(a) Produce a scatterplot matrix which includes all of the variables in the data set.

pairs(Auto)

(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.

Auto1 <- Auto[,1:8]
cor(Auto1)
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

Auto1$origin <- factor(Auto1$origin, labels = c("American", "European", "Japanese"))
m1 <- lm(mpg ~ ., data = Auto1)
summary(m1)
## 
## Call:
## lm(formula = mpg ~ ., data = Auto1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.0095 -2.0785 -0.0982  1.9856 13.3608 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.795e+01  4.677e+00  -3.839 0.000145 ***
## cylinders      -4.897e-01  3.212e-01  -1.524 0.128215    
## displacement    2.398e-02  7.653e-03   3.133 0.001863 ** 
## horsepower     -1.818e-02  1.371e-02  -1.326 0.185488    
## weight         -6.710e-03  6.551e-04 -10.243  < 2e-16 ***
## acceleration    7.910e-02  9.822e-02   0.805 0.421101    
## year            7.770e-01  5.178e-02  15.005  < 2e-16 ***
## originEuropean  2.630e+00  5.664e-01   4.643 4.72e-06 ***
## originJapanese  2.853e+00  5.527e-01   5.162 3.93e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.307 on 383 degrees of freedom
## Multiple R-squared:  0.8242, Adjusted R-squared:  0.8205 
## F-statistic: 224.5 on 8 and 383 DF,  p-value: < 2.2e-16

i. Is there a relationship between the predictors and the response?

Based on the p-value: < 2.2e-16 of the F-test we can reject the null hypothesis \(H_0:β_1=β_2=⋯=β_p=0\), and conclude that there is a relationship between the response variable mpg and the predictors.

ii. Which predictors appear to have a statistically significant relationship to the response?

Based on the result, all the variables except cylinders, horsepower, and acceleration have a statistically significant relationship with the response variable.

iii. What does the coefficient for the year variable suggest?

coef(m1)['year']
##      year 
## 0.7770269

The coefficient of 0.777 means that one year increase will increase 0.777 in the mpg efficiency.

(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow=c(2,2))
plot(m1)

The Residual vs. Fitted suggest that the relationship between the dependent variable and the independent variable is no linear. There are also some some high leverage and high residual that could influence the predictability of the model.

(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

m2 <- lm(formula = mpg ~ . * ., data = Auto1)
summary(m2)
## 
## Call:
## lm(formula = mpg ~ . * ., data = Auto1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6008 -1.2863  0.0813  1.2082 12.0382 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  4.401e+01  5.147e+01   0.855 0.393048    
## cylinders                    3.302e+00  8.187e+00   0.403 0.686976    
## displacement                -3.529e-01  1.974e-01  -1.788 0.074638 .  
## horsepower                   5.312e-01  3.390e-01   1.567 0.117970    
## weight                      -3.259e-03  1.820e-02  -0.179 0.857980    
## acceleration                -6.048e+00  2.147e+00  -2.818 0.005109 ** 
## year                         4.833e-01  5.923e-01   0.816 0.415119    
## originEuropean              -3.517e+01  1.260e+01  -2.790 0.005547 ** 
## originJapanese              -3.765e+01  1.426e+01  -2.640 0.008661 ** 
## cylinders:displacement      -6.316e-03  7.106e-03  -0.889 0.374707    
## cylinders:horsepower         1.452e-02  2.457e-02   0.591 0.555109    
## cylinders:weight             5.703e-04  9.044e-04   0.631 0.528709    
## cylinders:acceleration       3.658e-01  1.671e-01   2.189 0.029261 *  
## cylinders:year              -1.447e-01  9.652e-02  -1.499 0.134846    
## cylinders:originEuropean    -7.210e-01  1.088e+00  -0.662 0.508100    
## cylinders:originJapanese     1.226e+00  1.007e+00   1.217 0.224379    
## displacement:horsepower     -5.407e-05  2.861e-04  -0.189 0.850212    
## displacement:weight          2.659e-05  1.455e-05   1.828 0.068435 .  
## displacement:acceleration   -2.547e-03  3.356e-03  -0.759 0.448415    
## displacement:year            4.547e-03  2.446e-03   1.859 0.063842 .  
## displacement:originEuropean -3.364e-02  4.220e-02  -0.797 0.425902    
## displacement:originJapanese  5.375e-02  4.145e-02   1.297 0.195527    
## horsepower:weight           -3.407e-05  2.955e-05  -1.153 0.249743    
## horsepower:acceleration     -3.445e-03  3.937e-03  -0.875 0.382122    
## horsepower:year             -6.427e-03  3.891e-03  -1.652 0.099487 .  
## horsepower:originEuropean   -4.869e-03  5.061e-02  -0.096 0.923408    
## horsepower:originJapanese    2.289e-02  6.252e-02   0.366 0.714533    
## weight:acceleration         -6.851e-05  2.385e-04  -0.287 0.774061    
## weight:year                 -8.065e-05  2.184e-04  -0.369 0.712223    
## weight:originEuropean        2.277e-03  2.685e-03   0.848 0.397037    
## weight:originJapanese       -4.498e-03  3.481e-03  -1.292 0.197101    
## acceleration:year            6.141e-02  2.547e-02   2.412 0.016390 *  
## acceleration:originEuropean  9.234e-01  2.641e-01   3.496 0.000531 ***
## acceleration:originJapanese  7.159e-01  3.258e-01   2.198 0.028614 *  
## year:originEuropean          2.932e-01  1.444e-01   2.031 0.043005 *  
## year:originJapanese          3.139e-01  1.483e-01   2.116 0.035034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.628 on 356 degrees of freedom
## Multiple R-squared:  0.8967, Adjusted R-squared:  0.8866 
## F-statistic: 88.34 on 35 and 356 DF,  p-value: < 2.2e-16

Using a significant level of 0.05, these are the significant interaction effects:

  • cylinders:acceleration
  • acceleration:year
  • acceleration:originEuropean
  • acceleration:originJapanese
  • year:originEuropean
  • year:originJapanese

(f) Try a few different transformations of the variables, such as \(log(X)\), \(√X\), \(X^2\). Comment on your findings.

m3 <- lm(mpg ~ .+ I(cylinders^2) + I(displacement^2) + I(horsepower^2) + I(weight^2) + I(acceleration^2) + I(year^2), data = Auto1)
summary(m3)
## 
## Call:
## lm(formula = mpg ~ . + I(cylinders^2) + I(displacement^2) + I(horsepower^2) + 
##     I(weight^2) + I(acceleration^2) + I(year^2), data = Auto1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6457 -1.5810  0.0953  1.3132 12.2519 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        4.069e+02  6.928e+01   5.873 9.39e-09 ***
## cylinders          9.603e-01  1.413e+00   0.679  0.49728    
## displacement      -2.559e-02  2.250e-02  -1.137  0.25615    
## horsepower        -1.545e-01  4.153e-02  -3.719  0.00023 ***
## weight            -1.322e-02  2.681e-03  -4.929 1.24e-06 ***
## acceleration      -1.677e+00  5.552e-01  -3.021  0.00269 ** 
## year              -9.562e+00  1.840e+00  -5.196 3.34e-07 ***
## originEuropean     1.105e+00  5.348e-01   2.067  0.03944 *  
## originJapanese     1.258e+00  5.126e-01   2.454  0.01456 *  
## I(cylinders^2)    -4.655e-02  1.142e-01  -0.407  0.68392    
## I(displacement^2)  3.714e-05  3.882e-05   0.957  0.33933    
## I(horsepower^2)    3.448e-04  1.414e-04   2.438  0.01522 *  
## I(weight^2)        1.523e-06  3.643e-07   4.179 3.64e-05 ***
## I(acceleration^2)  4.519e-02  1.640e-02   2.756  0.00614 ** 
## I(year^2)          6.801e-02  1.209e-02   5.626 3.59e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.756 on 377 degrees of freedom
## Multiple R-squared:  0.8798, Adjusted R-squared:  0.8753 
## F-statistic:   197 on 14 and 377 DF,  p-value: < 2.2e-16
m4 <- lm(mpg ~ .+ I(displacement^2) + I(horsepower^2) + I(weight^2) + I(acceleration^2) + I(year^2), data = Auto1)
summary(m4)
## 
## Call:
## lm(formula = mpg ~ . + I(displacement^2) + I(horsepower^2) + 
##     I(weight^2) + I(acceleration^2) + I(year^2), data = Auto1)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.780 -1.595  0.086  1.284 12.256 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        4.071e+02  6.920e+01   5.884 8.84e-09 ***
## cylinders          3.993e-01  3.189e-01   1.252  0.21123    
## displacement      -2.201e-02  2.069e-02  -1.064  0.28816    
## horsepower        -1.592e-01  3.980e-02  -4.001 7.60e-05 ***
## weight            -1.311e-02  2.665e-03  -4.918 1.30e-06 ***
## acceleration      -1.644e+00  5.484e-01  -2.997  0.00291 ** 
## year              -9.541e+00  1.837e+00  -5.192 3.40e-07 ***
## originEuropean     1.130e+00  5.306e-01   2.131  0.03377 *  
## originJapanese     1.282e+00  5.087e-01   2.521  0.01212 *  
## I(displacement^2)  3.081e-05  3.553e-05   0.867  0.38650    
## I(horsepower^2)    3.612e-04  1.355e-04   2.666  0.00801 ** 
## I(weight^2)        1.503e-06  3.606e-07   4.167 3.83e-05 ***
## I(acceleration^2)  4.418e-02  1.619e-02   2.729  0.00666 ** 
## I(year^2)          6.786e-02  1.207e-02   5.623 3.66e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.753 on 378 degrees of freedom
## Multiple R-squared:  0.8797, Adjusted R-squared:  0.8756 
## F-statistic: 212.7 on 13 and 378 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(m4)

The \(R^2\) has improved the original model 0.8797. The Residual plot also shows more linearity comparing to the original model.

m5 <- lm(log(mpg) ~ ., data = Auto1)
summary(m5)
## 
## Call:
## lm(formula = log(mpg) ~ ., data = Auto1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.40380 -0.06679  0.00493  0.06913  0.33036 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.712e+00  1.673e-01  10.230  < 2e-16 ***
## cylinders      -2.781e-02  1.149e-02  -2.420  0.01598 *  
## displacement    7.874e-04  2.738e-04   2.876  0.00425 ** 
## horsepower     -1.520e-03  4.904e-04  -3.100  0.00208 ** 
## weight         -2.639e-04  2.344e-05 -11.260  < 2e-16 ***
## acceleration   -1.403e-03  3.513e-03  -0.399  0.68996    
## year            3.055e-02  1.852e-03  16.491  < 2e-16 ***
## originEuropean  8.531e-02  2.026e-02   4.210 3.18e-05 ***
## originJapanese  8.145e-02  1.977e-02   4.119 4.66e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1183 on 383 degrees of freedom
## Multiple R-squared:  0.8815, Adjusted R-squared:  0.879 
## F-statistic: 356.1 on 8 and 383 DF,  p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(m5)

Performing some log transformation to the response mpg showed a better performance of the model with \(R^2\) equal to 0.8815. The Q-Q plot also shows some improvement in the normality. A better model still can be created by using the best relationships from previous models

m6 <- lm(log(mpg) ~ . + I(horsepower^2) + I(year^2) + I(weight^2) + acceleration:year + acceleration:origin, data = Auto1)
summary(m6)
## 
## Call:
## lm(formula = log(mpg) ~ . + I(horsepower^2) + I(year^2) + I(weight^2) + 
##     acceleration:year + acceleration:origin, data = Auto1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.38388 -0.06247  0.00788  0.06144  0.38240 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  1.351e+01  2.699e+00   5.005 8.58e-07 ***
## cylinders                   -1.111e-02  1.105e-02  -1.005 0.315667    
## displacement                -8.330e-05  2.817e-04  -0.296 0.767641    
## horsepower                  -3.460e-03  1.554e-03  -2.226 0.026611 *  
## weight                      -4.607e-04  8.299e-05  -5.551 5.36e-08 ***
## acceleration                -1.697e-01  4.650e-02  -3.649 0.000301 ***
## year                        -2.314e-01  7.236e-02  -3.198 0.001499 ** 
## originEuropean              -3.146e-01  9.686e-02  -3.248 0.001266 ** 
## originJapanese              -1.989e-01  1.225e-01  -1.624 0.105299    
## I(horsepower^2)              3.890e-06  5.096e-06   0.763 0.445730    
## I(year^2)                    1.513e-03  4.905e-04   3.085 0.002183 ** 
## I(weight^2)                  4.186e-08  1.094e-08   3.828 0.000152 ***
## acceleration:year            2.024e-03  6.139e-04   3.297 0.001072 ** 
## acceleration:originEuropean  2.197e-02  5.643e-03   3.894 0.000116 ***
## acceleration:originJapanese  1.463e-02  7.556e-03   1.937 0.053500 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1061 on 377 degrees of freedom
## Multiple R-squared:  0.9061, Adjusted R-squared:  0.9026 
## F-statistic: 259.9 on 14 and 377 DF,  p-value: < 2.2e-16

10. This question should be answered using the Carseats data set.

str(Carseats)
## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
car <- Carseats

(a) Fit a multiple regression model to predict Sales using Price,Urban, and US.

mr1 <- lm(Sales ~ Price + Urban + US, data = car)
summary(mr1)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = car)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

(b) Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

Price = -0.054459. It means that for every 1 unit increase in ‘Price’ (with all other effects held constant) there is a change in ‘Sales’ of -0.054 units (54 sales).

Urban = -0.021916. It means that if the store is located in urban areas the ‘Sales’ change by 0.022 units (22 sales). However the p-value is greater that a significant level of 0.05, therefore there is no evidence that a relationship exist between ‘Sales’ and ‘Urban’.

US = 1.200573. It means that if the store is located in US the ‘Sales’ change by 1.20 units (1200 sales).

(c) Write out the model in equation form, being careful to handle the qualitative variables properly.

\(Sales=13.043469−0.054459*Price−0.021916*Urban+1.200573*US\)

  • Urban = 1 for a store in an urban location, else 0
  • US = 1 for a store in the US, else 0

(d) For which of the predictors can you reject the null hypothesis \(H_0 : β_j = 0\)?

We can reject the null hypothesis for price and US.

(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

mr2 <- lm(Sales ~ Price + US, data = car)
summary(mr2)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = car)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

(f) How well do the models in (a) and (e) fit the data?

summary(mr1)$r.squared
## [1] 0.2392754

For model in (a):
\(R^2 = 0.2392754\) and \(Adj. R^2 = 0.2335\)

summary(mr2)$r.squared
## [1] 0.2392629

For model in (e):
\(R^2 = 0.2392629\) and \(Adj. R^2 = 0.2354\)

Although model in (e) has smaller \(R^2\) than model in (a) we should use model in (e) because it take the useless variable Urban out of the equation.

(g) Using the model from (e), obtain 95% confidence intervals for the coefficient(s).

confint(mr2, level = 0.95)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

(h) Is there evidence of outliers or high leverage observations in the model from (e)?

par(mfrow=c(2,2))
plot(mr2)

dev.new(width = 1000, height = 1000, unit = "px")
plot(mr2, which = 4, id.n = 5)
(inf.id=which(cooks.distance(mr2)>0.02))
##  26  50 317 368 
##  26  50 317 368

With standardized residual outside of [−2,2] we can identified some outliers.

12. This problem involves simple linear regression without an intercept.

(a) Recall that the coefficient estimate \(ˆ β\) for the linear regression of \(Y\) onto \(X\) without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of \(X\) onto \(Y\) the same as the coefficient estimate for the regression of \(Y\) onto \(X\)?

\(∑x^2_i=∑y^2_i\) for n observations and \(i=1\)

(b) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of \(X\) onto \(Y\) is different from the coefficient estimate for the regression of \(Y\) onto \(X\).

set.seed(2)
x <- rnorm(100)
y <- 3*x + rnorm(100, sd = 2)
data <- data.frame(x, y)

lm_yx <- lm(y ~ x + 0)
lm_xy <- lm(x ~ y + 0)
summary(lm_yx)
## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2442 -1.5840  0.3314  1.4960  4.1668 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   2.9045     0.1705   17.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.968 on 99 degrees of freedom
## Multiple R-squared:  0.7457, Adjusted R-squared:  0.7432 
## F-statistic: 290.3 on 1 and 99 DF,  p-value: < 2.2e-16
summary(lm_xy)
## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.52583 -0.46562 -0.02371  0.42784  1.06727 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y  0.25674    0.01507   17.04   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5852 on 99 degrees of freedom
## Multiple R-squared:  0.7457, Adjusted R-squared:  0.7432 
## F-statistic: 290.3 on 1 and 99 DF,  p-value: < 2.2e-16
coef(lm_yx)
##        x 
## 2.904515
coef(lm_xy)
##        y 
## 0.256745

\(β_x ≠ β_y\)

plot(x, y)
abline(lm_yx ,lwd= 3, col = "red")
abline(lm_xy, lwd = 3, col = 'green')

(c) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of \(X\) onto \(Y\) is the same as the coefficient estimate for the regression of \(Y\) onto \(X\).

set.seed(3)
x <- rnorm(100)
y <- x
data1 <- data.frame(x, y)

lm_yx1 <- lm(y ~ x + 0)
lm_xy1 <- lm(x ~ y + 0)
coef(lm_yx1)
## x 
## 1
coef(lm_xy1)
## y 
## 1

\(β_x = β_y\)

plot(x, y)
abline(lm_yx1 ,lwd= 3, col = "red")
abline(lm_xy1, lwd = 3, col = 'green')