library(ISLR2)

Problem 2

Carefully explain the differences between the KNN classifier and KNN regression methods.
The difference between KNN classifier and KNN regression methods is that the classifier is used when the response variable is categorical and the regression method is used when it is numeric. They both are more flexible compared to parametric models, and both use K number of neighbors to determine values you don’t have; they use the same method of determining the missing values and the difference is on the output type - numeric or categorical.

Problem 9

This question involves the use of multiple linear regression on the Auto data set.

attach(Auto)

(a) Produce a scatterplot matrix which includes all of the variables in the data set.
When using the pairs function, we receive a non-numeric argument to ‘pairs’ error. So lets see where those are first.

summary(Auto)
##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##                                                                                
##   acceleration        year           origin                      name    
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   amc matador       :  5  
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5  
##  Median :15.50   Median :76.00   Median :1.000   toyota corolla    :  5  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577   amc gremlin       :  4  
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4  
##  Max.   :24.80   Max.   :82.00   Max.   :3.000   chevrolet chevette:  4  
##                                                  (Other)           :365

The variable name is non-numeric and causing our error. As its in the 9th spot, lets run pairs on variables 1 through 8.

pairs(Auto[,1:8])

(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.

cor(Auto[ ,-9])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

i. Is there a relationship between the predictors and the response?

ii. Which predictors appear to have a statistically significant relationship to the response?

iii. What does the coefficient for the year variable suggest?

Auto$origin<- factor(Auto$origin, labels = c("American", "European", "Japanese"))

fit_Auto<- lm(mpg~ . - name, data=Auto)
summary(fit_Auto)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.0095 -2.0785 -0.0982  1.9856 13.3608 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.795e+01  4.677e+00  -3.839 0.000145 ***
## cylinders      -4.897e-01  3.212e-01  -1.524 0.128215    
## displacement    2.398e-02  7.653e-03   3.133 0.001863 ** 
## horsepower     -1.818e-02  1.371e-02  -1.326 0.185488    
## weight         -6.710e-03  6.551e-04 -10.243  < 2e-16 ***
## acceleration    7.910e-02  9.822e-02   0.805 0.421101    
## year            7.770e-01  5.178e-02  15.005  < 2e-16 ***
## originEuropean  2.630e+00  5.664e-01   4.643 4.72e-06 ***
## originJapanese  2.853e+00  5.527e-01   5.162 3.93e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.307 on 383 degrees of freedom
## Multiple R-squared:  0.8242, Adjusted R-squared:  0.8205 
## F-statistic: 224.5 on 8 and 383 DF,  p-value: < 2.2e-16

i. There is a relationship between the predictors and the response, with some predictors being stronger than others. The F-statistic is 224.5 with the p-value being “< 2.2e-16”

ii. The predictors with the strongest statistically significant relationship to the response are weight, year, originEuropean, and originJapanese. These all have p-values less than .001. Additionally, displacement has a p-value less than .01 which is still strong statistical significance.

iii. The coefficient for year is ~ .777. This means that increasing the year by 1 results in an increase of .777 in the response variable, mpg.

(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow=c(2,2))
plot(fit_Auto)

The plots do suggest there are unusually large outliers, and the leverage plot does identify observations with high leverage. With a visual scan of the plots, we see observation 323 at least is an outlier, and 14 is highlighted in the Residuals vs Leverage plot.

plot(hatvalues(fit_Auto))

To confirm, this hatvalue plot shows one observation significantly higher.

which.max(hatvalues(fit_Auto))
## 14 
## 14

(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

Lets look at all the interaction terms

summary(lm(formula = mpg ~ . * ., data = Auto[, -9]))
## 
## Call:
## lm(formula = mpg ~ . * ., data = Auto[, -9])
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6008 -1.2863  0.0813  1.2082 12.0382 
## 
## Coefficients:
##                               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                  4.401e+01  5.147e+01   0.855 0.393048    
## cylinders                    3.302e+00  8.187e+00   0.403 0.686976    
## displacement                -3.529e-01  1.974e-01  -1.788 0.074638 .  
## horsepower                   5.312e-01  3.390e-01   1.567 0.117970    
## weight                      -3.259e-03  1.820e-02  -0.179 0.857980    
## acceleration                -6.048e+00  2.147e+00  -2.818 0.005109 ** 
## year                         4.833e-01  5.923e-01   0.816 0.415119    
## originEuropean              -3.517e+01  1.260e+01  -2.790 0.005547 ** 
## originJapanese              -3.765e+01  1.426e+01  -2.640 0.008661 ** 
## cylinders:displacement      -6.316e-03  7.106e-03  -0.889 0.374707    
## cylinders:horsepower         1.452e-02  2.457e-02   0.591 0.555109    
## cylinders:weight             5.703e-04  9.044e-04   0.631 0.528709    
## cylinders:acceleration       3.658e-01  1.671e-01   2.189 0.029261 *  
## cylinders:year              -1.447e-01  9.652e-02  -1.499 0.134846    
## cylinders:originEuropean    -7.210e-01  1.088e+00  -0.662 0.508100    
## cylinders:originJapanese     1.226e+00  1.007e+00   1.217 0.224379    
## displacement:horsepower     -5.407e-05  2.861e-04  -0.189 0.850212    
## displacement:weight          2.659e-05  1.455e-05   1.828 0.068435 .  
## displacement:acceleration   -2.547e-03  3.356e-03  -0.759 0.448415    
## displacement:year            4.547e-03  2.446e-03   1.859 0.063842 .  
## displacement:originEuropean -3.364e-02  4.220e-02  -0.797 0.425902    
## displacement:originJapanese  5.375e-02  4.145e-02   1.297 0.195527    
## horsepower:weight           -3.407e-05  2.955e-05  -1.153 0.249743    
## horsepower:acceleration     -3.445e-03  3.937e-03  -0.875 0.382122    
## horsepower:year             -6.427e-03  3.891e-03  -1.652 0.099487 .  
## horsepower:originEuropean   -4.869e-03  5.061e-02  -0.096 0.923408    
## horsepower:originJapanese    2.289e-02  6.252e-02   0.366 0.714533    
## weight:acceleration         -6.851e-05  2.385e-04  -0.287 0.774061    
## weight:year                 -8.065e-05  2.184e-04  -0.369 0.712223    
## weight:originEuropean        2.277e-03  2.685e-03   0.848 0.397037    
## weight:originJapanese       -4.498e-03  3.481e-03  -1.292 0.197101    
## acceleration:year            6.141e-02  2.547e-02   2.412 0.016390 *  
## acceleration:originEuropean  9.234e-01  2.641e-01   3.496 0.000531 ***
## acceleration:originJapanese  7.159e-01  3.258e-01   2.198 0.028614 *  
## year:originEuropean          2.932e-01  1.444e-01   2.031 0.043005 *  
## year:originJapanese          3.139e-01  1.483e-01   2.116 0.035034 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.628 on 356 degrees of freedom
## Multiple R-squared:  0.8967, Adjusted R-squared:  0.8866 
## F-statistic: 88.34 on 35 and 356 DF,  p-value: < 2.2e-16

The interaction term acceleration:originEuropean has a p-value less than .001. Additionally, ‘acceleration:year’, ‘acceleration:originJapanese’, ‘year:originEuropean’, and ‘year:originJapanese’ have p-values less than .05.

(f) Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.

fit_Auto2<- lm(mpg~weight+log(weight)+horsepower+sqrt(horsepower)+acceleration+I(acceleration^2))
summary(fit_Auto2)
## 
## Call:
## lm(formula = mpg ~ weight + log(weight) + horsepower + sqrt(horsepower) + 
##     acceleration + I(acceleration^2))
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -11.973  -2.243  -0.209   2.026  14.961 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       210.984487  48.071782   4.389 1.47e-05 ***
## weight              0.001594   0.002436   0.654 0.513346    
## log(weight)       -16.616856   7.565506  -2.196 0.028659 *  
## horsepower          0.229140   0.084785   2.703 0.007185 ** 
## sqrt(horsepower)   -7.209663   1.922123  -3.751 0.000203 ***
## acceleration       -1.143819   0.724376  -1.579 0.115148    
## I(acceleration^2)   0.025562   0.021865   1.169 0.243089    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.915 on 385 degrees of freedom
## Multiple R-squared:  0.7522, Adjusted R-squared:  0.7483 
## F-statistic: 194.8 on 6 and 385 DF,  p-value: < 2.2e-16

Rather than running transformations on all the variables, I ran a transformation on each of 3 different variables that saw other analysts do similarly. The square root transformation of horsepower and the log of weight were both significant. However, this method reduces the overall p-value. Lets separate out the transformations:

fit_Auto_sqr = lm(mpg ~ . - name + I(weight^2) + I(displacement^2) + I(horsepower^2) + I(year^2), data = Auto)

summary(fit_Auto_sqr)
## 
## Call:
## lm(formula = mpg ~ . - name + I(weight^2) + I(displacement^2) + 
##     I(horsepower^2) + I(year^2), data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.4816 -1.5384  0.0735  1.3671 12.0213 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        4.185e+02  6.966e+01   6.008 4.40e-09 ***
## cylinders          5.073e-01  3.191e-01   1.590 0.112692    
## displacement      -3.328e-02  2.045e-02  -1.627 0.104480    
## horsepower        -1.781e-01  3.953e-02  -4.506 8.81e-06 ***
## weight            -1.114e-02  2.587e-03  -4.306 2.12e-05 ***
## acceleration      -1.700e-01  9.652e-02  -1.762 0.078960 .  
## year              -1.019e+01  1.837e+00  -5.546 5.49e-08 ***
## originEuropean     1.323e+00  5.304e-01   2.494 0.013068 *  
## originJapanese     1.258e+00  5.129e-01   2.452 0.014637 *  
## I(weight^2)        1.182e-06  3.438e-07   3.439 0.000649 ***
## I(displacement^2)  5.839e-05  3.435e-05   1.700 0.089967 .  
## I(horsepower^2)    4.388e-04  1.336e-04   3.284 0.001118 ** 
## I(year^2)          7.210e-02  1.207e-02   5.974 5.35e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.776 on 379 degrees of freedom
## Multiple R-squared:  0.8773, Adjusted R-squared:  0.8735 
## F-statistic: 225.9 on 12 and 379 DF,  p-value: < 2.2e-16

After reviewing work on RPubs where another analyst (lmorgan95) determined the best variables to square were weight, displacement, horsepower, and year, we see a significant increase in the R-squared, and strong significance across all transformed terms other than displacement. Lets do a log now, using the response variable:

fit_Auto_log = lm(log(mpg) ~ . - name, data = Auto)

summary(fit_Auto_log)
## 
## Call:
## lm(formula = log(mpg) ~ . - name, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.40380 -0.06679  0.00493  0.06913  0.33036 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     1.712e+00  1.673e-01  10.230  < 2e-16 ***
## cylinders      -2.781e-02  1.149e-02  -2.420  0.01598 *  
## displacement    7.874e-04  2.738e-04   2.876  0.00425 ** 
## horsepower     -1.520e-03  4.904e-04  -3.100  0.00208 ** 
## weight         -2.639e-04  2.344e-05 -11.260  < 2e-16 ***
## acceleration   -1.403e-03  3.513e-03  -0.399  0.68996    
## year            3.055e-02  1.852e-03  16.491  < 2e-16 ***
## originEuropean  8.531e-02  2.026e-02   4.210 3.18e-05 ***
## originJapanese  8.145e-02  1.977e-02   4.119 4.66e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1183 on 383 degrees of freedom
## Multiple R-squared:  0.8815, Adjusted R-squared:  0.879 
## F-statistic: 356.1 on 8 and 383 DF,  p-value: < 2.2e-16

Taking the log of the response variable had an even better impact on the R-squared.

detach(Auto)

Problem 10

This question should be answered using the Carseats data set.

library(ISLR2)
attach(Carseats)

(a) Fit a multiple regression model to predict Sales using Price, Urban, and US.

fit<-lm(Sales~Price+Urban+US)
summary(fit)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

(b) Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!
From the table above, Price and US are significant predictors of Sales. For every $1 dollar increase in price, sales decrease by $54.Sales inside the US are $1,200 higher than sales outside of the US. Urban has no effect on Sales.

(c) Write out the model in equation form, being careful to handle the qualitative variables properly.
\(Sales = 13.043469 - 0.054459Price - 0.021916Urban_{Yes} + 1.200573US_{Yes}\)

(d) For which of the predictors can you reject the null hypothesis \(H_0 : \beta_j = 0?\)
Price and US

(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

fit<-lm(Sales~Price+US)
summary(fit)
## 
## Call:
## lm(formula = Sales ~ Price + US)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

(f) How well do the models in (a) and (e) fit the data?
Not very well, as each model explains between 23% and 24% of the variance in Sales.

(g) Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).

confint(fit)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

(h) Is there evidence of outliers or high leverage observations in the model from (e)?
R has built in functions that help us identify influential points using various statistics with one simple command. Researchers have suggested several cutoff levels or upper limits as to what is the acceptable influence an observation should have before being considered an outlier. For example, the average leverage \(\frac{(p+1)}{n}\) which for us is \(\frac{(2+1)}{400} = 0.0075\).

par(mfrow=c(2,2))
plot(fit)

summary(influence.measures(fit))
## Potentially influential observations of
##   lm(formula = Sales ~ Price + US) :
## 
##     dfb.1_ dfb.Pric dfb.USYs dffit   cov.r   cook.d hat    
## 26   0.24  -0.18    -0.17     0.28_*  0.97_*  0.03   0.01  
## 29  -0.10   0.10    -0.10    -0.18    0.97_*  0.01   0.01  
## 43  -0.11   0.10     0.03    -0.11    1.05_*  0.00   0.04_*
## 50  -0.10   0.17    -0.17     0.26_*  0.98    0.02   0.01  
## 51  -0.05   0.05    -0.11    -0.18    0.95_*  0.01   0.00  
## 58  -0.05  -0.02     0.16    -0.20    0.97_*  0.01   0.01  
## 69  -0.09   0.10     0.09     0.19    0.96_*  0.01   0.01  
## 126 -0.07   0.06     0.03    -0.07    1.03_*  0.00   0.03_*
## 160  0.00   0.00     0.00     0.01    1.02_*  0.00   0.02  
## 166  0.21  -0.23    -0.04    -0.24    1.02    0.02   0.03_*
## 172  0.06  -0.07     0.02     0.08    1.03_*  0.00   0.02  
## 175  0.14  -0.19     0.09    -0.21    1.03_*  0.02   0.03_*
## 210 -0.14   0.15    -0.10    -0.22    0.97_*  0.02   0.01  
## 270 -0.03   0.05    -0.03     0.06    1.03_*  0.00   0.02  
## 298 -0.06   0.06    -0.09    -0.15    0.97_*  0.01   0.00  
## 314 -0.05   0.04     0.02    -0.05    1.03_*  0.00   0.02_*
## 353 -0.02   0.03     0.09     0.15    0.97_*  0.01   0.00  
## 357  0.02  -0.02     0.02    -0.03    1.03_*  0.00   0.02  
## 368  0.26  -0.23    -0.11     0.27_*  1.01    0.02   0.02_*
## 377  0.14  -0.15     0.12     0.24    0.95_*  0.02   0.01  
## 384  0.00   0.00     0.00     0.00    1.02_*  0.00   0.02  
## 387 -0.03   0.04    -0.03     0.05    1.02_*  0.00   0.02  
## 396 -0.05   0.05     0.08     0.14    0.98_*  0.01   0.00

R points out a few observations that violate various rules for each influence measure. Typically, one can demonstrate these statistics and report both a regression with all data included and one with the outliers removed and compare.

outyling.obs<-c(26,29,43,50,51,58,69,126,160,166,172,175,210,270,298,314,353,357,368,377,384,387,396)
Carseats.small<-Carseats[-outyling.obs,]
fit2<-lm(Sales~Price+US,data=Carseats.small)
summary(fit2)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats.small)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.263 -1.605 -0.039  1.590  5.428 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.925232   0.665259  19.429  < 2e-16 ***
## Price       -0.053973   0.005511  -9.794  < 2e-16 ***
## USYes        1.255018   0.248856   5.043 7.15e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.29 on 374 degrees of freedom
## Multiple R-squared:  0.2387, Adjusted R-squared:  0.2347 
## F-statistic: 58.64 on 2 and 374 DF,  p-value: < 2.2e-16

With these potential outliers or influential observations removed, very little changes from the linear model fit to the full data set. The confidence interval for the coefficient estimates produced by the linear model fit to the full data set contain the estimates of the coefficients for the estimates of the model with the outliers removed. It’s safe to include all of the data points in our model.

detach(Carseats)

Problem 12

This problem involves simple linear regression without an intercept.

(a) Recall that the coefficient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?
The coefficients are the same when the denominators of the coefficient estimate equations are the same, which is when the sum of the squares of y and x are equal.\[\sum_{i=1}^{n} x^2_i=\sum_{i=1}^{n} y^2_i\]

(b) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.
First I’ll set my seed to 1 to make validation easier.

set.seed(1)
x <- rnorm(100)
y <- 2*x + rnorm(100, sd = .5)
data <- data.frame(x, y)

Lets make sure the sum of squares isn’t the same:

sum(x^2)
## [1] 81.05509
sum(y^2)
## [1] 345.9723

Now lets verify by doing the regressions and comparing the coefficients

lm_y <- lm(y ~ x + 0)
lm_x <- lm(x ~ y + 0)
summary(lm_y)
## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.95768 -0.32358 -0.08853  0.25279  1.15545 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x  1.99694    0.05324   37.51   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4793 on 99 degrees of freedom
## Multiple R-squared:  0.9343, Adjusted R-squared:  0.9336 
## F-statistic:  1407 on 1 and 99 DF,  p-value: < 2.2e-16
summary(lm_x)
## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.50931 -0.10863  0.05499  0.14436  0.44044 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y  0.46785    0.01247   37.51   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.232 on 99 degrees of freedom
## Multiple R-squared:  0.9343, Adjusted R-squared:  0.9336 
## F-statistic:  1407 on 1 and 99 DF,  p-value: < 2.2e-16

The coefficients are different

(c) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.
Lets just take the value 1 to 100 for x, and the inverse of 100 down to 1 for y. This way, summing the squares will come out to the same.

x<-1:100
y<-100:1
sum(x^2)
## [1] 338350
sum(y^2)
## [1] 338350

Now lets fit the models and confirm that the coefficients are the same.

lm_y2 <- lm(y ~ x + 0)
lm_x2 <- lm(x ~ y + 0)
summary(lm_y2)
## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49.75 -12.44  24.87  62.18  99.49 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   0.5075     0.0866    5.86 6.09e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared:  0.2575, Adjusted R-squared:   0.25 
## F-statistic: 34.34 on 1 and 99 DF,  p-value: 6.094e-08

There was y, and now lets compare to x:

summary(lm_x2)
## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49.75 -12.44  24.87  62.18  99.49 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y   0.5075     0.0866    5.86 6.09e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared:  0.2575, Adjusted R-squared:   0.25 
## F-statistic: 34.34 on 1 and 99 DF,  p-value: 6.094e-08