library(ISLR2)
Carefully explain the differences between the KNN classifier
and KNN regression methods.
The difference between KNN classifier and KNN regression methods is that
the classifier is used when the response variable is categorical and the
regression method is used when it is numeric. They both are more
flexible compared to parametric models, and both use K number of
neighbors to determine values you don’t have; they use the same method
of determining the missing values and the difference is on the output
type - numeric or categorical.
This question involves the use of multiple linear regression on the Auto data set.
attach(Auto)
(a) Produce a scatterplot matrix which includes all of the
variables in the data set.
When using the pairs function, we receive a non-numeric argument to
‘pairs’ error. So lets see where those are first.
summary(Auto)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
##
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5
## Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5
## Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4
## Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4
## (Other) :365
The variable name is non-numeric and causing our error.
As its in the 9th spot, lets run pairs on variables 1 through 8.
pairs(Auto[,1:8])
(b) Compute the matrix of correlations between the variables
using the function cor(). You will need to exclude the name
variable, which is qualitative.
cor(Auto[ ,-9])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
(c) Use the lm() function to perform a multiple linear
regression with mpg as the response and all other variables
except name as the predictors. Use the
summary() function to print the results. Comment on the
output. For instance:
i. Is there a relationship between the predictors and the response?
ii. Which predictors appear to have a statistically significant relationship to the response?
iii. What does the coefficient for the year variable suggest?
Auto$origin<- factor(Auto$origin, labels = c("American", "European", "Japanese"))
fit_Auto<- lm(mpg~ . - name, data=Auto)
summary(fit_Auto)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.0095 -2.0785 -0.0982 1.9856 13.3608
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.795e+01 4.677e+00 -3.839 0.000145 ***
## cylinders -4.897e-01 3.212e-01 -1.524 0.128215
## displacement 2.398e-02 7.653e-03 3.133 0.001863 **
## horsepower -1.818e-02 1.371e-02 -1.326 0.185488
## weight -6.710e-03 6.551e-04 -10.243 < 2e-16 ***
## acceleration 7.910e-02 9.822e-02 0.805 0.421101
## year 7.770e-01 5.178e-02 15.005 < 2e-16 ***
## originEuropean 2.630e+00 5.664e-01 4.643 4.72e-06 ***
## originJapanese 2.853e+00 5.527e-01 5.162 3.93e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.307 on 383 degrees of freedom
## Multiple R-squared: 0.8242, Adjusted R-squared: 0.8205
## F-statistic: 224.5 on 8 and 383 DF, p-value: < 2.2e-16
i. There is a relationship between the predictors and the response, with some predictors being stronger than others. The F-statistic is 224.5 with the p-value being “< 2.2e-16”
ii. The predictors with the strongest statistically
significant relationship to the response are weight,
year, originEuropean, and
originJapanese. These all have p-values less than .001.
Additionally, displacement has a p-value less than .01
which is still strong statistical significance.
iii. The coefficient for year is ~
.777. This means that increasing the year by 1 results in an increase of
.777 in the response variable, mpg.
(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
par(mfrow=c(2,2))
plot(fit_Auto)
The plots do suggest there are unusually large outliers, and the leverage plot does identify observations with high leverage. With a visual scan of the plots, we see observation 323 at least is an outlier, and 14 is highlighted in the Residuals vs Leverage plot.
plot(hatvalues(fit_Auto))
To confirm, this hatvalue plot shows one observation significantly higher.
which.max(hatvalues(fit_Auto))
## 14
## 14
(e) Use the * and : symbols to fit
linear regression models with interaction effects. Do any interactions
appear to be statistically significant?
Lets look at all the interaction terms
summary(lm(formula = mpg ~ . * ., data = Auto[, -9]))
##
## Call:
## lm(formula = mpg ~ . * ., data = Auto[, -9])
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6008 -1.2863 0.0813 1.2082 12.0382
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.401e+01 5.147e+01 0.855 0.393048
## cylinders 3.302e+00 8.187e+00 0.403 0.686976
## displacement -3.529e-01 1.974e-01 -1.788 0.074638 .
## horsepower 5.312e-01 3.390e-01 1.567 0.117970
## weight -3.259e-03 1.820e-02 -0.179 0.857980
## acceleration -6.048e+00 2.147e+00 -2.818 0.005109 **
## year 4.833e-01 5.923e-01 0.816 0.415119
## originEuropean -3.517e+01 1.260e+01 -2.790 0.005547 **
## originJapanese -3.765e+01 1.426e+01 -2.640 0.008661 **
## cylinders:displacement -6.316e-03 7.106e-03 -0.889 0.374707
## cylinders:horsepower 1.452e-02 2.457e-02 0.591 0.555109
## cylinders:weight 5.703e-04 9.044e-04 0.631 0.528709
## cylinders:acceleration 3.658e-01 1.671e-01 2.189 0.029261 *
## cylinders:year -1.447e-01 9.652e-02 -1.499 0.134846
## cylinders:originEuropean -7.210e-01 1.088e+00 -0.662 0.508100
## cylinders:originJapanese 1.226e+00 1.007e+00 1.217 0.224379
## displacement:horsepower -5.407e-05 2.861e-04 -0.189 0.850212
## displacement:weight 2.659e-05 1.455e-05 1.828 0.068435 .
## displacement:acceleration -2.547e-03 3.356e-03 -0.759 0.448415
## displacement:year 4.547e-03 2.446e-03 1.859 0.063842 .
## displacement:originEuropean -3.364e-02 4.220e-02 -0.797 0.425902
## displacement:originJapanese 5.375e-02 4.145e-02 1.297 0.195527
## horsepower:weight -3.407e-05 2.955e-05 -1.153 0.249743
## horsepower:acceleration -3.445e-03 3.937e-03 -0.875 0.382122
## horsepower:year -6.427e-03 3.891e-03 -1.652 0.099487 .
## horsepower:originEuropean -4.869e-03 5.061e-02 -0.096 0.923408
## horsepower:originJapanese 2.289e-02 6.252e-02 0.366 0.714533
## weight:acceleration -6.851e-05 2.385e-04 -0.287 0.774061
## weight:year -8.065e-05 2.184e-04 -0.369 0.712223
## weight:originEuropean 2.277e-03 2.685e-03 0.848 0.397037
## weight:originJapanese -4.498e-03 3.481e-03 -1.292 0.197101
## acceleration:year 6.141e-02 2.547e-02 2.412 0.016390 *
## acceleration:originEuropean 9.234e-01 2.641e-01 3.496 0.000531 ***
## acceleration:originJapanese 7.159e-01 3.258e-01 2.198 0.028614 *
## year:originEuropean 2.932e-01 1.444e-01 2.031 0.043005 *
## year:originJapanese 3.139e-01 1.483e-01 2.116 0.035034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.628 on 356 degrees of freedom
## Multiple R-squared: 0.8967, Adjusted R-squared: 0.8866
## F-statistic: 88.34 on 35 and 356 DF, p-value: < 2.2e-16
The interaction term acceleration:originEuropean has a
p-value less than .001. Additionally, ‘acceleration:year’,
‘acceleration:originJapanese’, ‘year:originEuropean’, and
‘year:originJapanese’ have p-values less than .05.
(f) Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.
fit_Auto2<- lm(mpg~weight+log(weight)+horsepower+sqrt(horsepower)+acceleration+I(acceleration^2))
summary(fit_Auto2)
##
## Call:
## lm(formula = mpg ~ weight + log(weight) + horsepower + sqrt(horsepower) +
## acceleration + I(acceleration^2))
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.973 -2.243 -0.209 2.026 14.961
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 210.984487 48.071782 4.389 1.47e-05 ***
## weight 0.001594 0.002436 0.654 0.513346
## log(weight) -16.616856 7.565506 -2.196 0.028659 *
## horsepower 0.229140 0.084785 2.703 0.007185 **
## sqrt(horsepower) -7.209663 1.922123 -3.751 0.000203 ***
## acceleration -1.143819 0.724376 -1.579 0.115148
## I(acceleration^2) 0.025562 0.021865 1.169 0.243089
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.915 on 385 degrees of freedom
## Multiple R-squared: 0.7522, Adjusted R-squared: 0.7483
## F-statistic: 194.8 on 6 and 385 DF, p-value: < 2.2e-16
Rather than running transformations on all the variables, I ran a transformation on each of 3 different variables that saw other analysts do similarly. The square root transformation of horsepower and the log of weight were both significant. However, this method reduces the overall p-value. Lets separate out the transformations:
fit_Auto_sqr = lm(mpg ~ . - name + I(weight^2) + I(displacement^2) + I(horsepower^2) + I(year^2), data = Auto)
summary(fit_Auto_sqr)
##
## Call:
## lm(formula = mpg ~ . - name + I(weight^2) + I(displacement^2) +
## I(horsepower^2) + I(year^2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.4816 -1.5384 0.0735 1.3671 12.0213
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.185e+02 6.966e+01 6.008 4.40e-09 ***
## cylinders 5.073e-01 3.191e-01 1.590 0.112692
## displacement -3.328e-02 2.045e-02 -1.627 0.104480
## horsepower -1.781e-01 3.953e-02 -4.506 8.81e-06 ***
## weight -1.114e-02 2.587e-03 -4.306 2.12e-05 ***
## acceleration -1.700e-01 9.652e-02 -1.762 0.078960 .
## year -1.019e+01 1.837e+00 -5.546 5.49e-08 ***
## originEuropean 1.323e+00 5.304e-01 2.494 0.013068 *
## originJapanese 1.258e+00 5.129e-01 2.452 0.014637 *
## I(weight^2) 1.182e-06 3.438e-07 3.439 0.000649 ***
## I(displacement^2) 5.839e-05 3.435e-05 1.700 0.089967 .
## I(horsepower^2) 4.388e-04 1.336e-04 3.284 0.001118 **
## I(year^2) 7.210e-02 1.207e-02 5.974 5.35e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.776 on 379 degrees of freedom
## Multiple R-squared: 0.8773, Adjusted R-squared: 0.8735
## F-statistic: 225.9 on 12 and 379 DF, p-value: < 2.2e-16
After reviewing work on RPubs where another analyst (lmorgan95)
determined the best variables to square were weight,
displacement, horsepower, and
year, we see a significant increase in the R-squared, and
strong significance across all transformed terms other than
displacement. Lets do a log now, using the response
variable:
fit_Auto_log = lm(log(mpg) ~ . - name, data = Auto)
summary(fit_Auto_log)
##
## Call:
## lm(formula = log(mpg) ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.40380 -0.06679 0.00493 0.06913 0.33036
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.712e+00 1.673e-01 10.230 < 2e-16 ***
## cylinders -2.781e-02 1.149e-02 -2.420 0.01598 *
## displacement 7.874e-04 2.738e-04 2.876 0.00425 **
## horsepower -1.520e-03 4.904e-04 -3.100 0.00208 **
## weight -2.639e-04 2.344e-05 -11.260 < 2e-16 ***
## acceleration -1.403e-03 3.513e-03 -0.399 0.68996
## year 3.055e-02 1.852e-03 16.491 < 2e-16 ***
## originEuropean 8.531e-02 2.026e-02 4.210 3.18e-05 ***
## originJapanese 8.145e-02 1.977e-02 4.119 4.66e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1183 on 383 degrees of freedom
## Multiple R-squared: 0.8815, Adjusted R-squared: 0.879
## F-statistic: 356.1 on 8 and 383 DF, p-value: < 2.2e-16
Taking the log of the response variable had an even better impact on the R-squared.
detach(Auto)
This question should be answered using the Carseats data set.
library(ISLR2)
attach(Carseats)
(a) Fit a multiple regression model to predict
Sales using Price, Urban, and
US.
fit<-lm(Sales~Price+Urban+US)
summary(fit)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
(b) Provide an interpretation of each coefficient in the
model. Be careful—some of the variables in the model are
qualitative!
From the table above, Price and US are
significant predictors of Sales. For every $1 dollar
increase in price, sales decrease by $54.Sales inside the US are $1,200
higher than sales outside of the US. Urban has no effect on Sales.
(c) Write out the model in equation form, being careful to
handle the qualitative variables properly.
\(Sales = 13.043469 - 0.054459Price -
0.021916Urban_{Yes} + 1.200573US_{Yes}\)
(d) For which of the predictors can you reject the null
hypothesis \(H_0 : \beta_j =
0?\)
Price and US
(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
fit<-lm(Sales~Price+US)
summary(fit)
##
## Call:
## lm(formula = Sales ~ Price + US)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
(f) How well do the models in (a) and (e) fit the
data?
Not very well, as each model explains between 23% and 24% of the
variance in Sales.
(g) Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).
confint(fit)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
(h) Is there evidence of outliers or high leverage
observations in the model from (e)?
R has built in functions that help us identify influential points using
various statistics with one simple command. Researchers have suggested
several cutoff levels or upper limits as to what is the acceptable
influence an observation should have before being considered an outlier.
For example, the average leverage \(\frac{(p+1)}{n}\) which for us is \(\frac{(2+1)}{400} = 0.0075\).
par(mfrow=c(2,2))
plot(fit)
summary(influence.measures(fit))
## Potentially influential observations of
## lm(formula = Sales ~ Price + US) :
##
## dfb.1_ dfb.Pric dfb.USYs dffit cov.r cook.d hat
## 26 0.24 -0.18 -0.17 0.28_* 0.97_* 0.03 0.01
## 29 -0.10 0.10 -0.10 -0.18 0.97_* 0.01 0.01
## 43 -0.11 0.10 0.03 -0.11 1.05_* 0.00 0.04_*
## 50 -0.10 0.17 -0.17 0.26_* 0.98 0.02 0.01
## 51 -0.05 0.05 -0.11 -0.18 0.95_* 0.01 0.00
## 58 -0.05 -0.02 0.16 -0.20 0.97_* 0.01 0.01
## 69 -0.09 0.10 0.09 0.19 0.96_* 0.01 0.01
## 126 -0.07 0.06 0.03 -0.07 1.03_* 0.00 0.03_*
## 160 0.00 0.00 0.00 0.01 1.02_* 0.00 0.02
## 166 0.21 -0.23 -0.04 -0.24 1.02 0.02 0.03_*
## 172 0.06 -0.07 0.02 0.08 1.03_* 0.00 0.02
## 175 0.14 -0.19 0.09 -0.21 1.03_* 0.02 0.03_*
## 210 -0.14 0.15 -0.10 -0.22 0.97_* 0.02 0.01
## 270 -0.03 0.05 -0.03 0.06 1.03_* 0.00 0.02
## 298 -0.06 0.06 -0.09 -0.15 0.97_* 0.01 0.00
## 314 -0.05 0.04 0.02 -0.05 1.03_* 0.00 0.02_*
## 353 -0.02 0.03 0.09 0.15 0.97_* 0.01 0.00
## 357 0.02 -0.02 0.02 -0.03 1.03_* 0.00 0.02
## 368 0.26 -0.23 -0.11 0.27_* 1.01 0.02 0.02_*
## 377 0.14 -0.15 0.12 0.24 0.95_* 0.02 0.01
## 384 0.00 0.00 0.00 0.00 1.02_* 0.00 0.02
## 387 -0.03 0.04 -0.03 0.05 1.02_* 0.00 0.02
## 396 -0.05 0.05 0.08 0.14 0.98_* 0.01 0.00
R points out a few observations that violate various rules for each influence measure. Typically, one can demonstrate these statistics and report both a regression with all data included and one with the outliers removed and compare.
outyling.obs<-c(26,29,43,50,51,58,69,126,160,166,172,175,210,270,298,314,353,357,368,377,384,387,396)
Carseats.small<-Carseats[-outyling.obs,]
fit2<-lm(Sales~Price+US,data=Carseats.small)
summary(fit2)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats.small)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.263 -1.605 -0.039 1.590 5.428
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.925232 0.665259 19.429 < 2e-16 ***
## Price -0.053973 0.005511 -9.794 < 2e-16 ***
## USYes 1.255018 0.248856 5.043 7.15e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.29 on 374 degrees of freedom
## Multiple R-squared: 0.2387, Adjusted R-squared: 0.2347
## F-statistic: 58.64 on 2 and 374 DF, p-value: < 2.2e-16
With these potential outliers or influential observations removed, very little changes from the linear model fit to the full data set. The confidence interval for the coefficient estimates produced by the linear model fit to the full data set contain the estimates of the coefficients for the estimates of the model with the outliers removed. It’s safe to include all of the data points in our model.
detach(Carseats)
This problem involves simple linear regression without an intercept.
(a) Recall that the coefficient estimate βˆ for the linear
regression of Y onto X without an intercept is given by (3.38). Under
what circumstance is the coefficient estimate for the regression of X
onto Y the same as the coefficient estimate for the regression of Y onto
X?
The coefficients are the same when the denominators of the coefficient
estimate equations are the same, which is when the sum of the squares of
y and x are equal.\[\sum_{i=1}^{n}
x^2_i=\sum_{i=1}^{n} y^2_i\]
(b) Generate an example in R with n = 100 observations in
which the coefficient estimate for the regression of X onto Y is
different from the coefficient estimate for the regression of Y onto
X.
First I’ll set my seed to 1 to make validation easier.
set.seed(1)
x <- rnorm(100)
y <- 2*x + rnorm(100, sd = .5)
data <- data.frame(x, y)
Lets make sure the sum of squares isn’t the same:
sum(x^2)
## [1] 81.05509
sum(y^2)
## [1] 345.9723
Now lets verify by doing the regressions and comparing the coefficients
lm_y <- lm(y ~ x + 0)
lm_x <- lm(x ~ y + 0)
summary(lm_y)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.95768 -0.32358 -0.08853 0.25279 1.15545
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 1.99694 0.05324 37.51 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4793 on 99 degrees of freedom
## Multiple R-squared: 0.9343, Adjusted R-squared: 0.9336
## F-statistic: 1407 on 1 and 99 DF, p-value: < 2.2e-16
summary(lm_x)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.50931 -0.10863 0.05499 0.14436 0.44044
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.46785 0.01247 37.51 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.232 on 99 degrees of freedom
## Multiple R-squared: 0.9343, Adjusted R-squared: 0.9336
## F-statistic: 1407 on 1 and 99 DF, p-value: < 2.2e-16
The coefficients are different
(c) Generate an example in R with n = 100 observations in
which the coefficient estimate for the regression of X onto Y is the
same as the coefficient estimate for the regression of Y onto
X.
Lets just take the value 1 to 100 for x, and the inverse of 100 down to
1 for y. This way, summing the squares will come out to the same.
x<-1:100
y<-100:1
sum(x^2)
## [1] 338350
sum(y^2)
## [1] 338350
Now lets fit the models and confirm that the coefficients are the same.
lm_y2 <- lm(y ~ x + 0)
lm_x2 <- lm(x ~ y + 0)
summary(lm_y2)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.75 -12.44 24.87 62.18 99.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 0.5075 0.0866 5.86 6.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared: 0.2575, Adjusted R-squared: 0.25
## F-statistic: 34.34 on 1 and 99 DF, p-value: 6.094e-08
There was y, and now lets compare to x:
summary(lm_x2)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.75 -12.44 24.87 62.18 99.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.5075 0.0866 5.86 6.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared: 0.2575, Adjusted R-squared: 0.25
## F-statistic: 34.34 on 1 and 99 DF, p-value: 6.094e-08