2. Carefully explain the differences between the KNN classifier and KNN regression methods.

KNN Classifier is used to classify already existing data points into different groups based on their location relative to other points. On the other hand, KNN regression focuses on predicting new points with the goal of creating a regression. The methdology behind why the points are choosen for classification and where the point is placed in KNN regression are the same, based on the number of nearest points inform the regression or classification.

9. This question involves the use of multiple linear regression on the Auto data set.

str(Auto)

## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : num  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : num  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : Factor w/ 304 levels "amc ambassador brougham",..: 49 36 231 14 161 141 54 223 241 2 ...

Produce a scatterplot matrix which includes all of the variables in the data set.

pairs(Auto)

### Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.

cor(Auto[, -9])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

i. Is there a relationship between the predictors and the response?

ii. Which predictors appear to have a statistically significant relationship to the response?

iii. What does the coefficient for the year variable suggest?

mpg.lm.fit = lm(mpg~.-name, Auto)
summary(mpg.lm.fit)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

The model’s p-value is below .001, indicating that the model is significant. This means that there is a significant relationship between predictors and the response variable.
The statisticaly significant predictors are displacement with a p-value of .008, weight with a p-value below .001, year with a p-value below .001, and origin with a p-value also below .001. All these predictors are significant assuming a .05 cutoff. This means that relationship between them and mpg is significant.
All else held equal, a single unit increase in year (like a car made in 1998 to 1999) will see an increase in its average mpg by a value of .750773.

Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow = c(2,2))
plot(mpg.lm.fit)

cooksd = order(cooks.distance(mpg.lm.fit), decreasing = T)
head(cooksd)

## [1]  14 389 325 382 324 243

The Residual vs Fitted and Scale-Location plots can be used to determine the linearity assumption of regression. The Scale location has standardized residuals, which return better results as they are more randomly spread out, decreasing the effect of outliers. Also the data is not evenly spread out, as it seems to open up towards the right implying that their is unequal variance in our model as well as slightly non linear pattern as the data does not spread out evenly across the line but shows a small dip. These also look better with the standardized residuals.

The QQ plot, which test for normality in the data by plotting standardized residuals against theoretical normal quantiles, returns a decent trend until the end where they seems to be a few outliers causing the data to skew.

The cook’s distance plot helps highlight any outliers, labeling those at the extremes. In this case records 327 and 394 are highlighted, along with a few other unamed points. We can see those points by creating a formula and find that our top 5 values are 14, 389, 325, 382, and 324.

Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

mpg.lm.fit2 <- lm(mpg~.*.-name*.+.-name,data=Auto)
summary(mpg.lm.fit2)

## 
## Call:
## lm(formula = mpg ~ . * . - name * . + . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6303 -1.4481  0.0596  1.2739 11.1386 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                3.548e+01  5.314e+01   0.668  0.50475   
## cylinders                  6.989e+00  8.248e+00   0.847  0.39738   
## displacement              -4.785e-01  1.894e-01  -2.527  0.01192 * 
## horsepower                 5.034e-01  3.470e-01   1.451  0.14769   
## weight                     4.133e-03  1.759e-02   0.235  0.81442   
## acceleration              -5.859e+00  2.174e+00  -2.696  0.00735 **
## year                       6.974e-01  6.097e-01   1.144  0.25340   
## origin                    -2.090e+01  7.097e+00  -2.944  0.00345 **
## cylinders:displacement    -3.383e-03  6.455e-03  -0.524  0.60051   
## cylinders:horsepower       1.161e-02  2.420e-02   0.480  0.63157   
## cylinders:weight           3.575e-04  8.955e-04   0.399  0.69000   
## cylinders:acceleration     2.779e-01  1.664e-01   1.670  0.09584 . 
## cylinders:year            -1.741e-01  9.714e-02  -1.793  0.07389 . 
## cylinders:origin           4.022e-01  4.926e-01   0.816  0.41482   
## displacement:horsepower   -8.491e-05  2.885e-04  -0.294  0.76867   
## displacement:weight        2.472e-05  1.470e-05   1.682  0.09342 . 
## displacement:acceleration -3.479e-03  3.342e-03  -1.041  0.29853   
## displacement:year          5.934e-03  2.391e-03   2.482  0.01352 * 
## displacement:origin        2.398e-02  1.947e-02   1.232  0.21875   
## horsepower:weight         -1.968e-05  2.924e-05  -0.673  0.50124   
## horsepower:acceleration   -7.213e-03  3.719e-03  -1.939  0.05325 . 
## horsepower:year           -5.838e-03  3.938e-03  -1.482  0.13916   
## horsepower:origin          2.233e-03  2.930e-02   0.076  0.93931   
## weight:acceleration        2.346e-04  2.289e-04   1.025  0.30596   
## weight:year               -2.245e-04  2.127e-04  -1.056  0.29182   
## weight:origin             -5.789e-04  1.591e-03  -0.364  0.71623   
## acceleration:year          5.562e-02  2.558e-02   2.174  0.03033 * 
## acceleration:origin        4.583e-01  1.567e-01   2.926  0.00365 **
## year:origin                1.393e-01  7.399e-02   1.882  0.06062 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared:  0.8893, Adjusted R-squared:  0.8808 
## F-statistic: 104.2 on 28 and 363 DF,  p-value: < 2.2e-16

The interactions that appear to be significant are displacement:year, acceleration:year, and acceleration:origin assuming a 0.05 cutoff value, as all of these terms have less than .05 p-values.

Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.

Looking at the main effects model, Cylinder, Horsepower, and acceleration were not significant, so perhaps including a couple transformations using these variables might help.

mpg.lm.fit3 = lm(mpg~.-name + log(acceleration) + log(cylinders) + log(horsepower), Auto)
summary(mpg.lm.fit3)

## 
## Call:
## lm(formula = mpg ~ . - name + log(acceleration) + log(cylinders) + 
##     log(horsepower), data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.1801 -1.6733 -0.0964  1.5354 12.0891 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.019e+02  1.484e+01   6.868 2.66e-11 ***
## cylinders          8.702e-01  1.205e+00   0.722 0.470607    
## displacement      -6.439e-03  7.352e-03  -0.876 0.381642    
## horsepower         1.545e-01  2.632e-02   5.870 9.47e-09 ***
## weight            -3.226e-03  6.732e-04  -4.792 2.37e-06 ***
## acceleration       3.557e-01  4.791e-01   0.742 0.458285    
## year               7.438e-01  4.532e-02  16.412  < 2e-16 ***
## origin             8.724e-01  2.543e-01   3.431 0.000667 ***
## log(acceleration) -1.091e+01  7.667e+00  -1.423 0.155637    
## log(cylinders)    -4.894e+00  6.499e+00  -0.753 0.451914    
## log(horsepower)   -2.487e+01  2.912e+00  -8.542 3.19e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.956 on 381 degrees of freedom
## Multiple R-squared:  0.8602, Adjusted R-squared:  0.8565 
## F-statistic: 234.4 on 10 and 381 DF,  p-value: < 2.2e-16

par(mfrow = c(2,2))
plot(mpg.lm.fit3)

Horse power seems to have been postively effected by the log transformation. The models return slightly better residucals vs fitted plots, with data more spread out and reflecting less homoscedacity and nonlinearity than the first model. The QQ plot seems to have issues with a few of the upper quantile points, but the cooks distance also appears to have a smaller effect on the data.

mpg.lm.fit4 = lm(mpg~.-name + sqrt(acceleration) + sqrt(cylinders) + sqrt(horsepower), Auto)
summary(mpg.lm.fit4)

## 
## Call:
## lm(formula = mpg ~ . - name + sqrt(acceleration) + sqrt(cylinders) + 
##     sqrt(horsepower), data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3747 -1.6637 -0.0922  1.5159 12.0592 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         7.833e+01  1.858e+01   4.217 3.09e-05 ***
## cylinders           2.870e+00  2.343e+00   1.225 0.221392    
## displacement       -7.473e-03  7.361e-03  -1.015 0.310619    
## horsepower          3.750e-01  5.102e-02   7.350 1.22e-12 ***
## weight             -3.158e-03  6.743e-04  -4.684 3.91e-06 ***
## acceleration        1.230e+00  9.832e-01   1.251 0.211720    
## year                7.413e-01  4.526e-02  16.380  < 2e-16 ***
## origin              8.752e-01  2.536e-01   3.451 0.000621 ***
## sqrt(acceleration) -1.250e+01  7.941e+00  -1.574 0.116281    
## sqrt(cylinders)    -1.314e+01  1.104e+01  -1.191 0.234369    
## sqrt(horsepower)   -9.565e+00  1.131e+00  -8.453 6.10e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.952 on 381 degrees of freedom
## Multiple R-squared:  0.8606, Adjusted R-squared:  0.857 
## F-statistic: 235.2 on 10 and 381 DF,  p-value: < 2.2e-16

par(mfrow = c(2,2))
plot(mpg.lm.fit4)

The square root of horsepower is also significant although the diagnostics do not seem to have been altered much from the previous log transformation model.

mpg.lm.fit5 = lm(mpg~.-name + I(acceleration*acceleration) + I(cylinders*cylinders) + I(horsepower*horsepower), Auto)
summary(mpg.lm.fit5)

## 
## Call:
## lm(formula = mpg ~ . - name + I(acceleration * acceleration) + 
##     I(cylinders * cylinders) + I(horsepower * horsepower), data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.8488 -1.6976 -0.0639  1.4410 12.1216 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    17.2501162  6.4010975   2.695  0.00735 ** 
## cylinders                      -2.5830432  1.2080209  -2.138  0.03313 *  
## displacement                   -0.0090379  0.0074258  -1.217  0.22432    
## horsepower                     -0.2975187  0.0344484  -8.637  < 2e-16 ***
## weight                         -0.0030964  0.0006800  -4.554 7.11e-06 ***
## acceleration                   -1.4761632  0.5457876  -2.705  0.00714 ** 
## year                            0.7355311  0.0453526  16.218  < 2e-16 ***
## origin                          0.9131603  0.2531125   3.608  0.00035 ***
## I(acceleration * acceleration)  0.0351887  0.0159215   2.210  0.02769 *  
## I(cylinders * cylinders)        0.2534219  0.1003346   2.526  0.01195 *  
## I(horsepower * horsepower)      0.0008809  0.0001108   7.951 2.13e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.957 on 381 degrees of freedom
## Multiple R-squared:  0.8601, Adjusted R-squared:  0.8564 
## F-statistic: 234.2 on 10 and 381 DF,  p-value: < 2.2e-16

par(mfrow = c(2,2))
plot(mpg.lm.fit5)

Lastly, with the squared terms, most of the model turns out be significant with a 0.05 cutoff. The only variable not significant is displacement. Also the data appears to be more spread out and linear in our diagnostics model, with the residuals being spread out evenly.

10. This question should be answered using the Carseats data set.

Carseats = Carseats
attach(Carseats)
str(Carseats)

## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...

Fit a multiple regression model to predict Sales using Price, Urban, and US.

sales.lm.fita = lm(Sales ~ Price + US + Urban, Carseats)
summary(sales.lm.fita)

## 
## Call:
## lm(formula = Sales ~ Price + US + Urban, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative

The model is significant overall, with a p-value of less than .001. The variables Price and USyes (The store where the seats were sold are in the US) were also both significant with p-values less than .001. For price, this means each increase in the price decreases the sales for said location by .054459 (per thousand units). Also if the store is located in the United States, the sales increase by 1.200573. UrbanYes has a p-value of .936, well above the .05 cut off, meaning its interaction with the target variable sales is not significant. For every store in an urban area, the sales decrease by .021916 (per thousand units).

Write out the model in equation form, being careful to handle the qualitative variables properly.

Sales = 13.043469 + Price * -0.054459 + US * 1.200573 + Urban * -0.021916

If the store isnt in the US then the variable will be zero meaning the following simplified equation will be in place.

Sales = 13.043469 + Price * -0.054459 + Urban * -0.021916

If the store isnt in an urban area then the variable will be zero meaning the following simplified equation will be in place.

Sales = 13.043469 + Price * -0.054459 + US * 1.200573

Lastly, if it the store is neither in the US or an Urban area, the formula will just include price and intercept.

Sales = 13.043469 + Price * -0.054459

For which of the predictors can you reject the null hypothesis H0 :βj =0?

In this model, we can reject the null hypothesis for both Price and US, as their terms are significant meaning that βj != 0.

On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

sales.lm.fite = lm(Sales ~ Price + US, Carseats)
summary(sales.lm.fite)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

How well do the models in (a) and (e) fit the data?

Both Model a and e are significant overall and have the same exact r-squared of .2393, but using adjusted r-squared shows that model e has a slightly higher value of .2354. This is becauce the adjusted r-squared is weighted by how many terms in the model and model e has fewer terms. The improvement from model is very small considering that no new variables were added and only an unhelpful variable was removed.

Using the model from (e), obtain 95% confidence intervals for the coefficient(s).

confint(sales.lm.fite)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

Is there evidence of outliers or high leverage observations in the model from (e)?

par(mfrow = c(2,2))
plot(sales.lm.fite)

The data in the diagnostics plots returns very clean results. The Residuals vs Fitted plots are randomly separated making it difficult to determine where outliers might be. We can also take a look at cook’s distance to determine where the influential points.

cooksdistance = cooks.distance(sales.lm.fite)
cooksdordered = order(cooksdistance, decreasing = T)
head(cooksdordered)

## [1]  26 368  50 317 166 377

cooksdistance[c(26,368,50,317,166,377)]

##         26        368         50        317        166        377 
## 0.02610946 0.02428736 0.02283546 0.02047046 0.01975504 0.01828219

The following points are the most influential by cook’s distance, with said distance listed beneath.

12. This problem involves simple linear regression without an intercept.

Recall that the coefficient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

For Y onto X without, the estimation of βˆ is the summation of X times y divided by the summation of X^2. For X onto Y, the estimation for βˆ is then the summation of X time y divided by the summation of Y^2.

It then follows that both held equal to each other is when X onto Y and Y onto X’s coefficient estimates would be equal.

summation of x^2 equals summation y^2 is where they would have the same estimate.

Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

x = rnorm(100)
y = 2 * x + rnorm(100)

sum(x^2)

## [1] 73.56352

sum(y^2)

## [1] 424.2937

lmYX = lm(y~x+0)
lmXY = lm(x~y+0)
summary(lmYX)

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.36794 -0.68238 -0.07716  0.69265  2.40594 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   2.0786     0.1209   17.19   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.037 on 99 degrees of freedom
## Multiple R-squared:  0.7491, Adjusted R-squared:  0.7465 
## F-statistic: 295.5 on 1 and 99 DF,  p-value: < 2.2e-16

summary(lmXY)

## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.9457 -0.3165 -0.0449  0.2829  1.0358 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y  0.36038    0.02096   17.19   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4318 on 99 degrees of freedom
## Multiple R-squared:  0.7491, Adjusted R-squared:  0.7465 
## F-statistic: 295.5 on 1 and 99 DF,  p-value: < 2.2e-16

The summations of x^2 and y^2 do not equal, which is in line with the thought process in 12a. Now the coefficients are not the same either.

Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

x=rnorm(100)
y=abs(x)

sum(x^2)

## [1] 91.0051

sum(y^2)

## [1] 91.0051

lmYX = lm(y~x+0)
lmXY = lm(x~y+0)
summary(lmYX)

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## 0.01504 0.24488 0.52209 1.02569 2.56584 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)
## x  0.09838    0.10002   0.984    0.328
## 
## Residual standard error: 0.9541 on 99 degrees of freedom
## Multiple R-squared:  0.009679,   Adjusted R-squared:  -0.0003241 
## F-statistic: 0.9676 on 1 and 99 DF,  p-value: 0.3277

summary(lmXY)

## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.56584 -0.49015  0.03824  0.53775  2.34748 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)
## y  0.09838    0.10002   0.984    0.328
## 
## Residual standard error: 0.9541 on 99 degrees of freedom
## Multiple R-squared:  0.009679,   Adjusted R-squared:  -0.0003241 
## F-statistic: 0.9676 on 1 and 99 DF,  p-value: 0.3277

Using the absolute value of x for y means that the range of data will be different, but when squared and summed, the values should come out as the same. This then means that the summation of x and y squared should be equal, meaning the coeffecients in X to Y and Y to X regression should be the same.

HW2

Nathaniel Sattler

3/1/2021