As the name suggests, KNN Classifier is typically used to solve Classification problems (Qualitiative response) and KNN regression is used to solve regression problems (Quantitative response).
Both methods use information about the neighborhood of data points to draw conclusions. However, each method has unique goals.
In case of KNN classifier, we are interested in finding to which class a new data point belongs to, thus we assign it to the class of most frequent point class in the neighborhood of this point.
One the other hand, in KNN regression, our target is to predict, so we assign a new point value to the average of the points in the neighborhood.
Load the ISLR library to use the data
library(ISLR)
Load the Auto dataset from the ISLR library. No need to use the CSV file
data(Auto)
Get the Summary of Auto from the ISLR library itself. No need to read the csv file
summary(Auto)
## mpg cylinders displacement horsepower weight
## Min. : 9.00 Min. :3.000 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.00 1st Qu.:4.000 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2225
## Median :22.75 Median :4.000 Median :151.0 Median : 93.5 Median :2804
## Mean :23.45 Mean :5.472 Mean :194.4 Mean :104.5 Mean :2978
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:275.8 3rd Qu.:126.0 3rd Qu.:3615
## Max. :46.60 Max. :8.000 Max. :455.0 Max. :230.0 Max. :5140
##
## acceleration year origin name
## Min. : 8.00 Min. :70.00 Min. :1.000 amc matador : 5
## 1st Qu.:13.78 1st Qu.:73.00 1st Qu.:1.000 ford pinto : 5
## Median :15.50 Median :76.00 Median :1.000 toyota corolla : 5
## Mean :15.54 Mean :75.98 Mean :1.577 amc gremlin : 4
## 3rd Qu.:17.02 3rd Qu.:79.00 3rd Qu.:2.000 amc hornet : 4
## Max. :24.80 Max. :82.00 Max. :3.000 chevrolet chevette: 4
## (Other) :365
Quick read
Observations: 392
Columns: 9
pairs(Auto)
Use Cor function. Remove the name variable, which is 9th column, from the Auto dataset
cor(Auto[,-9])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
use lm function and fit the multiple linear regression model, after removing the name variable. Fitted model is into Auto.lm.fit
Auto.lm.fit <- lm(mpg~.-name,data=Auto)
Print the summary
summary(Auto.lm.fit)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Based on the F-statistic of 252.4 and the p-value (assuming .05 as cutoff) from the multiple-linear regression fit above, we can conclude that there is relationship between predictors and response (mpg)
The p-values of displacement, weight, year and origin are < 0.05 (95% confidence level cutoff assumed) and R highlighting the same significant predictors with "*"s, we conclude that displacement, weight, year and origin appear to have a statistically significant relationship to the response variable mpg.
The coefficient of the year variable (0.750773) suggests the following:
Given all other predictors constant, with 1 unit of increase in year i.e. every one year, the auto’s mpg increases by .750773
In other words, Autos(cars) become more fuel efficient every 1 year.
par(mfrow = c(2,2))
plot(Auto.lm.fit)
The Residuals vs Fitted values plot suggest that the fit is slighlty curved (in other words, mild non-linearity) and doesn’t fall on exact linear distribution. However, the Standardized residuals vs Fitted values chart shows much better linearity.
Cook’s d plot shows that there is one high leverage point i.e. observation 14.
Observations 323, 326, 327, and 394 are the highly/easily visible outliers on the charts, though there are few more outliers as suggested by the charts above.
Auto.lm.fit.IT <- lm(mpg~.*.-name*.+.-name,data=Auto)
summary(Auto.lm.fit.IT)
##
## Call:
## lm(formula = mpg ~ . * . - name * . + . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6303 -1.4481 0.0596 1.2739 11.1386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.548e+01 5.314e+01 0.668 0.50475
## cylinders 6.989e+00 8.248e+00 0.847 0.39738
## displacement -4.785e-01 1.894e-01 -2.527 0.01192 *
## horsepower 5.034e-01 3.470e-01 1.451 0.14769
## weight 4.133e-03 1.759e-02 0.235 0.81442
## acceleration -5.859e+00 2.174e+00 -2.696 0.00735 **
## year 6.974e-01 6.097e-01 1.144 0.25340
## origin -2.090e+01 7.097e+00 -2.944 0.00345 **
## cylinders:displacement -3.383e-03 6.455e-03 -0.524 0.60051
## cylinders:horsepower 1.161e-02 2.420e-02 0.480 0.63157
## cylinders:weight 3.575e-04 8.955e-04 0.399 0.69000
## cylinders:acceleration 2.779e-01 1.664e-01 1.670 0.09584 .
## cylinders:year -1.741e-01 9.714e-02 -1.793 0.07389 .
## cylinders:origin 4.022e-01 4.926e-01 0.816 0.41482
## displacement:horsepower -8.491e-05 2.885e-04 -0.294 0.76867
## displacement:weight 2.472e-05 1.470e-05 1.682 0.09342 .
## displacement:acceleration -3.479e-03 3.342e-03 -1.041 0.29853
## displacement:year 5.934e-03 2.391e-03 2.482 0.01352 *
## displacement:origin 2.398e-02 1.947e-02 1.232 0.21875
## horsepower:weight -1.968e-05 2.924e-05 -0.673 0.50124
## horsepower:acceleration -7.213e-03 3.719e-03 -1.939 0.05325 .
## horsepower:year -5.838e-03 3.938e-03 -1.482 0.13916
## horsepower:origin 2.233e-03 2.930e-02 0.076 0.93931
## weight:acceleration 2.346e-04 2.289e-04 1.025 0.30596
## weight:year -2.245e-04 2.127e-04 -1.056 0.29182
## weight:origin -5.789e-04 1.591e-03 -0.364 0.71623
## acceleration:year 5.562e-02 2.558e-02 2.174 0.03033 *
## acceleration:origin 4.583e-01 1.567e-01 2.926 0.00365 **
## year:origin 1.393e-01 7.399e-02 1.882 0.06062 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared: 0.8893, Adjusted R-squared: 0.8808
## F-statistic: 104.2 on 28 and 363 DF, p-value: < 2.2e-16
From the above model with all interaction terms, we notice that the following interaction terms are significant based on p-value of 0.05 (confidence level of 95%) as cutoff:
acceleration:origin
acceleration:year
displacement:year
While the above interaction terms are more significant, if we consider the p-value cutoff as 0.1, then the following interaction terms are also significant:
cylinders:acceleration
cylinders:year
displacement:weight
horsepower:acceleration
year:origin
Scatter plot shows inverse relationship between horsepower and mpg but the multi-linear regression model fit showed that horsepower is not significant. Lets dig a little deeper into this using transformations.
par(mfrow = c(2,2))
plot(Auto$horsepower,Auto$mpg)
plot(log(Auto$horsepower), Auto$mpg)
plot(sqrt(Auto$horsepower), Auto$mpg)
plot((Auto$horsepower)^2, Auto$mpg)
Scatter plots show that as horsepower increases, mpg reduces but the log transformation of horsepower is giving more linear relationship compared to others.
Lets fit the model with horsepower and log transformed horsepower variables alone and see if its significant and also look at the model fit plots.
Auto.lm.fit.hp <- lm(mpg~horsepower,data=Auto)
summary(Auto.lm.fit.hp)
##
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.5710 -3.2592 -0.3435 2.7630 16.9240
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 39.935861 0.717499 55.66 <2e-16 ***
## horsepower -0.157845 0.006446 -24.49 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared: 0.6059, Adjusted R-squared: 0.6049
## F-statistic: 599.7 on 1 and 390 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(Auto.lm.fit.hp)
If we just include horsepower, it shows horsepower is significant. However the fit seems to be less linear.
Lets do the same with log(horsepower)
Auto.lm.fit.loghp <- lm(mpg~log(horsepower),data=Auto)
summary(Auto.lm.fit.loghp)
##
## Call:
## lm(formula = mpg ~ log(horsepower), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.2299 -2.7818 -0.2322 2.6661 15.4695
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 108.6997 3.0496 35.64 <2e-16 ***
## log(horsepower) -18.5822 0.6629 -28.03 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.501 on 390 degrees of freedom
## Multiple R-squared: 0.6683, Adjusted R-squared: 0.6675
## F-statistic: 785.9 on 1 and 390 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(Auto.lm.fit.loghp)
log(horsepower) transformation seems to have more linear relationship.
load the Carseats dataset from ISLR library
data(Carseats)
?Carseats
str(Carseats)
## 'data.frame': 400 obs. of 11 variables:
## $ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
## $ CompPrice : num 138 111 113 117 141 124 115 136 132 132 ...
## $ Income : num 73 48 35 100 64 113 105 81 110 113 ...
## $ Advertising: num 11 16 10 4 3 13 0 15 0 0 ...
## $ Population : num 276 260 269 466 340 501 45 425 108 131 ...
## $ Price : num 120 83 80 97 128 72 108 120 124 124 ...
## $ ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
## $ Age : num 42 65 59 55 38 78 71 67 76 76 ...
## $ Education : num 17 10 12 14 13 16 15 10 10 17 ...
## $ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
## $ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
Observations: 400
Columns: 11
Clear Qualitiative variables: ShelveLoc, Urban and US
summary(Carseats)
## Sales CompPrice Income Advertising
## Min. : 0.000 Min. : 77 Min. : 21.00 Min. : 0.000
## 1st Qu.: 5.390 1st Qu.:115 1st Qu.: 42.75 1st Qu.: 0.000
## Median : 7.490 Median :125 Median : 69.00 Median : 5.000
## Mean : 7.496 Mean :125 Mean : 68.66 Mean : 6.635
## 3rd Qu.: 9.320 3rd Qu.:135 3rd Qu.: 91.00 3rd Qu.:12.000
## Max. :16.270 Max. :175 Max. :120.00 Max. :29.000
## Population Price ShelveLoc Age Education
## Min. : 10.0 Min. : 24.0 Bad : 96 Min. :25.00 Min. :10.0
## 1st Qu.:139.0 1st Qu.:100.0 Good : 85 1st Qu.:39.75 1st Qu.:12.0
## Median :272.0 Median :117.0 Medium:219 Median :54.50 Median :14.0
## Mean :264.8 Mean :115.8 Mean :53.32 Mean :13.9
## 3rd Qu.:398.5 3rd Qu.:131.0 3rd Qu.:66.00 3rd Qu.:16.0
## Max. :509.0 Max. :191.0 Max. :80.00 Max. :18.0
## Urban US
## No :118 No :142
## Yes:282 Yes:258
##
##
##
##
Carseats.lm.fit <- lm(Sales~Price+Urban+US,data=Carseats)
summary(Carseats.lm.fit)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Three variables used in this model and there are 3 coefficients, apart from Intercept.
Price is a quantitative variable, where as Urban and US are qualitative variables.
Overall model F-statistic and P-value indicates that the model is significant.
Price and US are significant and we can reject null hypothesis for them.
Price: Given all other predictors remain fixed, on average, For every 1 unit of increase in Price, the average Sales of car seats at that location will decrease by ~0.0545 units (units in thousands)
US: Given all other predictors remain fixed, on average, sales of car seats in a US store is ~1.2 units (in thousands) more than non-US stores
Urban: Assuming Urban will not be there in the final model, we accept null hypothesis for Urban. Urban has no effect on sales.
In general: (with Urban fills in as 1/0 and US fills in as 1/0)
Sales = 13.043469 + (-0.054459) * Price + (-0.021916) * Urban + (1.200573) * US
This can be explained as following:
Predictors - Price and US have significant values. we can reject the null hypothesis for Price and US.
Predictor “Urban”’s p-value is not significant.
With previous model fit, we know that “Urban” predictor is not significant and shows no evidence of association with the outcome. So, let’s fit a smaller model with “Price” and “US” predictors.
Carseats.lm.fit.PU <- lm(Sales~Price+US,data=Carseats)
summary(Carseats.lm.fit.PU)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
R-Squared for model(a) is 0.2393; adjusted R-Squared is 0.2335 and RSE is 2.472
R-Squared for model(e) is 0.2393 and adjusted R-Squared is 0.2354 and RSE is 2.469
The adjusted R-Square for model(e) is slightly better and the RSE (Residual Standard Error) is also slightly better. Both the model fit and the individual predictors are all significant in model(e)
Even though both models are significant, overall roughly 23.93% of variability is only explained by the model, which means that both the models did not fit the data well.
However, model(e) is relatively(slightly or almost negligibly) better than model(a).
confint(Carseats.lm.fit.PU)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
To check this, lets plot the residuals from model(e)
par(mfrow = c(2,2))
plot(Carseats.lm.fit.PU)
Outliers:
The Residual vs Fitted chart shows almost a very close linear relationship. This chart shows some observations far away indicating potential outliers. The cook’sD is also indicating outliers (few observations >2 and <-2).
However, to really know and decide if any outliers exist, lets build a studentized residuals chart with observations whose studentized residuals > 3 colored differently.
plot(predict(Carseats.lm.fit.PU),rstudent(Carseats.lm.fit.PU),col=ifelse(rstudent(Carseats.lm.fit.PU)>3,"red","black"))
The above plot doesn’t highlight any observations whose studentized residuals are greater than 3 in absolute value. Hence, we conclude there are no outliers that could potentially influence the model. We can safely use all observations in the model.
Leverage Points:
The Cook’s D plot shows that there are few leverage points. To check this, lets color those observations whose leverage statistic is greater than (p+1)/n. Here p=2 and n = 400.
plot(hatvalues(Carseats.lm.fit.PU),col=ifelse(hatvalues(Carseats.lm.fit.PU)>(2+1)/dim(Carseats)[1],"red","black"))
The above plot shows that a few of the observations have considerbaly high leverage statistic. with this, we can confirm that there are high leverage points in model(e).
Let’s print the potential influence points.
summary(influence.measures(Carseats.lm.fit.PU))
## Potentially influential observations of
## lm(formula = Sales ~ Price + US, data = Carseats) :
##
## dfb.1_ dfb.Pric dfb.USYs dffit cov.r cook.d hat
## 26 0.24 -0.18 -0.17 0.28_* 0.97_* 0.03 0.01
## 29 -0.10 0.10 -0.10 -0.18 0.97_* 0.01 0.01
## 43 -0.11 0.10 0.03 -0.11 1.05_* 0.00 0.04_*
## 50 -0.10 0.17 -0.17 0.26_* 0.98 0.02 0.01
## 51 -0.05 0.05 -0.11 -0.18 0.95_* 0.01 0.00
## 58 -0.05 -0.02 0.16 -0.20 0.97_* 0.01 0.01
## 69 -0.09 0.10 0.09 0.19 0.96_* 0.01 0.01
## 126 -0.07 0.06 0.03 -0.07 1.03_* 0.00 0.03_*
## 160 0.00 0.00 0.00 0.01 1.02_* 0.00 0.02
## 166 0.21 -0.23 -0.04 -0.24 1.02 0.02 0.03_*
## 172 0.06 -0.07 0.02 0.08 1.03_* 0.00 0.02
## 175 0.14 -0.19 0.09 -0.21 1.03_* 0.02 0.03_*
## 210 -0.14 0.15 -0.10 -0.22 0.97_* 0.02 0.01
## 270 -0.03 0.05 -0.03 0.06 1.03_* 0.00 0.02
## 298 -0.06 0.06 -0.09 -0.15 0.97_* 0.01 0.00
## 314 -0.05 0.04 0.02 -0.05 1.03_* 0.00 0.02_*
## 353 -0.02 0.03 0.09 0.15 0.97_* 0.01 0.00
## 357 0.02 -0.02 0.02 -0.03 1.03_* 0.00 0.02
## 368 0.26 -0.23 -0.11 0.27_* 1.01 0.02 0.02_*
## 377 0.14 -0.15 0.12 0.24 0.95_* 0.02 0.01
## 384 0.00 0.00 0.00 0.00 1.02_* 0.00 0.02
## 387 -0.03 0.04 -0.03 0.05 1.02_* 0.00 0.02
## 396 -0.05 0.05 0.08 0.14 0.98_* 0.01 0.00
This shows few potential influence points. Lets run the model by removing these.
Carseats.Outliers<-c(26,29,43,50,51,58,69,126,160,166,172,175,210,270,298,314,353,357,368,377,384,387,396)
Carseats.subset<-Carseats[-Carseats.Outliers,]
Carseats.lm.fit.PU.subset<-lm(Sales~Price+US,data=Carseats.subset)
summary(Carseats.lm.fit.PU.subset)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats.subset)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.263 -1.605 -0.039 1.590 5.428
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.925232 0.665259 19.429 < 2e-16 ***
## Price -0.053973 0.005511 -9.794 < 2e-16 ***
## USYes 1.255018 0.248856 5.043 7.15e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.29 on 374 degrees of freedom
## Multiple R-squared: 0.2387, Adjusted R-squared: 0.2347
## F-statistic: 58.64 on 2 and 374 DF, p-value: < 2.2e-16
The new model here (model(h)) has R-Squared of 0.2387, which is infact not an improvement of fit from model(e) which has R-Square of 0.2393.
Based on the above analysis, we could safely use all the observations in the dataset.
The coefficient estimate for the regression of Y onto X is
\(\hat{\beta}\) = \(\displaystyle \frac {\Sigma_{i=1}^{n} x_{i} y_{i}} {\Sigma_{i'=1}^{n} x_{i'}^{2}}\)
The coefficient estimate for the regression of X onto Y is
\(\hat{\beta}\) = \(\displaystyle \frac {\Sigma_{i=1}^{n} x_{i} y_{i}} {\Sigma_{i'=1}^{n} y_{i'}^{2}}\)
The coefficient estimates for above regressions will be the same only when the denominators are same i.e. when :
\(\Sigma_{i'=1}^{n} x_{i'}^{2}\) = \(\Sigma_{i'=1}^{n} y_{i'}^{2}\)
To do this, we need to ensure \(\Sigma_{i'=1}^{n} x_{i'}^{2}\) != \(\Sigma_{i'=1}^{n} y_{i'}^{2}\)
Lets pick something of that sort.
Here let’s feed X with numbers 1 to 100 and do random seed. Let’s pick Y = 2x + rnorm(100).
set.seed(25)
x <- 1:100
y <- 2 * x + rnorm(100)
head(data.frame(x,y,x^2,y^2))
## x y x.2 y.2
## 1 1 1.788166 1 3.197539
## 2 2 2.958409 4 8.752183
## 3 3 4.846692 9 23.490428
## 4 4 8.321531 16 69.247886
## 5 5 8.499870 25 72.247792
## 6 6 11.554467 36 133.505702
As the functions are different, sum(x^2) and sum(y^2) should be different. lets print and confirm that.
sum(x^2)
## [1] 338350
sum(y^2)
## [1] 1350394
Let’s fit two linear regressions for Y onto X and X onto Y.
Y.lm.fit <- lm(y ~ x + 0)
summary(Y.lm.fit)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.23461 -0.94314 -0.05444 0.45612 2.42291
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 1.997702 0.001743 1146 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.014 on 99 degrees of freedom
## Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
## F-statistic: 1.314e+06 on 1 and 99 DF, p-value: < 2.2e-16
X.lm.fit <- lm(x ~ y + 0)
summary(X.lm.fit)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.21095 -0.22517 0.03132 0.47784 1.12212
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.5005374 0.0004367 1146 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5075 on 99 degrees of freedom
## Multiple R-squared: 0.9999, Adjusted R-squared: 0.9999
## F-statistic: 1.314e+06 on 1 and 99 DF, p-value: < 2.2e-16
From the above two model fits, Notice the coefficient estimates (1.997702 and 0.5005374) are different.
To do this, we need to ensure \(\Sigma_{i'=1}^{n} x_{i'}^{2}\) == \(\Sigma_{i'=1}^{n} y_{i'}^{2}\)
This can be achieved by picking options such y = x , Y = abs(x), Y = -(x) etc can be used as these will generate squares of them to be same.
set.seed(14)
x=rnorm(100)
y=abs(x)
head(data.frame(x,y,x^2,y^2))
## x y x.2 y.2
## 1 -0.66184983 0.66184983 0.438045195 0.438045195
## 2 1.71895416 1.71895416 2.954803394 2.954803394
## 3 2.12166699 2.12166699 4.501470822 4.501470822
## 4 1.49715368 1.49715368 2.241469154 2.241469154
## 5 -0.03614058 0.03614058 0.001306141 0.001306141
## 6 1.23194518 1.23194518 1.517688918 1.517688918
It is clear as x2 and y2 are same, obviously sum(x^2) and sum(y^2) will be same.
Let’s fit two linear regressions for Y onto X and X onto Y.
Y.lm.fit <- lm(y ~ x + 0)
summary(Y.lm.fit)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## 0.00667 0.28849 0.63300 1.02049 2.42147
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 0.13313 0.09961 1.337 0.184
##
## Residual standard error: 0.898 on 99 degrees of freedom
## Multiple R-squared: 0.01772, Adjusted R-squared: 0.007802
## F-statistic: 1.786 on 1 and 99 DF, p-value: 0.1844
X.lm.fit <- lm(x ~ y + 0)
summary(X.lm.fit)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.42147 -0.70064 -0.01548 0.54449 1.83921
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.13313 0.09961 1.337 0.184
##
## Residual standard error: 0.898 on 99 degrees of freedom
## Multiple R-squared: 0.01772, Adjusted R-squared: 0.007802
## F-statistic: 1.786 on 1 and 99 DF, p-value: 0.1844
As shown above, both the models have same coefficient estimates (0.13313).