A) This KNN classifier and KNN regression methods are closely related as KNN classifier is used when the response is qualitative (like binary: yes or no, class(es): high, low & medium) and KNN regression is used when the response is quantitative (like income/salary, mpg et.al.).
Another difference is in the way both predict the response. For a test observation x0, KNN classifier identifies neighboring K points in training data closest to x0(represented as N0) then estimates the conditional probability for class j as the fraction of points in the neighborhood whose response values equal j and apply Bayes rule to classify test observation x0 to the class with largest probability. This is represented as below: \[ Pr(Y=j|X=x_0)= \frac1K\sum_{i\in N_0}I(y_i=j) \] On the other hand, KNN regression also selects the neighboring K points (represented as N0) the same way, the response(estimate) for the test observation is the average of responses of all the K points in the neighborhood. This is represented as below: \[ \hat{f}(x_0)=\frac1K\sum_{i\in N_0}y_i \]
A)
my_Auto <- Auto
pairs(my_Auto)
A)
# get the column number for the "name" as we don't want to include in correlation matrix
colnum <- (as.integer(which(colnames(my_Auto)=="name")) * -1)
# Correlation excluding "name"
cor(my_Auto[ ,colnum])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:# fit multiple linear model except "name"
# Convert "origin" as factor with labels as in the data description
my_Auto$origin <- as.factor(my_Auto$origin)
# labels are "American","European","Japanese"
lm_fit_x_name <- lm(mpg ~ .-name, data = my_Auto)
summary(lm_fit_x_name)
##
## Call:
## lm(formula = mpg ~ . - name, data = my_Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.0095 -2.0785 -0.0982 1.9856 13.3608
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.795e+01 4.677e+00 -3.839 0.000145 ***
## cylinders -4.897e-01 3.212e-01 -1.524 0.128215
## displacement 2.398e-02 7.653e-03 3.133 0.001863 **
## horsepower -1.818e-02 1.371e-02 -1.326 0.185488
## weight -6.710e-03 6.551e-04 -10.243 < 2e-16 ***
## acceleration 7.910e-02 9.822e-02 0.805 0.421101
## year 7.770e-01 5.178e-02 15.005 < 2e-16 ***
## origin2 2.630e+00 5.664e-01 4.643 4.72e-06 ***
## origin3 2.853e+00 5.527e-01 5.162 3.93e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.307 on 383 degrees of freedom
## Multiple R-squared: 0.8242, Adjusted R-squared: 0.8205
## F-statistic: 224.5 on 8 and 383 DF, p-value: < 2.2e-16
A) Before we comment on the output, we need to set up out hypothesis,
H0: There is no significant relationship between predictors and response(mpg) or all β’s are zero Ha: At least one predictor has a significant relationship with the response(mpg) or at least one of the β’s is non-zero
F-test result can be used to answer this hypothesis. Assuming the significance level(alpha) is 0.05 the p-Value for the F-test is less than alpha thus we reject H0 indicating that there is relationship between the predictors and response.
A)To answer this, we need to look at the t-test results in the output. Assuming the significance level(alpha) is 0.05 and setting hypothesis for each predictor like,
H0: Predictor “Displacement” has no significant relationship with the response(mpg) or βdisplacement = 0 Ha: Predictor “Displacement” has significant relationship with the response(mpg) or βdisplacement ≠ 0
We can see from the output that the p-value from the t-test for displacement is less than alpha(0.05) indicating that displacement has significant relationship with mpg(response). We reject H0 and thus coefficient for displacement is not zero. Similarly, we can see that in addition to ‘displacement’ below are also significant: weight, year and origin.
A) From the above output we can see that βyear = 0.78 which can be interpreted as for every one unit increase in year average mpg is going to increase by 0.78, if all other predictors are constant. This can be seen as year over year change in average mpg with others remaining constant or same.
par(mfrow = c(2,2))
plot(lm_fit_x_name)
#Influence points identification
inf_pts_all=which(cooks.distance(lm_fit_x_name)>0.015)
inf_pts_all
## 14 45 112 156 167 245 276 278 310 323 326 327 328 330 335 387 394
## 14 44 111 154 165 243 274 276 308 321 324 325 326 328 332 382 389
par(mfrow=c(1,1)) # reset back
1) Zooming in ‘Residuals Vs Fitted’ plot, we can see that there is non-linearity as the residuals show a pattern.
2) Also, I can see heteroscedasticity as we can see that the variance of the residuals is increasing.
3) Normal Q-Q plot is also showing a right tailed behavior with some outliers and not a perfect conditionally normal distribution.
4) Residuals Vs Leverage show that the observation 14 has high leverage.
5) Above is the list of influential points (using cooks distance of 0.015)which needs to be looked at..
# We are excluding the predictor name(column 9) before we go for interaction effect
lm_fit_itrt <- lm(mpg~ . *., data = my_Auto[,-9])
# alternate syntax
lm_fit_itrt2 <- lm(mpg~ .+cylinders:acceleration+acceleration:year, data = my_Auto[,-9])
summary(lm_fit_itrt)
##
## Call:
## lm(formula = mpg ~ . * ., data = my_Auto[, -9])
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6008 -1.2863 0.0813 1.2082 12.0382
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.401e+01 5.147e+01 0.855 0.393048
## cylinders 3.302e+00 8.187e+00 0.403 0.686976
## displacement -3.529e-01 1.974e-01 -1.788 0.074638 .
## horsepower 5.312e-01 3.390e-01 1.567 0.117970
## weight -3.259e-03 1.820e-02 -0.179 0.857980
## acceleration -6.048e+00 2.147e+00 -2.818 0.005109 **
## year 4.833e-01 5.923e-01 0.816 0.415119
## origin2 -3.517e+01 1.260e+01 -2.790 0.005547 **
## origin3 -3.765e+01 1.426e+01 -2.640 0.008661 **
## cylinders:displacement -6.316e-03 7.106e-03 -0.889 0.374707
## cylinders:horsepower 1.452e-02 2.457e-02 0.591 0.555109
## cylinders:weight 5.703e-04 9.044e-04 0.631 0.528709
## cylinders:acceleration 3.658e-01 1.671e-01 2.189 0.029261 *
## cylinders:year -1.447e-01 9.652e-02 -1.499 0.134846
## cylinders:origin2 -7.210e-01 1.088e+00 -0.662 0.508100
## cylinders:origin3 1.226e+00 1.007e+00 1.217 0.224379
## displacement:horsepower -5.407e-05 2.861e-04 -0.189 0.850212
## displacement:weight 2.659e-05 1.455e-05 1.828 0.068435 .
## displacement:acceleration -2.547e-03 3.356e-03 -0.759 0.448415
## displacement:year 4.547e-03 2.446e-03 1.859 0.063842 .
## displacement:origin2 -3.364e-02 4.220e-02 -0.797 0.425902
## displacement:origin3 5.375e-02 4.145e-02 1.297 0.195527
## horsepower:weight -3.407e-05 2.955e-05 -1.153 0.249743
## horsepower:acceleration -3.445e-03 3.937e-03 -0.875 0.382122
## horsepower:year -6.427e-03 3.891e-03 -1.652 0.099487 .
## horsepower:origin2 -4.869e-03 5.061e-02 -0.096 0.923408
## horsepower:origin3 2.289e-02 6.252e-02 0.366 0.714533
## weight:acceleration -6.851e-05 2.385e-04 -0.287 0.774061
## weight:year -8.065e-05 2.184e-04 -0.369 0.712223
## weight:origin2 2.277e-03 2.685e-03 0.848 0.397037
## weight:origin3 -4.498e-03 3.481e-03 -1.292 0.197101
## acceleration:year 6.141e-02 2.547e-02 2.412 0.016390 *
## acceleration:origin2 9.234e-01 2.641e-01 3.496 0.000531 ***
## acceleration:origin3 7.159e-01 3.258e-01 2.198 0.028614 *
## year:origin2 2.932e-01 1.444e-01 2.031 0.043005 *
## year:origin3 3.139e-01 1.483e-01 2.116 0.035034 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.628 on 356 degrees of freedom
## Multiple R-squared: 0.8967, Adjusted R-squared: 0.8866
## F-statistic: 88.34 on 35 and 356 DF, p-value: < 2.2e-16
# summary(lm_fit_itrt2)
A) Shown above are two different syntax with in ‘lm()’. Below output shows that interaction effect among the predictors have in fact increase the relation strength metric(R2) from 0.8205 to 0.8866.
Below interactions appear significant using the alpha as 0.05 and looking at the t-statistic p-value:
Note: Origin 2 is “European”, Origin 3 is “Japanese”
A) Square transformation
# Square root and square transformation
lm_square_fit <-lm(mpg ~ .-name +I(acceleration ^ 2)+ + I(displacement ^ 2)+ I(horsepower ^ 2), data = my_Auto)
summary(lm_square_fit)
##
## Call:
## lm(formula = mpg ~ . - name + I(acceleration^2) + +I(displacement^2) +
## I(horsepower^2), data = my_Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.6033 -1.5451 -0.0509 1.5556 11.8887
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.754e+00 6.328e+00 1.225 0.2212
## cylinders 7.415e-01 3.116e-01 2.380 0.0178 *
## displacement -7.040e-02 1.672e-02 -4.211 3.18e-05 ***
## horsepower -2.222e-01 3.947e-02 -5.630 3.51e-08 ***
## weight -2.907e-03 6.949e-04 -4.184 3.57e-05 ***
## acceleration -1.365e+00 5.540e-01 -2.463 0.0142 *
## year 7.485e-01 4.594e-02 16.293 < 2e-16 ***
## origin2 5.266e-01 5.538e-01 0.951 0.3423
## origin3 1.143e+00 5.393e-01 2.119 0.0347 *
## I(acceleration^2) 3.342e-02 1.612e-02 2.073 0.0388 *
## I(displacement^2) 1.168e-04 2.872e-05 4.066 5.83e-05 ***
## I(horsepower^2) 5.803e-04 1.371e-04 4.232 2.91e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.922 on 380 degrees of freedom
## Multiple R-squared: 0.8637, Adjusted R-squared: 0.8598
## F-statistic: 219 on 11 and 380 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm_square_fit)
A) Square root transformation
lm_sqr_fit <-lm(mpg ~ .-name +I(sqrt(acceleration))+ I(displacement ^ 2)+ I(sqrt(horsepower)), data = my_Auto)
summary(lm_sqr_fit)
##
## Call:
## lm(formula = mpg ~ . - name + I(sqrt(acceleration)) + I(displacement^2) +
## I(sqrt(horsepower)), data = my_Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.402 -1.549 0.013 1.499 11.857
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.840e+01 1.692e+01 2.861 0.004463 **
## cylinders 5.175e-01 3.139e-01 1.648 0.100079
## displacement -6.184e-02 1.730e-02 -3.574 0.000397 ***
## horsepower 2.346e-01 6.608e-02 3.550 0.000433 ***
## weight -2.993e-03 6.877e-04 -4.352 1.74e-05 ***
## acceleration 1.172e+00 9.877e-01 1.186 0.236314
## year 7.504e-01 4.578e-02 16.391 < 2e-16 ***
## origin2 5.371e-01 5.496e-01 0.977 0.329039
## origin3 1.129e+00 5.370e-01 2.102 0.036214 *
## I(sqrt(acceleration)) -1.155e+01 7.986e+00 -1.447 0.148771
## I(displacement^2) 1.035e-04 2.973e-05 3.480 0.000559 ***
## I(sqrt(horsepower)) -6.615e+00 1.433e+00 -4.615 5.37e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.915 on 380 degrees of freedom
## Multiple R-squared: 0.8645, Adjusted R-squared: 0.8605
## F-statistic: 220.3 on 11 and 380 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm_sqr_fit)
A) log transformation
lm_log_fit <-lm(log(mpg) ~ .-name + I(log(acceleration)), data = my_Auto)
summary(lm_log_fit)
##
## Call:
## lm(formula = log(mpg) ~ . - name + I(log(acceleration)), data = my_Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.40961 -0.06432 0.00662 0.07307 0.36336
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.816e+00 5.512e-01 6.923 1.88e-11 ***
## cylinders -2.079e-02 1.141e-02 -1.822 0.069192 .
## displacement 3.351e-04 2.914e-04 1.150 0.250915
## horsepower -2.078e-03 5.008e-04 -4.148 4.13e-05 ***
## weight -2.229e-04 2.516e-05 -8.860 < 2e-16 ***
## acceleration 6.767e-02 1.761e-02 3.843 0.000143 ***
## year 3.033e-02 1.818e-03 16.685 < 2e-16 ***
## origin2 6.436e-02 2.055e-02 3.131 0.001876 **
## origin3 7.499e-02 1.946e-02 3.853 0.000137 ***
## I(log(acceleration)) -1.163e+00 2.907e-01 -4.000 7.62e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.116 on 382 degrees of freedom
## Multiple R-squared: 0.8862, Adjusted R-squared: 0.8836
## F-statistic: 330.7 on 9 and 382 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(lm_log_fit)
Comments:: Square, square root and log transformations have improved the R-squared metric and also as the diagnostics plot for residuals vs fitted values have almost showed linearity.
As shown in the last plot for log transformation, linearity (residuals vs fitted plot) and also conditional normality(normal -Q-Q plot) assumptions were close to valid.
Carseats data set.Sales using Price, Urban, and US.my_carseats <- Carseats
lm_cseat_fit <- lm(Sales~Price+Urban+US,data = my_carseats)
summary(lm_cseat_fit)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = my_carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
A) before interpreting the coefficients above we would like to understand the coding that R used for categorical variables:
contrasts(my_carseats$US)
## Yes
## No 0
## Yes 1
contrasts(my_carseats$Urban)
## Yes
## No 0
## Yes 1
coef(lm_cseat_fit)
## (Intercept) Price UrbanYes USYes
## 13.04346894 -0.05445885 -0.02191615 1.20057270
Coefficient Interpretation
βPrice = -0.054459, for every one unit of price increase the average sales goes down by 0.054 units when other predictors are constant.
βUrbanYes = -0.021916, this predictor is not significant but here is the interpretation: On average Urban locations (represented as ‘Yes’) have 0.022 units less sales than that of non-Urban location when other predictors are constant.
βUSYes = 1.2006, On average US locations(represented as ‘Yes’) have 1.2006 units more sales than that of non-US(represented as ‘No’) location when other predictors are constant.
A) note: non-significant terms are also included.
Sales(hat) ≈ 13.0434 – (0.054 * Price) –(0.0219 * Urban) + (1.2006 * US)
If Urban = ‘No’ (coded as 0) and US = ‘No’ (coded as 0)
Sales(hat) ≈ 13.0434 – ( 0.054 * Price)
If Urban = ‘Yes’ (coded as 1) and US = ‘No’ (coded as 0)
Sales(hat) ≈ 13.0434 – (0.054 * Price) -0.0219 * 1 = (13.0434 – 0.0219)-(0.054 * Price)
If Urban = ‘No’ (coded as 0) and US = ‘Yes’ (coded as 1)
Sales(hat) ≈ 13.0434 – (0.054 * Price) +1.2006 * 1 = (13.0434 + 1.2006)-(0.054 * Price)
If Urban = ‘Yes’ (coded as 1) and US = ‘Yes’ (coded as 1)
Sales(hat) ≈ 13.0434 – (0.054 * Price) –(0.0219 * Urban) + (1.2006US) = (13.0434-0.0219+1.2006)-(0.054 Price)
summary(lm_cseat_fit)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = my_carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
A) Assuming the significance level is 0.05,t-Test p-value is less than significance level thus indicating significance and we reject H0 of βPrice = 0 and βUS = 0.
Significant based on the model fitted in (a) are:
A) Removed Urban from the model due to its non-significance.
lm_sel_fit <- lm(Sales ~ Price+US,data = my_carseats )
summary(lm_sel_fit)
##
## Call:
## lm(formula = Sales ~ Price + US, data = my_carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
A) Looking at the model fit measures,
Even though Model (e) has the same R2 of 23.93% but looking at Adj R2 and RSE, model (e) is much slightly better. It has slight improvement of a lower RSE than model (a).
confint(lm_sel_fit,level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
A) To check the outliers, I used the studentized residuals which are above +/- 2.8. Also using the diagnostics plot we can see that observation 377 is an outlier and the same is confirmed below:
par(mfrow = c(2,2))
plot(lm_sel_fit)
par(mfrow = c(1,1))
df_plt <- data.frame(fit = predict (lm_sel_fit), stud = rstudent (lm_sel_fit))
df_plt <- cbind(obs=rownames(df_plt),df_plt)
which(df_plt$stud>2.8)
## [1] 377
d_hc <- hchart(
df_plt,
"scatter",
hcaes(x = fit, y = stud)
)
d_hc %>%
hc_add_theme(hc_theme_economist()) %>%
hc_xAxis(title = list(text = "Fitted Values")) %>%
hc_yAxis(title = list(text = "Studentized Residuals")) %>%
hc_title(text = "Identifying Outliers with Studentized residuals above +/-2.8")
Outlier from studentized residuals is observation 377
High Leverage points used the hat values function and observation 43 is the one with high leverage.
which.max(hatvalues(lm_sel_fit))
## 43
## 43
A) Referring to the simple linear equation from ILSR chapter 3, equation 3.4 with no intercept:
Regressing Y on X
\[ \hat{y}_i=\hat{\beta_1}x_i \]
\[
\hat{\beta}_1 = \frac{\sum_{i = 1}^{n}(x_i-\overline{x})(y_i-\overline{y})}{\sum_{i = 1}^{n}(x_i-\overline{x})^2}
\] Assuming that x and y have normal distribution with mean is 0 \[
\overline{x}=0;\overline{y}=0
\] \[
\hat{\beta}_1 = \frac{\sum_{i = 1}^{n}(x_i)(y_i)}{\sum_{i = 1}^{n}(x_i)^2}--Say(Equation (a))
\] Regressing X on Y
\[ \hat{x}_i=\hat{\beta_2}y_i \] \[ \hat{\beta}_2 = \frac{\sum_{i = 1}^{n}(y_i)(x_i)}{\sum_{i = 1}^{n}(y_i)^2}--Say(Equation (b)) \] if (a) has to be equal to (b) then:
\[ {\sum_{i = 1}^{n}(y_i)^2}={\sum_{i = 1}^{n}(x_i)^2} \]
A) I have simulated the observations(random 100 which follow normal distribution) and the sum of X2 (83.30737) and sum of Y2 (1384.24) is not equal and based on the conclusion in 12(a) we can estimate the coefficients to be different.
# simulated data
set.seed(123)
x <- rnorm(100)
y <- 4*x+rnorm(100)
# end of simulated data without intercept but with error term with zero mean
df_not_equal <- data.frame(x=(x*x),y= y*y)
df_x2_y2 <- df_not_equal %>% summarise( x = sum(x),
y = sum(y),
.groups = "drop")
df_x2_y2
## x y
## 1 83.30737 1384.24
lm_y_x <- lm(y~x+0) #Regressing Y onto X
summary(lm_y_x)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.0010 -0.7901 -0.1800 0.4693 3.1762
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 3.9364 0.1064 36.99 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9713 on 99 degrees of freedom
## Multiple R-squared: 0.9325, Adjusted R-squared: 0.9319
## F-statistic: 1368 on 1 and 99 DF, p-value: < 2.2e-16
lm_x_y <- lm(x~y+0) #Regressing X onto Y
summary(lm_x_y)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.82117 -0.12311 0.04998 0.18452 0.52946
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.236902 0.006404 36.99 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2383 on 99 degrees of freedom
## Multiple R-squared: 0.9325, Adjusted R-squared: 0.9319
## F-statistic: 1368 on 1 and 99 DF, p-value: < 2.2e-16
\[ {\sum_{i = 1}^{n}(x_i)^2}=83.30737 \]
\[ {\sum_{i = 1}^{n}(y_i)^2}=1384.24 \] As per results below, we can see that coefficient(beta1) = 3.9364 and coefficient(beta 2) = 0.2369
\[ \hat{\beta}_1 = 3.9364 \] \[ \hat{\beta}_2 = 0.2369 \]
# simulated data
set.seed(123)
x <- rnorm(100)
i <- length(x)
j <- 1
while(i != 0)
{
y[j] <- x[i]
i <- i - 1
j <- j + 1
}
# end of simulated data without intercept
df_equal <- data.frame(x=(x*x),y= y*y)
df_x2_y2 <- df_equal %>% summarise( x = sum(x),
y = sum(y),
.groups = "drop")
df_x2_y2
## x y
## 1 83.30737 83.30737
lm_y_x <- lm(y~x+0) #Regressing Y onto X
summary(lm_y_x)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1108 -0.5523 0.0173 0.7728 2.4388
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 0.17425 0.09897 1.761 0.0814 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9033 on 99 degrees of freedom
## Multiple R-squared: 0.03036, Adjusted R-squared: 0.02057
## F-statistic: 3.1 on 1 and 99 DF, p-value: 0.08137
lm_x_y <- lm(x~y+0) #Regressing X onto Y
summary(lm_x_y)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.1108 -0.5523 0.0173 0.7728 2.4388
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.17425 0.09897 1.761 0.0814 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9033 on 99 degrees of freedom
## Multiple R-squared: 0.03036, Adjusted R-squared: 0.02057
## F-statistic: 3.1 on 1 and 99 DF, p-value: 0.08137
\[ {\sum_{i = 1}^{n}(x_i)^2}=83.30737 \]
\[ {\sum_{i = 1}^{n}(y_i)^2}=83.30737 \] As per results below, we can see that coefficient(beta1) = 0.17425 and coefficient(beta 2) = 0.17425
\[ \hat{\beta}_1 = 0.17425 \] \[ \hat{\beta}_2 = 0.17425 \]
After fitting Y onto X and vice versa coefficient beta 1 is 0.17425 and same is the value for coefficient beta 2.