2. Carefully explain the differences between the KNN classifier and KNN regression methods.

As the name suggests, KNN Classifier is typically used to solve Classification problems (Qualitiative response) and KNN regression is used to solve regression problems (Quantitative response).
Both methods use information about the neighborhood of data points to draw conclusions. However, each method has unique goals.
In case of KNN classifier, we are interested in finding to which class a new data point belongs to, thus we assign it to the class of most frequent point class in the neighborhood of this point.
One the other hand, in KNN regression, our target is to predict, so we assign a new point value to the average of the points in the neighborhood.

9. This question involves the use of multiple linear regression on the Auto data set.

Load the ISLR library to use the data

library(ISLR)

Load the Auto dataset from the ISLR library. No need to use the CSV file

data(Auto)

Get the Summary of Auto from the ISLR library itself. No need to read the csv file

summary(Auto)
##       mpg          cylinders      displacement     horsepower        weight    
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0   Max.   :5140  
##                                                                                
##   acceleration        year           origin                      name    
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   amc matador       :  5  
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   ford pinto        :  5  
##  Median :15.50   Median :76.00   Median :1.000   toyota corolla    :  5  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577   amc gremlin       :  4  
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000   amc hornet        :  4  
##  Max.   :24.80   Max.   :82.00   Max.   :3.000   chevrolet chevette:  4  
##                                                  (Other)           :365

Quick read
Observations: 392
Columns: 9

9(a) Produce a scatterplot matrix which includes all of the variables in the data set.
pairs(Auto)

9(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, which is qualitative.

Use Cor function. Remove the name variable, which is 9th column, from the Auto dataset

cor(Auto[,-9])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000
9(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

use lm function and fit the multiple linear regression model, after removing the name variable. Fitted model is into Auto.lm.fit

Auto.lm.fit <- lm(mpg~.-name,data=Auto)

Print the summary

summary(Auto.lm.fit)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16
9(c)(i) Is there a relationship between the predictors and the response?

Based on the F-statistic of 252.4 and the p-value (assuming .05 as cutoff) from the multiple-linear regression fit above, we can conclude that there is relationship between predictors and response (mpg)

9(c)(ii) Which predictors appear to have a statistically significant relationship to the response?

The p-values of displacement, weight, year and origin are < 0.05 (95% confidence level cutoff assumed) and R highlighting the same significant predictors with "*"s, we conclude that displacement, weight, year and origin appear to have a statistically significant relationship to the response variable mpg.

9(c)(iii) What does the coefficient for the year variable suggest?

The coefficient of the year variable (0.750773) suggests the following:
Given all other predictors constant, with 1 unit of increase in year i.e. every one year, the auto’s mpg increases by .750773
In other words, Autos(cars) become more fuel efficient every 1 year.

9(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
par(mfrow = c(2,2))
plot(Auto.lm.fit)

The Residuals vs Fitted values plot suggest that the fit is slighlty curved (in other words, mild non-linearity) and doesn’t fall on exact linear distribution. However, the Standardized residuals vs Fitted values chart shows much better linearity.
Cook’s d plot shows that there is one high leverage point i.e. observation 14.
Observations 323, 326, 327, and 394 are the highly/easily visible outliers on the charts, though there are few more outliers as suggested by the charts above.

9(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?)
Auto.lm.fit.IT <- lm(mpg~.*.-name*.+.-name,data=Auto)
summary(Auto.lm.fit.IT)
## 
## Call:
## lm(formula = mpg ~ . * . - name * . + . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6303 -1.4481  0.0596  1.2739 11.1386 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                3.548e+01  5.314e+01   0.668  0.50475   
## cylinders                  6.989e+00  8.248e+00   0.847  0.39738   
## displacement              -4.785e-01  1.894e-01  -2.527  0.01192 * 
## horsepower                 5.034e-01  3.470e-01   1.451  0.14769   
## weight                     4.133e-03  1.759e-02   0.235  0.81442   
## acceleration              -5.859e+00  2.174e+00  -2.696  0.00735 **
## year                       6.974e-01  6.097e-01   1.144  0.25340   
## origin                    -2.090e+01  7.097e+00  -2.944  0.00345 **
## cylinders:displacement    -3.383e-03  6.455e-03  -0.524  0.60051   
## cylinders:horsepower       1.161e-02  2.420e-02   0.480  0.63157   
## cylinders:weight           3.575e-04  8.955e-04   0.399  0.69000   
## cylinders:acceleration     2.779e-01  1.664e-01   1.670  0.09584 . 
## cylinders:year            -1.741e-01  9.714e-02  -1.793  0.07389 . 
## cylinders:origin           4.022e-01  4.926e-01   0.816  0.41482   
## displacement:horsepower   -8.491e-05  2.885e-04  -0.294  0.76867   
## displacement:weight        2.472e-05  1.470e-05   1.682  0.09342 . 
## displacement:acceleration -3.479e-03  3.342e-03  -1.041  0.29853   
## displacement:year          5.934e-03  2.391e-03   2.482  0.01352 * 
## displacement:origin        2.398e-02  1.947e-02   1.232  0.21875   
## horsepower:weight         -1.968e-05  2.924e-05  -0.673  0.50124   
## horsepower:acceleration   -7.213e-03  3.719e-03  -1.939  0.05325 . 
## horsepower:year           -5.838e-03  3.938e-03  -1.482  0.13916   
## horsepower:origin          2.233e-03  2.930e-02   0.076  0.93931   
## weight:acceleration        2.346e-04  2.289e-04   1.025  0.30596   
## weight:year               -2.245e-04  2.127e-04  -1.056  0.29182   
## weight:origin             -5.789e-04  1.591e-03  -0.364  0.71623   
## acceleration:year          5.562e-02  2.558e-02   2.174  0.03033 * 
## acceleration:origin        4.583e-01  1.567e-01   2.926  0.00365 **
## year:origin                1.393e-01  7.399e-02   1.882  0.06062 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared:  0.8893, Adjusted R-squared:  0.8808 
## F-statistic: 104.2 on 28 and 363 DF,  p-value: < 2.2e-16

From the above model with all interaction terms, we notice that the following interaction terms are significant based on p-value of 0.05 (confidence level of 95%) as cutoff:
acceleration:origin
acceleration:year
displacement:year
While the above interaction terms are more significant, if we consider the p-value cutoff as 0.1, then the following interaction terms are also significant:
cylinders:acceleration
cylinders:year
displacement:weight
horsepower:acceleration
year:origin

9(f) Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.

Scatter plot shows inverse relationship between horsepower and mpg but the multi-linear regression model fit showed that horsepower is not significant. Lets dig a little deeper into this using transformations.

par(mfrow = c(2,2))
plot(Auto$horsepower,Auto$mpg)
plot(log(Auto$horsepower), Auto$mpg)
plot(sqrt(Auto$horsepower), Auto$mpg)
plot((Auto$horsepower)^2, Auto$mpg)

Scatter plots show that as horsepower increases, mpg reduces but the log transformation of horsepower is giving more linear relationship compared to others.

Lets fit the model with horsepower and log transformed horsepower variables alone and see if its significant and also look at the model fit plots.

Auto.lm.fit.hp <- lm(mpg~horsepower,data=Auto)
summary(Auto.lm.fit.hp)
## 
## Call:
## lm(formula = mpg ~ horsepower, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.5710  -3.2592  -0.3435   2.7630  16.9240 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 39.935861   0.717499   55.66   <2e-16 ***
## horsepower  -0.157845   0.006446  -24.49   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.906 on 390 degrees of freedom
## Multiple R-squared:  0.6059, Adjusted R-squared:  0.6049 
## F-statistic: 599.7 on 1 and 390 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(Auto.lm.fit.hp)

If we just include horsepower, it shows horsepower is significant. However the fit seems to be less linear.

Lets do the same with log(horsepower)

Auto.lm.fit.loghp <- lm(mpg~log(horsepower),data=Auto)
summary(Auto.lm.fit.loghp)
## 
## Call:
## lm(formula = mpg ~ log(horsepower), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.2299  -2.7818  -0.2322   2.6661  15.4695 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     108.6997     3.0496   35.64   <2e-16 ***
## log(horsepower) -18.5822     0.6629  -28.03   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.501 on 390 degrees of freedom
## Multiple R-squared:  0.6683, Adjusted R-squared:  0.6675 
## F-statistic: 785.9 on 1 and 390 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(Auto.lm.fit.loghp)

log(horsepower) transformation seems to have more linear relationship.

10. This question should be answered using the Carseats data set.

load the Carseats dataset from ISLR library

data(Carseats)
?Carseats
str(Carseats)
## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...

Observations: 400
Columns: 11
Clear Qualitiative variables: ShelveLoc, Urban and US

summary(Carseats)
##      Sales          CompPrice       Income        Advertising    
##  Min.   : 0.000   Min.   : 77   Min.   : 21.00   Min.   : 0.000  
##  1st Qu.: 5.390   1st Qu.:115   1st Qu.: 42.75   1st Qu.: 0.000  
##  Median : 7.490   Median :125   Median : 69.00   Median : 5.000  
##  Mean   : 7.496   Mean   :125   Mean   : 68.66   Mean   : 6.635  
##  3rd Qu.: 9.320   3rd Qu.:135   3rd Qu.: 91.00   3rd Qu.:12.000  
##  Max.   :16.270   Max.   :175   Max.   :120.00   Max.   :29.000  
##    Population        Price        ShelveLoc        Age          Education   
##  Min.   : 10.0   Min.   : 24.0   Bad   : 96   Min.   :25.00   Min.   :10.0  
##  1st Qu.:139.0   1st Qu.:100.0   Good  : 85   1st Qu.:39.75   1st Qu.:12.0  
##  Median :272.0   Median :117.0   Medium:219   Median :54.50   Median :14.0  
##  Mean   :264.8   Mean   :115.8                Mean   :53.32   Mean   :13.9  
##  3rd Qu.:398.5   3rd Qu.:131.0                3rd Qu.:66.00   3rd Qu.:16.0  
##  Max.   :509.0   Max.   :191.0                Max.   :80.00   Max.   :18.0  
##  Urban       US     
##  No :118   No :142  
##  Yes:282   Yes:258  
##                     
##                     
##                     
## 
10(a) Fit a multiple regression model to predict Sales using Price,Urban, and US.
Carseats.lm.fit <- lm(Sales~Price+Urban+US,data=Carseats)
summary(Carseats.lm.fit)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16
10(b) Provide an interpretation of each coefficient in the model. Becareful some of the variables in the model are qualitative!

Three variables used in this model and there are 3 coefficients, apart from Intercept.
Price is a quantitative variable, where as Urban and US are qualitative variables.
Overall model F-statistic and P-value indicates that the model is significant.
Price and US are significant and we can reject null hypothesis for them.

Price: Given all other predictors remain fixed, on average, For every 1 unit of increase in Price, the average Sales of car seats at that location will decrease by ~0.0545 units (units in thousands)

US: Given all other predictors remain fixed, on average, sales of car seats in a US store is ~1.2 units (in thousands) more than non-US stores

Urban: Assuming Urban will not be there in the final model, we accept null hypothesis for Urban. Urban has no effect on sales.

10(c) Write out the model in equation form, being careful to handle the qualitative variables properly.

In general: (with Urban fills in as 1/0 and US fills in as 1/0)
Sales = 13.043469 + (-0.054459) * Price + (-0.021916) * Urban + (1.200573) * US

This can be explained as following:

  • Urban - 1 (Yes), US - 1 (yes) :
    • Sales = 13.043469 + (-0.054459) * Price + (-0.021916) + (1.200573)
    • Sales = -0.054459 * Price + 14.222126
  • Urban - 0 (No), US - 1 (yes) :
    • Sales = 13.043469 + (-0.054459) * Price + (0) + (1.200573)
    • Sales = -0.054459 * Price + 14.244042
  • Urban - 1 (Yes), US - 0 (No) :
    • Sales = 13.043469 + (-0.054459) * Price + (-0.021916) + (0)
    • Sales = -0.054459 * Price + 13.021553
  • Urban - 1 (No), US - 0 (No) :
    • Sales = 13.043469 + (-0.054459) * Price + (0) + (0)
    • Sales = -0.054459 * Price + 13.043469
10(d) For which of the predictors can you reject the null hypothesis H0 :βj =0?

Predictors - Price and US have significant values. we can reject the null hypothesis for Price and US.
Predictor “Urban”’s p-value is not significant.

10(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

With previous model fit, we know that “Urban” predictor is not significant and shows no evidence of association with the outcome. So, let’s fit a smaller model with “Price” and “US” predictors.

Carseats.lm.fit.PU <- lm(Sales~Price+US,data=Carseats)
summary(Carseats.lm.fit.PU)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16
10(f) How well do the models in (a) and (e) fit the data?

R-Squared for model(a) is 0.2393; adjusted R-Squared is 0.2335 and RSE is 2.472
R-Squared for model(e) is 0.2393 and adjusted R-Squared is 0.2354 and RSE is 2.469
The adjusted R-Square for model(e) is slightly better and the RSE (Residual Standard Error) is also slightly better. Both the model fit and the individual predictors are all significant in model(e)
Even though both models are significant, overall roughly 23.93% of variability is only explained by the model, which means that both the models did not fit the data well.
However, model(e) is relatively(slightly or almost negligibly) better than model(a).

10(g) Using the model from (e), obtain 95% confidence intervals for the coefficient(s).
confint(Carseats.lm.fit.PU)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632
10(h) Is there evidence of outliers or high leverage observations in the model from (e)?

To check this, lets plot the residuals from model(e)

par(mfrow = c(2,2))
plot(Carseats.lm.fit.PU)

Outliers:
The Residual vs Fitted chart shows almost a very close linear relationship. This chart shows some observations far away indicating potential outliers. The cook’sD is also indicating outliers (few observations >2 and <-2).

However, to really know and decide if any outliers exist, lets build a studentized residuals chart with observations whose studentized residuals > 3 colored differently.

plot(predict(Carseats.lm.fit.PU),rstudent(Carseats.lm.fit.PU),col=ifelse(rstudent(Carseats.lm.fit.PU)>3,"red","black"))

The above plot doesn’t highlight any observations whose studentized residuals are greater than 3 in absolute value. Hence, we conclude there are no outliers that could potentially influence the model. We can safely use all observations in the model.

Leverage Points:
The Cook’s D plot shows that there are few leverage points. To check this, lets color those observations whose leverage statistic is greater than (p+1)/n. Here p=2 and n = 400.

plot(hatvalues(Carseats.lm.fit.PU),col=ifelse(hatvalues(Carseats.lm.fit.PU)>(2+1)/dim(Carseats)[1],"red","black"))

The above plot shows that a few of the observations have considerbaly high leverage statistic. with this, we can confirm that there are high leverage points in model(e).

Let’s print the potential influence points.

summary(influence.measures(Carseats.lm.fit.PU))
## Potentially influential observations of
##   lm(formula = Sales ~ Price + US, data = Carseats) :
## 
##     dfb.1_ dfb.Pric dfb.USYs dffit   cov.r   cook.d hat    
## 26   0.24  -0.18    -0.17     0.28_*  0.97_*  0.03   0.01  
## 29  -0.10   0.10    -0.10    -0.18    0.97_*  0.01   0.01  
## 43  -0.11   0.10     0.03    -0.11    1.05_*  0.00   0.04_*
## 50  -0.10   0.17    -0.17     0.26_*  0.98    0.02   0.01  
## 51  -0.05   0.05    -0.11    -0.18    0.95_*  0.01   0.00  
## 58  -0.05  -0.02     0.16    -0.20    0.97_*  0.01   0.01  
## 69  -0.09   0.10     0.09     0.19    0.96_*  0.01   0.01  
## 126 -0.07   0.06     0.03    -0.07    1.03_*  0.00   0.03_*
## 160  0.00   0.00     0.00     0.01    1.02_*  0.00   0.02  
## 166  0.21  -0.23    -0.04    -0.24    1.02    0.02   0.03_*
## 172  0.06  -0.07     0.02     0.08    1.03_*  0.00   0.02  
## 175  0.14  -0.19     0.09    -0.21    1.03_*  0.02   0.03_*
## 210 -0.14   0.15    -0.10    -0.22    0.97_*  0.02   0.01  
## 270 -0.03   0.05    -0.03     0.06    1.03_*  0.00   0.02  
## 298 -0.06   0.06    -0.09    -0.15    0.97_*  0.01   0.00  
## 314 -0.05   0.04     0.02    -0.05    1.03_*  0.00   0.02_*
## 353 -0.02   0.03     0.09     0.15    0.97_*  0.01   0.00  
## 357  0.02  -0.02     0.02    -0.03    1.03_*  0.00   0.02  
## 368  0.26  -0.23    -0.11     0.27_*  1.01    0.02   0.02_*
## 377  0.14  -0.15     0.12     0.24    0.95_*  0.02   0.01  
## 384  0.00   0.00     0.00     0.00    1.02_*  0.00   0.02  
## 387 -0.03   0.04    -0.03     0.05    1.02_*  0.00   0.02  
## 396 -0.05   0.05     0.08     0.14    0.98_*  0.01   0.00

This shows few potential influence points. Lets run the model by removing these.

Carseats.Outliers<-c(26,29,43,50,51,58,69,126,160,166,172,175,210,270,298,314,353,357,368,377,384,387,396)
Carseats.subset<-Carseats[-Carseats.Outliers,]
Carseats.lm.fit.PU.subset<-lm(Sales~Price+US,data=Carseats.subset)
summary(Carseats.lm.fit.PU.subset)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats.subset)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.263 -1.605 -0.039  1.590  5.428 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.925232   0.665259  19.429  < 2e-16 ***
## Price       -0.053973   0.005511  -9.794  < 2e-16 ***
## USYes        1.255018   0.248856   5.043 7.15e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.29 on 374 degrees of freedom
## Multiple R-squared:  0.2387, Adjusted R-squared:  0.2347 
## F-statistic: 58.64 on 2 and 374 DF,  p-value: < 2.2e-16

The new model here (model(h)) has R-Squared of 0.2387, which is infact not an improvement of fit from model(e) which has R-Square of 0.2393.

Based on the above analysis, we could safely use all the observations in the dataset.

12. This problem involves simple linear regression without an intercept.

12(a) Recall that the coefficient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

The coefficient estimate for the regression of Y onto X is
\(\hat{\beta}\) = \(\displaystyle \frac {\Sigma_{i=1}^{n} x_{i} y_{i}} {\Sigma_{i'=1}^{n} x_{i'}^{2}}\)

The coefficient estimate for the regression of X onto Y is
\(\hat{\beta}\) = \(\displaystyle \frac {\Sigma_{i=1}^{n} x_{i} y_{i}} {\Sigma_{i'=1}^{n} y_{i'}^{2}}\)

The coefficient estimates for above regressions will be the same only when the denominators are same i.e. when :
\(\Sigma_{i'=1}^{n} x_{i'}^{2}\) = \(\Sigma_{i'=1}^{n} y_{i'}^{2}\)

12(b) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

To do this, we need to ensure \(\Sigma_{i'=1}^{n} x_{i'}^{2}\) != \(\Sigma_{i'=1}^{n} y_{i'}^{2}\)

Lets pick something of that sort.

Here let’s feed X with numbers 1 to 100 and do random seed. Let’s pick Y = 2x + rnorm(100).

set.seed(25)
x <- 1:100
y <- 2 * x + rnorm(100)
head(data.frame(x,y,x^2,y^2))
##   x         y x.2        y.2
## 1 1  1.788166   1   3.197539
## 2 2  2.958409   4   8.752183
## 3 3  4.846692   9  23.490428
## 4 4  8.321531  16  69.247886
## 5 5  8.499870  25  72.247792
## 6 6 11.554467  36 133.505702

As the functions are different, sum(x^2) and sum(y^2) should be different. lets print and confirm that.

sum(x^2)
## [1] 338350
sum(y^2)
## [1] 1350394

Let’s fit two linear regressions for Y onto X and X onto Y.

Y.lm.fit <- lm(y ~ x + 0)
summary(Y.lm.fit)
## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.23461 -0.94314 -0.05444  0.45612  2.42291 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x 1.997702   0.001743    1146   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.014 on 99 degrees of freedom
## Multiple R-squared:  0.9999, Adjusted R-squared:  0.9999 
## F-statistic: 1.314e+06 on 1 and 99 DF,  p-value: < 2.2e-16
X.lm.fit <- lm(x ~ y + 0)
summary(X.lm.fit)
## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.21095 -0.22517  0.03132  0.47784  1.12212 
## 
## Coefficients:
##    Estimate Std. Error t value Pr(>|t|)    
## y 0.5005374  0.0004367    1146   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5075 on 99 degrees of freedom
## Multiple R-squared:  0.9999, Adjusted R-squared:  0.9999 
## F-statistic: 1.314e+06 on 1 and 99 DF,  p-value: < 2.2e-16

From the above two model fits, Notice the coefficient estimates (1.997702 and 0.5005374) are different.

12(c) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

To do this, we need to ensure \(\Sigma_{i'=1}^{n} x_{i'}^{2}\) == \(\Sigma_{i'=1}^{n} y_{i'}^{2}\)

This can be achieved by picking options such y = x , Y = abs(x), Y = -(x) etc can be used as these will generate squares of them to be same.

set.seed(14)
x=rnorm(100)
y=abs(x)

head(data.frame(x,y,x^2,y^2))
##             x          y         x.2         y.2
## 1 -0.66184983 0.66184983 0.438045195 0.438045195
## 2  1.71895416 1.71895416 2.954803394 2.954803394
## 3  2.12166699 2.12166699 4.501470822 4.501470822
## 4  1.49715368 1.49715368 2.241469154 2.241469154
## 5 -0.03614058 0.03614058 0.001306141 0.001306141
## 6  1.23194518 1.23194518 1.517688918 1.517688918

It is clear as x2 and y2 are same, obviously sum(x^2) and sum(y^2) will be same.

Let’s fit two linear regressions for Y onto X and X onto Y.

Y.lm.fit <- lm(y ~ x + 0)
summary(Y.lm.fit)
## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## 0.00667 0.28849 0.63300 1.02049 2.42147 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)
## x  0.13313    0.09961   1.337    0.184
## 
## Residual standard error: 0.898 on 99 degrees of freedom
## Multiple R-squared:  0.01772,    Adjusted R-squared:  0.007802 
## F-statistic: 1.786 on 1 and 99 DF,  p-value: 0.1844
X.lm.fit <- lm(x ~ y + 0)
summary(X.lm.fit)
## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.42147 -0.70064 -0.01548  0.54449  1.83921 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)
## y  0.13313    0.09961   1.337    0.184
## 
## Residual standard error: 0.898 on 99 degrees of freedom
## Multiple R-squared:  0.01772,    Adjusted R-squared:  0.007802 
## F-statistic: 1.786 on 1 and 99 DF,  p-value: 0.1844

As shown above, both the models have same coefficient estimates (0.13313).