Chapter 3: Linear Regression

Question 2

Differences Between KNN Classification and KNN Regression

The difference between the KNN classifier and KNN regression methods is that KNN classification is used when the response variable is categorical(qualitative), while KNN regression is used when the response variable is numerical (quantitative). For KNN classification, we are trying to predict the class of a new observation based on the majority class among its K nearest neighbors. While for KNN regression, we are predicting the response by taking the average of the response values of the K nearest neighbors. KNN classification predicts class labels such as 0 or 1, Yes or No, but KNN regression predicts continuous values.

For example: - Classification: Predicting Yes/No, 0/1 - Regression: Predicting house price, temperature


Question 9

Load and Clean the Auto Dataset

Auto <- read.csv("Auto.csv")

Auto$horsepower <- as.numeric(as.character(Auto$horsepower))
## Warning: NAs introduced by coercion
Clean_Auto <- na.omit(Auto)

1. Scatterplot Matrix

pairs(Auto[, -9])

This scatterplot matrix allows us to visualize pairwise relationships between variables.


2. Correlation Matrix

cor(Clean_Auto[, -9])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

Interpretation

We observe that:

  • mpg is negatively correlated with:

    • cylinders
    • displacement
    • horsepower
    • weight
  • mpg is positively correlated with:

    • year

This suggests that heavier cars with larger engines tend to have lower fuel efficiency.


3. Multiple Linear Regression

We fit a regression model using mpg as the response variable.

model_fit <- lm(mpg ~ . - name, data = Clean_Auto)

summary(model_fit)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Clean_Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

(i) Relationship Between Predictors and Response

Yes.

The F-statistic is highly significant with:

\[ p < 0.001 \]

This indicates strong evidence that at least one predictor is related to mpg.


(ii) Significant Predictors

Based on p-values less than 0.05, the statistically significant predictors are:

  • weight
  • year
  • origin

(iii) Interpretation of Year Coefficient

The coefficient for year is positive and statistically significant.

This suggests that newer cars tend to have higher fuel efficiency when holding all other variables constant.


(d) Diagnostic Plots

par(mfrow = c(2,2))
plot(model_fit)

Interpretation

Residuals vs Fitted

  • Indicates possible non-linearity.

Normal Q-Q

  • Residuals mostly follow the line.

Scale-Location

  • Suggests non-constant variance.

Residuals vs Leverage

  • Some observations may have high leverage.

Overall, the model performs reasonably well, although some non-linear effects may exist.


(e) Interaction Effects

fit1 <- lm(mpg ~ horsepower * weight, data = Clean_Auto)

summary(fit1)
## 
## Call:
## lm(formula = mpg ~ horsepower * weight, data = Clean_Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.7725  -2.2074  -0.2708   1.9973  14.7314 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        6.356e+01  2.343e+00  27.127  < 2e-16 ***
## horsepower        -2.508e-01  2.728e-02  -9.195  < 2e-16 ***
## weight            -1.077e-02  7.738e-04 -13.921  < 2e-16 ***
## horsepower:weight  5.355e-05  6.649e-06   8.054 9.93e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.93 on 388 degrees of freedom
## Multiple R-squared:  0.7484, Adjusted R-squared:  0.7465 
## F-statistic: 384.8 on 3 and 388 DF,  p-value: < 2.2e-16

Model performance:

  • Residual Standard Error = 3.93
  • Adjusted R² = 0.7465

This suggests a strong interaction effect.


(f) Polynomial Regression

fit2 <- lm(mpg ~ weight + I(weight^2), Clean_Auto)

summary(fit2)
## 
## Call:
## lm(formula = mpg ~ weight + I(weight^2), data = Clean_Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.6246  -2.7134  -0.3485   1.8267  16.0866 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.226e+01  2.993e+00  20.800  < 2e-16 ***
## weight      -1.850e-02  1.972e-03  -9.379  < 2e-16 ***
## I(weight^2)  1.697e-06  3.059e-07   5.545 5.43e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.176 on 389 degrees of freedom
## Multiple R-squared:  0.7151, Adjusted R-squared:  0.7137 
## F-statistic: 488.3 on 2 and 389 DF,  p-value: < 2.2e-16

Results:

  • Residual Standard Error = 4.176
  • Adjusted R² = 0.7137

This suggests a nonlinear relationship between weight and mpg.


Question 10

Load Dataset

Carseats <- read.csv("Carseats.csv")

str(Carseats)
## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : int  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : int  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: int  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : int  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : int  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : chr  "Bad" "Good" "Medium" "Medium" ...
##  $ Age        : int  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : int  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ US         : chr  "Yes" "Yes" "Yes" "Yes" ...
head(Carseats)
##   Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1  9.50       138     73          11        276   120       Bad  42        17
## 2 11.22       111     48          16        260    83      Good  65        10
## 3 10.06       113     35          10        269    80    Medium  59        12
## 4  7.40       117    100           4        466    97    Medium  55        14
## 5  4.15       141     64           3        340   128       Bad  38        13
## 6 10.81       124    113          13        501    72       Bad  78        16
##   Urban  US
## 1   Yes Yes
## 2   Yes Yes
## 3   Yes Yes
## 4   Yes Yes
## 5   Yes  No
## 6    No Yes

(a) Multiple Regression

sales_model <- lm(Sales ~ Price + Urban + US, data = Carseats)

summary(sales_model)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Results:

  • Residual Standard Error = 2.472
  • Adjusted R² = 0.2335

(b) Interpretation of Coefficients

Regression equation:

\[ Sales = 13.043 - 0.05446(Price) - 0.02192(UrbanYes) + 1.20057(USYes) \]

Interpretation:

  • For every $1 increase in price, sales decrease by approximately 0.0545 thousand units.
  • Stores in urban areas sell slightly fewer units.
  • Stores in the US sell approximately 1.2 thousand more units.

(d) Hypothesis Testing

Using significance level:

\[ \alpha = 0.05 \]

Reject null hypothesis for:

  • Price
  • UrbanYes

(e) Reduced Model

lm.fit2 <- lm(Sales ~ Price + US, data = Carseats)

summary(lm.fit2)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

(f) Model Comparison

anova(sales_model, lm.fit2)
## Analysis of Variance Table
## 
## Model 1: Sales ~ Price + Urban + US
## Model 2: Sales ~ Price + US
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    396 2420.8                           
## 2    397 2420.9 -1  -0.03979 0.0065 0.9357

The p-value is:

\[ 0.9357 \]

This suggests the reduced model performs similarly to the full model.


(g) 95% Confidence Intervals

confint(lm.fit2)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

Interpretation:

Price coefficient confidence interval:

\[ (-0.0648,,-0.0442) \]

We are 95% confident that the true coefficient lies within this interval.


(h) Outliers and Leverage

plot(predict(lm.fit2), rstudent(lm.fit2))

Interpretation:

The plot does not indicate a substantial number of outliers or highly influential observations.


Question 12

(a) Regression Slopes Equality

The regression slopes without intercept are equal when:

\[ \sum x_i^2 = \sum y_i^2 \]

This occurs when both variables have equal Euclidean norm.


(b) Example Where Coefficients Differ

set.seed(1)

x <- rnorm(100)

y <- rnorm(100)

coef(lm(y ~ x + 0))
##            x 
## -0.006123917
coef(lm(x ~ y + 0))
##            y 
## -0.005455947

Output:

The coefficients are different because:

\[ \sum x_i^2 \neq \sum y_i^2 \]


(c) Example Where Coefficients Are Equal

set.seed(2)

x <- 1:100

y <- 100:1

eg3 <- lm(y ~ x + 0)

eg4 <- lm(x ~ y + 0)

coef(eg3)
##         x 
## 0.5074627
coef(eg4)
##         y 
## 0.5074627

Output:

Both coefficients are equal.

This occurs because:

\[ \sum x_i^2 = \sum y_i^2 \]