Question #2. Carefully explain the differences between the KNN classifier and KNN regression methods.

K-Nearest Neighbors (KNN) is a machine learning algorithm that can be used for both classification and regression tasks. This algorithm differs in the output it produces. The KNN classifier is used to predict classes, while KNN regression is used to predict numerical values.

Specifically, KNN classification will predict the class of a new instance by finding the K nearest training instances and assigning the most frequent class amongst them as the prediction for the new instance.

In contrast, KNN regression estimates the numerical value of a new instance by finding the K nearest training instances and averaging or using another aggregate function on their values to create an estimation of the new instance.


Question #9. This question involves the use of multiple linear regression on the Auto data set.

(a) Produce a scatterplot matrix which includes all of the variables in the data set.

library(ISLR2)
library(MASS)
plot(Auto)

(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.

Auto1<-Auto
Auto1$name=NULL
cor(Auto1)
                    mpg  cylinders displacement horsepower     weight acceleration       year     origin
mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442    0.4233285  0.5805410  0.5652088
cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273   -0.5046834 -0.3456474 -0.5689316
displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944   -0.5438005 -0.3698552 -0.6145351
horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377   -0.6891955 -0.4163615 -0.4551715
weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000   -0.4168392 -0.3091199 -0.5850054
acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392    1.0000000  0.2903161  0.2127458
year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199    0.2903161  1.0000000  0.1815277
origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054    0.2127458  0.1815277  1.0000000

(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results.

AutoLR<-lm(mpg~ .-name,data=Auto)
summary(AutoLR)

Call:
lm(formula = mpg ~ . - name, data = Auto)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.5903 -2.1565 -0.1169  1.8690 13.0604 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
cylinders     -0.493376   0.323282  -1.526  0.12780    
displacement   0.019896   0.007515   2.647  0.00844 ** 
horsepower    -0.016951   0.013787  -1.230  0.21963    
weight        -0.006474   0.000652  -9.929  < 2e-16 ***
acceleration   0.080576   0.098845   0.815  0.41548    
year           0.750773   0.050973  14.729  < 2e-16 ***
origin         1.426141   0.278136   5.127 4.67e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.328 on 384 degrees of freedom
Multiple R-squared:  0.8215,    Adjusted R-squared:  0.8182 
F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

Comment on the output. For instance:

  1. Is there a relationship between the predictors and the response?
  1. Which predictors appear to have a statistically significant relationship to the response?
  1. What does the coefficient for the year variable suggest?

(d) Use the plot() function to produce diagnostic plots of the linear regression fit.

par(mfrow=c(2,2))
plot(AutoLR)

Comment on any problems you see with the fit.

Do the residual plots suggest any unusually large outliers?

plot(predict(Autolm), rstudent(Autolm))

Does the leverage plot identify any observations with unusually high leverage?

(e) Use the * and : symbols to fit linear regression models with interaction effects.

Autolm <-  lm(mpg ~ cylinders * displacement + displacement * weight, data = Auto)
summary(Autolm)

Call:
lm(formula = mpg ~ cylinders * displacement + displacement * 
    weight, data = Auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-13.2934  -2.5184  -0.3476   1.8399  17.7723 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)             5.262e+01  2.237e+00  23.519  < 2e-16 ***
cylinders               7.606e-01  7.669e-01   0.992    0.322    
displacement           -7.351e-02  1.669e-02  -4.403 1.38e-05 ***
weight                 -9.888e-03  1.329e-03  -7.438 6.69e-13 ***
cylinders:displacement -2.986e-03  3.426e-03  -0.872    0.384    
displacement:weight     2.128e-05  5.002e-06   4.254 2.64e-05 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.103 on 386 degrees of freedom
Multiple R-squared:  0.7272,    Adjusted R-squared:  0.7237 
F-statistic: 205.8 on 5 and 386 DF,  p-value: < 2.2e-16

Do any interactions appear to be statistically significant?

(f) Try a few different transformations of the variables, such as log(X), √X, X^2. Comment on your findings.

AutoLR4<-lm(mpg~weight+I((weight)^2),Auto)
summary(AutoLR4)

Call:
lm(formula = mpg ~ weight + I((weight)^2), data = Auto)

Residuals:
     Min       1Q   Median       3Q      Max 
-12.6246  -2.7134  -0.3485   1.8267  16.0866 

Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
(Intercept)    6.226e+01  2.993e+00  20.800  < 2e-16 ***
weight        -1.850e-02  1.972e-03  -9.379  < 2e-16 ***
I((weight)^2)  1.697e-06  3.059e-07   5.545 5.43e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4.176 on 389 degrees of freedom
Multiple R-squared:  0.7151,    Adjusted R-squared:  0.7137 
F-statistic: 488.3 on 2 and 389 DF,  p-value: < 2.2e-16
plot(AutoLR4)


Question #10. This question should be answered using the Carseats data set.

data("Carseats")
summary(Carseats)
     Sales          CompPrice       Income        Advertising       Population        Price        ShelveLoc  
 Min.   : 0.000   Min.   : 77   Min.   : 21.00   Min.   : 0.000   Min.   : 10.0   Min.   : 24.0   Bad   : 96  
 1st Qu.: 5.390   1st Qu.:115   1st Qu.: 42.75   1st Qu.: 0.000   1st Qu.:139.0   1st Qu.:100.0   Good  : 85  
 Median : 7.490   Median :125   Median : 69.00   Median : 5.000   Median :272.0   Median :117.0   Medium:219  
 Mean   : 7.496   Mean   :125   Mean   : 68.66   Mean   : 6.635   Mean   :264.8   Mean   :115.8               
 3rd Qu.: 9.320   3rd Qu.:135   3rd Qu.: 91.00   3rd Qu.:12.000   3rd Qu.:398.5   3rd Qu.:131.0               
 Max.   :16.270   Max.   :175   Max.   :120.00   Max.   :29.000   Max.   :509.0   Max.   :191.0               
      Age          Education    Urban       US     
 Min.   :25.00   Min.   :10.0   No :118   No :142  
 1st Qu.:39.75   1st Qu.:12.0   Yes:282   Yes:258  
 Median :54.50   Median :14.0                      
 Mean   :53.32   Mean   :13.9                      
 3rd Qu.:66.00   3rd Qu.:16.0                      
 Max.   :80.00   Max.   :18.0                      

(a) Fit a multiple regression model to predict Sales using Price, Urban, and US.

Carseatslm <-  lm(Sales ~ Price + Urban + US, data = Carseats)
summary(Carseatslm)

Call:
lm(formula = Sales ~ Price + Urban + US, data = Carseats)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.9206 -1.6220 -0.0564  1.5786  7.0581 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
Price       -0.054459   0.005242 -10.389  < 2e-16 ***
UrbanYes    -0.021916   0.271650  -0.081    0.936    
USYes        1.200573   0.259042   4.635 4.86e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.472 on 396 degrees of freedom
Multiple R-squared:  0.2393,    Adjusted R-squared:  0.2335 
F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

(b) Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

(c) Write out the model in equation form, being careful to handle the qualitative variables properly.

(d) For which of the predictors can you reject the null hypothesis H0 : βj = 0?

(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

Carseatslm2<-  lm(Sales ~ Price + US, data = Carseats)
summary(Carseatslm2)

Call:
lm(formula = Sales ~ Price + US, data = Carseats)

Residuals:
    Min      1Q  Median      3Q     Max 
-6.9269 -1.6286 -0.0574  1.5766  7.0515 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
Price       -0.05448    0.00523 -10.416  < 2e-16 ***
USYes        1.19964    0.25846   4.641 4.71e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.469 on 397 degrees of freedom
Multiple R-squared:  0.2393,    Adjusted R-squared:  0.2354 
F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

(f) How well do the models in (a) and (e) fit the data?

(g) Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).

confint(Carseatslm2)
                  2.5 %      97.5 %
(Intercept) 11.79032020 14.27126531
Price       -0.06475984 -0.04419543
USYes        0.69151957  1.70776632

(h) Is there evidence of outliers or high leverage observations in the model from (e)?

plot(predict(Carseatslm2), rstudent(Carseatslm2))

par(mfrow = c(2, 2))
plot(Carseatslm2)


Question #12. This problem involves simple linear regression without an intercept.

(a) Recall that the coefficient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

(b) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

x=rnorm(100)
y=rbinom(100,2,0.3)
n100<-lm(y~x+0)
summary(n100)

Call:
lm(formula = y ~ x + 0)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.25701  0.03364  0.85394  1.06467  2.24023 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)
x  -0.1470     0.1021   -1.44    0.153

Residual standard error: 0.9695 on 99 degrees of freedom
Multiple R-squared:  0.02052,   Adjusted R-squared:  0.01063 
F-statistic: 2.074 on 1 and 99 DF,  p-value: 0.153
n100.2<-lm(x~y+0)
summary(n100.2)

Call:
lm(formula = x ~ y + 0)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.28359 -0.65060 -0.06845  0.57698  2.42296 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)
y -0.13959    0.09693   -1.44    0.153

Residual standard error: 0.9447 on 99 degrees of freedom
Multiple R-squared:  0.02052,   Adjusted R-squared:  0.01063 
F-statistic: 2.074 on 1 and 99 DF,  p-value: 0.153

(c) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

x=1:100
y=100:1
n200<-lm(y~x+0)
summary(n200)

Call:
lm(formula = y ~ x + 0)

Residuals:
   Min     1Q Median     3Q    Max 
-49.75 -12.44  24.87  62.18  99.49 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
x   0.5075     0.0866    5.86 6.09e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 50.37 on 99 degrees of freedom
Multiple R-squared:  0.2575,    Adjusted R-squared:   0.25 
F-statistic: 34.34 on 1 and 99 DF,  p-value: 6.094e-08
n200.2<-lm(x~y+0)
summary(n200.2)

Call:
lm(formula = x ~ y + 0)

Residuals:
   Min     1Q Median     3Q    Max 
-49.75 -12.44  24.87  62.18  99.49 

Coefficients:
  Estimate Std. Error t value Pr(>|t|)    
y   0.5075     0.0866    5.86 6.09e-08 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 50.37 on 99 degrees of freedom
Multiple R-squared:  0.2575,    Adjusted R-squared:   0.25 
F-statistic: 34.34 on 1 and 99 DF,  p-value: 6.094e-08
---
title: "Assignment #2: Classification"
output: html_notebook
---

**Question #2**. Carefully explain the differences between the KNN classifier and KNN
regression methods.

K-Nearest Neighbors (KNN) is a machine learning algorithm that can be used for both classification and regression tasks. This algorithm differs in the output it produces. The KNN classifier is used to predict classes, while KNN regression is used to predict numerical values.

Specifically, KNN classification will predict the class of a new instance by finding the K nearest training instances and assigning the most frequent class amongst them as the prediction for the new instance. 

In contrast, KNN regression estimates the numerical value of a new instance by finding the K nearest training instances and averaging or using another aggregate function on their values to create an estimation of the new instance. 

---

**Question #9**. This question involves the use of multiple linear regression on the
Auto data set.

**(a)** Produce a scatterplot matrix which includes all of the variables
in the data set.
```{r}
library(ISLR2)
library(MASS)
plot(Auto)
```

**(b)** Compute the matrix of correlations between the variables using
the function cor(). You will need to exclude the name variable, cor() which is qualitative.
```{r}
Auto1<-Auto
Auto1$name=NULL
cor(Auto1)
```

**(c)** Use the lm() function to perform a multiple linear regression
with mpg as the response and all other variables except name as
the predictors. Use the summary() function to print the results.
```{r}
AutoLR<-lm(mpg~ .-name,data=Auto)
summary(AutoLR)
```

*Comment on the output. For instance:*

i. Is there a relationship between the predictors and the response?

- The p-value of <0.05 indicates that we can reject the null hypothesis. This means that there is a relationship between at least one predictor variable and the response variable.

ii. Which predictors appear to have a statistically significant
relationship to the response?

- Based on the p-value: Displacement, weight, year and origin have a statistically significant relationship with mpg.

iii. What does the coefficient for the year variable suggest?

- It suggests that it is significant and positive. This means that on average mpg increases by 0.75 every year if all the other variables remain constant.

**(d)** Use the plot() function to produce diagnostic plots of the linear
regression fit. 
```{r}
par(mfrow=c(2,2))
plot(AutoLR)
```
*Comment on any problems you see with the fit.*

Do the residual plots suggest any unusually large outliers?

- Based on this plot below of student residuals, there are several outlines because there is data with a value >3.

```{r}
plot(predict(Autolm), rstudent(Autolm))

```
Does the leverage plot identify any observations with unusually high leverage?

- a linear model isn't the best fit. In the leverage plot, point 14 seems to have high leverage, but not a high magnitude residual.



**(e)** Use the * and : symbols to fit linear regression models with
interaction effects. 
```{r}
Autolm <-  lm(mpg ~ cylinders * displacement + displacement * weight, data = Auto)
summary(Autolm)
```
Do any interactions appear to be statistically
significant?

- The interaction between displacement and weight is statistically signifcant. In contrast, the interaction between cylinders and displacement is not.



**(f)** Try a few different transformations of the variables, such as
log(X), √X, X^2. Comment on your findings.
```{r}
AutoLR4<-lm(mpg~weight+I((weight)^2),Auto)
summary(AutoLR4)
```
```{r}
plot(AutoLR4)
```
- A non-normal distribution of error terms from the Normal Q-Q plot is shown in the plot. 
- In the ‘Residuals vs Leverage’ plot, one can notice that there are no point within the bounds of the Cook’s Distance. Therefore, there ate no meaningful points that could cause the slope coefficient to be significantly impacted. 


---


**Question #10**. This question should be answered using the Carseats data set.
```{r}
data("Carseats")
summary(Carseats)
```

**(a)** Fit a multiple regression model to predict Sales using Price,
Urban, and US.
```{r}
Carseatslm <-  lm(Sales ~ Price + Urban + US, data = Carseats)
summary(Carseatslm)
```

**(b)** Provide an interpretation of each coefficient in the model. Be
careful—some of the variables in the model are qualitative!

- Price: there is likely a correlation between price and sales, with the coefficient showing a negative relationship; as price increases, sales decrease.

- UrbanYes: there is not enough evidence to suggest a link between the location of the store and the number of sales.

- USYes: there appears to be a positive relationship between whether a store is located in the US or not and the amount of sales, with an approximate increase of 1201 sales units if the store is based in the US.


**(c)** Write out the model in equation form, being careful to handle
the qualitative variables properly.

- Sales = 13.04 + -0.05(Price) + -0.02(UrbanYes) + 1.20(USYes)


**(d)** For which of the predictors can you reject the null hypothesis
H0 : βj = 0?

- The null hypothesis can be rejected for Price and USYes based on the p-values.


**(e)** On the basis of your response to the previous question, fit a
smaller model that only uses the predictors for which there is
evidence of association with the outcome.
```{r}
Carseatslm2<-  lm(Sales ~ Price + US, data = Carseats)
summary(Carseatslm2)
```

**(f)** How well do the models in (a) and (e) fit the data?

- Based on the RSE and r squared of the linear regressions, they both fit the data similarly. Although model (e) fit the data slightly better, the difference is not statistically significant for both models and therefore we do not reject the null hypothesis. 

**(g)** Using the model from (e), obtain 95 % confidence intervals for
the coefficient(s).
```{r}
confint(Carseatslm2)
```

**(h)** Is there evidence of outliers or high leverage observations in the
model from (e)?
```{r}
plot(predict(Carseatslm2), rstudent(Carseatslm2))
```
- The residuals appear to be bounded by -3 to 3. Therefore, we can say that there are no outliers present in the data

```{r}
par(mfrow = c(2, 2))
plot(Carseatslm2)
```
- There are very few observations that significantly exceed (p+1)/n(0.0075567) on the leverage-statistic plot that suggest that the points corresponding have high leverage.


---

**Question #12**. This problem involves simple linear regression without an intercept.

**(a)** Recall that the coefficient estimate βˆ for the linear regression of
Y onto X without an intercept is given by (3.38). Under what
circumstance is the coefficient estimate for the regression of X
onto Y the same as the coefficient estimate for the regression of
Y onto X?

- The coefficients will be  the same if ∑jx2j=∑jy2j.

**(b)** Generate an example in R with n = 100 observations in which
the coefficient estimate for the regression of X onto Y is different
from the coefficient estimate for the regression of Y onto X.
```{r}
x=rnorm(100)
y=rbinom(100,2,0.3)
n100<-lm(y~x+0)
summary(n100)
```
```{r}
n100.2<-lm(x~y+0)
summary(n100.2)
```
- based on the results of the coefficients the observations are different for both cases.


**(c)** Generate an example in R with n = 100 observations in which
the coefficient estimate for the regression of X onto Y is the
same as the coefficient estimate for the regression of Y onto X.
```{r}
x=1:100
y=100:1
n200<-lm(y~x+0)
summary(n200)
```

```{r}
n200.2<-lm(x~y+0)
summary(n200.2)
```
- Unlike (b), here we can see that the coefficients are the same based on the results.
