library(ISLR) library(MASS)

library(ISLR)
library(MASS)

The difference between the KNN classifier and KNN regression methods is that the classifier is used in situations where the response variable is categorical (qualitative), while the regressor is used in numerical situations (quantitative). The KNN classifier shows Y as 0 or 1, while the KNN regression method predicts the quantitative value of Y, and can also be continuous.

#9.

# a.
plot(Auto)

# b.
auto_b = Auto
auto_b$name = NULL
cor(auto_b)

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

#c.
auto_c = lm(mpg~.-name, data = Auto)
summary(auto_c)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

There is at least one relationship between a predictor and a response variable, because the p-value is below the threshold of 0.05.
Displacement, weight, year, and origin are all statistically significant in terms of their relationship with mpg, due to all of them being below the threshold of 0.05.
The coefficient for the year variable suggests that mpg increases by 0.75 each year.

#d
par(mfrow = c(2,2))
plot(auto_c)

The diagnostic plots provide valuable insights into the regression model. The ‘Residuals vs Fitted’ plot reveals a U-shaped pattern, suggesting non-linearity in the data. Meanwhile, the QQ Plot highlights deviations in the right tail, indicating potential challenges in assuming normality. On the sqrt(Standardized Residuals) Plot, most observations fall within the range of 0 and 2, suggesting that normality assumptions might be reasonable. Furthermore, the ‘Residuals vs Leverage’ plot shows Cook’s Distance without any points within its bounds, indicating the absence of influential data points adversely affecting the slope coefficient. These diagnostic assessments collectively contribute to a comprehensive understanding of the model’s performance and potential areas for improvement.

#e.
auto_i1 = lm(mpg~.:., auto_b)
summary(auto_i1)

## 
## Call:
## lm(formula = mpg ~ .:., data = auto_b)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6303 -1.4481  0.0596  1.2739 11.1386 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                3.548e+01  5.314e+01   0.668  0.50475   
## cylinders                  6.989e+00  8.248e+00   0.847  0.39738   
## displacement              -4.785e-01  1.894e-01  -2.527  0.01192 * 
## horsepower                 5.034e-01  3.470e-01   1.451  0.14769   
## weight                     4.133e-03  1.759e-02   0.235  0.81442   
## acceleration              -5.859e+00  2.174e+00  -2.696  0.00735 **
## year                       6.974e-01  6.097e-01   1.144  0.25340   
## origin                    -2.090e+01  7.097e+00  -2.944  0.00345 **
## cylinders:displacement    -3.383e-03  6.455e-03  -0.524  0.60051   
## cylinders:horsepower       1.161e-02  2.420e-02   0.480  0.63157   
## cylinders:weight           3.575e-04  8.955e-04   0.399  0.69000   
## cylinders:acceleration     2.779e-01  1.664e-01   1.670  0.09584 . 
## cylinders:year            -1.741e-01  9.714e-02  -1.793  0.07389 . 
## cylinders:origin           4.022e-01  4.926e-01   0.816  0.41482   
## displacement:horsepower   -8.491e-05  2.885e-04  -0.294  0.76867   
## displacement:weight        2.472e-05  1.470e-05   1.682  0.09342 . 
## displacement:acceleration -3.479e-03  3.342e-03  -1.041  0.29853   
## displacement:year          5.934e-03  2.391e-03   2.482  0.01352 * 
## displacement:origin        2.398e-02  1.947e-02   1.232  0.21875   
## horsepower:weight         -1.968e-05  2.924e-05  -0.673  0.50124   
## horsepower:acceleration   -7.213e-03  3.719e-03  -1.939  0.05325 . 
## horsepower:year           -5.838e-03  3.938e-03  -1.482  0.13916   
## horsepower:origin          2.233e-03  2.930e-02   0.076  0.93931   
## weight:acceleration        2.346e-04  2.289e-04   1.025  0.30596   
## weight:year               -2.245e-04  2.127e-04  -1.056  0.29182   
## weight:origin             -5.789e-04  1.591e-03  -0.364  0.71623   
## acceleration:year          5.562e-02  2.558e-02   2.174  0.03033 * 
## acceleration:origin        4.583e-01  1.567e-01   2.926  0.00365 **
## year:origin                1.393e-01  7.399e-02   1.882  0.06062 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared:  0.8893, Adjusted R-squared:  0.8808 
## F-statistic: 104.2 on 28 and 363 DF,  p-value: < 2.2e-16

auto_i2 = lm(mpg~.*., auto_b)
summary(auto_i2)

## 
## Call:
## lm(formula = mpg ~ . * ., data = auto_b)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6303 -1.4481  0.0596  1.2739 11.1386 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                3.548e+01  5.314e+01   0.668  0.50475   
## cylinders                  6.989e+00  8.248e+00   0.847  0.39738   
## displacement              -4.785e-01  1.894e-01  -2.527  0.01192 * 
## horsepower                 5.034e-01  3.470e-01   1.451  0.14769   
## weight                     4.133e-03  1.759e-02   0.235  0.81442   
## acceleration              -5.859e+00  2.174e+00  -2.696  0.00735 **
## year                       6.974e-01  6.097e-01   1.144  0.25340   
## origin                    -2.090e+01  7.097e+00  -2.944  0.00345 **
## cylinders:displacement    -3.383e-03  6.455e-03  -0.524  0.60051   
## cylinders:horsepower       1.161e-02  2.420e-02   0.480  0.63157   
## cylinders:weight           3.575e-04  8.955e-04   0.399  0.69000   
## cylinders:acceleration     2.779e-01  1.664e-01   1.670  0.09584 . 
## cylinders:year            -1.741e-01  9.714e-02  -1.793  0.07389 . 
## cylinders:origin           4.022e-01  4.926e-01   0.816  0.41482   
## displacement:horsepower   -8.491e-05  2.885e-04  -0.294  0.76867   
## displacement:weight        2.472e-05  1.470e-05   1.682  0.09342 . 
## displacement:acceleration -3.479e-03  3.342e-03  -1.041  0.29853   
## displacement:year          5.934e-03  2.391e-03   2.482  0.01352 * 
## displacement:origin        2.398e-02  1.947e-02   1.232  0.21875   
## horsepower:weight         -1.968e-05  2.924e-05  -0.673  0.50124   
## horsepower:acceleration   -7.213e-03  3.719e-03  -1.939  0.05325 . 
## horsepower:year           -5.838e-03  3.938e-03  -1.482  0.13916   
## horsepower:origin          2.233e-03  2.930e-02   0.076  0.93931   
## weight:acceleration        2.346e-04  2.289e-04   1.025  0.30596   
## weight:year               -2.245e-04  2.127e-04  -1.056  0.29182   
## weight:origin             -5.789e-04  1.591e-03  -0.364  0.71623   
## acceleration:year          5.562e-02  2.558e-02   2.174  0.03033 * 
## acceleration:origin        4.583e-01  1.567e-01   2.926  0.00365 **
## year:origin                1.393e-01  7.399e-02   1.882  0.06062 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared:  0.8893, Adjusted R-squared:  0.8808 
## F-statistic: 104.2 on 28 and 363 DF,  p-value: < 2.2e-16

In my first interaction model, displacement:year, acceleration:year , and acceleration:origin are statistically significant. In my second interaction model, displacement:year, acceleration:year, and acceleration:year are all statistically significant just like in the first interaction model.

#f.
auto_f = lm(mpg~weight+I((weight)^2), Auto)
summary(auto_f)

## 
## Call:
## lm(formula = mpg ~ weight + I((weight)^2), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.6246  -2.7134  -0.3485   1.8267  16.0866 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.226e+01  2.993e+00  20.800  < 2e-16 ***
## weight        -1.850e-02  1.972e-03  -9.379  < 2e-16 ***
## I((weight)^2)  1.697e-06  3.059e-07   5.545 5.43e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.176 on 389 degrees of freedom
## Multiple R-squared:  0.7151, Adjusted R-squared:  0.7137 
## F-statistic: 488.3 on 2 and 389 DF,  p-value: < 2.2e-16

plot(auto_f)

The Normal Q-Q plot indicates a departure from normality in the distribution of error terms. Additionally, upon inspecting the ‘Residuals vs Leverage’ plot, it is evident that Cook’s Distance does not have any points within its bounds. This absence implies the absence of influential data points that could adversely impact the slope coefficient.

10.

#a.
sales<-lm(Sales~Price+Urban+US,data=Carseats)
summary(sales)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

#b.
str(sales)

## List of 13
##  $ coefficients : Named num [1:4] 13.0435 -0.0545 -0.0219 1.2006
##   ..- attr(*, "names")= chr [1:4] "(Intercept)" "Price" "UrbanYes" "USYes"
##  $ residuals    : Named num [1:400] 1.813 1.518 0.195 -1.54 -1.901 ...
##   ..- attr(*, "names")= chr [1:400] "1" "2" "3" "4" ...
##  $ effects      : Named num [1:400] -149.926 -25.1 -0.311 -11.459 -2.071 ...
##   ..- attr(*, "names")= chr [1:400] "(Intercept)" "Price" "UrbanYes" "USYes" ...
##  $ rank         : int 4
##  $ fitted.values: Named num [1:400] 7.69 9.7 9.87 8.94 6.05 ...
##   ..- attr(*, "names")= chr [1:400] "1" "2" "3" "4" ...
##  $ assign       : int [1:4] 0 1 2 3
##  $ qr           :List of 5
##   ..$ qr   : num [1:400, 1:4] -20 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 0.05 ...
##   .. ..- attr(*, "dimnames")=List of 2
##   .. .. ..$ : chr [1:400] "1" "2" "3" "4" ...
##   .. .. ..$ : chr [1:4] "(Intercept)" "Price" "UrbanYes" "USYes"
##   .. ..- attr(*, "assign")= int [1:4] 0 1 2 3
##   .. ..- attr(*, "contrasts")=List of 2
##   .. .. ..$ Urban: chr "contr.treatment"
##   .. .. ..$ US   : chr "contr.treatment"
##   ..$ qraux: num [1:4] 1.05 1.07 1.03 1.03
##   ..$ pivot: int [1:4] 1 2 3 4
##   ..$ tol  : num 1e-07
##   ..$ rank : int 4
##   ..- attr(*, "class")= chr "qr"
##  $ df.residual  : int 396
##  $ contrasts    :List of 2
##   ..$ Urban: chr "contr.treatment"
##   ..$ US   : chr "contr.treatment"
##  $ xlevels      :List of 2
##   ..$ Urban: chr [1:2] "No" "Yes"
##   ..$ US   : chr [1:2] "No" "Yes"
##  $ call         : language lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##  $ terms        :Classes 'terms', 'formula'  language Sales ~ Price + Urban + US
##   .. ..- attr(*, "variables")= language list(Sales, Price, Urban, US)
##   .. ..- attr(*, "factors")= int [1:4, 1:3] 0 1 0 0 0 0 1 0 0 0 ...
##   .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. ..$ : chr [1:4] "Sales" "Price" "Urban" "US"
##   .. .. .. ..$ : chr [1:3] "Price" "Urban" "US"
##   .. ..- attr(*, "term.labels")= chr [1:3] "Price" "Urban" "US"
##   .. ..- attr(*, "order")= int [1:3] 1 1 1
##   .. ..- attr(*, "intercept")= int 1
##   .. ..- attr(*, "response")= int 1
##   .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. ..- attr(*, "predvars")= language list(Sales, Price, Urban, US)
##   .. ..- attr(*, "dataClasses")= Named chr [1:4] "numeric" "numeric" "factor" "factor"
##   .. .. ..- attr(*, "names")= chr [1:4] "Sales" "Price" "Urban" "US"
##  $ model        :'data.frame':   400 obs. of  4 variables:
##   ..$ Sales: num [1:400] 9.5 11.22 10.06 7.4 4.15 ...
##   ..$ Price: num [1:400] 120 83 80 97 128 72 108 120 124 124 ...
##   ..$ Urban: Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##   ..$ US   : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
##   ..- attr(*, "terms")=Classes 'terms', 'formula'  language Sales ~ Price + Urban + US
##   .. .. ..- attr(*, "variables")= language list(Sales, Price, Urban, US)
##   .. .. ..- attr(*, "factors")= int [1:4, 1:3] 0 1 0 0 0 0 1 0 0 0 ...
##   .. .. .. ..- attr(*, "dimnames")=List of 2
##   .. .. .. .. ..$ : chr [1:4] "Sales" "Price" "Urban" "US"
##   .. .. .. .. ..$ : chr [1:3] "Price" "Urban" "US"
##   .. .. ..- attr(*, "term.labels")= chr [1:3] "Price" "Urban" "US"
##   .. .. ..- attr(*, "order")= int [1:3] 1 1 1
##   .. .. ..- attr(*, "intercept")= int 1
##   .. .. ..- attr(*, "response")= int 1
##   .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
##   .. .. ..- attr(*, "predvars")= language list(Sales, Price, Urban, US)
##   .. .. ..- attr(*, "dataClasses")= Named chr [1:4] "numeric" "numeric" "factor" "factor"
##   .. .. .. ..- attr(*, "names")= chr [1:4] "Sales" "Price" "Urban" "US"
##  - attr(*, "class")= chr "lm"

In this regression analysis, the variable “Price” is treated as a continuous factor, and its coefficient suggests that a one-unit increase in price corresponds to a decrease in sales by -0.54459 units or approximately 54 sales, holding other variables constant. On the other hand, the categorical variable “Urban” indicates that being in an urban area is associated with a decrease in sales by approximately -0.021916 units or 22 sales, although the p-value suggests no significant evidence of a relationship between sales and the store being in an urban area. Lastly, the categorical variable “US” reveals that being in the U.S. is associated with an increase in sales by 1.200573 units or approximately 1200 sales, assuming other variables remain constant. These interpretations provide valuable insights into the impact of different factors on sales, taking into account their continuous or categorical nature.

#c. Sales=13.043469 - 0.054459(Price) - (0.021916)(1 ,if Urban is Yes 0 if not yes) + 1.200573 (1,if US is Yes 0 if not yes)

#d. The null hypothesis can be rejected for Price and UrbanYes, because they both have p-values less than the 0.05 threshold.

#e.
sales_small<-lm(Sales~Price+US,data=Carseats)
summary(sales_small)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

#f.
anova(sales, sales_small)

## Analysis of Variance Table
## 
## Model 1: Sales ~ Price + Urban + US
## Model 2: Sales ~ Price + US
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    396 2420.8                           
## 2    397 2420.9 -1  -0.03979 0.0065 0.9357

Upon excluding the Urban variable from the model, a marginal improvement is observed in the adjusted R-squared, coupled with a reduction in the residual standard error. However, the ANOVA test indicates that the difference in these model performance metrics is not statistically significant for both models. Consequently, we fail to reject the null hypothesis, suggesting that the inclusion or exclusion of the Urban variable does not lead to a significant impact on the overall model fit.

#g.
confint(sales_small)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

Utilizing the confidence interval test, it is determined with 95% confidence that the true parameter for the variable “Price” lies within the interval of (-0.0648, -0.0442). This signifies that there is a 5% probability that the actual parameter value does not fall within this specified interval. The confidence interval provides a range within which the true parameter is likely to be, offering a level of certainty regarding the estimated value.

#h.
plot(predict(sales_small),rstudent(sales_small))

mod_lev=hat(model.matrix(sales_small))
plot(mod_lev)

12.

#a. From the equation, the parameter estimate will be equal if the summation of xi^2 equals the summation of yi ^2.

#b.
x=rnorm(100)
y=rbinom(100,2,0.3)
mod_1 = lm(y~x+0)
summary(mod_1)

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.16041 -0.01393  0.48947  0.99128  2.12060 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)
## x -0.07123    0.08407  -0.847    0.399
## 
## Residual standard error: 0.8258 on 99 degrees of freedom
## Multiple R-squared:  0.007198,   Adjusted R-squared:  -0.00283 
## F-statistic: 0.7178 on 1 and 99 DF,  p-value: 0.3989

mod_2 =lm(x~y+0)
summary(mod_2)

## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2522 -0.7099 -0.1659  0.3919  2.1965 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)
## y  -0.1011     0.1193  -0.847    0.399
## 
## Residual standard error: 0.9837 on 99 degrees of freedom
## Multiple R-squared:  0.007198,   Adjusted R-squared:  -0.00283 
## F-statistic: 0.7178 on 1 and 99 DF,  p-value: 0.3989

The observations for these models are different by looking at the coefficients of each model.

#c.
x=1:100
y=100:1
mod_3 = lm(y~x+0)
summary(mod_3)

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49.75 -12.44  24.87  62.18  99.49 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   0.5075     0.0866    5.86 6.09e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared:  0.2575, Adjusted R-squared:   0.25 
## F-statistic: 34.34 on 1 and 99 DF,  p-value: 6.094e-08

mod_4 =lm(x~y+0)
summary(mod_4)

## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49.75 -12.44  24.87  62.18  99.49 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y   0.5075     0.0866    5.86 6.09e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared:  0.2575, Adjusted R-squared:   0.25 
## F-statistic: 34.34 on 1 and 99 DF,  p-value: 6.094e-08

These observations are the same for the coefficients now.

Assignment#2

Luke Lopez

2024-02-13

10.

12.