knitr::opts_chunk$set(warning = FALSE, message = FALSE)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.1
## Warning: package 'ggplot2' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.3.2
library(MASS)
## Warning: package 'MASS' was built under R version 4.3.1
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
library(modelr)
library(car)
## Warning: package 'car' was built under R version 4.3.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.3.1
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

(2)Carefully explain the diferences between the KNN classifer and KNN regression methods. KNN regression returns an average of the values of the nearest neighbours but the KNN classifier just returns the class to which most of the nearby neighbors belong. (9) (a)Produce a scatterplot matrix which includes all of the variables in the data set.

getwd()
## [1] "C:/Users/monee/OneDrive/Pictures/Documents"
setwd("C:/Users/monee/OneDrive/Pictures/Documents")
auto_data<-read.csv("Auto.csv",header=TRUE)
auto_data$horsepower <-as.numeric(auto_data$horsepower)
auto<-na.omit(auto_data)
plot(auto)

  1. Compute the matrix of correlations between the variables using the DataFrame.corr() method.
auto %>% 
  dplyr::select(-name) %>% 
  cor()
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

(c)Use the sm.OLS() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summarize() function to print the results. Comment on the output. For instance: i. Is there a relationship between the predictors and the response? Use the anova_lm() function from statsmodels to answer this question.

lm_mpg_allothers <- lm(mpg ~ .-name, data = auto)
summary(lm_mpg_allothers)
## 
## Call:
## lm(formula = mpg ~ . - name, data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

There is at least 1 relationship between predictors and the response variable.

(ii)Which predictors appear to have a statistically signifcant relationship to the response? Origin,year,weight, and displacement

(iii)What does the coefcient for the year variable suggest? Also, the coefficent for the year variable suggest that, holding constant all the other predictors, each year the cars increase their efficiency by 0.75 miles per gallon, on average.

(d)Produce some of diagnostic plots of the linear regression ft as described in the lab. Comment on any problems you see with the ft. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow = c(2, 2))
plot(lm_mpg_allothers)

  1. Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?
lm(mpg ~ year*weight*horsepower, data = auto) %>% 
  summary()
## 
## Call:
## lm(formula = mpg ~ year * weight * horsepower, data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.8796 -1.6800 -0.0639  1.2556 11.5534 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -1.577e+02  3.694e+01  -4.270 2.47e-05 ***
## year                    2.886e+00  4.906e-01   5.884 8.75e-09 ***
## weight                  1.889e-02  1.286e-02   1.469  0.14273    
## horsepower              1.696e+00  4.208e-01   4.029 6.74e-05 ***
## year:weight            -3.943e-04  1.712e-04  -2.303  0.02180 *  
## year:horsepower        -2.540e-02  5.652e-03  -4.494 9.27e-06 ***
## weight:horsepower      -3.218e-04  1.112e-04  -2.893  0.00403 ** 
## year:weight:horsepower  4.972e-06  1.500e-06   3.314  0.00101 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.826 on 384 degrees of freedom
## Multiple R-squared:  0.8712, Adjusted R-squared:  0.8689 
## F-statistic: 371.2 on 7 and 384 DF,  p-value: < 2.2e-16

The cases are signifigant

  1. Try a few diferent transformations of the variables, such as log(X),√X, X2. Comment on your fndings.
lm(mpg ~ horsepower + I(horsepower^2), data = auto) %>% plot()

lm(mpg ~ poly(horsepower, 3), data = auto) %>% summary()
## 
## Call:
## lm(formula = mpg ~ poly(horsepower, 3), data = auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.7039  -2.4491  -0.1519   2.2035  15.8159 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            23.446      0.221 106.105   <2e-16 ***
## poly(horsepower, 3)1 -120.138      4.375 -27.460   <2e-16 ***
## poly(horsepower, 3)2   44.090      4.375  10.078   <2e-16 ***
## poly(horsepower, 3)3   -3.949      4.375  -0.903    0.367    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.375 on 388 degrees of freedom
## Multiple R-squared:  0.6882, Adjusted R-squared:  0.6858 
## F-statistic: 285.5 on 3 and 388 DF,  p-value: < 2.2e-16

In previous diagnostic plots we seen that the relationship between horsepower and mpg was probably non-linear (because the residuals showed a pattern across the horsepower variable). Now we can see that adding a cuadratic term for horsepower fixes this, and captures the non-linear relationship present in the data. However, adding a cubic term doesn’t increses the explanatory power of the model.

  1. This question should be answered using the Carseats data set.
carseat_data<-read.csv("Carseats.csv",header=TRUE)
carseat<-na.omit(carseat_data)
carseat$ShelveLoc<-as.factor(carseat$ShelveLoc)
carseat$Urban<-as.factor(carseat$Urban)
carseat$US<-as.factor(carseat$US)
str(carseat)
## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : int  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : int  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: int  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : int  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : int  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : int  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : int  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
  1. Fit a multiple regression model to predict Sales using Price,Urban, and US.
lm_carseats <- lm(Sales ~ Price + Urban + US, data = carseat)
summary(lm_carseats)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = carseat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

The Intercept estimate tell us that when all the other variables are zero (Price is equal to zero, the store is in a rural location, and is not located in the US) we can expect to see sales of 13,043 (asumming that the effects are linear and additive).

  1. Provide an interpretation of each coefcient in the model. Be careful—some of the variables in the model are qualitative! The Price coefficient indicates that an increase of 1 in the final price brings sales down by 54 units. The UrbanYes coefficient is not statistically significant, so we can’t conclude that the sales level is different in urban areas vs. rural areas (holding constant all the other variables). Finally, the USYes coefficient is significant, and its value implies that a store in the US sales 1,200 more units, on average, than a non-US store.

  2. Write out the model in equation form, being careful to handle the qualitative variables properly. Salesi = β0 + β1 * Pricei +((β2 if Urban ((β3&if US 0&if not Urban)) + 0if not US))

  3. For which of the predictors can you reject the null hypothesis H0 : βj = 0? The Intercept, Price, Urban, and US.

(e)On the basis of your response to the previous question, ft a smaller model that only uses the predictors for which there is evidence of association with the outcome

lm_carseats_simp <- lm(Sales ~ Price + US, data = carseat)
summary(lm_carseats_simp)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = carseat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

(f)How well do the models in (a) and (e) ft the data? The fit of both models is very similar, but the second model has a better adjusted R-squared, because we have removed a variable with no explanatory power (adjusted R-squared penalizes for each variable we add to the model, so including variables with little correlation with the response don’t bring the number upwards, as it happens with the simple R-squared).

(g)Using the model from (e), obtain 95 % confdence intervals for the coefcient(s).

confint(lm_carseats_simp)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

(h)Is there evidence of outliers or high leverage observations in the model from (e)?

par(mfrow = c(2,2))
plot(lm_carseats_simp)

In the diagnostic plots we see little evidence of outlier observations. There seems to be an observation with relatively high leverage, but it’s not problematic since it has a very low residual.

  1. This problem involves simple linear regression without an intercept.
  1. Recall that the coefcient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefcient estimate for the regression of X onto Y the same as the coefcient estimate for the regression of Y onto X? The numerator in the () is always the same weather the regression is y ~ x or x ~ y. However, the denominator is usually different. If we want the whole coefficient to be the same in both cases, then this equality must be true: ( = )

  2. Generate an example in Python with n = 100 observations in which the coefcient estimate for the regression of X onto Y is diferent from the coefcient estimate for the regression of Y onto X.

set.seed(1989)
x_1 <- rnorm(100)
y_1 <- rnorm(100) + 2*x_1

lm(x_1 ~ y_1 + 0) %>% summary()
## 
## Call:
## lm(formula = x_1 ~ y_1 + 0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.94313 -0.25160  0.00501  0.26080  1.28222 
## 
## Coefficients:
##     Estimate Std. Error t value Pr(>|t|)    
## y_1  0.41411    0.01899   21.81   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4325 on 99 degrees of freedom
## Multiple R-squared:  0.8277, Adjusted R-squared:  0.826 
## F-statistic: 475.7 on 1 and 99 DF,  p-value: < 2.2e-16
lm(y_1 ~ x_1 + 0) %>% summary()
## 
## Call:
## lm(formula = y_1 ~ x_1 + 0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.69131 -0.63116 -0.05623  0.52089  2.72321 
## 
## Coefficients:
##     Estimate Std. Error t value Pr(>|t|)    
## x_1  1.99887    0.09164   21.81   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9503 on 99 degrees of freedom
## Multiple R-squared:  0.8277, Adjusted R-squared:  0.826 
## F-statistic: 475.7 on 1 and 99 DF,  p-value: < 2.2e-16

(c)Generate an example in Python with n = 100 observations in which the coefcient estimate for the regression of X onto Y is the same as the coefcient estimate for the regression of Y onto X.

x_2 <- rnorm(100)
y_2 <- sample(x_2, 100, replace = FALSE)

lm(x_2 ~ y_2 + 0) %>% summary()
## 
## Call:
## lm(formula = x_2 ~ y_2 + 0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.15853 -0.68461 -0.07436  0.63679  2.64401 
## 
## Coefficients:
##     Estimate Std. Error t value Pr(>|t|)  
## y_2  0.17061    0.09903   1.723    0.088 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.965 on 99 degrees of freedom
## Multiple R-squared:  0.02911,    Adjusted R-squared:  0.0193 
## F-statistic: 2.968 on 1 and 99 DF,  p-value: 0.08804
lm(y_2 ~ x_2 + 0) %>% summary()
## 
## Call:
## lm(formula = y_2 ~ x_2 + 0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -3.12868 -0.65179 -0.07524  0.54794  2.83044 
## 
## Coefficients:
##     Estimate Std. Error t value Pr(>|t|)  
## x_2  0.17061    0.09903   1.723    0.088 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.965 on 99 degrees of freedom
## Multiple R-squared:  0.02911,    Adjusted R-squared:  0.0193 
## F-statistic: 2.968 on 1 and 99 DF,  p-value: 0.08804