knitr::opts_chunk$set(warning = FALSE, message = FALSE)
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.3.1
## Warning: package 'ggplot2' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.4.4 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ISLR)
## Warning: package 'ISLR' was built under R version 4.3.2
library(MASS)
## Warning: package 'MASS' was built under R version 4.3.1
##
## Attaching package: 'MASS'
##
## The following object is masked from 'package:dplyr':
##
## select
library(modelr)
library(car)
## Warning: package 'car' was built under R version 4.3.2
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.3.1
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
(2)Carefully explain the diferences between the KNN classifer and KNN regression methods. KNN regression returns an average of the values of the nearest neighbours but the KNN classifier just returns the class to which most of the nearby neighbors belong. (9) (a)Produce a scatterplot matrix which includes all of the variables in the data set.
getwd()
## [1] "C:/Users/monee/OneDrive/Pictures/Documents"
setwd("C:/Users/monee/OneDrive/Pictures/Documents")
auto_data<-read.csv("Auto.csv",header=TRUE)
auto_data$horsepower <-as.numeric(auto_data$horsepower)
auto<-na.omit(auto_data)
plot(auto)
auto %>%
dplyr::select(-name) %>%
cor()
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
(c)Use the sm.OLS() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summarize() function to print the results. Comment on the output. For instance: i. Is there a relationship between the predictors and the response? Use the anova_lm() function from statsmodels to answer this question.
lm_mpg_allothers <- lm(mpg ~ .-name, data = auto)
summary(lm_mpg_allothers)
##
## Call:
## lm(formula = mpg ~ . - name, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
There is at least 1 relationship between predictors and the response variable.
(ii)Which predictors appear to have a statistically signifcant relationship to the response? Origin,year,weight, and displacement
(iii)What does the coefcient for the year variable suggest? Also, the coefficent for the year variable suggest that, holding constant all the other predictors, each year the cars increase their efficiency by 0.75 miles per gallon, on average.
(d)Produce some of diagnostic plots of the linear regression ft as described in the lab. Comment on any problems you see with the ft. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
par(mfrow = c(2, 2))
plot(lm_mpg_allothers)
lm(mpg ~ year*weight*horsepower, data = auto) %>%
summary()
##
## Call:
## lm(formula = mpg ~ year * weight * horsepower, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.8796 -1.6800 -0.0639 1.2556 11.5534
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.577e+02 3.694e+01 -4.270 2.47e-05 ***
## year 2.886e+00 4.906e-01 5.884 8.75e-09 ***
## weight 1.889e-02 1.286e-02 1.469 0.14273
## horsepower 1.696e+00 4.208e-01 4.029 6.74e-05 ***
## year:weight -3.943e-04 1.712e-04 -2.303 0.02180 *
## year:horsepower -2.540e-02 5.652e-03 -4.494 9.27e-06 ***
## weight:horsepower -3.218e-04 1.112e-04 -2.893 0.00403 **
## year:weight:horsepower 4.972e-06 1.500e-06 3.314 0.00101 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.826 on 384 degrees of freedom
## Multiple R-squared: 0.8712, Adjusted R-squared: 0.8689
## F-statistic: 371.2 on 7 and 384 DF, p-value: < 2.2e-16
The cases are signifigant
lm(mpg ~ horsepower + I(horsepower^2), data = auto) %>% plot()
lm(mpg ~ poly(horsepower, 3), data = auto) %>% summary()
##
## Call:
## lm(formula = mpg ~ poly(horsepower, 3), data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.7039 -2.4491 -0.1519 2.2035 15.8159
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.446 0.221 106.105 <2e-16 ***
## poly(horsepower, 3)1 -120.138 4.375 -27.460 <2e-16 ***
## poly(horsepower, 3)2 44.090 4.375 10.078 <2e-16 ***
## poly(horsepower, 3)3 -3.949 4.375 -0.903 0.367
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.375 on 388 degrees of freedom
## Multiple R-squared: 0.6882, Adjusted R-squared: 0.6858
## F-statistic: 285.5 on 3 and 388 DF, p-value: < 2.2e-16
In previous diagnostic plots we seen that the relationship between horsepower and mpg was probably non-linear (because the residuals showed a pattern across the horsepower variable). Now we can see that adding a cuadratic term for horsepower fixes this, and captures the non-linear relationship present in the data. However, adding a cubic term doesn’t increses the explanatory power of the model.
carseat_data<-read.csv("Carseats.csv",header=TRUE)
carseat<-na.omit(carseat_data)
carseat$ShelveLoc<-as.factor(carseat$ShelveLoc)
carseat$Urban<-as.factor(carseat$Urban)
carseat$US<-as.factor(carseat$US)
str(carseat)
## 'data.frame': 400 obs. of 11 variables:
## $ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
## $ CompPrice : int 138 111 113 117 141 124 115 136 132 132 ...
## $ Income : int 73 48 35 100 64 113 105 81 110 113 ...
## $ Advertising: int 11 16 10 4 3 13 0 15 0 0 ...
## $ Population : int 276 260 269 466 340 501 45 425 108 131 ...
## $ Price : int 120 83 80 97 128 72 108 120 124 124 ...
## $ ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
## $ Age : int 42 65 59 55 38 78 71 67 76 76 ...
## $ Education : int 17 10 12 14 13 16 15 10 10 17 ...
## $ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
## $ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
lm_carseats <- lm(Sales ~ Price + Urban + US, data = carseat)
summary(lm_carseats)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = carseat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
The Intercept estimate tell us that when all the other variables are zero (Price is equal to zero, the store is in a rural location, and is not located in the US) we can expect to see sales of 13,043 (asumming that the effects are linear and additive).
Provide an interpretation of each coefcient in the model. Be careful—some of the variables in the model are qualitative! The Price coefficient indicates that an increase of 1 in the final price brings sales down by 54 units. The UrbanYes coefficient is not statistically significant, so we can’t conclude that the sales level is different in urban areas vs. rural areas (holding constant all the other variables). Finally, the USYes coefficient is significant, and its value implies that a store in the US sales 1,200 more units, on average, than a non-US store.
Write out the model in equation form, being careful to handle the qualitative variables properly. Salesi = β0 + β1 * Pricei +((β2 if Urban ((β3&if US 0&if not Urban)) + 0if not US))
For which of the predictors can you reject the null hypothesis H0 : βj = 0? The Intercept, Price, Urban, and US.
(e)On the basis of your response to the previous question, ft a smaller model that only uses the predictors for which there is evidence of association with the outcome
lm_carseats_simp <- lm(Sales ~ Price + US, data = carseat)
summary(lm_carseats_simp)
##
## Call:
## lm(formula = Sales ~ Price + US, data = carseat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
(f)How well do the models in (a) and (e) ft the data? The fit of both models is very similar, but the second model has a better adjusted R-squared, because we have removed a variable with no explanatory power (adjusted R-squared penalizes for each variable we add to the model, so including variables with little correlation with the response don’t bring the number upwards, as it happens with the simple R-squared).
(g)Using the model from (e), obtain 95 % confdence intervals for the coefcient(s).
confint(lm_carseats_simp)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
(h)Is there evidence of outliers or high leverage observations in the model from (e)?
par(mfrow = c(2,2))
plot(lm_carseats_simp)
In the diagnostic plots we see little evidence of outlier observations.
There seems to be an observation with relatively high leverage, but it’s
not problematic since it has a very low residual.
Recall that the coefcient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefcient estimate for the regression of X onto Y the same as the coefcient estimate for the regression of Y onto X? The numerator in the () is always the same weather the regression is y ~ x or x ~ y. However, the denominator is usually different. If we want the whole coefficient to be the same in both cases, then this equality must be true: ( = )
Generate an example in Python with n = 100 observations in which the coefcient estimate for the regression of X onto Y is diferent from the coefcient estimate for the regression of Y onto X.
set.seed(1989)
x_1 <- rnorm(100)
y_1 <- rnorm(100) + 2*x_1
lm(x_1 ~ y_1 + 0) %>% summary()
##
## Call:
## lm(formula = x_1 ~ y_1 + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.94313 -0.25160 0.00501 0.26080 1.28222
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y_1 0.41411 0.01899 21.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4325 on 99 degrees of freedom
## Multiple R-squared: 0.8277, Adjusted R-squared: 0.826
## F-statistic: 475.7 on 1 and 99 DF, p-value: < 2.2e-16
lm(y_1 ~ x_1 + 0) %>% summary()
##
## Call:
## lm(formula = y_1 ~ x_1 + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.69131 -0.63116 -0.05623 0.52089 2.72321
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x_1 1.99887 0.09164 21.81 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9503 on 99 degrees of freedom
## Multiple R-squared: 0.8277, Adjusted R-squared: 0.826
## F-statistic: 475.7 on 1 and 99 DF, p-value: < 2.2e-16
(c)Generate an example in Python with n = 100 observations in which the coefcient estimate for the regression of X onto Y is the same as the coefcient estimate for the regression of Y onto X.
x_2 <- rnorm(100)
y_2 <- sample(x_2, 100, replace = FALSE)
lm(x_2 ~ y_2 + 0) %>% summary()
##
## Call:
## lm(formula = x_2 ~ y_2 + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.15853 -0.68461 -0.07436 0.63679 2.64401
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y_2 0.17061 0.09903 1.723 0.088 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.965 on 99 degrees of freedom
## Multiple R-squared: 0.02911, Adjusted R-squared: 0.0193
## F-statistic: 2.968 on 1 and 99 DF, p-value: 0.08804
lm(y_2 ~ x_2 + 0) %>% summary()
##
## Call:
## lm(formula = y_2 ~ x_2 + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.12868 -0.65179 -0.07524 0.54794 2.83044
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x_2 0.17061 0.09903 1.723 0.088 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.965 on 99 degrees of freedom
## Multiple R-squared: 0.02911, Adjusted R-squared: 0.0193
## F-statistic: 2.968 on 1 and 99 DF, p-value: 0.08804