Linear Regression

library(tidyverse)
library(ISLR2)
library(gtsummary)
library(gt)

Question 2

Carefully explain the differences between the KNN classifier and KNN regression methods. ***

Answer: KNN regression is a prediction method that uses prediction points to estimate what a value might be.Conversely, KNN classification is a classification method that uses the local values to predict which class a variable belongs to. KNN is classification is used for more broad research and determining which class a datapoint belongs to where KNN regression is more specific.

Question 9

This question involves the use of multiple linear regression on the Auto data set. ***

(a) Produce a scatterplot matrix which includes all of the variables in the data set.

plot(Auto)

(b) Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.

auto1 = Auto |> select(mpg, cylinders, displacement, horsepower, weight, acceleration, year, origin)

cor(auto1)

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance: i. Is there a relationship between the predictors and the response? ii. Which predictors appear to have a statistically significant relationship to the response? iii. What does the coefficient for the year variable suggest

model = lm(mpg ~ ., data = auto1)

tbl_regression(model, intercept = TRUE)

Characteristic	Beta	95% CI	p-value
(Intercept)	-17	-26, -8.1	<0.001
cylinders	-0.49	-1.1, 0.14	0.13
displacement	0.02	0.01, 0.03	0.008
horsepower	-0.02	-0.04, 0.01	0.2
weight	-0.01	-0.01, -0.01	<0.001
acceleration	0.08	-0.11, 0.27	0.4
year	0.75	0.65, 0.85	<0.001
origin	1.4	0.88, 2.0	<0.001
Abbreviation: CI = Confidence Interval

Answer: Yes, there is a relationship between the predictors and the response variables. The statistically significant variables are to MPG are displacement, weight, year, and origin. The year variables coefficient suggests that for every 1-unit increase in MPG, the car tends to be .75 years newer.

(d) Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow = c(2, 2))
plot(model)

Answer: The residual vs fitted plot indicates that the data does not move in a straight line and that it has a heavy lift-off on the right side of the plot. The scale-location plot shows this non-linear patern in the data further. The residual vs leverage plot indicates that 327,394, and 141 are outliers that may affect the model.

(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

model1 = lm(mpg ~ displacement * weight + year:origin, data = auto1)

tbl_regression(model1, intercept = TRUE)

Characteristic	Beta	95% CI	p-value
(Intercept)	50	45, 55	<0.001
displacement	-0.06	-0.09, -0.04	<0.001
weight	-0.01	-0.01, -0.01	<0.001
displacement * weight	0.00	0.00, 0.00	<0.001
year * origin	0.01	0.00, 0.02	0.009
Abbreviation: CI = Confidence Interval

Answer: The interaction terms displacement + weight and year + origin appear to be statistically significant. All terms appear significant.

(f) Try a few different transformations of the variables, such as log(X), √ X, X2. Comment on your findings

model2 = lm(mpg ~  sqrt(displacement) + log(weight) + year + I(origin^2), data = auto1)

tbl_regression(model2, intercept = TRUE)

Characteristic	Beta	95% CI	p-value
(Intercept)	123	101, 146	<0.001
sqrt(displacement)	0.07	-0.18, 0.33	0.6
log(weight)	-20	-23, -17	<0.001
year	0.78	0.69, 0.88	<0.001
I(origin^2)	0.18	0.05, 0.30	0.006
Abbreviation: CI = Confidence Interval

Answer: After using some transformations on the variables all variables are steadily significant. The R squared at 84% is an indicator that this is a good model. The R squared shows that 84% of change in MPG can be explained by displacement, weight, year, and orgin in this regression model.

Question 10

This question should be answered using the Carseats data set. ***

(a) Fit a multiple regression model to predict Sales using Price, Urban, and US.

m = lm(Sales ~ Price + Urban + US, data = Carseats)

table = tbl_regression(m, intercept = TRUE)

as_gt(table) |> gt::tab_header("Linear Regression Model: Sales")

Characteristic	Beta	95% CI	p-value
Linear Regression Model: Sales
(Intercept)	13	12, 14	<0.001
Price	-0.05	-0.06, -0.04	<0.001
Urban
No	—	—
Yes	-0.02	-0.56, 0.51	>0.9
US
No	—	—
Yes	1.2	0.69, 1.7	<0.001
Abbreviation: CI = Confidence Interval

(b) Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative!

Answer: Per the statistically significant variables in this model for every .05 cent decrease in price there is a one unit increase in Sales. Also, the US made products tend to be larger sales drivers. Note that the R squared on this model is only 23%.

(c) Write out the model in equation form, being careful to handle the qualitative variables properly.

Answer: \(Sales = \beta_0 + \beta_1(Price) + \beta_2(Urban) + \beta_3(US) + \varepsilon\)

(d) For which of the predictors can you reject the null hypothesis H0 : βj =0?

Answer: For predictors Price and USyes we reject the NULL hypothesis which states there is no relationship between variables

(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

m1 = lm(Sales ~ Price + US, data = Carseats)

table1 = tbl_regression(m1, intercept = TRUE)

as_gt(table1) |> gt::tab_header("Linear Regression Model: Significant Predictors")

Characteristic	Beta	95% CI	p-value
Linear Regression Model: Significant Predictors
(Intercept)	13	12, 14	<0.001
Price	-0.05	-0.06, -0.04	<0.001
US
No	—	—
Yes	1.2	0.69, 1.7	<0.001
Abbreviation: CI = Confidence Interval

(f) How well do the models in (a) and (e) fit the data?

Answer: Both these models do not fit the data very well shown in the R^2 of 24%. Although Price and USyes are statistically significant predictors the there is a still 76% of the variance of Sales that cannot be explained by this model.

(g) Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).

confint(m1)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

Answer: For every unit of increase in Sales 1,200 of them are made in the US and Price decreases by .5 cents. The R squared of 24% stayed the same for both models proving that removing statistically insignificant variables did not help.

(h) Is there evidence of outliers or high leverage observations in the model from (e)?

Carseats |> 
  ggplot(aes(Price)) + 
  geom_histogram(binwidth =10, 
                 fill="darkgreen",
                 color="white") + 
  labs(title= "Price Outlier Graph")

Carseats |>
  ggplot(aes(Sales)) + 
  geom_histogram(binwidth =1, 
                 fill="darkgreen",
                 color="white") + 
  labs(title= "Sales Outlier Graph")

Answer: In the Sales section the range of sales sits between 0 - 20 without any real evidence of outliers. With the price variable there seems to be some outliers with a few price points in the 170 -180 range.

Question 12

This problem involves simple linear regression without an intercept.

(a) Recall that the coefficient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

Answer: When X and Y can both be used to predict each other the coefficient estimate stays the same. For example, if sales can predict price and price can predict sales the coefficient estimate would be the same.

(b) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

market = Smarket |> head(100)

m3 = lm(Lag1 ~ Volume, data = market)
tbl_regression(m3, intercept = FALSE)

Characteristic	Beta	95% CI	p-value
Volume	1.3	-0.40, 3.0	0.13
Abbreviation: CI = Confidence Interval

m4 = lm(Volume ~ Lag1, data = market)
tbl_regression(m4, intercept = FALSE)

Characteristic	Beta	95% CI	p-value
Lag1	0.02	-0.01, 0.04	0.13
Abbreviation: CI = Confidence Interval

Answer: Using the Volume and Lag5 variables here we can see that these two variables work in different ways. The coefficient value for Volume is -0.7032 and the coefficient value for Lag1 is -0.00835 are completely different

(c) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

market = Smarket |> head(100)

m5 = lm(Lag1 ~ Lag2, data = market)
tbl_regression(m5, intercept = FALSE)

Characteristic	Beta	95% CI	p-value
Lag2	-0.01	-0.21, 0.19	>0.9
Abbreviation: CI = Confidence Interval

m6 = lm(Lag2 ~ Lag1, data = market)
tbl_regression(m6, intercept = FALSE)

Characteristic	Beta	95% CI	p-value
Lag1	-0.01	-0.21, 0.19	>0.9
Abbreviation: CI = Confidence Interval

Answer: Using 100 observations from the Smarket dataset, Lag1 and Lag2 can be used as an example in which the coefficient estimates are the same for both scenarios of flipping the variables. The 0.000156 R squared and the .12 show that the two models are almost identical.