library(tidyverse)
library(ISLR2)
library(gtsummary)
library(gt)
Carefully explain the differences between the KNN classifier and KNN regression methods. ***
Answer: KNN regression is a prediction method that uses prediction points to estimate what a value might be.Conversely, KNN classification is a classification method that uses the local values to predict which class a variable belongs to. KNN is classification is used for more broad research and determining which class a datapoint belongs to where KNN regression is more specific.
This question involves the use of multiple linear regression on the Auto data set. ***
plot(Auto)
auto1 = Auto |> select(mpg, cylinders, displacement, horsepower, weight, acceleration, year, origin)
cor(auto1)
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
model = lm(mpg ~ ., data = auto1)
tbl_regression(model, intercept = TRUE)
| Characteristic | Beta | 95% CI | p-value |
|---|---|---|---|
| (Intercept) | -17 | -26, -8.1 | <0.001 |
| cylinders | -0.49 | -1.1, 0.14 | 0.13 |
| displacement | 0.02 | 0.01, 0.03 | 0.008 |
| horsepower | -0.02 | -0.04, 0.01 | 0.2 |
| weight | -0.01 | -0.01, -0.01 | <0.001 |
| acceleration | 0.08 | -0.11, 0.27 | 0.4 |
| year | 0.75 | 0.65, 0.85 | <0.001 |
| origin | 1.4 | 0.88, 2.0 | <0.001 |
| Abbreviation: CI = Confidence Interval | |||
Answer: Yes, there is a relationship between the predictors and the response variables. The statistically significant variables are to MPG are displacement, weight, year, and origin. The year variables coefficient suggests that for every 1-unit increase in MPG, the car tends to be .75 years newer.
par(mfrow = c(2, 2))
plot(model)
Answer: The residual vs fitted plot indicates that the
data does not move in a straight line and that it has a heavy lift-off
on the right side of the plot. The scale-location plot shows this
non-linear patern in the data further. The residual vs leverage plot
indicates that 327,394, and 141 are outliers that may affect the
model.
model1 = lm(mpg ~ displacement * weight + year:origin, data = auto1)
tbl_regression(model1, intercept = TRUE)
| Characteristic | Beta | 95% CI | p-value |
|---|---|---|---|
| (Intercept) | 50 | 45, 55 | <0.001 |
| displacement | -0.06 | -0.09, -0.04 | <0.001 |
| weight | -0.01 | -0.01, -0.01 | <0.001 |
| displacement * weight | 0.00 | 0.00, 0.00 | <0.001 |
| year * origin | 0.01 | 0.00, 0.02 | 0.009 |
| Abbreviation: CI = Confidence Interval | |||
Answer: The interaction terms displacement + weight and year + origin appear to be statistically significant. All terms appear significant.
model2 = lm(mpg ~ sqrt(displacement) + log(weight) + year + I(origin^2), data = auto1)
tbl_regression(model2, intercept = TRUE)
| Characteristic | Beta | 95% CI | p-value |
|---|---|---|---|
| (Intercept) | 123 | 101, 146 | <0.001 |
| sqrt(displacement) | 0.07 | -0.18, 0.33 | 0.6 |
| log(weight) | -20 | -23, -17 | <0.001 |
| year | 0.78 | 0.69, 0.88 | <0.001 |
| I(origin^2) | 0.18 | 0.05, 0.30 | 0.006 |
| Abbreviation: CI = Confidence Interval | |||
Answer: After using some transformations on the variables all variables are steadily significant. The R squared at 84% is an indicator that this is a good model. The R squared shows that 84% of change in MPG can be explained by displacement, weight, year, and orgin in this regression model.
This question should be answered using the Carseats data set. ***
m = lm(Sales ~ Price + Urban + US, data = Carseats)
table = tbl_regression(m, intercept = TRUE)
as_gt(table) |> gt::tab_header("Linear Regression Model: Sales")
| Linear Regression Model: Sales | |||
| Characteristic | Beta | 95% CI | p-value |
|---|---|---|---|
| (Intercept) | 13 | 12, 14 | <0.001 |
| Price | -0.05 | -0.06, -0.04 | <0.001 |
| Urban | |||
| No | — | — | |
| Yes | -0.02 | -0.56, 0.51 | >0.9 |
| US | |||
| No | — | — | |
| Yes | 1.2 | 0.69, 1.7 | <0.001 |
| Abbreviation: CI = Confidence Interval | |||
Answer: Per the statistically significant variables in this model for every .05 cent decrease in price there is a one unit increase in Sales. Also, the US made products tend to be larger sales drivers. Note that the R squared on this model is only 23%.
Answer: \(Sales = \beta_0 + \beta_1(Price) + \beta_2(Urban) + \beta_3(US) + \varepsilon\)
Answer: For predictors Price and USyes we reject the NULL hypothesis which states there is no relationship between variables
m1 = lm(Sales ~ Price + US, data = Carseats)
table1 = tbl_regression(m1, intercept = TRUE)
as_gt(table1) |> gt::tab_header("Linear Regression Model: Significant Predictors")
| Linear Regression Model: Significant Predictors | |||
| Characteristic | Beta | 95% CI | p-value |
|---|---|---|---|
| (Intercept) | 13 | 12, 14 | <0.001 |
| Price | -0.05 | -0.06, -0.04 | <0.001 |
| US | |||
| No | — | — | |
| Yes | 1.2 | 0.69, 1.7 | <0.001 |
| Abbreviation: CI = Confidence Interval | |||
Answer: Both these models do not fit the data very well shown in the R^2 of 24%. Although Price and USyes are statistically significant predictors the there is a still 76% of the variance of Sales that cannot be explained by this model.
confint(m1)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
Answer: For every unit of increase in Sales 1,200 of them are made in the US and Price decreases by .5 cents. The R squared of 24% stayed the same for both models proving that removing statistically insignificant variables did not help.
Carseats |>
ggplot(aes(Price)) +
geom_histogram(binwidth =10,
fill="darkgreen",
color="white") +
labs(title= "Price Outlier Graph")
Carseats |>
ggplot(aes(Sales)) +
geom_histogram(binwidth =1,
fill="darkgreen",
color="white") +
labs(title= "Sales Outlier Graph")
Answer: In the Sales section the range of sales sits between 0 - 20 without any real evidence of outliers. With the price variable there seems to be some outliers with a few price points in the 170 -180 range.
This problem involves simple linear regression without an intercept.
Answer: When X and Y can both be used to predict each other the coefficient estimate stays the same. For example, if sales can predict price and price can predict sales the coefficient estimate would be the same.
market = Smarket |> head(100)
m3 = lm(Lag1 ~ Volume, data = market)
tbl_regression(m3, intercept = FALSE)
| Characteristic | Beta | 95% CI | p-value |
|---|---|---|---|
| Volume | 1.3 | -0.40, 3.0 | 0.13 |
| Abbreviation: CI = Confidence Interval | |||
m4 = lm(Volume ~ Lag1, data = market)
tbl_regression(m4, intercept = FALSE)
| Characteristic | Beta | 95% CI | p-value |
|---|---|---|---|
| Lag1 | 0.02 | -0.01, 0.04 | 0.13 |
| Abbreviation: CI = Confidence Interval | |||
Answer: Using the Volume and Lag5 variables here we can see that these two variables work in different ways. The coefficient value for Volume is -0.7032 and the coefficient value for Lag1 is -0.00835 are completely different
market = Smarket |> head(100)
m5 = lm(Lag1 ~ Lag2, data = market)
tbl_regression(m5, intercept = FALSE)
| Characteristic | Beta | 95% CI | p-value |
|---|---|---|---|
| Lag2 | -0.01 | -0.21, 0.19 | >0.9 |
| Abbreviation: CI = Confidence Interval | |||
m6 = lm(Lag2 ~ Lag1, data = market)
tbl_regression(m6, intercept = FALSE)
| Characteristic | Beta | 95% CI | p-value |
|---|---|---|---|
| Lag1 | -0.01 | -0.21, 0.19 | >0.9 |
| Abbreviation: CI = Confidence Interval | |||
Answer: Using 100 observations from the Smarket dataset, Lag1 and Lag2 can be used as an example in which the coefficient estimates are the same for both scenarios of flipping the variables. The 0.000156 R squared and the .12 show that the two models are almost identical.