Q2

Carefully explain the differences between the KNN classifier and KNN regression methods.

#KNN classifier is used when the response variable is categorical. This will show the dependent variable Y as 0 or 1.

#KNN regression is used when the response variable is quantitative. This will try to dependent Y.

Q9

This question involves the use of multiple linear regression on the Auto data set.

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 4.1.3
## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --
## v ggplot2 3.4.0      v purrr   1.0.1 
## v tibble  3.1.8      v dplyr   1.0.10
## v tidyr   1.2.1      v stringr 1.5.0 
## v readr   2.1.3      v forcats 0.5.2
## Warning: package 'ggplot2' was built under R version 4.1.3
## Warning: package 'tibble' was built under R version 4.1.3
## Warning: package 'tidyr' was built under R version 4.1.3
## Warning: package 'readr' was built under R version 4.1.3
## Warning: package 'purrr' was built under R version 4.1.3
## Warning: package 'dplyr' was built under R version 4.1.3
## Warning: package 'stringr' was built under R version 4.1.3
## Warning: package 'forcats' was built under R version 4.1.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(readr)
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.1.3
library(MASS)
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:ISLR2':
## 
##     Boston
## 
## The following object is masked from 'package:dplyr':
## 
##     select

##(a) Produce a scatterplot matrix which includes all of the variables in the data set.

auto <- read_csv("~/UTSA Classes/UTSA_Data_Mining/Auto.csv")
## Rows: 397 Columns: 9
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (2): horsepower, name
## dbl (7): mpg, cylinders, displacement, weight, acceleration, year, origin
## 
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(auto)
## spc_tbl_ [397 x 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ mpg         : num [1:397] 18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : num [1:397] 8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num [1:397] 307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : chr [1:397] "130" "165" "150" "150" ...
##  $ weight      : num [1:397] 3504 3693 3436 3433 3449 ...
##  $ acceleration: num [1:397] 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : num [1:397] 70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : num [1:397] 1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : chr [1:397] "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   mpg = col_double(),
##   ..   cylinders = col_double(),
##   ..   displacement = col_double(),
##   ..   horsepower = col_character(),
##   ..   weight = col_double(),
##   ..   acceleration = col_double(),
##   ..   year = col_double(),
##   ..   origin = col_double(),
##   ..   name = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

##(b) Compute the matrix of correlations between the variables usig the function cor(). You will need to exclude the name variable, cor() which is qualitative.

auto$horsepower = as.numeric(auto$horsepower)
## Warning: NAs introduced by coercion
str(auto)
## spc_tbl_ [397 x 9] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ mpg         : num [1:397] 18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : num [1:397] 8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num [1:397] 307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num [1:397] 130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num [1:397] 3504 3693 3436 3433 3449 ...
##  $ acceleration: num [1:397] 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : num [1:397] 70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : num [1:397] 1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : chr [1:397] "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   mpg = col_double(),
##   ..   cylinders = col_double(),
##   ..   displacement = col_double(),
##   ..   horsepower = col_character(),
##   ..   weight = col_double(),
##   ..   acceleration = col_double(),
##   ..   year = col_double(),
##   ..   origin = col_double(),
##   ..   name = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
plot(auto[,1:8])

##(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

autoN <- auto[,1:8]
mult_lm <- lm(mpg ~ . , data = autoN)

#i. Is there a relationship between the predictors and the response?

summary(mult_lm)
## 
## Call:
## lm(formula = mpg ~ ., data = autoN)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16
# There is a relationship between the predictors and the response variables sinc ethe p-value is small which means we would rejuct the null hypothesis and accept the alternative that at least one of the slopes of the predictor variables are different from 0.

#ii. Which predictors appear to have a statistically significant relationship to the response?

#Predictors with a p-value less then 0.05 are considered significant, so those variables would be
#The displacement, weight, year, and origin. note: the intercept is also significant.

#iii. What does the coefficient for the year variable suggest?

# the coefficient for year is 0.75 which means that mpg increases by (or decreases since 0.75 would decrease it) 0.75 for every one year increase, when all the other variables are held constant

##(d) Use the plot() function to produce diagnostic plots of the linearregression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow=c(2,2))
plot(mult_lm)

#The first plot we take a look at is the residuals vs the fitted values plot and we see that there is a sort of polynomial trend which is a vialation of having random residuals.

#Then we take a look at the normal Q-Q plot to check for normality and we see it is sort of linear until the ends where is levels off.

#Now looking at the Cook's D graph we see that there are a few outliers in our data.

##(e) Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

mult_lm_int <- lm(mpg ~ .:. , data = autoN)
mult_lm_int <- lm(mpg ~ .*. , data = autoN)

summary(mult_lm_int)
## 
## Call:
## lm(formula = mpg ~ . * ., data = autoN)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6303 -1.4481  0.0596  1.2739 11.1386 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                3.548e+01  5.314e+01   0.668  0.50475   
## cylinders                  6.989e+00  8.248e+00   0.847  0.39738   
## displacement              -4.785e-01  1.894e-01  -2.527  0.01192 * 
## horsepower                 5.034e-01  3.470e-01   1.451  0.14769   
## weight                     4.133e-03  1.759e-02   0.235  0.81442   
## acceleration              -5.859e+00  2.174e+00  -2.696  0.00735 **
## year                       6.974e-01  6.097e-01   1.144  0.25340   
## origin                    -2.090e+01  7.097e+00  -2.944  0.00345 **
## cylinders:displacement    -3.383e-03  6.455e-03  -0.524  0.60051   
## cylinders:horsepower       1.161e-02  2.420e-02   0.480  0.63157   
## cylinders:weight           3.575e-04  8.955e-04   0.399  0.69000   
## cylinders:acceleration     2.779e-01  1.664e-01   1.670  0.09584 . 
## cylinders:year            -1.741e-01  9.714e-02  -1.793  0.07389 . 
## cylinders:origin           4.022e-01  4.926e-01   0.816  0.41482   
## displacement:horsepower   -8.491e-05  2.885e-04  -0.294  0.76867   
## displacement:weight        2.472e-05  1.470e-05   1.682  0.09342 . 
## displacement:acceleration -3.479e-03  3.342e-03  -1.041  0.29853   
## displacement:year          5.934e-03  2.391e-03   2.482  0.01352 * 
## displacement:origin        2.398e-02  1.947e-02   1.232  0.21875   
## horsepower:weight         -1.968e-05  2.924e-05  -0.673  0.50124   
## horsepower:acceleration   -7.213e-03  3.719e-03  -1.939  0.05325 . 
## horsepower:year           -5.838e-03  3.938e-03  -1.482  0.13916   
## horsepower:origin          2.233e-03  2.930e-02   0.076  0.93931   
## weight:acceleration        2.346e-04  2.289e-04   1.025  0.30596   
## weight:year               -2.245e-04  2.127e-04  -1.056  0.29182   
## weight:origin             -5.789e-04  1.591e-03  -0.364  0.71623   
## acceleration:year          5.562e-02  2.558e-02   2.174  0.03033 * 
## acceleration:origin        4.583e-01  1.567e-01   2.926  0.00365 **
## year:origin                1.393e-01  7.399e-02   1.882  0.06062 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.695 on 363 degrees of freedom
##   (5 observations deleted due to missingness)
## Multiple R-squared:  0.8893, Adjusted R-squared:  0.8808 
## F-statistic: 104.2 on 28 and 363 DF,  p-value: < 2.2e-16
#The 3 interactions that are significant are displacement*year, acceleration*year, and acceleration*origin.

##(f) Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.

mult_lm_trans <- lm(mpg ~ displacement + I(displacement^(1/2)), data = autoN)
summary(mult_lm_trans)
## 
## Call:
## lm(formula = mpg ~ displacement + I(displacement^(1/2)), data = autoN)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -16.362  -2.354  -0.261   2.097  20.285 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           64.46433    4.05000  15.917  < 2e-16 ***
## displacement           0.09076    0.02087   4.349 1.74e-05 ***
## I(displacement^(1/2)) -4.35730    0.59883  -7.276 1.87e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.376 on 394 degrees of freedom
## Multiple R-squared:  0.6889, Adjusted R-squared:  0.6874 
## F-statistic: 436.3 on 2 and 394 DF,  p-value: < 2.2e-16
# I did a model with mpg as the dependent variable and with displacement and the square root of displacement as the independent variables. We see that the response variables are all significant and the model is also significant. But the R^2 is 0.689 which means that only 68.9% of the variation in the model can be explained by our predictor variables, which is pretty low.

Q10.

This question should be answered using the Carseats data set.

head(Carseats)
##   Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1  9.50       138     73          11        276   120       Bad  42        17
## 2 11.22       111     48          16        260    83      Good  65        10
## 3 10.06       113     35          10        269    80    Medium  59        12
## 4  7.40       117    100           4        466    97    Medium  55        14
## 5  4.15       141     64           3        340   128       Bad  38        13
## 6 10.81       124    113          13        501    72       Bad  78        16
##   Urban  US
## 1   Yes Yes
## 2   Yes Yes
## 3   Yes Yes
## 4   Yes Yes
## 5   Yes  No
## 6    No Yes
car <- Carseats

##(a) Fit a multiple regression model to predict Sales using Price,Urban, and US.

lm1 <- lm(Sales ~ Price + Urban + US, data=car)

##(b) Provide an interpretation of each coefficient in the modl. Be careful—some of the variables in the model are qualitative!

lm1
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = car)
## 
## Coefficients:
## (Intercept)        Price     UrbanYes        USYes  
##    13.04347     -0.05446     -0.02192      1.20057
#The model would be Sales = 13.04 - 0.05(Price) - 0.02(UrbanYes) + 1.2(USYes)
#The intercept means that sales is 13.04 when all the variables are held constant
#for Price, every 1 unit increase of price, sales decreases by 0.05 while all other variables are held constant.
#for UrbanYes, if urban is yes, price decreases by 0.02, else it is 0, while all other variables are held constant.
#for USYes, if US is yes, price increases by 1.2, else it is 0, while all other variables are held constant.

##(c) Write out the model in equation form, being careful to handle the qualitative variables properly.

#The model would be Sales = 13.04 - 0.05(Price) - 0.02(UrbanYes, 0 if no) + 1.2(USYes, 0 if no)

##(d) For which of the predictors can you reject the null hypothesis H0 : βj = 0?

summary(lm1)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = car)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16
#The null hypothesis is that the slope = 0 and the alternative hypothesis is that the slope != 0.

#So based on the p-value, if it is less then 0.05 we reject the null. So we see that Price, and USYes are significant which means there slopes are not equal to 0.

##(e) On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

lm2 <- lm(Sales ~ Price + US, data=car)

##(f) How well do the models in (a) and (e) fit the data?

summary(lm1)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = car)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16
summary(lm2)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = car)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16
# Both models are significatn but to tell which model is better we will take a look at the R^2 since that explains the variation.
# for the model with Price, UrbanYes, and USYes, the R^2 is 23.9% and for the model with only Price, and USYes it is 23.9%. So they seem to be the same, but they are both terrible to predict price.

##(g) Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).

confint(lm2)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632
#95% confidence interval for Price is (-0.065, -0.0442)
#95% confidence interval for USYes is (0.692, 1.708)

##(h) Is there evidence of outliers or high leverage observations in the model from (e)?

plot(lm2)

#Based on the plot below, we see that there are some outliers and high leverage points.

Q12

This problem involves simple linear regression without an intercept.

##(a) Recall that the coefficient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

#From the equation, the parameter estimate will be equal if the summation of xi^2 equals the summation of yi ^2.

##(b) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

x=rnorm(100)
y=rbinom(100,2,0.3)
eg<-lm(y~x+0)
summary(eg)
## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.20978  0.07236  0.86278  1.02152  2.16120 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)
## x  0.10995    0.09337   1.178    0.242
## 
## Residual standard error: 0.8758 on 99 degrees of freedom
## Multiple R-squared:  0.01381,    Adjusted R-squared:  0.003852 
## F-statistic: 1.387 on 1 and 99 DF,  p-value: 0.2418
eg1<-lm(x~y+0)
summary(eg1)
## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.7869 -0.8114 -0.1615  0.5770  2.0223 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)
## y   0.1256     0.1067   1.178    0.242
## 
## Residual standard error: 0.9362 on 99 degrees of freedom
## Multiple R-squared:  0.01381,    Adjusted R-squared:  0.003852 
## F-statistic: 1.387 on 1 and 99 DF,  p-value: 0.2418

##(c) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X

x=1:100
y=100:1
eg3<-lm(y~x+0)
summary(eg3)
## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49.75 -12.44  24.87  62.18  99.49 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   0.5075     0.0866    5.86 6.09e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared:  0.2575, Adjusted R-squared:   0.25 
## F-statistic: 34.34 on 1 and 99 DF,  p-value: 6.094e-08
eg4<-lm(x~y+0)
summary(eg4)
## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49.75 -12.44  24.87  62.18  99.49 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y   0.5075     0.0866    5.86 6.09e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared:  0.2575, Adjusted R-squared:   0.25 
## F-statistic: 34.34 on 1 and 99 DF,  p-value: 6.094e-08