STA 6543 Chapter 3 Assignment

Problem 2
Carefully explain the differences between the KNN classifier and KNN regression methods.

The difference between the KNN classifier and KNN regression methods is that the classifier is used in situations where the response variable is categorical (qualitative), while the regressor is used in numerical situations (quantitative). The KNN classifier shows Y as 0 or 1, while the KNN regression method predicts the quantitative value of Y, and can also be continuous.

Problem 9
This question involves the use of multiple linear regression on the Auto data set.

Produce a scatterplot matrix which includes all of the variables in the data set.
Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.
Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results.

Comment on the output. For instance:

Is there a relationship between the predictors and the response?
Which predictors appear to have a statistically significant relationship to the response?
What does the coefficient for the year variable suggest?

Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?
Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?
Try a few different transformations of the variables, such as log(X), ???X, X2. Comment on your findings.

#a)
library(ISLR)

## Warning: package 'ISLR' was built under R version 4.0.5

library(MASS)
plot(Auto)

#b)
Auto1<-Auto
Auto1$name=NULL
cor(Auto1)

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

#c)
y1<-lm(mpg~ .-name,data=Auto)
summary(y1)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

# i) There is a relationship between at least one predictor and the response variable shown by a p-value of less than 0.05 for the f statistic meaning we can reject the null hypothesis.
# ii) Displacement, weight, year and origin have a statistically significant relationship with mpg shown be their p-values.
# iii) The coefficient of year variable is significant and positive which suggests that if all other variables are constant than on average mpg increases by 0.75 every year.

#d)
par(mfrow=c(2,2))
plot(y1)

plot(predict(y1),rstudent(y1))
plot(hatvalues(y1))
which.max(hatvalues(y1))

## 14 
## 14

#The ‘Residuals vs Fitted’ plot has a U shape, which shows non-linearity.
#The right tail of the data on the QQ Plot has several points that are not along the normal distribution line and curve upward instead. These points might indicate normality isn't reasonable to assume. 
#Looking at the sqrt(Standardized Residuals) Plot we also see that most observations fall inside the range of 0 and 2. These observations indicate normality might be reasonable to assume.
#Looking at the ‘Residuals vs Leverage’ plot, I see that Cook’s Distance doesn’t have any points within its bounds meaning that there are no influential points causing the slope coefficient to be impacted. 

#e)
y2<-lm(mpg~.:.,Auto1)
summary(y2)

## 
## Call:
## lm(formula = mpg ~ .:., data = Auto1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6303 -1.4481  0.0596  1.2739 11.1386 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                3.548e+01  5.314e+01   0.668  0.50475   
## cylinders                  6.989e+00  8.248e+00   0.847  0.39738   
## displacement              -4.785e-01  1.894e-01  -2.527  0.01192 * 
## horsepower                 5.034e-01  3.470e-01   1.451  0.14769   
## weight                     4.133e-03  1.759e-02   0.235  0.81442   
## acceleration              -5.859e+00  2.174e+00  -2.696  0.00735 **
## year                       6.974e-01  6.097e-01   1.144  0.25340   
## origin                    -2.090e+01  7.097e+00  -2.944  0.00345 **
## cylinders:displacement    -3.383e-03  6.455e-03  -0.524  0.60051   
## cylinders:horsepower       1.161e-02  2.420e-02   0.480  0.63157   
## cylinders:weight           3.575e-04  8.955e-04   0.399  0.69000   
## cylinders:acceleration     2.779e-01  1.664e-01   1.670  0.09584 . 
## cylinders:year            -1.741e-01  9.714e-02  -1.793  0.07389 . 
## cylinders:origin           4.022e-01  4.926e-01   0.816  0.41482   
## displacement:horsepower   -8.491e-05  2.885e-04  -0.294  0.76867   
## displacement:weight        2.472e-05  1.470e-05   1.682  0.09342 . 
## displacement:acceleration -3.479e-03  3.342e-03  -1.041  0.29853   
## displacement:year          5.934e-03  2.391e-03   2.482  0.01352 * 
## displacement:origin        2.398e-02  1.947e-02   1.232  0.21875   
## horsepower:weight         -1.968e-05  2.924e-05  -0.673  0.50124   
## horsepower:acceleration   -7.213e-03  3.719e-03  -1.939  0.05325 . 
## horsepower:year           -5.838e-03  3.938e-03  -1.482  0.13916   
## horsepower:origin          2.233e-03  2.930e-02   0.076  0.93931   
## weight:acceleration        2.346e-04  2.289e-04   1.025  0.30596   
## weight:year               -2.245e-04  2.127e-04  -1.056  0.29182   
## weight:origin             -5.789e-04  1.591e-03  -0.364  0.71623   
## acceleration:year          5.562e-02  2.558e-02   2.174  0.03033 * 
## acceleration:origin        4.583e-01  1.567e-01   2.926  0.00365 **
## year:origin                1.393e-01  7.399e-02   1.882  0.06062 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared:  0.8893, Adjusted R-squared:  0.8808 
## F-statistic: 104.2 on 28 and 363 DF,  p-value: < 2.2e-16

y3<-lm(mpg~.*.,Auto1)
summary(y3)

## 
## Call:
## lm(formula = mpg ~ . * ., data = Auto1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6303 -1.4481  0.0596  1.2739 11.1386 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                3.548e+01  5.314e+01   0.668  0.50475   
## cylinders                  6.989e+00  8.248e+00   0.847  0.39738   
## displacement              -4.785e-01  1.894e-01  -2.527  0.01192 * 
## horsepower                 5.034e-01  3.470e-01   1.451  0.14769   
## weight                     4.133e-03  1.759e-02   0.235  0.81442   
## acceleration              -5.859e+00  2.174e+00  -2.696  0.00735 **
## year                       6.974e-01  6.097e-01   1.144  0.25340   
## origin                    -2.090e+01  7.097e+00  -2.944  0.00345 **
## cylinders:displacement    -3.383e-03  6.455e-03  -0.524  0.60051   
## cylinders:horsepower       1.161e-02  2.420e-02   0.480  0.63157   
## cylinders:weight           3.575e-04  8.955e-04   0.399  0.69000   
## cylinders:acceleration     2.779e-01  1.664e-01   1.670  0.09584 . 
## cylinders:year            -1.741e-01  9.714e-02  -1.793  0.07389 . 
## cylinders:origin           4.022e-01  4.926e-01   0.816  0.41482   
## displacement:horsepower   -8.491e-05  2.885e-04  -0.294  0.76867   
## displacement:weight        2.472e-05  1.470e-05   1.682  0.09342 . 
## displacement:acceleration -3.479e-03  3.342e-03  -1.041  0.29853   
## displacement:year          5.934e-03  2.391e-03   2.482  0.01352 * 
## displacement:origin        2.398e-02  1.947e-02   1.232  0.21875   
## horsepower:weight         -1.968e-05  2.924e-05  -0.673  0.50124   
## horsepower:acceleration   -7.213e-03  3.719e-03  -1.939  0.05325 . 
## horsepower:year           -5.838e-03  3.938e-03  -1.482  0.13916   
## horsepower:origin          2.233e-03  2.930e-02   0.076  0.93931   
## weight:acceleration        2.346e-04  2.289e-04   1.025  0.30596   
## weight:year               -2.245e-04  2.127e-04  -1.056  0.29182   
## weight:origin             -5.789e-04  1.591e-03  -0.364  0.71623   
## acceleration:year          5.562e-02  2.558e-02   2.174  0.03033 * 
## acceleration:origin        4.583e-01  1.567e-01   2.926  0.00365 **
## year:origin                1.393e-01  7.399e-02   1.882  0.06062 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared:  0.8893, Adjusted R-squared:  0.8808 
## F-statistic: 104.2 on 28 and 363 DF,  p-value: < 2.2e-16

#Statistically Significant Coefficients with : and *
#Displacement - Acceleration - origin - Displacement: year - Acceleration: year - Acceleration: origin 

#f)
y4<-lm(mpg~weight+I((weight)^2),Auto)
summary(y4)

## 
## Call:
## lm(formula = mpg ~ weight + I((weight)^2), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.6246  -2.7134  -0.3485   1.8267  16.0866 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    6.226e+01  2.993e+00  20.800  < 2e-16 ***
## weight        -1.850e-02  1.972e-03  -9.379  < 2e-16 ***
## I((weight)^2)  1.697e-06  3.059e-07   5.545 5.43e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.176 on 389 degrees of freedom
## Multiple R-squared:  0.7151, Adjusted R-squared:  0.7137 
## F-statistic: 488.3 on 2 and 389 DF,  p-value: < 2.2e-16

plot(y4)

#The plot shows a non-normal distribution of error terms from the Normal Q-Q plot. 
#Looking at the ‘Residuals vs Leverage’ plot, I see that Cook’s Distance doesn’t have any points within its bounds meaning that there are no influential points causing the slope coefficient to be impacted.

Problem 10
This question should be answered using the Carseats data set.

Fit a multiple regression model to predict Sales using Price, Urban, and US.
Provide an interpretation of each coefficient in the model. Be careful-some of the variables in the model are qualitative!
Write out the model in equation form, being careful to handle the qualitative variables properly.
For which of the predictors can you reject the null hypothesis H0 : ??j = 0?
On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.
How well do the models in (a) and (e) fit the data?
Using the model from (e), obtain 95% confidence intervals for the coefficient(s).
Is there evidence of outliers or high leverage observations in the model from (e)?

#a
sales<-lm(Sales~Price+Urban+US,data=Carseats)
summary(sales)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

#b)
#Price is a continuous variable. and can be interpreted as the effect of one unit increase in price is a change in sales of -0.54459 units or 54 sales with the other variable’s constant. 
#Urban is a categorical variable.  The effect of a store being in an urban area is the change in sales of -.021916 units or 22 sales with the other variable’s constant. However, after looking at the p-value, we can say there is no evidence of a relationship between sales and the store being in an urban area. 
#US is a categorical variable.  The effect of a store being in the US is the change in sales of 1.200573 units or 1200 sales with the other variable’s constant.

#c)
#Sales=13.043469 - 0.054459*(Price) - (0.021916)*(1 ,if Urban is Yes 0 if not yes) + 1.200573 (1,if US is Yes 0 if not yes)

#d)
#The null hypothesis can be rejected for Price and UrbanYes as both have p-values less than 0.05. 

#e)
sales_var<-lm(Sales~Price+US,data=Carseats)
summary(sales_var)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

#f)
anova(sales,sales_var)

## Analysis of Variance Table
## 
## Model 1: Sales ~ Price + Urban + US
## Model 2: Sales ~ Price + US
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1    396 2420.8                           
## 2    397 2420.9 -1  -0.03979 0.0065 0.9357

#There is slight increase in adjusted r square and decrease in residual standard error after removing the Urban variable. The Avona test shows that the difference is not statistically significant for both models and therefore we do not reject the null hypothesis. 

#g)
confint(sales_var)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

#Using the confidence Interval test we can see that there is a 95% chance that the true parameter for Price falls within the in interval: (-0.0648, -0.0442), and a 5% probability that it doesn’t fall within that interval.

#h)
plot(predict(sales_var),rstudent(sales_var))

lev<-hat(model.matrix(sales_var))
plot(lev)

4/nrow(Carseats)

## [1] 0.01

plot(Carseats$Sales,Carseats$Price)
points(Carseats[lev>0.01,]$Sales,Carseats[lev>0.01,]$Price,col='red')

#The rstudentized residuals vs predicted value plot detects presence of outliers. As rstudentized residuals are within the limit of -3 to 3 we can say that there is no outlier present in the data. An outlier is defined as the observation affecting the value of Y to be significantly different from the expected trend.

Problem 12
This problem involves simple linear regression without an intercept.

Recall that the coefficient estimate ^ ?? for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?
Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.
Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

#a) 
#From the equation, the parameter estimate will be equal if the summation of xi^2 equals the summation of yi ^2.

#b)
x=rnorm(100)
y=rbinom(100,2,0.3)
eg<-lm(y~x+0)
summary(eg)

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.1606  0.0283  0.8442  1.0337  2.1046 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)
## x  0.09137    0.09625   0.949    0.345
## 
## Residual standard error: 0.9278 on 99 degrees of freedom
## Multiple R-squared:  0.00902,    Adjusted R-squared:  -0.0009897 
## F-statistic: 0.9011 on 1 and 99 DF,  p-value: 0.3448

eg1<-lm(x~y+0)
summary(eg1)

## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.7685 -0.6248 -0.1147  0.6004  2.4687 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)
## y  0.09873    0.10400   0.949    0.345
## 
## Residual standard error: 0.9645 on 99 degrees of freedom
## Multiple R-squared:  0.00902,    Adjusted R-squared:  -0.0009897 
## F-statistic: 0.9011 on 1 and 99 DF,  p-value: 0.3448

We can see that the observations are different for both cases by looking at the coefficients.

#c)
x=1:100
y=100:1
eg3<-lm(y~x+0)
summary(eg3)

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49.75 -12.44  24.87  62.18  99.49 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   0.5075     0.0866    5.86 6.09e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared:  0.2575, Adjusted R-squared:   0.25 
## F-statistic: 34.34 on 1 and 99 DF,  p-value: 6.094e-08

eg4<-lm(x~y+0)
summary(eg4)

## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49.75 -12.44  24.87  62.18  99.49 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y   0.5075     0.0866    5.86 6.09e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared:  0.2575, Adjusted R-squared:   0.25 
## F-statistic: 34.34 on 1 and 99 DF,  p-value: 6.094e-08

#Now we can see that the observations are the same for the coefficients.

STA 6543 Chapter 3 Assignment

Krischelle Joyner

6/19/2021