Homework 2 (Data Mining)

2. Carefully explain the di???erences between the KNN classi???er and KNN regression methods.

KNN classifier focus mainly on problems that involve qualitative data. Most of the time, while KNN regression involves quantitative data in regression modeling. Both methods work in a similar sort of way identifying a neighborhood of a variable of interest usually (+Xo) and a given a function to try and estimate the mean of points in a certain neighborhood.

9. This question involves the use of multiple linear regression on the Auto data set.

(a) Produce a scatterplot matrix which includes all of the variables in the data set.

AT<-read.csv("C:/Users/Joshua Escareno/Desktop/Auto.csv", header = TRUE)
pairs(AT)

Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.

AT<-read.csv("C:/Users/Joshua Escareno/Desktop/Auto.csv", header = TRUE)
AT$horsepower<-as.numeric(as.character(AT$horsepower))

## Warning: NAs introduced by coercion

AT2<-na.omit(AT)
cor(AT2[1:8])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

(c) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use thesummary() function to print the results. Comment on the output. For instance:

Auto <-read.csv("C:/Users/Joshua Escareno/Desktop/Auto.csv", header = TRUE)
Auto$horsepower<-as.numeric(as.character(Auto$horsepower))

## Warning: NAs introduced by coercion

Auto2<-na.omit(Auto)
summary(lm(formula = mpg~.-name, data = Auto2))

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

i. Is there a relationship between the predictors and the response?

The P value given to us after running the linear model is 2.2e^-16 which is extremely small so Yes there is significant relationship between these horsepower and MPG

ii. Which predictors appear to have a statistically signi???cant relationship to the response?

All predictors except cylinders, horsepower and weight are significant because of their positive estimate number.

iii. What does the coe???cient for the year variable suggest?

This coefficient of year (.750773) is what the response changes by, that is if all others remain constant.

(d) Use the plot() function to produce diagnostic plots of the linear regression ???t. Comment on any problems you see with the ???t. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage?

par(mfrow = c(2,2))
AT<-lm(formula = mpg~.-name, data = Auto2)
plot(AT)

There are some outliers present in our data set.

(e) Use the * and : symbols to ???t linear regression models with interaction e???ects. Do any interactions appear to be statistically signi???cant?

AT2<-lm(mpg~year*cylinders + year*weight, data = Auto2)
summary(AT2)

## 
## Call:
## lm(formula = mpg ~ year * cylinders + year * weight, data = Auto2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -8.418 -1.921 -0.071  1.676 12.955 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    -1.091e+02  1.298e+01  -8.408 8.17e-16 ***
## year            2.025e+00  1.722e-01  11.758  < 2e-16 ***
## cylinders      -5.986e+00  4.398e+00  -1.361    0.174    
## weight          3.833e-02  9.083e-03   4.220 3.05e-05 ***
## year:cylinders  7.709e-02  5.791e-02   1.331    0.184    
## year:weight    -5.970e-04  1.202e-04  -4.966 1.03e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.192 on 386 degrees of freedom
## Multiple R-squared:  0.8348, Adjusted R-squared:  0.8327 
## F-statistic: 390.2 on 5 and 386 DF,  p-value: < 2.2e-16

It would appear that the relationship between cylinders and years are not statistically significant.

(f) Try a few di???erent transformations of the variables, such as log(X), ???X, X2. Comment on your ???ndings.

Auto <-read.csv("C:/Users/Joshua Escareno/Desktop/Auto.csv", header = TRUE)
Auto$horsepower<-as.numeric(as.character(Auto$horsepower))

## Warning: NAs introduced by coercion

Auto2<-na.omit(Auto)
par(mfrow = c(2,2))
plot(Auto2$mpg,Auto2$weight)
plot(log(Auto2$mpg),Auto2$weight)
plot(sqrt(Auto2$mpg),Auto2$weight)
plot((Auto2$mpg)^2,Auto2$weight)

Its obvious by the first graph which is a normal plot that the heavier (WEIGHT) the car the less (MPG) it will receive.

Adding the Log, X^2 and Square root to the MPG of the graph we see that the X-axis change but the correlation between these two variables remain the same. The X-axis responds appropriately to the operation.

10. This question should be answered using the Carseats data set.

(a) Fit a multiple regression model to predict Sales using Price, Urban, and US.

Below is the Model predicting Sales correlated with Price, Urban and US

library(ISLR)

## 
## Attaching package: 'ISLR'

## The following object is masked _by_ '.GlobalEnv':
## 
##     Auto

prediction<-lm(formula = Sales~Price+Urban+US , data = Carseats)
summary(prediction)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

(b) Provide an interpretation of each coe???cient in the model. Be careful-some of the variables in the model are qualitative!

Focusing on PRICE because it is quantitative, and disregarding URBANYES and USYES, Prices coefficient represents that when the car seat rises in price! Sales will fall, -.054459!

(c) Write out the model in equation form, being careful to handle the qualitative variables properly.

OUR EQUATION : SALES = 13.043469 - (PRICE).054459 - (URBANYES).021916 + (USYES)1.200573 + Error

(d) For which of the predictors can you reject the null hypothesis H0 : ??j = 0?

Looking at the P values from the summary it would appear that the null hypothesis can be rejected for the variable PRICE and US.

(e) On the basis of your response to the previous question, ???t a smaller model that only uses the predictors for which there is evidence of association with the outcome.

library(ISLR)
prediction1<-lm(formula = Sales~Price+US , data = Carseats)
summary(prediction1)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

(f) How well do the models in (a) and (e) ???t the data?

Both seem to have a Multiple R-Squared of .23 which to me concludes they fit the data about the same

(g) Using the model from (e), obtain 95% con???dence intervals for the coe???cient(s).

confint(prediction1, level = .95)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

(h) Is there evidence of outliers or high leverage observations in the model from (e)?

There does appear to be outliers however most of this data shares a linear trend.

12. This problem involves simple linear regression without an intercept.

(a) Recall that the coe???cient estimate ^ ?? for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coe???cient estimate for the regression of X onto Y the same as the coe???cient estimate for the regression of Y onto X?

The original coefficient that is given by 3.38 is
The coefficient for regression of Y onto X is similar however instead of X^2 it is Y^2 on the denominators, However, rearranging both these equations when the numerators(Standard Diviation) are equal. When the SUMATION of points Xi and Yi are equal for the equations of both X onto Y and Y onto X.

(b) Generate an example in R with n = 100 observations in which the coe???cient estimate for the regression of X onto Y is di???erent from the coe???cient estimate for the regression of Y onto X.

x<-rnorm(100, mean = 0, sd = 1)
y<-rnorm(100, mean = 0, sd = 1)
Relation<-lm(x~y)
Relation

## 
## Call:
## lm(formula = x ~ y)
## 
## Coefficients:
## (Intercept)            y  
##     0.03997     -0.11883

(c) Generate an example in R with n = 100 observations in which the coe???cient estimate for the regression of X onto Y is the same as the coe???cient estimate for the regression of Y onto X.

x<-rnorm(100, mean = 0, sd = 1)
y<-rnorm(100, mean = 0, sd = 1)
Relation<-lm(y~x)
Relation

## 
## Call:
## lm(formula = y ~ x)
## 
## Coefficients:
## (Intercept)            x  
##     0.16010     -0.01632

Homework 2 (Data Mining)

Joshua Escareno

February 11, 2019