Question 2

Carefully explain the differences between KNN classifier and KNN regression methods

KNN Classifier KNN Regression
1) It is mainly used to predict the qualitative response It is mainly used to predict the quantitative response
2) Prediction of response is based on the highest estimated probability of the specified K number of nearest observation Prediction of response is based on the estimated average value of all specified K number of nearest observation

Question 9

This question involves the use of multiple linear regression on the Auto data set.

Auto = read.table("Data/Auto.data",header=T,na.strings = "?")
Auto=na.omit(Auto)
str(Auto)
## 'data.frame':    392 obs. of  9 variables:
##  $ mpg         : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ cylinders   : int  8 8 8 8 8 8 8 8 8 8 ...
##  $ displacement: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horsepower  : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight      : num  3504 3693 3436 3433 3449 ...
##  $ acceleration: num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ year        : int  70 70 70 70 70 70 70 70 70 70 ...
##  $ origin      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ name        : chr  "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
##  - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
##   ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...

Question 9a

Produce a scatter plot matrix which includes all of the variables in the data set

pairs(Auto[1:8], main= "Scatterplots for Auto data set")

Question 9b

Compute the matrix of correlations between the variables using the cor().you will exclude the name variable, which is qualitative

cor(Auto[,1:8])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

Question 9c

use the lm() function to perform a multiple linear regression variable with mpg as the response and all other variables except name as the predictors.use the summary() function to print the results. comment on the output

lm.auto = lm(mpg~.-name,data=Auto)
summary(lm.auto)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

9c(i)

Is there a relationship between the predictors and the response

  • ho: Model does not show the any relationship existence between response and predictors.

  • halt:Model does not show the any relationship existence between response and predictors.

Inference: Based on the summary statistics of the Auto linear model, the P value for the F statistics (< 2.2e-16) is less than the significant value (0.05), so there exists a strong evidence to reject the null hypothesis and confirm that the model does show the relationship existence between response and the predictors.

9c (ii)

Which predictors appear to have a statistically significant relationship to the response

Inference: Based on the summary statistics of the Auto linear model, The individual p value for the predictors Displacement(0.00844), weight(< 2e-16),year(< 2e-16) is less than the significant level (0.05). So the predictors such as displacement, weight and year appears to significant in this model.

9c (iii)

What does the coefficient for the year variable suggest?

From the summary statistics of the year variable we can say that for every unit increase in the year variable, there is 0.75 times increase in the miles per gallon with the assumption of having all other variables as constant.

Question 9d

Use the plot() function to produce diagnostics plots of the linear regression fit. comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage point identify any observations with unusually high leverage?

par(mfrow=c(2,2))
plot(lm.auto)

Inference:

  • In the Residuals Vs fitted Plot, There is no pattern followed, this shows that this model supports the homoscadesticity.

  • Based on the QQ plot, the model follows normality distribution at the middle of the line whereas at the tails the model is right skewed.

  • As per the leverage plot few of the observation is below the cooks distance dashed line which concludes that the data set has few outliers. whereas there is no point outside the dashed leverage line which concludes that the data set doesn’t have any observation with high leverage.

Question 9e

Use the * and : symbols to fit linear regression models with interaction effects. Do any interaction appears to be statistically significant

lm.auto.1=lm(mpg~year+cylinders+acceleration+year:cylinders+year:acceleration, data= Auto)
summary(lm.auto.1)
## 
## Call:
## lm(formula = mpg ~ year + cylinders + acceleration + year:cylinders + 
##     year:acceleration, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.7237  -2.6223  -0.0754   2.1138  14.9288 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)   
## (Intercept)       34.52399   40.53245   0.852  0.39487   
## year               0.07651    0.52786   0.145  0.88484   
## cylinders          1.14108    3.34159   0.341  0.73293   
## acceleration      -4.70792    1.81020  -2.601  0.00966 **
## year:cylinders    -0.05662    0.04399  -1.287  0.19878   
## year:acceleration  0.06205    0.02366   2.623  0.00907 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.105 on 386 degrees of freedom
## Multiple R-squared:  0.727,  Adjusted R-squared:  0.7234 
## F-statistic: 205.6 on 5 and 386 DF,  p-value: < 2.2e-16

Inference: As per this model, year and acceleration interaction will have some positive impact over mile per gallon

lm.auto.2 = lm(mpg~ cylinders*displacement, data = Auto)
summary(lm.auto.2)
## 
## Call:
## lm(formula = mpg ~ cylinders * displacement, data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.0432  -2.4308  -0.2263   2.2048  20.9051 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            48.22040    2.34712  20.545  < 2e-16 ***
## cylinders              -2.41838    0.53456  -4.524 8.08e-06 ***
## displacement           -0.13436    0.01615  -8.321 1.50e-15 ***
## cylinders:displacement  0.01182    0.00207   5.711 2.24e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.454 on 388 degrees of freedom
## Multiple R-squared:  0.6769, Adjusted R-squared:  0.6744 
## F-statistic:   271 on 3 and 388 DF,  p-value: < 2.2e-16

Inference: As per this model, cylinders and displacement interaction will have some impact over mile per gallon

Question 9f

Try a few different transformation of the variables such as log(x),sqrt x, X^2 . Comment on the findings

lm.auto.3= lm(mpg~ .-name+log(horsepower),data= Auto)
summary(lm.auto.3)
## 
## Call:
## lm(formula = mpg ~ . - name + log(horsepower), data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.5777 -1.6623 -0.1213  1.4913 12.0230 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      8.674e+01  1.106e+01   7.839 4.54e-14 ***
## cylinders       -5.530e-02  2.907e-01  -0.190 0.849230    
## displacement    -4.607e-03  7.108e-03  -0.648 0.517291    
## horsepower       1.764e-01  2.269e-02   7.775 7.05e-14 ***
## weight          -3.366e-03  6.561e-04  -5.130 4.62e-07 ***
## acceleration    -3.277e-01  9.670e-02  -3.388 0.000776 ***
## year             7.421e-01  4.534e-02  16.368  < 2e-16 ***
## origin           8.976e-01  2.528e-01   3.551 0.000432 ***
## log(horsepower) -2.685e+01  2.652e+00 -10.127  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.959 on 383 degrees of freedom
## Multiple R-squared:  0.8592, Adjusted R-squared:  0.8562 
## F-statistic: 292.1 on 8 and 383 DF,  p-value: < 2.2e-16

Inference: In the base model, Horsepower doesn’t have any impact over the miles per gallon .At the same time the Measure of fit was 82%. when we use log transformation for horsepower, the model shows that the horse power does have some impact on mile per gallon and also the measure of fit is increase from 82% to 85%.

Question 10

This Question should be answered based on carseats data set.

Question 10a

Fit the multiple regression model to predict Sales using Price, Urban and US.

lm.carseat = lm(Sales~Price+Urban+US, data= Carseats)
summary(lm.carseat)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Question 10b

Provide an interpretation of each coefficient in the model.

Price: As per the coefficient value , it implies that the sales and price follows negative linear relation ship which means that for each unit increase in price the sales will decrease by 0.05 times.

Urban:As per the coefficient value , it implies that the sales and location of the store follows negative linear relation ship which means for the store located in Urban or rural location the sales will decrease by 0.02 times.

US: As per the coefficient value , it implies that the sales and base location of the store follows positive linear relation ship which means for the store located in US the sales will increase by 1.20 times.

Question 10c

write out the model in equation form

\[sales = 13.04-(0.054*price)-(0.022*Store in Rural location)+(1.2*store in US)\]

question 10d

For which of the predictors can you reject the Null hypothesis

Price:

  • Ho : car Seat price does not have significant Impact on car seat sales
  • Ha : car Seat price does have significant impact on car seat sales

Inference: The p value of the car price is < 2e-16 which is less than the significant value 0.05 so there is strong evidence to reject the null hypothesis and concluded that the car seat price have significant impact in the car seat sales.

US:

  • Ho : Store in US does not have significant Impact on car seat sales
  • Ha : Store in US does have significant impact on car seat sales

Inference: The p value of the store located in US is 4.86e-06 which is less than the significant value 0.05 so there is strong evidence to reject the null hypothesis and concluded that the store located in US have significant impact in the car seat sales.

Question 10e

On the basis of your response to the previous question, fit a small model that only uses the predictors for which there is evidence of association with the outcome

Based on the above model the Price and the store location in US has the significant impact on the sales. So we will refit the with that parameter

lm.carseat.1 = update(lm.carseat,Sales~Price+US)
summary(lm.carseat.1)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

Question 10f

How well do the models in (a) and (e) fit the data?

Inference: Between the model created by the ‘a’ and ‘e’, there is no much difference in the model fit . Both model fit corresponds to 24%.

Question 10g

Using the model from (e), obtain 95% confidence intervals for the coefficients.

confint(lm.carseat.1)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

Is there evidence of outliers or high leverage observations in the model from (e)

#hatvalues(lm.carseat.1)[order(hatvalues(lm.carseat.1), decreasing = T)]
which.max(hatvalues(lm.carseat.1))
## 43 
## 43
par(mfrow=c(2,2))
plot(lm.carseat.1)

Question 12

This problem involves simple linear regression without an intercept

Question 12(a)

Under what circumstances is the coefficient estimate for the regression without an intercept of X onto Y is same as the coefficient estimate for the regression of Y onto X

Coefficient estimate for Y onto X without intercept:

\[\hat\beta = \sum_{i=1}^n(x_iy_i)/\sum_{i'=1}^n(x'_i)^2 \]

Coefficient estimate for X onto Y without intercept:

\[\hat\beta_1 = \sum_{i=1}^n(x_iy_i)/\sum_{i'=1}^n(y'_i)^2 \]

Inference:

Based on the above two equation \(\hat\beta\) and \(\hat\beta_1\) will be equal when the sum of squares of the x and y should be equal. which means x and y should be same may or may not different order in position

Question 12(b)

Generate an example in R with n=100 observation in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto x

set.seed(100)
x = rnorm(100)
y = 2*x+rnorm(100)
lm.fit.YonX= lm(y~x+0)
summary(lm.fit.YonX)
## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.04051 -0.42120 -0.06707  0.49725  1.95009 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x  1.89466    0.07769   24.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.789 on 99 degrees of freedom
## Multiple R-squared:  0.8573, Adjusted R-squared:  0.8559 
## F-statistic: 594.8 on 1 and 99 DF,  p-value: < 2.2e-16
lm.fit.XonY= lm(x~y+0)
summary(lm.fit.XonY)
## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.17839 -0.22598  0.01977  0.21129  1.10008 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y  0.45249    0.01855   24.39   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3856 on 99 degrees of freedom
## Multiple R-squared:  0.8573, Adjusted R-squared:  0.8559 
## F-statistic: 594.8 on 1 and 99 DF,  p-value: < 2.2e-16

Question 12(c)

Generate an example in R with n=100 observation in which the coefficient estimate for the regression of X onto Y is same as the coefficient estimate for the regression of Y onto x

set.seed(100)
x1 = rnorm(100)
y1 = sample(x1)
lm.fit.Y1onX1= lm(y1~x1+0)
summary(lm.fit.Y1onX1)
## 
## Call:
## lm(formula = y1 ~ x1 + 0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.29087 -0.59662 -0.04643  0.63796  2.57044 
## 
## Coefficients:
##    Estimate Std. Error t value Pr(>|t|)
## x1  -0.0263     0.1005  -0.262    0.794
## 
## Residual standard error: 1.02 on 99 degrees of freedom
## Multiple R-squared:  0.0006919,  Adjusted R-squared:  -0.009402 
## F-statistic: 0.06855 on 1 and 99 DF,  p-value: 0.794
lm.fit.X1onY1= lm(x1~y1+0)
summary(lm.fit.X1onY1)
## 
## Call:
## lm(formula = x1 ~ y1 + 0)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.29992 -0.59756 -0.07285  0.64672  2.57507 
## 
## Coefficients:
##    Estimate Std. Error t value Pr(>|t|)
## y1  -0.0263     0.1005  -0.262    0.794
## 
## Residual standard error: 1.02 on 99 degrees of freedom
## Multiple R-squared:  0.0006919,  Adjusted R-squared:  -0.009402 
## F-statistic: 0.06855 on 1 and 99 DF,  p-value: 0.794