Carefully explain the differences between KNN classifier and KNN regression methods
| KNN Classifier | KNN Regression |
|---|---|
| 1) It is mainly used to predict the qualitative response | It is mainly used to predict the quantitative response |
| 2) Prediction of response is based on the highest estimated probability of the specified K number of nearest observation | Prediction of response is based on the estimated average value of all specified K number of nearest observation |
This question involves the use of multiple linear regression on the Auto data set.
Auto = read.table("Data/Auto.data",header=T,na.strings = "?")
Auto=na.omit(Auto)
str(Auto)
## 'data.frame': 392 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ name : chr "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
## - attr(*, "na.action")= 'omit' Named int [1:5] 33 127 331 337 355
## ..- attr(*, "names")= chr [1:5] "33" "127" "331" "337" ...
Produce a scatter plot matrix which includes all of the variables in the data set
pairs(Auto[1:8], main= "Scatterplots for Auto data set")
Compute the matrix of correlations between the variables using the cor().you will exclude the name variable, which is qualitative
cor(Auto[,1:8])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
use the lm() function to perform a multiple linear regression variable with mpg as the response and all other variables except name as the predictors.use the summary() function to print the results. comment on the output
lm.auto = lm(mpg~.-name,data=Auto)
summary(lm.auto)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Is there a relationship between the predictors and the response
ho: Model does not show the any relationship existence between response and predictors.
halt:Model does not show the any relationship existence between response and predictors.
Inference: Based on the summary statistics of the Auto linear model, the P value for the F statistics (< 2.2e-16) is less than the significant value (0.05), so there exists a strong evidence to reject the null hypothesis and confirm that the model does show the relationship existence between response and the predictors.
Which predictors appear to have a statistically significant relationship to the response
Inference: Based on the summary statistics of the Auto linear model, The individual p value for the predictors Displacement(0.00844), weight(< 2e-16),year(< 2e-16) is less than the significant level (0.05). So the predictors such as displacement, weight and year appears to significant in this model.
What does the coefficient for the year variable suggest?
From the summary statistics of the year variable we can say that for every unit increase in the year variable, there is 0.75 times increase in the miles per gallon with the assumption of having all other variables as constant.
Use the plot() function to produce diagnostics plots of the linear regression fit. comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage point identify any observations with unusually high leverage?
par(mfrow=c(2,2))
plot(lm.auto)
Inference:
In the Residuals Vs fitted Plot, There is no pattern followed, this shows that this model supports the homoscadesticity.
Based on the QQ plot, the model follows normality distribution at the middle of the line whereas at the tails the model is right skewed.
As per the leverage plot few of the observation is below the cooks distance dashed line which concludes that the data set has few outliers. whereas there is no point outside the dashed leverage line which concludes that the data set doesn’t have any observation with high leverage.
Use the * and : symbols to fit linear regression models with interaction effects. Do any interaction appears to be statistically significant
lm.auto.1=lm(mpg~year+cylinders+acceleration+year:cylinders+year:acceleration, data= Auto)
summary(lm.auto.1)
##
## Call:
## lm(formula = mpg ~ year + cylinders + acceleration + year:cylinders +
## year:acceleration, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.7237 -2.6223 -0.0754 2.1138 14.9288
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.52399 40.53245 0.852 0.39487
## year 0.07651 0.52786 0.145 0.88484
## cylinders 1.14108 3.34159 0.341 0.73293
## acceleration -4.70792 1.81020 -2.601 0.00966 **
## year:cylinders -0.05662 0.04399 -1.287 0.19878
## year:acceleration 0.06205 0.02366 2.623 0.00907 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.105 on 386 degrees of freedom
## Multiple R-squared: 0.727, Adjusted R-squared: 0.7234
## F-statistic: 205.6 on 5 and 386 DF, p-value: < 2.2e-16
Inference: As per this model, year and acceleration interaction will have some positive impact over mile per gallon
lm.auto.2 = lm(mpg~ cylinders*displacement, data = Auto)
summary(lm.auto.2)
##
## Call:
## lm(formula = mpg ~ cylinders * displacement, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.0432 -2.4308 -0.2263 2.2048 20.9051
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.22040 2.34712 20.545 < 2e-16 ***
## cylinders -2.41838 0.53456 -4.524 8.08e-06 ***
## displacement -0.13436 0.01615 -8.321 1.50e-15 ***
## cylinders:displacement 0.01182 0.00207 5.711 2.24e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.454 on 388 degrees of freedom
## Multiple R-squared: 0.6769, Adjusted R-squared: 0.6744
## F-statistic: 271 on 3 and 388 DF, p-value: < 2.2e-16
Inference: As per this model, cylinders and displacement interaction will have some impact over mile per gallon
Try a few different transformation of the variables such as log(x),sqrt x, X^2 . Comment on the findings
lm.auto.3= lm(mpg~ .-name+log(horsepower),data= Auto)
summary(lm.auto.3)
##
## Call:
## lm(formula = mpg ~ . - name + log(horsepower), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.5777 -1.6623 -0.1213 1.4913 12.0230
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.674e+01 1.106e+01 7.839 4.54e-14 ***
## cylinders -5.530e-02 2.907e-01 -0.190 0.849230
## displacement -4.607e-03 7.108e-03 -0.648 0.517291
## horsepower 1.764e-01 2.269e-02 7.775 7.05e-14 ***
## weight -3.366e-03 6.561e-04 -5.130 4.62e-07 ***
## acceleration -3.277e-01 9.670e-02 -3.388 0.000776 ***
## year 7.421e-01 4.534e-02 16.368 < 2e-16 ***
## origin 8.976e-01 2.528e-01 3.551 0.000432 ***
## log(horsepower) -2.685e+01 2.652e+00 -10.127 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.959 on 383 degrees of freedom
## Multiple R-squared: 0.8592, Adjusted R-squared: 0.8562
## F-statistic: 292.1 on 8 and 383 DF, p-value: < 2.2e-16
Inference: In the base model, Horsepower doesn’t have any impact over the miles per gallon .At the same time the Measure of fit was 82%. when we use log transformation for horsepower, the model shows that the horse power does have some impact on mile per gallon and also the measure of fit is increase from 82% to 85%.
This Question should be answered based on carseats data set.
Fit the multiple regression model to predict Sales using Price, Urban and US.
lm.carseat = lm(Sales~Price+Urban+US, data= Carseats)
summary(lm.carseat)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Provide an interpretation of each coefficient in the model.
Price: As per the coefficient value , it implies that the sales and price follows negative linear relation ship which means that for each unit increase in price the sales will decrease by 0.05 times.
Urban:As per the coefficient value , it implies that the sales and location of the store follows negative linear relation ship which means for the store located in Urban or rural location the sales will decrease by 0.02 times.
US: As per the coefficient value , it implies that the sales and base location of the store follows positive linear relation ship which means for the store located in US the sales will increase by 1.20 times.
write out the model in equation form
\[sales = 13.04-(0.054*price)-(0.022*Store in Rural location)+(1.2*store in US)\]
For which of the predictors can you reject the Null hypothesis
Price:
Inference: The p value of the car price is < 2e-16 which is less than the significant value 0.05 so there is strong evidence to reject the null hypothesis and concluded that the car seat price have significant impact in the car seat sales.
US:
Inference: The p value of the store located in US is 4.86e-06 which is less than the significant value 0.05 so there is strong evidence to reject the null hypothesis and concluded that the store located in US have significant impact in the car seat sales.
On the basis of your response to the previous question, fit a small model that only uses the predictors for which there is evidence of association with the outcome
Based on the above model the Price and the store location in US has the significant impact on the sales. So we will refit the with that parameter
lm.carseat.1 = update(lm.carseat,Sales~Price+US)
summary(lm.carseat.1)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
How well do the models in (a) and (e) fit the data?
Inference: Between the model created by the ‘a’ and ‘e’, there is no much difference in the model fit . Both model fit corresponds to 24%.
Using the model from (e), obtain 95% confidence intervals for the coefficients.
confint(lm.carseat.1)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
Is there evidence of outliers or high leverage observations in the model from (e)
#hatvalues(lm.carseat.1)[order(hatvalues(lm.carseat.1), decreasing = T)]
which.max(hatvalues(lm.carseat.1))
## 43
## 43
par(mfrow=c(2,2))
plot(lm.carseat.1)
This problem involves simple linear regression without an intercept
Under what circumstances is the coefficient estimate for the regression without an intercept of X onto Y is same as the coefficient estimate for the regression of Y onto X
\[\hat\beta = \sum_{i=1}^n(x_iy_i)/\sum_{i'=1}^n(x'_i)^2 \]
\[\hat\beta_1 = \sum_{i=1}^n(x_iy_i)/\sum_{i'=1}^n(y'_i)^2 \]
Based on the above two equation \(\hat\beta\) and \(\hat\beta_1\) will be equal when the sum of squares of the x and y should be equal. which means x and y should be same may or may not different order in position
Generate an example in R with n=100 observation in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto x
set.seed(100)
x = rnorm(100)
y = 2*x+rnorm(100)
lm.fit.YonX= lm(y~x+0)
summary(lm.fit.YonX)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.04051 -0.42120 -0.06707 0.49725 1.95009
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 1.89466 0.07769 24.39 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.789 on 99 degrees of freedom
## Multiple R-squared: 0.8573, Adjusted R-squared: 0.8559
## F-statistic: 594.8 on 1 and 99 DF, p-value: < 2.2e-16
lm.fit.XonY= lm(x~y+0)
summary(lm.fit.XonY)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.17839 -0.22598 0.01977 0.21129 1.10008
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.45249 0.01855 24.39 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3856 on 99 degrees of freedom
## Multiple R-squared: 0.8573, Adjusted R-squared: 0.8559
## F-statistic: 594.8 on 1 and 99 DF, p-value: < 2.2e-16
Generate an example in R with n=100 observation in which the coefficient estimate for the regression of X onto Y is same as the coefficient estimate for the regression of Y onto x
set.seed(100)
x1 = rnorm(100)
y1 = sample(x1)
lm.fit.Y1onX1= lm(y1~x1+0)
summary(lm.fit.Y1onX1)
##
## Call:
## lm(formula = y1 ~ x1 + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.29087 -0.59662 -0.04643 0.63796 2.57044
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x1 -0.0263 0.1005 -0.262 0.794
##
## Residual standard error: 1.02 on 99 degrees of freedom
## Multiple R-squared: 0.0006919, Adjusted R-squared: -0.009402
## F-statistic: 0.06855 on 1 and 99 DF, p-value: 0.794
lm.fit.X1onY1= lm(x1~y1+0)
summary(lm.fit.X1onY1)
##
## Call:
## lm(formula = x1 ~ y1 + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.29992 -0.59756 -0.07285 0.64672 2.57507
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y1 -0.0263 0.1005 -0.262 0.794
##
## Residual standard error: 1.02 on 99 degrees of freedom
## Multiple R-squared: 0.0006919, Adjusted R-squared: -0.009402
## F-statistic: 0.06855 on 1 and 99 DF, p-value: 0.794