KNN classifier is typically used to solve classification problem or when attempting to predict qualitative responses. KNN classifier is solved by identifying the neighborhood of x0 and then estimating the conditional probability P(Y=j|X=x0) for class j as the fraction of points in the neighborhood whose response values equal j.
The KNN regression method is a non-parametric method that is used to solve regression problems or when attempting to predict quantitative responses. KNN regression is solved by identifying the K training observations that are closest to x0 (represented by N0), and then estimating f(x0) as the average of all the training responses in the “neighborhood.”
knitr::include_graphics("C:/Users/selen/Desktop/R Folder/KNN.png")
setwd("C:/Users/selen/Desktop/R Folder")
auto<-read.csv("Auto.csv", na.strings = "?")
auto<-na.omit(auto)
pairs(auto)
names(auto)
## [1] "mpg" "cylinders" "displacement" "horsepower"
## [5] "weight" "acceleration" "year" "origin"
## [9] "name"
cor(auto[1:8])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
fit <- lm(mpg ~ . - name, data= auto)
summary(fit)
##
## Call:
## lm(formula = mpg ~ . - name, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
To answer this question, I must test the hypothesis H0: B1=B2=Bp=0; Ha: B1 NOT= B2 NOT= Bp NOT= 0. In this case, I will reject the null hypothesis, in favor of the alternative hypothesis. Since the Bs are not equal to 0, this means that there is a relationship between the predictors and response. Additionally, I see that the F-statistic of 252.4 is greater than 1 and that the p-value of 2.2e-16 is much smaller than the alpha of 0.05. Both the F-statistic and p-value provide evidence that there is some kind of relationship between the predictors and response.
The predictors that appear to have a statistically significant relationship to the response include: the intercept [defaulted in the regression to appear], displacement, weight, year, and origin.
The coefficient of the year variable suggests that the average effect of an increase of 1 year is an increase of 0.750773 in mpg-holding all other predictors fixed. By this, I mean that with every 1 year increase, cars become more fuel efficient by approximately 0.75 mpg/year.
par(mfrow=c (2,2))
plot(fit)
Residuals vs. Fitted Values plot helps show that there is a slight pattern in the residuals that indicates non-linearity in this dataset. As described by the book, the presence of a pattern may indicate a problem with some aspect of the linear model. Will need to investigate further. The Normal Q-Q plot also indicated that the data may have some extreme values than would be expected if the data followed a normal distribution. The Residuals vs. Leverage plot indicates that there are a few outliers present in this dataset. Two of the observations-observations 327 and 394-have high residuals and low leverages. Additionally, there is another observation located in the middle of the chart that brings up the Residuals vs. Leverage line and is plotted a little further out from the majority of the data. Lastly, as indicated in the plot, observation 14 is another outlier in the dataset. Observation 14 has an unusually low standardized residual and unusually high leverage.
fit2<- lm(mpg ~ . -name + cylinders:weight, data=auto)
summary(fit2)
##
## Call:
## lm(formula = mpg ~ . - name + cylinders:weight, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.9484 -1.7133 -0.1809 1.4530 12.4137
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.3143478 5.0076737 1.461 0.14494
## cylinders -5.0347425 0.5795767 -8.687 < 2e-16 ***
## displacement 0.0156444 0.0068409 2.287 0.02275 *
## horsepower -0.0314213 0.0126216 -2.489 0.01322 *
## weight -0.0150329 0.0011125 -13.513 < 2e-16 ***
## acceleration 0.1006438 0.0897944 1.121 0.26306
## year 0.7813453 0.0464139 16.834 < 2e-16 ***
## origin 0.8030154 0.2617333 3.068 0.00231 **
## cylinders:weight 0.0015058 0.0001657 9.088 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.022 on 383 degrees of freedom
## Multiple R-squared: 0.8531, Adjusted R-squared: 0.8501
## F-statistic: 278.1 on 8 and 383 DF, p-value: < 2.2e-16
fit5<- lm(mpg ~ . -name + displacement:weight, data=auto)
summary(fit5)
##
## Call:
## lm(formula = mpg ~ . - name + displacement:weight, data = auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.9027 -1.8092 -0.0946 1.5549 12.1687
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.389e+00 4.301e+00 -1.253 0.2109
## cylinders 1.175e-01 2.943e-01 0.399 0.6899
## displacement -6.837e-02 1.104e-02 -6.193 1.52e-09 ***
## horsepower -3.280e-02 1.238e-02 -2.649 0.0084 **
## weight -1.064e-02 7.136e-04 -14.915 < 2e-16 ***
## acceleration 6.724e-02 8.805e-02 0.764 0.4455
## year 7.852e-01 4.553e-02 17.246 < 2e-16 ***
## origin 5.610e-01 2.622e-01 2.139 0.0331 *
## displacement:weight 2.269e-05 2.257e-06 10.054 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.964 on 383 degrees of freedom
## Multiple R-squared: 0.8588, Adjusted R-squared: 0.8558
## F-statistic: 291.1 on 8 and 383 DF, p-value: < 2.2e-16
I tested 10 interaction effects between the statistically significant variables. All interactions seems to appear statistically significant; however, I chose the top two (2) interactions that showed the stronger relationships between the variables considering the interaction effects. The interaction effects that showed a high R^2 (explanation of the variation in the model) and F-statistic (explanation that there is a relationship between the variables and response) were the models including the displacement:weight interaction and the cylinders:weight interaction.
An interesting pattern that I observed for most of the models with various interactions was that while other variables “turned on” to being statistically significant given the interaction, acceleration continued to remain an insignificant variable.
par(mfrow=c(2,2))
plot((auto$weight), auto$mpg)
plot(log(auto$weight), auto$mpg)
plot(sqrt(auto$weight), auto$mpg)
plot((auto$weight)^2, auto$mpg)
I transformed the weight and displacement variables from the auto dataset to discover trends amongst each transformation. With the different transforms, they both showed similar results. I only included the weight variable’s results in this report since it depicted clearer trends for each transformation. The top left graph shows the original pattern of the data. By observation, I see that the bottom two transformations show a similar trend in the data as the top left graph showed-the bottom two transformations are sqrt(weight) and (weight)^2. The top right graph shows a slightly different pattern in the data. The log transformation of auto$weight shows a more linear pattern.
library("ISLR")
data(Carseats)
str(Carseats)
## 'data.frame': 400 obs. of 11 variables:
## $ Sales : num 9.5 11.22 10.06 7.4 4.15 ...
## $ CompPrice : num 138 111 113 117 141 124 115 136 132 132 ...
## $ Income : num 73 48 35 100 64 113 105 81 110 113 ...
## $ Advertising: num 11 16 10 4 3 13 0 15 0 0 ...
## $ Population : num 276 260 269 466 340 501 45 425 108 131 ...
## $ Price : num 120 83 80 97 128 72 108 120 124 124 ...
## $ ShelveLoc : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
## $ Age : num 42 65 59 55 38 78 71 67 76 76 ...
## $ Education : num 17 10 12 14 13 16 15 10 10 17 ...
## $ Urban : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
## $ US : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...
predict<-lm(Sales ~ Price + Urban + US, data=Carseats)
summary(predict)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
Price coefficient may be interpreted as the average effect of a price increase of $1,000 would result in a decrease of 54.459 units in sales-holding all other predictors fixed.
UrbanYes coefficient may be interpreted by saying that on average, the unit sales in urban locations are 21.9116 units less than in rural location-holding all other predictors fixed.
USYes coefficient may be interpreted by saying that on average, the unit sales in US stores are 1,200.573 units more than in a non-US store locations-holding all other predictors fixed.
Sales=13.042469-0.054459 x_1-0.021916x_i1+1.200573 x_i2
Given that:
x_1= Price
X_i1= 1 if ith store is in Urban location; 0 if ith store is not in Urban location
X_i2= 1 if ith store is in the United States; 0 if ith store is not in the United States
We can reject the null hypothesis H0: Bj=0 for both the Price and US variables. This can be indicated by statistical significance of these two variables.
predict2<- lm(Sales ~ Price + US, data=Carseats)
summary(predict2)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
Both models fit that data pretty poorly. Both Multiple R^2 are the same at .2393-meaning that 23.93% of the variability is explained by the model. However, one key difference that I do see with the fit of these two models is that the F-statistic for the smaller model is a little larger. The F-statistic in the second model explains that there is a stronger relationship between the predictors and the response variable.
confint(predict2, level = 0.95)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
par(mfrow=c (2,2))
plot(predict2)
According to the Residuals vs Leverage graph, there are a few outliers in this dataset. It looks like observation 26, 50, and 368 are some obvious outliers. Additionally, there quite a few observations that have high leverages (greater than 0.01).
The coefficient of Y onto X without an intercept is represented by the function:
knitr::include_graphics("C:/Users/selen/Desktop/R Folder/YonX.png")
Y on X
The coefficient of X onto Y without an intercept is represented by the function:
knitr::include_graphics("C:/Users/selen/Desktop/R Folder/XonY.png")
X on Y
The coefficient estimate for the regression of X on Y would be the same as the coefficient estimate for the regression of Y onto X if:
knitr::include_graphics("C:/Users/selen/Desktop/R Folder/SameCoefficients.png")
Same coefficients
This is evident as the functions are extremely similar with the exception of the denominator.
set.seed(1)
x<- 1:100
sum(x^2)
## [1] 338350
y<-2 * x + rnorm(100, sd=0.1)
sum(y^2)
## [1] 1353606
fit.x<- lm(y ~ x + 1)
fit.y<- lm(x ~ y + 0)
summary (fit.x)
##
## Call:
## lm(formula = y ~ x + 1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.234005 -0.060584 0.001551 0.058514 0.229747
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0131666 0.0181897 0.724 0.471
## x 1.9999549 0.0003127 6395.532 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09027 on 98 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 4.09e+07 on 1 and 98 DF, p-value: < 2.2e-16
summary (fit.y)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.115418 -0.029231 -0.002186 0.031322 0.111795
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 5.00e-01 3.87e-05 12920 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04502 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.669e+08 on 1 and 99 DF, p-value: < 2.2e-16
x2<-1:100
sum(x2^2)
## [1] 338350
y2<-1:100
sum(y2^2)
## [1] 338350
fit.y2<- lm(y2 ~ x2 + 0)
summary(fit.y2)
## Warning in summary.lm(fit.y2): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = y2 ~ x2 + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.082e-13 -2.094e-15 2.900e-17 2.218e-15 1.294e-14
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x2 1.000e+00 5.379e-17 1.859e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.129e-14 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 3.457e+32 on 1 and 99 DF, p-value: < 2.2e-16
fit.x2 <- lm(x2 ~ y2 + 0)
summary(fit.x2)
## Warning in summary.lm(fit.x2): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = x2 ~ y2 + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.082e-13 -2.094e-15 2.900e-17 2.218e-15 1.294e-14
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y2 1.000e+00 5.379e-17 1.859e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.129e-14 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 3.457e+32 on 1 and 99 DF, p-value: < 2.2e-16