KNN Classifier is a method to predict qualitative responses that attempt to estimate the conditional distribution of Y given X, and then classify a given observation to the class with highest estimated probability, that is, given a positive integer ‘K’ and a test observation x0, the KNN classifier first identifies the neighbors ‘K’ points in the training data that are closest to x0, represented by N0, then estimates the conditional probability for class j as the fraction of points in N0 whose response values equal j.
KNN Regression is a non-parametric regression method to predict quantitative responses and for a given a value for ‘K’ and a prediction point x0, it first identifies the ‘K’ training observations that are closest to x0, represented by N0. It then estimates f(x0) using the average of all the training responses in N0.
Auto data set.Scatterplot matrix which includes all of the variables in the data set.
pairs(Auto)
names(Auto)
## [1] "mpg" "cylinders" "displacement" "horsepower" "weight"
## [6] "acceleration" "year" "origin" "name"
cor(Auto[1:8])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
lmAuto <- lm(mpg~.-name, data=Auto)
summary(lmAuto)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
Null Hypothesis (Ho): Model is NOT useful , that’s Beta1 = Beta2 = Beta3 = Beta4 = Beta5 = Beta6 = Beta7 = 0
Alternate Hypothesis (H1): Model is useful, that’s At least one of the Beta is not 0 (There is a relationship between the predictors and response.)
Since F-statistic p-value(2.2e-16), from the above summary of the linear regression model, is very much smaller than the significance value (0.05), there is an evidence to reject the null hypothesis, which means there is a relationship between the predictors and the response (as at least one of the Beta is not 0).
From the above linear regression model summary, we can check each predictors t-statistic p-values, that is t-statistic p-values for displacement, weight, year and origin are much smaller than the significance value (0.05). So, we can reject the null hypothesis (Null hypothesis: No significant relationship with the response). This indicates the predictors displacement, weight, year and origin are having a statistically significant relationship to the response.
On average, mpg is predicted to have an increase of 0.750773 when year increases by 1 unit on the condition that all other predictors are fixed, that is, every year the car’s mileage increase by 0.75.
par(mfrow=c(2,2))
plot(lmAuto)
From the Normal Q-Q plot, we see Wide Right tail and curve upward and from the SQRT(|Standardized residuals|) plot, we see few points above 2.0 point (which indicates few outliers in the observations). Both these indicates the Violation of Normality assumption. From the Residuals plot, we see there is a slight pattern, which indicates a mild non-linearity in the data. Also, there is one high leverage point (observation 14).
It is easy to include interaction terms in a linear model using the lm() function. For example, if a and b are predictors and y is response variable, then the syntax a:b tells R to include an interaction term between a and b and the syntax a*b simultaneously includes a, b, and the interaction term a b as predictors; it is a shorthand for a+b+a:b.
From the section 9(a) scatterplot, we can see that displacement and weight are highly correlated (also they are having significant relationship individually with mpg as well). So, we will try the interaction terms between displacement and weight along with all the individual predictors (that is add this interaction term to the linear regression test in section 9(c)) as below.
lmautointeraction1 = lm(mpg ~ . - name + displacement:weight,data=Auto)
summary(lmautointeraction1)
##
## Call:
## lm(formula = mpg ~ . - name + displacement:weight, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.9027 -1.8092 -0.0946 1.5549 12.1687
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5.389e+00 4.301e+00 -1.253 0.2109
## cylinders 1.175e-01 2.943e-01 0.399 0.6899
## displacement -6.837e-02 1.104e-02 -6.193 1.52e-09 ***
## horsepower -3.280e-02 1.238e-02 -2.649 0.0084 **
## weight -1.064e-02 7.136e-04 -14.915 < 2e-16 ***
## acceleration 6.724e-02 8.805e-02 0.764 0.4455
## year 7.852e-01 4.553e-02 17.246 < 2e-16 ***
## origin 5.610e-01 2.622e-01 2.139 0.0331 *
## displacement:weight 2.269e-05 2.257e-06 10.054 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.964 on 383 degrees of freedom
## Multiple R-squared: 0.8588, Adjusted R-squared: 0.8558
## F-statistic: 291.1 on 8 and 383 DF, p-value: < 2.2e-16
From the above linear regression model, we can identify that the p-value is much smaller than the significance value (0.05). This indicates the displacement and weight interaction term has significant relationship with the mpg. Also, we can see that the R2 is increased from the linear regression model in section 9(c), which also indicates that the interaction term has significant relationship with the model.
Let’s try the other significant predictors year and origin interaction effect on the model.
lmautointeraction2 = lm(mpg ~ . - name + year:origin,data=Auto)
summary(lmautointeraction2)
##
## Call:
## lm(formula = mpg ~ . - name + year:origin, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.6072 -2.0439 -0.0596 1.7121 12.3368
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.492e+00 9.044e+00 0.939 0.348353
## cylinders -5.042e-01 3.192e-01 -1.579 0.115082
## displacement 1.567e-02 7.530e-03 2.081 0.038060 *
## horsepower -1.399e-02 1.364e-02 -1.025 0.305786
## weight -6.352e-03 6.449e-04 -9.851 < 2e-16 ***
## acceleration 9.185e-02 9.766e-02 0.941 0.347546
## year 4.189e-01 1.125e-01 3.723 0.000226 ***
## origin -1.405e+01 4.699e+00 -2.989 0.002978 **
## year:origin 1.989e-01 6.030e-02 3.298 0.001064 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.286 on 383 degrees of freedom
## Multiple R-squared: 0.8264, Adjusted R-squared: 0.8228
## F-statistic: 227.9 on 8 and 383 DF, p-value: < 2.2e-16
From the above linear regression model, we can identify that the p-value is much smaller than the significance value (0.05). This indicates the year and origin interaction term has significant relationship with the mpg. But, we can see that there is a minor increase in R2 from the linear regression model in section 9(c), which indicates that the interaction term has less significant relationship with the model as compare to the interaction term displacement and weight.
From the linear regression model from section 9(c), we see that cylinders, horsepower and acceleration are NOT having significant relationship with mpg from its t-statistic p-value. So, lets try some of the transformations on those variables and apply the linear regression model with those ones.
lmTransformed = lm(mpg ~ . - name + log(horsepower) + sqrt(cylinders) + I(acceleration^2),data=Auto)
summary(lmTransformed)
##
## Call:
## lm(formula = mpg ~ . - name + log(horsepower) + sqrt(cylinders) +
## I(acceleration^2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3131 -1.6627 -0.0949 1.5201 12.1563
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.710e+01 1.481e+01 6.556 1.80e-10 ***
## cylinders 2.149e+00 2.365e+00 0.909 0.363972
## displacement -6.094e-03 7.265e-03 -0.839 0.402078
## horsepower 1.500e-01 2.621e-02 5.725 2.10e-08 ***
## weight -3.255e-03 6.665e-04 -4.884 1.54e-06 ***
## acceleration -1.219e+00 5.523e-01 -2.207 0.027929 *
## year 7.430e-01 4.523e-02 16.427 < 2e-16 ***
## origin 8.697e-01 2.538e-01 3.427 0.000676 ***
## log(horsepower) -2.443e+01 2.910e+00 -8.394 9.34e-16 ***
## sqrt(cylinders) -1.029e+01 1.113e+01 -0.925 0.355777
## I(acceleration^2) 2.717e-02 1.619e-02 1.678 0.094077 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.951 on 381 degrees of freedom
## Multiple R-squared: 0.8607, Adjusted R-squared: 0.857
## F-statistic: 235.4 on 10 and 381 DF, p-value: < 2.2e-16
Even after applying few transformation on the variables that are NOT significant earlier in section 9(c) linear regression model, we found only horsepower has now significant relationship with the mpg. Lets try and check with plots.
plot(log(Auto$horsepower), Auto$mpg)
The above plot indicates that the log transformation of horsepower has a linear relationship with the mpg and its a negative relationship (as it can be identified from the negative coefficient [-2.443e+01] as well).
Carseats data set.Sales using Price, Urban, and US.lmSales1 <- lm(Sales ~ Price + Urban + US, data=Carseats)
summary(lmSales1)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
lmSales1$coefficients
## (Intercept) Price UrbanYes USYes
## 13.04346894 -0.05445885 -0.02191615 1.20057270
Since t-statistics p-values of Price and US (Yes - 1 and No - 0), from the section (a) model, are much smaller than the significance value so they have significant linear relationship with Sales.
Lets check the coefficients and they can be interpreted as below (1000 units of Sales):
Price: On average, 1 dollar increase in Price will decrease 54.45885 units of Sales having all other predictors remain same or fixed.
Urban (Yes = 1 and No = 0): If Yes represents the urban and No represents the rural, then on average the unit Sales in the urban is 21.91615 less than the rural having all other predictors remain fixed.
US (Yes = 1 and No = 0): If Yes represents the US Stores and No represents the non-US stores, then on average the unit Sales in the US Stores is 1200.5727 more than non-US stores having all other predictors remain fixed.
In general, the model can be written as:
(Sales)Hat = 13.043469 - 0.054459 * Price - 0.021916 * Urban + 1.200573 * US
Where Urban = 1 for Urban store and 0 for rural store and US = 1 for US store and 0 for non-US store.
Urban = 1: the model would be (Sales)Hat = (13.043469 - 0.021916) - 0.054459 * Price + 1.200573 * US, when Price and US remain fixed.
Urban = 0: the model would be (Sales)Hat = 13.043469 - 0.054459 * Price + 1.200573 * US, when Price and US remain fixed.
US = 1: the model would be (Sales)Hat = (13.043469 + 1.200573) - 0.054459 * Price - 0.021916 * Urban, when Price and Urban remain fixed.
US = 0: the model would be (Sales)Hat = 13.043469 - 0.054459 * Price - 0.021916 * Urban, when Price and Urban remain fixed.
Since t-statistics p-values of Price and US (Yes - 1), from the section (a) summary, are much smaller than the significance value (0.05) so they have significant linear relationship with Sales (that is we have evidence to reject null hypothesis).
Since the p-value from the t-statistic (of section(a) summary) is much larger than the significance value (0.05), we DO NOT have evidence to reject null hypothesis (that is there is not much significant relationship between Sales and Urban). So, we can remove the Urban from the model (from section (a)).
lmUpdatedModel <- lm(Sales ~ Price + US, data = Carseats)
summary(lmUpdatedModel)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
Both models has R2 is 23.93% that the model explain the variance in Sales. But when the adjusted R2 compared between the models (a) and (e), there is a slight increase in the value from 23.35% to 23.54%. This indicates adding additional predictor variable Urban to the model is fitting with the data. So, adding additional variables or predictors NOT always increase the Goodness of fit within the model. In this case, model defined in (e) is the better when compare to the model defined in (a).
The 95% confidence intervals for Price and US (Yes = 1) coefficients are below:
confint(lmUpdatedModel)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
For instance, the 95% confidence interval for Price is between -0.06475984 and -0.04419543.
par(mfrow = c(2, 2))
plot(lmUpdatedModel)
From the Residuals vs Leverage plot, there is some evidence that there are few outliers (greater than 2 and less than -2) and also, from the same plot, there is some evidence for the high leverage points which are greater than 0.01 [the value 0.01 value can be calculated using the formula (p+1)/n, i.e., (2+1)/400 -> approximately equal to 0.01].
sprintf("The number of observations in Carseats = %d", nrow(Carseats))
## [1] "The number of observations in Carseats = 400"
Using the same equation from (3.38),
The Coefficient estimate for the regression of Y onto X is
\[ \hat{\beta}_a = \frac{\sum_{i = 1}^{n}x_iy_i}{\sum_{j = 1}^{n}x_j^2} \]
The Coefficient estimate for the regression of X onto Y is
\[ \hat{\beta}_b = \frac{\sum_{i = 1}^{n}y_ix_i}{\sum_{j = 1}^{n}y_j^2} = \frac{\sum_{i = 1}^{n}x_iy_i}{\sum_{j = 1}^{n}y_j^2} \]
If the above 2 coefficient estimates should be same, when the below equation resolves to:
\[ \hat{\beta}_a = \hat{\beta}_b \iff \frac{\sum_{i = 1}^{n}x_iy_i}{\sum_{j = 1}^{n}x_j^2} = \frac{\sum_{i = 1}^{n}x_iy_i}{\sum_{j = 1}^{n}y_j^2} \iff \sum_{j = 1}^{n}x_j^2 = \sum_{j = 1}^{n}y_j^2 \]
So the coefficient estimate for the regression of Y onto X will be same as the coefficient estimate for the regression of X onto Y when the below condition is met:
\[ \sum_{j = 1}^{n}x_j^2 = \sum_{j = 1}^{n}y_j^2 \]
Randomly, generate 100 observations for X and construct Y from X with some error terms.
#Random 100 observations for X
set.seed(1)
X <- rnorm(100)
Y <- 3 * X + rnorm(100, sd = 3)
sprintf("Sum of X^2^ is = %f", sum(X^2))
## [1] "Sum of X^2^ is = 81.055093"
sprintf("Sum of Y^2^ is = %f", sum(Y^2))
## [1] "Sum of Y^2^ is = 1539.368887"
\[ \sum_{j = 1}^{n}x_j^2 = 81.055093 \]
\[ \sum_{j = 1}^{n}y_j^2 = 1539.368887 \]
It’s very much evident from the above that \[ \hat{\beta}_a \neq \hat{\beta}_b \]
From the both the sum of squares, we can say that the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X. Also, we can prove this with the below linear regression model summary of X onto Y and Y onto X.
lmY <- lm(Y ~ X + 0)
summary(lmY)
##
## Call:
## lm(formula = Y ~ X + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.7461 -1.9415 -0.5312 1.5167 6.9327
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## X 2.9816 0.3194 9.334 3.1e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.876 on 99 degrees of freedom
## Multiple R-squared: 0.4681, Adjusted R-squared: 0.4627
## F-statistic: 87.13 on 1 and 99 DF, p-value: 3.1e-15
lmX <- lm(X ~ Y + 0)
summary(lmX)
##
## Call:
## lm(formula = X ~ Y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.35410 -0.37468 0.09974 0.48799 1.55406
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Y 0.15700 0.01682 9.334 3.1e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6599 on 99 degrees of freedom
## Multiple R-squared: 0.4681, Adjusted R-squared: 0.4627
## F-statistic: 87.13 on 1 and 99 DF, p-value: 3.1e-15
From the both the linear regression model summary, we see that the estimated coefficients are different (Coefficient estimate of X = 2.9816 and the coefficient estimate of Y = 0.15700).
Randomly, generate 100 observations for X and set the same observations to Y.
#Random 100 observations for X
set.seed(2)
X <- rnorm(100)
Y <- X
sprintf("Sum of X^2^ is = %f", sum(X^2))
## [1] "Sum of X^2^ is = 133.352149"
sprintf("Sum of Y^2^ is = %f", sum(Y^2))
## [1] "Sum of Y^2^ is = 133.352149"
\[ \sum_{j = 1}^{n}x_j^2 = 133.352149 \]
\[ \sum_{j = 1}^{n}y_j^2 = 133.352149 \]
It’s very much evident from the above that \[ \hat{\beta}_a = \hat{\beta}_b \]
From the both the sum of squares, we can say that the coefficient estimate for the regression of X onto Y is same as the coefficient estimate for the regression of Y onto X. Also, we can prove this with the below linear regression model summary of X onto Y and Y onto X.
lmYSame <- lm(Y ~ X + 0)
summary(lmYSame)
## Warning in summary.lm(lmYSame): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = Y ~ X + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.024e-16 -1.308e-17 7.990e-18 4.566e-17 2.532e-15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## X 1.000e+00 2.287e-17 4.373e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.641e-16 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.913e+33 on 1 and 99 DF, p-value: < 2.2e-16
lmXSame <- lm(X ~ Y + 0)
summary(lmXSame)
## Warning in summary.lm(lmXSame): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = X ~ Y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.024e-16 -1.308e-17 7.990e-18 4.566e-17 2.532e-15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## Y 1.000e+00 2.287e-17 4.373e+16 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.641e-16 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.913e+33 on 1 and 99 DF, p-value: < 2.2e-16
From the both the linear regression model summary, we see that the estimated coefficients are same (Coefficient estimate of X = 1.000e+00 and the coefficient estimate of Y = 1.000e+00).
Comment on the output. For instance: