STA 4143-001 Data Mining

Question 2: Difference between the KNN classifier and KNN regression methods

KNN classifier is typically used to solve classification problem or when attempting to predict qualitative responses. KNN classifier is solved by identifying the neighborhood of x0 and then estimating the conditional probability P(Y=j|X=x0) for class j as the fraction of points in the neighborhood whose response values equal j.

The KNN regression method is a non-parametric method that is used to solve regression problems or when attempting to predict quantitative responses. KNN regression is solved by identifying the K training observations that are closest to x0 (represented by N0), and then estimating f(x0) as the average of all the training responses in the “neighborhood.”

knitr::include_graphics("C:/Users/selen/Desktop/R Folder/KNN.png")

Question 9: Auto data set

setwd("C:/Users/selen/Desktop/R Folder")
auto<-read.csv("Auto.csv", na.strings = "?")
auto<-na.omit(auto)

Part A: Scatterplot Matrix

pairs(auto)

names(auto)

## [1] "mpg"          "cylinders"    "displacement" "horsepower"  
## [5] "weight"       "acceleration" "year"         "origin"      
## [9] "name"

Part B: Matrix Correlations

cor(auto[1:8])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

Part C: Multiple Linear Regression

fit <- lm(mpg ~ . - name, data= auto)
summary(fit)

## 
## Call:
## lm(formula = mpg ~ . - name, data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

Section i: Is there a relationship between te predictors and the response?

To answer this question, I must test the hypothesis H0: B1=B2=Bp=0; Ha: B1 NOT= B2 NOT= Bp NOT= 0. In this case, I will reject the null hypothesis, in favor of the alternative hypothesis. Since the Bs are not equal to 0, this means that there is a relationship between the predictors and response. Additionally, I see that the F-statistic of 252.4 is greater than 1 and that the p-value of 2.2e-16 is much smaller than the alpha of 0.05. Both the F-statistic and p-value provide evidence that there is some kind of relationship between the predictors and response.

Section ii: Which predictors appear to have a statistically significant relationship to the response?

The predictors that appear to have a statistically significant relationship to the response include: the intercept [defaulted in the regression to appear], displacement, weight, year, and origin.

Section iii: What does the coefficient for the year variable suggest?

The coefficient of the year variable suggests that the average effect of an increase of 1 year is an increase of 0.750773 in mpg-holding all other predictors fixed. By this, I mean that with every 1 year increase, cars become more fuel efficient by approximately 0.75 mpg/year.

Part D: Diagnostic Plots

par(mfrow=c (2,2))
plot(fit)

Residuals vs. Fitted Values plot helps show that there is a slight pattern in the residuals that indicates non-linearity in this dataset. As described by the book, the presence of a pattern may indicate a problem with some aspect of the linear model. Will need to investigate further. The Normal Q-Q plot also indicated that the data may have some extreme values than would be expected if the data followed a normal distribution. The Residuals vs. Leverage plot indicates that there are a few outliers present in this dataset. Two of the observations-observations 327 and 394-have high residuals and low leverages. Additionally, there is another observation located in the middle of the chart that brings up the Residuals vs. Leverage line and is plotted a little further out from the majority of the data. Lastly, as indicated in the plot, observation 14 is another outlier in the dataset. Observation 14 has an unusually low standardized residual and unusually high leverage.

Part E: Linear Regression with Interactions

fit2<- lm(mpg ~ . -name + cylinders:weight, data=auto)
summary(fit2)

## 
## Call:
## lm(formula = mpg ~ . - name + cylinders:weight, data = auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.9484  -1.7133  -0.1809   1.4530  12.4137 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       7.3143478  5.0076737   1.461  0.14494    
## cylinders        -5.0347425  0.5795767  -8.687  < 2e-16 ***
## displacement      0.0156444  0.0068409   2.287  0.02275 *  
## horsepower       -0.0314213  0.0126216  -2.489  0.01322 *  
## weight           -0.0150329  0.0011125 -13.513  < 2e-16 ***
## acceleration      0.1006438  0.0897944   1.121  0.26306    
## year              0.7813453  0.0464139  16.834  < 2e-16 ***
## origin            0.8030154  0.2617333   3.068  0.00231 ** 
## cylinders:weight  0.0015058  0.0001657   9.088  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.022 on 383 degrees of freedom
## Multiple R-squared:  0.8531, Adjusted R-squared:  0.8501 
## F-statistic: 278.1 on 8 and 383 DF,  p-value: < 2.2e-16

fit5<- lm(mpg ~ . -name + displacement:weight, data=auto)
summary(fit5)

## 
## Call:
## lm(formula = mpg ~ . - name + displacement:weight, data = auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.9027 -1.8092 -0.0946  1.5549 12.1687 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -5.389e+00  4.301e+00  -1.253   0.2109    
## cylinders            1.175e-01  2.943e-01   0.399   0.6899    
## displacement        -6.837e-02  1.104e-02  -6.193 1.52e-09 ***
## horsepower          -3.280e-02  1.238e-02  -2.649   0.0084 ** 
## weight              -1.064e-02  7.136e-04 -14.915  < 2e-16 ***
## acceleration         6.724e-02  8.805e-02   0.764   0.4455    
## year                 7.852e-01  4.553e-02  17.246  < 2e-16 ***
## origin               5.610e-01  2.622e-01   2.139   0.0331 *  
## displacement:weight  2.269e-05  2.257e-06  10.054  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.964 on 383 degrees of freedom
## Multiple R-squared:  0.8588, Adjusted R-squared:  0.8558 
## F-statistic: 291.1 on 8 and 383 DF,  p-value: < 2.2e-16

I tested 10 interaction effects between the statistically significant variables. All interactions seems to appear statistically significant; however, I chose the top two (2) interactions that showed the stronger relationships between the variables considering the interaction effects. The interaction effects that showed a high R^2 (explanation of the variation in the model) and F-statistic (explanation that there is a relationship between the variables and response) were the models including the displacement:weight interaction and the cylinders:weight interaction.

An interesting pattern that I observed for most of the models with various interactions was that while other variables “turned on” to being statistically significant given the interaction, acceleration continued to remain an insignificant variable.

Part F: Transformations

par(mfrow=c(2,2))
plot((auto$weight), auto$mpg)
plot(log(auto$weight), auto$mpg)
plot(sqrt(auto$weight), auto$mpg)
plot((auto$weight)^2, auto$mpg)

I transformed the weight and displacement variables from the auto dataset to discover trends amongst each transformation. With the different transforms, they both showed similar results. I only included the weight variable’s results in this report since it depicted clearer trends for each transformation. The top left graph shows the original pattern of the data. By observation, I see that the bottom two transformations show a similar trend in the data as the top left graph showed-the bottom two transformations are sqrt(weight) and (weight)^2. The top right graph shows a slightly different pattern in the data. The log transformation of auto$weight shows a more linear pattern.

Question 10: Carseats Data Set

library("ISLR")
data(Carseats)
str(Carseats)

## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...

Part A: Fitting Regression Model to Predict Sales

predict<-lm(Sales ~ Price + Urban + US, data=Carseats)
summary(predict)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Part B: Interpretation of Coefficients

Price coefficient may be interpreted as the average effect of a price increase of $1,000 would result in a decrease of 54.459 units in sales-holding all other predictors fixed.

UrbanYes coefficient may be interpreted by saying that on average, the unit sales in urban locations are 21.9116 units less than in rural location-holding all other predictors fixed.

USYes coefficient may be interpreted by saying that on average, the unit sales in US stores are 1,200.573 units more than in a non-US store locations-holding all other predictors fixed.

Part C: Equation Form of Model

Sales=13.042469-0.054459 x_1-0.021916x_i1+1.200573 x_i2
Given that:

x_1= Price

X_i1= 1 if ith store is in Urban location; 0 if ith store is not in Urban location

X_i2= 1 if ith store is in the United States; 0 if ith store is not in the United States

Part D: Predictors that Reject Null Hypothesis

We can reject the null hypothesis H0: Bj=0 for both the Price and US variables. This can be indicated by statistical significance of these two variables.

Part E: Smaller Model with Significant Predictors

predict2<- lm(Sales ~ Price + US, data=Carseats)
summary(predict2)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

Part F: How well do the models in (a) and (e) fit the data?

Both models fit that data pretty poorly. Both Multiple R^2 are the same at .2393-meaning that 23.93% of the variability is explained by the model. However, one key difference that I do see with the fit of these two models is that the F-statistic for the smaller model is a little larger. The F-statistic in the second model explains that there is a stronger relationship between the predictors and the response variable.

Part G: 95% Confidence Intervals

confint(predict2, level = 0.95)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

Part H: Outliers

par(mfrow=c (2,2))
plot(predict2)

According to the Residuals vs Leverage graph, there are a few outliers in this dataset. It looks like observation 26, 50, and 368 are some obvious outliers. Additionally, there quite a few observations that have high leverages (greater than 0.01).

Question 12: Simple Linear Regression without an Intercept

Part A: Recall that the coefficient estimate B for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

The coefficient of Y onto X without an intercept is represented by the function:

knitr::include_graphics("C:/Users/selen/Desktop/R Folder/YonX.png")

Y on X

The coefficient of X onto Y without an intercept is represented by the function:

knitr::include_graphics("C:/Users/selen/Desktop/R Folder/XonY.png")

X on Y

The coefficient estimate for the regression of X on Y would be the same as the coefficient estimate for the regression of Y onto X if:

knitr::include_graphics("C:/Users/selen/Desktop/R Folder/SameCoefficients.png")

Same coefficients

This is evident as the functions are extremely similar with the exception of the denominator.

Part B: Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

set.seed(1)
x<- 1:100
sum(x^2)

## [1] 338350

y<-2 * x + rnorm(100, sd=0.1)
sum(y^2)

## [1] 1353606

fit.x<- lm(y ~ x + 1)
fit.y<- lm(x ~ y + 0)
summary (fit.x)

## 
## Call:
## lm(formula = y ~ x + 1)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.234005 -0.060584  0.001551  0.058514  0.229747 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 0.0131666  0.0181897    0.724    0.471    
## x           1.9999549  0.0003127 6395.532   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09027 on 98 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 4.09e+07 on 1 and 98 DF,  p-value: < 2.2e-16

summary (fit.y)

## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.115418 -0.029231 -0.002186  0.031322  0.111795 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y 5.00e-01   3.87e-05   12920   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04502 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.669e+08 on 1 and 99 DF,  p-value: < 2.2e-16

Part C: Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

x2<-1:100
sum(x2^2)

## [1] 338350

y2<-1:100
sum(y2^2)

## [1] 338350

fit.y2<- lm(y2 ~ x2 + 0)
summary(fit.y2)

## Warning in summary.lm(fit.y2): essentially perfect fit: summary may be
## unreliable

## 
## Call:
## lm(formula = y2 ~ x2 + 0)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -3.082e-13 -2.094e-15  2.900e-17  2.218e-15  1.294e-14 
## 
## Coefficients:
##     Estimate Std. Error   t value Pr(>|t|)    
## x2 1.000e+00  5.379e-17 1.859e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.129e-14 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 3.457e+32 on 1 and 99 DF,  p-value: < 2.2e-16

fit.x2 <- lm(x2 ~ y2 + 0)
summary(fit.x2)

## Warning in summary.lm(fit.x2): essentially perfect fit: summary may be
## unreliable

## 
## Call:
## lm(formula = x2 ~ y2 + 0)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -3.082e-13 -2.094e-15  2.900e-17  2.218e-15  1.294e-14 
## 
## Coefficients:
##     Estimate Std. Error   t value Pr(>|t|)    
## y2 1.000e+00  5.379e-17 1.859e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.129e-14 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 3.457e+32 on 1 and 99 DF,  p-value: < 2.2e-16