R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

2 Carefully explain the differences between the KNN classifier and KNN regression methods.

KNN classifier is used when your response variable is categorical (Discrete // Yes or No), while the regressor knn is used when our predictor is continous/quantitative.

9 This question involves the use of multiple linear regression on the Auto data set.

  1. Produce a scatterplot matrix which includes all of the variables in the data set.
# Read in the data
auto = read.csv("https://www.statlearning.com/s/Auto.csv", stringsAsFactors=TRUE)
#Producing a scatterpolot matrix with all var
plot(auto)

  1. Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.
library(data.table)
autoDT = as.data.table(auto)
autoDT = autoDT[,-"name"]
autoDT[is.na(autoDT$horsepower)] <- 0
autoDT$horsepower=ifelse(autoDT$horsepower == "NA", NA, autoDT$horsepower)
autoDT$horsepower = as.numeric(autoDT$horsepower)
cor(autoDT)
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7762599   -0.8044430  0.4228227 -0.8317389
## cylinders    -0.7762599  1.0000000    0.9509199 -0.5466585  0.8970169
## displacement -0.8044430  0.9509199    1.0000000 -0.4820705  0.9331044
## horsepower    0.4228227 -0.5466585   -0.4820705  1.0000000 -0.4821507
## weight       -0.8317389  0.8970169    0.9331044 -0.4821507  1.0000000
## acceleration  0.4222974 -0.5040606   -0.5441618  0.2662877 -0.4195023
## year          0.5814695 -0.3467172   -0.3698041  0.1274167 -0.3079004
## origin        0.5636979 -0.5649716   -0.6106643  0.2973734 -0.5812652
##              acceleration       year     origin
## mpg             0.4222974  0.5814695  0.5636979
## cylinders      -0.5040606 -0.3467172 -0.5649716
## displacement   -0.5441618 -0.3698041 -0.6106643
## horsepower      0.2662877  0.1274167  0.2973734
## weight         -0.4195023 -0.3079004 -0.5812652
## acceleration    1.0000000  0.2829009  0.2100836
## year            0.2829009  1.0000000  0.1843141
## origin          0.2100836  0.1843141  1.0000000

I attempted to replace missing values with 0, and then perform as numeric or as double, however, Coercions of NA are still introduced. I talked to Dr. Campbell about this and this is as far as we got.

C) Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results. Comment on the output. For instance:

lm.fit = lm(mpg ~., data = autoDT) # name is already removed
summary(lm.fit)
## 
## Call:
## lm(formula = mpg ~ ., data = autoDT)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -9.629 -2.034 -0.046  1.801 13.010 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -2.128e+01  4.259e+00  -4.998 8.78e-07 ***
## cylinders    -2.927e-01  3.382e-01  -0.865   0.3874    
## displacement  1.603e-02  7.284e-03   2.201   0.0283 *  
## horsepower    7.942e-03  6.809e-03   1.166   0.2442    
## weight       -6.870e-03  5.799e-04 -11.846  < 2e-16 ***
## acceleration  1.539e-01  7.750e-02   1.986   0.0477 *  
## year          7.734e-01  4.939e-02  15.661  < 2e-16 ***
## origin        1.346e+00  2.691e-01   5.004 8.52e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.331 on 389 degrees of freedom
## Multiple R-squared:  0.822,  Adjusted R-squared:  0.8188 
## F-statistic: 256.7 on 7 and 389 DF,  p-value: < 2.2e-16

Is there a relationship between the predictors and the response?

Yes, there appears to be a relationship with the predictors and the response variable. This is found by looking at the associated significant p values.In addition 81.82% of the variation in mpg can be explained by the linear relationship between all the regressors.

ii. Which predictors appear to have a statistically significant relationship to the response?

The predictors with significant statistical relationship to mpg are displacement, weight, year, and origin

iii. What does the coeff for the year variable suggest

The coefficient for year suggest that as year increases by n units, MPG will increase by 7.734e-01(year) units.

d Use the plot() function to produce a diagnostict plots of the LR fit

par(mfrow = c(2,2))
plot(lm.fit)

High Leverage Points: The residuals vs leverage point plot suggest observations with high leverage points. Points above the cooks d distance are high leverage points. There appears to be no observations with a large leverage

Outliers: The scale location data shows for outliers only if the Standard residuals are outside the range of [-3,3]. There appears to be no outliers

Normal Distributed: Based on the Q-Q plots, we can see that a lot of the observations are not normally distributed especially for points: 323, 327, 326y

Non-Linearity: Residual plots shwo that its not linear

E use * and : symbols to fit a LR model with interactions. Are any significant?

autoDT
##      mpg cylinders displacement horsepower weight acceleration year origin
##   1:  18         8          307         17   3504         12.0   70      1
##   2:  15         8          350         35   3693         11.5   70      1
##   3:  18         8          318         29   3436         11.0   70      1
##   4:  16         8          304         29   3433         12.0   70      1
##   5:  17         8          302         24   3449         10.5   70      1
##  ---                                                                      
## 393:  27         4          140         82   2790         15.6   82      1
## 394:  44         4           97         53   2130         24.6   82      2
## 395:  32         4          135         80   2295         11.6   82      1
## 396:  28         4          120         75   2625         18.6   82      1
## 397:  31         4          119         78   2720         19.4   82      1
lm.fit2 = lm(mpg~., horsepower*displacement, data = autoDT)
lm.fit3 = lm(mpg~.,horsepower:displacement, data = autoDT)
## Warning in horsepower:displacement: numerical expression has 397 elements: only
## the first used

## Warning in horsepower:displacement: numerical expression has 397 elements: only
## the first used
summary(lm.fit2)
## Warning in summary.lm(lm.fit2): essentially perfect fit: summary may be
## unreliable
## 
## Call:
## lm(formula = mpg ~ ., data = autoDT, subset = horsepower * displacement)
## 
## Residuals:
##         98        200        390         85        140      140.1        100 
## -1.824e-31  2.988e-32 -1.278e-32  2.849e-32  7.850e-17 -7.850e-17  1.310e-31 
##        238 
## -6.734e-33 
## 
## Coefficients: (1 not defined because of singularities)
##                Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)  -7.399e+01  2.548e-14 -2.904e+15  < 2e-16 ***
## cylinders    -7.337e-01  6.280e-16 -1.168e+15 5.45e-16 ***
## displacement  4.479e-02  5.554e-17  8.064e+14 7.89e-16 ***
## horsepower    1.110e-01  4.768e-17  2.327e+15 2.74e-16 ***
## weight       -5.372e-03  2.327e-18 -2.308e+15 2.76e-16 ***
## acceleration  1.852e+00  1.005e-15  1.842e+15 3.46e-16 ***
## year          9.858e-01  9.556e-17  1.032e+16  < 2e-16 ***
## origin               NA         NA         NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.11e-16 on 1 degrees of freedom
##   (389 observations deleted due to missingness)
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 4.874e+33 on 6 and 1 DF,  p-value: < 2.2e-16
summary(lm.fit3)
## 
## Call:
## lm(formula = mpg ~ ., data = autoDT, subset = horsepower:displacement)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.2775 -1.7419  0.0712  1.6601 13.2517 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -3.4476391  4.9794417  -0.692   0.4893    
## cylinders    -0.2248552  0.3350671  -0.671   0.5027    
## displacement  0.0043062  0.0077799   0.554   0.5804    
## horsepower    0.0098612  0.0064640   1.526   0.1282    
## weight       -0.0058010  0.0005948  -9.753   <2e-16 ***
## acceleration  0.0203068  0.0780000   0.260   0.7948    
## year          0.5517624  0.0618582   8.920   <2e-16 ***
## origin        0.6729376  0.2957428   2.275   0.0236 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.776 on 283 degrees of freedom
## Multiple R-squared:  0.8207, Adjusted R-squared:  0.8163 
## F-statistic: 185.1 on 7 and 283 DF,  p-value: < 2.2e-16

Significant interaction terms include: * Displacement and Horsepower, Horspeower and Origin

F Try transformations of variables such as log(x), sqrt(X), X^2

summary(lm(mpg~. +log(horsepower) +sqrt(displacement) + sqrt(acceleration), data = autoDT))
## 
## Call:
## lm(formula = mpg ~ . + log(horsepower) + sqrt(displacement) + 
##     sqrt(acceleration), data = autoDT)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.3481  -1.6910   0.0891   1.6103  12.0139 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         2.820e+01  1.567e+01   1.799   0.0728 .  
## cylinders           3.627e-01  3.678e-01   0.986   0.3246    
## displacement        1.239e-01  1.954e-02   6.338 6.52e-10 ***
## horsepower          1.031e-03  1.868e-02   0.055   0.9560    
## weight             -5.445e-03  5.719e-04  -9.521  < 2e-16 ***
## acceleration        1.958e+00  9.785e-01   2.001   0.0461 *  
## year                8.216e-01  4.677e-02  17.568  < 2e-16 ***
## origin              3.173e-01  2.783e-01   1.140   0.2549    
## log(horsepower)     1.242e-01  4.657e-01   0.267   0.7898    
## sqrt(displacement) -3.885e+00  6.071e-01  -6.400 4.50e-10 ***
## sqrt(acceleration) -1.430e+01  7.909e+00  -1.808   0.0714 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.072 on 386 degrees of freedom
## Multiple R-squared:  0.8498, Adjusted R-squared:  0.8459 
## F-statistic: 218.4 on 10 and 386 DF,  p-value: < 2.2e-16

This question should be answered using the Carseats data set

a) Fit a Reg model to predict Sales, price, Urban, and US

library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.1.3
temp = Carseats[,c("Sales", "Price", "Urban", "US")]
lm.fit = lm(Sales ~., data = temp)
summary(lm.fit)
## 
## Call:
## lm(formula = Sales ~ ., data = temp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

B) Provide an interpretation of each coef in the model

summary(temp)
##      Sales            Price       Urban       US     
##  Min.   : 0.000   Min.   : 24.0   No :118   No :142  
##  1st Qu.: 5.390   1st Qu.:100.0   Yes:282   Yes:258  
##  Median : 7.490   Median :117.0                      
##  Mean   : 7.496   Mean   :115.8                      
##  3rd Qu.: 9.320   3rd Qu.:131.0                      
##  Max.   :16.270   Max.   :191.0
?Carseats
## starting httpd help server ... done

As price increases by 1k with all other variables being held constant, the sales decreases by 53,030 in unit sales.

As a US sale is not affected by wether its in a rural or urban area

A store sells 1,200 more careseats in US stores in comparison to an area outside of the US

C) Write the model out in equation form:

Sales = 13.043469 - 0.054459(Price) - -0.021916(UrbanYes/No) + 1.200573(USYes)

D) For which of the predictors can you reject the null hypothesis H_0: B_j = 0 ?

Urban because the p value is not statistically significant

E) Use only significant predictors

lm.fit2 = lm(Sales ~. -Urban, data = temp)
summary(lm.fit2)
## 
## Call:
## lm(formula = Sales ~ . - Urban, data = temp)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

F) How well do the models in a and e fit the data

Both have an R^2 of about 24% which doesn’t really explain a significant amount of the variance in the model

G) Get a 95% CI for model e

confint(lm.fit2)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

H) Is there evidence of outliers or High Leverage Points?

plot(lm.fit2)

Based on such plots, there is no evidence

12. This problem involves a simple linear regression without an intercept

a) Recall that the coefficient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X?

Coef estimate for Y onto X is given by \[ \widehat{\beta } = \frac{\sum_{i}^{}x_iy_i}{\sum_{j}^{}x_i^2} \] However, the coef estimate of X onto y is \[ \widehat{\beta' } = \frac{\sum_{i}^{}x_iy_i}{\sum_{j}^{}y_i^2} \] The coefficients are the same IF AND ONLY IF, \[ \sum_{j}^{}x^2_j = \sum_{j}^{}y^2_j \]

b) Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X

# norm dist
x = rnorm(100, mean = 10, sd = 10)

y = 2*x + rnorm(100, mean = 5, sd = 20)
lm.fit = lm(y~x )
lm.fit1 = lm(x~y)
# intercepts are dif! However, should have the same R^2 values? Cuz u just flippin the x and y space
plot(x,y)

# C Generate an example in R with n=100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X

set.seed(100)
x = rnorm(100, mean = 10, sd = 10)

set.seed(100)
y = rnorm(100, mean = 10, sd = 10)

lm.fit = lm(x~y)
lm.fit = lm(y~x)
plot(x,y)

2, 9, 10, 12