This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.
KNN classifier is used when your response variable is categorical (Discrete // Yes or No), while the regressor knn is used when our predictor is continous/quantitative.
# Read in the data
auto = read.csv("https://www.statlearning.com/s/Auto.csv", stringsAsFactors=TRUE)
#Producing a scatterpolot matrix with all var
plot(auto)
library(data.table)
autoDT = as.data.table(auto)
autoDT = autoDT[,-"name"]
autoDT[is.na(autoDT$horsepower)] <- 0
autoDT$horsepower=ifelse(autoDT$horsepower == "NA", NA, autoDT$horsepower)
autoDT$horsepower = as.numeric(autoDT$horsepower)
cor(autoDT)
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7762599 -0.8044430 0.4228227 -0.8317389
## cylinders -0.7762599 1.0000000 0.9509199 -0.5466585 0.8970169
## displacement -0.8044430 0.9509199 1.0000000 -0.4820705 0.9331044
## horsepower 0.4228227 -0.5466585 -0.4820705 1.0000000 -0.4821507
## weight -0.8317389 0.8970169 0.9331044 -0.4821507 1.0000000
## acceleration 0.4222974 -0.5040606 -0.5441618 0.2662877 -0.4195023
## year 0.5814695 -0.3467172 -0.3698041 0.1274167 -0.3079004
## origin 0.5636979 -0.5649716 -0.6106643 0.2973734 -0.5812652
## acceleration year origin
## mpg 0.4222974 0.5814695 0.5636979
## cylinders -0.5040606 -0.3467172 -0.5649716
## displacement -0.5441618 -0.3698041 -0.6106643
## horsepower 0.2662877 0.1274167 0.2973734
## weight -0.4195023 -0.3079004 -0.5812652
## acceleration 1.0000000 0.2829009 0.2100836
## year 0.2829009 1.0000000 0.1843141
## origin 0.2100836 0.1843141 1.0000000
I attempted to replace missing values with 0, and then perform as numeric or as double, however, Coercions of NA are still introduced. I talked to Dr. Campbell about this and this is as far as we got.
lm.fit = lm(mpg ~., data = autoDT) # name is already removed
summary(lm.fit)
##
## Call:
## lm(formula = mpg ~ ., data = autoDT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.629 -2.034 -0.046 1.801 13.010
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.128e+01 4.259e+00 -4.998 8.78e-07 ***
## cylinders -2.927e-01 3.382e-01 -0.865 0.3874
## displacement 1.603e-02 7.284e-03 2.201 0.0283 *
## horsepower 7.942e-03 6.809e-03 1.166 0.2442
## weight -6.870e-03 5.799e-04 -11.846 < 2e-16 ***
## acceleration 1.539e-01 7.750e-02 1.986 0.0477 *
## year 7.734e-01 4.939e-02 15.661 < 2e-16 ***
## origin 1.346e+00 2.691e-01 5.004 8.52e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.331 on 389 degrees of freedom
## Multiple R-squared: 0.822, Adjusted R-squared: 0.8188
## F-statistic: 256.7 on 7 and 389 DF, p-value: < 2.2e-16
Yes, there appears to be a relationship with the predictors and the response variable. This is found by looking at the associated significant p values.In addition 81.82% of the variation in mpg can be explained by the linear relationship between all the regressors.
The predictors with significant statistical relationship to mpg are displacement, weight, year, and origin
The coefficient for year suggest that as year increases by n units, MPG will increase by 7.734e-01(year) units.
par(mfrow = c(2,2))
plot(lm.fit)
High Leverage Points: The residuals vs leverage point plot suggest observations with high leverage points. Points above the cooks d distance are high leverage points. There appears to be no observations with a large leverage
Outliers: The scale location data shows for outliers only if the Standard residuals are outside the range of [-3,3]. There appears to be no outliers
Normal Distributed: Based on the Q-Q plots, we can see that a lot of the observations are not normally distributed especially for points: 323, 327, 326y
Non-Linearity: Residual plots shwo that its not linear
autoDT
## mpg cylinders displacement horsepower weight acceleration year origin
## 1: 18 8 307 17 3504 12.0 70 1
## 2: 15 8 350 35 3693 11.5 70 1
## 3: 18 8 318 29 3436 11.0 70 1
## 4: 16 8 304 29 3433 12.0 70 1
## 5: 17 8 302 24 3449 10.5 70 1
## ---
## 393: 27 4 140 82 2790 15.6 82 1
## 394: 44 4 97 53 2130 24.6 82 2
## 395: 32 4 135 80 2295 11.6 82 1
## 396: 28 4 120 75 2625 18.6 82 1
## 397: 31 4 119 78 2720 19.4 82 1
lm.fit2 = lm(mpg~., horsepower*displacement, data = autoDT)
lm.fit3 = lm(mpg~.,horsepower:displacement, data = autoDT)
## Warning in horsepower:displacement: numerical expression has 397 elements: only
## the first used
## Warning in horsepower:displacement: numerical expression has 397 elements: only
## the first used
summary(lm.fit2)
## Warning in summary.lm(lm.fit2): essentially perfect fit: summary may be
## unreliable
##
## Call:
## lm(formula = mpg ~ ., data = autoDT, subset = horsepower * displacement)
##
## Residuals:
## 98 200 390 85 140 140.1 100
## -1.824e-31 2.988e-32 -1.278e-32 2.849e-32 7.850e-17 -7.850e-17 1.310e-31
## 238
## -6.734e-33
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.399e+01 2.548e-14 -2.904e+15 < 2e-16 ***
## cylinders -7.337e-01 6.280e-16 -1.168e+15 5.45e-16 ***
## displacement 4.479e-02 5.554e-17 8.064e+14 7.89e-16 ***
## horsepower 1.110e-01 4.768e-17 2.327e+15 2.74e-16 ***
## weight -5.372e-03 2.327e-18 -2.308e+15 2.76e-16 ***
## acceleration 1.852e+00 1.005e-15 1.842e+15 3.46e-16 ***
## year 9.858e-01 9.556e-17 1.032e+16 < 2e-16 ***
## origin NA NA NA NA
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.11e-16 on 1 degrees of freedom
## (389 observations deleted due to missingness)
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 4.874e+33 on 6 and 1 DF, p-value: < 2.2e-16
summary(lm.fit3)
##
## Call:
## lm(formula = mpg ~ ., data = autoDT, subset = horsepower:displacement)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.2775 -1.7419 0.0712 1.6601 13.2517
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.4476391 4.9794417 -0.692 0.4893
## cylinders -0.2248552 0.3350671 -0.671 0.5027
## displacement 0.0043062 0.0077799 0.554 0.5804
## horsepower 0.0098612 0.0064640 1.526 0.1282
## weight -0.0058010 0.0005948 -9.753 <2e-16 ***
## acceleration 0.0203068 0.0780000 0.260 0.7948
## year 0.5517624 0.0618582 8.920 <2e-16 ***
## origin 0.6729376 0.2957428 2.275 0.0236 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.776 on 283 degrees of freedom
## Multiple R-squared: 0.8207, Adjusted R-squared: 0.8163
## F-statistic: 185.1 on 7 and 283 DF, p-value: < 2.2e-16
Significant interaction terms include: * Displacement and Horsepower, Horspeower and Origin
summary(lm(mpg~. +log(horsepower) +sqrt(displacement) + sqrt(acceleration), data = autoDT))
##
## Call:
## lm(formula = mpg ~ . + log(horsepower) + sqrt(displacement) +
## sqrt(acceleration), data = autoDT)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.3481 -1.6910 0.0891 1.6103 12.0139
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.820e+01 1.567e+01 1.799 0.0728 .
## cylinders 3.627e-01 3.678e-01 0.986 0.3246
## displacement 1.239e-01 1.954e-02 6.338 6.52e-10 ***
## horsepower 1.031e-03 1.868e-02 0.055 0.9560
## weight -5.445e-03 5.719e-04 -9.521 < 2e-16 ***
## acceleration 1.958e+00 9.785e-01 2.001 0.0461 *
## year 8.216e-01 4.677e-02 17.568 < 2e-16 ***
## origin 3.173e-01 2.783e-01 1.140 0.2549
## log(horsepower) 1.242e-01 4.657e-01 0.267 0.7898
## sqrt(displacement) -3.885e+00 6.071e-01 -6.400 4.50e-10 ***
## sqrt(acceleration) -1.430e+01 7.909e+00 -1.808 0.0714 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.072 on 386 degrees of freedom
## Multiple R-squared: 0.8498, Adjusted R-squared: 0.8459
## F-statistic: 218.4 on 10 and 386 DF, p-value: < 2.2e-16
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.1.3
temp = Carseats[,c("Sales", "Price", "Urban", "US")]
lm.fit = lm(Sales ~., data = temp)
summary(lm.fit)
##
## Call:
## lm(formula = Sales ~ ., data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
summary(temp)
## Sales Price Urban US
## Min. : 0.000 Min. : 24.0 No :118 No :142
## 1st Qu.: 5.390 1st Qu.:100.0 Yes:282 Yes:258
## Median : 7.490 Median :117.0
## Mean : 7.496 Mean :115.8
## 3rd Qu.: 9.320 3rd Qu.:131.0
## Max. :16.270 Max. :191.0
?Carseats
## starting httpd help server ... done
As price increases by 1k with all other variables being held constant, the sales decreases by 53,030 in unit sales.
As a US sale is not affected by wether its in a rural or urban area
A store sells 1,200 more careseats in US stores in comparison to an area outside of the US
Sales = 13.043469 - 0.054459(Price) - -0.021916(UrbanYes/No) + 1.200573(USYes)
Urban because the p value is not statistically significant
lm.fit2 = lm(Sales ~. -Urban, data = temp)
summary(lm.fit2)
##
## Call:
## lm(formula = Sales ~ . - Urban, data = temp)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
Both have an R^2 of about 24% which doesn’t really explain a significant amount of the variance in the model
confint(lm.fit2)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
plot(lm.fit2)
Based on such plots, there is no evidence
Coef estimate for Y onto X is given by \[ \widehat{\beta } = \frac{\sum_{i}^{}x_iy_i}{\sum_{j}^{}x_i^2} \] However, the coef estimate of X onto y is \[ \widehat{\beta' } = \frac{\sum_{i}^{}x_iy_i}{\sum_{j}^{}y_i^2} \] The coefficients are the same IF AND ONLY IF, \[ \sum_{j}^{}x^2_j = \sum_{j}^{}y^2_j \]
# norm dist
x = rnorm(100, mean = 10, sd = 10)
y = 2*x + rnorm(100, mean = 5, sd = 20)
lm.fit = lm(y~x )
lm.fit1 = lm(x~y)
# intercepts are dif! However, should have the same R^2 values? Cuz u just flippin the x and y space
plot(x,y)
# C Generate an example in R with n=100 observations in which the
coefficient estimate for the regression of X onto Y is the same as the
coefficient estimate for the regression of Y onto X
set.seed(100)
x = rnorm(100, mean = 10, sd = 10)
set.seed(100)
y = rnorm(100, mean = 10, sd = 10)
lm.fit = lm(x~y)
lm.fit = lm(y~x)
plot(x,y)
2, 9, 10, 12