This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
library(ISLR2)
## Warning: package 'ISLR2' was built under R version 4.3.3
attach(Carseats)
head(Carseats)
## Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1 9.50 138 73 11 276 120 Bad 42 17
## 2 11.22 111 48 16 260 83 Good 65 10
## 3 10.06 113 35 10 269 80 Medium 59 12
## 4 7.40 117 100 4 466 97 Medium 55 14
## 5 4.15 141 64 3 340 128 Bad 38 13
## 6 10.81 124 113 13 501 72 Bad 78 16
## Urban US
## 1 Yes Yes
## 2 Yes Yes
## 3 Yes Yes
## 4 Yes Yes
## 5 Yes No
## 6 No Yes
summary(Carseats)
## Sales CompPrice Income Advertising
## Min. : 0.000 Min. : 77 Min. : 21.00 Min. : 0.000
## 1st Qu.: 5.390 1st Qu.:115 1st Qu.: 42.75 1st Qu.: 0.000
## Median : 7.490 Median :125 Median : 69.00 Median : 5.000
## Mean : 7.496 Mean :125 Mean : 68.66 Mean : 6.635
## 3rd Qu.: 9.320 3rd Qu.:135 3rd Qu.: 91.00 3rd Qu.:12.000
## Max. :16.270 Max. :175 Max. :120.00 Max. :29.000
## Population Price ShelveLoc Age Education
## Min. : 10.0 Min. : 24.0 Bad : 96 Min. :25.00 Min. :10.0
## 1st Qu.:139.0 1st Qu.:100.0 Good : 85 1st Qu.:39.75 1st Qu.:12.0
## Median :272.0 Median :117.0 Medium:219 Median :54.50 Median :14.0
## Mean :264.8 Mean :115.8 Mean :53.32 Mean :13.9
## 3rd Qu.:398.5 3rd Qu.:131.0 3rd Qu.:66.00 3rd Qu.:16.0
## Max. :509.0 Max. :191.0 Max. :80.00 Max. :18.0
## Urban US
## No :118 No :142
## Yes:282 Yes:258
##
##
##
##
# Question 2
# Carefully explain the differences between the KNN classifier and KNN
# regression methods.
# There are several differences between KNN classifier and KNN regression
# methods. KNN classifier is utilized when the response variable
# is categorical in nature (ex: Yes, No), while KNN regression methods is
# used when response variable is continuous numerical (quantitative). This
# is due to the fact that the main goal of KNN classifier is to assign a
# class label to a observation (Y) based on what is the common
# class of its nearest neighbors (observations), while in KNN regression
# the goal is to predict a continuous quantitative numeric value of Y based
# on the averages of the observations of the variable or variables of interests
# nearest neighbor.
# So in a practical setting KNN classifier will be utilized for classification
# task like determining if a customer will make a purchase of a book (Yes,No)
# while the KNN regression is used for predicting the price of a book.
# Question 9 This question involves the use of multiple linear regression on the
# Auto data set.(a) Produce a scatterplot matrix which includes all of
# the variables in the data set.
pairs(Auto)
# Question 9 part b
# Compute the matrix of correlations between the variables using
# the function cor(). You will need to exclude the name variable, which is
# qualitative.
cor(Auto[, names(Auto) !="name"])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
# Question 9 part c
# Use the lm() function to perform a multiple linear regression
# with mpg as the response and all other variables except name as
# the predictors. Use the summary() function to print the results.
# Comment on the output.
lmmodel_q9 = lm(mpg ~. -name, data = Auto)
summary(lmmodel_q9)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
# Question 9 part c
# Is there a relationship between the predictors and the response?
# Based on the p-value associated with the F-statistic being less than
# 0.05, this indicates that there is at least one predictor that has a
# relationship with the response.Some predictors are significant meaning
# that there is a relationship while others variables are not significant.
# The adjusted r-square value indicates that 82% of the variance
# in the response variable mpg is explained by the predictors in the model.
# Which predictors appear to have a statistically significant
# relationship to the response?
# The predictors that have a statistically significant relationship to the
# response are displacement, weight, year, and origin.
# What does the coefficient for the year variable suggest?
# For every year, mpg increases by 0.75, thus cars become more
# fuel efficient every year by 0.75mpg.
# Question 9 part d
# Use the plot() function to produce diagnostic plots of the linear
# regression fit. Comment on any problems you see with the fit.
# Do the residual plots suggest any unusually large outliers? Does
# the leverage plot identify any observations with unusually high
# leverage?
par(mfrow = c(2, 2))
plot(lmmodel_q9)
# Question 9 part d
# Use the plot() function to produce diagnostic plots of the linear
# regression fit. Comment on any problems you see with the fit.
# Do the residual plots suggest any unusually large outliers? Does
# the leverage plot identify any observations with unusually high
# leverage?
# In the residual plot there is a slight pattern (check mark curve shape)
# thus homoscadsiticty is violated due to a pattern being present.
# In the QQ plot majority of the points appear to be on the line which
# indicates normality however there are a few points at the tail end that
# deviate from this pattern.
# In the leverage plot point 14 is quite far away form the
# other points and thus it could possibly be a point of high leverage and
# outlier.
# Question 9 part e
# Use the * and : symbols to fit linear regression models with
# interaction effects. Do any interactions appear to be statistically
# significant?
lmmodel_q9e = lm(mpg ~.-name + displacement:weight + horsepower*cylinders,
data = Auto)
summary(lmmodel_q9e)
##
## Call:
## lm(formula = mpg ~ . - name + displacement:weight + horsepower *
## cylinders, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5149 -1.5579 -0.0921 1.4015 12.0051
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.109e+00 5.069e+00 1.402 0.161597
## cylinders -2.656e+00 6.921e-01 -3.838 0.000145 ***
## displacement -3.630e-02 1.301e-02 -2.790 0.005529 **
## horsepower -2.170e-01 4.354e-02 -4.985 9.41e-07 ***
## weight -6.819e-03 1.113e-03 -6.125 2.26e-09 ***
## acceleration -8.749e-02 9.290e-02 -0.942 0.346923
## year 7.600e-01 4.484e-02 16.949 < 2e-16 ***
## origin 6.728e-01 2.574e-01 2.614 0.009314 **
## displacement:weight 1.092e-05 3.464e-06 3.153 0.001744 **
## cylinders:horsepower 2.590e-02 5.880e-03 4.405 1.37e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.895 on 382 degrees of freedom
## Multiple R-squared: 0.8656, Adjusted R-squared: 0.8624
## F-statistic: 273.3 on 9 and 382 DF, p-value: < 2.2e-16
# Based on the results displacement:weight and
# cylinders:horsepower interactions are statistically significant because they
# have a p-value that is less than 0.05.
# Question 9 part f
# Try a few different transformations of the variables, such as
# log(X), √X, X2. Comment on your findings.
lmmodel_q9f = lm(mpg ~ log(weight)*log(displacement)*log(horsepower),
data = Auto[,1:8])
summary(lmmodel_q9f)
##
## Call:
## lm(formula = mpg ~ log(weight) * log(displacement) * log(horsepower),
## data = Auto[, 1:8])
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.3760 -2.0492 -0.3067 1.8102 16.2392
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -2513.045 1007.463 -2.494
## log(weight) 359.082 127.785 2.810
## log(displacement) 445.825 195.671 2.278
## log(horsepower) 554.944 225.352 2.463
## log(weight):log(displacement) -62.672 24.329 -2.576
## log(weight):log(horsepower) -78.126 28.579 -2.734
## log(displacement):log(horsepower) -93.910 42.413 -2.214
## log(weight):log(displacement):log(horsepower) 13.168 5.278 2.495
## Pr(>|t|)
## (Intercept) 0.01304 *
## log(weight) 0.00521 **
## log(displacement) 0.02325 *
## log(horsepower) 0.01423 *
## log(weight):log(displacement) 0.01037 *
## log(weight):log(horsepower) 0.00655 **
## log(displacement):log(horsepower) 0.02740 *
## log(weight):log(displacement):log(horsepower) 0.01302 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.85 on 384 degrees of freedom
## Multiple R-squared: 0.761, Adjusted R-squared: 0.7566
## F-statistic: 174.7 on 7 and 384 DF, p-value: < 2.2e-16
# Question 9 part f
# Try a few different transformations of the variables, such as
# log(X), √X, X2. Comment on your findings.
# After using a log transformation of a few variables like weight,
# displacement and horsepower, the results show that all of the variables are
# significant with a p value less than 0.05. Additionally, the interactions
# between the variables were also significant as denoted by a p value that is
# less than 0.05.
# Question 10 Part a
# Fit a multiple regression model to predict
# Sales using Price, Urban, and US.
fit = lm(Sales ~ Price + Urban + US, data = Carseats)
summary(fit)
##
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9206 -1.6220 -0.0564 1.5786 7.0581
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.043469 0.651012 20.036 < 2e-16 ***
## Price -0.054459 0.005242 -10.389 < 2e-16 ***
## UrbanYes -0.021916 0.271650 -0.081 0.936
## USYes 1.200573 0.259042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
# Question 10 Part (b)
# Provide an interpretation of each coefficient in the model.
# Be careful—some of the variables in the model are qualitative!
# P-value at the very bottom indicates at least one p value is not equal to
# zero. This means that one variable is at least associated with sales.
# 'Price', 'US == Yes' is significant.
# The coefficient for 'Price' is -0.054459 which means that for every dollar
# increase in the price of my carseat, my store's sales
# decrease by $54 on average.
# The coefficient for 'US == Yes' is 1.200573 which means on average,
# US stores sell $1,200 more carseats compared to stores outside the US.
# UrbanYes is not significant and thus does not have a significant effect on
# sales.
# Part (c) Write out the model in equation form, being careful to handle the qualitative variables properly.
# Sales = 13.04 - 0.05Price - 0.022Urban(Yes) + 1.2US(Yes)
#Intercept 13.043469 0.651012 20.036 < 2e-16 ***
#Price -0.054459 0.005242 -10.389 < 2e-16 ***
#UrbanYes -0.021916 0.271650 -0.081 0.936
#USYes 1.200573 0.259042 4.635 4.86e-06 ***
# Part (d) For which of the predictors can you reject the null hypothesis.
# 'Price' and 'US = Yes' are significant thus we can reject the null hypothesis
# Part (e) On the basis of your response to the previous question, fit a
# smaller model that only uses the predictors for which there is
# evidence of association with the outcome.
fit = lm(Sales ~ Price + US, data = Carseats)
summary(fit)
##
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.9269 -1.6286 -0.0574 1.5766 7.0515
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.03079 0.63098 20.652 < 2e-16 ***
## Price -0.05448 0.00523 -10.416 < 2e-16 ***
## USYes 1.19964 0.25846 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
# Part (f) How well do the models in (a) and (e) fit the data?
# The models do not fit well because Adjusted R-square is 0.2335 for part(a)
# and Adjusted R-square is 0.2354 for part (e). This explains about
# 23% of variance in the models in regards to Sales.
# Part(g) Using the model from (e), obtain 95 % confidence intervals for
# the coefficient(s).
confint(fit)
## 2.5 % 97.5 %
## (Intercept) 11.79032020 14.27126531
## Price -0.06475984 -0.04419543
## USYes 0.69151957 1.70776632
# Part (h) Is there evidence of outliers or high leverage observations in the
# model from (e)?
par(mfrow = c(2,2))
plot(fit)
summary(influence.measures(fit))
## Potentially influential observations of
## lm(formula = Sales ~ Price + US, data = Carseats) :
##
## dfb.1_ dfb.Pric dfb.USYs dffit cov.r cook.d hat
## 26 0.24 -0.18 -0.17 0.28_* 0.97_* 0.03 0.01
## 29 -0.10 0.10 -0.10 -0.18 0.97_* 0.01 0.01
## 43 -0.11 0.10 0.03 -0.11 1.05_* 0.00 0.04_*
## 50 -0.10 0.17 -0.17 0.26_* 0.98 0.02 0.01
## 51 -0.05 0.05 -0.11 -0.18 0.95_* 0.01 0.00
## 58 -0.05 -0.02 0.16 -0.20 0.97_* 0.01 0.01
## 69 -0.09 0.10 0.09 0.19 0.96_* 0.01 0.01
## 126 -0.07 0.06 0.03 -0.07 1.03_* 0.00 0.03_*
## 160 0.00 0.00 0.00 0.01 1.02_* 0.00 0.02
## 166 0.21 -0.23 -0.04 -0.24 1.02 0.02 0.03_*
## 172 0.06 -0.07 0.02 0.08 1.03_* 0.00 0.02
## 175 0.14 -0.19 0.09 -0.21 1.03_* 0.02 0.03_*
## 210 -0.14 0.15 -0.10 -0.22 0.97_* 0.02 0.01
## 270 -0.03 0.05 -0.03 0.06 1.03_* 0.00 0.02
## 298 -0.06 0.06 -0.09 -0.15 0.97_* 0.01 0.00
## 314 -0.05 0.04 0.02 -0.05 1.03_* 0.00 0.02_*
## 353 -0.02 0.03 0.09 0.15 0.97_* 0.01 0.00
## 357 0.02 -0.02 0.02 -0.03 1.03_* 0.00 0.02
## 368 0.26 -0.23 -0.11 0.27_* 1.01 0.02 0.02_*
## 377 0.14 -0.15 0.12 0.24 0.95_* 0.02 0.01
## 384 0.00 0.00 0.00 0.00 1.02_* 0.00 0.02
## 387 -0.03 0.04 -0.03 0.05 1.02_* 0.00 0.02
## 396 -0.05 0.05 0.08 0.14 0.98_* 0.01 0.00
# Part (h) Is there evidence of outliers or high leverage observations in the
# model from (e)?
# There are a few observations that are noted to be potential
# outliers/influential observations, based on the plots that denote a
# specified observation and the table that details what specific observation
# violated a specified measure of influence.
# Question 12 This problem involves simple linear regression without
# an intercept.
# Part (a). a) Recall that the coefficient estimate for the linear regression of
# Y onto X without an intercept is given by (3.38). Under what
# circumstance is the coefficient estimate for the regression of X
# onto Y the same as the coefficient estimate for the regression of
# Y onto X?
# The coefficients are the same if ∑jx^2j = ∑jy^2j. Essentially when
# the variance of both X and Y are exactly the same.
# Question 12 part (b). Generate an example in R with n = 100 observations
# in which the coefficient estimate for the regression of X onto Y is different
# from the coefficient estimate for the regression of Y onto X.
set.seed(1)
x = 1:100
sum(x^2)
## [1] 338350
# Question 12 part (b)
y = 2 * x + rnorm(100, sd = 0.1)
sum(y^2)
## [1] 1353606
# Question 12 part (b)
fit_Y = lm(y ~ x + 0)
fit_X = lm(x ~ y + 0)
summary(fit_Y)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.223590 -0.062560 0.004426 0.058507 0.230926
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 2.0001514 0.0001548 12920 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.09005 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.669e+08 on 1 and 99 DF, p-value: < 2.2e-16
# Question 12 part (b)
summary(fit_X)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.115418 -0.029231 -0.002186 0.031322 0.111795
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 5.00e-01 3.87e-05 12920 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.04502 on 99 degrees of freedom
## Multiple R-squared: 1, Adjusted R-squared: 1
## F-statistic: 1.669e+08 on 1 and 99 DF, p-value: < 2.2e-16
# Question 12 part (c). Generate an example in R with n = 100 observations
# in whichthe coefficient estimate for the regression of X onto Y is the
# same as the coefficient estimate for the regression of Y onto X.
x = 1:100
sum(x^2)
## [1] 338350
y = 100:1
sum(y^2)
## [1] 338350
# Question 12 part (c).
fit_Y = lm(y ~ x + 0)
fit_X = lm(x ~ y + 0)
summary(fit_Y)
##
## Call:
## lm(formula = y ~ x + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.75 -12.44 24.87 62.18 99.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## x 0.5075 0.0866 5.86 6.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared: 0.2575, Adjusted R-squared: 0.25
## F-statistic: 34.34 on 1 and 99 DF, p-value: 6.094e-08
# Question 12 part (c).
summary(fit_X)
##
## Call:
## lm(formula = x ~ y + 0)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.75 -12.44 24.87 62.18 99.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## y 0.5075 0.0866 5.86 6.09e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared: 0.2575, Adjusted R-squared: 0.25
## F-statistic: 34.34 on 1 and 99 DF, p-value: 6.094e-08