Assignment2

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

library(ISLR2)

## Warning: package 'ISLR2' was built under R version 4.3.3

attach(Carseats)
head(Carseats)

##   Sales CompPrice Income Advertising Population Price ShelveLoc Age Education
## 1  9.50       138     73          11        276   120       Bad  42        17
## 2 11.22       111     48          16        260    83      Good  65        10
## 3 10.06       113     35          10        269    80    Medium  59        12
## 4  7.40       117    100           4        466    97    Medium  55        14
## 5  4.15       141     64           3        340   128       Bad  38        13
## 6 10.81       124    113          13        501    72       Bad  78        16
##   Urban  US
## 1   Yes Yes
## 2   Yes Yes
## 3   Yes Yes
## 4   Yes Yes
## 5   Yes  No
## 6    No Yes

summary(Carseats)

##      Sales          CompPrice       Income        Advertising    
##  Min.   : 0.000   Min.   : 77   Min.   : 21.00   Min.   : 0.000  
##  1st Qu.: 5.390   1st Qu.:115   1st Qu.: 42.75   1st Qu.: 0.000  
##  Median : 7.490   Median :125   Median : 69.00   Median : 5.000  
##  Mean   : 7.496   Mean   :125   Mean   : 68.66   Mean   : 6.635  
##  3rd Qu.: 9.320   3rd Qu.:135   3rd Qu.: 91.00   3rd Qu.:12.000  
##  Max.   :16.270   Max.   :175   Max.   :120.00   Max.   :29.000  
##    Population        Price        ShelveLoc        Age          Education   
##  Min.   : 10.0   Min.   : 24.0   Bad   : 96   Min.   :25.00   Min.   :10.0  
##  1st Qu.:139.0   1st Qu.:100.0   Good  : 85   1st Qu.:39.75   1st Qu.:12.0  
##  Median :272.0   Median :117.0   Medium:219   Median :54.50   Median :14.0  
##  Mean   :264.8   Mean   :115.8                Mean   :53.32   Mean   :13.9  
##  3rd Qu.:398.5   3rd Qu.:131.0                3rd Qu.:66.00   3rd Qu.:16.0  
##  Max.   :509.0   Max.   :191.0                Max.   :80.00   Max.   :18.0  
##  Urban       US     
##  No :118   No :142  
##  Yes:282   Yes:258  
##                     
##                     
##                     
##

# Question 2 
# Carefully explain the differences between the KNN classifier and KNN
# regression methods.

# There are several differences between KNN classifier and KNN regression
# methods. KNN classifier is utilized when the response variable 
# is categorical in nature (ex: Yes, No), while KNN regression methods is 
# used when response variable is continuous numerical (quantitative). This 
# is due to the fact that the main goal of KNN classifier is to assign a 
# class label to a observation (Y) based on what is the common
# class of its nearest neighbors (observations), while in KNN regression 
# the goal is to predict a continuous quantitative numeric value of Y based 
# on the averages of the observations of the variable or variables of interests
# nearest neighbor. 
# So in a practical setting KNN classifier will be utilized for classification
# task like determining if a customer will make a purchase of a book (Yes,No)
# while the KNN regression is used for predicting the price of a book.

# Question 9 This question involves the use of multiple linear regression on the
# Auto data set.(a) Produce a scatterplot matrix which includes all of 
# the variables in the data set.

pairs(Auto)

# Question 9 part b 
# Compute the matrix of correlations between the variables using
# the function cor(). You will need to exclude the name variable, which is
# qualitative.

cor(Auto[, names(Auto) !="name"])

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

# Question 9 part c
# Use the lm() function to perform a multiple linear regression
# with mpg as the response and all other variables except name as
# the predictors. Use the summary() function to print the results.
# Comment on the output.

lmmodel_q9 = lm(mpg ~. -name, data = Auto)
summary(lmmodel_q9)

## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

# Question 9 part c
# Is there a relationship between the predictors and the response?
# Based on the p-value associated with the F-statistic being less than 
# 0.05, this indicates that there is at least one predictor that has a 
# relationship with the response.Some predictors are significant meaning
# that there is a relationship while others variables are not significant.  
# The adjusted r-square value indicates that 82% of the variance 
# in the response variable mpg is explained by the predictors in the model. 

# Which predictors appear to have a statistically significant
# relationship to the response?
# The predictors that have a statistically significant relationship to the 
# response are displacement, weight, year, and origin. 

# What does the coefficient for the year variable suggest?
# For every year, mpg increases by 0.75, thus cars become more 
# fuel efficient every year by 0.75mpg.

# Question 9 part d
# Use the plot() function to produce diagnostic plots of the linear
# regression fit. Comment on any problems you see with the fit.
# Do the residual plots suggest any unusually large outliers? Does
# the leverage plot identify any observations with unusually high
# leverage?

par(mfrow = c(2, 2))
plot(lmmodel_q9)

# Question 9 part d 
# Use the plot() function to produce diagnostic plots of the linear
# regression fit. Comment on any problems you see with the fit.
# Do the residual plots suggest any unusually large outliers? Does
# the leverage plot identify any observations with unusually high
# leverage?

# In the residual plot there is a slight pattern (check mark curve shape)
# thus homoscadsiticty is violated due to a pattern being present. 
# In the QQ plot majority of the points appear to be on the line which 
# indicates normality however there are a few points at the tail end that 
# deviate from this pattern. 
# In the leverage plot point 14 is quite far away form the 
# other points and thus it could possibly be a point of high leverage and
# outlier.

# Question 9 part e
# Use the * and : symbols to fit linear regression models with
# interaction effects. Do any interactions appear to be statistically
# significant?

lmmodel_q9e = lm(mpg ~.-name + displacement:weight + horsepower*cylinders,
                 data = Auto)
summary(lmmodel_q9e)

## 
## Call:
## lm(formula = mpg ~ . - name + displacement:weight + horsepower * 
##     cylinders, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5149 -1.5579 -0.0921  1.4015 12.0051 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           7.109e+00  5.069e+00   1.402 0.161597    
## cylinders            -2.656e+00  6.921e-01  -3.838 0.000145 ***
## displacement         -3.630e-02  1.301e-02  -2.790 0.005529 ** 
## horsepower           -2.170e-01  4.354e-02  -4.985 9.41e-07 ***
## weight               -6.819e-03  1.113e-03  -6.125 2.26e-09 ***
## acceleration         -8.749e-02  9.290e-02  -0.942 0.346923    
## year                  7.600e-01  4.484e-02  16.949  < 2e-16 ***
## origin                6.728e-01  2.574e-01   2.614 0.009314 ** 
## displacement:weight   1.092e-05  3.464e-06   3.153 0.001744 ** 
## cylinders:horsepower  2.590e-02  5.880e-03   4.405 1.37e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.895 on 382 degrees of freedom
## Multiple R-squared:  0.8656, Adjusted R-squared:  0.8624 
## F-statistic: 273.3 on 9 and 382 DF,  p-value: < 2.2e-16

# Based on the results displacement:weight and 
# cylinders:horsepower interactions are statistically significant because they
# have a p-value that is less than 0.05.

# Question 9 part f
# Try a few different transformations of the variables, such as
# log(X), √X, X2. Comment on your findings.

lmmodel_q9f = lm(mpg ~ log(weight)*log(displacement)*log(horsepower),
                 data = Auto[,1:8])
summary(lmmodel_q9f)

## 
## Call:
## lm(formula = mpg ~ log(weight) * log(displacement) * log(horsepower), 
##     data = Auto[, 1:8])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.3760  -2.0492  -0.3067   1.8102  16.2392 
## 
## Coefficients:
##                                                Estimate Std. Error t value
## (Intercept)                                   -2513.045   1007.463  -2.494
## log(weight)                                     359.082    127.785   2.810
## log(displacement)                               445.825    195.671   2.278
## log(horsepower)                                 554.944    225.352   2.463
## log(weight):log(displacement)                   -62.672     24.329  -2.576
## log(weight):log(horsepower)                     -78.126     28.579  -2.734
## log(displacement):log(horsepower)               -93.910     42.413  -2.214
## log(weight):log(displacement):log(horsepower)    13.168      5.278   2.495
##                                               Pr(>|t|)   
## (Intercept)                                    0.01304 * 
## log(weight)                                    0.00521 **
## log(displacement)                              0.02325 * 
## log(horsepower)                                0.01423 * 
## log(weight):log(displacement)                  0.01037 * 
## log(weight):log(horsepower)                    0.00655 **
## log(displacement):log(horsepower)              0.02740 * 
## log(weight):log(displacement):log(horsepower)  0.01302 * 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.85 on 384 degrees of freedom
## Multiple R-squared:  0.761,  Adjusted R-squared:  0.7566 
## F-statistic: 174.7 on 7 and 384 DF,  p-value: < 2.2e-16

# Question 9 part f
# Try a few different transformations of the variables, such as
# log(X), √X, X2. Comment on your findings.

# After using a log transformation of a few variables like weight,
# displacement and horsepower, the results show that all of the variables are
# significant with a p value less than 0.05. Additionally, the interactions 
# between the variables were also significant as denoted by a p value that is 
# less than 0.05.

# Question 10 Part a
# Fit a multiple regression model to predict 
# Sales using Price, Urban, and US.

fit = lm(Sales ~ Price + Urban + US, data = Carseats)
summary(fit)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

# Question 10 Part (b) 
# Provide an interpretation of each coefficient in the model.
# Be careful—some of the variables in the model are qualitative!

# P-value at the very bottom indicates at least one p value is not equal to 
# zero. This means that one variable is at least associated with sales. 

# 'Price', 'US == Yes' is significant. 

# The coefficient for 'Price' is  -0.054459 which means that for every dollar 
# increase in the price of my carseat, my store's sales 
# decrease by $54 on average.

# The coefficient for 'US == Yes' is 1.200573 which means on average,
# US stores sell $1,200 more carseats compared to stores outside the US. 

# UrbanYes is not significant and thus does not have a significant effect on 
# sales.

# Part (c) Write out the model in equation form, being careful to handle the qualitative variables properly.

# Sales = 13.04 - 0.05Price - 0.022Urban(Yes) + 1.2US(Yes)

#Intercept   13.043469   0.651012  20.036  < 2e-16 ***
#Price       -0.054459   0.005242 -10.389  < 2e-16 ***
#UrbanYes    -0.021916   0.271650  -0.081    0.936    
#USYes        1.200573   0.259042   4.635 4.86e-06 ***

# Part (d) For which of the predictors can you reject the null hypothesis.

# 'Price' and 'US  = Yes' are significant thus we can reject the null hypothesis

# Part (e) On the basis of your response to the previous question, fit a
# smaller model that only uses the predictors for which there is
# evidence of association with the outcome.

fit = lm(Sales ~ Price  + US, data = Carseats)
summary(fit)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

# Part (f) How well do the models in (a) and (e) fit the data?

# The models do not fit well because Adjusted R-square is 0.2335 for part(a) 
# and Adjusted R-square is 0.2354 for part (e). This explains about 
# 23% of variance in the models in regards to Sales.

# Part(g) Using the model from (e), obtain 95 % confidence intervals for
# the coefficient(s).

confint(fit)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

# Part (h) Is there evidence of outliers or high leverage observations in the
# model from (e)?

par(mfrow = c(2,2))
plot(fit)

summary(influence.measures(fit))

## Potentially influential observations of
##   lm(formula = Sales ~ Price + US, data = Carseats) :
## 
##     dfb.1_ dfb.Pric dfb.USYs dffit   cov.r   cook.d hat    
## 26   0.24  -0.18    -0.17     0.28_*  0.97_*  0.03   0.01  
## 29  -0.10   0.10    -0.10    -0.18    0.97_*  0.01   0.01  
## 43  -0.11   0.10     0.03    -0.11    1.05_*  0.00   0.04_*
## 50  -0.10   0.17    -0.17     0.26_*  0.98    0.02   0.01  
## 51  -0.05   0.05    -0.11    -0.18    0.95_*  0.01   0.00  
## 58  -0.05  -0.02     0.16    -0.20    0.97_*  0.01   0.01  
## 69  -0.09   0.10     0.09     0.19    0.96_*  0.01   0.01  
## 126 -0.07   0.06     0.03    -0.07    1.03_*  0.00   0.03_*
## 160  0.00   0.00     0.00     0.01    1.02_*  0.00   0.02  
## 166  0.21  -0.23    -0.04    -0.24    1.02    0.02   0.03_*
## 172  0.06  -0.07     0.02     0.08    1.03_*  0.00   0.02  
## 175  0.14  -0.19     0.09    -0.21    1.03_*  0.02   0.03_*
## 210 -0.14   0.15    -0.10    -0.22    0.97_*  0.02   0.01  
## 270 -0.03   0.05    -0.03     0.06    1.03_*  0.00   0.02  
## 298 -0.06   0.06    -0.09    -0.15    0.97_*  0.01   0.00  
## 314 -0.05   0.04     0.02    -0.05    1.03_*  0.00   0.02_*
## 353 -0.02   0.03     0.09     0.15    0.97_*  0.01   0.00  
## 357  0.02  -0.02     0.02    -0.03    1.03_*  0.00   0.02  
## 368  0.26  -0.23    -0.11     0.27_*  1.01    0.02   0.02_*
## 377  0.14  -0.15     0.12     0.24    0.95_*  0.02   0.01  
## 384  0.00   0.00     0.00     0.00    1.02_*  0.00   0.02  
## 387 -0.03   0.04    -0.03     0.05    1.02_*  0.00   0.02  
## 396 -0.05   0.05     0.08     0.14    0.98_*  0.01   0.00

# Part (h) Is there evidence of outliers or high leverage observations in the
# model from (e)?

# There are a few observations that are noted to be potential
# outliers/influential observations, based on the plots that denote a 
# specified observation and the table that details what specific observation
# violated a specified measure of influence.

# Question 12 This problem involves simple linear regression without 
# an intercept.

# Part (a). a) Recall that the coefficient estimate for the linear regression of
# Y onto X without an intercept is given by (3.38). Under what
# circumstance is the coefficient estimate for the regression of X
# onto Y the same as the coefficient estimate for the regression of
# Y onto X?

# The coefficients are the same if ∑jx^2j = ∑jy^2j. Essentially when 
# the variance of both X and Y are exactly the same.

# Question 12 part (b). Generate an example in R with n = 100 observations
# in which the coefficient estimate for the regression of X onto Y is different
# from the coefficient estimate for the regression of Y onto X.


set.seed(1)
x = 1:100
sum(x^2)

## [1] 338350

# Question 12 part (b)

y = 2 * x + rnorm(100, sd = 0.1)
sum(y^2)

## [1] 1353606

# Question 12 part (b)

fit_Y = lm(y ~ x + 0)
fit_X = lm(x ~ y + 0)

summary(fit_Y)

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.223590 -0.062560  0.004426  0.058507  0.230926 
## 
## Coefficients:
##    Estimate Std. Error t value Pr(>|t|)    
## x 2.0001514  0.0001548   12920   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.09005 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.669e+08 on 1 and 99 DF,  p-value: < 2.2e-16

# Question 12 part (b)

summary(fit_X)

## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.115418 -0.029231 -0.002186  0.031322  0.111795 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y 5.00e-01   3.87e-05   12920   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04502 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 1.669e+08 on 1 and 99 DF,  p-value: < 2.2e-16

# Question 12 part (c). Generate an example in R with n = 100 observations 
# in whichthe coefficient estimate for the regression of X onto Y is the
# same as the coefficient estimate for the regression of Y onto X.

x = 1:100
sum(x^2)

## [1] 338350

y = 100:1
sum(y^2)

## [1] 338350

# Question 12 part (c).

fit_Y = lm(y ~ x + 0)
fit_X = lm(x ~ y + 0)
summary(fit_Y)

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49.75 -12.44  24.87  62.18  99.49 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x   0.5075     0.0866    5.86 6.09e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared:  0.2575, Adjusted R-squared:   0.25 
## F-statistic: 34.34 on 1 and 99 DF,  p-value: 6.094e-08

# Question 12 part (c).

summary(fit_X)

## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -49.75 -12.44  24.87  62.18  99.49 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y   0.5075     0.0866    5.86 6.09e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 50.37 on 99 degrees of freedom
## Multiple R-squared:  0.2575, Adjusted R-squared:   0.25 
## F-statistic: 34.34 on 1 and 99 DF,  p-value: 6.094e-08

Assignment2

Daniel O.

2025-02-24

R Markdown