Question 2

Carefully explain the differences between the KNN classifier and KNN regression methods.

Answer

The main difference between these methods is that the KNN classifier is used to solve classification problems where the response is qualitative while the KNN regression solves regression problems with quantitative responses.

Question 9

 Auto <- read.table ("C:\\Users\\Winni\\Downloads\\Auto.data ", header = T, na.strings = "?",
stringsAsFactors = T)
Auto <- na.omit(Auto)

Produce a scatterplot matrix which includes all of the variables in the data set.

pairs(~mpg+cylinders+displacement+horsepower+weight+acceleration +year +origin + name,data=Auto)

Compute the matrix of correlations between the variables using the function cor(). You will need to exclude the name variable, cor() which is qualitative.

Auto_new <- Auto|>select(mpg:origin)
cor(Auto_new)

##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

Use the lm() function to perform a multiple linear regression with mpg as the response and all other variables except name as the predictors. Use the summary() function to print the results.

auto.model <- lm(mpg~., data=Auto_new)
summary(auto.model)

## 
## Call:
## lm(formula = mpg ~ ., data = Auto_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

Comment on the output. For instance: i. Is there a relationship between the predictors and the response?

Answer

From the F-statistic and the corresponding p-value, we see that F=252.4>1 suggesting there is a relationship between the predictors and response.

Which predictors appear to have a statistically significant relationship to the response?

#Answer The displacement, weight, year, and origin variables have statistically significant relationship to mpg, as the absolute value for their t-value is greater than 1, and they have small p-values as well.

What does the coefficient for the year variable suggest?

#Answer The coefficient for year suggests that holding all other predictor fixed, each year cars saw on average a 0.75 increase in their miles per gallon.

Use the plot() function to produce diagnostic plots of the linear regression fit. Comment on any problems you see with the fit. Do the residual plots suggest any unusually large outliers? Does the leverage plot identify any observations with unusually high leverage? ## Answer

par(mfrow=c(2,2))
plot(auto.model)

(i) Looking at the residual plot, i see no indication of unsually large outliers. (ii) From the leverage plot, we see that the observation (14) has high leverage.

Use the * and : symbols to fit linear regression models with interaction effects. Do any interactions appear to be statistically significant?

auto.modelINT1 <- lm(mpg~cylinders*displacement + cylinders*weight + cylinders*horsepower + cylinders*origin + cylinders*year + cylinders*acceleration, data=Auto)
summary(auto.modelINT1)

## 
## Call:
## lm(formula = mpg ~ cylinders * displacement + cylinders * weight + 
##     cylinders * horsepower + cylinders * origin + cylinders * 
##     year + cylinders * acceleration, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.7843 -1.6237 -0.0424  1.3271 12.3258 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            -2.902e+01  1.485e+01  -1.954 0.051483 .  
## cylinders               3.090e+00  2.702e+00   1.143 0.253571    
## displacement            4.810e-03  2.408e-02   0.200 0.841758    
## weight                 -1.006e-02  2.457e-03  -4.093 5.20e-05 ***
## horsepower             -1.777e-01  5.472e-02  -3.248 0.001267 ** 
## origin                 -2.217e+00  1.381e+00  -1.606 0.109192    
## year                    1.336e+00  1.654e-01   8.078 8.93e-15 ***
## acceleration           -1.020e-02  2.989e-01  -0.034 0.972804    
## cylinders:displacement  3.982e-04  3.632e-03   0.110 0.912762    
## cylinders:weight        8.602e-04  3.624e-04   2.374 0.018113 *  
## cylinders:horsepower    1.918e-02  8.013e-03   2.393 0.017185 *  
## cylinders:origin        7.242e-01  3.182e-01   2.276 0.023421 *  
## cylinders:year         -1.164e-01  3.143e-02  -3.703 0.000245 ***
## cylinders:acceleration  1.166e-03  5.431e-02   0.021 0.982878    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.865 on 378 degrees of freedom
## Multiple R-squared:  0.8697, Adjusted R-squared:  0.8652 
## F-statistic: 194.1 on 13 and 378 DF,  p-value: < 2.2e-16

In this model, we see that the cylinders:weight, cylinders:horsepower, cylinders:origin and cylinders:year interaction term are statistically significant. #2

auto.modelINT2 <- lm(mpg~displacement*horsepower + displacement*weight + displacement*origin + displacement*year + displacement*acceleration+ cylinders, data=Auto)
summary(auto.modelINT2)

## 
## Call:
## lm(formula = mpg ~ displacement * horsepower + displacement * 
##     weight + displacement * origin + displacement * year + displacement * 
##     acceleration + cylinders, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.5046 -1.5468  0.0123  1.3195 13.5004 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)               -2.979e+01  8.992e+00  -3.313 0.001013 ** 
## displacement               7.178e-02  4.427e-02   1.622 0.105727    
## horsepower                -9.919e-02  3.175e-02  -3.124 0.001923 ** 
## weight                    -7.945e-03  1.367e-03  -5.811 1.32e-08 ***
## origin                    -8.544e-01  8.949e-01  -0.955 0.340354    
## year                       1.104e+00  9.824e-02  11.236  < 2e-16 ***
## acceleration               8.621e-02  1.772e-01   0.487 0.626826    
## cylinders                  6.256e-01  2.978e-01   2.101 0.036307 *  
## displacement:horsepower    1.666e-04  1.026e-04   1.625 0.105086    
## displacement:weight        1.550e-05  4.052e-06   3.826 0.000152 ***
## displacement:origin        1.233e-02  7.734e-03   1.594 0.111767    
## displacement:year         -2.039e-03  5.157e-04  -3.954 9.19e-05 ***
## displacement:acceleration -5.375e-04  8.597e-04  -0.625 0.532218    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.832 on 379 degrees of freedom
## Multiple R-squared:  0.8724, Adjusted R-squared:  0.8683 
## F-statistic: 215.9 on 12 and 379 DF,  p-value: < 2.2e-16

Here,we see that the displacement:weight, displacement:year interaction terms are statistically significant.

auto.modelINT3 <- lm(mpg~horsepower*weight+horsepower*acceleration+horsepower*year+ horsepower*origin + cylinders+displacement,data=Auto)
summary(auto.modelINT3)

## 
## Call:
## lm(formula = mpg ~ horsepower * weight + horsepower * acceleration + 
##     horsepower * year + horsepower * origin + cylinders + displacement, 
##     data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.1377 -1.3451 -0.0611  1.2719 11.1489 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -5.626e+01  1.084e+01  -5.189 3.46e-07 ***
## horsepower               3.797e-01  1.059e-01   3.586  0.00038 ***
## weight                  -7.885e-03  1.095e-03  -7.199 3.27e-12 ***
## acceleration             2.124e-01  1.632e-01   1.302  0.19386    
## year                     1.390e+00  1.346e-01  10.333  < 2e-16 ***
## origin                   1.297e+00  1.086e+00   1.194  0.23309    
## cylinders                3.549e-01  2.891e-01   1.228  0.22030    
## displacement            -9.768e-03  7.639e-03  -1.279  0.20178    
## horsepower:weight        3.682e-05  6.959e-06   5.291 2.05e-07 ***
## horsepower:acceleration -4.386e-03  1.759e-03  -2.493  0.01308 *  
## horsepower:year         -6.620e-03  1.345e-03  -4.921 1.29e-06 ***
## horsepower:origin       -6.978e-03  1.284e-02  -0.543  0.58723    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.803 on 380 degrees of freedom
## Multiple R-squared:  0.8747, Adjusted R-squared:  0.871 
## F-statistic: 241.1 on 11 and 380 DF,  p-value: < 2.2e-16

Here we see that the horsepower:weight, horsepower:acceleration, and horsepower:year interaction terms are statistically significant.

auto.modelINT4 <- lm(mpg~weight*acceleration + weight*origin + weight*year + horsepower + displacement + cylinders, data=Auto)
summary(auto.modelINT4)

## 
## Call:
## lm(formula = mpg ~ weight * acceleration + weight * origin + 
##     weight * year + horsepower + displacement + cylinders, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.7396 -1.6905 -0.0713  1.3018 11.3500 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -1.175e+02  1.306e+01  -8.999  < 2e-16 ***
## weight               3.098e-02  4.526e-03   6.845 3.07e-11 ***
## acceleration         1.213e+00  2.414e-01   5.025 7.75e-07 ***
## origin               2.699e+00  1.240e+00   2.176  0.03016 *  
## year                 1.817e+00  1.759e-01  10.327  < 2e-16 ***
## horsepower          -4.138e-02  1.303e-02  -3.176  0.00161 ** 
## displacement        -1.120e-03  7.355e-03  -0.152  0.87908    
## cylinders           -3.498e-02  2.965e-01  -0.118  0.90614    
## weight:acceleration -4.126e-04  8.430e-05  -4.895 1.46e-06 ***
## weight:origin       -7.752e-04  5.302e-04  -1.462  0.14457    
## weight:year         -3.803e-04  6.274e-05  -6.062 3.23e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.978 on 381 degrees of freedom
## Multiple R-squared:  0.8582, Adjusted R-squared:  0.8544 
## F-statistic: 230.5 on 10 and 381 DF,  p-value: < 2.2e-16

Here the weight:acceleration and weight:year interaction terms are significant.

auto.modelINT5 <- lm(mpg~cylinders+acceleration*year + acceleration*origin + displacement + horsepower + weight, data=Auto)
summary(auto.modelINT5)

## 
## Call:
## lm(formula = mpg ~ cylinders + acceleration * year + acceleration * 
##     origin + displacement + horsepower + weight, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.2160 -1.9139 -0.1561  1.6798 12.2113 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         88.5083128 20.0395827   4.417 1.31e-05 ***
## cylinders           -0.3553012  0.3029518  -1.173   0.2416    
## acceleration        -6.6459681  1.2408097  -5.356 1.47e-07 ***
## year                -0.4274157  0.2643531  -1.617   0.1067    
## origin              -7.9692322  1.5977193  -4.988 9.28e-07 ***
## displacement         0.0013427  0.0072991   0.184   0.8541    
## horsepower          -0.0318848  0.0128631  -2.479   0.0136 *  
## weight              -0.0049283  0.0006313  -7.806 5.71e-14 ***
## acceleration:year    0.0750484  0.0164081   4.574 6.48e-06 ***
## acceleration:origin  0.5650941  0.0965091   5.855 1.03e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.071 on 382 degrees of freedom
## Multiple R-squared:  0.8488, Adjusted R-squared:  0.8452 
## F-statistic: 238.2 on 9 and 382 DF,  p-value: < 2.2e-16

Here we see that the acceleration:year and acceleration:origin interaction terms are statistically significant.

auto.modelINT6 <- lm(mpg~cylinders + displacement+horsepower + weight + acceleration + year*origin, data=Auto)
summary(auto.modelINT6)

## 
## Call:
## lm(formula = mpg ~ cylinders + displacement + horsepower + weight + 
##     acceleration + year * origin, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6072 -2.0439 -0.0596  1.7121 12.3368 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   8.492e+00  9.044e+00   0.939 0.348353    
## cylinders    -5.042e-01  3.192e-01  -1.579 0.115082    
## displacement  1.567e-02  7.530e-03   2.081 0.038060 *  
## horsepower   -1.399e-02  1.364e-02  -1.025 0.305786    
## weight       -6.352e-03  6.449e-04  -9.851  < 2e-16 ***
## acceleration  9.185e-02  9.766e-02   0.941 0.347546    
## year          4.189e-01  1.125e-01   3.723 0.000226 ***
## origin       -1.405e+01  4.699e+00  -2.989 0.002978 ** 
## year:origin   1.989e-01  6.030e-02   3.298 0.001064 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.286 on 383 degrees of freedom
## Multiple R-squared:  0.8264, Adjusted R-squared:  0.8228 
## F-statistic: 227.9 on 8 and 383 DF,  p-value: < 2.2e-16

Here we see that the year:origin interaction term is statistically significant.

Try a few different transformations of the variables, such as log(X), √X, X2. Comment on your findings.

#Answer #1 Log transformation

auto.modellog <- lm(mpg~log(cylinders)+log(displacement)+log(horsepower)+log(weight)+log(acceleration)+log(year)+log(origin), data=Auto_new)
summary((auto.modellog))

## 
## Call:
## lm(formula = mpg ~ log(cylinders) + log(displacement) + log(horsepower) + 
##     log(weight) + log(acceleration) + log(year) + log(origin), 
##     data = Auto_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5987 -1.8172 -0.0181  1.5906 12.8132 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -66.5643    17.5053  -3.803 0.000167 ***
## log(cylinders)      1.4818     1.6589   0.893 0.372273    
## log(displacement)  -1.0551     1.5385  -0.686 0.493230    
## log(horsepower)    -6.9657     1.5569  -4.474 1.01e-05 ***
## log(weight)       -12.5728     2.2251  -5.650 3.12e-08 ***
## log(acceleration)  -4.9831     1.6078  -3.099 0.002082 ** 
## log(year)          54.9857     3.5555  15.465  < 2e-16 ***
## log(origin)         1.5822     0.5083   3.113 0.001991 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.069 on 384 degrees of freedom
## Multiple R-squared:  0.8482, Adjusted R-squared:  0.8454 
## F-statistic: 306.5 on 7 and 384 DF,  p-value: < 2.2e-16

After transforming the predictor variables, we see that 5 predictors are statistically significant which is more than the untransformed model.
The R-squared value increased from 0.8215 to 0.8482 after transformation.
The Residual standard error value got smaller hence suggesting a better fit.

#2 square root transformation

auto.modelroot <- lm(mpg~sqrt(cylinders)+sqrt(displacement)+sqrt(horsepower)+sqrt(weight)+sqrt(acceleration)+sqrt(year)+sqrt(origin), data=Auto_new)
summary((auto.modelroot))

## 
## Call:
## lm(formula = mpg ~ sqrt(cylinders) + sqrt(displacement) + sqrt(horsepower) + 
##     sqrt(weight) + sqrt(acceleration) + sqrt(year) + sqrt(origin), 
##     data = Auto_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5250 -1.9822 -0.1111  1.7347 13.0681 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        -49.79814    9.17832  -5.426 1.02e-07 ***
## sqrt(cylinders)     -0.23699    1.53753  -0.154   0.8776    
## sqrt(displacement)   0.22580    0.22940   0.984   0.3256    
## sqrt(horsepower)    -0.77976    0.30788  -2.533   0.0117 *  
## sqrt(weight)        -0.62172    0.07898  -7.872 3.59e-14 ***
## sqrt(acceleration)  -0.82529    0.83443  -0.989   0.3233    
## sqrt(year)          12.79030    0.85891  14.891  < 2e-16 ***
## sqrt(origin)         3.26036    0.76767   4.247 2.72e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.21 on 384 degrees of freedom
## Multiple R-squared:  0.8338, Adjusted R-squared:  0.8308 
## F-statistic: 275.3 on 7 and 384 DF,  p-value: < 2.2e-16

The R-squared value slightly increased from 0.8215 to 0.8338 after transformation.
The RSE value slightly decreased but not significant enough. These findings suggest that a quadratic transformation has little to no effect on the model fit. #3 X^2 Transformation

auto.modelsquare <- lm(mpg~ I(cylinders^2)+ I(displacement^2)+I(horsepower^2) +I(weight^2)+ I(acceleration^2)+I(year^2)+ I(origin^2), data=Auto_new)
summary((auto.modelsquare))

## 
## Call:
## lm(formula = mpg ~ I(cylinders^2) + I(displacement^2) + I(horsepower^2) + 
##     I(weight^2) + I(acceleration^2) + I(year^2) + I(origin^2), 
##     data = Auto_new)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.6786 -2.3227 -0.0582  1.9073 12.9807 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.208e+00  2.356e+00   0.513 0.608382    
## I(cylinders^2)    -8.829e-02  2.521e-02  -3.502 0.000515 ***
## I(displacement^2)  5.680e-05  1.382e-05   4.109 4.87e-05 ***
## I(horsepower^2)   -3.621e-05  4.975e-05  -0.728 0.467201    
## I(weight^2)       -9.351e-07  8.978e-08 -10.416  < 2e-16 ***
## I(acceleration^2)  6.278e-03  2.690e-03   2.334 0.020130 *  
## I(year^2)          4.999e-03  3.530e-04  14.160  < 2e-16 ***
## I(origin^2)        4.129e-01  6.914e-02   5.971 5.37e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.539 on 384 degrees of freedom
## Multiple R-squared:  0.7981, Adjusted R-squared:  0.7944 
## F-statistic: 216.8 on 7 and 384 DF,  p-value: < 2.2e-16

After transforming the predictor variables, we see that the number of statistically significant predictors increased.
The R-squared value decreased from 0.8215 to 0.7981 after transformation.
The Residual standard error value increased. These findings suggest that a quadratic transformation does not improve the model fit but rather the opposite.

Question 10

This question should be answered using the Carseats data set.

str(Carseats)

## 'data.frame':    400 obs. of  11 variables:
##  $ Sales      : num  9.5 11.22 10.06 7.4 4.15 ...
##  $ CompPrice  : num  138 111 113 117 141 124 115 136 132 132 ...
##  $ Income     : num  73 48 35 100 64 113 105 81 110 113 ...
##  $ Advertising: num  11 16 10 4 3 13 0 15 0 0 ...
##  $ Population : num  276 260 269 466 340 501 45 425 108 131 ...
##  $ Price      : num  120 83 80 97 128 72 108 120 124 124 ...
##  $ ShelveLoc  : Factor w/ 3 levels "Bad","Good","Medium": 1 2 3 3 1 1 3 2 3 3 ...
##  $ Age        : num  42 65 59 55 38 78 71 67 76 76 ...
##  $ Education  : num  17 10 12 14 13 16 15 10 10 17 ...
##  $ Urban      : Factor w/ 2 levels "No","Yes": 2 2 2 2 2 1 2 2 1 1 ...
##  $ US         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 1 2 1 2 ...

Fit a multiple regression model to predict Sales using Price, Urban, and US. #Answer

sales.model <- lm(Sales~Price+ Urban + US, data=Carseats)
summary(sales.model)

## 
## Call:
## lm(formula = Sales ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

Provide an interpretation of each coefficient in the model. Be careful—some of the variables in the model are qualitative! #Answer

coef(sales.model)

## (Intercept)       Price    UrbanYes       USYes 
## 13.04346894 -0.05445885 -0.02191615  1.20057270

(i). 13.0435 is the overall average sales among non US and non Urban stores and when the price is 0. (ii) 0.05446 implies that on average, sales will decrease by 55 units when the price increases by 1000 dollars when the other predictors are fixed. (iii) 1.2001 implies that on average, US stores will have 1200 more sales than Non-US stores.

Write out the model in equation form, being careful to handle the qualitative variables properly. ## Answer Sales = 13.0435-0.0545Price-0.02192UrbanYes+1.2006USYes
For which of the predictors can you reject the null hypothesis H0 : βj = 0? #Answer The Price and US Predictors.
On the basis of your response to the previous question, fit a smaller model that only uses the predictors for which there is evidence of association with the outcome.

sales.reducedmodel <- lm(Sales~Price+US, data=Carseats)
summary(sales.reducedmodel)

## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

How well do the models in (a) and (e) fit the data? ## Answer

R_squared: a.For model(a),we see that the R-squared value is 0.2393 suggesting that 24% of total variation in Sales can be explained by the predictors.

For model(e),R-squared value is 0.2393 which is the same as the previous model suggesting that the removal of one variable had no effect on the model.

RSE: The RSE value for model (a) is 2.472 and for model (e) is 2.469 Overall, both models so not fit the data.

Using the model from (e), obtain 95 % confidence intervals for the coefficient(s).

confint(sales.reducedmodel, level = 0.95)

##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

Is there evidence of outliers or high leverage observations in the model from (e)?

plot(predict(sales.reducedmodel), rstudent(sales.reducedmodel))

plot(hatvalues(sales.reducedmodel))

which.max(hatvalues(sales.reducedmodel))

## 43 
## 43

1.From the rstudent plot, we see that there is no indication of an outlier as no observations have studentized residuals greater than 3. 2. We see that there is an indication of high leverage observations from the plot, and whichmax function, we see that observation 43 had the largest leverage statistic.

Question 12

This problem involves simple linear regression without an intercept.

Recall that the coefficient estimate βˆ for the linear regression of Y onto X without an intercept is given by (3.38). Under what circumstance is the coefficient estimate for the regression of X onto Y the same as the coefficient estimate for the regression of Y onto X? #Answer The coeffcient estimate is equal for both when sum(x[j]^2, j==1, n) = sum(y[j]^2, j==1, n)
Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is different from the coefficient estimate for the regression of Y onto X.

set.seed(2)
x <- 67:166
y <- 2 * x 
y.model <- lm(y ~ x + 0)
x.model <- lm(x ~ y + 0)
summary(y.model)

## Warning in summary.lm(y.model): essentially perfect fit: summary may be
## unreliable

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -5.024e-14 -1.243e-14 -2.320e-15  5.060e-15  9.963e-13 
## 
## Coefficients:
##    Estimate Std. Error   t value Pr(>|t|)    
## x 2.000e+00  8.445e-17 2.368e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.014e-13 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 5.609e+32 on 1 and 99 DF,  p-value: < 2.2e-16

summary(x.model)

## Warning in summary.lm(x.model): essentially perfect fit: summary may be
## unreliable

## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -2.512e-14 -6.220e-15 -1.160e-15  2.530e-15  4.982e-13 
## 
## Coefficients:
##    Estimate Std. Error   t value Pr(>|t|)    
## y 5.000e-01  2.111e-17 2.368e+16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.068e-14 on 99 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 5.609e+32 on 1 and 99 DF,  p-value: < 2.2e-16

Generate an example in R with n = 100 observations in which the coefficient estimate for the regression of X onto Y is the same as the coefficient estimate for the regression of Y onto X.

x <-21:120
y <- 120:21

x.model <- lm(x~y+0)
summary(x.model)

## 
## Call:
## lm(formula = x ~ y + 0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -64.54 -22.15  20.24  62.64 105.03 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## y  0.71285    0.07049   10.11   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 53.7 on 99 degrees of freedom
## Multiple R-squared:  0.5081, Adjusted R-squared:  0.5032 
## F-statistic: 102.3 on 1 and 99 DF,  p-value: < 2.2e-16

y.model <- lm(y~x+0)
summary(y.model)

## 
## Call:
## lm(formula = y ~ x + 0)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -64.54 -22.15  20.24  62.64 105.03 
## 
## Coefficients:
##   Estimate Std. Error t value Pr(>|t|)    
## x  0.71285    0.07049   10.11   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 53.7 on 99 degrees of freedom
## Multiple R-squared:  0.5081, Adjusted R-squared:  0.5032 
## F-statistic: 102.3 on 1 and 99 DF,  p-value: < 2.2e-16

Data Mining: Homework 2

Obehi Ikpea

2023-02-16

Question 2

Answer

Question 9

Answer

Question 10

Question 12