2)

3)

  1. The correct answer is iii, since college graduates are earning an average of 35 - 10 * mean(GPA) more than high school graduates, a higher GPA mean means that college graduates earn less than high school graduates, so this is true.

  2. We estimate that college graduates earn an average of 137.1.

  3. This is false because the statistical significance of an interaction differs from its magnitude.

4)

  1. If the actual connection is linear, adding cubic regression would just bring in unnecessary noise. So, we anticipate that the training RSS for cubic regression will be lower compared to linear regression.

  2. We predict a lower test RSS for cubic regression than for linear regression. This expectation is reinforced by the possibility that the training subset may be randomly biased towards a more cubic shape, which may not hold true for the test population.

  3. The cubic relationship might capture some non-linearity, contingent on the specific nature of the relationship, but there is not enough information to conclude.

  4. There’s also not enough information to know.

3)

9

a)

library("ISLR")
## Warning: package 'ISLR' was built under R version 4.2.3
pairs(Auto)

b)

  cor(Auto[, names(Auto) !="name"])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000

c)

model = lm(mpg ~. -name, data = Auto)
summary(model)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16
  1. Yes there is, but certain predictors lack a statistically significant impact on the response. The R-squared value indicates that 82% of the variations in the response are accounted for by the predictors in this regression model.

  2. Displacement, weight, year, and origin have a significant relationship.

  3. When every other predictor held constant, the mpg value increases with each year that passes, by 1.43 each year.

d)

par(mfrow = c(2,2))
plot(model)

The initial graph indicates a non-linear association between the response and the predictors. The second graph illustrates normally distributed and right-skewed residuals. Contrarily, the third graph disproves the assumption of constant error variance in this model. Despite no apparent leverage points in the third graph, one observation, labeled 14, stands out as a potential leverage point.

e)

model = lm(mpg ~.-name+displacement:weight, data = Auto)
summary(model)
## 
## Call:
## lm(formula = mpg ~ . - name + displacement:weight, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.9027 -1.8092 -0.0946  1.5549 12.1687 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         -5.389e+00  4.301e+00  -1.253   0.2109    
## cylinders            1.175e-01  2.943e-01   0.399   0.6899    
## displacement        -6.837e-02  1.104e-02  -6.193 1.52e-09 ***
## horsepower          -3.280e-02  1.238e-02  -2.649   0.0084 ** 
## weight              -1.064e-02  7.136e-04 -14.915  < 2e-16 ***
## acceleration         6.724e-02  8.805e-02   0.764   0.4455    
## year                 7.852e-01  4.553e-02  17.246  < 2e-16 ***
## origin               5.610e-01  2.622e-01   2.139   0.0331 *  
## displacement:weight  2.269e-05  2.257e-06  10.054  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.964 on 383 degrees of freedom
## Multiple R-squared:  0.8588, Adjusted R-squared:  0.8558 
## F-statistic: 291.1 on 8 and 383 DF,  p-value: < 2.2e-16
model = lm(mpg ~.-name+displacement:cylinders+displacement:weight+acceleration:horsepower, data=Auto)
summary(model)
## 
## Call:
## lm(formula = mpg ~ . - name + displacement:cylinders + displacement:weight + 
##     acceleration:horsepower, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.3344 -1.6333  0.0188  1.4740 11.9723 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             -1.725e+01  5.328e+00  -3.237  0.00131 ** 
## cylinders                6.354e-01  6.106e-01   1.041  0.29870    
## displacement            -6.805e-02  1.337e-02  -5.088 5.68e-07 ***
## horsepower               6.026e-02  2.601e-02   2.317  0.02105 *  
## weight                  -8.864e-03  1.097e-03  -8.084 8.43e-15 ***
## acceleration             6.257e-01  1.592e-01   3.931  0.00010 ***
## year                     7.845e-01  4.470e-02  17.549  < 2e-16 ***
## origin                   4.668e-01  2.595e-01   1.799  0.07284 .  
## cylinders:displacement  -1.337e-03  2.726e-03  -0.490  0.62415    
## displacement:weight      2.071e-05  3.638e-06   5.694 2.49e-08 ***
## horsepower:acceleration -7.467e-03  1.784e-03  -4.185 3.55e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.905 on 381 degrees of freedom
## Multiple R-squared:  0.865,  Adjusted R-squared:  0.8615 
## F-statistic: 244.2 on 10 and 381 DF,  p-value: < 2.2e-16
model = lm(mpg ~.-name+displacement:cylinders+displacement:weight+year:origin+acceleration:horsepower, data=Auto)
summary(model)
## 
## Call:
## lm(formula = mpg ~ . - name + displacement:cylinders + displacement:weight + 
##     year:origin + acceleration:horsepower, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.6504 -1.6476  0.0381  1.4254 12.7893 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              5.287e+00  9.074e+00   0.583 0.560429    
## cylinders                4.249e-01  6.079e-01   0.699 0.485011    
## displacement            -7.322e-02  1.334e-02  -5.490 7.38e-08 ***
## horsepower               5.252e-02  2.586e-02   2.031 0.042913 *  
## weight                  -8.689e-03  1.086e-03  -7.998 1.54e-14 ***
## acceleration             5.796e-01  1.582e-01   3.665 0.000283 ***
## year                     5.116e-01  9.976e-02   5.129 4.66e-07 ***
## origin                  -1.220e+01  4.161e+00  -2.933 0.003560 ** 
## cylinders:displacement  -4.368e-04  2.712e-03  -0.161 0.872156    
## displacement:weight      1.992e-05  3.608e-06   5.522 6.21e-08 ***
## year:origin              1.630e-01  5.341e-02   3.051 0.002440 ** 
## horsepower:acceleration -6.735e-03  1.781e-03  -3.781 0.000181 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.874 on 380 degrees of freedom
## Multiple R-squared:  0.8683, Adjusted R-squared:  0.8644 
## F-statistic: 227.7 on 11 and 380 DF,  p-value: < 2.2e-16
model = lm(mpg ~.-name-cylinders-acceleration+year:origin+displacement:weight+
                  displacement:weight+acceleration:horsepower+acceleration:weight, data=Auto)
summary(model)
## 
## Call:
## lm(formula = mpg ~ . - name - cylinders - acceleration + year:origin + 
##     displacement:weight + displacement:weight + acceleration:horsepower + 
##     acceleration:weight, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5074 -1.6324  0.0599  1.4577 12.7376 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              1.868e+01  7.796e+00   2.396 0.017051 *  
## displacement            -7.794e-02  9.026e-03  -8.636  < 2e-16 ***
## horsepower               8.719e-02  3.167e-02   2.753 0.006183 ** 
## weight                  -1.350e-02  1.287e-03 -10.490  < 2e-16 ***
## year                     4.911e-01  9.825e-02   4.998 8.83e-07 ***
## origin                  -1.262e+01  4.109e+00  -3.071 0.002288 ** 
## year:origin              1.686e-01  5.277e-02   3.195 0.001516 ** 
## displacement:weight      2.253e-05  2.184e-06  10.312  < 2e-16 ***
## horsepower:acceleration -9.164e-03  2.222e-03  -4.125 4.56e-05 ***
## weight:acceleration      2.784e-04  7.087e-05   3.929 0.000101 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.861 on 382 degrees of freedom
## Multiple R-squared:  0.8687, Adjusted R-squared:  0.8656 
## F-statistic: 280.8 on 9 and 382 DF,  p-value: < 2.2e-16

Among the four models, only the last one has all significant variables. Additional trials suggest it’s likely the best combination of predictors and interactions. The R-squared, estimating 87% explanatory power, surpassed other trials.

f)

Auto$log_disp = log(Auto$displacement)
Auto$sqrt_weight = sqrt(Auto$weight)
Auto$weight_squared = Auto$weight^2

summary(Auto[, c("log_disp", "sqrt_weight", "weight_squared")])
##     log_disp      sqrt_weight    weight_squared    
##  Min.   :4.220   Min.   :40.16   Min.   : 2601769  
##  1st Qu.:4.654   1st Qu.:47.17   1st Qu.: 4951739  
##  Median :5.017   Median :52.95   Median : 7859624  
##  Mean   :5.128   Mean   :54.03   Mean   : 9585652  
##  3rd Qu.:5.618   3rd Qu.:60.12   3rd Qu.:13066427  
##  Max.   :6.120   Max.   :71.69   Max.   :26419600

The logarithmic transformation compressed large values, square root reduced extreme influence, and squared transformation magnified differences.

10)

a)

library(ISLR)
attach(Carseats)

fit<-lm(Sales ~ Price + Urban + US)
summary(fit)
## 
## Call:
## lm(formula = Sales ~ Price + Urban + US)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9206 -1.6220 -0.0564  1.5786  7.0581 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.043469   0.651012  20.036  < 2e-16 ***
## Price       -0.054459   0.005242 -10.389  < 2e-16 ***
## UrbanYes    -0.021916   0.271650  -0.081    0.936    
## USYes        1.200573   0.259042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

b)

According to the table, both price and being in the US significantly predict sales. For every $1 increase in price, sales decrease by $54. Sales within the US are $1,200 higher than those outside of the US. Urban status does not have any impact on sales.

c)

Sales = 13.043469 − 0.054459Price − 0.021916UrbanYes + 1.200573XUSYes

d)

Price and US

e)

fit<-lm(Sales ~ Price + US)
summary(fit)
## 
## Call:
## lm(formula = Sales ~ Price + US)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.9269 -1.6286 -0.0574  1.5766  7.0515 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13.03079    0.63098  20.652  < 2e-16 ***
## Price       -0.05448    0.00523 -10.416  < 2e-16 ***
## USYes        1.19964    0.25846   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

f)

Terrible, each model explains around 23% of the variance in Sales.

g)

confint(fit)
##                   2.5 %      97.5 %
## (Intercept) 11.79032020 14.27126531
## Price       -0.06475984 -0.04419543
## USYes        0.69151957  1.70776632

h)

(2+1)400 = 0.0075

par(mfrow=c(2,2))
plot(fit)

summary(influence.measures(fit))
## Potentially influential observations of
##   lm(formula = Sales ~ Price + US) :
## 
##     dfb.1_ dfb.Pric dfb.USYs dffit   cov.r   cook.d hat    
## 26   0.24  -0.18    -0.17     0.28_*  0.97_*  0.03   0.01  
## 29  -0.10   0.10    -0.10    -0.18    0.97_*  0.01   0.01  
## 43  -0.11   0.10     0.03    -0.11    1.05_*  0.00   0.04_*
## 50  -0.10   0.17    -0.17     0.26_*  0.98    0.02   0.01  
## 51  -0.05   0.05    -0.11    -0.18    0.95_*  0.01   0.00  
## 58  -0.05  -0.02     0.16    -0.20    0.97_*  0.01   0.01  
## 69  -0.09   0.10     0.09     0.19    0.96_*  0.01   0.01  
## 126 -0.07   0.06     0.03    -0.07    1.03_*  0.00   0.03_*
## 160  0.00   0.00     0.00     0.01    1.02_*  0.00   0.02  
## 166  0.21  -0.23    -0.04    -0.24    1.02    0.02   0.03_*
## 172  0.06  -0.07     0.02     0.08    1.03_*  0.00   0.02  
## 175  0.14  -0.19     0.09    -0.21    1.03_*  0.02   0.03_*
## 210 -0.14   0.15    -0.10    -0.22    0.97_*  0.02   0.01  
## 270 -0.03   0.05    -0.03     0.06    1.03_*  0.00   0.02  
## 298 -0.06   0.06    -0.09    -0.15    0.97_*  0.01   0.00  
## 314 -0.05   0.04     0.02    -0.05    1.03_*  0.00   0.02_*
## 353 -0.02   0.03     0.09     0.15    0.97_*  0.01   0.00  
## 357  0.02  -0.02     0.02    -0.03    1.03_*  0.00   0.02  
## 368  0.26  -0.23    -0.11     0.27_*  1.01    0.02   0.02_*
## 377  0.14  -0.15     0.12     0.24    0.95_*  0.02   0.01  
## 384  0.00   0.00     0.00     0.00    1.02_*  0.00   0.02  
## 387 -0.03   0.04    -0.03     0.05    1.02_*  0.00   0.02  
## 396 -0.05   0.05     0.08     0.14    0.98_*  0.01   0.00
outyling.obs<-c(26,29,43,50,51,58,69,126,160,166,172,175,210,270,298,314,353,357,368,377,384,387,396)
Carseats.small<-Carseats[-outyling.obs,]
fit2<-lm(Sales ~ Price + US, data=Carseats.small)
summary(fit2)
## 
## Call:
## lm(formula = Sales ~ Price + US, data = Carseats.small)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.263 -1.605 -0.039  1.590  5.428 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.925232   0.665259  19.429  < 2e-16 ***
## Price       -0.053973   0.005511  -9.794  < 2e-16 ***
## USYes        1.255018   0.248856   5.043 7.15e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.29 on 374 degrees of freedom
## Multiple R-squared:  0.2387, Adjusted R-squared:  0.2347 
## F-statistic: 58.64 on 2 and 374 DF,  p-value: < 2.2e-16

After excluding potential outliers or influential observations, the linear model fitted to the modified dataset shows minimal alterations. The confidence intervals for the coefficient estimates in the linear model applied to the complete dataset encompass the coefficients’ estimates from the model with outliers removed. Therefore, it is deemed safe to incorporate all data points into our model.

13)

a)

set.seed(1)
x = rnorm(100,0,1)
x
##   [1] -0.626453811  0.183643324 -0.835628612  1.595280802  0.329507772
##   [6] -0.820468384  0.487429052  0.738324705  0.575781352 -0.305388387
##  [11]  1.511781168  0.389843236 -0.621240581 -2.214699887  1.124930918
##  [16] -0.044933609 -0.016190263  0.943836211  0.821221195  0.593901321
##  [21]  0.918977372  0.782136301  0.074564983 -1.989351696  0.619825748
##  [26] -0.056128740 -0.155795507 -1.470752384 -0.478150055  0.417941560
##  [31]  1.358679552 -0.102787727  0.387671612 -0.053805041 -1.377059557
##  [36] -0.414994563 -0.394289954 -0.059313397  1.100025372  0.763175748
##  [41] -0.164523596 -0.253361680  0.696963375  0.556663199 -0.688755695
##  [46] -0.707495157  0.364581962  0.768532925 -0.112346212  0.881107726
##  [51]  0.398105880 -0.612026393  0.341119691 -1.129363096  1.433023702
##  [56]  1.980399899 -0.367221476 -1.044134626  0.569719627 -0.135054604
##  [61]  2.401617761 -0.039240003  0.689739362  0.028002159 -0.743273209
##  [66]  0.188792300 -1.804958629  1.465554862  0.153253338  2.172611670
##  [71]  0.475509529 -0.709946431  0.610726353 -0.934097632 -1.253633400
##  [76]  0.291446236 -0.443291873  0.001105352  0.074341324 -0.589520946
##  [81] -0.568668733 -0.135178615  1.178086997 -1.523566800  0.593946188
##  [86]  0.332950371  1.063099837 -0.304183924  0.370018810  0.267098791
##  [91] -0.542520031  1.207867806  1.160402616  0.700213650  1.586833455
##  [96]  0.558486426 -1.276592208 -0.573265414 -1.224612615 -0.473400636

b)

eps = rnorm(100,0,sqrt(0.25))
eps
##   [1] -0.310183339  0.021057937 -0.455460824  0.079014386 -0.327292322
##   [6]  0.883643635  0.358353738  0.455087115  0.192092679  0.841088040
##  [11] -0.317868227 -0.230822365  0.716141119 -0.325348177 -0.103690372
##  [16] -0.196403965 -0.159996434 -0.139556651  0.247094166 -0.088665241
##  [21] -0.252978731  0.671519413 -0.107289704 -0.089778265 -0.050095371
##  [26]  0.356333154 -0.036782202 -0.018817086 -0.340830239 -0.162135136
##  [31]  0.030080220 -0.294447243  0.265748096 -0.759197041  0.153278930
##  [36] -0.768224912 -0.150488063 -0.264139952 -0.326047390 -0.028448389
##  [41] -0.957179713  0.588291656 -0.832486218 -0.231765201 -0.557960053
##  [46] -0.375409501  1.043583273  0.008697810 -0.643150265 -0.820302767
##  [51]  0.225093551 -0.009279916 -0.159034187 -0.464681074 -0.743730155
##  [56] -0.537596148  0.500014402 -0.310633347 -0.692213424  0.934645311
##  [61]  0.212550189 -0.119323550  0.529241524  0.443211326 -0.309621524
##  [66]  1.103051232 -0.127513515 -0.712247325 -0.072199801  0.103769170
##  [71]  1.153989200  0.052901184  0.228499403 -0.038576468 -0.167000421
##  [76] -0.017363014  0.393819803  1.037622504  0.513696219  0.603954199
##  [81] -0.615661711  0.491947785  0.109962402 -0.733625015  0.260511371
##  [86] -0.079377302  0.732293656 -0.383041000 -0.215105877 -0.463054749
##  [91] -0.088551981  0.201005890 -0.365874087  0.415186584 -0.604041393
##  [96] -0.523992206  0.720578853 -0.507923733  0.205987356 -0.190538026

c)

y <- -1 + 0.5*x + eps
y
##   [1] -1.62341024 -0.88712040 -1.87327513 -0.12334521 -1.16253844 -0.52659056
##   [7] -0.39793174 -0.17575053 -0.52001665 -0.31160615 -0.56197764 -1.03590075
##  [13] -0.59447917 -2.43269812 -0.54122491 -1.21887077 -1.16809157 -0.66763855
##  [19] -0.34229524 -0.79171458 -0.79349005  0.06258756 -1.07000721 -2.08445411
##  [25] -0.74018250 -0.67173122 -1.11467996 -1.75419328 -1.57990527 -0.95316436
##  [31] -0.29058000 -1.34584111 -0.54041610 -1.78609956 -1.53525085 -1.97572219
##  [37] -1.34763304 -1.29379665 -0.77603470 -0.64686051 -2.03944151 -0.53838918
##  [43] -1.48400453 -0.95343360 -1.90233790 -1.72915708  0.22587425 -0.60703573
##  [49] -1.69932337 -1.37974890 -0.57585351 -1.31529311 -0.98847434 -2.02936262
##  [55] -1.02721830 -0.54739620 -0.68359634 -1.83270066 -1.40735361 -0.13288199
##  [61]  0.41335907 -1.13894355 -0.12588879 -0.54278759 -1.68125813  0.19744738
##  [67] -2.02999283 -0.97946989 -0.99557313  0.19007500  0.39174396 -1.30207203
##  [73] -0.46613742 -1.50562528 -1.79381712 -0.87163990 -0.82782613  0.03817518
##  [79] -0.44913312 -0.69080627 -1.89999608 -0.57564152 -0.30099410 -2.49540841
##  [85] -0.44251553 -0.91290212  0.26384357 -1.53513296 -1.03009647 -1.32950535
##  [91] -1.35981200 -0.19506021 -0.78567278 -0.23470659 -0.81062467 -1.24474899
##  [97] -0.91771725 -1.79455644 -1.40631895 -1.42723834

The values of β0 and β1 are −1 and 0.5 respectively.

d)

plot(x,y)

The relationship between “x” and “y” looks linear with some noise introduced by the “eps” variable.

e)

fit = lm(y~x)
summary(fit)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.93842 -0.30688 -0.06975  0.26970  1.17309 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.01885    0.04849 -21.010  < 2e-16 ***
## x            0.49947    0.05386   9.273 4.58e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4814 on 98 degrees of freedom
## Multiple R-squared:  0.4674, Adjusted R-squared:  0.4619 
## F-statistic: 85.99 on 1 and 98 DF,  p-value: 4.583e-15

The estimated values of β^0 and β^1 closely resemble β0 and β1. The model exhibits a substantial F-statistic with an almost negligible p-value, allowing us to reject the null hypothesis.

f)

plot(x,y)
abline(coef = c(-1,0.5),col = "blue")
abline(fit,col="red")
legend("topleft",c("ls","regression"),col=c("red","blue"),lty = c(1,2))

g)

lm.fit2 <- lm(y ~ x + I(x^2))
summary(lm.fit2)
## 
## Call:
## lm(formula = y ~ x + I(x^2))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.98252 -0.31270 -0.06441  0.29014  1.13500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.97164    0.05883 -16.517  < 2e-16 ***
## x            0.50858    0.05399   9.420  2.4e-15 ***
## I(x^2)      -0.05946    0.04238  -1.403    0.164    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.479 on 97 degrees of freedom
## Multiple R-squared:  0.4779, Adjusted R-squared:  0.4672 
## F-statistic:  44.4 on 2 and 97 DF,  p-value: 2.038e-14

The “x^2” coefficient lacks significance, given its p-value exceeding 0.05. Therefore, there isn’t enough evidence supporting the enhancement of model fit by the quadratic term, despite the slightly higher R2 and marginally lower RSEs compared to the linear model.

h)

set.seed(1)
eps <- rnorm(100, sd = 0.125)
x <- rnorm(100)
y <- -1 + 0.5 * x + eps
plot(x, y)
lm.fit3 <- lm(y ~ x)
summary(lm.fit3)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.29052 -0.07545  0.00067  0.07288  0.28664 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.98639    0.01129  -87.34   <2e-16 ***
## x            0.49988    0.01184   42.22   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1128 on 98 degrees of freedom
## Multiple R-squared:  0.9479, Adjusted R-squared:  0.9474 
## F-statistic:  1782 on 1 and 98 DF,  p-value: < 2.2e-16
abline(lm.fit3, col = "red")
abline(-1, 0.5, col = "blue")
legend("topleft", c("lm.fit3 Least square", "Regression"), col = c("red", "blue"), lty = c(1, 1))

By mitigating the variance of the normal distribution responsible for generating the error term ε, we effectively reduced the noise. Although the coefficients remain closely aligned with their previous values, the relationship now approaches linearity, resulting in a significantly higher R2 and substantially lower RSE. The two lines exhibit overlap, indicative of minimal noise in the model.

i)

set.seed(1)
eps <- rnorm(100, sd = 0.5)
x <- rnorm(100)
y <- -1 + 0.5 * x + eps
plot(x, y)
lm.fit4 <- lm(y ~ x)
summary(lm.fit4)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.16208 -0.30181  0.00268  0.29152  1.14658 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.94557    0.04517  -20.93   <2e-16 ***
## x            0.49953    0.04736   10.55   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4514 on 98 degrees of freedom
## Multiple R-squared:  0.5317, Adjusted R-squared:  0.5269 
## F-statistic: 111.2 on 1 and 98 DF,  p-value: < 2.2e-16
abline(lm.fit4, col = "red")
abline(-1, 0.5, col = "blue")
legend("topleft", c("lm.fit4 Least square", "Regression"), col = c("red", "blue"), lty = c(1, 1))

By amplifying the noise through an increase in the variance of the normal distribution shaping the error term ε, we notice the coefficients closely mirroring their previous values. However, given the relationship’s departure from complete linearity, there is a noticeable reduction in R2 and a notable upswing in RSE. Furthermore, the two lines show increased separation, though they still maintain proximity, underscoring the impact of a substantial dataset.

j)

confint_lm.fit <- confint(fit)
confint_lm.fit3 <- confint(lm.fit3)
confint_lm.fit4 <- confint(lm.fit4)
print(confint_lm.fit)
##                  2.5 %     97.5 %
## (Intercept) -1.1150804 -0.9226122
## x            0.3925794  0.6063602
print(confint_lm.fit3)
##                 2.5 %     97.5 %
## (Intercept) -1.008805 -0.9639819
## x            0.476387  0.5233799
print(confint_lm.fit4)
##                  2.5 %     97.5 %
## (Intercept) -1.0352203 -0.8559276
## x            0.4055479  0.5935197

All intervals seem to be centered on approximately 0.5. As the noise increases, the confidence intervals widen. With less noise, there is more predictability in the data set.

14)

a)

set.seed(1)
x1 <- runif(100)
x2 <- 0.5 * x1 + rnorm(100)/10
y <- 2 + 2 * x1 + 0.3 * x2 + rnorm(100)

The final line pertains to establishing a linear model where “y” is a function of “x1” and “x2”. Express the linear model as follows: Y = 2 + 2X1 + 0.3X2 + ε, where ε is a random variable with a normal distribution N(0,1). The regression coefficients are 2, 2, and 0.3 for the intercept, X1, and X2, respectively.

b)

cor(x1, x2)
## [1] 0.8351212
plot(x1, x2)

The variables seem highly correlated.

c)

lm.fit <- lm(y ~ x1 + x2)
summary(lm.fit)
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8311 -0.7273 -0.0537  0.6338  2.3359 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1305     0.2319   9.188 7.61e-15 ***
## x1            1.4396     0.7212   1.996   0.0487 *  
## x2            1.0097     1.1337   0.891   0.3754    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared:  0.2088, Adjusted R-squared:  0.1925 
## F-statistic:  12.8 on 2 and 97 DF,  p-value: 1.164e-05

The coefficients β^0, β^1 and β^2 are respectively 2.1304996, 1.4395554 and 1.0096742. Only β^0 is close to β0. As the p-value is less than 0.05 we may reject H0 for β1, however we may not reject H0 for β2 as the p-value is higher than 0.05.

d)

lm.fit2 <- lm(y ~ x1)
summary(lm.fit2)
## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.89495 -0.66874 -0.07785  0.59221  2.45560 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1124     0.2307   9.155 8.27e-15 ***
## x1            1.9759     0.3963   4.986 2.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared:  0.2024, Adjusted R-squared:  0.1942 
## F-statistic: 24.86 on 1 and 98 DF,  p-value: 2.661e-06

The coefficient for “x1” in this recent model significantly differs from the one where “x1” and “x2” serve as predictors. Notably, “x1” holds high significance in this instance, evidenced by its very low p-value, allowing us to reject the null hypothesis (H0).

e)

lm.fit3 <- lm(y ~ x2)
summary(lm.fit3)
## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.62687 -0.75156 -0.03598  0.72383  2.44890 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3899     0.1949   12.26  < 2e-16 ***
## x2            2.8996     0.6330    4.58 1.37e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared:  0.1763, Adjusted R-squared:  0.1679 
## F-statistic: 20.98 on 1 and 98 DF,  p-value: 1.366e-05

The “x2” coefficient in this new model is quite different from the one with both “x1” and “x2.” In this case, “x2” is really important, as shown by its very low p-value. So, once again, we can say no to the null hypothesis (H0).

f)

No, the findings are not conflicting. Since the predictors “x1” and “x2” exhibit high correlation, collinearity is present, making it challenging to discern the individual associations of each predictor with the response. Collinearity reduces the accuracy of regression coefficient estimates, causing the standard error for β^1 to increase. Specifically, in the model with two predictors, we observe standard errors of 0.7211795 and 1.1337225 for “x1” and “x2,” respectively. In contrast, in the models with only one predictor, we find standard errors of 0.3962774 and 0.6330467 for “x1” and “x2,” respectively. Consequently, in the presence of collinearity, we may fail to reject H0. The significance of the “x2” variable is obscured due to collinearity.

g)

x1 <- c(x1, 0.1)
x2 <- c(x2, 0.8)
y <- c(y, 6)

lm.fit4 <- lm(y ~ x1 + x2)
lm.fit5 <- lm(y ~ x1)
lm.fit6 <- lm(y ~ x2)
summary(lm.fit4)
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.73348 -0.69318 -0.05263  0.66385  2.30619 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2267     0.2314   9.624 7.91e-16 ***
## x1            0.5394     0.5922   0.911  0.36458    
## x2            2.5146     0.8977   2.801  0.00614 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared:  0.2188, Adjusted R-squared:  0.2029 
## F-statistic: 13.72 on 2 and 98 DF,  p-value: 5.564e-06
summary(lm.fit5)
## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8897 -0.6556 -0.0909  0.5682  3.5665 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2569     0.2390   9.445 1.78e-15 ***
## x1            1.7657     0.4124   4.282 4.29e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared:  0.1562, Adjusted R-squared:  0.1477 
## F-statistic: 18.33 on 1 and 99 DF,  p-value: 4.295e-05
summary(lm.fit6)
## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.64729 -0.71021 -0.06899  0.72699  2.38074 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3451     0.1912  12.264  < 2e-16 ***
## x2            3.1190     0.6040   5.164 1.25e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared:  0.2122, Adjusted R-squared:  0.2042 
## F-statistic: 26.66 on 1 and 99 DF,  p-value: 1.253e-06

In the initial model, x1 lacks statistical significance; however, in the second model, it becomes statistically significant. Meanwhile, x2 maintains statistical significance in both the first and third models.

par(mfrow=c(2,2))
plot(lm.fit4)

plot(lm.fit5)

plot(lm.fit6)

In the initial model (x1 & x2) and the third model (x2 only), the final data point acts as a high-leverage point. However, in the second model (x1 only), the last point does not exhibit high leverage.

plot(predict(lm.fit4), rstudent(lm.fit4))

plot(predict(lm.fit5), rstudent(lm.fit5))

plot(predict(lm.fit6), rstudent(lm.fit6))

In the initial model (x1 & x2) and the third model (x2 only), the last point does not qualify as an outlier. However, in the second model (x1 only), the last point is an outlier, lying beyond the |3| value cutoff.

4)

a)

set.seed(123)  
num_variables <- 25
num_samples <- 25

df.train <- as.data.frame(matrix(rnorm(num_variables * num_samples), ncol = num_variables))

response_variable_index <- sample(1:num_variables, 1)
df.train$y <- df.train[, response_variable_index]
df.train <- df.train[, c("y", setdiff(names(df.train), "y"))]

b)

df.test <- as.data.frame(matrix(rnorm(num_variables * num_samples), ncol = num_variables))
response_variable_index_test <- sample(1:num_variables, 1)
df.test$y <- df.test[, response_variable_index_test]
df.test <- df.test[, c("y", setdiff(names(df.test), "y"))]

c)

num_predictors <- seq(1, num_variables)
MSE.train <- numeric(length(num_predictors))
MSE.test <- numeric(length(num_predictors))

for (i in num_predictors) {
  predictors <- names(df.train)[2:(i + 1)]  # Select the first i predictors
  
  model <- lm(y ~ ., data = df.train[, c("y", predictors)])
  
  MSE.train[i] <- mean((predict(model, df.train) - df.train$y)^2)
  
  MSE.test[i] <- mean((predict(model, df.test) - df.test$y)^2)
}
## Warning in predict.lm(model, df.train): prediction from a rank-deficient fit may
## be misleading
## Warning in predict.lm(model, df.test): prediction from a rank-deficient fit may
## be misleading

d)

plot(num_predictors, MSE.train, type = "l", col = "blue", xlab = "Number of Predictors", ylab = "Mean Squared Error", ylim = c(0, max(c(MSE.train, MSE.test))))
lines(num_predictors, MSE.test, type = "l", col = "red")
legend("topright", legend = c("Training Error", "Test Error"), col = c("blue", "red"), lty = 1)

e)

As more predictors are added the test error increases. On the other hand, as more predictors are added the training error decreases.