library(MASS)
library(ISLR)
library(car)

Conceptual Exercises

3)

a) iii

If we fix IQ and GPA, the model for males is the following:
salary = 50 + 20 * gpa + .07 * iq + .01 * (gpa * iq)
and the model for females is the following:
salary = 85 + 10 * gpa + .07 * iq + .01 * (gpa * iq)

So, females have an intercept of 85 and males have an intercept of 50, but the coefficient for male gpa is 20 and for females this coefficient is only 10. So average earnings for males vs. females depends on GPA. For a fixed value of IQ and GPA, males earn more on average than females provided that GPA is high enough.

# checking answer
gpa = seq(1,4, by = .5)
iq = 90
female = 1

salary = 50 + 20 * gpa + .07 * iq + 35 * female + .01 * (gpa * iq) + -10 * (gpa * female)
f <- salary 

female = 0
salary = 50 + 20 * gpa + .07 * iq + 35 * female + .01 * (gpa * iq) + -10 * (gpa * female)
m <- salary

cbind(gpa, f, m)
##      gpa      f      m
## [1,] 1.0 102.20  77.20
## [2,] 1.5 107.65  87.65
## [3,] 2.0 113.10  98.10
## [4,] 2.5 118.55 108.55
## [5,] 3.0 124.00 119.00
## [6,] 3.5 129.45 129.45
## [7,] 4.0 134.90 139.90

b)

# predict salary of female with IQ of 110 and GPA of 4.0

gpa = 4
iq = 110
female = 1

salary = 50 + 20 * gpa + .07 * iq + 35 * female + .01 * (gpa * iq) + -10 * (gpa * female)
salary
## [1] 137.1

Prediction: $137,100

c) False

It could be the case that the p-value is <.05 for the interaction coefficient, indicating that we can say with certainty that the coefficient is non-zero. So there is a chance that there is still an interaction effect, but with a relatively small coefficient. We can not say either way until we see the p-value or confidence interval associated with the interaction term.

4)

a)

We would expect the training RSS for the cubic regression to be lower than the RSS for the linear regression. Even though the true relationship is linear, the cubic regression would “overfit” the model resulting in a lower RSS.

b)

The test RSS would likely be lower for the linear model since the true relationship is linear, and the cubic model has a higher variance since it overfit the data. So the residuals from the test data will be smaller for the linear model than for the cubic model.

c)

The training RSS will still be lower for the cubic model since adding more flexibility will lower the training RSS.

d)

In this case, we might expect the cubic model to perform better on the test data than it did in part b) since the true relationship is now non-linear. However, there is not enough information to tell what the test RSS will be for the linear model compared to the cubic model. For example, the true relationship could be quadratic, and the cubic model could still overfit the data resulting in a greater test RSS compared to the linear model. If the true relationship is of an order greater than 2, the test RSS will be lower for the cubic model relative to the linear model. So again, there is not enough information to tell.

Applied Exercises

9)

# a) scatterplot matrix of all variables in the dataset
pairs(Auto)

# b) matrix of correlations between all variables

cor(Auto[,colnames(Auto) != "name"])
##                     mpg  cylinders displacement horsepower     weight
## mpg           1.0000000 -0.7776175   -0.8051269 -0.7784268 -0.8322442
## cylinders    -0.7776175  1.0000000    0.9508233  0.8429834  0.8975273
## displacement -0.8051269  0.9508233    1.0000000  0.8972570  0.9329944
## horsepower   -0.7784268  0.8429834    0.8972570  1.0000000  0.8645377
## weight       -0.8322442  0.8975273    0.9329944  0.8645377  1.0000000
## acceleration  0.4233285 -0.5046834   -0.5438005 -0.6891955 -0.4168392
## year          0.5805410 -0.3456474   -0.3698552 -0.4163615 -0.3091199
## origin        0.5652088 -0.5689316   -0.6145351 -0.4551715 -0.5850054
##              acceleration       year     origin
## mpg             0.4233285  0.5805410  0.5652088
## cylinders      -0.5046834 -0.3456474 -0.5689316
## displacement   -0.5438005 -0.3698552 -0.6145351
## horsepower     -0.6891955 -0.4163615 -0.4551715
## weight         -0.4168392 -0.3091199 -0.5850054
## acceleration    1.0000000  0.2903161  0.2127458
## year            0.2903161  1.0000000  0.1815277
## origin          0.2127458  0.1815277  1.0000000
# c) MLR with mpg as response and all else as predictors

lm <- lm(mpg ~. - name, Auto)
summary(lm)
## 
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.5903 -2.1565 -0.1169  1.8690 13.0604 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -17.218435   4.644294  -3.707  0.00024 ***
## cylinders     -0.493376   0.323282  -1.526  0.12780    
## displacement   0.019896   0.007515   2.647  0.00844 ** 
## horsepower    -0.016951   0.013787  -1.230  0.21963    
## weight        -0.006474   0.000652  -9.929  < 2e-16 ***
## acceleration   0.080576   0.098845   0.815  0.41548    
## year           0.750773   0.050973  14.729  < 2e-16 ***
## origin         1.426141   0.278136   5.127 4.67e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared:  0.8215, Adjusted R-squared:  0.8182 
## F-statistic: 252.4 on 7 and 384 DF,  p-value: < 2.2e-16

The F-statistic is significantly large, so there does appear to be a relationship between the predictors and the response. The following predictors have significant relationships with the response: displacement, weight, year, and origin. The year coefficient suggests that a 1-unit increase in model year is correlated with a .750773 increase in mpg holding all the other predictors constant.

# d) diagnostic plots

par(mfrow = c(2,2))
plot(lm)

The residual plots suggest some outliers for higher predicted values of mpg. It looks like the model is underestimating some of the points with higher mpg. Points like 327 and 394 have high residuals (as seen on the plot of standardized residuals) but low leverage. There also looks to be heteroskedasticity; as the predicted value of mpg increases, so does the the variation in the residuals. Additionally, the leverage plot identified point 14 as a high leverage point.

# e) fit models with interaction effects
Auto <- Auto[,colnames(Auto) != "name"]

#all possible interactions
lm.inter <- lm(mpg ~ . * . , data = Auto)
summary(lm.inter)
## 
## Call:
## lm(formula = mpg ~ . * ., data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6303 -1.4481  0.0596  1.2739 11.1386 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                3.548e+01  5.314e+01   0.668  0.50475   
## cylinders                  6.989e+00  8.248e+00   0.847  0.39738   
## displacement              -4.785e-01  1.894e-01  -2.527  0.01192 * 
## horsepower                 5.034e-01  3.470e-01   1.451  0.14769   
## weight                     4.133e-03  1.759e-02   0.235  0.81442   
## acceleration              -5.859e+00  2.174e+00  -2.696  0.00735 **
## year                       6.974e-01  6.097e-01   1.144  0.25340   
## origin                    -2.090e+01  7.097e+00  -2.944  0.00345 **
## cylinders:displacement    -3.383e-03  6.455e-03  -0.524  0.60051   
## cylinders:horsepower       1.161e-02  2.420e-02   0.480  0.63157   
## cylinders:weight           3.575e-04  8.955e-04   0.399  0.69000   
## cylinders:acceleration     2.779e-01  1.664e-01   1.670  0.09584 . 
## cylinders:year            -1.741e-01  9.714e-02  -1.793  0.07389 . 
## cylinders:origin           4.022e-01  4.926e-01   0.816  0.41482   
## displacement:horsepower   -8.491e-05  2.885e-04  -0.294  0.76867   
## displacement:weight        2.472e-05  1.470e-05   1.682  0.09342 . 
## displacement:acceleration -3.479e-03  3.342e-03  -1.041  0.29853   
## displacement:year          5.934e-03  2.391e-03   2.482  0.01352 * 
## displacement:origin        2.398e-02  1.947e-02   1.232  0.21875   
## horsepower:weight         -1.968e-05  2.924e-05  -0.673  0.50124   
## horsepower:acceleration   -7.213e-03  3.719e-03  -1.939  0.05325 . 
## horsepower:year           -5.838e-03  3.938e-03  -1.482  0.13916   
## horsepower:origin          2.233e-03  2.930e-02   0.076  0.93931   
## weight:acceleration        2.346e-04  2.289e-04   1.025  0.30596   
## weight:year               -2.245e-04  2.127e-04  -1.056  0.29182   
## weight:origin             -5.789e-04  1.591e-03  -0.364  0.71623   
## acceleration:year          5.562e-02  2.558e-02   2.174  0.03033 * 
## acceleration:origin        4.583e-01  1.567e-01   2.926  0.00365 **
## year:origin                1.393e-01  7.399e-02   1.882  0.06062 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared:  0.8893, Adjusted R-squared:  0.8808 
## F-statistic: 104.2 on 28 and 363 DF,  p-value: < 2.2e-16
#only significant interactions
lm.inter1 <- lm(mpg ~ . + acceleration*origin + acceleration*year + displacement * year, data = Auto)
summary(lm.inter1)
## 
## Call:
## lm(formula = mpg ~ . + acceleration * origin + acceleration * 
##     year + displacement * year, data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.7659 -1.8996  0.0241  1.4837 12.2739 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         11.8757343 27.4715000   0.432   0.6658    
## cylinders           -0.1269380  0.3026441  -0.419   0.6751    
## displacement         0.1708125  0.0430357   3.969 8.62e-05 ***
## horsepower          -0.0370510  0.0126847  -2.921   0.0037 ** 
## weight              -0.0048741  0.0006195  -7.868 3.76e-14 ***
## acceleration        -3.6459893  1.4303639  -2.549   0.0112 *  
## year                 0.5476463  0.3561807   1.538   0.1250    
## origin              -7.0940248  1.5825945  -4.483 9.77e-06 ***
## acceleration:origin  0.5078440  0.0957537   5.304 1.93e-07 ***
## acceleration:year    0.0374758  0.0186441   2.010   0.0451 *  
## displacement:year   -0.0022809  0.0005711  -3.994 7.81e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.012 on 381 degrees of freedom
## Multiple R-squared:  0.8548, Adjusted R-squared:  0.851 
## F-statistic: 224.4 on 10 and 381 DF,  p-value: < 2.2e-16

Three interactions appear to be significant: acceleration and origin, acceleration and year, and displacement and year.

# f) transform the variables

# I used the scatterplot matrix to choose which variables to transform

lm.trans <- lm(mpg ~ . + I(displacement^2) + I(horsepower^2) + I(weight^2), data = Auto)
summary(lm.trans)
## 
## Call:
## lm(formula = mpg ~ . + I(displacement^2) + I(horsepower^2) + 
##     I(weight^2), data = Auto)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.2232 -1.5534 -0.0931  1.4304 11.9162 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        3.509e+00  4.789e+00   0.733 0.464247    
## cylinders          4.113e-01  3.275e-01   1.256 0.209886    
## displacement      -3.513e-02  2.005e-02  -1.752 0.080556 .  
## horsepower        -1.915e-01  4.096e-02  -4.675 4.09e-06 ***
## weight            -1.067e-02  2.590e-03  -4.122 4.62e-05 ***
## acceleration      -1.735e-01  1.004e-01  -1.728 0.084870 .  
## year               7.692e-01  4.512e-02  17.048  < 2e-16 ***
## origin             5.788e-01  2.668e-01   2.170 0.030643 *  
## I(displacement^2)  6.324e-05  3.463e-05   1.826 0.068661 .  
## I(horsepower^2)    5.268e-04  1.384e-04   3.807 0.000164 ***
## I(weight^2)        1.047e-06  3.488e-07   3.002 0.002862 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.902 on 381 degrees of freedom
## Multiple R-squared:  0.8653, Adjusted R-squared:  0.8618 
## F-statistic: 244.8 on 10 and 381 DF,  p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(lm.trans)

It looks like displacement and horsepower both have significant quadratic terms. However this transformation did not resolve the issue of heteroskedasticity or the high leverage observations. To solve this, we will try and transform the response variable instead.

lm.trans1 <- lm(log(mpg) ~ . + I(displacement^2) + I(horsepower^2) + I(weight^2), data = Auto)
summary(lm.trans1)
## 
## Call:
## lm(formula = log(mpg) ~ . + I(displacement^2) + I(horsepower^2) + 
##     I(weight^2), data = Auto)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41267 -0.06549  0.00241  0.06082  0.39522 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.132e+00  1.849e-01  11.533  < 2e-16 ***
## cylinders          4.602e-03  1.264e-02   0.364 0.716017    
## displacement      -2.046e-03  7.741e-04  -2.643 0.008548 ** 
## horsepower        -5.578e-03  1.581e-03  -3.528 0.000469 ***
## weight            -2.051e-04  9.996e-05  -2.052 0.040878 *  
## acceleration      -7.854e-03  3.877e-03  -2.026 0.043502 *  
## year               2.972e-02  1.742e-03  17.067  < 2e-16 ***
## origin             1.531e-02  1.030e-02   1.487 0.137849    
## I(displacement^2)  3.971e-06  1.337e-06   2.970 0.003165 ** 
## I(horsepower^2)    1.096e-05  5.342e-06   2.051 0.040916 *  
## I(weight^2)        4.430e-09  1.346e-08   0.329 0.742307    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.112 on 381 degrees of freedom
## Multiple R-squared:  0.8943, Adjusted R-squared:  0.8915 
## F-statistic: 322.3 on 10 and 381 DF,  p-value: < 2.2e-16
plot(lm.trans1)

This transformation looks like it resolved the heteroskedasticity issue, and resulted in a relatively high adjusted R^2 and low residual standard error. There is still some pattern left in the standardized residual plot, but I am not entirely sure how to get rid of it completely.
### 10)

# a) fit the given MLR model

lm.sales <- lm(Sales*1000 ~ Price + Urban + US, data = Carseats)
lm.sales
## 
## Call:
## lm(formula = Sales * 1000 ~ Price + Urban + US, data = Carseats)
## 
## Coefficients:
## (Intercept)        Price     UrbanYes        USYes  
##    13043.47       -54.46       -21.92      1200.57

b)

A $1 increase in the carseat price is correlated with a 54.46 unit decrease in sales holding all other predictors fixed. If a store is in an urban location, it is predicted to have 21.92 less units sold than a rural location, holding all other predictors fixed. If a store is located in the US, it is predicted to have 1200.57 more units sold than a non-US store holding all other predictors fixed.

c)

Sales = 13043.47 + -54.46 * Price - 21.92 * Urban + 1200.57 * US

# d)
summary(lm.sales)
## 
## Call:
## lm(formula = Sales * 1000 ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6920.6 -1622.0   -56.4  1578.6  7058.1 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13043.469    651.012  20.036  < 2e-16 ***
## Price         -54.459      5.242 -10.389  < 2e-16 ***
## UrbanYes      -21.916    271.650  -0.081    0.936    
## USYes        1200.573    259.042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16

We can reject the null hypothesis for Price and US but not for Urban.

# e) drop Urban

lm.sales1 <- lm(Sales* 1000 ~ Price + US, data = Carseats)
lm.sales1
## 
## Call:
## lm(formula = Sales * 1000 ~ Price + US, data = Carseats)
## 
## Coefficients:
## (Intercept)        Price        USYes  
##    13030.79       -54.48      1199.64
# f) compare the models
summary(lm.sales)
## 
## Call:
## lm(formula = Sales * 1000 ~ Price + Urban + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6920.6 -1622.0   -56.4  1578.6  7058.1 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13043.469    651.012  20.036  < 2e-16 ***
## Price         -54.459      5.242 -10.389  < 2e-16 ***
## UrbanYes      -21.916    271.650  -0.081    0.936    
## USYes        1200.573    259.042   4.635 4.86e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2472 on 396 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2335 
## F-statistic: 41.52 on 3 and 396 DF,  p-value: < 2.2e-16
summary(lm.sales1)
## 
## Call:
## lm(formula = Sales * 1000 ~ Price + US, data = Carseats)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6926.9 -1628.6   -57.4  1576.6  7051.5 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 13030.79     630.98  20.652  < 2e-16 ***
## Price         -54.48       5.23 -10.416  < 2e-16 ***
## USYes        1199.64     258.46   4.641 4.71e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2469 on 397 degrees of freedom
## Multiple R-squared:  0.2393, Adjusted R-squared:  0.2354 
## F-statistic: 62.43 on 2 and 397 DF,  p-value: < 2.2e-16

Both models have a fairly small adjusted R^2 of approximately .23, and a large residual standard error of about 2470. That being said, model e) is probably the better model due to the slightly higher adjusted R^2.

# g) 95 percent CI
confint(lm.sales1)[-1,]
##           2.5 %     97.5 %
## Price -64.75984  -44.19543
## USYes 691.51957 1707.76632
# h) outliers or high leverage points
par(mfrow = c(2,2))
plot(lm.sales1)

#high leverage point formula:
(2+1)/nrow(Carseats)
## [1] 0.0075

There are some points with a standardized residual approaching 3, indicating that they are outliers. Additionally, there are many points with leverage far greater than .0075 (the average leverage statistic), indicating that there exist high leverage observations in the dataset.

13)

set.seed(1)

# a) 100 observations from N(0,1)

x <- rnorm(100)

# b) 100 observations from N(0,.25)

eps <- rnorm(100, 0, sqrt(.25))

# c) 

y <- -1 + .5 * x + eps

length(y)
## [1] 100

The length of Y is 100. In this model, \(\beta_0\) = -1 and \(\beta_1\) = .5.

#d) scatterplot between x and y 

plot(y, x)

The relationship between x and y looks to be linear with a slope of approximately .5, with some random variation about the line.

# e) fit a linear model to predict y using x

lm <- lm(y~x)
summary(lm)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.93842 -0.30688 -0.06975  0.26970  1.17309 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.01885    0.04849 -21.010  < 2e-16 ***
## x            0.49947    0.05386   9.273 4.58e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4814 on 98 degrees of freedom
## Multiple R-squared:  0.4674, Adjusted R-squared:  0.4619 
## F-statistic: 85.99 on 1 and 98 DF,  p-value: 4.583e-15

Both predicted coefficients are within a single standard error of the simulated coefficients of -1 and .5 respectively.

# f) add regression line to scatter plot
plot(x,y)
abline(lm, col = "blue")
legend(-2, .25, legend = c("Simulated Data", "Regression Line"), pch = c(1,NA), lty = c(NA,1) , col = c("black", "blue"),  cex = .8)

# g) compare to polynomial regression 

lm2 <- lm(y ~ poly(x,2))
anova(lm, lm2)
## Analysis of Variance Table
## 
## Model 1: y ~ x
## Model 2: y ~ poly(x, 2)
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     98 22.709                           
## 2     97 22.257  1   0.45163 1.9682 0.1638

I ran an ANOVA test to tell if the quadratic model significantly improves the first-order model. The p-value is >.05 indicating that we fail to reject the null hypothesis and conclude that the quadratic model does not significantly improve the first-order model.

# h) repeat a-f using simulation with less noise
set.seed(1)

# a)

x1 <- rnorm(100)

# b)

eps1 <- rnorm(100, 0, sqrt(.0001))

# c) 

y1 <- -1 + .5 * x1 + eps1

length(y1)
## [1] 100
# values of B_0 and B_1 and length of vector y remains the same as before

# d)

plot(x1,y1)

The plot appears to be completely linear, with barely any variation from the line y = -1 + .5X.

# e) 
lm3 <- lm(y1~x1)
summary(lm3)
## 
## Call:
## lm(formula = y1 ~ x1)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.018768 -0.006138 -0.001395  0.005394  0.023462 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.0003769  0.0009699 -1031.5   <2e-16 ***
## x1           0.4999894  0.0010773   464.1   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.009628 on 98 degrees of freedom
## Multiple R-squared:  0.9995, Adjusted R-squared:  0.9995 
## F-statistic: 2.154e+05 on 1 and 98 DF,  p-value: < 2.2e-16

The regression coefficient estimates are now much closer to the true values relative to the simulation with more noise.

# f) 
plot(x1,y1)
abline(lm3, col = "blue")
legend(-2, .25, legend = c("Simulated Data", "Regression Line"), pch = c(1,NA), lty = c(NA,1) , col = c("black", "blue"),  cex = .8)

# g)
lm4 <- lm(y1 ~ poly(x1,2))
anova(lm3, lm4)
## Analysis of Variance Table
## 
## Model 1: y1 ~ x1
## Model 2: y1 ~ poly(x1, 2)
##   Res.Df       RSS Df  Sum of Sq      F Pr(>F)
## 1     98 0.0090836                            
## 2     97 0.0089029  1 0.00018065 1.9682 0.1638

Again, the addition of a quadratic term does not significantly improve the linear model.

# i) repeat a-f using simulation with more noise
set.seed(1)

# a)

x2 <- rnorm(100)

# b)

eps2 <- rnorm(100, 0, 10)

# c) 

y2 <- -1 + .5 * x2 + eps2

length(y2)
## [1] 100
# values of B_0 and B_1 and length of vector y remains the same as before

# d)

plot(x2,y2)

The plot makes it seem like the variables were simulated randomly, with no discernable relationship.

# e) 
lm5 <- lm(y2~x2)
summary(lm5)
## 
## Call:
## lm(formula = y2 ~ x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -18.768  -6.138  -1.395   5.394  23.462 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)
## (Intercept)  -1.3769     0.9699  -1.420    0.159
## x2            0.4894     1.0773   0.454    0.651
## 
## Residual standard error: 9.628 on 98 degrees of freedom
## Multiple R-squared:  0.002102,   Adjusted R-squared:  -0.008081 
## F-statistic: 0.2064 on 1 and 98 DF,  p-value: 0.6506

The regression coefficients are not significantly different from zero, so they do not accurately represent the true coefficient values due to the amount of noise in the generation process.

# f) 
plot(x2,y2)
abline(lm5, col = "blue")
legend(-2, 20, legend = c("Simulated Data", "Regression Line"), pch = c(1,NA), lty = c(NA,1) , col = c("black", "blue"),  cex = .8)

# g)
lm6 <- lm(y2 ~ poly(x2,2))
anova(lm5, lm6)
## Analysis of Variance Table
## 
## Model 1: y2 ~ x2
## Model 2: y2 ~ poly(x2, 2)
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     98 9083.6                           
## 2     97 8902.9  1    180.65 1.9682 0.1638

The addition of a quadratic term does not significantly improve the linear model.

# j)

# CI for original dataset
confint(lm)
##                  2.5 %     97.5 %
## (Intercept) -1.1150804 -0.9226122
## x            0.3925794  0.6063602
# CI for low noise dataset
confint(lm3)
##                  2.5 %     97.5 %
## (Intercept) -1.0023016 -0.9984522
## x1           0.4978516  0.5021272
# CI for high noise dataset
confint(lm5)
##                 2.5 %    97.5 %
## (Intercept) -3.301607 0.5477552
## x2          -1.648412 2.6272040

The noisy dataset had the widest confidence intervals, the low-noise dataset had the narrowest confidence intervals, and the original dataset was somewhere in the middle. All confidence intervals contained the true parameter values.
### 14)

set.seed(1)
x1 = runif(100)
x2 = .5*x1 + rnorm(100)/10
y = 2 + 2* x1 + .3 * x2 + rnorm(100)

The linear model is of the following form:
\(Y = 2 + 2X_1 + .3X_2 + \epsilon\)
So the regression coeffients are 2 for the intercept, 2 for the slope of \(X_1\) and .3 for the slope of \(X_2\).

# b) correlation between X1 and X2
cor(x1, x2)
## [1] 0.8351212
plot(x1, x2)

There is a correlation of .8351212 between the two predictors.

# c)
lm14 <- lm(y ~ x1 + x2)
summary(lm14)
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8311 -0.7273 -0.0537  0.6338  2.3359 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1305     0.2319   9.188 7.61e-15 ***
## x1            1.4396     0.7212   1.996   0.0487 *  
## x2            1.0097     1.1337   0.891   0.3754    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared:  0.2088, Adjusted R-squared:  0.1925 
## F-statistic:  12.8 on 2 and 97 DF,  p-value: 1.164e-05

The estimates for the intercept, slope of x1, and slope of x2 are the following: 2.1305, 1.4396, and 1.0097. So the intercept is close to its true value, but the estimated coefficients for x1 and x2 are relatively far from the true parameters. We would reject the null hypothesis that \(\beta_1 = 0\) and conclude that it is nonzero. However for \(\beta_2 = 0\), we would fail to reject.

# d)
lm14.1 <- lm(y ~ x1)
summary(lm14.1)
## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.89495 -0.66874 -0.07785  0.59221  2.45560 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.1124     0.2307   9.155 8.27e-15 ***
## x1            1.9759     0.3963   4.986 2.66e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared:  0.2024, Adjusted R-squared:  0.1942 
## F-statistic: 24.86 on 1 and 98 DF,  p-value: 2.661e-06

In this case, the coefficient estimates are closer to their true values. we would reject the null and conclude the coefficient for \(X_1\) is significantly different from zero.

# e)
lm14.2 <- lm(y ~ x2)
summary(lm14.2)
## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.62687 -0.75156 -0.03598  0.72383  2.44890 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3899     0.1949   12.26  < 2e-16 ***
## x2            2.8996     0.6330    4.58 1.37e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared:  0.1763, Adjusted R-squared:  0.1679 
## F-statistic: 20.98 on 1 and 98 DF,  p-value: 1.366e-05

In this case, the estimated coefficient for \(X_2\) is now even further from its true value of .3. We would still reject the null hypothesis and conclude this coefficient is non-zero.

f)

These results do not contradict each other, since we know the regression estimates are biased due to collinearity. \(X_2\) was generated using \(x_1\), so there exists collinearity in the predictors, leading to the innacurate coefficients in c). The estimate for \(\beta_1\) in part d) was close to its true value since its generation did not include any other predictors, so the simple linear regression between \(Y\) and \(X_1\) yielded unbiased estimates. Additionally, the biased \(\beta_1\) estimate in e) was the result of the generation process for \(X_2\).

# g) 
x1 = c(x1, .1)
x2 = c(x2, .8)
y = c(y, 6)
par(mfrow = c(2,2))
lm14 <- lm(y ~ x1 + x2)
summary(lm14)
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.73348 -0.69318 -0.05263  0.66385  2.30619 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2267     0.2314   9.624 7.91e-16 ***
## x1            0.5394     0.5922   0.911  0.36458    
## x2            2.5146     0.8977   2.801  0.00614 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared:  0.2188, Adjusted R-squared:  0.2029 
## F-statistic: 13.72 on 2 and 98 DF,  p-value: 5.564e-06
par(mfrow = c(2,2))
plot(lm14)

The new observation looks to be an outlier and high leverage point, based on its position in the top-right corner of the residual-leverage plot. Adding this point increased the estimate for the intercept, and decreased the estimate for the respective slopes for \(X_1\) and \(X_2\).

lm14.1 <- lm(y ~ x1)
summary(lm14.1)
## 
## Call:
## lm(formula = y ~ x1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.8897 -0.6556 -0.0909  0.5682  3.5665 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.2569     0.2390   9.445 1.78e-15 ***
## x1            1.7657     0.4124   4.282 4.29e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared:  0.1562, Adjusted R-squared:  0.1477 
## F-statistic: 18.33 on 1 and 99 DF,  p-value: 4.295e-05
par(mfrow = c(2,2))
plot(lm14.1)

Again, the new point is high leverage and an outler. The addition of the new observation resulted in a higher estimate for the intercept, and lower estimate for the slope.

lm14.2 <- lm(y ~ x2)
summary(lm14.2)
## 
## Call:
## lm(formula = y ~ x2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.64729 -0.71021 -0.06899  0.72699  2.38074 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.3451     0.1912  12.264  < 2e-16 ***
## x2            3.1190     0.6040   5.164 1.25e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared:  0.2122, Adjusted R-squared:  0.2042 
## F-statistic: 26.66 on 1 and 99 DF,  p-value: 1.253e-06
par(mfrow = c(2,2))
plot(lm14.2)

In this case, the new observation is a high leverage point, but not an outlier since the standardized residual is less than 2. The addition of the new observation resulted in a decrease in the estimate for the intercept but increase in the slope estimate.

4)

set.seed(8675309)

#create training data
df.train <- NULL
for(i in 1:25){
        df.train <- cbind(df.train, rnorm(25))
}
df.train <- as.data.frame(df.train)

#change one variable to y
sample <- sample(1:25, 1)
colnames(df.train)[sample] <- "y"
df.train <- df.train[, c(sample, (1:25)[-sample])]
for(i in 2:25){
        colnames(df.train)[i] <- paste("x", (i-1), sep= "")
}

head(df.train)
##            y         x1         x2          x3          x4           x5
## 1  0.6832342 -0.9965824  0.1348980  1.46878477  0.74453768  0.399879213
## 2  0.6902011  0.7218241  0.1175821 -0.93946081 -1.34662801 -0.004830717
## 3  0.5334919 -0.6172088 -0.8255891 -0.85086347  0.33014251  1.204846218
## 4 -0.1861048  2.0293916 -2.1352363  1.66093372 -0.01272533  0.595980740
## 5  0.3829458  1.0654161  0.2142085 -1.35752606 -0.46367596 -0.495136183
## 6  0.3761842  0.9872197 -0.9470681  0.08600488  0.20494209  0.657976320
##           x6         x7          x8          x9        x10        x11
## 1 -0.3191162  0.4096266 -0.08836476 -0.01375654 -0.3327243 -0.4851972
## 2  0.9146375 -0.5455656  0.08114036 -2.72281902 -0.1736846  0.5176822
## 3  1.4227406  0.4298306  1.38814525  0.41191338 -0.1874381 -0.3547614
## 4 -1.0991012 -0.9108425 -0.02341062  1.23162468  1.1563222 -0.1924374
## 5 -0.6047125 -1.0194112  0.98759269  1.59030643 -0.9381941  1.9105449
## 6 -1.2275653  1.9146288 -0.14469635 -0.51023230  0.5240523  1.8119469
##           x12        x13        x14        x15        x16        x17        x18
## 1  0.26497555 -2.4051153  1.0259678  0.1643308  0.1492779 -0.9964029  1.0243020
## 2  1.03616158  0.7133681 -0.3924602 -0.8235687 -0.1879838 -0.3161343 -0.5163557
## 3 -0.52738827 -0.3324065  1.2267929 -0.8157660  0.8315598  0.8109417 -1.3154965
## 4  0.02581511  0.9175493  0.5496249  0.2396696 -1.0644169 -0.4154368 -0.3939363
## 5  0.22154581  0.7396449  0.4016090 -3.2374386  1.3997263  0.2577046  0.1096451
## 6  1.00161177  0.4143795 -0.4686555  0.4954646  0.4119513 -0.4418233 -0.8552211
##          x19        x20        x21        x22        x23        x24
## 1  0.2892246 -0.3912259  0.7204351  1.5399328  1.7661774 -0.9528813
## 2 -0.0510803  2.2127224  2.2385646  0.4266710 -2.2309977  0.7786959
## 3  3.4127944 -1.5625910  0.2308601 -1.0262685  0.3106110 -0.1535704
## 4  0.1252550 -0.7702976  0.2661300 -0.6903070 -1.2352997 -0.2035335
## 5 -0.3436253 -0.5547058  0.3057487 -0.9539138 -0.2390462  0.1131789
## 6 -0.7758054  1.4105020 -2.7342631 -1.7556159  0.9363910 -0.5664170
#create testing data
df.test <- NULL
for(i in 1:25){
        df.test <- cbind(df.test, rnorm(25))
}
df.test <- as.data.frame(df.test)

#change one variable to y
sample <- sample(1:25, 1)
colnames(df.test)[sample] <- "y"
df.test <- df.test[, c(sample, (1:25)[-sample])]
for(i in 2:25){
        colnames(df.test)[i] <- paste("x", (i-1), sep= "")
}
head(df.test)
##            y         x1         x2          x3          x4          x5
## 1  0.4158043 -1.1250766 -0.5196149 -1.09864136 -0.02410888  1.18134481
## 2 -0.6970499  0.7090751 -2.2255202 -0.50459779 -0.49378629 -0.02132088
## 3  2.0642083 -0.9915848 -0.1433703 -0.43961467  1.60450170  0.36315224
## 4  0.3897854  0.5363899 -0.9774714  0.16267950 -0.22497240 -1.59881043
## 5  0.5219511 -1.7949124 -0.5512871 -1.38977658 -0.97209899  1.17390164
## 6  1.5108279 -0.2227637  1.2820031 -0.06122145 -1.25670096  0.36913315
##            x6         x7         x8           x9         x10        x11
## 1  0.44148486 -0.9319189 -1.0534274 -0.655567996  0.38911231  0.5585176
## 2  0.19009395 -0.4034207 -0.7440766 -0.692025610 -1.45191844  0.8926430
## 3  0.34811270 -0.5812124  1.2361566 -1.418728317  1.69421915  1.2009851
## 4  0.09678569  2.1347891  1.4137998  0.006771314 -0.23393196  0.8426864
## 5 -1.03069405  0.3695359 -0.3247897  1.775508007  0.04801099  1.2780781
## 6  0.98563352 -0.1816729  0.2511341 -2.070801353  0.12236614 -0.8324172
##          x12         x13         x14        x15        x16        x17
## 1 -1.1318773 -2.49928123 -0.31827564 -1.2631925 -0.6050051  1.1799015
## 2 -0.9765474  0.02553818 -0.98585201 -1.1700743  2.0258127 -0.9506309
## 3 -0.4443967  0.01264405 -0.06336071  0.7013156  1.2641116  0.1975220
## 4  0.6063389 -1.23822614  0.24240613  1.6318222 -0.5386447  0.5255518
## 5 -1.2944295 -0.32226495  0.36891237 -0.2343429  0.2844003 -0.4800159
## 6  0.7768829 -0.87910969  2.23755904  0.6487112  0.5253629  0.7514034
##          x18        x19        x20        x21         x22        x23
## 1  0.6325128 -0.8978570  1.0887539  0.6481990 -1.16044672  0.6224226
## 2  0.6125887 -1.9269922 -1.2502255 -0.1901002 -0.03489018 -1.7872473
## 3 -0.9325299 -0.2030309  2.1158338 -0.5559143  2.37474595  0.9057682
## 4  0.1415578  0.1293742 -0.1612847 -0.6333391  1.23450040 -0.4913474
## 5  0.8749486  1.1216944  1.3456696  2.4586224  1.67528224  2.5594178
## 6  1.7572323  1.1620434 -0.8794192  0.1829208  0.26314079 -0.1249909
##           x24
## 1 -1.32802361
## 2 -0.23227467
## 3 -0.07170024
## 4 -1.44581787
## 5 -0.38713272
## 6  1.60126771
# regression algorithm
MSE.train <- c()
MSE.test <- c()
for(i in 2:25){
        fit <- lm(y ~ . , data = df.train[, 1:i])
        MSE.train <- c(MSE.train, mean(fit$residuals^2))
        
        pred <- predict.lm(fit, df.test)
        MSE.test <- c(MSE.test, mean(
                (pred - df.test$y)^2
        ))

}
MSE.train
##  [1] 0.47727541 0.47234668 0.46873958 0.42767072 0.41126150 0.38775351
##  [7] 0.38727546 0.38509644 0.36567243 0.33879570 0.33103017 0.28378094
## [13] 0.26272630 0.24194714 0.21127788 0.15999128 0.15998526 0.15672988
## [19] 0.15237197 0.14660067 0.14659715 0.11529060 0.03174804 0.00000000
MSE.test
##  [1]     1.024503     1.096721     1.051816     1.247501     1.347985
##  [6]     1.325017     1.327829     1.358484     1.358356     1.956702
## [11]     1.974135     2.032266     2.940323     2.839861     2.240212
## [16]     5.020365     5.058449     4.473183     4.052842     3.206986
## [21]     3.252859     5.327046    26.040003 11403.136648
plot(1:24, MSE.test, type = "l", main = NULL, xlab = "# of predictors", ylab = "MSE", ylim = c(0, 5), col = "blue", lwd = 2)
lines(1:24, MSE.train, type = "l", lwd = 2)
legend(2.5, 5, col = c("blue", "black"), legend = c("MSE.test", "MSE.train"), lty = 1, lwd = 2)

As more predictors are added to the model, the training error decreases uniformly. The test error tends to have some fluctuation, but the overall trend is that as more predictors are added, the test error increases.