library(MASS)
library(ISLR)
library(car)
If we fix IQ and GPA, the model for males is the following:
salary = 50 + 20 * gpa + .07 * iq + .01 * (gpa * iq)
and the model for females is the following:
salary = 85 + 10 * gpa + .07 * iq + .01 * (gpa * iq)
So, females have an intercept of 85 and males have an intercept of 50, but the coefficient for male gpa is 20 and for females this coefficient is only 10. So average earnings for males vs. females depends on GPA. For a fixed value of IQ and GPA, males earn more on average than females provided that GPA is high enough.
# checking answer
gpa = seq(1,4, by = .5)
iq = 90
female = 1
salary = 50 + 20 * gpa + .07 * iq + 35 * female + .01 * (gpa * iq) + -10 * (gpa * female)
f <- salary
female = 0
salary = 50 + 20 * gpa + .07 * iq + 35 * female + .01 * (gpa * iq) + -10 * (gpa * female)
m <- salary
cbind(gpa, f, m)
## gpa f m
## [1,] 1.0 102.20 77.20
## [2,] 1.5 107.65 87.65
## [3,] 2.0 113.10 98.10
## [4,] 2.5 118.55 108.55
## [5,] 3.0 124.00 119.00
## [6,] 3.5 129.45 129.45
## [7,] 4.0 134.90 139.90
# predict salary of female with IQ of 110 and GPA of 4.0
gpa = 4
iq = 110
female = 1
salary = 50 + 20 * gpa + .07 * iq + 35 * female + .01 * (gpa * iq) + -10 * (gpa * female)
salary
## [1] 137.1
Prediction: $137,100
It could be the case that the p-value is <.05 for the interaction coefficient, indicating that we can say with certainty that the coefficient is non-zero. So there is a chance that there is still an interaction effect, but with a relatively small coefficient. We can not say either way until we see the p-value or confidence interval associated with the interaction term.
We would expect the training RSS for the cubic regression to be lower than the RSS for the linear regression. Even though the true relationship is linear, the cubic regression would “overfit” the model resulting in a lower RSS.
The test RSS would likely be lower for the linear model since the true relationship is linear, and the cubic model has a higher variance since it overfit the data. So the residuals from the test data will be smaller for the linear model than for the cubic model.
The training RSS will still be lower for the cubic model since adding more flexibility will lower the training RSS.
In this case, we might expect the cubic model to perform better on the test data than it did in part b) since the true relationship is now non-linear. However, there is not enough information to tell what the test RSS will be for the linear model compared to the cubic model. For example, the true relationship could be quadratic, and the cubic model could still overfit the data resulting in a greater test RSS compared to the linear model. If the true relationship is of an order greater than 2, the test RSS will be lower for the cubic model relative to the linear model. So again, there is not enough information to tell.
# a) scatterplot matrix of all variables in the dataset
pairs(Auto)
# b) matrix of correlations between all variables
cor(Auto[,colnames(Auto) != "name"])
## mpg cylinders displacement horsepower weight
## mpg 1.0000000 -0.7776175 -0.8051269 -0.7784268 -0.8322442
## cylinders -0.7776175 1.0000000 0.9508233 0.8429834 0.8975273
## displacement -0.8051269 0.9508233 1.0000000 0.8972570 0.9329944
## horsepower -0.7784268 0.8429834 0.8972570 1.0000000 0.8645377
## weight -0.8322442 0.8975273 0.9329944 0.8645377 1.0000000
## acceleration 0.4233285 -0.5046834 -0.5438005 -0.6891955 -0.4168392
## year 0.5805410 -0.3456474 -0.3698552 -0.4163615 -0.3091199
## origin 0.5652088 -0.5689316 -0.6145351 -0.4551715 -0.5850054
## acceleration year origin
## mpg 0.4233285 0.5805410 0.5652088
## cylinders -0.5046834 -0.3456474 -0.5689316
## displacement -0.5438005 -0.3698552 -0.6145351
## horsepower -0.6891955 -0.4163615 -0.4551715
## weight -0.4168392 -0.3091199 -0.5850054
## acceleration 1.0000000 0.2903161 0.2127458
## year 0.2903161 1.0000000 0.1815277
## origin 0.2127458 0.1815277 1.0000000
# c) MLR with mpg as response and all else as predictors
lm <- lm(mpg ~. - name, Auto)
summary(lm)
##
## Call:
## lm(formula = mpg ~ . - name, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.5903 -2.1565 -0.1169 1.8690 13.0604
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -17.218435 4.644294 -3.707 0.00024 ***
## cylinders -0.493376 0.323282 -1.526 0.12780
## displacement 0.019896 0.007515 2.647 0.00844 **
## horsepower -0.016951 0.013787 -1.230 0.21963
## weight -0.006474 0.000652 -9.929 < 2e-16 ***
## acceleration 0.080576 0.098845 0.815 0.41548
## year 0.750773 0.050973 14.729 < 2e-16 ***
## origin 1.426141 0.278136 5.127 4.67e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.328 on 384 degrees of freedom
## Multiple R-squared: 0.8215, Adjusted R-squared: 0.8182
## F-statistic: 252.4 on 7 and 384 DF, p-value: < 2.2e-16
The F-statistic is significantly large, so there does appear to be a relationship between the predictors and the response. The following predictors have significant relationships with the response: displacement, weight, year, and origin. The year coefficient suggests that a 1-unit increase in model year is correlated with a .750773 increase in mpg holding all the other predictors constant.
# d) diagnostic plots
par(mfrow = c(2,2))
plot(lm)
The residual plots suggest some outliers for higher predicted values of mpg. It looks like the model is underestimating some of the points with higher mpg. Points like 327 and 394 have high residuals (as seen on the plot of standardized residuals) but low leverage. There also looks to be heteroskedasticity; as the predicted value of mpg increases, so does the the variation in the residuals. Additionally, the leverage plot identified point 14 as a high leverage point.
# e) fit models with interaction effects
Auto <- Auto[,colnames(Auto) != "name"]
#all possible interactions
lm.inter <- lm(mpg ~ . * . , data = Auto)
summary(lm.inter)
##
## Call:
## lm(formula = mpg ~ . * ., data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6303 -1.4481 0.0596 1.2739 11.1386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.548e+01 5.314e+01 0.668 0.50475
## cylinders 6.989e+00 8.248e+00 0.847 0.39738
## displacement -4.785e-01 1.894e-01 -2.527 0.01192 *
## horsepower 5.034e-01 3.470e-01 1.451 0.14769
## weight 4.133e-03 1.759e-02 0.235 0.81442
## acceleration -5.859e+00 2.174e+00 -2.696 0.00735 **
## year 6.974e-01 6.097e-01 1.144 0.25340
## origin -2.090e+01 7.097e+00 -2.944 0.00345 **
## cylinders:displacement -3.383e-03 6.455e-03 -0.524 0.60051
## cylinders:horsepower 1.161e-02 2.420e-02 0.480 0.63157
## cylinders:weight 3.575e-04 8.955e-04 0.399 0.69000
## cylinders:acceleration 2.779e-01 1.664e-01 1.670 0.09584 .
## cylinders:year -1.741e-01 9.714e-02 -1.793 0.07389 .
## cylinders:origin 4.022e-01 4.926e-01 0.816 0.41482
## displacement:horsepower -8.491e-05 2.885e-04 -0.294 0.76867
## displacement:weight 2.472e-05 1.470e-05 1.682 0.09342 .
## displacement:acceleration -3.479e-03 3.342e-03 -1.041 0.29853
## displacement:year 5.934e-03 2.391e-03 2.482 0.01352 *
## displacement:origin 2.398e-02 1.947e-02 1.232 0.21875
## horsepower:weight -1.968e-05 2.924e-05 -0.673 0.50124
## horsepower:acceleration -7.213e-03 3.719e-03 -1.939 0.05325 .
## horsepower:year -5.838e-03 3.938e-03 -1.482 0.13916
## horsepower:origin 2.233e-03 2.930e-02 0.076 0.93931
## weight:acceleration 2.346e-04 2.289e-04 1.025 0.30596
## weight:year -2.245e-04 2.127e-04 -1.056 0.29182
## weight:origin -5.789e-04 1.591e-03 -0.364 0.71623
## acceleration:year 5.562e-02 2.558e-02 2.174 0.03033 *
## acceleration:origin 4.583e-01 1.567e-01 2.926 0.00365 **
## year:origin 1.393e-01 7.399e-02 1.882 0.06062 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.695 on 363 degrees of freedom
## Multiple R-squared: 0.8893, Adjusted R-squared: 0.8808
## F-statistic: 104.2 on 28 and 363 DF, p-value: < 2.2e-16
#only significant interactions
lm.inter1 <- lm(mpg ~ . + acceleration*origin + acceleration*year + displacement * year, data = Auto)
summary(lm.inter1)
##
## Call:
## lm(formula = mpg ~ . + acceleration * origin + acceleration *
## year + displacement * year, data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.7659 -1.8996 0.0241 1.4837 12.2739
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.8757343 27.4715000 0.432 0.6658
## cylinders -0.1269380 0.3026441 -0.419 0.6751
## displacement 0.1708125 0.0430357 3.969 8.62e-05 ***
## horsepower -0.0370510 0.0126847 -2.921 0.0037 **
## weight -0.0048741 0.0006195 -7.868 3.76e-14 ***
## acceleration -3.6459893 1.4303639 -2.549 0.0112 *
## year 0.5476463 0.3561807 1.538 0.1250
## origin -7.0940248 1.5825945 -4.483 9.77e-06 ***
## acceleration:origin 0.5078440 0.0957537 5.304 1.93e-07 ***
## acceleration:year 0.0374758 0.0186441 2.010 0.0451 *
## displacement:year -0.0022809 0.0005711 -3.994 7.81e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.012 on 381 degrees of freedom
## Multiple R-squared: 0.8548, Adjusted R-squared: 0.851
## F-statistic: 224.4 on 10 and 381 DF, p-value: < 2.2e-16
Three interactions appear to be significant: acceleration and origin, acceleration and year, and displacement and year.
# f) transform the variables
# I used the scatterplot matrix to choose which variables to transform
lm.trans <- lm(mpg ~ . + I(displacement^2) + I(horsepower^2) + I(weight^2), data = Auto)
summary(lm.trans)
##
## Call:
## lm(formula = mpg ~ . + I(displacement^2) + I(horsepower^2) +
## I(weight^2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.2232 -1.5534 -0.0931 1.4304 11.9162
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.509e+00 4.789e+00 0.733 0.464247
## cylinders 4.113e-01 3.275e-01 1.256 0.209886
## displacement -3.513e-02 2.005e-02 -1.752 0.080556 .
## horsepower -1.915e-01 4.096e-02 -4.675 4.09e-06 ***
## weight -1.067e-02 2.590e-03 -4.122 4.62e-05 ***
## acceleration -1.735e-01 1.004e-01 -1.728 0.084870 .
## year 7.692e-01 4.512e-02 17.048 < 2e-16 ***
## origin 5.788e-01 2.668e-01 2.170 0.030643 *
## I(displacement^2) 6.324e-05 3.463e-05 1.826 0.068661 .
## I(horsepower^2) 5.268e-04 1.384e-04 3.807 0.000164 ***
## I(weight^2) 1.047e-06 3.488e-07 3.002 0.002862 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.902 on 381 degrees of freedom
## Multiple R-squared: 0.8653, Adjusted R-squared: 0.8618
## F-statistic: 244.8 on 10 and 381 DF, p-value: < 2.2e-16
par(mfrow = c(2,2))
plot(lm.trans)
It looks like displacement and horsepower both have significant quadratic terms. However this transformation did not resolve the issue of heteroskedasticity or the high leverage observations. To solve this, we will try and transform the response variable instead.
lm.trans1 <- lm(log(mpg) ~ . + I(displacement^2) + I(horsepower^2) + I(weight^2), data = Auto)
summary(lm.trans1)
##
## Call:
## lm(formula = log(mpg) ~ . + I(displacement^2) + I(horsepower^2) +
## I(weight^2), data = Auto)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.41267 -0.06549 0.00241 0.06082 0.39522
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.132e+00 1.849e-01 11.533 < 2e-16 ***
## cylinders 4.602e-03 1.264e-02 0.364 0.716017
## displacement -2.046e-03 7.741e-04 -2.643 0.008548 **
## horsepower -5.578e-03 1.581e-03 -3.528 0.000469 ***
## weight -2.051e-04 9.996e-05 -2.052 0.040878 *
## acceleration -7.854e-03 3.877e-03 -2.026 0.043502 *
## year 2.972e-02 1.742e-03 17.067 < 2e-16 ***
## origin 1.531e-02 1.030e-02 1.487 0.137849
## I(displacement^2) 3.971e-06 1.337e-06 2.970 0.003165 **
## I(horsepower^2) 1.096e-05 5.342e-06 2.051 0.040916 *
## I(weight^2) 4.430e-09 1.346e-08 0.329 0.742307
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.112 on 381 degrees of freedom
## Multiple R-squared: 0.8943, Adjusted R-squared: 0.8915
## F-statistic: 322.3 on 10 and 381 DF, p-value: < 2.2e-16
plot(lm.trans1)
This transformation looks like it resolved the heteroskedasticity issue, and resulted in a relatively high adjusted R^2 and low residual standard error. There is still some pattern left in the standardized residual plot, but I am not entirely sure how to get rid of it completely.
### 10)
# a) fit the given MLR model
lm.sales <- lm(Sales*1000 ~ Price + Urban + US, data = Carseats)
lm.sales
##
## Call:
## lm(formula = Sales * 1000 ~ Price + Urban + US, data = Carseats)
##
## Coefficients:
## (Intercept) Price UrbanYes USYes
## 13043.47 -54.46 -21.92 1200.57
A $1 increase in the carseat price is correlated with a 54.46 unit decrease in sales holding all other predictors fixed. If a store is in an urban location, it is predicted to have 21.92 less units sold than a rural location, holding all other predictors fixed. If a store is located in the US, it is predicted to have 1200.57 more units sold than a non-US store holding all other predictors fixed.
Sales = 13043.47 + -54.46 * Price - 21.92 * Urban + 1200.57 * US
# d)
summary(lm.sales)
##
## Call:
## lm(formula = Sales * 1000 ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6920.6 -1622.0 -56.4 1578.6 7058.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13043.469 651.012 20.036 < 2e-16 ***
## Price -54.459 5.242 -10.389 < 2e-16 ***
## UrbanYes -21.916 271.650 -0.081 0.936
## USYes 1200.573 259.042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
We can reject the null hypothesis for Price and US but not for Urban.
# e) drop Urban
lm.sales1 <- lm(Sales* 1000 ~ Price + US, data = Carseats)
lm.sales1
##
## Call:
## lm(formula = Sales * 1000 ~ Price + US, data = Carseats)
##
## Coefficients:
## (Intercept) Price USYes
## 13030.79 -54.48 1199.64
# f) compare the models
summary(lm.sales)
##
## Call:
## lm(formula = Sales * 1000 ~ Price + Urban + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6920.6 -1622.0 -56.4 1578.6 7058.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13043.469 651.012 20.036 < 2e-16 ***
## Price -54.459 5.242 -10.389 < 2e-16 ***
## UrbanYes -21.916 271.650 -0.081 0.936
## USYes 1200.573 259.042 4.635 4.86e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2472 on 396 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2335
## F-statistic: 41.52 on 3 and 396 DF, p-value: < 2.2e-16
summary(lm.sales1)
##
## Call:
## lm(formula = Sales * 1000 ~ Price + US, data = Carseats)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6926.9 -1628.6 -57.4 1576.6 7051.5
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13030.79 630.98 20.652 < 2e-16 ***
## Price -54.48 5.23 -10.416 < 2e-16 ***
## USYes 1199.64 258.46 4.641 4.71e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2469 on 397 degrees of freedom
## Multiple R-squared: 0.2393, Adjusted R-squared: 0.2354
## F-statistic: 62.43 on 2 and 397 DF, p-value: < 2.2e-16
Both models have a fairly small adjusted R^2 of approximately .23, and a large residual standard error of about 2470. That being said, model e) is probably the better model due to the slightly higher adjusted R^2.
# g) 95 percent CI
confint(lm.sales1)[-1,]
## 2.5 % 97.5 %
## Price -64.75984 -44.19543
## USYes 691.51957 1707.76632
# h) outliers or high leverage points
par(mfrow = c(2,2))
plot(lm.sales1)
#high leverage point formula:
(2+1)/nrow(Carseats)
## [1] 0.0075
There are some points with a standardized residual approaching 3, indicating that they are outliers. Additionally, there are many points with leverage far greater than .0075 (the average leverage statistic), indicating that there exist high leverage observations in the dataset.
set.seed(1)
# a) 100 observations from N(0,1)
x <- rnorm(100)
# b) 100 observations from N(0,.25)
eps <- rnorm(100, 0, sqrt(.25))
# c)
y <- -1 + .5 * x + eps
length(y)
## [1] 100
The length of Y is 100. In this model, \(\beta_0\) = -1 and \(\beta_1\) = .5.
#d) scatterplot between x and y
plot(y, x)
The relationship between x and y looks to be linear with a slope of approximately .5, with some random variation about the line.
# e) fit a linear model to predict y using x
lm <- lm(y~x)
summary(lm)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.93842 -0.30688 -0.06975 0.26970 1.17309
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.01885 0.04849 -21.010 < 2e-16 ***
## x 0.49947 0.05386 9.273 4.58e-15 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4814 on 98 degrees of freedom
## Multiple R-squared: 0.4674, Adjusted R-squared: 0.4619
## F-statistic: 85.99 on 1 and 98 DF, p-value: 4.583e-15
Both predicted coefficients are within a single standard error of the simulated coefficients of -1 and .5 respectively.
# f) add regression line to scatter plot
plot(x,y)
abline(lm, col = "blue")
legend(-2, .25, legend = c("Simulated Data", "Regression Line"), pch = c(1,NA), lty = c(NA,1) , col = c("black", "blue"), cex = .8)
# g) compare to polynomial regression
lm2 <- lm(y ~ poly(x,2))
anova(lm, lm2)
## Analysis of Variance Table
##
## Model 1: y ~ x
## Model 2: y ~ poly(x, 2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 98 22.709
## 2 97 22.257 1 0.45163 1.9682 0.1638
I ran an ANOVA test to tell if the quadratic model significantly improves the first-order model. The p-value is >.05 indicating that we fail to reject the null hypothesis and conclude that the quadratic model does not significantly improve the first-order model.
# h) repeat a-f using simulation with less noise
set.seed(1)
# a)
x1 <- rnorm(100)
# b)
eps1 <- rnorm(100, 0, sqrt(.0001))
# c)
y1 <- -1 + .5 * x1 + eps1
length(y1)
## [1] 100
# values of B_0 and B_1 and length of vector y remains the same as before
# d)
plot(x1,y1)
The plot appears to be completely linear, with barely any variation from the line y = -1 + .5X.
# e)
lm3 <- lm(y1~x1)
summary(lm3)
##
## Call:
## lm(formula = y1 ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.018768 -0.006138 -0.001395 0.005394 0.023462
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.0003769 0.0009699 -1031.5 <2e-16 ***
## x1 0.4999894 0.0010773 464.1 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.009628 on 98 degrees of freedom
## Multiple R-squared: 0.9995, Adjusted R-squared: 0.9995
## F-statistic: 2.154e+05 on 1 and 98 DF, p-value: < 2.2e-16
The regression coefficient estimates are now much closer to the true values relative to the simulation with more noise.
# f)
plot(x1,y1)
abline(lm3, col = "blue")
legend(-2, .25, legend = c("Simulated Data", "Regression Line"), pch = c(1,NA), lty = c(NA,1) , col = c("black", "blue"), cex = .8)
# g)
lm4 <- lm(y1 ~ poly(x1,2))
anova(lm3, lm4)
## Analysis of Variance Table
##
## Model 1: y1 ~ x1
## Model 2: y1 ~ poly(x1, 2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 98 0.0090836
## 2 97 0.0089029 1 0.00018065 1.9682 0.1638
Again, the addition of a quadratic term does not significantly improve the linear model.
# i) repeat a-f using simulation with more noise
set.seed(1)
# a)
x2 <- rnorm(100)
# b)
eps2 <- rnorm(100, 0, 10)
# c)
y2 <- -1 + .5 * x2 + eps2
length(y2)
## [1] 100
# values of B_0 and B_1 and length of vector y remains the same as before
# d)
plot(x2,y2)
The plot makes it seem like the variables were simulated randomly, with no discernable relationship.
# e)
lm5 <- lm(y2~x2)
summary(lm5)
##
## Call:
## lm(formula = y2 ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.768 -6.138 -1.395 5.394 23.462
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.3769 0.9699 -1.420 0.159
## x2 0.4894 1.0773 0.454 0.651
##
## Residual standard error: 9.628 on 98 degrees of freedom
## Multiple R-squared: 0.002102, Adjusted R-squared: -0.008081
## F-statistic: 0.2064 on 1 and 98 DF, p-value: 0.6506
The regression coefficients are not significantly different from zero, so they do not accurately represent the true coefficient values due to the amount of noise in the generation process.
# f)
plot(x2,y2)
abline(lm5, col = "blue")
legend(-2, 20, legend = c("Simulated Data", "Regression Line"), pch = c(1,NA), lty = c(NA,1) , col = c("black", "blue"), cex = .8)
# g)
lm6 <- lm(y2 ~ poly(x2,2))
anova(lm5, lm6)
## Analysis of Variance Table
##
## Model 1: y2 ~ x2
## Model 2: y2 ~ poly(x2, 2)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 98 9083.6
## 2 97 8902.9 1 180.65 1.9682 0.1638
The addition of a quadratic term does not significantly improve the linear model.
# j)
# CI for original dataset
confint(lm)
## 2.5 % 97.5 %
## (Intercept) -1.1150804 -0.9226122
## x 0.3925794 0.6063602
# CI for low noise dataset
confint(lm3)
## 2.5 % 97.5 %
## (Intercept) -1.0023016 -0.9984522
## x1 0.4978516 0.5021272
# CI for high noise dataset
confint(lm5)
## 2.5 % 97.5 %
## (Intercept) -3.301607 0.5477552
## x2 -1.648412 2.6272040
The noisy dataset had the widest confidence intervals, the low-noise dataset had the narrowest confidence intervals, and the original dataset was somewhere in the middle. All confidence intervals contained the true parameter values.
### 14)
set.seed(1)
x1 = runif(100)
x2 = .5*x1 + rnorm(100)/10
y = 2 + 2* x1 + .3 * x2 + rnorm(100)
The linear model is of the following form:
\(Y = 2 + 2X_1 + .3X_2 + \epsilon\)
So the regression coeffients are 2 for the intercept, 2 for the slope of \(X_1\) and .3 for the slope of \(X_2\).
# b) correlation between X1 and X2
cor(x1, x2)
## [1] 0.8351212
plot(x1, x2)
There is a correlation of .8351212 between the two predictors.
# c)
lm14 <- lm(y ~ x1 + x2)
summary(lm14)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8311 -0.7273 -0.0537 0.6338 2.3359
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1305 0.2319 9.188 7.61e-15 ***
## x1 1.4396 0.7212 1.996 0.0487 *
## x2 1.0097 1.1337 0.891 0.3754
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.056 on 97 degrees of freedom
## Multiple R-squared: 0.2088, Adjusted R-squared: 0.1925
## F-statistic: 12.8 on 2 and 97 DF, p-value: 1.164e-05
The estimates for the intercept, slope of x1, and slope of x2 are the following: 2.1305, 1.4396, and 1.0097. So the intercept is close to its true value, but the estimated coefficients for x1 and x2 are relatively far from the true parameters. We would reject the null hypothesis that \(\beta_1 = 0\) and conclude that it is nonzero. However for \(\beta_2 = 0\), we would fail to reject.
# d)
lm14.1 <- lm(y ~ x1)
summary(lm14.1)
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.89495 -0.66874 -0.07785 0.59221 2.45560
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.1124 0.2307 9.155 8.27e-15 ***
## x1 1.9759 0.3963 4.986 2.66e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.055 on 98 degrees of freedom
## Multiple R-squared: 0.2024, Adjusted R-squared: 0.1942
## F-statistic: 24.86 on 1 and 98 DF, p-value: 2.661e-06
In this case, the coefficient estimates are closer to their true values. we would reject the null and conclude the coefficient for \(X_1\) is significantly different from zero.
# e)
lm14.2 <- lm(y ~ x2)
summary(lm14.2)
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.62687 -0.75156 -0.03598 0.72383 2.44890
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3899 0.1949 12.26 < 2e-16 ***
## x2 2.8996 0.6330 4.58 1.37e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.072 on 98 degrees of freedom
## Multiple R-squared: 0.1763, Adjusted R-squared: 0.1679
## F-statistic: 20.98 on 1 and 98 DF, p-value: 1.366e-05
In this case, the estimated coefficient for \(X_2\) is now even further from its true value of .3. We would still reject the null hypothesis and conclude this coefficient is non-zero.
These results do not contradict each other, since we know the regression estimates are biased due to collinearity. \(X_2\) was generated using \(x_1\), so there exists collinearity in the predictors, leading to the innacurate coefficients in c). The estimate for \(\beta_1\) in part d) was close to its true value since its generation did not include any other predictors, so the simple linear regression between \(Y\) and \(X_1\) yielded unbiased estimates. Additionally, the biased \(\beta_1\) estimate in e) was the result of the generation process for \(X_2\).
# g)
x1 = c(x1, .1)
x2 = c(x2, .8)
y = c(y, 6)
par(mfrow = c(2,2))
lm14 <- lm(y ~ x1 + x2)
summary(lm14)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.73348 -0.69318 -0.05263 0.66385 2.30619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2267 0.2314 9.624 7.91e-16 ***
## x1 0.5394 0.5922 0.911 0.36458
## x2 2.5146 0.8977 2.801 0.00614 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.075 on 98 degrees of freedom
## Multiple R-squared: 0.2188, Adjusted R-squared: 0.2029
## F-statistic: 13.72 on 2 and 98 DF, p-value: 5.564e-06
par(mfrow = c(2,2))
plot(lm14)
The new observation looks to be an outlier and high leverage point, based on its position in the top-right corner of the residual-leverage plot. Adding this point increased the estimate for the intercept, and decreased the estimate for the respective slopes for \(X_1\) and \(X_2\).
lm14.1 <- lm(y ~ x1)
summary(lm14.1)
##
## Call:
## lm(formula = y ~ x1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8897 -0.6556 -0.0909 0.5682 3.5665
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.2569 0.2390 9.445 1.78e-15 ***
## x1 1.7657 0.4124 4.282 4.29e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.111 on 99 degrees of freedom
## Multiple R-squared: 0.1562, Adjusted R-squared: 0.1477
## F-statistic: 18.33 on 1 and 99 DF, p-value: 4.295e-05
par(mfrow = c(2,2))
plot(lm14.1)
Again, the new point is high leverage and an outler. The addition of the new observation resulted in a higher estimate for the intercept, and lower estimate for the slope.
lm14.2 <- lm(y ~ x2)
summary(lm14.2)
##
## Call:
## lm(formula = y ~ x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.64729 -0.71021 -0.06899 0.72699 2.38074
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.3451 0.1912 12.264 < 2e-16 ***
## x2 3.1190 0.6040 5.164 1.25e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.074 on 99 degrees of freedom
## Multiple R-squared: 0.2122, Adjusted R-squared: 0.2042
## F-statistic: 26.66 on 1 and 99 DF, p-value: 1.253e-06
par(mfrow = c(2,2))
plot(lm14.2)
In this case, the new observation is a high leverage point, but not an outlier since the standardized residual is less than 2. The addition of the new observation resulted in a decrease in the estimate for the intercept but increase in the slope estimate.
set.seed(8675309)
#create training data
df.train <- NULL
for(i in 1:25){
df.train <- cbind(df.train, rnorm(25))
}
df.train <- as.data.frame(df.train)
#change one variable to y
sample <- sample(1:25, 1)
colnames(df.train)[sample] <- "y"
df.train <- df.train[, c(sample, (1:25)[-sample])]
for(i in 2:25){
colnames(df.train)[i] <- paste("x", (i-1), sep= "")
}
head(df.train)
## y x1 x2 x3 x4 x5
## 1 0.6832342 -0.9965824 0.1348980 1.46878477 0.74453768 0.399879213
## 2 0.6902011 0.7218241 0.1175821 -0.93946081 -1.34662801 -0.004830717
## 3 0.5334919 -0.6172088 -0.8255891 -0.85086347 0.33014251 1.204846218
## 4 -0.1861048 2.0293916 -2.1352363 1.66093372 -0.01272533 0.595980740
## 5 0.3829458 1.0654161 0.2142085 -1.35752606 -0.46367596 -0.495136183
## 6 0.3761842 0.9872197 -0.9470681 0.08600488 0.20494209 0.657976320
## x6 x7 x8 x9 x10 x11
## 1 -0.3191162 0.4096266 -0.08836476 -0.01375654 -0.3327243 -0.4851972
## 2 0.9146375 -0.5455656 0.08114036 -2.72281902 -0.1736846 0.5176822
## 3 1.4227406 0.4298306 1.38814525 0.41191338 -0.1874381 -0.3547614
## 4 -1.0991012 -0.9108425 -0.02341062 1.23162468 1.1563222 -0.1924374
## 5 -0.6047125 -1.0194112 0.98759269 1.59030643 -0.9381941 1.9105449
## 6 -1.2275653 1.9146288 -0.14469635 -0.51023230 0.5240523 1.8119469
## x12 x13 x14 x15 x16 x17 x18
## 1 0.26497555 -2.4051153 1.0259678 0.1643308 0.1492779 -0.9964029 1.0243020
## 2 1.03616158 0.7133681 -0.3924602 -0.8235687 -0.1879838 -0.3161343 -0.5163557
## 3 -0.52738827 -0.3324065 1.2267929 -0.8157660 0.8315598 0.8109417 -1.3154965
## 4 0.02581511 0.9175493 0.5496249 0.2396696 -1.0644169 -0.4154368 -0.3939363
## 5 0.22154581 0.7396449 0.4016090 -3.2374386 1.3997263 0.2577046 0.1096451
## 6 1.00161177 0.4143795 -0.4686555 0.4954646 0.4119513 -0.4418233 -0.8552211
## x19 x20 x21 x22 x23 x24
## 1 0.2892246 -0.3912259 0.7204351 1.5399328 1.7661774 -0.9528813
## 2 -0.0510803 2.2127224 2.2385646 0.4266710 -2.2309977 0.7786959
## 3 3.4127944 -1.5625910 0.2308601 -1.0262685 0.3106110 -0.1535704
## 4 0.1252550 -0.7702976 0.2661300 -0.6903070 -1.2352997 -0.2035335
## 5 -0.3436253 -0.5547058 0.3057487 -0.9539138 -0.2390462 0.1131789
## 6 -0.7758054 1.4105020 -2.7342631 -1.7556159 0.9363910 -0.5664170
#create testing data
df.test <- NULL
for(i in 1:25){
df.test <- cbind(df.test, rnorm(25))
}
df.test <- as.data.frame(df.test)
#change one variable to y
sample <- sample(1:25, 1)
colnames(df.test)[sample] <- "y"
df.test <- df.test[, c(sample, (1:25)[-sample])]
for(i in 2:25){
colnames(df.test)[i] <- paste("x", (i-1), sep= "")
}
head(df.test)
## y x1 x2 x3 x4 x5
## 1 0.4158043 -1.1250766 -0.5196149 -1.09864136 -0.02410888 1.18134481
## 2 -0.6970499 0.7090751 -2.2255202 -0.50459779 -0.49378629 -0.02132088
## 3 2.0642083 -0.9915848 -0.1433703 -0.43961467 1.60450170 0.36315224
## 4 0.3897854 0.5363899 -0.9774714 0.16267950 -0.22497240 -1.59881043
## 5 0.5219511 -1.7949124 -0.5512871 -1.38977658 -0.97209899 1.17390164
## 6 1.5108279 -0.2227637 1.2820031 -0.06122145 -1.25670096 0.36913315
## x6 x7 x8 x9 x10 x11
## 1 0.44148486 -0.9319189 -1.0534274 -0.655567996 0.38911231 0.5585176
## 2 0.19009395 -0.4034207 -0.7440766 -0.692025610 -1.45191844 0.8926430
## 3 0.34811270 -0.5812124 1.2361566 -1.418728317 1.69421915 1.2009851
## 4 0.09678569 2.1347891 1.4137998 0.006771314 -0.23393196 0.8426864
## 5 -1.03069405 0.3695359 -0.3247897 1.775508007 0.04801099 1.2780781
## 6 0.98563352 -0.1816729 0.2511341 -2.070801353 0.12236614 -0.8324172
## x12 x13 x14 x15 x16 x17
## 1 -1.1318773 -2.49928123 -0.31827564 -1.2631925 -0.6050051 1.1799015
## 2 -0.9765474 0.02553818 -0.98585201 -1.1700743 2.0258127 -0.9506309
## 3 -0.4443967 0.01264405 -0.06336071 0.7013156 1.2641116 0.1975220
## 4 0.6063389 -1.23822614 0.24240613 1.6318222 -0.5386447 0.5255518
## 5 -1.2944295 -0.32226495 0.36891237 -0.2343429 0.2844003 -0.4800159
## 6 0.7768829 -0.87910969 2.23755904 0.6487112 0.5253629 0.7514034
## x18 x19 x20 x21 x22 x23
## 1 0.6325128 -0.8978570 1.0887539 0.6481990 -1.16044672 0.6224226
## 2 0.6125887 -1.9269922 -1.2502255 -0.1901002 -0.03489018 -1.7872473
## 3 -0.9325299 -0.2030309 2.1158338 -0.5559143 2.37474595 0.9057682
## 4 0.1415578 0.1293742 -0.1612847 -0.6333391 1.23450040 -0.4913474
## 5 0.8749486 1.1216944 1.3456696 2.4586224 1.67528224 2.5594178
## 6 1.7572323 1.1620434 -0.8794192 0.1829208 0.26314079 -0.1249909
## x24
## 1 -1.32802361
## 2 -0.23227467
## 3 -0.07170024
## 4 -1.44581787
## 5 -0.38713272
## 6 1.60126771
# regression algorithm
MSE.train <- c()
MSE.test <- c()
for(i in 2:25){
fit <- lm(y ~ . , data = df.train[, 1:i])
MSE.train <- c(MSE.train, mean(fit$residuals^2))
pred <- predict.lm(fit, df.test)
MSE.test <- c(MSE.test, mean(
(pred - df.test$y)^2
))
}
MSE.train
## [1] 0.47727541 0.47234668 0.46873958 0.42767072 0.41126150 0.38775351
## [7] 0.38727546 0.38509644 0.36567243 0.33879570 0.33103017 0.28378094
## [13] 0.26272630 0.24194714 0.21127788 0.15999128 0.15998526 0.15672988
## [19] 0.15237197 0.14660067 0.14659715 0.11529060 0.03174804 0.00000000
MSE.test
## [1] 1.024503 1.096721 1.051816 1.247501 1.347985
## [6] 1.325017 1.327829 1.358484 1.358356 1.956702
## [11] 1.974135 2.032266 2.940323 2.839861 2.240212
## [16] 5.020365 5.058449 4.473183 4.052842 3.206986
## [21] 3.252859 5.327046 26.040003 11403.136648
plot(1:24, MSE.test, type = "l", main = NULL, xlab = "# of predictors", ylab = "MSE", ylim = c(0, 5), col = "blue", lwd = 2)
lines(1:24, MSE.train, type = "l", lwd = 2)
legend(2.5, 5, col = c("blue", "black"), legend = c("MSE.test", "MSE.train"), lty = 1, lwd = 2)
As more predictors are added to the model, the training error decreases uniformly. The test error tends to have some fluctuation, but the overall trend is that as more predictors are added, the test error increases.