data <- read.csv(file.choose())
y <- data$y
x1 <- data$x_1
x2 <- data$x_2
x3 <- data$x_3
x4 <- data$x_4

a) When you regress y on all four predictors, what do you notice about the p-value for the f-statistic and the t-tests for the individual regression coefficients?

model <- lm(y~x1+x2+x3+x4)
summary(model)
## 
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1750 -1.6709  0.2508  1.3783  3.9254 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  62.4054    70.0710   0.891   0.3991  
## x1            1.5511     0.7448   2.083   0.0708 .
## x2            0.5102     0.7238   0.705   0.5009  
## x3            0.1019     0.7547   0.135   0.8959  
## x4           -0.1441     0.7091  -0.203   0.8441  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.446 on 8 degrees of freedom
## Multiple R-squared:  0.9824, Adjusted R-squared:  0.9736 
## F-statistic: 111.5 on 4 and 8 DF,  p-value: 4.756e-07

The p-value for the f-statistic (4.756e-07) is very low, however, p-value for individual regression coeffictionts are high (p>0.05).

b) What are the VIFs for the predictors in this model?

library(car)
vif(model)
##        x1        x2        x3        x4 
##  38.49621 254.42317  46.86839 282.51286

VIF values for all 4 predictor variables are high (>10).

c) Which first order model do you think describes the response with interpretable regression parameters the “best”, why?

First I remove the predictor variable with the highest VIF (x4)

model1 <- lm(y~x1+x2+x3)
summary(model1)
## 
## Call:
## lm(formula = y ~ x1 + x2 + x3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2543 -1.4726  0.1755  1.5409  3.9711 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 48.19363    3.91330  12.315 6.17e-07 ***
## x1           1.69589    0.20458   8.290 1.66e-05 ***
## x2           0.65691    0.04423  14.851 1.23e-07 ***
## x3           0.25002    0.18471   1.354    0.209    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.312 on 9 degrees of freedom
## Multiple R-squared:  0.9823, Adjusted R-squared:  0.9764 
## F-statistic: 166.3 on 3 and 9 DF,  p-value: 3.367e-08
vif(model1)
##       x1       x2       x3 
## 3.251068 1.063575 3.142125

by removing x4, x1 and x2 became signifcant and also the VIF values are all three predictor variables drastically reduced.

As x3 is not significant, in the next step, I will remove the x3 from the model.

model2 <- lm(y~x1+x2)
summary(model2)
## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.893 -1.574 -1.302  1.363  4.048 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 52.57735    2.28617   23.00 5.46e-10 ***
## x1           1.46831    0.12130   12.11 2.69e-07 ***
## x2           0.66225    0.04585   14.44 5.03e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.406 on 10 degrees of freedom
## Multiple R-squared:  0.9787, Adjusted R-squared:  0.9744 
## F-statistic: 229.5 on 2 and 10 DF,  p-value: 4.407e-09
vif(model2)
##       x1       x2 
## 1.055129 1.055129

In this model both predictor variables are significant and the VIF value is close to 1, suggesting that x1 and x2 are not correlated.