data <- read.csv(file.choose())
y <- data$y
x1 <- data$x_1
x2 <- data$x_2
x3 <- data$x_3
x4 <- data$x_4
model <- lm(y~x1+x2+x3+x4)
summary(model)
##
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.1750 -1.6709 0.2508 1.3783 3.9254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 62.4054 70.0710 0.891 0.3991
## x1 1.5511 0.7448 2.083 0.0708 .
## x2 0.5102 0.7238 0.705 0.5009
## x3 0.1019 0.7547 0.135 0.8959
## x4 -0.1441 0.7091 -0.203 0.8441
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.446 on 8 degrees of freedom
## Multiple R-squared: 0.9824, Adjusted R-squared: 0.9736
## F-statistic: 111.5 on 4 and 8 DF, p-value: 4.756e-07
The p-value for the f-statistic (4.756e-07) is very low, however, p-value for individual regression coeffictionts are high (p>0.05).
library(car)
vif(model)
## x1 x2 x3 x4
## 38.49621 254.42317 46.86839 282.51286
VIF values for all 4 predictor variables are high (>10).
First I remove the predictor variable with the highest VIF (x4)
model1 <- lm(y~x1+x2+x3)
summary(model1)
##
## Call:
## lm(formula = y ~ x1 + x2 + x3)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.2543 -1.4726 0.1755 1.5409 3.9711
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.19363 3.91330 12.315 6.17e-07 ***
## x1 1.69589 0.20458 8.290 1.66e-05 ***
## x2 0.65691 0.04423 14.851 1.23e-07 ***
## x3 0.25002 0.18471 1.354 0.209
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.312 on 9 degrees of freedom
## Multiple R-squared: 0.9823, Adjusted R-squared: 0.9764
## F-statistic: 166.3 on 3 and 9 DF, p-value: 3.367e-08
vif(model1)
## x1 x2 x3
## 3.251068 1.063575 3.142125
by removing x4, x1 and x2 became signifcant and also the VIF values are all three predictor variables drastically reduced.
As x3 is not significant, in the next step, I will remove the x3 from the model.
model2 <- lm(y~x1+x2)
summary(model2)
##
## Call:
## lm(formula = y ~ x1 + x2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.893 -1.574 -1.302 1.363 4.048
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 52.57735 2.28617 23.00 5.46e-10 ***
## x1 1.46831 0.12130 12.11 2.69e-07 ***
## x2 0.66225 0.04585 14.44 5.03e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.406 on 10 degrees of freedom
## Multiple R-squared: 0.9787, Adjusted R-squared: 0.9744
## F-statistic: 229.5 on 2 and 10 DF, p-value: 4.407e-09
vif(model2)
## x1 x2
## 1.055129 1.055129
In this model both predictor variables are significant and the VIF value is close to 1, suggesting that x1 and x2 are not correlated.