data <- read.csv("C:\\Users\\sjtha\\OneDrive\\Documents\\Academia\\Spring 2022\\SDA Dr.Matis\\Flipped Assignments\\FA11\\data-table-B21(2).csv")
head(data)
##   ï..i     y x_1 x_2 x_3 x_4
## 1    1  78.5   7  26   6  60
## 2    2  74.3   1  29  15  52
## 3    3 104.3  11  56   8  20
## 4    4  87.6  11  31   8  47
## 5    5  95.9   7  52   6  33
## 6    6 109.2  11  55   9  22
colnames(data) <- c("obs","y","x1","x2","x3","x4")
head(data)
##   obs     y x1 x2 x3 x4
## 1   1  78.5  7 26  6 60
## 2   2  74.3  1 29 15 52
## 3   3 104.3 11 56  8 20
## 4   4  87.6 11 31  8 47
## 5   5  95.9  7 52  6 33
## 6   6 109.2 11 55  9 22

Q1) When you regress y on all four predictors, what do you notice about the p-value for the f-statistic and the t-tests for the individual regression coefficients?

model <- lm(y~x1+x2+x3+x4,data = data)
summary(model)
## 
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1750 -1.6709  0.2508  1.3783  3.9254 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  62.4054    70.0710   0.891   0.3991  
## x1            1.5511     0.7448   2.083   0.0708 .
## x2            0.5102     0.7238   0.705   0.5009  
## x3            0.1019     0.7547   0.135   0.8959  
## x4           -0.1441     0.7091  -0.203   0.8441  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.446 on 8 degrees of freedom
## Multiple R-squared:  0.9824, Adjusted R-squared:  0.9736 
## F-statistic: 111.5 on 4 and 8 DF,  p-value: 4.756e-07

P value for f test is 4.756e-07 and for t test for individual predicting factors it is as follows ,

x1 is 0.0708, x2 is 0.5009 , x3 is 0.8959 , x4 is 0.8441

We can see from above regression that we can smell that something is wrong , as our model has a very small p value which proves it significance of model , but for individual t test we could see none of our factor are significant .

This is one of the indication that we have multicolinearity in our model . That is our predictor varibales are correlated to each other .

Q2) What are the VIFs for the predictors in this model?

vif(model)
##        x1        x2        x3        x4 
##  38.49621 254.42317  46.86839 282.51286

We can see from above vif values that we have large values hence our intution after seeing the t test and f test p values of predictor variable being correlated is true

Q3) Which first order model do you think fits best, why?

Lets remove the predictor with highest value of VIF

That is lets dump x4

model2 <- lm(y~x1+x2+x3,data = data)
summary(model2)
## 
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2543 -1.4726  0.1755  1.5409  3.9711 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 48.19363    3.91330  12.315 6.17e-07 ***
## x1           1.69589    0.20458   8.290 1.66e-05 ***
## x2           0.65691    0.04423  14.851 1.23e-07 ***
## x3           0.25002    0.18471   1.354    0.209    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.312 on 9 degrees of freedom
## Multiple R-squared:  0.9823, Adjusted R-squared:  0.9764 
## F-statistic: 166.3 on 3 and 9 DF,  p-value: 3.367e-08

We can see from above analysis after removing x4 , we got our model is still significant after viewing f test p value . And our T test of specific p value is also significant for x1 with p value of 1.66e-05 , x2 with p value of 1.23e-07 . But the x3 is not significant as it has p value of 0.209

Lets check VIF values now , if we have VIF Values less than 5 then we can say we have a acceptable model

vif(model2)
##       x1       x2       x3 
## 3.251068 1.063575 3.142125

As we can see our VIF from above , we can see that still we have very slight amount of multicolinerity as vif is not zero in above analysis , but still it is in the acceptable range of 0-5 . Hence we can use our model with predictor variable x1,x2 and x3 after dumping x4.

Lets see what could have happened if we have choosen to keep x4 in the model and dumped all other predictor varibles

model3 <- lm(y~x4,data = data)
summary(model3)
## 
## Call:
## lm(formula = y ~ x4, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -12.589  -8.228   1.495   4.726  17.524 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 117.5679     5.2622  22.342 1.62e-10 ***
## x4           -0.7382     0.1546  -4.775 0.000576 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.964 on 11 degrees of freedom
## Multiple R-squared:  0.6745, Adjusted R-squared:  0.645 
## F-statistic:  22.8 on 1 and 11 DF,  p-value: 0.0005762

We can see that our p value of our model dropped but is still significant, we can see that our R square and Adj R square also dropped , so it is best that we go with our best first order model as follows,

\(Y\) = \(\beta_0 + \beta_1 * x_1 + \beta_2 * x_2 + \beta_3 *x_3\)

Lets see what happens if we dropped x2 instead of x4, lets play with it

model4 <- lm(y~x1+x3+x4,data = data)
summary(model4)
## 
## Call:
## lm(formula = y ~ x1 + x3 + x4, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9323 -1.8090  0.4806  1.1398  3.7771 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 111.68441    4.56248  24.479 1.52e-09 ***
## x1            1.05185    0.22368   4.702  0.00112 ** 
## x3           -0.41004    0.19923  -2.058  0.06969 .  
## x4           -0.64280    0.04454 -14.431 1.58e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.377 on 9 degrees of freedom
## Multiple R-squared:  0.9813, Adjusted R-squared:  0.975 
## F-statistic: 157.3 on 3 and 9 DF,  p-value: 4.312e-08
vif(model4)
##       x1       x3       x4 
## 3.678168 3.459601 1.181000

Still we see as in our model2 where we dropped x4 , and here when we dropped x2 , we still have one term which is insignificant and p values are also nearly equal (not much but close ) and VIF also are within acceptable range of 0-5

But we can also see that x3 pvalue is now more closer to be significant hence we can say that this model is more preferable than droping x4

Hence our final best model will be ,

\(Y\) = \(\beta_0 + \beta_1 * x_1 + \beta_3 * x_3 + \beta_4 *x_4\)