--> Here we are going to import data into R and rename the column names
dat<-read.csv("C:\\Users\\18067\\Documents\\Fareeha Imam\\TTU R11767331\\Spring 2023\\SDA\\Assignment 11\\data-table-B21(2).csv")
head(dat)
##   i     y x_1 x_2 x_3 x_4
## 1 1  78.5   7  26   6  60
## 2 2  74.3   1  29  15  52
## 3 3 104.3  11  56   8  20
## 4 4  87.6  11  31   8  47
## 5 5  95.9   7  52   6  33
## 6 6 109.2  11  55   9  22
dat<-dat[,-1]
colnames(dat)<-c("y","x1","x2","x3","x4")
print(dat,row.names=FALSE)
##      y x1 x2 x3 x4
##   78.5  7 26  6 60
##   74.3  1 29 15 52
##  104.3 11 56  8 20
##   87.6 11 31  8 47
##   95.9  7 52  6 33
##  109.2 11 55  9 22
##  102.7  3 71 17  6
##   72.5  1 31 22 44
##   93.1  2 54 18 22
##  115.9 21 47  4 26
##   83.8  1 40 23 34
##  113.3 11 66  9 12
##  109.4 10 68  8 12

1 Part A:

When you regress y on all four predictors, what do you notice about the p-value for the f-statistic and the t-tests for the individual regression coefficients?

--> To Regress y on all four predictors, we can use lm() function as follows:
model<-lm(y~x1+x2+x3+x4, data=dat)
--> Checking the significance of the coefficient by using Anova() and summary()
anova(model)
## Analysis of Variance Table
## 
## Response: y
##           Df  Sum Sq Mean Sq  F value    Pr(>F)    
## x1         1 1450.08 1450.08 242.3679 2.888e-07 ***
## x2         1 1207.78 1207.78 201.8705 5.863e-07 ***
## x3         1    9.79    9.79   1.6370    0.2366    
## x4         1    0.25    0.25   0.0413    0.8441    
## Residuals  8   47.86    5.98                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(model)
## 
## Call:
## lm(formula = y ~ x1 + x2 + x3 + x4, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1750 -1.6709  0.2508  1.3783  3.9254 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  62.4054    70.0710   0.891   0.3991  
## x1            1.5511     0.7448   2.083   0.0708 .
## x2            0.5102     0.7238   0.705   0.5009  
## x3            0.1019     0.7547   0.135   0.8959  
## x4           -0.1441     0.7091  -0.203   0.8441  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.446 on 8 degrees of freedom
## Multiple R-squared:  0.9824, Adjusted R-squared:  0.9736 
## F-statistic: 111.5 on 4 and 8 DF,  p-value: 4.756e-07
--> The anova(model) results and summary output indicate that the overall model is statistically significant (p-value < 0.05) with an F-statistic of 111.5 and a very small p-value of 4.756e-07. Looking at the individual regression coefficients, we can see that x1 and x2 are significant predictors (p-value < 0.05) with t-values of 2.083 and 0.705, respectively. But not for x3 and x4. This suggests that x3 and x4 may not be important predictors for the response variable.

2 Part B:

What are the VIFs for the predictors in this model?

library(car)
vif(model)
##        x1        x2        x3        x4 
##  38.49621 254.42317  46.86839 282.51286
--> The high VIF values for x2 and x4 indicate that these predictors having high collinearity between them.

3 Part C:

Which first order model do you think describes the response with interpretable regression parameters the “best”, why?

--> We are going to build several different models using different combinations of the predictor variables
-->

Model 1

model1<-lm(y~x1+x3 , data=dat)
summary(model1)
## 
## Call:
## lm(formula = y ~ x1 + x3, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.142  -7.779   2.558   7.226  15.008 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  72.3490    17.0528   4.243  0.00171 **
## x1            2.3125     0.9598   2.409  0.03672 * 
## x3            0.4945     0.8814   0.561  0.58717   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 11.08 on 10 degrees of freedom
## Multiple R-squared:  0.5482, Adjusted R-squared:  0.4578 
## F-statistic: 6.066 on 2 and 10 DF,  p-value: 0.01883
--> The intercept is significant (p-value = 0.00171)
x1 is significant (p-value = 0.03672)
x3 is not significant (p-value = 0.58717)
The multiple R-squared value is 0.5482 and the adjusted R-squared value is 0.4578
The F-statistic is 6.066 with a p-value of 0.01883

-->

Model 2

model2<-lm(y~x2+x3, data=dat)
summary(model2)
## 
## Call:
## lm(formula = y ~ x2 + x3, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.1535 -4.1565 -0.3155  2.0330 13.4864 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  72.0747     7.3834   9.762 1.98e-06 ***
## x2            0.7313     0.1207   6.057 0.000123 ***
## x3           -1.0084     0.2934  -3.437 0.006358 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.445 on 10 degrees of freedom
## Multiple R-squared:  0.847,  Adjusted R-squared:  0.8164 
## F-statistic: 27.69 on 2 and 10 DF,  p-value: 8.377e-05
--> The intercept is significant (p-value = 1.98e-06)
x2 is significant (p-value = 0.000123)
x3 is significant (p-value = 0.006358)
The multiple R-squared value is 0.847 and the adjusted R-squared value is 0.8164
The F-statistic is 27.69 with a p-value of 8.377e-05

-->

Model 3

model3<-lm(y~x1+x2+x3, data=dat)
summary(model3)
## 
## Call:
## lm(formula = y ~ x1 + x2 + x3, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2543 -1.4726  0.1755  1.5409  3.9711 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 48.19363    3.91330  12.315 6.17e-07 ***
## x1           1.69589    0.20458   8.290 1.66e-05 ***
## x2           0.65691    0.04423  14.851 1.23e-07 ***
## x3           0.25002    0.18471   1.354    0.209    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.312 on 9 degrees of freedom
## Multiple R-squared:  0.9823, Adjusted R-squared:  0.9764 
## F-statistic: 166.3 on 3 and 9 DF,  p-value: 3.367e-08
--> The intercept is significant (p-value = 6.17e-07)
x1 is significant (p-value = 1.66e-05)
x2 is significant (p-value = 1.23e-07)
x3 is not significant (p-value = 0.209)
The multiple R-squared value is 0.9823 and the adjusted R-squared value is 0.9764
The F-statistic is 166.3 with a p-value of 3.367e-08

-->

Model 4

model4<-lm(y~x1+x3+x4, data=dat)
summary(model4)
## 
## Call:
## lm(formula = y ~ x1 + x3 + x4, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.9323 -1.8090  0.4806  1.1398  3.7771 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 111.68441    4.56248  24.479 1.52e-09 ***
## x1            1.05185    0.22368   4.702  0.00112 ** 
## x3           -0.41004    0.19923  -2.058  0.06969 .  
## x4           -0.64280    0.04454 -14.431 1.58e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.377 on 9 degrees of freedom
## Multiple R-squared:  0.9813, Adjusted R-squared:  0.975 
## F-statistic: 157.3 on 3 and 9 DF,  p-value: 4.312e-08
--> The intercept is significant (p-value = 0.000675)
x1 is significant (p-value = 5.78e-07)
x2 is not significant (p-value = 0.051687)
x4 is not significant (p-value = 0.205395)
The multiple R-squared value is 0.9823 and the adjusted R-squared value is 0.9764
The F-statistic is 166.3 with a p-value of 3.367e-08
-->

Conclusion

According to the results of each model. Model 3 has the highest multiple R-squared value (0.9823) and the lowest residual standard error (2.312). Both of these indicate that Model 3 provides a better fit to the data than the other models. If we say in the reference of Residual standard error, which is a measure of the average amount that the actual values deviate from the predicted values. The smaller the RSE, the better the model fits the data. Hence, Model 3 is the best model to describe the response with interpretable regression parameters.