Answer the following questions using the PISA dataset and the PISA Codebook.

For each question, provide your code and the answer.


Q1:

Estimate a regression model where Reading performance on the PISA (PISARead) is regressed on gross national income (GNI) and gross domestic product (GDP). Is there multicollinearity between the independent variables?

PISA <- read.csv("/Users/YanfeiQin/Desktop/Fall 2021/897-002 Applied Linear Modeling/Lab 7/PISA.csv", header=TRUE, sep=",")
PISA2 <- na.omit(PISA[,c("PISARead","GNI","GDP")])
lm <- lm(PISARead ~ GNI + GDP, data = PISA2)
library(car)
## Loading required package: carData
vif(lm)
##      GNI      GDP 
## 4159.901 4159.901


As shown above in the Variance Inflation Factor, VIF = 4159.901. The VIF is way bigger than the standard of VIF>= 10, suggesting that our model suffer from multicollinearity problems. Also, by using the method of Tolerance. The tolerance for GNI & GDP equal 1 / 4159.901 = 0.00024. According to the standard, a value less than or equal to 0.10 suggests that the independent variables in your model may suffer from multicollinearity problems. Thus, the tolerance method shows our model does suffer from multicollinearity problems.

cor(PISA2[,c( "GNI", "GDP")], use="complete.obs")
##           GNI       GDP
## GNI 1.0000000 0.9998798
## GDP 0.9998798 1.0000000


As shown above in the Correlation Matrix, the correlation between our two IVs is 0.999, which again confirm our conclusion: the IVs do suffer from multicollinearity problems.