Project

Statement of the problem

In this project, we analyze the relationship between foreign direct investment inflows and the inflation rate, gold reserves, and trade openness of 13 countries as of 2020, and we examine the extent of this relationship, if it exists.

We will start by loading the data

library(readxl)
data <- read_excel("~/Desktop/ENS /Macroeconomics/econometrics_Independent_Work/data.xlsx")

## New names:
## • `` -> `...1`

head(data)

## # A tibble: 6 × 13
##   ...1       GDP   fdi inflation   hci trade unemployment freedom reserves   tax
##   <chr>    <dbl> <dbl>     <dbl> <dbl> <dbl>        <dbl>   <dbl>    <dbl> <dbl>
## 1 Austr… 1.33e12 1.41      0.847 0.778  44.0         6.46      78  4.25e10  47.4
## 2 Argen… 3.90e11 1.21     42     0.612  30.1        11.5       50  3.94e10 106. 
## 3 Azerb… 4.27e10 1.19      2.76  0.591  72.0         6.46      62  7.63e 9  40.7
## 4 Germa… 3.85e12 3.71      0.507 0.762  81.1         3.81      76  2.68e11  48.8
## 5 Spain  1.28e12 2.63     -0.323 0.734  59.8        15.5       68  8.13e10  47  
## 6 France 2.63e12 0.560     0.476 0.771  57.8         8.01      66  2.24e11  60.7
## # ℹ 3 more variables: `FDI in summ` <dbl>, wage <dbl>, export <dbl>

y <- data$fdi
x1 <- data$reserves
x2 <- data$trade
x3 <- data$inflation

dat <- data.frame(y,x1,x2,x3)
head(dat)

##           y           x1       x2         x3
## 1 1.4083576  42544629265 44.03953  0.8469055
## 2 1.2122067  39403734630 30.14814 42.0000000
## 3 1.1879039   7633754110 72.01787  2.7598095
## 4 3.7119907 268408603349 81.10855  0.5066899
## 5 2.6325211  81287702461 59.76872 -0.3227530
## 6 0.5598317 224236417868 57.76743  0.4764989

Data classification

The data were obtained from the World Bank website as of 2020. Source of the information
\(y\) - Foreign direct investment inflows, measured as a percentage of GDP.
\(x1\) - Gold reserves in the country (in US dollars).
\(x2\) - The degree of trade openness of a country, which indicates the percentage of GDP formed by foreign trade, measured as a percentage of GDP.
\(x3\) - Inflation rate in the country, measured in percentage.

Based on the data, we construct an initial scatter plot.

pairs(dat, lower.panel = NULL)

Preliminary conclusions based on the diagram:
There is a weak positive linear relationship between \(y\) and \(x_1\). There is a positive increasing linear relationship between \(y\) and \(x_2\). There is a negative decreasing linear relationship between \(y\) and \(x_3\).

Relationship between the predictors:
There is almost no noticeable linear relationship between \(x_1\) and \(x_2\). There is a negative relationship between \(x_1\) and \(x_3\). There is a negative decreasing linear relationship between \(x_2\) and \(x_3\).

Now we perform a correlation analysis.

r <- cor(dat)
round(r,2)

##        y    x1    x2    x3
## y   1.00  0.10  0.98 -0.16
## x1  0.10  1.00  0.02 -0.35
## x2  0.98  0.02  1.00 -0.20
## x3 -0.16 -0.35 -0.20  1.00

Conclusions based on the correlation analysis:

Between \(y\) and \(x_1\), \(r = 0.10\), indicating a weak positive linear relationship between foreign direct investment and gold reserves. Between \(y\) and \(x_2\), \(r = 0.98\), indicating a strong positive increasing linear relationship between foreign direct investment and trade openness. Between \(y\) and \(x_3\), \(r = -0.16\), indicating a weak negative decreasing linear relationship between foreign direct investment and the inflation rate.

Relationship between the predictors:
Between \(x_1\) and \(x_2\), \(r = 0.02\), indicating almost no relationship. Between \(x_1\) and \(x_3\), \(r = -0.35\), indicating a negative linear relationship. Between \(x_2\) and \(x_3\), \(r = -0.20\), indicating a weak negative decreasing linear relationship.

Among the predictors, \(x_2\) is highly positively correlated with the dependent variable \(y\), indicating that the degree of trade openness explains the volume of foreign direct investment inflows very well.

We will construct the initial model

initModel <- lm(y~x1+x2+x3)
summary(initModel)

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8873 -1.1105  0.1750  0.8436  2.8185 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.694e+00  8.623e-01  -5.443 0.000409 ***
## x1           2.357e-12  1.272e-12   1.853 0.096816 .  
## x2           1.055e-01  5.537e-03  19.061 1.39e-08 ***
## x3           5.419e-02  4.289e-02   1.263 0.238224    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.6 on 9 degrees of freedom
## Multiple R-squared:  0.9765, Adjusted R-squared:  0.9687 
## F-statistic: 124.7 on 3 and 9 DF,  p-value: 1.196e-07

We check the predictors for multicollinearity.

It is necessary to check the predictors for multicollinearity in order to determine whether there is a strong relationship between them. If such a relationship exists, it may reduce the quality of the model by causing predictors to explain the same variation repeatedly.

To check for multicollinearity, we use the Variance Inflation Factor (VIF) analysis. If the VIF coefficient is less than 5, there is no problem. If the VIF coefficient is greater than 5, it indicates the presence of multicollinearity. In that case, the predictor that is more weakly correlated with the dependent variable \(y\) among the highly correlated predictors should be removed, and a new model should be constructed.

library(car)

## Loading required package: carData

vif(initModel)

##       x1       x2       x3 
## 1.139016 1.042864 1.184368

Conclusion:
According to the VIF analysis, all indicators are below 5, which means there is no multicollinearity problem among the predictors. Therefore, these predictors can be used in the model.

We construct a model that includes all predictors

model <- lm(y~x1+x2+x3); model

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3)
## 
## Coefficients:
## (Intercept)           x1           x2           x3  
##  -4.694e+00    2.357e-12    1.055e-01    5.419e-02

summary(model)

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8873 -1.1105  0.1750  0.8436  2.8185 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.694e+00  8.623e-01  -5.443 0.000409 ***
## x1           2.357e-12  1.272e-12   1.853 0.096816 .  
## x2           1.055e-01  5.537e-03  19.061 1.39e-08 ***
## x3           5.419e-02  4.289e-02   1.263 0.238224    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.6 on 9 degrees of freedom
## Multiple R-squared:  0.9765, Adjusted R-squared:  0.9687 
## F-statistic: 124.7 on 3 and 9 DF,  p-value: 1.196e-07

The regression model equation is as follows:

\(y = b_0 + b_1*x_1 + b_2*x_2 + b_3*x_3\)
\(y = (-4.694e+00) + (2.357e-12)*x_1 + (1.055e-01)*x_2 + (5.419e-02)*x_3\)

The economic meaning of the regression equation:
\(b_0=-4.694e+00\) - the average FDI level when all predictors are zero is negative.

\(b_1=2.357e-12\) - holding other factors constant, a one-unit increase in \(x_1\) (gold reserves) has an almost negligible positive effect on FDI.

\(b_2=1.055e-01\) - holding other factors constant, a one-unit increase in \(x_2\) (trade openness) increases FDI by 0.1055 units.

\(b_3=5.419e-02\) - holding other factors constant, a one-unit increase in \(x_3\) increases FDI by 0.05419 units (so the relationship is positive).

Regression analysis

To evaluate the goodness of fit of the regression model, we use the decomposition of the total variability of the dependent variable \(y\) into explained and unexplained components.

SSR,SSE,SST

Regression Sum of Squares (SSR) Represents the part of the total variation explained by the model: \[SSR=\sum_\limits {i=0} ^{n} (\widehat y_i - \overline y)^2\]
Error (Residual) Sum of Squares (SSE) Represents the unexplained part of the total variation: \[SSE=\sum_\limits {i=0} ^{n} (y_i - \widehat y)^2\]
Total Sum of Squares (SST) Represents the total variation in the dependent variable: \[SST=\sum_\limits {i=0} ^{n} (y_i - \overline y)^2\]

SSE <- sum(residuals(model)^2);SSE

## [1] 23.04161

SSR <- sum((predict(model)-mean(y))^2);SSR

## [1] 957.7148

SST <- sum((mean(y)-y)^2);SST

## [1] 980.7564

#R-squared:
Rsq <- (SSR/SST);Rsq

## [1] 0.9765063

# We check whether the calculated **R-squared** value matches the value given in the **summary** output.

As we see the calculations from the above, the \(Multiple\ R^2 = 0.9765\) is the same as the value reported in the summary output.

Degrees of freedom: df1 and df2

n <- 13 # Sample size
k <- 3 # Number of predictors

df1 <- 3
df2 <- (n-k-1);df2

## [1] 9

We conduct an F-test for the model.

Objective:
To test whether the overall regression model is statistically significant — that is, whether at least one of the predictors has a significant effect on the dependent variable, based on the sample coefficient of determination.

Formulation of hypotheses:
\[H_0: \rho^2 = 0\]
\[H_1: \rho^2 \ne 0\]
Or
\[H_0:\beta_1=\beta_2=0\] \[H_1: \beta_i \ne 0\]

We take the significance level as \(\alpha = 0.1\), i.e., a 90% confidence level.

alpha <- 0.1

\[F = \frac{\frac {SSR}{k}}{\frac {SSE}{n-k-1}} = \frac {MSR}{MSE}\]

F <- (SSR/k)/(SSE/(n-k-1)); F

## [1] 124.6938

Critical F-value

qf(1-alpha,k, n-k-1)

## [1] 2.812863

qt(1-alpha/2, n-k-1)

## [1] 1.833113

\[F=124.6938 > F_{cr} = 2.813\]
Conclusion:
Since \(F = 124.6938 > F_{crit} = 2.813\), we reject the null hypothesis at the 10% significance level. This indicates that the model is statistically significant, and foreign direct investment inflows can be explained by changes in gold reserves, trade openness, and inflation rate. With 90% confidence, the model can be considered suitable for prediction.

Conducting a t-test for the model

summary(model)

## 
## Call:
## lm(formula = y ~ x1 + x2 + x3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8873 -1.1105  0.1750  0.8436  2.8185 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.694e+00  8.623e-01  -5.443 0.000409 ***
## x1           2.357e-12  1.272e-12   1.853 0.096816 .  
## x2           1.055e-01  5.537e-03  19.061 1.39e-08 ***
## x3           5.419e-02  4.289e-02   1.263 0.238224    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.6 on 9 degrees of freedom
## Multiple R-squared:  0.9765, Adjusted R-squared:  0.9687 
## F-statistic: 124.7 on 3 and 9 DF,  p-value: 1.196e-07

Formulation of hypotheses:
\[H_0:\beta_1=\beta_2=0\] \[H_1: \beta_i \ne 0\]

From the summary, we can see that for predictors \(x_1\) and \(x_2\), the \(P\text{-value} < \alpha = 0.1\), which means we can conclude that there is a significant relationship. However, for \(x_3\), the \(P\text{-value} = 0.24 > \alpha = 0.1\), indicating that the inflation rate is not significantly related to foreign direct investment at the 90% confidence level. At most, we can say there is a relationship with about 76% confidence, but including this variable lowers the quality of the model. Therefore, we exclude the variable \(x_3\) and rebuild the model.

New model(without inflation rate)

model2 <- lm(y~x1+x2); model2

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Coefficients:
## (Intercept)           x1           x2  
##  -4.070e+00    1.797e-12    1.041e-01

summary(model2)

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2558 -1.1520  0.3321  0.8156  2.6143 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.070e+00  7.277e-01  -5.593  0.00023 ***
## x1           1.797e-12  1.227e-12   1.465  0.17375    
## x2           1.041e-01  5.583e-03  18.653 4.24e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.647 on 10 degrees of freedom
## Multiple R-squared:  0.9723, Adjusted R-squared:  0.9668 
## F-statistic: 175.8 on 2 and 10 DF,  p-value: 1.619e-08

New regression model equation:
\(y = (-4.070e+00) + (1.797e-12)*x1 + (1.041e-01)*x2\)

We conduct the F-test and t-test for the new model.

The F-test → checks overall linear relationship. The t-test → checks individual relationships.

t-test

\(H_0\): If \(P\text{(value)} > \alpha\), there is no linear relationship between the variables.
\(H_1\): If \(P\text{(value)} < \alpha\), then at least one independent variable has a significant effect on \(y\).

The variable \(x_2\) passes the t-test because \(P\text{(value)} = 4.24 \times 10^{-9} < 0.1\), meaning that \(x_2\) has a statistically significant impact on \(y\) at the 90% confidence level.

F-test \[H_0: \rho^2 = 0\]
\[H_1: \rho^2 \ne 0\]

n2 <- 13 # Sample size
k2 <- 2 # Number of predictors
alpha2 <- 0.2
SSE2 <- sum(residuals(model2)^2);SSE

## [1] 23.04161

SSR2 <- sum((predict(model2)-mean(y))^2);SSR

## [1] 957.7148

SST2 <- sum((mean(y)-y)^2);SST

## [1] 980.7564

#R-squared:
Rsq2 <- (SSR2/SST2);Rsq2

## [1] 0.9723403

Fst2 <- ((SSR2/k2)/(SSE2/(n2-k2-1)));Fst2

## [1] 175.7681

#F critical
Fcr2 <- qf(1-alpha2,k2,n2-k2-1);Fcr2

## [1] 1.898648

\(F_{st2} = 175.77 > F_{cr2} = 1.899\), which means that the null hypothesis \(H_0\) is rejected and the alternative hypothesis \(H_1\) is accepted. The Multiple R-squared is 97.2%, and the Adjusted R-squared is 96.7%. This indicates that the model explains a large proportion of the variation in \(y\), but there is still a possibility to slightly improve the model’s quality by adding additional explanatory variables.

Conclusion:

To avoid the model becoming a simple regression, we had to set the significance level to 0.2, because the excluded variable \(x_3\) was positively correlated with \(x_1\). Under the current conditions, we can proceed with 80% confidence.

\(S_e = 1.647\) — the residual standard error, represents the dispersion of the actual values around the regression line. It shows, on average, how much the observed values deviate from the values predicted by the model. This statistic is also used when comparing multiple models, where a smaller \(S_e\) indicates a better fit.

Confidence intervals for the regression equation coefficients

confint(model2, level=0.8)

##                      10 %          90 %
## (Intercept) -5.068539e+00 -3.071411e+00
## x1           1.133745e-13  3.480490e-12
## x2           9.647903e-02  1.118012e-01

summary(model2)

## 
## Call:
## lm(formula = y ~ x1 + x2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.2558 -1.1520  0.3321  0.8156  2.6143 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.070e+00  7.277e-01  -5.593  0.00023 ***
## x1           1.797e-12  1.227e-12   1.465  0.17375    
## x2           1.041e-01  5.583e-03  18.653 4.24e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.647 on 10 degrees of freedom
## Multiple R-squared:  0.9723, Adjusted R-squared:  0.9668 
## F-statistic: 175.8 on 2 and 10 DF,  p-value: 1.619e-08

Conclusion:
With an 80% confidence interval, \(b_0\) can vary between −5.068539 and −3.071411, \(b_1\) can vary between 1.133745 × 10⁻¹³ and 3.480490 × 10⁻¹², and \(b_2\) can vary between 0.09647903 and 0.1118012.

Residual analysis for the regression model

Histogram

If the histogram has a bell-shaped or dome-like form, it can be considered approximately normally distributed.

hist(residuals(model2),pch=25)

Conclusion:
From the histogram, the residuals do not form a perfectly bell-shaped distribution, but they are centered around zero with few large deviations. Therefore, the residuals can be considered approximately normally distributed, though not perfectly normal.

Q-Q Plot

The histogram and QQ plot are visual methods for assessing whether the residuals follow a normal distribution. If most of the points on the QQ plot lie on or very close to the reference line, the residuals can be considered approximately normally distributed.

Bmodel <- lm(y~x1+x2)
fit<-predict(Bmodel); fit

##            1            2            3            4            5            6 
##  0.592756842 -0.859537751  3.443692531  4.858991830  2.300416215  2.348870196 
##            7            8            9           10           11           12 
## 33.339077564  1.694914631  2.465249419 -0.499536676  1.788863014 -0.005744327 
##           13 
##  2.452479720

res<-residuals(Bmodel); res

##          1          2          3          4          5          6          7 
##  0.8156007  2.0717445 -2.2557887 -1.1470011  0.3321049 -1.7890385  0.7167702 
##          8          9         10         11         12         13 
## -0.4744065 -1.3775423  1.2122476 -1.1519840  2.6142750  0.4330180

qqnorm(res)
qqline(res)

Conclusion:
The residuals are distributed approximately along the straight line, indicating that they can be considered approximately normally distributed.

All plots

plot(Bmodel)

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

Conclusion:
The residuals mostly lie within acceptable limits, but observation 7 shows high leverage and could influence the model. Overall, the model seems stable, but influential points should be checked.

Durbin–Watson test

The main purpose of conducting the Durbin–Watson test is to determine whether there is autocorrelation among the residuals.

library(lmtest)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

#dwtest
dwtest(Bmodel) #Test for independence of residuals

## 
##  Durbin-Watson test
## 
## data:  Bmodel
## DW = 2.5058, p-value = 0.8378
## alternative hypothesis: true autocorrelation is greater than 0

Conclusion:
Based on the Durbin–Watson test results, DW=2.5 va \(p-value=0.8378 > \alpha=0.2\), there is no evidence of autocorrelation among the residuals. A DW value near 2 indicates independence of residuals, and values in the range of 1.5 to 2.5 are typically acceptable in practice.

Jarque–Bera Test

Purpose of the test

The Jarque–Bera test is performed using the fBasics package and is used to check whether the skewness and kurtosis of the residuals conform to those of a normal distribution. The null hypothesis (\(H_0\)) of the Jarque–Bera test states that the residuals are normally distributed, meaning their skewness and kurtosis are equal to zero. If the \(p\text{-value} > \alpha\), we fail to reject \(H_0\), and therefore we can conclude that the residuals are normally distributed.

library(fBasics)

## 
## Attaching package: 'fBasics'

## The following object is masked from 'package:car':
## 
##     densityPlot

jarqueberaTest(res)

## 
## Title:
##  Jarque-Bera Normality Test
## 
## Test Results:
##   STATISTIC:
##     X-squared: 0.64
##   P VALUE:
##     Asymptotic p Value: 0.7262

Conclusion:
\[p-value=0.7262 > \alpha = 0,2\] This means that the null hypothesis \(H_0\) is valid, and the residuals are normally distributed.

Final Conclusion

Based on the collected data, the effects of gold reserves, trade openness, and inflation rate on the share of foreign direct investment (FDI) inflows were analyzed. Initially, three predictors were included in the model. However, according to the t-test results, the inflation rate (\(x_3\)) was found to be statistically insignificant and was therefore excluded from the final model. The remaining variables — gold reserves and trade openness — were retained as significant predictors at the 80% confidence level. The analysis revealed that trade openness is the strongest determinant of FDI inflows. The final model was found to be valid and suitable for forecasting purposes.

Economic implication: To enhance the inflow of foreign direct investment, countries should focus primarily on increasing their level of trade openness. Additionally, expanding national gold reserves can further strengthen investor confidence and contribute to greater investment inflows.

Macroeconomics

Murodjon Fayziev

2025-11-01