Introduction

Our research is based on different factors that are believed have an impact on GDP growth rate in developing countries such as the internet cover, corruption index, openness, women rights, electricity or power index, existence of armed conflicts, and percentage of paved roads. In this research, we are going to test the following hypotheses, and using panel data model.

Hypotheses to test:

  1. Observance of women rights are significant factors for GDP growth.
  2. Openness to the foreign markets positively affects the GDP growth.

In this document we will analyze counties’ GDP growth rate using panel data model.Panel data models provide information on individual behavior, both across individuals and over time. In other words, there is a time dimension and cross sectional dimension. We will follow these steps:

  1. Unit root test for all variables.
  2. Checking if it is pooling data or panel data.
  3. Hausman’s test to decide between fixed or random effect method
  4. Creating the model
  5. Tests for autocorrelation and heteroscedastic (likelihood ratio)
  6. Checking if error term is normal
  7. Analyzing and evaluating the model using accuracy statistic. We will then compare the results to those of similar literature.

Literature review

Economists have used both theory and empirical research to explain the cause of economic growth. People like Solow, Swann, and Romer have provided theoretical frameworks on which most later works are based.

[5] According to Robert Barro (1996) studies of a panel of 100 countries from 1960 to 1990 to find the factors that affected the economic growth of countries, the growth rate of real per capita GDP was associated with maintenance of the rule of law, smaller government consumption, longer life expectancy, more male secondary and higher levels of schooling, lower fertility rates, higher levels of investment, the level of democracy, a lower inflation rate, and openness to trade. Whereas the prevalence of traps such as conflicts or wars, rent seeking on natural resources, dependence on only one neighboring country, and lack of the rule of law hinder GDP growth. (Collier, 2007). Other economists like Caves (1971) found that there was a positive correlation between the productivity of a multinational enterprise and labor productivity in domestic firms in the same industry. He claimed that this was a result of competition and continuous improvement brought by foreign investment to the domestic market. Foreign direct investment benefits a host country through added employment, new technology and transfer of knowledge. Also, it causes an increase in the volume of domestic investment (Borensztein, De Gregorio, and Lee 1998).

Data

The source data, available Here, contains various development indicators for 135+countries between years 1980 and 2008. We have selected some variables which we believed could directly influence a country’s GDP growth rate. The dependent variable to be regressed is the GDP growth rate. We have found the data from World Bank’s development indicators for the years mentioned above.

The description of variables is given in figure 1.

figure 1: data description

Data Preparation

Install and load the necessary packages

requiredPackages = c("aTSA","verification","olsrr","DescTools","caret","tibble","purrr","corrplot","corrplot","dbplyr","dplyr","readr", "ggplot2", "glmnet","plm", "Formula","stargazer", "MASS","sandwich", "zoo","car", "lmtest","aod","dplyr","DescTools","PerformanceAnalytics","dummies","lmtest","fBasics","splines","akima")

for(i in requiredPackages){if(!require(i,character.only = TRUE)) install.packages(i)}
for(i in requiredPackages){if(!require(i,character.only = TRUE)) library(i,character.only = TRUE)}

Load the data.

data_countries <- read.csv("CountriesDataTrue.csv", header = TRUE)
head(data_countries,6) # the top 6 rows
##   ï..Country Year gdpGrowthRate R.D  openness electricityIndex internetCover
## 1     Angola 1980     2.2087276   . 0.5963767         64.42574             0
## 2     Angola 1981    -0.1324745   . 0.6993147         62.25248             0
## 3     Angola 1982     1.2464611   . 0.5534043         64.08428             0
## 4     Angola 1983     1.7701183   . 0.5312528         67.18925             0
## 5     Angola 1984     1.7846867   . 0.5857161         55.51948             0
## 6     Angola 1985     2.3214330   . 0.5402753         64.72874             0
##     roads corruptionIndex womenRights armedConflict
## 1 11.8110            0.33           4            -2
## 2 10.9276            1.74           5            -2
## 3 12.0430            1.53           4            -2
## 4  9.6743            1.74           5            -2
## 5 12.3564            1.94           4            -2
## 6 12.3244            1.87           5            -2
str(data_countries)
## 'data.frame':    3829 obs. of  11 variables:
##  $ ï..Country      : chr  "Angola" "Angola" "Angola" "Angola" ...
##  $ Year            : int  1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 ...
##  $ gdpGrowthRate   : num  2.209 -0.132 1.246 1.77 1.785 ...
##  $ R.D             : chr  "." "." "." "." ...
##  $ openness        : num  0.596 0.699 0.553 0.531 0.586 ...
##  $ electricityIndex: num  64.4 62.3 64.1 67.2 55.5 ...
##  $ internetCover   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ roads           : num  11.81 10.93 12.04 9.67 12.36 ...
##  $ corruptionIndex : num  0.33 1.74 1.53 1.74 1.94 1.87 1.81 1.93 1.7 1.86 ...
##  $ womenRights     : int  4 5 4 5 4 5 5 5 5 5 ...
##  $ armedConflict   : int  -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 ...

Exploratory Data Analysis

Below we will do some exploratory data analysis, hoping to get the general picture of the data.

Finding the missing entries

We notice that column R.D has many entries filled with “.” Let’s see how much of the total column is missing.

sum(data_countries$R.D== ".")/length(data_countries$R.D)
## [1] 0.2953774

About 30% of R.D column is missing. Therefore, we should remove the column from the analysis.

data_countries = data_countries[,-4] # omit the R&D column

Finding the outliers

Just like the missing values, it is important to check and handle the outliers appropriately for the better model accuracy.

Let’s start with analyzing basic statistics.

summary(data_countries)
##   ï..Country             Year      gdpGrowthRate        openness     
##  Length:3829        Min.   :1980   Min.   :-50.248   Min.   :0.0000  
##  Class :character   1st Qu.:1987   1st Qu.:  1.522   1st Qu.:0.3371  
##  Mode  :character   Median :1994   Median :  3.730   Median :0.5070  
##                     Mean   :1994   Mean   :  3.615   Mean   :0.6046  
##                     3rd Qu.:2001   3rd Qu.:  6.063   3rd Qu.:0.7207  
##                     Max.   :2008   Max.   : 35.224   Max.   :9.8665  
##                     NA's   :1      NA's   :1                         
##  electricityIndex  internetCover        roads        corruptionIndex 
##  Min.   :    0.0   Min.   : 0.000   Min.   :  0.00   Min.   : 0.000  
##  1st Qu.:  586.7   1st Qu.: 0.000   1st Qu.: 17.55   1st Qu.: 2.710  
##  Median : 1368.5   Median : 0.010   Median : 37.98   Median : 3.660  
##  Mean   : 2985.4   Mean   : 6.254   Mean   : 47.79   Mean   : 4.329  
##  3rd Qu.: 3981.2   3rd Qu.: 2.910   3rd Qu.: 80.60   3rd Qu.: 5.410  
##  Max.   :36852.5   Max.   :90.000   Max.   :100.00   Max.   :10.000  
##                                                                      
##   womenRights   armedConflict    
##  Min.   :0.00   Min.   :-2.0000  
##  1st Qu.:3.00   1st Qu.: 0.0000  
##  Median :4.00   Median : 0.0000  
##  Mean   :3.98   Mean   :-0.2142  
##  3rd Qu.:5.00   3rd Qu.: 0.0000  
##  Max.   :9.00   Max.   : 0.0000  
## 
# glimpse(data_countries)

Box plots of the variables can indicate the outliers if any.

library(reshape)
meltData <- melt(data_countries)
p <- ggplot(meltData, aes(factor(variable), value)) 
p + geom_boxplot() + facet_wrap(~variable, scale="free")

As we can see, variables “gdpGrowthRate”, “openness”, “womenRights” and “armedConflict” have outliers. We will remove the instances with those values in the next stage.

gdpGrowthRate outliers

outlier_gdpGrowthRate <- boxplot(data_countries$gdpGrowthRate,data = data_countries)$out

openness outliers

outlier_openness <- boxplot(data_countries$openness,data = data_countries)$out

womenRights outliers

outlier_womenRights <- boxplot(data_countries$womenRights,data = data_countries)$out

armedConflicts outliers

Since variable “armedConflicts” has only 3 levels, it is better not to remove any of them.

Remove the outliers.

data_countries <- data_countries[!((data_countries$gdpGrowthRate %in% outlier_gdpGrowthRate) |(data_countries$openness %in% outlier_openness) |(data_countries$womenRights %in% outlier_womenRights)),]

A closer look at the dependent variable

After cleaning the data, it is time to take a closer look at the dependent variable, gdpGrowthRate. Let’s plot it first.

ggplot(data_countries,
       aes(x = gdpGrowthRate)) +
  geom_histogram(fill = "blue",
                 bins = 100) +
  theme_bw()

The dependent variable is nearly symmetric, making the variable easier to model. Therefore,there is no need for log transformation.

Variable selection

Omit the variables with 1 level.

Variables experiencing no change across the observations should be excluded from the model. Let’s find out if we have any such variable.

sapply(data_countries, 
       function(x) 
         unique(x) %>% 
         length()) %>% 
  sort()
##    armedConflict      womenRights             Year       ï..Country 
##                3                9               30              131 
##  corruptionIndex    internetCover    gdpGrowthRate            roads 
##              736              966             1685             2628 
##         openness electricityIndex 
##             3411             3411

Another way to check this is:

vars_selected <-names(data_countries)
var_to_remove <- nearZeroVar(data_countries, 
                                      names = TRUE)

var_to_remove
## character(0)

We keep all variables because none has a zero/near zero variance.

Remove the collinear (strongly correlated) independent variables

Correlated features in general don’t improve models (although it depends on the specifics of the problem like the number of variables and the degree of correlation). The interpretation of effect of individual predictor variables are not reliable when multicolinearity is present. It is important, therefore, to exclude them from the model.

data_countries_correlations <- cor(data_countries[,-c(1,2)],
    use = "pairwise.complete.obs")

data_countries_cor_order <- 
  data_countries_correlations[,"gdpGrowthRate"] %>% 
  sort(decreasing = TRUE) %>% # sort the correlations with TEY in decreasing order
  names()

Plotting the correlations

# using the 30 most correlated variables
corrplot.mixed(data_countries_correlations[data_countries_cor_order, 
                                   data_countries_cor_order],
               upper = "square",
               lower = "number",
               tl.col = "black",
               tl.pos = "lt")

We see that there are variables highly correlated to one another. Next step should be finding the potential candidates to be excluded from the model

vars_to_remove <- findCorrelation(data_countries_correlations[-1,-1],
                cutoff = 0.9, # threshold
                names = TRUE)

vars_selected <- names(data_countries)[
  !names(data_countries) %in% vars_to_remove
]
data_countries <- data_countries[,vars_selected]

No variable has been removed, since all correlations are less than the 0.9 threshold.

Omit variables which are linear combinations of one another.

Variables that are linear combinations of other variables should be removed. They do not add anything, but only make everything a little harder.

( findLinearCombos(data_countries[,-c(1,2,3)]) ->
    data_countries_linearCombos )
## $linearCombos
## list()
## 
## $remove
## NULL

We do not have variables that are linear combinations of each other.

Modeling

Testing for unit roots/stationarity

It is necessary to make all variables stationary to avoid the Heteroscedasticity issues. The augmented Dickey-Fuller - If p-value < 0.05 then no unit roots present.

Augmented Dickey-Fuller test results show that the variables are stationary.

gdpGrowthRate

adf.test(data_countries$gdpGrowthRate)
## Augmented Dickey-Fuller Test 
## alternative: stationary 
##  
## Type 1: no drift no trend 
##       lag    ADF p.value
##  [1,]   0 -22.48    0.01
##  [2,]   1 -15.22    0.01
##  [3,]   2 -11.71    0.01
##  [4,]   3 -10.34    0.01
##  [5,]   4  -9.00    0.01
##  [6,]   5  -7.81    0.01
##  [7,]   6  -6.76    0.01
##  [8,]   7  -6.04    0.01
##  [9,]   8  -5.61    0.01
## Type 2: with drift no trend 
##       lag   ADF p.value
##  [1,]   0 -38.5    0.01
##  [2,]   1 -28.0    0.01
##  [3,]   2 -22.7    0.01
##  [4,]   3 -20.9    0.01
##  [5,]   4 -18.9    0.01
##  [6,]   5 -17.0    0.01
##  [7,]   6 -15.0    0.01
##  [8,]   7 -13.7    0.01
##  [9,]   8 -13.0    0.01
## Type 3: with drift and trend 
##       lag   ADF p.value
##  [1,]   0 -38.6    0.01
##  [2,]   1 -28.1    0.01
##  [3,]   2 -22.7    0.01
##  [4,]   3 -21.0    0.01
##  [5,]   4 -19.0    0.01
##  [6,]   5 -17.0    0.01
##  [7,]   6 -15.1    0.01
##  [8,]   7 -13.8    0.01
##  [9,]   8 -13.0    0.01
## ---- 
## Note: in fact, p.value = 0.01 means p.value <= 0.01

openness

adf.test(data_countries$openness)
## Augmented Dickey-Fuller Test 
## alternative: stationary 
##  
## Type 1: no drift no trend 
##       lag   ADF p.value
##  [1,]   0 -6.12    0.01
##  [2,]   1 -5.67    0.01
##  [3,]   2 -5.30    0.01
##  [4,]   3 -5.03    0.01
##  [5,]   4 -5.00    0.01
##  [6,]   5 -4.88    0.01
##  [7,]   6 -4.72    0.01
##  [8,]   7 -4.77    0.01
##  [9,]   8 -4.68    0.01
## Type 2: with drift no trend 
##       lag   ADF p.value
##  [1,]   0 -14.1    0.01
##  [2,]   1 -13.3    0.01
##  [3,]   2 -12.7    0.01
##  [4,]   3 -12.3    0.01
##  [5,]   4 -12.4    0.01
##  [6,]   5 -12.3    0.01
##  [7,]   6 -12.2    0.01
##  [8,]   7 -12.5    0.01
##  [9,]   8 -12.6    0.01
## Type 3: with drift and trend 
##       lag   ADF p.value
##  [1,]   0 -14.1    0.01
##  [2,]   1 -13.3    0.01
##  [3,]   2 -12.7    0.01
##  [4,]   3 -12.3    0.01
##  [5,]   4 -12.4    0.01
##  [6,]   5 -12.4    0.01
##  [7,]   6 -12.2    0.01
##  [8,]   7 -12.6    0.01
##  [9,]   8 -12.6    0.01
## ---- 
## Note: in fact, p.value = 0.01 means p.value <= 0.01

electricityIndex

adf.test(data_countries$electricityIndex)
## Augmented Dickey-Fuller Test 
## alternative: stationary 
##  
## Type 1: no drift no trend 
##       lag   ADF p.value
##  [1,]   0 -7.66    0.01
##  [2,]   1 -7.45    0.01
##  [3,]   2 -7.28    0.01
##  [4,]   3 -7.25    0.01
##  [5,]   4 -7.35    0.01
##  [6,]   5 -7.50    0.01
##  [7,]   6 -7.47    0.01
##  [8,]   7 -7.46    0.01
##  [9,]   8 -7.43    0.01
## Type 2: with drift no trend 
##       lag   ADF p.value
##  [1,]   0 -9.46    0.01
##  [2,]   1 -9.25    0.01
##  [3,]   2 -9.08    0.01
##  [4,]   3 -9.08    0.01
##  [5,]   4 -9.25    0.01
##  [6,]   5 -9.48    0.01
##  [7,]   6 -9.49    0.01
##  [8,]   7 -9.52    0.01
##  [9,]   8 -9.54    0.01
## Type 3: with drift and trend 
##       lag   ADF p.value
##  [1,]   0 -9.46    0.01
##  [2,]   1 -9.25    0.01
##  [3,]   2 -9.08    0.01
##  [4,]   3 -9.08    0.01
##  [5,]   4 -9.24    0.01
##  [6,]   5 -9.48    0.01
##  [7,]   6 -9.49    0.01
##  [8,]   7 -9.52    0.01
##  [9,]   8 -9.54    0.01
## ---- 
## Note: in fact, p.value = 0.01 means p.value <= 0.01

internetCover

adf.test(data_countries$internetCover)
## Augmented Dickey-Fuller Test 
## alternative: stationary 
##  
## Type 1: no drift no trend 
##       lag   ADF p.value
##  [1,]   0 -15.2    0.01
##  [2,]   1 -15.6    0.01
##  [3,]   2 -16.1    0.01
##  [4,]   3 -16.0    0.01
##  [5,]   4 -16.5    0.01
##  [6,]   5 -16.6    0.01
##  [7,]   6 -15.9    0.01
##  [8,]   7 -15.2    0.01
##  [9,]   8 -14.9    0.01
## Type 2: with drift no trend 
##       lag   ADF p.value
##  [1,]   0 -16.4    0.01
##  [2,]   1 -16.9    0.01
##  [3,]   2 -17.6    0.01
##  [4,]   3 -17.7    0.01
##  [5,]   4 -18.4    0.01
##  [6,]   5 -18.7    0.01
##  [7,]   6 -18.1    0.01
##  [8,]   7 -17.5    0.01
##  [9,]   8 -17.3    0.01
## Type 3: with drift and trend 
##       lag   ADF p.value
##  [1,]   0 -16.4    0.01
##  [2,]   1 -16.9    0.01
##  [3,]   2 -17.6    0.01
##  [4,]   3 -17.7    0.01
##  [5,]   4 -18.4    0.01
##  [6,]   5 -18.7    0.01
##  [7,]   6 -18.1    0.01
##  [8,]   7 -17.5    0.01
##  [9,]   8 -17.3    0.01
## ---- 
## Note: in fact, p.value = 0.01 means p.value <= 0.01

roads

adf.test(data_countries$roads)
## Augmented Dickey-Fuller Test 
## alternative: stationary 
##  
## Type 1: no drift no trend 
##       lag   ADF p.value
##  [1,]   0 -4.60    0.01
##  [2,]   1 -4.44    0.01
##  [3,]   2 -4.43    0.01
##  [4,]   3 -4.40    0.01
##  [5,]   4 -4.46    0.01
##  [6,]   5 -4.53    0.01
##  [7,]   6 -4.43    0.01
##  [8,]   7 -4.42    0.01
##  [9,]   8 -4.40    0.01
## Type 2: with drift no trend 
##       lag   ADF p.value
##  [1,]   0 -8.14    0.01
##  [2,]   1 -7.92    0.01
##  [3,]   2 -7.95    0.01
##  [4,]   3 -7.95    0.01
##  [5,]   4 -8.10    0.01
##  [6,]   5 -8.28    0.01
##  [7,]   6 -8.16    0.01
##  [8,]   7 -8.20    0.01
##  [9,]   8 -8.23    0.01
## Type 3: with drift and trend 
##       lag   ADF p.value
##  [1,]   0 -8.15    0.01
##  [2,]   1 -7.93    0.01
##  [3,]   2 -7.97    0.01
##  [4,]   3 -7.96    0.01
##  [5,]   4 -8.12    0.01
##  [6,]   5 -8.30    0.01
##  [7,]   6 -8.18    0.01
##  [8,]   7 -8.22    0.01
##  [9,]   8 -8.25    0.01
## ---- 
## Note: in fact, p.value = 0.01 means p.value <= 0.01

corruptionIndex

adf.test(data_countries$corruptionIndex)
## Augmented Dickey-Fuller Test 
## alternative: stationary 
##  
## Type 1: no drift no trend 
##       lag   ADF p.value
##  [1,]   0 -4.24    0.01
##  [2,]   1 -4.06    0.01
##  [3,]   2 -3.94    0.01
##  [4,]   3 -3.87    0.01
##  [5,]   4 -3.84    0.01
##  [6,]   5 -3.78    0.01
##  [7,]   6 -3.73    0.01
##  [8,]   7 -3.71    0.01
##  [9,]   8 -3.68    0.01
## Type 2: with drift no trend 
##       lag   ADF p.value
##  [1,]   0 -9.55    0.01
##  [2,]   1 -9.18    0.01
##  [3,]   2 -9.01    0.01
##  [4,]   3 -8.92    0.01
##  [5,]   4 -8.93    0.01
##  [6,]   5 -8.87    0.01
##  [7,]   6 -8.84    0.01
##  [8,]   7 -8.88    0.01
##  [9,]   8 -8.91    0.01
## Type 3: with drift and trend 
##       lag   ADF p.value
##  [1,]   0 -9.56    0.01
##  [2,]   1 -9.18    0.01
##  [3,]   2 -9.02    0.01
##  [4,]   3 -8.92    0.01
##  [5,]   4 -8.93    0.01
##  [6,]   5 -8.87    0.01
##  [7,]   6 -8.84    0.01
##  [8,]   7 -8.88    0.01
##  [9,]   8 -8.91    0.01
## ---- 
## Note: in fact, p.value = 0.01 means p.value <= 0.01

womenRights

adf.test(data_countries$womenRights)
## Augmented Dickey-Fuller Test 
## alternative: stationary 
##  
## Type 1: no drift no trend 
##       lag   ADF p.value
##  [1,]   0 -7.19    0.01
##  [2,]   1 -6.14    0.01
##  [3,]   2 -4.99    0.01
##  [4,]   3 -4.67    0.01
##  [5,]   4 -4.68    0.01
##  [6,]   5 -4.63    0.01
##  [7,]   6 -4.69    0.01
##  [8,]   7 -4.58    0.01
##  [9,]   8 -4.45    0.01
## Type 2: with drift no trend 
##       lag   ADF p.value
##  [1,]   0 -16.5    0.01
##  [2,]   1 -14.4    0.01
##  [3,]   2 -11.8    0.01
##  [4,]   3 -11.2    0.01
##  [5,]   4 -11.4    0.01
##  [6,]   5 -11.5    0.01
##  [7,]   6 -11.8    0.01
##  [8,]   7 -11.7    0.01
##  [9,]   8 -11.6    0.01
## Type 3: with drift and trend 
##       lag   ADF p.value
##  [1,]   0 -16.5    0.01
##  [2,]   1 -14.4    0.01
##  [3,]   2 -11.9    0.01
##  [4,]   3 -11.3    0.01
##  [5,]   4 -11.5    0.01
##  [6,]   5 -11.5    0.01
##  [7,]   6 -11.9    0.01
##  [8,]   7 -11.8    0.01
##  [9,]   8 -11.6    0.01
## ---- 
## Note: in fact, p.value = 0.01 means p.value <= 0.01

armedConflict

adf.test(data_countries$armedConflict)
## Augmented Dickey-Fuller Test 
## alternative: stationary 
##  
## Type 1: no drift no trend 
##       lag   ADF p.value
##  [1,]   0 -17.7    0.01
##  [2,]   1 -14.0    0.01
##  [3,]   2 -12.9    0.01
##  [4,]   3 -12.0    0.01
##  [5,]   4 -11.3    0.01
##  [6,]   5 -11.1    0.01
##  [7,]   6 -11.1    0.01
##  [8,]   7 -11.1    0.01
##  [9,]   8 -10.9    0.01
## Type 2: with drift no trend 
##       lag   ADF p.value
##  [1,]   0 -19.3    0.01
##  [2,]   1 -15.4    0.01
##  [3,]   2 -14.2    0.01
##  [4,]   3 -13.3    0.01
##  [5,]   4 -12.5    0.01
##  [6,]   5 -12.4    0.01
##  [7,]   6 -12.4    0.01
##  [8,]   7 -12.5    0.01
##  [9,]   8 -12.3    0.01
## Type 3: with drift and trend 
##       lag   ADF p.value
##  [1,]   0 -19.3    0.01
##  [2,]   1 -15.4    0.01
##  [3,]   2 -14.2    0.01
##  [4,]   3 -13.3    0.01
##  [5,]   4 -12.5    0.01
##  [6,]   5 -12.4    0.01
##  [7,]   6 -12.4    0.01
##  [8,]   7 -12.5    0.01
##  [9,]   8 -12.3    0.01
## ---- 
## Note: in fact, p.value = 0.01 means p.value <= 0.01

Fixed or random?

Fixed effect models are used when the interest is in analyzing the impact of variables that vary over time. To find out if fixed or random model better fits out data, we use Hausman’s test. Hausman’s test is based on the comparison of two sets of estimates:the fixed and the random effects models:

names(data_countries)[names(data_countries) == "ï..Country"] <- "country"
fixed <-plm(gdpGrowthRate~openness+electricityIndex + internetCover + roads + corruptionIndex + womenRights + armedConflict, data=data_countries, 
             index=c("country", "Year"), model="within")
random <-plm(gdpGrowthRate~openness+electricityIndex + internetCover + roads + corruptionIndex + womenRights + armedConflict, data=data_countries, 
             index=c("country", "Year"), model="random")

Hausman’s test

phtest(fixed, random)
## 
##  Hausman Test
## 
## data:  gdpGrowthRate ~ openness + electricityIndex + internetCover +  ...
## chisq = 57.858, df = 7, p-value = 4.036e-10
## alternative hypothesis: one model is inconsistent

Since the p -value is below the 5% threshold, we have to reject the null hypothesis that random model should be used. We continue with the fixed model into the further steps.

Tests of poolability

We can check if the model can be reduced to a simple regression with no panel effect using tests for poolability. pFtest computes F tests of effects based on the comparison of the within and the pooling model.

OLS <-plm(gdpGrowthRate~openness+electricityIndex+ electricityIndex + internetCover + roads + roads + corruptionIndex + womenRights + armedConflict, data=data_countries, 
             index=c("country", "Year"), model="pooling")
pFtest(fixed, OLS)
## 
##  F test for individual effects
## 
## data:  gdpGrowthRate ~ openness + electricityIndex + internetCover +  ...
## F = 7.2633, df1 = 129, df2 = 3273, p-value < 2.2e-16
## alternative hypothesis: significant effects

We reject the null hypothesis that there is no panel effect.

Individual and|or time effects?

To test if individual effect, time effect or both should be considered in the model, we can use pFtest().

# individual effect 
individual <-plm(gdpGrowthRate~openness+electricityIndex + internetCover + roads + corruptionIndex + womenRights + armedConflict, data=data_countries, effect = "individual",
             index=c("country", "Year"), model="within")

# time effect
time <-plm(gdpGrowthRate~openness+electricityIndex + internetCover + roads + corruptionIndex + womenRights + armedConflict, data=data_countries, effect = "time",
             index=c("country", "Year"), model="within")

# individual and time effect
individual_time <-plm(gdpGrowthRate~openness+electricityIndex + internetCover + roads + corruptionIndex + womenRights + armedConflict, data=data_countries, effect = "twoways",
             index=c("country", "Year"), model="within")

Below pFtest() is used to compare the base model against each of them.

Individual effect

Testing Individual effects. The null is that no Individual effects are needed

pFtest(individual, fixed)
## 
##  F test for individual effects
## 
## data:  gdpGrowthRate ~ openness + electricityIndex + internetCover +  ...
## F = NaN, df1 = 0, df2 = 3273, p-value = NA
## alternative hypothesis: significant effects

Time effect

Testing time-fixed effects. The null is that no time-fixed effects are needed

pFtest(time, fixed)
## 
##  F test for time effects
## 
## data:  gdpGrowthRate ~ openness + electricityIndex + internetCover +  ...
## F = 5.4096, df1 = -101, df2 = 3374, p-value = NA
## alternative hypothesis: significant effects

Individual and time effect

Testing individual and time effects. The null is that no individual and time effects are needed

pFtest(individual_time, fixed)
## 
##  F test for twoways effects
## 
## data:  gdpGrowthRate ~ openness + electricityIndex + internetCover +  ...
## F = 10.325, df1 = 28, df2 = 3245, p-value < 2.2e-16
## alternative hypothesis: significant effects

Sincep-value <5%, we reject that there is individual and time effect.

Testing for serial correlation

Breusch–Godfrey Test for Panel Models can be used to test if there is autocorrelation in the residuals of the model. The null hypothesis H0: there is no serial correlation of any order up to p. 

pbgtest(fixed) #Breusch–Godfrey Test
## 
##  Breusch-Godfrey/Wooldridge test for serial correlation in panel models
## 
## data:  gdpGrowthRate ~ openness + electricityIndex + internetCover +     roads + corruptionIndex + womenRights + armedConflict
## chisq = 190.72, df = 5, p-value < 2.2e-16
## alternative hypothesis: serial correlation in idiosyncratic errors

We reject the null that there is no serial correlation. We will need to use a robust method that handles autocorrelation.

Testing for heteroskedasticity

Breusch-Pagan test is used for testing if the residual has heteroskedasticity.

The null hypothesis for the Breusch-Pagan test is homoscedasticity. If p-value < 0.05 then the residual has heteroskedasticity.

bptest(fixed, studentize=T) #Breusch-Pagan Test
## 
##  studentized Breusch-Pagan test
## 
## data:  fixed
## BP = 49.639, df = 7, p-value = 1.7e-08

Since the p-value is below 5%, the null hypothesis is rejected. The residual has heteroskedasticity.

Controlling for heteroskedasticity and autocorrelation:

Since our model experiences both heteroskedasticity and autocorrelation, it is necessary that we use robust covariance matrix to account for it. The recommended parameters are applied in the code below.

(coeff<- coeftest(fixed, vcov.=vcovHC(fixed, method="white1", type="HC0", cluster="group")))
## 
## t test of coefficients:
## 
##                     Estimate  Std. Error t value Pr(>|t|)    
## openness          1.7060e+00  3.6635e-01  4.6568 3.34e-06 ***
## electricityIndex  1.4531e-04  5.6513e-05  2.5712  0.01018 *  
## internetCover    -8.9998e-04  5.1010e-03 -0.1764  0.85996    
## roads             7.4108e-03  6.8824e-03  1.0768  0.28166    
## corruptionIndex  -8.4418e-02  8.8273e-02 -0.9563  0.33898    
## womenRights      -5.5695e-02  4.1350e-02 -1.3469  0.17810    
## armedConflict     2.4287e-01  1.4406e-01  1.6859  0.09190 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

General to specific approach

Our last selected model fixed shows that some of the variables are insignificant. We will follow the general to specific approach of removing the most insignificant variable at a time until all variables are significant.

model_1 <- fixed
coeftest(model_1, vcov.=vcovHC(fixed, method="white1", type="HC0", cluster="group"))
## 
## t test of coefficients:
## 
##                     Estimate  Std. Error t value Pr(>|t|)    
## openness          1.7060e+00  3.6635e-01  4.6568 3.34e-06 ***
## electricityIndex  1.4531e-04  5.6513e-05  2.5712  0.01018 *  
## internetCover    -8.9998e-04  5.1010e-03 -0.1764  0.85996    
## roads             7.4108e-03  6.8824e-03  1.0768  0.28166    
## corruptionIndex  -8.4418e-02  8.8273e-02 -0.9563  0.33898    
## womenRights      -5.5695e-02  4.1350e-02 -1.3469  0.17810    
## armedConflict     2.4287e-01  1.4406e-01  1.6859  0.09190 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

After removing internetCover

model_2 <-plm(gdpGrowthRate~openness+electricityIndex + roads + corruptionIndex + womenRights + armedConflict, data=data_countries, 
             index=c("country", "Year"), model="within")

coeftest(model_2, vcov.=vcovHC(fixed, method="white1", type="HC0", cluster="group"))
## 
## t test of coefficients:
## 
##                     Estimate  Std. Error t value  Pr(>|t|)    
## openness          1.6887e+00  3.6635e-01  4.6094 4.192e-06 ***
## electricityIndex  1.3867e-04  5.6513e-05  2.4538   0.01419 *  
## roads             7.3718e-03  6.8824e-03  1.0711   0.28420    
## corruptionIndex  -8.5237e-02  8.8273e-02 -0.9656   0.33431    
## womenRights      -5.5933e-02  4.1350e-02 -1.3527   0.17626    
## armedConflict     2.4364e-01  1.4406e-01  1.6912   0.09089 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

After removing corruptionIndex

model_3 <-plm(gdpGrowthRate~openness+electricityIndex + roads + womenRights + armedConflict, data=data_countries,index=c("country", "Year"), model="within")

coeftest(model_3, vcov.=vcovHC(fixed, method="white1", type="HC0", cluster="group"))
## 
## t test of coefficients:
## 
##                     Estimate  Std. Error t value  Pr(>|t|)    
## openness          1.7057e+00  3.6635e-01  4.6559 3.354e-06 ***
## electricityIndex  1.4117e-04  5.6513e-05  2.4981   0.01254 *  
## roads             7.6505e-03  6.8824e-03  1.1116   0.26639    
## womenRights      -5.6258e-02  4.1350e-02 -1.3605   0.17376    
## armedConflict     2.4778e-01  1.4406e-01  1.7200   0.08553 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

After removing roads

model_4 <-plm(gdpGrowthRate~openness+electricityIndex + womenRights + armedConflict, data=data_countries,index=c("country", "Year"), model="within")

coeftest(model_4, vcov.=vcovHC(fixed, method="white1", type="HC0", cluster="group"))
## 
## t test of coefficients:
## 
##                     Estimate  Std. Error t value Pr(>|t|)    
## openness          1.7420e+00  3.6635e-01  4.7550 2.07e-06 ***
## electricityIndex  1.4992e-04  5.6513e-05  2.6529 0.008018 ** 
## womenRights      -5.1531e-02  4.1350e-02 -1.2462 0.212780    
## armedConflict     2.4265e-01  1.4406e-01  1.6844 0.092197 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

After removing womenRights

model_5 <-plm(gdpGrowthRate~openness+electricityIndex + armedConflict, data=data_countries,index=c("country", "Year"), model="within")

coeftest(model_5, vcov.=vcovHC(fixed, method="white1", type="HC0", cluster="group"))
## 
## t test of coefficients:
## 
##                    Estimate Std. Error t value  Pr(>|t|)    
## openness         1.6500e+00 3.6635e-01  4.5040 6.903e-06 ***
## electricityIndex 1.4757e-04 5.6513e-05  2.6113  0.009062 ** 
## armedConflict    2.4352e-01 1.4406e-01  1.6904  0.091044 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(model_5)
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = gdpGrowthRate ~ openness + electricityIndex + armedConflict, 
##     data = data_countries, model = "within", index = c("country", 
##         "Year"))
## 
## Unbalanced Panel: n = 130, T = 5-29, N = 3410
## 
## Residuals:
##      Min.   1st Qu.    Median   3rd Qu.      Max. 
## -10.36182  -1.54941   0.20244   1.88606  10.27775 
## 
## Coefficients:
##                    Estimate Std. Error t-value  Pr(>|t|)    
## openness         1.6500e+00 3.6663e-01  4.5005 7.016e-06 ***
## electricityIndex 1.4757e-04 5.2087e-05  2.8332  0.004637 ** 
## armedConflict    2.4352e-01 1.4740e-01  1.6521  0.098613 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    29085
## Residual Sum of Squares: 28755
## R-Squared:      0.011346
## Adj. R-Squared: -0.028478
## F-statistic: 12.5353 on 3 and 3277 DF, p-value: 3.791e-08

Testing for non-linear relationships

Using scatter plots of the dependent variable against each of the independent variables which are kept to the last stage, we can check if there is a non linear relationship and compare the results if any.

Before that, however, we need to aggregate the data by country to be able to take a single data point from each. The mean is to be used for aggregation.

data_countries %>%
    group_by(country) %>% 
    summarise_each(funs(mean)) -> data_countries_agg

openness

ggplot(data_countries_agg, aes(x=openness, y=gdpGrowthRate)) +
  geom_line()

No clear relationship

electricityIndex

ggplot(data_countries_agg, aes(x=electricityIndex, y=gdpGrowthRate)) +
  geom_line()

> No clear relationship

armedConflict

ggplot(data_countries_agg, aes(x=armedConflict, y=gdpGrowthRate)) +
  geom_line()

> No clear relationship

Non linear relationships 2

We do not see any obvious non linear relationship between the tested variables. Since we believe opening the market to the foreign markets can accelerate the growth rate more than linearly, we decided to test the second order of openness.

data_countries$sq_openness <- (data_countries$openness)^2
model_6 <-plm(gdpGrowthRate~sq_openness+electricityIndex + armedConflict, data=data_countries,index=c("country", "Year"), model="within")

coeftest(model_6, vcov.=vcovHC(fixed, method="white1", type="HC0", cluster="group"))
## 
## t test of coefficients:
## 
##                    Estimate Std. Error t value Pr(>|t|)   
## electricityIndex 1.5080e-04 5.6513e-05  2.6685 0.007657 **
## armedConflict    2.7371e-01 1.4406e-01  1.9000 0.057520 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(model_6)
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = gdpGrowthRate ~ sq_openness + electricityIndex + 
##     armedConflict, data = data_countries, model = "within", index = c("country", 
##     "Year"))
## 
## Unbalanced Panel: n = 130, T = 5-29, N = 3410
## 
## Residuals:
##      Min.   1st Qu.    Median   3rd Qu.      Max. 
## -10.38274  -1.54090   0.21082   1.89785  10.24891 
## 
## Coefficients:
##                    Estimate Std. Error t-value  Pr(>|t|)    
## sq_openness      1.1536e+00 2.7135e-01  4.2515 2.182e-05 ***
## electricityIndex 1.5080e-04 5.2058e-05  2.8968  0.003794 ** 
## armedConflict    2.7371e-01 1.4685e-01  1.8639  0.062425 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    29085
## Residual Sum of Squares: 28774
## R-Squared:      0.010692
## Adj. R-Squared: -0.029158
## F-statistic: 11.8052 on 3 and 3277 DF, p-value: 1.0887e-07

Results

stargazer(model_1, model_2, model_3, model_4, model_5, model_6, type="text", df=FALSE)
## 
## ===========================================================================
##                                     Dependent variable:                    
##                  ----------------------------------------------------------
##                                        gdpGrowthRate                       
##                    (1)       (2)       (3)       (4)       (5)       (6)   
## ---------------------------------------------------------------------------
## openness         1.706*** 1.689***  1.706***  1.742***  1.650***           
##                  (0.390)   (0.376)   (0.376)   (0.374)   (0.367)           
##                                                                            
## sq_openness                                                       1.154*** 
##                                                                    (0.271) 
##                                                                            
## electricityIndex 0.0001** 0.0001*** 0.0001*** 0.0001*** 0.0001*** 0.0002***
##                  (0.0001) (0.0001)  (0.0001)  (0.0001)  (0.0001)  (0.0001) 
##                                                                            
## internetCover     -0.001                                                   
##                  (0.005)                                                   
##                                                                            
## roads             0.007     0.007     0.008                                
##                  (0.007)   (0.007)   (0.007)                               
##                                                                            
## corruptionIndex   -0.084   -0.085                                          
##                  (0.084)   (0.083)                                         
##                                                                            
## womenRights       -0.056   -0.056    -0.056    -0.052                      
##                  (0.042)   (0.042)   (0.042)   (0.042)                     
##                                                                            
## armedConflict     0.243*   0.244*    0.248*    0.243*    0.244*    0.274*  
##                  (0.148)   (0.148)   (0.147)   (0.147)   (0.147)   (0.147) 
##                                                                            
## ---------------------------------------------------------------------------
## Observations      3,410     3,410     3,410     3,410     3,410     3,410  
## R2                0.012     0.012     0.012     0.012     0.011     0.011  
## Adjusted R2       -0.029   -0.028    -0.028    -0.028    -0.028    -0.029  
## F Statistic      5.905*** 6.886***  8.055***  9.782***  12.535*** 11.805***
## ===========================================================================
## Note:                                           *p<0.1; **p<0.05; ***p<0.01

Residual Diagnostic

Comparing the last two models with all significant variables, we see that model 5 has better adjusted R squared value. Let’s take model 5 as out final model and test the residuals of it.

Residual plot for model 5.

residuals_model5 <- as.data.frame(residuals(model_5) )
names(residuals_model5) <- "residual"

ggplot(residuals_model5, aes(x=residual)) + 
  geom_histogram(aes(y=..density..), colour="black", fill="white")+
  geom_density(alpha=.2, fill="#FF6666") + 
  labs(title="Histogram of residual")

Residual normality test

We can test the normality of error term using Jarque Bera test. It assumes the null hypothesis as the data are from a normal distribution.

library(akima)
jbTest(residuals(model_5))
## 
## Title:
##  Jarque - Bera Normality Test
## 
## Test Results:
##   PARAMETER:
##     Sample Size: 3410
##   STATISTIC:
##     LM: 117.404
##     ALM: 118.109
##   P VALUE:
##     Asymptotic: < 2.2e-16 
## 
## Description:
##  Thu Jun 03 23:57:02 2021 by user: tesfa

We see that residuals are not normally distributed, however,from observation of the residual plot above, it definitely looks like a normal distribution. We agreed to ignore this test result.

Findings

The chosen model can be summarized by the following empirical formula:

\[ gdpGrowthRate = 1.65*openness + 1.47* 10^-4*electricityIndex + 0.243*armedConflict \] This is inline with our expectation except that some variables like womenRights, corruptionIndex were found out to be insignificant. Accordingly, we can draw relations like:

  1. As openness increases by 1 unit, the gdpGrowthRate increases by 1.65 ceteris paribus.
  2. For every additional 10,000 MW of electrical energy a country produces, the gdp grows by 1.47 units ceteris paribus.
  3. As armed conflict index increares by 1, the country’s gdpGrowthRate decreases by 0.243. (remember that “armedConflict” values are negative)

The R^2 statistics for the chosen model is barely above 1%. The model explains very little of the variations in gdpGrowthRate.