Abstract
This document provides a panel data model for countries’ development indicators. The purpose of this research was to figure out the most deterministic factors for GDP growth. First, the dataset was analyzed briefly. Then after, the best fitting model was picked by following standard test results. The chosen model was finally be evaluated numerically and empirically. The resulting model shows that GDP growth rate highly depends on the country’s openness to the world and on the presence or absence of active armed conflict. Some variables like corruption index were surprisingly deemed insignificant, however, we understand that this could be an indication of a poor model.Our research is based on different factors that are believed have an impact on GDP growth rate in developing countries such as the internet cover, corruption index, openness, women rights, electricity or power index, existence of armed conflicts, and percentage of paved roads. In this research, we are going to test the following hypotheses, and using panel data model.
Hypotheses to test:
In this document we will analyze counties’ GDP growth rate using panel data model.Panel data models provide information on individual behavior, both across individuals and over time. In other words, there is a time dimension and cross sectional dimension. We will follow these steps:
Economists have used both theory and empirical research to explain the cause of economic growth. People like Solow, Swann, and Romer have provided theoretical frameworks on which most later works are based.
[5] According to Robert Barro (1996) studies of a panel of 100 countries from 1960 to 1990 to find the factors that affected the economic growth of countries, the growth rate of real per capita GDP was associated with maintenance of the rule of law, smaller government consumption, longer life expectancy, more male secondary and higher levels of schooling, lower fertility rates, higher levels of investment, the level of democracy, a lower inflation rate, and openness to trade. Whereas the prevalence of traps such as conflicts or wars, rent seeking on natural resources, dependence on only one neighboring country, and lack of the rule of law hinder GDP growth. (Collier, 2007). Other economists like Caves (1971) found that there was a positive correlation between the productivity of a multinational enterprise and labor productivity in domestic firms in the same industry. He claimed that this was a result of competition and continuous improvement brought by foreign investment to the domestic market. Foreign direct investment benefits a host country through added employment, new technology and transfer of knowledge. Also, it causes an increase in the volume of domestic investment (Borensztein, De Gregorio, and Lee 1998).
The source data, available Here, contains various development indicators for 135+countries between years 1980 and 2008. We have selected some variables which we believed could directly influence a country’s GDP growth rate. The dependent variable to be regressed is the GDP growth rate. We have found the data from World Bank’s development indicators for the years mentioned above.
The description of variables is given in figure 1.
figure 1: data description
Install and load the necessary packages
requiredPackages = c("aTSA","verification","olsrr","DescTools","caret","tibble","purrr","corrplot","corrplot","dbplyr","dplyr","readr", "ggplot2", "glmnet","plm", "Formula","stargazer", "MASS","sandwich", "zoo","car", "lmtest","aod","dplyr","DescTools","PerformanceAnalytics","dummies","lmtest","fBasics","splines","akima")
for(i in requiredPackages){if(!require(i,character.only = TRUE)) install.packages(i)}
for(i in requiredPackages){if(!require(i,character.only = TRUE)) library(i,character.only = TRUE)}
Load the data.
data_countries <- read.csv("CountriesDataTrue.csv", header = TRUE)
head(data_countries,6) # the top 6 rows
## ï..Country Year gdpGrowthRate R.D openness electricityIndex internetCover
## 1 Angola 1980 2.2087276 . 0.5963767 64.42574 0
## 2 Angola 1981 -0.1324745 . 0.6993147 62.25248 0
## 3 Angola 1982 1.2464611 . 0.5534043 64.08428 0
## 4 Angola 1983 1.7701183 . 0.5312528 67.18925 0
## 5 Angola 1984 1.7846867 . 0.5857161 55.51948 0
## 6 Angola 1985 2.3214330 . 0.5402753 64.72874 0
## roads corruptionIndex womenRights armedConflict
## 1 11.8110 0.33 4 -2
## 2 10.9276 1.74 5 -2
## 3 12.0430 1.53 4 -2
## 4 9.6743 1.74 5 -2
## 5 12.3564 1.94 4 -2
## 6 12.3244 1.87 5 -2
str(data_countries)
## 'data.frame': 3829 obs. of 11 variables:
## $ ï..Country : chr "Angola" "Angola" "Angola" "Angola" ...
## $ Year : int 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 ...
## $ gdpGrowthRate : num 2.209 -0.132 1.246 1.77 1.785 ...
## $ R.D : chr "." "." "." "." ...
## $ openness : num 0.596 0.699 0.553 0.531 0.586 ...
## $ electricityIndex: num 64.4 62.3 64.1 67.2 55.5 ...
## $ internetCover : num 0 0 0 0 0 0 0 0 0 0 ...
## $ roads : num 11.81 10.93 12.04 9.67 12.36 ...
## $ corruptionIndex : num 0.33 1.74 1.53 1.74 1.94 1.87 1.81 1.93 1.7 1.86 ...
## $ womenRights : int 4 5 4 5 4 5 5 5 5 5 ...
## $ armedConflict : int -2 -2 -2 -2 -2 -2 -2 -2 -2 -2 ...
Below we will do some exploratory data analysis, hoping to get the general picture of the data.
We notice that column R.D has many entries filled with “.” Let’s see how much of the total column is missing.
sum(data_countries$R.D== ".")/length(data_countries$R.D)
## [1] 0.2953774
About 30% of R.D column is missing. Therefore, we should remove the column from the analysis.
data_countries = data_countries[,-4] # omit the R&D column
Just like the missing values, it is important to check and handle the outliers appropriately for the better model accuracy.
Let’s start with analyzing basic statistics.
summary(data_countries)
## ï..Country Year gdpGrowthRate openness
## Length:3829 Min. :1980 Min. :-50.248 Min. :0.0000
## Class :character 1st Qu.:1987 1st Qu.: 1.522 1st Qu.:0.3371
## Mode :character Median :1994 Median : 3.730 Median :0.5070
## Mean :1994 Mean : 3.615 Mean :0.6046
## 3rd Qu.:2001 3rd Qu.: 6.063 3rd Qu.:0.7207
## Max. :2008 Max. : 35.224 Max. :9.8665
## NA's :1 NA's :1
## electricityIndex internetCover roads corruptionIndex
## Min. : 0.0 Min. : 0.000 Min. : 0.00 Min. : 0.000
## 1st Qu.: 586.7 1st Qu.: 0.000 1st Qu.: 17.55 1st Qu.: 2.710
## Median : 1368.5 Median : 0.010 Median : 37.98 Median : 3.660
## Mean : 2985.4 Mean : 6.254 Mean : 47.79 Mean : 4.329
## 3rd Qu.: 3981.2 3rd Qu.: 2.910 3rd Qu.: 80.60 3rd Qu.: 5.410
## Max. :36852.5 Max. :90.000 Max. :100.00 Max. :10.000
##
## womenRights armedConflict
## Min. :0.00 Min. :-2.0000
## 1st Qu.:3.00 1st Qu.: 0.0000
## Median :4.00 Median : 0.0000
## Mean :3.98 Mean :-0.2142
## 3rd Qu.:5.00 3rd Qu.: 0.0000
## Max. :9.00 Max. : 0.0000
##
# glimpse(data_countries)
Box plots of the variables can indicate the outliers if any.
library(reshape)
meltData <- melt(data_countries)
p <- ggplot(meltData, aes(factor(variable), value))
p + geom_boxplot() + facet_wrap(~variable, scale="free")
As we can see, variables “gdpGrowthRate”, “openness”, “womenRights” and “armedConflict” have outliers. We will remove the instances with those values in the next stage.
outlier_gdpGrowthRate <- boxplot(data_countries$gdpGrowthRate,data = data_countries)$out
outlier_openness <- boxplot(data_countries$openness,data = data_countries)$out
outlier_womenRights <- boxplot(data_countries$womenRights,data = data_countries)$out
Since variable “armedConflicts” has only 3 levels, it is better not to remove any of them.
data_countries <- data_countries[!((data_countries$gdpGrowthRate %in% outlier_gdpGrowthRate) |(data_countries$openness %in% outlier_openness) |(data_countries$womenRights %in% outlier_womenRights)),]
After cleaning the data, it is time to take a closer look at the dependent variable, gdpGrowthRate. Let’s plot it first.
ggplot(data_countries,
aes(x = gdpGrowthRate)) +
geom_histogram(fill = "blue",
bins = 100) +
theme_bw()
The dependent variable is nearly symmetric, making the variable easier to model. Therefore,there is no need for log transformation.
Variables experiencing no change across the observations should be excluded from the model. Let’s find out if we have any such variable.
sapply(data_countries,
function(x)
unique(x) %>%
length()) %>%
sort()
## armedConflict womenRights Year ï..Country
## 3 9 30 131
## corruptionIndex internetCover gdpGrowthRate roads
## 736 966 1685 2628
## openness electricityIndex
## 3411 3411
Another way to check this is:
vars_selected <-names(data_countries)
var_to_remove <- nearZeroVar(data_countries,
names = TRUE)
var_to_remove
## character(0)
We keep all variables because none has a zero/near zero variance.
Variables that are linear combinations of other variables should be removed. They do not add anything, but only make everything a little harder.
( findLinearCombos(data_countries[,-c(1,2,3)]) ->
data_countries_linearCombos )
## $linearCombos
## list()
##
## $remove
## NULL
We do not have variables that are linear combinations of each other.
It is necessary to make all variables stationary to avoid the Heteroscedasticity issues. The augmented Dickey-Fuller - If p-value < 0.05 then no unit roots present.
Augmented Dickey-Fuller test results show that the variables are stationary.
adf.test(data_countries$gdpGrowthRate)
## Augmented Dickey-Fuller Test
## alternative: stationary
##
## Type 1: no drift no trend
## lag ADF p.value
## [1,] 0 -22.48 0.01
## [2,] 1 -15.22 0.01
## [3,] 2 -11.71 0.01
## [4,] 3 -10.34 0.01
## [5,] 4 -9.00 0.01
## [6,] 5 -7.81 0.01
## [7,] 6 -6.76 0.01
## [8,] 7 -6.04 0.01
## [9,] 8 -5.61 0.01
## Type 2: with drift no trend
## lag ADF p.value
## [1,] 0 -38.5 0.01
## [2,] 1 -28.0 0.01
## [3,] 2 -22.7 0.01
## [4,] 3 -20.9 0.01
## [5,] 4 -18.9 0.01
## [6,] 5 -17.0 0.01
## [7,] 6 -15.0 0.01
## [8,] 7 -13.7 0.01
## [9,] 8 -13.0 0.01
## Type 3: with drift and trend
## lag ADF p.value
## [1,] 0 -38.6 0.01
## [2,] 1 -28.1 0.01
## [3,] 2 -22.7 0.01
## [4,] 3 -21.0 0.01
## [5,] 4 -19.0 0.01
## [6,] 5 -17.0 0.01
## [7,] 6 -15.1 0.01
## [8,] 7 -13.8 0.01
## [9,] 8 -13.0 0.01
## ----
## Note: in fact, p.value = 0.01 means p.value <= 0.01
adf.test(data_countries$openness)
## Augmented Dickey-Fuller Test
## alternative: stationary
##
## Type 1: no drift no trend
## lag ADF p.value
## [1,] 0 -6.12 0.01
## [2,] 1 -5.67 0.01
## [3,] 2 -5.30 0.01
## [4,] 3 -5.03 0.01
## [5,] 4 -5.00 0.01
## [6,] 5 -4.88 0.01
## [7,] 6 -4.72 0.01
## [8,] 7 -4.77 0.01
## [9,] 8 -4.68 0.01
## Type 2: with drift no trend
## lag ADF p.value
## [1,] 0 -14.1 0.01
## [2,] 1 -13.3 0.01
## [3,] 2 -12.7 0.01
## [4,] 3 -12.3 0.01
## [5,] 4 -12.4 0.01
## [6,] 5 -12.3 0.01
## [7,] 6 -12.2 0.01
## [8,] 7 -12.5 0.01
## [9,] 8 -12.6 0.01
## Type 3: with drift and trend
## lag ADF p.value
## [1,] 0 -14.1 0.01
## [2,] 1 -13.3 0.01
## [3,] 2 -12.7 0.01
## [4,] 3 -12.3 0.01
## [5,] 4 -12.4 0.01
## [6,] 5 -12.4 0.01
## [7,] 6 -12.2 0.01
## [8,] 7 -12.6 0.01
## [9,] 8 -12.6 0.01
## ----
## Note: in fact, p.value = 0.01 means p.value <= 0.01
adf.test(data_countries$electricityIndex)
## Augmented Dickey-Fuller Test
## alternative: stationary
##
## Type 1: no drift no trend
## lag ADF p.value
## [1,] 0 -7.66 0.01
## [2,] 1 -7.45 0.01
## [3,] 2 -7.28 0.01
## [4,] 3 -7.25 0.01
## [5,] 4 -7.35 0.01
## [6,] 5 -7.50 0.01
## [7,] 6 -7.47 0.01
## [8,] 7 -7.46 0.01
## [9,] 8 -7.43 0.01
## Type 2: with drift no trend
## lag ADF p.value
## [1,] 0 -9.46 0.01
## [2,] 1 -9.25 0.01
## [3,] 2 -9.08 0.01
## [4,] 3 -9.08 0.01
## [5,] 4 -9.25 0.01
## [6,] 5 -9.48 0.01
## [7,] 6 -9.49 0.01
## [8,] 7 -9.52 0.01
## [9,] 8 -9.54 0.01
## Type 3: with drift and trend
## lag ADF p.value
## [1,] 0 -9.46 0.01
## [2,] 1 -9.25 0.01
## [3,] 2 -9.08 0.01
## [4,] 3 -9.08 0.01
## [5,] 4 -9.24 0.01
## [6,] 5 -9.48 0.01
## [7,] 6 -9.49 0.01
## [8,] 7 -9.52 0.01
## [9,] 8 -9.54 0.01
## ----
## Note: in fact, p.value = 0.01 means p.value <= 0.01
adf.test(data_countries$internetCover)
## Augmented Dickey-Fuller Test
## alternative: stationary
##
## Type 1: no drift no trend
## lag ADF p.value
## [1,] 0 -15.2 0.01
## [2,] 1 -15.6 0.01
## [3,] 2 -16.1 0.01
## [4,] 3 -16.0 0.01
## [5,] 4 -16.5 0.01
## [6,] 5 -16.6 0.01
## [7,] 6 -15.9 0.01
## [8,] 7 -15.2 0.01
## [9,] 8 -14.9 0.01
## Type 2: with drift no trend
## lag ADF p.value
## [1,] 0 -16.4 0.01
## [2,] 1 -16.9 0.01
## [3,] 2 -17.6 0.01
## [4,] 3 -17.7 0.01
## [5,] 4 -18.4 0.01
## [6,] 5 -18.7 0.01
## [7,] 6 -18.1 0.01
## [8,] 7 -17.5 0.01
## [9,] 8 -17.3 0.01
## Type 3: with drift and trend
## lag ADF p.value
## [1,] 0 -16.4 0.01
## [2,] 1 -16.9 0.01
## [3,] 2 -17.6 0.01
## [4,] 3 -17.7 0.01
## [5,] 4 -18.4 0.01
## [6,] 5 -18.7 0.01
## [7,] 6 -18.1 0.01
## [8,] 7 -17.5 0.01
## [9,] 8 -17.3 0.01
## ----
## Note: in fact, p.value = 0.01 means p.value <= 0.01
adf.test(data_countries$roads)
## Augmented Dickey-Fuller Test
## alternative: stationary
##
## Type 1: no drift no trend
## lag ADF p.value
## [1,] 0 -4.60 0.01
## [2,] 1 -4.44 0.01
## [3,] 2 -4.43 0.01
## [4,] 3 -4.40 0.01
## [5,] 4 -4.46 0.01
## [6,] 5 -4.53 0.01
## [7,] 6 -4.43 0.01
## [8,] 7 -4.42 0.01
## [9,] 8 -4.40 0.01
## Type 2: with drift no trend
## lag ADF p.value
## [1,] 0 -8.14 0.01
## [2,] 1 -7.92 0.01
## [3,] 2 -7.95 0.01
## [4,] 3 -7.95 0.01
## [5,] 4 -8.10 0.01
## [6,] 5 -8.28 0.01
## [7,] 6 -8.16 0.01
## [8,] 7 -8.20 0.01
## [9,] 8 -8.23 0.01
## Type 3: with drift and trend
## lag ADF p.value
## [1,] 0 -8.15 0.01
## [2,] 1 -7.93 0.01
## [3,] 2 -7.97 0.01
## [4,] 3 -7.96 0.01
## [5,] 4 -8.12 0.01
## [6,] 5 -8.30 0.01
## [7,] 6 -8.18 0.01
## [8,] 7 -8.22 0.01
## [9,] 8 -8.25 0.01
## ----
## Note: in fact, p.value = 0.01 means p.value <= 0.01
adf.test(data_countries$corruptionIndex)
## Augmented Dickey-Fuller Test
## alternative: stationary
##
## Type 1: no drift no trend
## lag ADF p.value
## [1,] 0 -4.24 0.01
## [2,] 1 -4.06 0.01
## [3,] 2 -3.94 0.01
## [4,] 3 -3.87 0.01
## [5,] 4 -3.84 0.01
## [6,] 5 -3.78 0.01
## [7,] 6 -3.73 0.01
## [8,] 7 -3.71 0.01
## [9,] 8 -3.68 0.01
## Type 2: with drift no trend
## lag ADF p.value
## [1,] 0 -9.55 0.01
## [2,] 1 -9.18 0.01
## [3,] 2 -9.01 0.01
## [4,] 3 -8.92 0.01
## [5,] 4 -8.93 0.01
## [6,] 5 -8.87 0.01
## [7,] 6 -8.84 0.01
## [8,] 7 -8.88 0.01
## [9,] 8 -8.91 0.01
## Type 3: with drift and trend
## lag ADF p.value
## [1,] 0 -9.56 0.01
## [2,] 1 -9.18 0.01
## [3,] 2 -9.02 0.01
## [4,] 3 -8.92 0.01
## [5,] 4 -8.93 0.01
## [6,] 5 -8.87 0.01
## [7,] 6 -8.84 0.01
## [8,] 7 -8.88 0.01
## [9,] 8 -8.91 0.01
## ----
## Note: in fact, p.value = 0.01 means p.value <= 0.01
adf.test(data_countries$womenRights)
## Augmented Dickey-Fuller Test
## alternative: stationary
##
## Type 1: no drift no trend
## lag ADF p.value
## [1,] 0 -7.19 0.01
## [2,] 1 -6.14 0.01
## [3,] 2 -4.99 0.01
## [4,] 3 -4.67 0.01
## [5,] 4 -4.68 0.01
## [6,] 5 -4.63 0.01
## [7,] 6 -4.69 0.01
## [8,] 7 -4.58 0.01
## [9,] 8 -4.45 0.01
## Type 2: with drift no trend
## lag ADF p.value
## [1,] 0 -16.5 0.01
## [2,] 1 -14.4 0.01
## [3,] 2 -11.8 0.01
## [4,] 3 -11.2 0.01
## [5,] 4 -11.4 0.01
## [6,] 5 -11.5 0.01
## [7,] 6 -11.8 0.01
## [8,] 7 -11.7 0.01
## [9,] 8 -11.6 0.01
## Type 3: with drift and trend
## lag ADF p.value
## [1,] 0 -16.5 0.01
## [2,] 1 -14.4 0.01
## [3,] 2 -11.9 0.01
## [4,] 3 -11.3 0.01
## [5,] 4 -11.5 0.01
## [6,] 5 -11.5 0.01
## [7,] 6 -11.9 0.01
## [8,] 7 -11.8 0.01
## [9,] 8 -11.6 0.01
## ----
## Note: in fact, p.value = 0.01 means p.value <= 0.01
adf.test(data_countries$armedConflict)
## Augmented Dickey-Fuller Test
## alternative: stationary
##
## Type 1: no drift no trend
## lag ADF p.value
## [1,] 0 -17.7 0.01
## [2,] 1 -14.0 0.01
## [3,] 2 -12.9 0.01
## [4,] 3 -12.0 0.01
## [5,] 4 -11.3 0.01
## [6,] 5 -11.1 0.01
## [7,] 6 -11.1 0.01
## [8,] 7 -11.1 0.01
## [9,] 8 -10.9 0.01
## Type 2: with drift no trend
## lag ADF p.value
## [1,] 0 -19.3 0.01
## [2,] 1 -15.4 0.01
## [3,] 2 -14.2 0.01
## [4,] 3 -13.3 0.01
## [5,] 4 -12.5 0.01
## [6,] 5 -12.4 0.01
## [7,] 6 -12.4 0.01
## [8,] 7 -12.5 0.01
## [9,] 8 -12.3 0.01
## Type 3: with drift and trend
## lag ADF p.value
## [1,] 0 -19.3 0.01
## [2,] 1 -15.4 0.01
## [3,] 2 -14.2 0.01
## [4,] 3 -13.3 0.01
## [5,] 4 -12.5 0.01
## [6,] 5 -12.4 0.01
## [7,] 6 -12.4 0.01
## [8,] 7 -12.5 0.01
## [9,] 8 -12.3 0.01
## ----
## Note: in fact, p.value = 0.01 means p.value <= 0.01
Fixed effect models are used when the interest is in analyzing the impact of variables that vary over time. To find out if fixed or random model better fits out data, we use Hausman’s test. Hausman’s test is based on the comparison of two sets of estimates:the fixed and the random effects models:
names(data_countries)[names(data_countries) == "ï..Country"] <- "country"
fixed <-plm(gdpGrowthRate~openness+electricityIndex + internetCover + roads + corruptionIndex + womenRights + armedConflict, data=data_countries,
index=c("country", "Year"), model="within")
random <-plm(gdpGrowthRate~openness+electricityIndex + internetCover + roads + corruptionIndex + womenRights + armedConflict, data=data_countries,
index=c("country", "Year"), model="random")
Hausman’s test
phtest(fixed, random)
##
## Hausman Test
##
## data: gdpGrowthRate ~ openness + electricityIndex + internetCover + ...
## chisq = 57.858, df = 7, p-value = 4.036e-10
## alternative hypothesis: one model is inconsistent
Since the p -value is below the 5% threshold, we have to reject the null hypothesis that random model should be used. We continue with the fixed model into the further steps.
We can check if the model can be reduced to a simple regression with no panel effect using tests for poolability. pFtest computes F tests of effects based on the comparison of the within and the pooling model.
OLS <-plm(gdpGrowthRate~openness+electricityIndex+ electricityIndex + internetCover + roads + roads + corruptionIndex + womenRights + armedConflict, data=data_countries,
index=c("country", "Year"), model="pooling")
pFtest(fixed, OLS)
##
## F test for individual effects
##
## data: gdpGrowthRate ~ openness + electricityIndex + internetCover + ...
## F = 7.2633, df1 = 129, df2 = 3273, p-value < 2.2e-16
## alternative hypothesis: significant effects
We reject the null hypothesis that there is no panel effect.
To test if individual effect, time effect or both should be considered in the model, we can use pFtest().
# individual effect
individual <-plm(gdpGrowthRate~openness+electricityIndex + internetCover + roads + corruptionIndex + womenRights + armedConflict, data=data_countries, effect = "individual",
index=c("country", "Year"), model="within")
# time effect
time <-plm(gdpGrowthRate~openness+electricityIndex + internetCover + roads + corruptionIndex + womenRights + armedConflict, data=data_countries, effect = "time",
index=c("country", "Year"), model="within")
# individual and time effect
individual_time <-plm(gdpGrowthRate~openness+electricityIndex + internetCover + roads + corruptionIndex + womenRights + armedConflict, data=data_countries, effect = "twoways",
index=c("country", "Year"), model="within")
Below pFtest() is used to compare the base model against each of them.
Testing Individual effects. The null is that no Individual effects are needed
pFtest(individual, fixed)
##
## F test for individual effects
##
## data: gdpGrowthRate ~ openness + electricityIndex + internetCover + ...
## F = NaN, df1 = 0, df2 = 3273, p-value = NA
## alternative hypothesis: significant effects
Testing time-fixed effects. The null is that no time-fixed effects are needed
pFtest(time, fixed)
##
## F test for time effects
##
## data: gdpGrowthRate ~ openness + electricityIndex + internetCover + ...
## F = 5.4096, df1 = -101, df2 = 3374, p-value = NA
## alternative hypothesis: significant effects
Testing individual and time effects. The null is that no individual and time effects are needed
pFtest(individual_time, fixed)
##
## F test for twoways effects
##
## data: gdpGrowthRate ~ openness + electricityIndex + internetCover + ...
## F = 10.325, df1 = 28, df2 = 3245, p-value < 2.2e-16
## alternative hypothesis: significant effects
Sincep-value <5%, we reject that there is individual and time effect.
Breusch–Godfrey Test for Panel Models can be used to test if there is autocorrelation in the residuals of the model. The null hypothesis H0: there is no serial correlation of any order up to p.
pbgtest(fixed) #Breusch–Godfrey Test
##
## Breusch-Godfrey/Wooldridge test for serial correlation in panel models
##
## data: gdpGrowthRate ~ openness + electricityIndex + internetCover + roads + corruptionIndex + womenRights + armedConflict
## chisq = 190.72, df = 5, p-value < 2.2e-16
## alternative hypothesis: serial correlation in idiosyncratic errors
We reject the null that there is no serial correlation. We will need to use a robust method that handles autocorrelation.
Breusch-Pagan test is used for testing if the residual has heteroskedasticity.
The null hypothesis for the Breusch-Pagan test is homoscedasticity. If p-value < 0.05 then the residual has heteroskedasticity.
bptest(fixed, studentize=T) #Breusch-Pagan Test
##
## studentized Breusch-Pagan test
##
## data: fixed
## BP = 49.639, df = 7, p-value = 1.7e-08
Since the p-value is below 5%, the null hypothesis is rejected. The residual has heteroskedasticity.
Since our model experiences both heteroskedasticity and autocorrelation, it is necessary that we use robust covariance matrix to account for it. The recommended parameters are applied in the code below.
(coeff<- coeftest(fixed, vcov.=vcovHC(fixed, method="white1", type="HC0", cluster="group")))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## openness 1.7060e+00 3.6635e-01 4.6568 3.34e-06 ***
## electricityIndex 1.4531e-04 5.6513e-05 2.5712 0.01018 *
## internetCover -8.9998e-04 5.1010e-03 -0.1764 0.85996
## roads 7.4108e-03 6.8824e-03 1.0768 0.28166
## corruptionIndex -8.4418e-02 8.8273e-02 -0.9563 0.33898
## womenRights -5.5695e-02 4.1350e-02 -1.3469 0.17810
## armedConflict 2.4287e-01 1.4406e-01 1.6859 0.09190 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Our last selected model fixed shows that some of the variables are insignificant. We will follow the general to specific approach of removing the most insignificant variable at a time until all variables are significant.
model_1 <- fixed
coeftest(model_1, vcov.=vcovHC(fixed, method="white1", type="HC0", cluster="group"))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## openness 1.7060e+00 3.6635e-01 4.6568 3.34e-06 ***
## electricityIndex 1.4531e-04 5.6513e-05 2.5712 0.01018 *
## internetCover -8.9998e-04 5.1010e-03 -0.1764 0.85996
## roads 7.4108e-03 6.8824e-03 1.0768 0.28166
## corruptionIndex -8.4418e-02 8.8273e-02 -0.9563 0.33898
## womenRights -5.5695e-02 4.1350e-02 -1.3469 0.17810
## armedConflict 2.4287e-01 1.4406e-01 1.6859 0.09190 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
After removing internetCover
model_2 <-plm(gdpGrowthRate~openness+electricityIndex + roads + corruptionIndex + womenRights + armedConflict, data=data_countries,
index=c("country", "Year"), model="within")
coeftest(model_2, vcov.=vcovHC(fixed, method="white1", type="HC0", cluster="group"))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## openness 1.6887e+00 3.6635e-01 4.6094 4.192e-06 ***
## electricityIndex 1.3867e-04 5.6513e-05 2.4538 0.01419 *
## roads 7.3718e-03 6.8824e-03 1.0711 0.28420
## corruptionIndex -8.5237e-02 8.8273e-02 -0.9656 0.33431
## womenRights -5.5933e-02 4.1350e-02 -1.3527 0.17626
## armedConflict 2.4364e-01 1.4406e-01 1.6912 0.09089 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
After removing corruptionIndex
model_3 <-plm(gdpGrowthRate~openness+electricityIndex + roads + womenRights + armedConflict, data=data_countries,index=c("country", "Year"), model="within")
coeftest(model_3, vcov.=vcovHC(fixed, method="white1", type="HC0", cluster="group"))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## openness 1.7057e+00 3.6635e-01 4.6559 3.354e-06 ***
## electricityIndex 1.4117e-04 5.6513e-05 2.4981 0.01254 *
## roads 7.6505e-03 6.8824e-03 1.1116 0.26639
## womenRights -5.6258e-02 4.1350e-02 -1.3605 0.17376
## armedConflict 2.4778e-01 1.4406e-01 1.7200 0.08553 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
After removing roads
model_4 <-plm(gdpGrowthRate~openness+electricityIndex + womenRights + armedConflict, data=data_countries,index=c("country", "Year"), model="within")
coeftest(model_4, vcov.=vcovHC(fixed, method="white1", type="HC0", cluster="group"))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## openness 1.7420e+00 3.6635e-01 4.7550 2.07e-06 ***
## electricityIndex 1.4992e-04 5.6513e-05 2.6529 0.008018 **
## womenRights -5.1531e-02 4.1350e-02 -1.2462 0.212780
## armedConflict 2.4265e-01 1.4406e-01 1.6844 0.092197 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
After removing womenRights
model_5 <-plm(gdpGrowthRate~openness+electricityIndex + armedConflict, data=data_countries,index=c("country", "Year"), model="within")
coeftest(model_5, vcov.=vcovHC(fixed, method="white1", type="HC0", cluster="group"))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## openness 1.6500e+00 3.6635e-01 4.5040 6.903e-06 ***
## electricityIndex 1.4757e-04 5.6513e-05 2.6113 0.009062 **
## armedConflict 2.4352e-01 1.4406e-01 1.6904 0.091044 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(model_5)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = gdpGrowthRate ~ openness + electricityIndex + armedConflict,
## data = data_countries, model = "within", index = c("country",
## "Year"))
##
## Unbalanced Panel: n = 130, T = 5-29, N = 3410
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -10.36182 -1.54941 0.20244 1.88606 10.27775
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## openness 1.6500e+00 3.6663e-01 4.5005 7.016e-06 ***
## electricityIndex 1.4757e-04 5.2087e-05 2.8332 0.004637 **
## armedConflict 2.4352e-01 1.4740e-01 1.6521 0.098613 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 29085
## Residual Sum of Squares: 28755
## R-Squared: 0.011346
## Adj. R-Squared: -0.028478
## F-statistic: 12.5353 on 3 and 3277 DF, p-value: 3.791e-08
Using scatter plots of the dependent variable against each of the independent variables which are kept to the last stage, we can check if there is a non linear relationship and compare the results if any.
Before that, however, we need to aggregate the data by country to be able to take a single data point from each. The mean is to be used for aggregation.
data_countries %>%
group_by(country) %>%
summarise_each(funs(mean)) -> data_countries_agg
ggplot(data_countries_agg, aes(x=openness, y=gdpGrowthRate)) +
geom_line()
No clear relationship
ggplot(data_countries_agg, aes(x=electricityIndex, y=gdpGrowthRate)) +
geom_line()
> No clear relationship
ggplot(data_countries_agg, aes(x=armedConflict, y=gdpGrowthRate)) +
geom_line()
> No clear relationship
We do not see any obvious non linear relationship between the tested variables. Since we believe opening the market to the foreign markets can accelerate the growth rate more than linearly, we decided to test the second order of openness.
data_countries$sq_openness <- (data_countries$openness)^2
model_6 <-plm(gdpGrowthRate~sq_openness+electricityIndex + armedConflict, data=data_countries,index=c("country", "Year"), model="within")
coeftest(model_6, vcov.=vcovHC(fixed, method="white1", type="HC0", cluster="group"))
##
## t test of coefficients:
##
## Estimate Std. Error t value Pr(>|t|)
## electricityIndex 1.5080e-04 5.6513e-05 2.6685 0.007657 **
## armedConflict 2.7371e-01 1.4406e-01 1.9000 0.057520 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
summary(model_6)
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = gdpGrowthRate ~ sq_openness + electricityIndex +
## armedConflict, data = data_countries, model = "within", index = c("country",
## "Year"))
##
## Unbalanced Panel: n = 130, T = 5-29, N = 3410
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -10.38274 -1.54090 0.21082 1.89785 10.24891
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## sq_openness 1.1536e+00 2.7135e-01 4.2515 2.182e-05 ***
## electricityIndex 1.5080e-04 5.2058e-05 2.8968 0.003794 **
## armedConflict 2.7371e-01 1.4685e-01 1.8639 0.062425 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 29085
## Residual Sum of Squares: 28774
## R-Squared: 0.010692
## Adj. R-Squared: -0.029158
## F-statistic: 11.8052 on 3 and 3277 DF, p-value: 1.0887e-07
stargazer(model_1, model_2, model_3, model_4, model_5, model_6, type="text", df=FALSE)
##
## ===========================================================================
## Dependent variable:
## ----------------------------------------------------------
## gdpGrowthRate
## (1) (2) (3) (4) (5) (6)
## ---------------------------------------------------------------------------
## openness 1.706*** 1.689*** 1.706*** 1.742*** 1.650***
## (0.390) (0.376) (0.376) (0.374) (0.367)
##
## sq_openness 1.154***
## (0.271)
##
## electricityIndex 0.0001** 0.0001*** 0.0001*** 0.0001*** 0.0001*** 0.0002***
## (0.0001) (0.0001) (0.0001) (0.0001) (0.0001) (0.0001)
##
## internetCover -0.001
## (0.005)
##
## roads 0.007 0.007 0.008
## (0.007) (0.007) (0.007)
##
## corruptionIndex -0.084 -0.085
## (0.084) (0.083)
##
## womenRights -0.056 -0.056 -0.056 -0.052
## (0.042) (0.042) (0.042) (0.042)
##
## armedConflict 0.243* 0.244* 0.248* 0.243* 0.244* 0.274*
## (0.148) (0.148) (0.147) (0.147) (0.147) (0.147)
##
## ---------------------------------------------------------------------------
## Observations 3,410 3,410 3,410 3,410 3,410 3,410
## R2 0.012 0.012 0.012 0.012 0.011 0.011
## Adjusted R2 -0.029 -0.028 -0.028 -0.028 -0.028 -0.029
## F Statistic 5.905*** 6.886*** 8.055*** 9.782*** 12.535*** 11.805***
## ===========================================================================
## Note: *p<0.1; **p<0.05; ***p<0.01
Comparing the last two models with all significant variables, we see that model 5 has better adjusted R squared value. Let’s take model 5 as out final model and test the residuals of it.
Residual plot for model 5.
residuals_model5 <- as.data.frame(residuals(model_5) )
names(residuals_model5) <- "residual"
ggplot(residuals_model5, aes(x=residual)) +
geom_histogram(aes(y=..density..), colour="black", fill="white")+
geom_density(alpha=.2, fill="#FF6666") +
labs(title="Histogram of residual")
We can test the normality of error term using Jarque Bera test. It assumes the null hypothesis as the data are from a normal distribution.
library(akima)
jbTest(residuals(model_5))
##
## Title:
## Jarque - Bera Normality Test
##
## Test Results:
## PARAMETER:
## Sample Size: 3410
## STATISTIC:
## LM: 117.404
## ALM: 118.109
## P VALUE:
## Asymptotic: < 2.2e-16
##
## Description:
## Thu Jun 03 23:57:02 2021 by user: tesfa
We see that residuals are not normally distributed, however,from observation of the residual plot above, it definitely looks like a normal distribution. We agreed to ignore this test result.
The chosen model can be summarized by the following empirical formula:
\[ gdpGrowthRate = 1.65*openness + 1.47* 10^-4*electricityIndex + 0.243*armedConflict \] This is inline with our expectation except that some variables like womenRights, corruptionIndex were found out to be insignificant. Accordingly, we can draw relations like:
The R^2 statistics for the chosen model is barely above 1%. The model explains very little of the variations in gdpGrowthRate.