Inequality and Economic Growth

Income inequality, measured by the GINI coefficient, is the unevenness of the distribution of income throughout a population as a result of several factors (i.e., education, globalization, labor markets, wealth concentration, etc.). Due to its growing relevance in today’s society, various studies have been conducted to find the relationship between income inequality and other economic variables, such as economic growth defined as the increase in the output per capita per annum. This research has offered contrasting results. More importantly, few studies study Central America in isolation, and if they do, they tend to focus more on the political instability and other non-quantifiable factors than statistics and data. This, along with the interest of finding a final answer to the relationship between income inequality and economic growth, has motivated the research question of this study: What is the impact of income inequality on the economic growth of Central America?

pwt100 <-read.csv("/Users/lourdescortes/Desktop/Marcos Libros/Job Search/data.csv")

The CSV file above was obtained directly by modifying the excel file on the repository corresponding to Load Penn World Tables dataset.

I added some desired columns via trivial VBA operations in excel and some external data sources. The externally added data points where those of GINI coefficient data and GNI data from the World Bank Data.

If the same data points want to be obtained but directly from R look at the code below. Remember to then add the data corresponding to GINI coefficient data and GNI data from the World Bank Data.

#alternative mode of data cleaning using r
#library("xlsx")
pwt100 <- read.xlsx("/Users/lourdescortes/Desktop/Marcos Libros/Job Search/pwt100_sub.xlsx",3)

Now, we should first clean our data set. Normally we would decide upon the variables we will use later after trying a few models, however as I will use a variation of a model proposed by Forbes, K.J.on his paper “A Reassessment of the Relationship between Inequality and Economic Growth (2000)” I will use the variables in the following model: avg.growth_t, i = B₀ + B₁GINI_{t − 1, i} + B₂Income_{t − 1, i} + B₃HumanCapital_{t − 1, i} + B₄Workers_{t − 1, i} + ϵ for country i in time t {i,t = 1, 2, 3, …}.Thus I will delete the unnecesary columns. The differences between the model proposed in that paper and here is that we will use average annual growth instead of year on year economic growth to minimize the effect of short-run recessions when studying the behavior of economic growth over a long period of time. Moreover, I will use log(GINI) as a proxy to income instead of any other measurement, and total laborforce instead number of employed as a measurement of workers.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

build_data <- pwt100
#Perform EDA
#delete data points with 0 data
build_data[build_data==0] <- NA
build_data_use <- build_data[ c(4,6,57:61) ]
build_data_use<-build_data_use[complete.cases(build_data_use),]

The code below does virtually the same with the original Penn World Tables dataset. The only difference, as mentioned earlier, is that the csv file we work with here has extra columns were we perform operations with original dataset data to calculate the average annual growth for country i in time t {i,t = 1, 2, 3, …}. Also we make sure all of the other variables of interest for country i correspond to time t-1 (i.e., potential predictors in the row of year t correspond to year t-1) as we attempt to uncover what effect the previous state of economic variables has in the future state of economic growth (this is not done in the code below, just the cleaning).

library(dplyr)
#getind data from wanted countries
my_range = 1:length(pwt100$country)
data_pwt100 <- tibble()
for (i in my_range){
  if (pwt100$country[i] %in% c("Panama","Costa Rica","El Salvador","Nicaragua","Honduras", "Guatemala")){
    data_pwt100 <- rbind(data_pwt100,pwt100[(i),])
  }
}
#deliting unwanted cols
data_pwt100 <- data_pwt100[ -c(1,3,5:9,12:18,20:46,48:52)]
#deleting unwanted rows (just want years after 1990)
my_range = 1:length(data_pwt100$year)
data_pwt <- tibble()
for (j in my_range){
  if (data_pwt100$year[j] %in% c(1990:2020)){
    data_pwt <- rbind(data_pwt,data_pwt100[(j),]) 
  }
}

Now we will perform EDA by plotting scatterplots and histograms of each numeric variable in the model-dataset. This will let us observe and acknowledge the existence of influential points, see if we will have potential problems with multicollinearity and with the normality of data assumption. We will also subdivide the dataset into test-dataset (1990-1994) and model-dataset (1995-2019).

str(build_data)

## 'data.frame':    180 obs. of  61 variables:
##  $ X.1          : int  2911 2912 2913 2914 2915 2916 2917 2918 2919 2920 ...
##  $ X            : int  2910 2911 2912 2913 2914 2915 2916 2917 2918 2919 ...
##  $ countrycode  : chr  "CRI" "CRI" "CRI" "CRI" ...
##  $ country      : chr  "Costa Rica" "Costa Rica" "Costa Rica" "Costa Rica" ...
##  $ currency_unit: chr  "Costa Rican Colon" "Costa Rican Colon" "Costa Rican Colon" "Costa Rican Colon" ...
##  $ year         : int  1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 ...
##  $ rgdpe        : num  24934 25669 28308 30610 32439 ...
##  $ rgdpo        : num  26685 27175 30037 32134 33639 ...
##  $ pop          : num  3.12 3.2 3.29 3.37 3.46 ...
##  $ emp          : num  1.08 1.11 1.16 1.2 1.24 ...
##  $ avh          : num  2364 2278 2365 2352 2376 ...
##  $ hc           : num  2.25 2.27 2.29 2.31 2.33 ...
##  $ ccon         : num  23661 23448 25503 27731 29150 ...
##  $ cda          : num  27162 26534 29535 32440 33945 ...
##  $ cgdpe        : num  24650 25356 27980 30229 31945 ...
##  $ cgdpo        : num  25623 26289 28764 30744 32418 ...
##  $ cn           : num  50398 52283 55270 59021 63045 ...
##  $ ck           : num  0.00126 0.00127 0.00133 0.00139 0.00145 ...
##  $ ctfp         : num  0.725 0.74 0.728 0.742 0.73 ...
##  $ cwtfp        : num  0.76 0.744 0.742 0.772 0.754 ...
##  $ rgdpna       : num  28256 28895 31553 33789 35305 ...
##  $ rconna       : num  28025 27758 30158 32801 34339 ...
##  $ rdana        : num  31125 30089 33742 37091 38757 ...
##  $ rnna         : num  68950 70779 74078 78060 81945 ...
##  $ rkna         : num  0.279 0.287 0.305 0.325 0.343 ...
##  $ rtfpna       : num  0.853 0.866 0.875 0.893 0.884 ...
##  $ rwtfpna      : num  0.903 0.866 0.899 0.941 0.932 ...
##  $ labsh        : num  0.575 0.575 0.575 0.575 0.575 ...
##  $ irr          : num  0.149 0.144 0.155 0.158 0.158 ...
##  $ delta        : num  0.0491 0.0497 0.0501 0.051 0.0516 ...
##  $ xr           : num  91.6 122.4 134.5 142.2 157.1 ...
##  $ pl_con       : num  0.276 0.263 0.284 0.296 0.305 ...
##  $ pl_da        : num  0.295 0.283 0.305 0.315 0.327 ...
##  $ pl_gdpo      : num  0.283 0.273 0.297 0.31 0.322 ...
##  $ i_cig        : chr  "Interpolated" "Interpolated" "Interpolated" "Interpolated" ...
##  $ i_xm         : chr  "Benchmark" "Benchmark" "Benchmark" "Benchmark" ...
##  $ i_xr         : chr  "Market-based" "Market-based" "Market-based" "Market-based" ...
##  $ i_outlier    : chr  "Regular" "Regular" "Regular" "Regular" ...
##  $ i_irr        : chr  "Regular" "Regular" "Regular" "Regular" ...
##  $ cor_exp      : num  NA NA NA NA NA NA NA NA NA NA ...
##  $ statcap      : num  NA NA NA NA NA ...
##  $ csh_c        : num  0.766 0.739 0.741 0.759 0.759 ...
##  $ csh_i        : num  0.137 0.117 0.14 0.153 0.148 ...
##  $ csh_g        : num  0.157 0.153 0.146 0.143 0.14 ...
##  $ csh_x        : num  0.12 0.124 0.133 0.132 0.144 ...
##  $ csh_m        : num  -0.192 -0.172 -0.216 -0.219 -0.205 ...
##  $ csh_r        : num  0.012 0.0382 0.0565 0.0314 0.0145 ...
##  $ pl_c         : num  0.278 0.269 0.292 0.3 0.305 ...
##  $ pl_i         : num  0.422 0.433 0.437 0.431 0.457 ...
##  $ pl_g         : num  0.263 0.233 0.244 0.276 0.304 ...
##  $ pl_x         : num  0.474 0.498 0.479 0.479 0.476 ...
##  $ pl_m         : num  0.465 0.495 0.448 0.437 0.454 ...
##  $ pl_n         : num  0.278 0.27 0.289 0.299 0.305 ...
##  $ pl_k         : num  1.06 1.01 1.1 1.1 1.07 ...
##  $ GINI         : num  43.2 45 43.9 44.4 45.1 44.5 45.4 44 45 46.4 ...
##  $ logGNI       : num  9.16 9.17 9.24 9.28 9.31 ...
##  $ avg_t        : num  NA 320 1099 1383 1410 ...
##  $ GINI_T_1     : num  NA 0.432 0.45 0.439 0.444 0.451 0.445 0.454 0.44 0.45 ...
##  $ income_t_1   : num  NA 9.16 9.17 9.24 9.28 ...
##  $ hc_t_1       : num  NA 0.563 0.568 0.573 0.578 ...
##  $ workers_t_1  : num  1.08 1.11 1.16 1.2 1.24 ...

par(mfrow=c(2,3))
hist(build_data_use$avg_t, main="Average rGDP Growth", xlab="rGDP (mil. 2017 USD)")
hist(build_data_use$GINI_T_1, main="GINI Index", xlab="Gini Index (0-1)")
hist(build_data_use$income_t_1, main="Log of GNI", xlab="Gross National Income (mil. 2017 USD)")
hist(build_data_use$hc_t_1, main="Human Capital Index", xlab="Human Capital Index (0-1)")
hist(build_data_use$workers_t_1, main="Total Work Force", xlab="Workers (mil.)")

par(mfrow=c(2,2))

plot(build_data_use$GINI_T_1, build_data_use$avg_t, main="Average rGDP Growth vs. GINI Index", xlab="GINI Index", ylab="Average rGDP Growth")
plot(build_data_use$income_t_1, build_data_use$avg_t, main="Average rGDP Growth vs. Income", xlab="Income", ylab="Average rGDP Growth")
plot(build_data_use$hc_t_1, build_data_use$avg_t, main="Average rGDP Growth vs. Human Capital", xlab="Human Capital Index", ylab="Average rGDP Growth")
plot(build_data_use$workers_t_1, build_data_use$avg_t, main="Average rGDP Growth vs. Total Work Force", xlab="Total Work Force", ylab="Average rGDP Growth")

model_data <- build_data_use[ c(5:29, 34:58, 63:112 ,117:141, 146:170), ]
test_data <- build_data_use[ c(0:5, 30:33, 59:62, 113:116, 142:145), ]
summary(model_data)

##    country               year          avg_t           GINI_T_1     
##  Length:150         Min.   :1995   Min.   : 208.9   Min.   :0.3800  
##  Class :character   1st Qu.:2001   1st Qu.: 943.0   1st Qu.:0.4720  
##  Mode  :character   Median :2007   Median :1276.6   Median :0.4965  
##                     Mean   :2007   Mean   :1488.3   Mean   :0.5023  
##                     3rd Qu.:2013   3rd Qu.:2106.5   3rd Qu.:0.5397  
##                     Max.   :2019   Max.   :3439.5   Max.   :0.5910  
##    income_t_1         hc_t_1        workers_t_1    
##  Min.   : 7.946   Min.   :0.3911   Min.   :0.8667  
##  1st Qu.: 8.477   1st Qu.:0.4669   1st Qu.:1.7102  
##  Median : 8.836   Median :0.5168   Median :2.2092  
##  Mean   : 8.964   Mean   :0.5386   Mean   :2.5624  
##  3rd Qu.: 9.454   3rd Qu.:0.6266   3rd Qu.:2.8945  
##  Max.   :10.269   Max.   :0.7182   Max.   :7.0934

We see skews in the histograms and possible curves in the scatterplots, indicating we may have issues with non-linearity and non-normality. To verify this, let’s build a model and look at residual plots.

In accordance to economic theory we do see a linear trend between income, human capital, work force and economic growth. However, we will ignore this as there is virtually no other way to model rGDP, according to academia, without this variables.

We will now proceed to build our model.

mod <- lm(avg_t ~ GINI_T_1 + income_t_1 + hc_t_1 + workers_t_1, data=model_data)
summary(mod)$coefficients

##                Estimate Std. Error    t value     Pr(>|t|)
## (Intercept) -12929.5165  379.33853 -34.084375 4.178616e-71
## GINI_T_1      3116.1794  376.09948   8.285519 7.190364e-14
## income_t_1    1468.6313   55.83924  26.301063 4.702824e-57
## hc_t_1       -2071.1250  421.08834  -4.918505 2.335383e-06
## workers_t_1    313.6305   17.26795  18.162584 3.509264e-39

confint(mod)

##                   2.5 %      97.5 %
## (Intercept) -13679.2638 -12179.7693
## GINI_T_1      2372.8340   3859.5249
## income_t_1    1358.2673   1578.9953
## hc_t_1       -2903.3891  -1238.8609
## workers_t_1    279.5011    347.7599

r <- resid(mod)

# first check condition 1 and 2
pairs(model_data[4:7],)

plot(model_data$avg_t ~ fitted(mod), main="Avg Annual Economic Growth versus Fitted Values", xlab="Fitted Value", ylab="Avg Annual Economic Growth")
abline(a = 0, b = 1)
lines(lowess(model_data$avg_t ~ fitted(mod)), lty=2)

# make all residual plots
par(mfrow=c(2,3))

plot(r ~ fitted(mod), xlab="Fitted", ylab="Residuals")
plot(r ~ model_data$GINI_T_1, xlab="GINI Index", ylab="Residuals")
plot(r ~ model_data$hc_t_1, xlab="Human Capital Index", ylab="Residuals")
plot(r ~ model_data$income_t_1, xlab="Income", ylab="Residuals")
plot(r ~ model_data$workers_t_1, xlab="Labor Force", ylab="Residuals")

qqnorm(r)
qqline(r)

Model summary tells us all coefficients have small p-values for the t-test, hence, we carry on with the same predictors as all variables are influential.

Residual plots show uncorrelated errors assumption doesn’t hold for any variable, non-constant variance doesn’t hold for Gini Index and Labor force and linearity holds for every variable. Observing the QQ plot also tells us we will have a potential issue with our normality assumption too (as our histograms predicted). Therefore, we continue to plot response against fitted values plot and a pairwise plot of all predictors to check condition 1 and 2. Condition 1 holds and condition 2 holds, henceforth we advance to apply Box- Cox to our model to satisfy our broken assumptions.

#install.packages("car")
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

boxCox(mod, family="bcnPower")

# finally, Box-Cox predictors and response together
p <- powerTransform(cbind(model_data[,3], model_data[,4], model_data[,5], model_data[,6], model_data[,7])~ 1)
summary(p)

## bcPower Transformations to Multinormality 
##    Est Power Rounded Pwr Wald Lwr Bnd Wald Upr Bnd
## Y1    0.3848        0.50       0.2384       0.5312
## Y2    0.0424        1.00      -1.3100       1.3947
## Y3    1.8908        1.00       0.3158       3.4658
## Y4    1.6412        1.00       0.6914       2.5910
## Y5    0.2200        0.33       0.0192       0.4209
## 
## Likelihood ratio test that transformation parameters are equal to 0
##  (all log transformations)
##                                    LRT df       pval
## LR test, lambda = (0 0 0 0 0) 48.98844  5 2.2314e-09
## 
## Likelihood ratio test that no transformations are needed
##                                    LRT df       pval
## LR test, lambda = (1 1 1 1 1) 197.7874  5 < 2.22e-16

The hypothesis testing tells us to apply transformations to both predictors and response, making our this our equation with rounded powers: $\sqrt{avg.growth_{t,i}}$ = B₀ + B₁GINI_{t − 1, i} + B₂Income_{t − 1, i} + B₃HumanCapital_{t − 1, i} + $B_4\sqrt[3]{Workers_{t-1,i}}$ + ϵ

We will proceed to check what this transformations do to our model in comparison to our un-transformed model:

mod2 <- lm(I((avg_t)^0.5) ~ I((GINI_T_1)) + I((income_t_1)) + I((hc_t_1)) + I((workers_t_1)^0.33), data=model_data)
summary(mod2)

## 
## Call:
## lm(formula = I((avg_t)^0.5) ~ I((GINI_T_1)) + I((income_t_1)) + 
##     I((hc_t_1)) + I((workers_t_1)^0.33), data = model_data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.085 -1.227  0.130  1.491  4.005 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           -187.7854     3.9739 -47.255  < 2e-16 ***
## I((GINI_T_1))           44.1277     3.6690  12.027  < 2e-16 ***
## I((income_t_1))         20.7252     0.5218  39.722  < 2e-16 ***
## I((hc_t_1))            -31.6696     4.0231  -7.872 7.45e-13 ***
## I((workers_t_1)^0.33)   25.6427     1.0696  23.975  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.997 on 145 degrees of freedom
## Multiple R-squared:  0.9629, Adjusted R-squared:  0.9619 
## F-statistic: 940.4 on 4 and 145 DF,  p-value: < 2.2e-16

summary(mod)

## 
## Call:
## lm(formula = avg_t ~ GINI_T_1 + income_t_1 + hc_t_1 + workers_t_1, 
##     data = model_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -444.47 -122.86   -3.05   96.40  668.40 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -12929.52     379.34 -34.084  < 2e-16 ***
## GINI_T_1      3116.18     376.10   8.286 7.19e-14 ***
## income_t_1    1468.63      55.84  26.301  < 2e-16 ***
## hc_t_1       -2071.12     421.09  -4.919 2.34e-06 ***
## workers_t_1    313.63      17.27  18.163  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 209.5 on 145 degrees of freedom
## Multiple R-squared:  0.9305, Adjusted R-squared:  0.9286 
## F-statistic: 485.3 on 4 and 145 DF,  p-value: < 2.2e-16

It is clear the model is significantly better and all the predictors are still influential. Hence, we continue by checking model assumptions of our transformed model:

data <- model_data %>%
  mutate(avg_t = (avg_t)^0.5, income_t_1 = (income_t_1), hc_t_1 = (hc_t_1), workers_t_1 = (workers_t_1)^0.33, GINI_T_1 = (GINI_T_1))

mod <- lm(avg_t ~ GINI_T_1 +income_t_1 + hc_t_1 + workers_t_1, data=data)

pairs(data[4:7],)

plot(data$avg_t ~ fitted(mod), main="Y vs Fitted Values", xlab="Fitted", ylab="Avg Annual Economic Growth")
lines(lowess(data$avg_t ~ fitted(mod)), lty=2)
abline(a = 0, b = 1)

# make all residual plots
par(mfrow=c(3,4))
plot(rstandard(mod)~fitted(mod), xlab="fitted", ylab="Residuals")
for(i in c(4:7)){
  plot(rstandard(mod)~data[,i], xlab=names(data)[i], ylab="Residuals")
}


qqnorm(rstandard(mod))
qqline(rstandard(mod))
vif(mod)

##    GINI_T_1  income_t_1      hc_t_1 workers_t_1 
##    1.134519    3.383691    4.891233    1.889661

Linearity holds, uncorrelated errors holds better than untransformed model, non-constant variance is satisfied, and normality holds better than previously. We don’t worry about multicollinearity as all our VIF are less than 5. As our dataset doesn’t has any more potential predictors after we cleaned it, we skip the step of using AIC-based stepwise selection and go directly into testing our model with the test-dataset.

test_data <- build_data_use[ c(0:5, 30:33, 59:62, 113:116, 142:145), ]
mod3 <- lm(I((avg_t)^0.5) ~ I((GINI_T_1)) + I((income_t_1)) + I((hc_t_1)) + I((workers_t_1)^0.33), data=test_data)
summary(mod2)
summary(mod3)

data1 <- test_data %>%
  mutate(avg_t = (avg_t)^0.5, income_t_1 = (income_t_1), hc_t_1 = (hc_t_1), workers_t_1 = (workers_t_1)^0.33, GINI_T_1 = (GINI_T_1))

mod3 <- lm(avg_t ~ GINI_T_1 +income_t_1 + hc_t_1 + workers_t_1, data=data1)

pairs(data1[4:7],)
plot(data1$avg_t ~ fitted(mod3), main="Y vs Fitted Values", xlab="Fitted", ylab="Avg Annual Economic Growth")
lines(lowess(data1$avg_t ~ fitted(mod3)), lty=2)
abline(a = 0, b = 1)



# make all residual plots
par(mfrow=c(3,4))
plot(rstandard(mod3)~fitted(mod3), xlab="fitted", ylab="Residuals")
for(i in c(4:7)){
  plot(rstandard(mod3)~data1[,i], xlab=names(data1)[i], ylab="Residuals")
}


qqnorm(rstandard(mod3))
qqline(rstandard(mod3))
vif(mod3)

Our model fails its test. Coefficients are vastly different from that of the model built with model-data. However, the assumptions hold. This means we overfitted the model and the transformations we did to our model using model-data only worked for such data. This is mainly because the test-dataset been very different from our train- dataset (mainly because between the years 1990-1994 economic growth in Central America boomed and then slowed down significantly while the other variables increased at their normal rates, I found out about this after this project). However, not much can be done to solve this as economic figures prior to 1990 of Central America are extrapolated and scarce.

This means there are many influential points in our test-dataset, which also lead to that failure. Furthermore, the test-dataset is small in econometrics terms, but there isn’t much to do about this as there is limited information concerning economic variables of Central American countries prior to 1990. Another limitation is the multicollinearity present in the model. When creating a model that predicts economic growth, multicollinearity is somewhat expected as most economic variables are intercorrelated. All of these limitations are not corrected as they are out of my control and changing the data-points to better fit my model would be unethical.

However, it should be noted that in both models there is a positive relationship between income inequality and economic growth of Central America. Nevertheless, more research have to be done to arrive to a definitive answer on the matter.