This study seeks to find a relationship between income inequality and different factors that affect to income inequality. There are several ways of defining and measuring inequality, in this paper the inequality in income is the object of analysis and for that the Gini index is used as a measurement. The goal is to understand which factors might be related to income inequality and how are they related. Some factors are there which hold a positive relationship and others support a negative relationship. Multiple linear regression model is formed using data from world bank. For this study a dataset is created for year 2013. The data for the Gini coefficient in this analysis is provided by the World Bank.
The explanatory variables are basic indicators for development of the study such as GDP per capita, Maternal mortality ratio, Labor force, unemployment, gross savings and Life expectancy. In the regression analysis it is shown that models for explaining income inequality can be found but that exact predictions cannot be made. Variables used in the final model include GDP per capita, Labor Force and Gross enrollment ratio for primary education. Statistical inference tests supported the significance of these variables but significance of unemployment is not supported. Further studies should splitting up countries based on level of development, and implementing new explanatory variables.
The Gini index or Gini coefficient is a statistical measure of distribution developed by the Italian statistician Corrado Gini in 1912. The Gini coefficient is usually defined mathematically based on the Lorenz curve, which plots the proportion of the total income of the population (y axis) that is cumulatively earned by the bottom x% of the population. It is often used as a gauge of economic inequality, measuring income distribution or wealth distribution among a population. The coefficient ranges from 0 (complete equality) to 1 (complete inequality), with 0 representing perfect equality and 1 representing perfect inequality. It is sometimes expressed as a percentage ranging between 0 and 100. Values over 1 are theoretically possible due to negative income or wealth. The Gini coefficient can also be used to measure wealth inequality. This use requires that no one has a negative net wealth. It is also commonly used for the measurement of discriminatory power of rating systems in the credit risk management. The Gini Coefficient is an example of a summary statistic. In other words, it compresses a broader array of statistical information into a single figure. The purpose of studying GINI index is presenting information about inequality. I am studying GINI index because this paper submission a part of academic assignment.
The Data used for this analysis is provided by World Bank. Data are based on primary household survey data obtained from government statistical agencies and World Bank country departments. The GINI data set in total includes 264 observations between the years 1960-2017. The World Development Indicators are available on this topics: Agriculture and Food Security, Climate Change, Economic Growth, Education, Energy and Extractives, Environment and Natural Resources, Financial Sector Development, Gender, Macroeconomic Vulnerability and Debt, Poverty, Private Sector Development, Public Sector Management, Social Development, Social Protection and Labor, Trade and Urban Development.
For this study the explanatory varibales are choosen from some of these topics. I decided to run the analysis for year 2013 becuase this is the most recent year for which the data is available. So while creating the data set I included all the countries for which 2013 GINI index data is available. After that I selected about 26 explanatory variables which I think will be the good indicators to predict the GINI index. But because of unavibility of the data, in the final data set I could include only 14 varibales. The data manupulation is done in R and excel.
Countries in Dataset: Argentina,Austria, Belarus, Belgium, Bolivia, Brazil, Bulgaria, Burundi, Canada, Chile, Colombia, Costa Rica, Croatia, Cyprus, Czech Republic, Denmark, Dominican Republic, Ecuador, El Salvador, Estonia, Fiji, Finland, France, Georgia, Germany, Greece, Honduras, Hungary, Iceland, Indonesia, Ireland, Italy, Kazakhstan, Kyrgyz Republic, Latvia, Lithuania, Luxembourg, Moldova, Netherlands, Norway, Pakistan, Panama, Peru, Poland, Portugal, Russian Federation, Rwanda, Serbia, Slovak Republic, Slovenia, Spain, Sweden, Switzerland, Tajikistan, Thailand, Turkey, Ukraine, United Kingdom, United States and Uruguay.
Discussion of 10 explanatory Variables:
1) Government expenditure on education, total (% of GDP): It is reasonable to assume that increase in Government expenditure on education leads to better access to public education, especially among the poor households. This leads to increased education level among poor, improving their access to higher paying jobs, thus reducing the income inequality.
2) Life expectancy at birth, total (years): Life expectancy at birth is an indicator of access to health and average per capita income for a country. Although it does not affect the inequality directly, factors that increase the life expectancy, decrease the overall income inequality. In fact decrease in income inequality has been shown to increase the life expectancy in the lowest quintiles of income.
3) Labor force, total: The percentage of labor force compared to general population in theory should have a linear relationship to income inequality. I.e. a country with only 10% labor force will have a very high income inequality because the Lorenz curve will start from 90% mark. As the labor force percentage increases this inequality should decrease, assuming the percentage of unemployment as share of labor force remains the same.
4) Unemployment, total (% of total labor force): As the unemployment rate goes up, the competition for a particular job increases, leading to depressed wages across the board. Since the proportion of lower wage jobs is generally higher than that of higher paying jobs, this leads to higher number of people with lower wages. So the mean income decreases more than the median income, leading to higher income inequality.
5) GDP per capita (current US$): GDP per Capita is the best way to compare gross domestic product between countries. That’s because some countries have enormous economic outputs because they have so many people. To get a more accurate picture, it’s helpful to use GDP per capita.
6) Population ages 15-64 (% of total): The increase in working class population generally leads to higher percentage of labor force and higher number of people who are employed. This causes decrease in income inequality.
7) Rural Population: In contrast to urban population, rural population has only limited amount of job option. This leads to larger percentage of population with similar incomes. This causes the decrease in income inequality.
8) Gross enrollment ratio, primary, both sexes (%): The countries where the enrollment in primary education is low, obviously states that the percentage of population in low paying jobs is higher. This leads to higher income inequality.
9) Maternal mortality ratio: Maternal Mortality ratio is a measure of access to health care and income. The higher maternal mortality is an indicator of poverty in the population, and thus of the income inequality.
10) Inflation, consumer prices (annual %): The inflation leads to decrease in value of money a person has. If the person’s income is not rising at or above the inflation rate, his net worth goes down with time unless his expenditure also decreases (which is not practical in most cases). Since a person with higher income tend to have more than one source of income (e.g stocks, real estate), compared to a person with lower income, the inflation does not affect him/her as much. This leads to increase in income inequality, and the relationship is likely linear if not logarithmic.
Variable Selection: For Variable selection, I check the collinearity, because if two explanatory variables are highly correlated with each other, they can cause problems during multivariable analysis because they are explaining almost the same variability in the outcome. So I dropped “Income Share by Lowest20%” and “Income Share by Highest 20%”. Moving on, I used manual Stepwise Backward Elimination method which starts with all the predictors in the model, then removes the predictor with highest p-value greater than αcrit. Refit the model and repeat this until all p-values are less than αcrit. By following this method I had to remove variables Tax, PopulationAge, GrossSavings, AccessElectricity, Unemployment, Immunization, RuralPopulation, MaternalMortality, LifeExpectancy and Inflation in this order.
Figure1: Individual Scatter plots for Varibales selected for model
Figure 2: Scatter plot matrix
Regression Assumptions:
1)Fixed x and measurement error: The data used in the analysis is collected from reliable sources so are measured without error and x values are fixed.
2)Linearity: The data appear to be well modeled by a linear relationship between y and x, and the points appear to be randomly spread out about the line, with no discerninle non-linear trends or changes in variability. Looking at the “Residuals vs Fitted plot” (Figure 3), we see that the red line is perfectly flat. This tells us that there is no discernible non-linear trend to the residuals. Furthermore, the residuals appear to be equally variable across the entire range of fitted values. There is no indication of non-constant variance.
3)E[ε] = 0: If it zero (or very close), then this assumption is held true for our model.
## [1] -4.401861e-17
From above result, since the mean of residuals is approximately zero, this assumption holds true for this model.
4)Homoscedasticity of residuals or equal variance:
Ideally, residuals are randomly scattered around 0 (the horizontal line) providing a relatively even distribution. Heteroscedasticity is indicated when the residuals are not evenly scattered around the line. There are many forms heteroscedasticity can take, such as a bow-tie or fan shape. In our case, there is a definite pattern noticed. So, there is heteroscedasticity.
Figure 4
Figure 5: Normal Q-Q Plots
Model Evaluation:
Given that the model assumptions are satisfied, we want to determine how good our model is. Below is the summary and anova results for our model.
##
## Call:
## lm(formula = GINI ~ . - (IncomeHigh20 + IncomeLow20 + Tax + PopulationAge +
## GrossSavings + AccessElectricity + UnEMP + Immunization +
## RuralPopulation + MaternalMortality + LifeExpectancy + Inflation),
## data = finalDataNew)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.0364 -4.9001 0.1125 2.8116 15.0979
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.658e+00 1.205e+01 -0.138 0.89104
## GDP -1.022e-04 3.567e-05 -2.865 0.00586 **
## LaborForce 6.205e-08 2.944e-08 2.108 0.03954 *
## PrimaryEnrollment 3.735e-01 1.121e-01 3.332 0.00153 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.767 on 56 degrees of freedom
## Multiple R-squared: 0.3336, Adjusted R-squared: 0.2979
## F-statistic: 9.346 on 3 and 56 DF, p-value: 4.184e-05
| r.squared | adj.r.squared | |
|---|---|---|
| GINI vs. GDP | 0.1614 | 0.147 |
| GINI vs. GDP, LaborForce | 0.2015 | 0.1735 |
| GINI vs. GDP, LaborForce, PrimaryEnrollment | 0.3336 | 0.2979 |
First, we perform the overall f-test on the model as follows. α = 0.05
hypotheses:
H0: βGDP = βLaborForce = βPrimaryEnrollment = 0
Ha: at least one slope is not zero
test statistic:
Fc = (1284.13/3)/( 2564.74/56) = 428.04/ 45.798
Fc = 28.038
with p = 3 and n − (p + 1) = 60 − 4 = 56 degrees of freedom
p-value< 4.184*10^-5 < α = 0.05
conclusion: reject null hypothesis; the model is adequate.
Next we perform t-test.
Testing βGDP: α = 0.05
H0 : βGDP=0 (Assuming LaborForce and PrimaryEnrollment are already in the model)
Ha : βGDP not =0
t-statistics t1 = -2.865 which has t distribution with n − (p + 1) = 60 − 4 = 56 degrees of freedom.
Here p-value is 0.00586 < 0.05. So we reject null hypothesis and conclude that GDP is statically significant when LaborForce and PrimaryEnrollment are already in the model.
Testing βPrimaryEnrollment: α = 0.05
H0 : βPrimaryEnrollment = 0 (Assuming LaborForce and GDP are already in the model)
Ha : βPrimaryEnrollment not = 0
t-statistics t1 = 3.332 which has t distribution with n − (p + 1) = 60 − 4 = 56 degrees of freedom.
Here p-value is 0.00153 < 0.05. So we reject null hypothesis and conclude that PrimaryEnrollment is statically significant when LaborForce and GDP are already in the model.
Testing βLaborForce: α = 0.05
H0 : βLaborForce = 0 (Assuming PrimaryEnrollment and GDP are already in the model)
Ha : βLaborForce not = 0
t-statistics t1 = 2.108 which has t distribution with n − (p + 1) = 60 − 4 = 56 degrees of freedom.
Here p-value is 0.03954 < 0.05. So we reject null hypothesis and conclude that LaborForce is statically significant when PrimaryEnrollment and GDP are already in the model.
I used backward elimination method to choose the variables for model. In which one starts with fitting a model with all the variables of interest. Then the least significant variable is dropped by checking the p-value, so long as it is not significant at our chosen critical level. I continued by successively re-fitting reduced models and applying the same rule until all remaining variables are statistically significant. Since I used this method, it may be possible that when t-test is performed all the variables in the model are proved statistically significant.
The partial F-test is the most common method of testing for a nested normal linear regression model. “Nested” model means a reduced model in terms of variables included. Since all the variables in our model are proved statistically significant, I did not drop or add any variables to the model. Hence, I did not needed to perform the partial F-test.
Final regression model:
\(\hat y=\)-1.6582664 (%) -1.02201710^{-4} (index/%) \(x1\) + 6.205386410^{-8} (index/%) \(x2\) + 0.3735227 (index/%) \(x3\)
Interpretation of Model:
y-intercept: If GDP, LaborForce and PrimaryEnrollment indicators are 0, the gini index is -1.6582664.
Slope for X1: If GDP increases by 1%, holding LaborForce and PrimaryEnrollment indicator constant, the gini decreases by roughly 1.02201710^(-4).
Slope for X2: If LaborForce increases by 1%, holding GDP and PrimaryEnrollment indicator constant, the gini increases by roughly 6.205386410^(-8).
Slope for X3: If PrimaryEnrollment increases by 1%, holding GDP and LaborForce indicator constant, the gini increases by roughly 0.3735227.
Uselfulness: Our regression analysis determines the actual relationship between the GINI index and GDP, Labor Force and primary Enrollment. Increase in labor force is associated with decrease in income inequality that means decrease in GINI index. Also countries with low Gini coefficient will experience GDP increase, since the gap between rich and poor increase, less growth in economy is likely to occur. And higher gross enrollment in primary education is associated with lower income inequality (negative correlation).
Future Improvments: The analysis could have been improved in many ways, mostly in terms of the data. With this sort of data an exact model for prediction is often very hard to find. It can be more interesting to have a model showing several indicators e.g. quantiles of income, which should make the prediction better since quintiles are a part of calculating the Gini coefficient. Some of the variables had to be excluded from the analysis because the low number of observations, and the analysis would have been better if these variables could be included in the analysis. It would have been preferable if the quality of data was the same for all variables.
https://en.wikipedia.org/wiki/Gini_coefficient
https://www.investopedia.com/terms/g/gini-index.asp
For variable selection method: http://www.biostat.jhsph.edu/~iruczins/teaching/jf/ch10.pdf