The goal of this project is to further explore the relationship between poverty and inequality among countries. World bank World Development Indicators data will be used to examine various development measures to gain a better understanding of the dynamics of income, income share, poverty, and inequality in low, middle, and high income countries.

1 Summary Statistics

The following summary statistics can be observed. Of note, mean share of income for the top twenty percent is 43.6%, while the bottom twenty percent hold 6.8% of the income share. In fact, the maximum share of income held by the bottom twenty percent is 24 percentage points lower than the minimum income share held by the top twenty.

##                    Mean      Median        Min          Max           SD
## Gini       3.635789e+01       35.10    24.8000         54.6 7.594494e+00
## b20        6.806579e+00        6.95     3.1000         10.1 1.878250e+00
## t20        4.356316e+01       41.60    34.0000         60.3 6.183351e+00
## Population 2.389171e+08 10856018.50 11225.0000 7424282488.0 9.063624e+08
## Income     2.020937e+04    12762.32   794.6046     117335.6 2.056949e+04
## Poverty    6.495402e+00        1.20     0.0000         70.8 1.384794e+01
##            X25th.Percentile Coefficient.of.Variation
## Gini                 31.125                0.2088816
## b20                   5.550                0.2759462
## t20                  38.975                0.1419399
## Population      3026305.750                3.7936267
## Income             4783.459                1.0178193
## Poverty               0.200                2.1319610

The average income inequality, using the Gini coefficient, is about 0.36.

What proportion of the sample falls into different income categories? Since I chose to divide the sample by thirds, the proportions are roughly equivalent. there is a drawback to this method in that the difference between two countries at the margins is likely not that great despite one falling in a lower category.

## 
##      High       Low    Middle 
## 0.3319328 0.3361345 0.3319328

The mean Gini for the ‘high income’ countries is 33.1.

## [1] 33.1027

We see here that high income countries have lowest levels of inequality, while middle income have slightly higher inequality on average compared to low income countries.

2 Inequality, Poverty, and Income

We see strong negative correlation between Gini, the measure of inequality, and the share of income held by the bottom twenty percent of the population. This means a country with a higher Gini index (higher inequality) will have lower wealth share of income for the bottom twenty percent. This makes sense, as a more equal society in terms of income will have higher shares of income in the bottom twenty percent. There is also a positive correlation between Income in a country (per capita, PPP) and share of income in among the bottom twenty percent. The interpretation here is that higher income countries have higher income share held by the lowest 20%. Poverty has a strong negative correlation with income, of course. Higher inequality also correlates with higher poverty.

##               Gini        b20     Income    Poverty
## Gini     1.0000000 -0.9394969 -0.3751602  0.3418114
## b20     -0.9394969  1.0000000  0.2724326 -0.2070425
## Income  -0.3751602  0.2724326  1.0000000 -0.3871782
## Poverty  0.3418114 -0.2070425 -0.3871782  1.0000000

3 Inequality Variation

There is some negative covariance between Gini and Income.

## [1] "Interquartile Range for Inequality (Gini): 10.175"
## [1] "Variance in Inequality (Gini): 57.6763368421053"
## [1] "Covariance between Inequality (Gini) and Income:"
## [1] -57241.75

4 Middle Income Countries

The following are summary statistics for the designated ‘middle income’ countries only. We see that the Gini is slightly higher for this group compared to observations of all countries overall. Maximum income for these countries is around $20,000 per capita in PPP terms, with the minimum being almost $8,000. Poverty headcount ratio averages 3.6.

## Summary Statistics for Inequality:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   25.00   33.70   39.80   39.62   45.70   54.60      50
## Summary Statistics for Poverty:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   0.000   0.850   1.900   3.606   4.450  29.200      44
## Summary Statistics for Income:
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7705   10886   12821   13514   15788   20579

5 Poverty and Inequality

These chartsfollows the intuition explained in the correlation plot. Higher poverty headcount corresponds to higher levels of inequality among countries observed.

More unequal countries have lower shares of income in the bottom 20% of the population. The trend appears clearly in the chart.

6 Regression Models

6.1 Income & Poverty

When regressing Poverty and Income, we see a negative and statistically significant coefficient at the 1% level. The R-squared shows income explains 14% of the variation in Poverty using the headcount measure at $1.90 per day in PPP terms. Higher income means lower poverty, which makes intuitive sense.

6.2 Income & Inequality

When regressing Gini as a function of Income, we see a negative and statistically significant coefficient at the 1% level. The R-squared shows income explains about 13% of the variation in Gini across countries. Higher income countries have lower Gini, an thus lower inequality overall.

Let’s observe the residual plots here to see if linear regression assumptions are held.

There seem to be some outliers, and some potential heteroskedasiticty. Since Income tends to be skewed datasets, lets run this again by taking log of income, and examine the results.

Residuals and fitted values plot appears to be more normalized after log-transforming income. Homoskedasticity appears to hold now as well.

6.3 Inequality & Poverty

When regressing Poverty as a function of Gini, we see a positive and statistically significant coefficient at the 5% level. The R-squared shows Gini explains about 10% of the variation in income shares across countries. Countries with higher Gini scores (more unequal countries) have higher poverty rates. .

6.4 Model Comparison

Something to keep in mind as well is omitted variable bias and some potential autocorrelation among certain variables. There is some potential for dependent variables to have some impact on explanatory variables.

## 
## ============================================================
##                                    Dependent variable:      
##                               ------------------------------
##                                Poverty     Gini     Poverty 
##                                  (1)        (2)       (3)   
## ------------------------------------------------------------
## Income                        -0.0002***                    
##                                (0.0001)                     
##                                                             
## log(Income)                              -3.032***          
##                                           (0.887)           
##                                                             
## Gini                                               0.567*** 
##                                                     (0.181) 
##                                                             
## Constant                      11.216***  66.147*** -15.684**
##                                (2.196)    (8.753)   (6.731) 
##                                                             
## ------------------------------------------------------------
## Observations                      76        76        76    
## R2                              0.150      0.136     0.117  
## Adjusted R2                     0.138      0.125     0.105  
## Residual Std. Error (df = 74)   11.696     7.105    11.921  
## F Statistic (df = 1; 74)      13.049***  11.684*** 9.790*** 
## ============================================================
## Note:                            *p<0.1; **p<0.05; ***p<0.01

6.5 Income & Poverty at the Country Level

This regression examines the share of income held by the bottom 20% as a function of the dummy variable for country income level we created earlier.

The coefficient for LowIncome group is 25.742, and it is statistically significant at the 1% level. This means that low income countries have an estimated higher poverty rate compared to the reference category, which is high income countries in this case.

## 
## ===============================================
##                         Dependent variable:    
##                     ---------------------------
##                               Poverty          
## -----------------------------------------------
## Income_GroupLow              25.742***         
##                               (3.376)          
##                                                
## Income_GroupMiddle             2.648           
##                               (2.349)          
##                                                
## Constant                       0.538           
##                               (1.557)          
##                                                
## -----------------------------------------------
## Observations                    76             
## R2                             0.450           
## Adjusted R2                    0.435           
## Residual Std. Error       9.471 (df = 73)      
## F Statistic           29.879*** (df = 2; 73)   
## ===============================================
## Note:               *p<0.1; **p<0.05; ***p<0.01

7 Appendix

rm(list = ls()) 
  gc()            
  cat("\f") 
packages <- c("readr", #open csv
              "readxl", #open excel file
              "psych", # quick summary stats for data exploration,
              "stargazer", #summary stats for sharing,
              "tidyverse", # data manipulation like selecting variables,
              "corrplot", # correlation plots
              "ggplot2", # graphing
              "ggcorrplot", # correlation plot
              "gridExtra", #overlay plots
              "data.table", # reshape for graphing 
              "car", #vif
              "prettydoc", # html output
              "visdat", # visualize missing variables
              "glmnet", # lasso/ridge
              "caret", # confusion matrix
              "MASS", #step AIC
              "plm", # fixed effects demeaned regression
              "lmtest" # test regression coefficients
)
for (i in 1:length(packages)) {
  if (!packages[i] %in% rownames(installed.packages())) {
    install.packages(packages[i]
                     , repos = "http://cran.rstudio.com/"
                     , dependencies = TRUE
    )
  }
  library(packages[i], character.only = TRUE)
}

rm(packages)
setwd("/Users/matthewcolantonio/Desktop/poverty_inequality/")
wdi <- read_excel("Data_Extract_From_World_Development_Indicators.xlsx")
gini <- read_excel("Data_Extract_From_Poverty_and_Equity.xlsx")

wdi <- read_excel('Data_Extract_From_World_Development_Indicators.xlsx')

gini <- read_excel('Data_Extract_From_Poverty_and_Equity.xlsx')


gini.only <- subset(gini, gini$`Series Code` == "SI.POV.GINI")

describe(gini.only)


t20.only <- subset(gini, gini$`Series Code` == "SI.DST.05TH.20")

describe(t20.only)


b20.only <- subset(gini, gini$`Series Code` == "SI.DST.FRST.20")

describe(b20.only)
pop.only <- subset(gini, gini$`Series Code` == "SP.POP.TOTL")
describe(pop.only)


pg190.only <- subset(gini, gini$`Series Code` == "SI.POV.GAPS")

describe(pg190.only)
h190.only <- subset(gini, gini$`Series Code` == "SI.POV.DDAY")

describe(h190.only)
percap.only <- subset(wdi, wdi$`Series Code` == "NY.GDP.PCAP.PP.KD")
# Assuming you have the data frames 'gini.only', 'b20.only', 't20.only', and 'pop.only'

# Merge the data frames by 'Country Name'
merged_df <- merge(gini.only, b20.only, by = "Country Name", all = TRUE)
merged_df <- merge(merged_df, t20.only, by = "Country Name", all = TRUE)
merged_df <- merge(merged_df, pop.only, by = "Country Name", all = TRUE)
merged_df <- merge(merged_df, percap.only, by = "Country Name", all = TRUE)
merged_df <- merge(merged_df, h190.only, by = "Country Name", all = TRUE)
# View the merged data frame
print(merged_df)



# Select the desired columns from each data frame
gini_subset <- gini.only[c("Country Name", "2016 [YR2016]")]
b20_subset <- b20.only[c("Country Name", "2016 [YR2016]")]
t20_subset <- t20.only[c("Country Name", "2016 [YR2016]")]
pop_subset <- pop.only[c("Country Name", "2016 [YR2016]")]
percap_subset <- percap.only[c("Country Name", "2016 [YR2016]")]
h190_subset <- h190.only[c("Country Name", "2016 [YR2016]")]
# Merge the subsetted data frames by 'Country Name'
merged_df <- merge(gini_subset, b20_subset, by = "Country Name", all = TRUE)
merged_df <- merge(merged_df, t20_subset, by = "Country Name", all = TRUE)
merged_df <- merge(merged_df, pop_subset, by = "Country Name", all = TRUE)
merged_df<-  merge(merged_df, percap_subset, by = "Country Name", all = TRUE)
merged_df<-  merge(merged_df, h190_subset, by = "Country Name", all = TRUE)


colnames(merged_df)[2:7] <- c('Gini', 'b20', 't20', 'Population', "Income", "Poverty")

merged_df$Gini <- as.numeric(merged_df$Gini)
merged_df$b20 <- as.numeric(merged_df$b20)
merged_df$t20 <- as.numeric(merged_df$t20)
merged_df$Population <- as.numeric(merged_df$Population)
merged_df$Income <- as.numeric(merged_df$Income)
merged_df$Poverty <- as.numeric(merged_df$Poverty)
# Calculate the quantiles for dividing into thirds
quantiles <- quantile(merged_df$Income, probs = c(1/3, 2/3), na.rm = TRUE)

# Create a new column 'Income_Group' with dummy variables
merged_df$Income_Group <- ifelse(merged_df$Income <= quantiles[1], "Low",
                                 ifelse(merged_df$Income <= quantiles[2], "Middle", "High"))



summary_stats <- data.frame(
  Mean = colMeans(merged_df[, 2:7], na.rm = TRUE),
  Median = apply(merged_df[, 2:7], 2, median, na.rm = TRUE),
  Min = apply(merged_df[, 2:7], 2, min, na.rm = TRUE),
  Max = apply(merged_df[, 2:7], 2, max, na.rm = TRUE),
  SD = apply(merged_df[, 2:7], 2, sd, na.rm = TRUE),
  `25th Percentile` = apply(merged_df[, 2:7], 2, quantile, probs = 0.25, na.rm = TRUE),
  `Coefficient of Variation` = apply(merged_df[, 2:7], 2, function(x) sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE))
)

print(summary_stats)
prop.table(table(merged_df$Income_Group))
high_income_df <- subset(merged_df, Income_Group == "High")

# mean Gini value for high-income countries
mean_gini <- mean(high_income_df$Gini, na.rm = TRUE)

print(mean_gini)

merged_df2 <- na.omit(merged_df)
ggplot(merged_df2, aes(x = reorder(`Country Name`, Gini), y = Gini)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(x = "Country", y = "Gini") +
  ggtitle("Gini by Country") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))
ggplot(merged_df2, aes(x = Income_Group, y = Gini)) +
  geom_boxplot(fill = "steelblue") +
  labs(x = "Income Group", y = "Gini") +
  ggtitle("Gini by Income Group") +
  theme_bw()
cor(merged_df2[, c("Gini", "b20", "Income", "Poverty")])
mycorr<- cor(merged_df2[, c("Gini", "b20", "Income", "Poverty")])

ggcorrplot(mycorr, hc.order = TRUE, 
           type = "lower", lab = TRUE, 
           lab_size = 1.5, method = "square",
           tl.cex = 7.5,
           pch = 4,
           colors = c("#6D9EC1", "white", "#E46726"), 
           title = "Correlation Plot Matrix")
# Compute interquartile range
IQR_inequality <- IQR(merged_df2$Gini, na.rm = TRUE)

# Compute variance
var_inequality <- var(merged_df2$Gini, na.rm = TRUE)

# Compute covariance with other variables
covariance <- cov(merged_df2$Gini, merged_df2$Income, use = "complete.obs")

# Print the results
print(paste("Interquartile Range for Inequality (Gini):", IQR_inequality))
print(paste("Variance in Inequality (Gini):", var_inequality))
print("Covariance between Inequality (Gini) and Income:")
print(covariance)

# Subset the dataframe for middle income countries
middle_income_df <- subset(merged_df, Income_Group == "Middle")

# Calculate the five-number summary for inequality, poverty, and income
summary_inequality <- summary(middle_income_df$Gini)
summary_poverty <- summary(middle_income_df$Poverty)
summary_income <- summary(middle_income_df$Income)


cat("Summary Statistics for Inequality:\n")
print(summary_inequality)
cat("\n")

# Display a message indicating the summary statistics are for poverty
cat("Summary Statistics for Poverty:\n")
print(summary_poverty)
cat("\n")

# Display a message indicating the summary statistics are for income
cat("Summary Statistics for Income:\n")
print(summary_income)
cat("\n")
ggplot(merged_df, aes(x = log(Poverty), y = log(Gini))) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  xlab("log Poverty (Headcount Ratio)") +
  ylab("log Gini") +
  ggtitle("Scatterplot of Poverty and Gini")
ggplot(merged_df, aes(x = b20, y = Gini)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  xlab("Income Share, Bottom 20%") +
  ylab("Gini") +
  ggtitle("Scatterplot of Income in Bottom 20% and Gini")
reg1 <- lm(data = merged_df2, Poverty ~ Income)
summary(reg1)
reg2 <- lm(data = merged_df2, Gini ~ Income)
summary(reg2)
par(mfrow = c(2, 2))
plot(reg2)
reg2.1 <- lm(data = merged_df2, Gini ~ log(Income))
summary(reg2.1)
par(mfrow = c(2, 2))
plot(reg2.1)
reg3 <- lm(data = merged_df2, Poverty ~ Gini)
summary(reg3)
stargazer(reg1, reg2.1, reg3, type ='text', digits = 3)
reg4 <- lm(data = merged_df2, Poverty ~ Income_Group)
stargazer(reg4, type = 'text', digits = 3)