Introduction

Databases at the World Bank are crucial instruments for supporting critical management decisions and supplying key statistical data for Bank operations. The use of globally recognised standards and norms produces a consistent and trustworthy source of data. Poverty is defined as a state or circumstance in which an individual or a group lacks the financial means and necessities for a basic quality of living. Poverty is defined as a situation in which one’s earnings from work are insufficient to meet fundamental human requirements. Here we are going to see different steps of data cleaning and transformation before plotting certain poverty plots.

Question

  • How the poverty and inequality differs national poverty lines with respect to time?

Required Libraries

library(ggplot2)
library(gridExtra)
library(dplyr)
library(reshape2)

Dataset Description

Data Source: http://databank.worldbank.org/data/download/PovStats_csv.zip

PovData <- read.csv("PovStatsData.csv", stringsAsFactors = FALSE)

The data contains 10730 observations (countries x stat) for 50 variables (Name, Stat, and over 40 years)

Data Cleaning and Transformation

Dealing with missing data

We must first determine how much data is missing before drawing inferences and conclusions from any dataset.

#identification of the proportion of missing NA values
ProportionNA <- function(x){
        mean(is.na(x))} 

ProportionData <- sapply(PovData[,5:49],ProportionNA)

#get the years, make them numerics & remove the X
yearlist <- as.numeric( 
        gsub("X","",
             names(PovData)[5:49]
             )
        )  
#put them in a combined Data Frame 
ProportionDF <- cbind.data.frame(yearlist, ProportionData) 
str(ProportionDF)
## 'data.frame':    45 obs. of  2 variables:
##  $ yearlist      : num  1974 1975 1976 1977 1978 ...
##  $ ProportionData: num  0.98 0.98 0.983 0.982 0.981 ...
graph1 <- ggplot(aes(x = yearlist, y= ProportionData), data = ProportionDF) + 
  geom_point() + 
  geom_smooth(method = "loess")  + 
  theme_bw() + 
  labs(x = "Year", y = "Proportion of Data Missing") + 
  ggtitle("Proportion of Missing Data")+
  coord_cartesian(ylim = c(.5,1))
graph1 
## `geom_smooth()` using formula 'y ~ x'

From 1975 to 1980, the World Bank continually adapted and decreased the proportion of data missing to around 70% between 2005 and 2012. Since then, data gathering has decreased to levels last seen in the mid-1990s.

I’d want to focus on the GINI index and poverty rates, which are two of the 58 distinct indicators. Because there are so many poverty indicators, I’ve chosen two that have relatively little missing data:

  1. Poverty headcount ratio at $5.50 a day (2011 PPP) (% of population)
  2. Poverty headcount ratio at national poverty lines (% of population)
indicators_unique <- unique(PovData$Indicator.Name)
#6th indicator is GINI Index, a World Bank estimator of inequality 

PovGini <- PovData[PovData$Indicator.Name == indicators_unique[6],] #GINI 
PovNpl <- PovData[PovData$Indicator.Name == indicators_unique[45],] 
PovDay <- PovData[PovData$Indicator.Name == indicators_unique[44],] 

#Note Poverty rates should not be zero, replacing with NAs 
PovNpl[PovNpl == 0] <- NA
PovDay[PovDay == 0] <- NA


GiniNA <- sapply(PovGini[,5:46],ProportionNA) #1974:1980 have no data
#70% of the data is missing 

NplNA <- sapply(PovNpl[,5:46],ProportionNA) #1974:1983, 1986 have no data
#80% of the data is missing

DayNA <- sapply(PovDay[,5:46],ProportionNA) # 1974:1983,1986 have no data
#85% of the data is missing

For the years 1987 to 2014, the data was trimmed to just Gini, Rural Poverty Headcount, and Urban Poverty Headcount.

framemaker <- function(df, factorname){
        df <- cbind.data.frame(df[,18:49],as.factor(rep(factorname,nrow(df))))

} #adds a marker for three variables to each dataframe 
#also selects only years 1984:2018

pov_new_data <- rbind.data.frame(framemaker(PovGini, "Gini"),
                         framemaker(PovNpl, "National Poverty Lines"),
                         framemaker(PovDay, "$5.50 a Day"))
colnames(pov_new_data) <- c(1987:2018,"Category")



pov_new_data <- melt(pov_new_data, id = "Category")
colnames(pov_new_data) <- c("Category","Year","Value")

summarise_data <- summarise_all(group_by(pov_new_data, Category, Year, .add =TRUE),
                                funs(median,mean), 
                                na.rm = TRUE) 
head(summarise_data)

Final Visualization

meanplot <- ggplot(aes(x = as.numeric(as.character(Year)), y= mean,
                       color = Category), 
                   data = summarise_data) + 
        geom_point() + geom_smooth(method = "loess") + 
        theme_bw() + xlab("Year") + ylab("Mean Value") + 
        labs(title = "Poverty is Decreasing, but Inequality Remains Constant",
             subtitle = "Gini Coefficient of Inequality \n Poverty headcount ratio at national poverty lines and $5.50 a day (% of population)")

meanplot 

In 1987, the poverty headcount ratio (as a percentage of population) fell. Inequality with a GINI Coefficient of (.35 to.45) has been present for over 25 years.