Databases at the World Bank are crucial instruments for supporting critical management decisions and supplying key statistical data for Bank operations. The use of globally recognised standards and norms produces a consistent and trustworthy source of data. Poverty is defined as a state or circumstance in which an individual or a group lacks the financial means and necessities for a basic quality of living. Poverty is defined as a situation in which one’s earnings from work are insufficient to meet fundamental human requirements. Here we are going to see different steps of data cleaning and transformation before plotting certain poverty plots.
library(ggplot2)
library(gridExtra)
library(dplyr)
library(reshape2)
Data Source: http://databank.worldbank.org/data/download/PovStats_csv.zip
PovData <- read.csv("PovStatsData.csv", stringsAsFactors = FALSE)
The data contains 10730 observations (countries x stat) for 50 variables (Name, Stat, and over 40 years)
We must first determine how much data is missing before drawing inferences and conclusions from any dataset.
#identification of the proportion of missing NA values
ProportionNA <- function(x){
mean(is.na(x))}
ProportionData <- sapply(PovData[,5:49],ProportionNA)
#get the years, make them numerics & remove the X
yearlist <- as.numeric(
gsub("X","",
names(PovData)[5:49]
)
)
#put them in a combined Data Frame
ProportionDF <- cbind.data.frame(yearlist, ProportionData)
str(ProportionDF)
## 'data.frame': 45 obs. of 2 variables:
## $ yearlist : num 1974 1975 1976 1977 1978 ...
## $ ProportionData: num 0.98 0.98 0.983 0.982 0.981 ...
graph1 <- ggplot(aes(x = yearlist, y= ProportionData), data = ProportionDF) +
geom_point() +
geom_smooth(method = "loess") +
theme_bw() +
labs(x = "Year", y = "Proportion of Data Missing") +
ggtitle("Proportion of Missing Data")+
coord_cartesian(ylim = c(.5,1))
graph1
## `geom_smooth()` using formula 'y ~ x'
From 1975 to 1980, the World Bank continually adapted and decreased the proportion of data missing to around 70% between 2005 and 2012. Since then, data gathering has decreased to levels last seen in the mid-1990s.
I’d want to focus on the GINI index and poverty rates, which are two of the 58 distinct indicators. Because there are so many poverty indicators, I’ve chosen two that have relatively little missing data:
indicators_unique <- unique(PovData$Indicator.Name)
#6th indicator is GINI Index, a World Bank estimator of inequality
PovGini <- PovData[PovData$Indicator.Name == indicators_unique[6],] #GINI
PovNpl <- PovData[PovData$Indicator.Name == indicators_unique[45],]
PovDay <- PovData[PovData$Indicator.Name == indicators_unique[44],]
#Note Poverty rates should not be zero, replacing with NAs
PovNpl[PovNpl == 0] <- NA
PovDay[PovDay == 0] <- NA
GiniNA <- sapply(PovGini[,5:46],ProportionNA) #1974:1980 have no data
#70% of the data is missing
NplNA <- sapply(PovNpl[,5:46],ProportionNA) #1974:1983, 1986 have no data
#80% of the data is missing
DayNA <- sapply(PovDay[,5:46],ProportionNA) # 1974:1983,1986 have no data
#85% of the data is missing
For the years 1987 to 2014, the data was trimmed to just Gini, Rural Poverty Headcount, and Urban Poverty Headcount.
framemaker <- function(df, factorname){
df <- cbind.data.frame(df[,18:49],as.factor(rep(factorname,nrow(df))))
} #adds a marker for three variables to each dataframe
#also selects only years 1984:2018
pov_new_data <- rbind.data.frame(framemaker(PovGini, "Gini"),
framemaker(PovNpl, "National Poverty Lines"),
framemaker(PovDay, "$5.50 a Day"))
colnames(pov_new_data) <- c(1987:2018,"Category")
pov_new_data <- melt(pov_new_data, id = "Category")
colnames(pov_new_data) <- c("Category","Year","Value")
summarise_data <- summarise_all(group_by(pov_new_data, Category, Year, .add =TRUE),
funs(median,mean),
na.rm = TRUE)
head(summarise_data)
meanplot <- ggplot(aes(x = as.numeric(as.character(Year)), y= mean,
color = Category),
data = summarise_data) +
geom_point() + geom_smooth(method = "loess") +
theme_bw() + xlab("Year") + ylab("Mean Value") +
labs(title = "Poverty is Decreasing, but Inequality Remains Constant",
subtitle = "Gini Coefficient of Inequality \n Poverty headcount ratio at national poverty lines and $5.50 a day (% of population)")
meanplot
In 1987, the poverty headcount ratio (as a percentage of population) fell. Inequality with a GINI Coefficient of (.35 to.45) has been present for over 25 years.