In this exercise, World Development Indicator (WDI) data from the World Bank is used to calculated Human Development Index (HDI) according to United Nations Methodology. HDI is a useful tool for economists to gauge the development status of different countries using more than GDP or other measures focused solely on income.
There are some limitations with data availability, especially in terms of education indicators for the World Bank dataset.
We want to evaluate indicators to calculate Human Development Index. We are particularly interested in life expectancy, income (using GNI), and some measure of education.
For education, youth and adult literacy rates data were used instead of expected and mean years of schooling, as the WDI dataset was limited in education related indicators.
The calculations used for the indexes come from the UN’s HDI methodology.
\[ Index = \frac{ActualValue-MinValue}{Max Value-Min Value} \]
Life expectancy maximum is 85 and minimum is 20.
GNI maximum is 75,000 and minimum is 100.
The HDI index will be the geometric mean of the indices:
\[ HDI_i = (I_{health} \cdot I_{education} \cdot I_{income})^\frac{1}{3} \]
where \(i\) refers to the individual country and the different \(I\) values represent the calculated indices for health (life expectancy), education (literacy), and income (GNI per capita).
The calculated HDI using World Bank data differs slightly from the UN calculations. The obvious reason for this is the different education measure used. Using the indicator data from the World Bank, literacy rate had the least null values and was used in place of the UN’s measure for years of schooling. This gave different values. In addition, the standard deviation and min/max for the calculated HDI were higher.
This exercise shows that there is a significantly higher correlation between HDI and GDP when taking the log of GDP. The correlation between log GDP and HDI is .956, with the non-log GDP and HDI at .884.
One of the reasons for this is that log transforming GDP normalizes the variable. We can see the GDP measure is skewed to the right. By taking the log, skew is limited and GDP can be evaluated more accurately.
The following plot shows that calculated HDI is highly correlated with GDP. More specifically, countries with higher GDP’s (per capita) have higher HDI values.
The maximum and minimum values the UN uses to create the HDI indices can impact the final HDI calculation. Initially, the UN measures were used, except for the education measure created using WDI data. Now, the true maximum and minimum values for the WDI data will be used and the difference will be observed.
I may be making an error in calculation, but it appears the different minimum and maximum value (see the appendix for code). However, that does not mean the differences in minimum maximum values in different samples could have major impacts on HDI calculations.
The current Human Development Index has many great qualities. By focusing on more than just conventional economic indicators, such as GDP, HDI gives a better picture of the lives and economic circumstances across countries. The three main categories of life, education, and income are what I think makes the current measure valuable. Future improvements can be made by including more variables in each of the bucketed categories to give an even more complete picture.
Some potential measures to be included in the life category could include food expenditures, as this could give more information on food security, which can be an important indicator for low-income countries. It would also be useful to include some measures of government effectiveness and stability. This may be difficult to measure, but could be a fourth indicator to add to a country’s HDI calculation.
Of course, data collection and availability are serious limitations, especially for government measures, as some authoritarian governments or administrations involved in conflict may be reluctant to share information.
rm(list = ls())
gc()
cat("\f")
packages <- c("readr", #open csv
"psych", # quick summary stats for data exploration,
"stargazer", #summary stats for sharing,
"tidyverse", # data manipulation like selecting variables,
"corrplot", # correlation plots
"ggplot2", # graphing
"ggcorrplot", # correlation plot
"gridExtra", #overlay plots
"data.table", # reshape for graphing
"car", #vif
"prettydoc", # html output
"visdat", # visualize missing variables
"glmnet", # lasso/ridge
"caret", # confusion matrix
"MASS", #step AIC
"plm", # fixed effects demeaned regression
"lmtest" # test regression coefficients
)
for (i in 1:length(packages)) {
if (!packages[i] %in% rownames(installed.packages())) {
install.packages(packages[i]
, repos = "http://cran.rstudio.com/"
, dependencies = TRUE
)
}
library(packages[i], character.only = TRUE)
}
rm(packages)
setwd("/Users/matthewcolantonio/Desktop/hdi/")
wdi <- read_csv("WDIData.csv")
wdi.2 <- wdi[,c(1:4, 65:67)]
wdi.3 <-wdi.2[wdi.2$`Indicator Code` %in% c("NY.GNP.PCAP.CD", "SP.DYN.LE00.IN", "SE.ADT.LITR.ZS", "SE.ADT.1524.LT.ZS", "NY.GDP.PCAP.PP.CD"), ]
# life expectancy
wdi.3$life <- NA
# Find the row index where the condition matches
row_index <- which(wdi.3[,"Indicator Code"] == 'SP.DYN.LE00.IN')
# Check if a matching row is found
if (length(row_index) > 0) {
# Get the column index for the year 2020
column_index <- which(colnames(wdi.3) == "2020")
# Perform the calculation: (value - 100) / 74900
result <- (wdi.3[row_index, column_index] - 20) / 65
# Assign the result to the 'gni' column
wdi.3[row_index, "life"] <- result
} else {
print("No matching row found.")
}
# Create a new column 'gni' and initialize with NA values
wdi.3$gni <- NA
# Find the row index where the condition matches
row_index <- which(wdi.3[,"Indicator Code"] == 'NY.GNP.PCAP.CD')
# Check if a matching row is found
if (length(row_index) > 0) {
# Get the column index for the year 2020
column_index <- which(colnames(wdi.3) == "2020")
# Perform the calculation: (value - 100) / 74900
result <- (wdi.3[row_index, column_index] - 100) / 74900
# Assign the result to the 'gni' column
wdi.3[row_index, "gni"] <- result
} else {
print("No matching row found.")
}
# education
# Create a new column 'edu' and initialize with NA values
wdi.3$edu <- NA
# Find the row indices where the conditions match
row_index_1 <- which(wdi.3[,"Indicator Code"] == 'SE.ADT.LITR.ZS')
row_index_2 <- which(wdi.3[,"Indicator Code"] == 'SE.ADT.1524.LT.ZS')
# Check if matching rows are found for both indicator codes
if (length(row_index_1) > 0 && length(row_index_2) > 0) {
# Get the column index for the year 2020
column_index <- which(colnames(wdi.3) == "2020")
# Retrieve the values for each indicator code
values_1 <- wdi.3[row_index_1, column_index]
values_2 <- wdi.3[row_index_2, column_index]
# Calculate the simple average of the values
edu <- (values_1 + values_2) / 2
# Assign the average values to the 'edu' column for the corresponding rows
wdi.3[row_index_1, "edu"] <- edu
wdi.3[row_index_2, "edu"] <- edu
} else {
print("No matching rows found for one or both indicator codes.")
}
wdi.3$edu <- ((wdi.3$edu- 38) / 62)
# Create a new column 'hdi' and initialize with NA values
wdi.3$hdi <- NA
# Iterate over each unique country name
for (country in unique(wdi.3$`Country Name`)) {
# Subset the data for the current country
subset_data <- subset(wdi.3, wdi.3$`Country Name` == country)
# Check if data exists for all variables
if (all(c("gni", "life", "edu") %in% colnames(subset_data))) {
# Calculate the geometric mean for 'gni', 'life', and 'edu'
gni <- subset_data$gni
life <- subset_data$life
edu <- subset_data$edu
# Check if any variable has missing data
if (!anyNA(gni) && !anyNA(life) && !anyNA(edu)) {
hdi <- exp((log(gni) + log(life) + log(edu)) / 3)
# Assign the calculated hdi value to the 'hdi' column for the current country
wdi.3$hdi[wdi.3$`Country Name` == country] <- hdi
}
}
}
wdi.3$gdp <- NA
# Find the row index where the condition matches
row_index <- which(wdi.3[,"Indicator Code"] == 'NY.GDP.PCAP.PP.CD')
# Check if a matching row is found
if (length(row_index) > 0) {
# Get the column index for the year 2020
column_index <- which(colnames(wdi.3) == "2020")
# Perform the calculation
result <- wdi.3[row_index, column_index]
# Assign the result to the 'gdp' column
wdi.3[row_index, "gdp"] <- result
} else {
print("No matching row found.")
}
unique_countries <- unique(wdi.3$`Country Name`)
wdi.4 <- data.frame(
CountryName = character(length(unique_countries)),
life = numeric(length(unique_countries)),
edu = numeric(length(unique_countries)),
gni = numeric(length(unique_countries)),
gdp = numeric(length(unique_countries))
)
# Iterate over each unique country name
for (i in 1:length(unique_countries)) {
country <- unique_countries[i]
# Subset the data for the current country
subset_data <- subset(wdi.3, wdi.3$`Country Name` == country)
# Check if data exists for all variables
if (all(c("life", "edu", "gni", "gdp") %in% colnames(subset_data))) {
# Assign the values to the new data frame
wdi.4[i, "CountryName"] <- country
wdi.4[i, "life"] <- subset_data$life[!is.na(subset_data$life)][1]
wdi.4[i, "edu"] <- subset_data$edu[!is.na(subset_data$edu)][1]
wdi.4[i, "gni"] <- subset_data$gni[!is.na(subset_data$gni)][1]
wdi.4[i, "gdp"] <- subset_data$gdp[!is.na(subset_data$gdp)][1]
}
}
wdi.4$hdi <- apply(wdi.4[, c("gni", "life", "edu")], 1, function(x) exp(mean(log(x), na.rm = TRUE)))
ggplot(wdi.4, aes(x = CountryName, y = hdi)) +
geom_point() +
labs(x = "Country", y = "HDI") +
ggtitle("Scatter Plot of HDI by Country")
ggplot(wdi.4, aes(x = CountryName, y = hdi)) +
geom_bar(stat = "identity", fill = "blue") +
labs(x = "Country", y = "HDI") +
ggtitle("Bar Plot of HDI by Country") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
hdi <- read_csv("hdi_UN.csv")
describe(wdi.4$hdi)
describe(hdi$`Human Development Index (HDI)`)
ggplot(hdi, aes(x = Country, y = hdi$`Human Development Index (HDI)`)) +
geom_bar(stat = "identity", fill = "blue") +
labs(x = "Country", y = "HDI") +
ggtitle("Bar Plot of HDI by Country") +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
CountryName <- hdi$Country
wdi.5 <- na.omit(wdi.4)
cor(wdi.5$hdi, wdi.5$gdp)
cor(wdi.5$hdi, log(wdi.5$gdp))
# Bluebook, no tranformation
ggplot(data = wdi.4, mapping = aes(x = gdp)) +
geom_histogram(color = '#6D9EC1') +
labs(title = "GDP PPP Distribution", x = "GDP")
#Bluebook, log-transformed
ggplot(data = wdi.4, mapping = aes(x = log(gdp))) +
geom_histogram(color = '#6D9EC1') +
labs(title = "GDP PPP Distribution, log-transformed", x = "Log GDP")
ggplot(wdi.4, aes(x = log(gdp), y = hdi)) +
geom_point() +
labs(x = "Log GDP", y = "HDI") +
ggtitle("GDP and HDI")
wdi.3$life2 <- NA
# Find the row index where the condition matches
row_index <- which(wdi.3[,"Indicator Code"] == 'SP.DYN.LE00.IN')
# Check if a matching row is found
if (length(row_index) > 0) {
# Get the column index for the year 2020
column_index <- which(colnames(wdi.3) == "2020")
# Retrieve the values for the matching row
values <- wdi.3[row_index, column_index]
# Perform the calculation: subtract the minimum value and divide by the range (max - min)
result <- (values - min(values)) / (max(values) - min(values))
# Assign the result to the 'life2' column
wdi.3[row_index, "life2"] <- result
} else {
print("No matching row found.")
}
wdi.3$gni2 <- NA
# Find the row index where the condition matches
row_index <- which(wdi.3[,"Indicator Code"] == 'NY.GNP.PCAP.CD')
# Check if a matching row is found
if (length(row_index) > 0) {
# Get the column index for the year 2020
column_index <- which(colnames(wdi.3) == "2020")
# Perform the calculation: (value - 100) / 74900
result <- (values - min(values)) / (max(values) - min(values))
# Assign the result to the 'gni' column
wdi.3[row_index, "gni2"] <- result
} else {
print("No matching row found.")
}
# education
# Create a new column 'edu' and initialize with NA values
wdi.3$edu <- NA
# Find the row indices where the conditions match
row_index_1 <- which(wdi.3[,"Indicator Code"] == 'SE.ADT.LITR.ZS')
row_index_2 <- which(wdi.3[,"Indicator Code"] == 'SE.ADT.1524.LT.ZS')
# Check if matching rows are found for both indicator codes
if (length(row_index_1) > 0 && length(row_index_2) > 0) {
# Get the column index for the year 2020
column_index <- which(colnames(wdi.3) == "2020")
# Retrieve the values for each indicator code
values_1 <- wdi.3[row_index_1, column_index]
values_2 <- wdi.3[row_index_2, column_index]
# Calculate the simple average of the values
edu <- (values_1 + values_2) / 2
# Assign the average values to the 'edu' column for the corresponding rows
wdi.3[row_index_1, "edu"] <- edu
wdi.3[row_index_2, "edu"] <- edu
} else {
print("No matching rows found for one or both indicator codes.")
}
wdi.3$edu <- ((wdi.3$edu- 38) / 62)
# Create a new column 'hdi' and initialize with NA values
wdi.3$hdi2 <- NA
# Iterate over each unique country name
for (country in unique(wdi.3$`Country Name`)) {
# Subset the data for the current country
subset_data <- subset(wdi.3, wdi.3$`Country Name` == country)
# Check if data exists for all variables
if (all(c("gni", "life", "edu") %in% colnames(subset_data))) {
# Calculate the geometric mean for 'gni', 'life', and 'edu'
gni2 <- subset_data$gni
life2 <- subset_data$life
edu <- subset_data$edu
# Check if any variable has missing data
if (!anyNA(gni) && !anyNA(life) && !anyNA(edu)) {
hdi2 <- exp((log(gni) + log(life) + log(edu)) / 3)
# Assign the calculated hdi value to the 'hdi' column for the current country
wdi.3$hdi[wdi.3$`Country Name` == country] <- hdi2
}
}
}
unique_countries <- unique(wdi.3$`Country Name`)
wdi.6 <- data.frame(
CountryName = character(length(unique_countries)),
life2 = numeric(length(unique_countries)),
edu = numeric(length(unique_countries)),
gni2 = numeric(length(unique_countries)),
gdp = numeric(length(unique_countries))
)
# Iterate over each unique country name
for (i in 1:length(unique_countries)) {
country <- unique_countries[i]
# Subset the data for the current country
subset_data <- subset(wdi.3, wdi.3$`Country Name` == country)
# Check if data exists for all variables
if (all(c("life2", "edu", "gni2", "gdp") %in% colnames(subset_data))) {
# Assign the values to the new data frame
wdi.6[i, "CountryName"] <- country
wdi.6[i, "life2"] <- subset_data$life[!is.na(subset_data$life)][1]
wdi.6[i, "edu"] <- subset_data$edu[!is.na(subset_data$edu)][1]
wdi.6[i, "gni2"] <- subset_data$gni[!is.na(subset_data$gni)][1]
wdi.6[i, "gdp"] <- subset_data$gdp[!is.na(subset_data$gdp)][1]
}
}
wdi.6$hdi <- apply(wdi.6[, c("gni2", "life2", "edu")], 1, function(x) exp(mean(log(x), na.rm = TRUE)))
describe(wdi.6$hdi)
describe(wdi.4$hdi)