1 Introduction

In this exercise, World Development Indicator (WDI) data from the World Bank is used to calculated Human Development Index (HDI) according to United Nations Methodology. HDI is a useful tool for economists to gauge the development status of different countries using more than GDP or other measures focused solely on income.

There are some limitations with data availability, especially in terms of education indicators for the World Bank dataset.

2 HDI Index

We want to evaluate indicators to calculate Human Development Index. We are particularly interested in life expectancy, income (using GNI), and some measure of education.

For education, youth and adult literacy rates data were used instead of expected and mean years of schooling, as the WDI dataset was limited in education related indicators.

2.1 HDI Calculations

The calculations used for the indexes come from the UN’s HDI methodology.

\[ Index = \frac{ActualValue-MinValue}{Max Value-Min Value} \]

Life expectancy maximum is 85 and minimum is 20.

GNI maximum is 75,000 and minimum is 100.

The HDI index will be the geometric mean of the indices:

\[ HDI_i = (I_{health} \cdot I_{education} \cdot I_{income})^\frac{1}{3} \]

where \(i\) refers to the individual country and the different \(I\) values represent the calculated indices for health (life expectancy), education (literacy), and income (GNI per capita).

3 UN vs World Bank

The calculated HDI using World Bank data differs slightly from the UN calculations. The obvious reason for this is the different education measure used. Using the indicator data from the World Bank, literacy rate had the least null values and was used in place of the UN’s measure for years of schooling. This gave different values. In addition, the standard deviation and min/max for the calculated HDI were higher.

3.1 GDP PPP vs ln GDP PPP

This exercise shows that there is a significantly higher correlation between HDI and GDP when taking the log of GDP. The correlation between log GDP and HDI is .956, with the non-log GDP and HDI at .884.

One of the reasons for this is that log transforming GDP normalizes the variable. We can see the GDP measure is skewed to the right. By taking the log, skew is limited and GDP can be evaluated more accurately.

3.2 Plotting HDI and GDP

The following plot shows that calculated HDI is highly correlated with GDP. More specifically, countries with higher GDP’s (per capita) have higher HDI values.

4 Addressing Max and Min Values

The maximum and minimum values the UN uses to create the HDI indices can impact the final HDI calculation. Initially, the UN measures were used, except for the education measure created using WDI data. Now, the true maximum and minimum values for the WDI data will be used and the difference will be observed.

I may be making an error in calculation, but it appears the different minimum and maximum value (see the appendix for code). However, that does not mean the differences in minimum maximum values in different samples could have major impacts on HDI calculations.

5 Improving the Development Indicator

The current Human Development Index has many great qualities. By focusing on more than just conventional economic indicators, such as GDP, HDI gives a better picture of the lives and economic circumstances across countries. The three main categories of life, education, and income are what I think makes the current measure valuable. Future improvements can be made by including more variables in each of the bucketed categories to give an even more complete picture.

Some potential measures to be included in the life category could include food expenditures, as this could give more information on food security, which can be an important indicator for low-income countries. It would also be useful to include some measures of government effectiveness and stability. This may be difficult to measure, but could be a fourth indicator to add to a country’s HDI calculation.

Of course, data collection and availability are serious limitations, especially for government measures, as some authoritarian governments or administrations involved in conflict may be reluctant to share information.

6 Appendix

rm(list = ls()) 
  gc()            
  cat("\f")  
packages <- c("readr", #open csv
              "psych", # quick summary stats for data exploration,
              "stargazer", #summary stats for sharing,
              "tidyverse", # data manipulation like selecting variables,
              "corrplot", # correlation plots
              "ggplot2", # graphing
              "ggcorrplot", # correlation plot
              "gridExtra", #overlay plots
              "data.table", # reshape for graphing 
              "car", #vif
              "prettydoc", # html output
              "visdat", # visualize missing variables
              "glmnet", # lasso/ridge
              "caret", # confusion matrix
              "MASS", #step AIC
              "plm", # fixed effects demeaned regression
              "lmtest" # test regression coefficients
)

for (i in 1:length(packages)) {
  if (!packages[i] %in% rownames(installed.packages())) {
    install.packages(packages[i]
                     , repos = "http://cran.rstudio.com/"
                     , dependencies = TRUE
    )
  }
  library(packages[i], character.only = TRUE)
}

rm(packages)
setwd("/Users/matthewcolantonio/Desktop/hdi/")
wdi <- read_csv("WDIData.csv")
wdi.2 <- wdi[,c(1:4, 65:67)] 

wdi.3 <-wdi.2[wdi.2$`Indicator Code` %in% c("NY.GNP.PCAP.CD", "SP.DYN.LE00.IN", "SE.ADT.LITR.ZS", "SE.ADT.1524.LT.ZS", "NY.GDP.PCAP.PP.CD"), ]
# life expectancy 

wdi.3$life <- NA

# Find the row index where the condition matches
row_index <- which(wdi.3[,"Indicator Code"] == 'SP.DYN.LE00.IN')

# Check if a matching row is found
if (length(row_index) > 0) {
  # Get the column index for the year 2020
  column_index <- which(colnames(wdi.3) == "2020")

  # Perform the calculation: (value - 100) / 74900
  result <- (wdi.3[row_index, column_index] - 20) / 65

  # Assign the result to the 'gni' column
  wdi.3[row_index, "life"] <- result
} else {
  print("No matching row found.")
}



# Create a new column 'gni' and initialize with NA values
wdi.3$gni <- NA

# Find the row index where the condition matches
row_index <- which(wdi.3[,"Indicator Code"] == 'NY.GNP.PCAP.CD')

# Check if a matching row is found
if (length(row_index) > 0) {
  # Get the column index for the year 2020
  column_index <- which(colnames(wdi.3) == "2020")

  # Perform the calculation: (value - 100) / 74900
  result <- (wdi.3[row_index, column_index] - 100) / 74900

  # Assign the result to the 'gni' column
  wdi.3[row_index, "gni"] <- result
} else {
  print("No matching row found.")
}

# education

# Create a new column 'edu' and initialize with NA values
wdi.3$edu <- NA

# Find the row indices where the conditions match
row_index_1 <- which(wdi.3[,"Indicator Code"] == 'SE.ADT.LITR.ZS')
row_index_2 <- which(wdi.3[,"Indicator Code"] == 'SE.ADT.1524.LT.ZS')

# Check if matching rows are found for both indicator codes
if (length(row_index_1) > 0 && length(row_index_2) > 0) {
  # Get the column index for the year 2020
  column_index <- which(colnames(wdi.3) == "2020")

  # Retrieve the values for each indicator code
  values_1 <- wdi.3[row_index_1, column_index]
  values_2 <- wdi.3[row_index_2, column_index]

  # Calculate the simple average of the values
  edu <- (values_1 + values_2) / 2

  # Assign the average values to the 'edu' column for the corresponding rows
  wdi.3[row_index_1, "edu"] <- edu
  wdi.3[row_index_2, "edu"] <- edu
} else {
  print("No matching rows found for one or both indicator codes.")
}


wdi.3$edu <- ((wdi.3$edu- 38) / 62)



# Create a new column 'hdi' and initialize with NA values
wdi.3$hdi <- NA

# Iterate over each unique country name
for (country in unique(wdi.3$`Country Name`)) {
  # Subset the data for the current country
  subset_data <- subset(wdi.3, wdi.3$`Country Name` == country)
  
  # Check if data exists for all variables
  if (all(c("gni", "life", "edu") %in% colnames(subset_data))) {
    # Calculate the geometric mean for 'gni', 'life', and 'edu'
    gni <- subset_data$gni
    life <- subset_data$life
    edu <- subset_data$edu
    
    # Check if any variable has missing data
    if (!anyNA(gni) && !anyNA(life) && !anyNA(edu)) {
      hdi <- exp((log(gni) + log(life) + log(edu)) / 3)
      
      # Assign the calculated hdi value to the 'hdi' column for the current country
      wdi.3$hdi[wdi.3$`Country Name` == country] <- hdi
    }
  }
}


wdi.3$gdp <- NA

# Find the row index where the condition matches
row_index <- which(wdi.3[,"Indicator Code"] == 'NY.GDP.PCAP.PP.CD')

# Check if a matching row is found
if (length(row_index) > 0) {
  # Get the column index for the year 2020
  column_index <- which(colnames(wdi.3) == "2020")

  # Perform the calculation
  result <- wdi.3[row_index, column_index]

  # Assign the result to the 'gdp' column
  wdi.3[row_index, "gdp"] <- result
} else {
  print("No matching row found.")
}



unique_countries <- unique(wdi.3$`Country Name`)
wdi.4 <- data.frame(
  CountryName = character(length(unique_countries)),
  life = numeric(length(unique_countries)),
  edu = numeric(length(unique_countries)),
  gni = numeric(length(unique_countries)),
  gdp = numeric(length(unique_countries))
)

# Iterate over each unique country name
for (i in 1:length(unique_countries)) {
  country <- unique_countries[i]
  
  # Subset the data for the current country
  subset_data <- subset(wdi.3, wdi.3$`Country Name` == country)
  
  # Check if data exists for all variables
  if (all(c("life", "edu", "gni", "gdp") %in% colnames(subset_data))) {
    # Assign the values to the new data frame
    wdi.4[i, "CountryName"] <- country
    wdi.4[i, "life"] <- subset_data$life[!is.na(subset_data$life)][1]
    wdi.4[i, "edu"] <- subset_data$edu[!is.na(subset_data$edu)][1]
    wdi.4[i, "gni"] <- subset_data$gni[!is.na(subset_data$gni)][1]
    wdi.4[i, "gdp"] <- subset_data$gdp[!is.na(subset_data$gdp)][1]
  }
}


wdi.4$hdi <- apply(wdi.4[, c("gni", "life", "edu")], 1, function(x) exp(mean(log(x), na.rm = TRUE)))

ggplot(wdi.4, aes(x = CountryName, y = hdi)) +
  geom_point() +
  labs(x = "Country", y = "HDI") +
  ggtitle("Scatter Plot of HDI by Country")

ggplot(wdi.4, aes(x = CountryName, y = hdi)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(x = "Country", y = "HDI") +
  ggtitle("Bar Plot of HDI by Country") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
hdi <- read_csv("hdi_UN.csv")

describe(wdi.4$hdi)
describe(hdi$`Human Development Index (HDI)`)

ggplot(hdi, aes(x = Country, y = hdi$`Human Development Index (HDI)`)) +
  geom_bar(stat = "identity", fill = "blue") +
  labs(x = "Country", y = "HDI") +
  ggtitle("Bar Plot of HDI by Country") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

CountryName <- hdi$Country

wdi.5 <- na.omit(wdi.4)

cor(wdi.5$hdi, wdi.5$gdp)
cor(wdi.5$hdi, log(wdi.5$gdp))
# Bluebook, no tranformation
ggplot(data = wdi.4, mapping = aes(x = gdp)) +
  geom_histogram(color = '#6D9EC1') +
  labs(title = "GDP PPP Distribution", x = "GDP")

#Bluebook, log-transformed
ggplot(data = wdi.4, mapping = aes(x = log(gdp))) +
  geom_histogram(color = '#6D9EC1') +
  labs(title = "GDP PPP Distribution, log-transformed", x = "Log GDP")
ggplot(wdi.4, aes(x = log(gdp), y = hdi)) +
  geom_point() +
  labs(x = "Log GDP", y = "HDI") +
  ggtitle("GDP and HDI")

wdi.3$life2 <- NA

# Find the row index where the condition matches
row_index <- which(wdi.3[,"Indicator Code"] == 'SP.DYN.LE00.IN')

# Check if a matching row is found
if (length(row_index) > 0) {
  # Get the column index for the year 2020
  column_index <- which(colnames(wdi.3) == "2020")

  # Retrieve the values for the matching row
  values <- wdi.3[row_index, column_index]

  # Perform the calculation: subtract the minimum value and divide by the range (max - min)
  result <- (values - min(values)) / (max(values) - min(values))

  # Assign the result to the 'life2' column
  wdi.3[row_index, "life2"] <- result
} else {
  print("No matching row found.")
}


wdi.3$gni2 <- NA

# Find the row index where the condition matches
row_index <- which(wdi.3[,"Indicator Code"] == 'NY.GNP.PCAP.CD')

# Check if a matching row is found
if (length(row_index) > 0) {
  # Get the column index for the year 2020
  column_index <- which(colnames(wdi.3) == "2020")

  # Perform the calculation: (value - 100) / 74900
  result <- (values - min(values)) / (max(values) - min(values))

  # Assign the result to the 'gni' column
  wdi.3[row_index, "gni2"] <- result
} else {
  print("No matching row found.")
}

# education

# Create a new column 'edu' and initialize with NA values
wdi.3$edu <- NA

# Find the row indices where the conditions match
row_index_1 <- which(wdi.3[,"Indicator Code"] == 'SE.ADT.LITR.ZS')
row_index_2 <- which(wdi.3[,"Indicator Code"] == 'SE.ADT.1524.LT.ZS')

# Check if matching rows are found for both indicator codes
if (length(row_index_1) > 0 && length(row_index_2) > 0) {
  # Get the column index for the year 2020
  column_index <- which(colnames(wdi.3) == "2020")

  # Retrieve the values for each indicator code
  values_1 <- wdi.3[row_index_1, column_index]
  values_2 <- wdi.3[row_index_2, column_index]

  # Calculate the simple average of the values
  edu <- (values_1 + values_2) / 2

  # Assign the average values to the 'edu' column for the corresponding rows
  wdi.3[row_index_1, "edu"] <- edu
  wdi.3[row_index_2, "edu"] <- edu
} else {
  print("No matching rows found for one or both indicator codes.")
}
wdi.3$edu <- ((wdi.3$edu- 38) / 62)

# Create a new column 'hdi' and initialize with NA values
wdi.3$hdi2 <- NA

# Iterate over each unique country name
for (country in unique(wdi.3$`Country Name`)) {
  # Subset the data for the current country
  subset_data <- subset(wdi.3, wdi.3$`Country Name` == country)
  
  # Check if data exists for all variables
  if (all(c("gni", "life", "edu") %in% colnames(subset_data))) {
    # Calculate the geometric mean for 'gni', 'life', and 'edu'
    gni2 <- subset_data$gni
    life2 <- subset_data$life
    edu <- subset_data$edu
    
    # Check if any variable has missing data
    if (!anyNA(gni) && !anyNA(life) && !anyNA(edu)) {
      hdi2 <- exp((log(gni) + log(life) + log(edu)) / 3)
      
      # Assign the calculated hdi value to the 'hdi' column for the current country
      wdi.3$hdi[wdi.3$`Country Name` == country] <- hdi2
    }
  }
}


unique_countries <- unique(wdi.3$`Country Name`)
wdi.6 <- data.frame(
  CountryName = character(length(unique_countries)),
  life2 = numeric(length(unique_countries)),
  edu = numeric(length(unique_countries)),
  gni2 = numeric(length(unique_countries)),
  gdp = numeric(length(unique_countries))
)

# Iterate over each unique country name
for (i in 1:length(unique_countries)) {
  country <- unique_countries[i]
  
  # Subset the data for the current country
  subset_data <- subset(wdi.3, wdi.3$`Country Name` == country)
  
  # Check if data exists for all variables
  if (all(c("life2", "edu", "gni2", "gdp") %in% colnames(subset_data))) {
    # Assign the values to the new data frame
    wdi.6[i, "CountryName"] <- country
    wdi.6[i, "life2"] <- subset_data$life[!is.na(subset_data$life)][1]
    wdi.6[i, "edu"] <- subset_data$edu[!is.na(subset_data$edu)][1]
    wdi.6[i, "gni2"] <- subset_data$gni[!is.na(subset_data$gni)][1]
    wdi.6[i, "gdp"] <- subset_data$gdp[!is.na(subset_data$gdp)][1]
  }
}


wdi.6$hdi <- apply(wdi.6[, c("gni2", "life2", "edu")], 1, function(x) exp(mean(log(x), na.rm = TRUE)))
describe(wdi.6$hdi)
describe(wdi.4$hdi)