World Development Indicators - Final Project for Data Wrangling @ UC

Synopsis

This project aims at finding how different countries compare against each other on the basis of various development indicators. Majority of the study is focussed on seeing how the three largest economies in the world-USA, India and China - fare against each other on the basis of these standards. A brief study on unemplyment across nations has also been done.

I made of of the Exploratory Data Anlaysis functionalitites of R tidyverse package to make the follwoing insights. China has a higher blance of trade and a very high GDP growth but the CO2 emission rate is also on the rise in China. It need to adopt more environment friendly development model.

USA has a low and fairly constant GDP growth, indicating that the economy is more or less saturated. It has a tarde defecit which is improving over the years. US is also cutting down on its Carbon emissions, possibly shifting towards an environmentally sustainable economy.

India has a very high GDP growth rate but increasing CO2 emissions and tarde defecit. It need to adopt more economically and environmentally sustainable development models.

Packages required

I have used the following packages for doing this project.

tidyverse package for tidying data and creating visualizations
DT package for printing condensed tables

library(tidyverse) # for tidying data and creating visualizations
library(DT) # for printing condensed tables

Data Description

The dataset that I am using is the Dataset of World Development Indicators published by World Bank. It contains time series data on various economic indicators for over a hundred countries around the world. World bank has compiled the data from officially recognized international sources. It is the reliable data on development indicators available.

This dataset is updated quarterly and covers data from 1960 to 2016. The dataset was last updated on 17th November 2016. Missing values are coded as NA.

Various indicators that are listed in the original dataset can be found here

A list of countries for which data is collected can be found here

The dataset has multiple tables in it. Only WDI_Data table has been used for this project.

Variable Description

Country Name: The name of the country or region

Country Code: Abbreviated code to identify a country or region

Indicator Name: Name of the measured development indicator

Indicator Code: Abbrevaited code for the indicator

1960 to 2015: A number of columns for years from 1960 to 2016 that has the corresponding indicator value for each row

Data Preparation

Importing the data

The original WDI_Data table is over 500MB in size with data for over a thousand indicators. Therefore, I picked a few indicators for the years 1960 to 2015 and hosted it here on Github. This is the data that is used in this study.

#Importing data from Github
wdi_url <- "https://raw.githubusercontent.com/suchith91/wdi/master/WDI_Data_Selected.csv"
wdi_data <- read.csv(wdi_url) %>% as_tibble()

Cleaning the data

The data was originally in an untidy format. I perfromed the following actions on the data to convert it to tidy format.

a) When I imported the data from Github, a letter ‘X’ got added in front of the column names that were numeric; ie columns 1960 to 2015. I first used gather function to narrow down all year columns to a single column called ‘YearVar’ and the values in them to another column called ‘Percentage’.

b) Now the ‘YearVar’ column was having values in the format X1960, X2015 etc. So, I used ‘seperate’ function to seperate them into two columns from 2nd position - ‘Junk’ and ‘Year’. The ‘Junk’ column has X in all fields and ‘Year’ column has years in it.

c) The indicators for a particular year and particular country were now spread across different observations. I wanted to make them a single observation with indicator names as columns. For achieveing this, first I dropped columns ‘Junk’ and ‘Indicator.Value’ from the dataset using select function.

d) I then used spread function to make indictor names as column names and their corresponding values for each year as the field values for these columns.

e) The column names were too long as the ‘Indicator.Name’ field had long descriptions of indicators. So, I gave these columns shorter names using the rename function.

f) Finally, I removed the columns, ‘Poverty’ and ‘Youth.Literacy’ from the table since I was not analyzing the trends for those indicators.

# Data cleaning
wdi <- wdi_data %>%
  gather(`X1960`:`X2015`, key = 'YearVar', value = "Percentage") %>%
  separate(YearVar, into = c("Junk", "Year"), sep=1, convert=TRUE) %>%
  select(-(Junk),-(Indicator.Code)) %>%
  spread(Indicator.Name, Percentage) %>%
  rename(CO2.Emissions = `CO2 emissions (metric tons per capita)`,
         Exports = `Exports of goods and services (% of GDP)`,
         Forest.Area = `Forest area (% of land area)`,
         GDP.Growth = `GDP growth (annual %)`,
         Imports = `Imports of goods and services (% of GDP)`,
         Poverty = `Poverty headcount ratio at national poverty lines (% of population)`,
         Unemployment = `Unemployment, total (% of total labor force) (national estimate)`,
         Youth.Literacy = `Youth literacy rate, population 15-24 years, both sexes (%)`) %>%
  select(-(Poverty),-(Youth.Literacy))

datatable(wdi, caption = 'FINAL DATASET IN TIDY FORMAT')

Variable Description and Summary Statistics

# Generating summary statistics for inline code
min_year <- min(wdi$Year, na.rm = TRUE)
max_year <- max(wdi$Year, na.rm = TRUE)
mean_year <- mean(wdi$Year, na.rm = TRUE)
miss_year <- sum(is.na(wdi$Year))

min_CO2 <- min(wdi$CO2.Emissions, na.rm = TRUE)
max_CO2 <- max(wdi$CO2.Emissions, na.rm = TRUE)
mean_CO2 <- mean(wdi$CO2.Emissions, na.rm = TRUE)
miss_CO2 <- sum(is.na(wdi$CO2.Emissions))

min_exp <- min(wdi$Exports, na.rm = TRUE)
max_exp <- max(wdi$Exports, na.rm = TRUE)
mean_exp <- mean(wdi$Exports, na.rm = TRUE)
miss_exp <- sum(is.na(wdi$Exports))

min_for <- min(wdi$Forest.Area, na.rm = TRUE)
max_for <- max(wdi$Forest.Area, na.rm = TRUE)
mean_for <- mean(wdi$Forest.Area, na.rm = TRUE)
miss_for <- sum(is.na(wdi$Forest.Area))

min_gdp <- min(wdi$GDP.Growth, na.rm = TRUE)
max_gdp <- max(wdi$GDP.Growth, na.rm = TRUE)
mean_gdp <- mean(wdi$GDP.Growth, na.rm = TRUE)
miss_gdp <- sum(is.na(wdi$GDP.Growth))

min_imp <- min(wdi$Imports, na.rm = TRUE)
max_imp <- max(wdi$Imports, na.rm = TRUE)
mean_imp <- mean(wdi$Imports, na.rm = TRUE)
miss_imp <- sum(is.na(wdi$Imports))

min_ump <- min(wdi$Unemployment, na.rm = TRUE)
max_ump <- max(wdi$Unemployment, na.rm = TRUE)
mean_ump <- mean(wdi$Unemployment, na.rm = TRUE)
miss_ump <- sum(is.na(wdi$Unemployment))

Since the final datset has some new columns, I have provided variable description and some summary statistics for those columns below.

Year

Description: Represesnts the year in which the observation is recorded

Minimum value = 1960

Maximum value = 2015

Average value = 1987.5

Missing values = 0

CO2.Emissions

Description: CO2 emissions in metric tons per capita.

Minimum value = -0.0203

Maximum value = 99.8

Average value = 4.2342192

Missing values = 2877

Exports

Description: Exports of goods and services as percentage of GDP.

Minimum value = 0.00538

Maximum value = 230

Average value = 33.3705208

Missing values = 4441

Forest.Area

Description: Forest area as percentage of land area.

Minimum value = 0

Maximum value = 98.9

Average value = 31.8434439

Missing values = 8152

GDP.Growth

Description: Annual growth rate of GDP calculated as percentage.

Minimum value = -64

Maximum value = 190

Average value = 3.9406544

Missing values = 3987

Imports

Description: Imports of goods and services as percentage of GDP.

Minimum value = 0

Maximum value = 425

Average value = 38.5534421

Missing values = 4441

Unemployment

Description: Unemployment as a percentage of total labor force

Minimum value = 0

Maximum value = 59.5

Average value = 8.7052664

Missing values = 10992

Exploratory Data Analysis

Carbondioxide emission comparison

CO2 emission is an indicator of the sustainability of the developemntal model of an economy. The lower the carbon footprint, the better sustainable the development model.Here, I used gplot to compare the per capita CO2 emission in USA, India and China.

# CO2 emission comp
wdi %>%
  filter(Country.Code == c('USA', 'IND', 'CHN')) %>%
  ggplot() + geom_line(mapping = aes(Year, CO2.Emissions, 
                                     group=Country.Name, color=Country.Name)) +
  labs(y = 'CO2 emissions (metric tons per capita)') +
  ggtitle(paste('CO2 Emission comparison of US, India and China'))

As seen from the plot, the per capita Carbondioxide emission is much higher in the US as compared to India and China. However, since the population of US and the other two nations is not comparable, we cannot make a proper inference on whether there is actually a significant difference in total CO2 emission by these countries.

Another interesting trend that can be seen is that, the per capita CO2 emission is on the decrease for US and slightly increasing for India but there is a sharp increase for China post the millenium. This means that developing countries still need to take necessary steps and adopt sustainable developement models.

GDP growth comparison

GDP growth is an indicator of how fast a country is advancing economically. Here, I plotted the GDP growth indicator agaist Year for US, India and China to see the historical trend of GDP growth in these economies.

#GDP Growth comp
wdi %>%
  filter(Country.Code == c('USA', 'IND', 'CHN')) %>%
  ggplot() + geom_line(mapping = aes(Year, GDP.Growth, 
                                     group=Country.Name, color=Country.Name)) +
  labs(y = 'GDP growth (annual %)') +
  ggtitle(paste('GDP growth comparison of US, India and China'))

As seen from the graph, India has the highest GDP growth among the three in recent years. The GDP growth of the US is more or less constant post 1980s. China has historically shown alternate rise and falls in its GDP growth, going below zero more than once.

Trade defecit comparison

Balance of trade is difference between exports and imports in an economy. If exports are higher than imports, there is a positive balance of trade and if exports are lower than imports, there is a negative balance of trade or a trade defecit.

I introduced a new column to calculate the balance of trade for US, India and China. Then I produced a table with all available balance of trade values for these 3 countries post 2010.

Bot <- wdi %>%
  filter(Country.Code == c('USA', 'IND', 'CHN'), Year > 2010 ) %>%
  mutate( `Balance of trade` = Exports - Imports) %>%
  select(Country.Name,`Balance of trade`, Year ) %>%
  na.omit() %>%
  arrange(desc(Year), desc(`Balance of trade`))

datatable(Bot)

I then plotted the historic trend of balance of trade for these countries.

#Trade deficit comp
wdi %>%
  filter(Country.Code == c('USA', 'IND', 'CHN')) %>%
  mutate( `Balance of trade (% of GDP)` = Exports - Imports) %>%
  ggplot() + geom_line(mapping = aes(Year, `Balance of trade (% of GDP)`, stat = "identity",
                                     group=Country.Name, color=Country.Name)) +
  ggtitle(paste('Balance of trade comparison of US, India and China'))

As seen from the plot, India and the US has a trade defecit and China has a positive balance of trade since the 1990s. The US has been trying to achieve a better balance of trade after late 2000s. Inda has an alarming trade defecit which is increasing year after year.

Countries with lowest and highest unemployment in 2014

Here, I created two tables of top five countries with lowest and highest unemployment.

#Unemployment comparison
low <- wdi %>%
  filter(Year == 2014) %>%
  select(Country.Name, Unemployment) %>%
  na.omit() %>%
  arrange(Unemployment) %>%
  head(5)

high <- wdi %>%
  filter(Year == 2014) %>%
  select(Country.Name, Unemployment) %>%
  na.omit() %>%
  arrange(desc(Unemployment)) %>%
  head(5)

datatable(low, caption = 'Top 5 countries with lowest unemployment in 2014')

datatable(high, caption = 'Top 5 countries with highest unemployment in 2014')

Summary

A comparative study of the performance of the USA, India and China based on CO2 emissions, GDP growth and balance of trade was performed making use of exploratory data analysis functionalitites of R tidyverse package. Most of the comparisons were made by observing the historical trends of various indicators for these three coutries. China has higher blance of trade and a very high GDP growth but the CO2 emission rate is also on the rise in China. It need to adopt more environment friendly development model. USA has a low and fairly constant GDP growth, indicating that the economy is more or less saturated. It has a tarde defecit which is improving over the years. US is also cutting down on its Carbon emissions, possibly shifting towards an environmentally sustainable economy. India has a very high GDP growth rate but increasing CO2 emissions and tarde defecit. It need to adopt more economically and environmentally sustainable development models.

A brief study on unemplyment across nations was also performed. Belarus has the least unemployment and Macedonia has the highest unemployment is 2014.

World Development Indicators - Final Project for Data Wrangling @ UC

Suchith Rajasekharan

December 9, 2016

Synopsis

Packages required

Data Description

Variable Description

Data Preparation

Importing the data

Cleaning the data

Variable Description and Summary Statistics

Exploratory Data Analysis

Carbondioxide emission comparison

GDP growth comparison

Trade defecit comparison

Countries with lowest and highest unemployment in 2014

Summary