This project aims at finding how different countries compare against each other on the basis of various development indicators. Majority of the study is focussed on seeing how the three largest economies in the world-USA, India and China - fare against each other on the basis of these standards. A brief study on unemplyment across nations has also been done.
I made of of the Exploratory Data Anlaysis functionalitites of R tidyverse package to make the follwoing insights. China has a higher blance of trade and a very high GDP growth but the CO2 emission rate is also on the rise in China. It need to adopt more environment friendly development model.
USA has a low and fairly constant GDP growth, indicating that the economy is more or less saturated. It has a tarde defecit which is improving over the years. US is also cutting down on its Carbon emissions, possibly shifting towards an environmentally sustainable economy.
India has a very high GDP growth rate but increasing CO2 emissions and tarde defecit. It need to adopt more economically and environmentally sustainable development models.
I have used the following packages for doing this project.
tidyverse package for tidying data and creating visualizations
DT package for printing condensed tables
library(tidyverse) # for tidying data and creating visualizations
library(DT) # for printing condensed tables
The dataset that I am using is the Dataset of World Development Indicators published by World Bank. It contains time series data on various economic indicators for over a hundred countries around the world. World bank has compiled the data from officially recognized international sources. It is the reliable data on development indicators available.
This dataset is updated quarterly and covers data from 1960 to 2016. The dataset was last updated on 17th November 2016. Missing values are coded as NA.
Various indicators that are listed in the original dataset can be found here
A list of countries for which data is collected can be found here
The dataset has multiple tables in it. Only WDI_Data table has been used for this project.
Country Name: The name of the country or region
Country Code: Abbreviated code to identify a country or region
Indicator Name: Name of the measured development indicator
Indicator Code: Abbrevaited code for the indicator
1960 to 2015: A number of columns for years from 1960 to 2016 that has the corresponding indicator value for each row
The original WDI_Data table is over 500MB in size with data for over a thousand indicators. Therefore, I picked a few indicators for the years 1960 to 2015 and hosted it here on Github. This is the data that is used in this study.
#Importing data from Github
wdi_url <- "https://raw.githubusercontent.com/suchith91/wdi/master/WDI_Data_Selected.csv"
wdi_data <- read.csv(wdi_url) %>% as_tibble()
The data was originally in an untidy format. I perfromed the following actions on the data to convert it to tidy format.
a) When I imported the data from Github, a letter ‘X’ got added in front of the column names that were numeric; ie columns 1960 to 2015. I first used gather function to narrow down all year columns to a single column called ‘YearVar’ and the values in them to another column called ‘Percentage’.
b) Now the ‘YearVar’ column was having values in the format X1960, X2015 etc. So, I used ‘seperate’ function to seperate them into two columns from 2nd position - ‘Junk’ and ‘Year’. The ‘Junk’ column has X in all fields and ‘Year’ column has years in it.
c) The indicators for a particular year and particular country were now spread across different observations. I wanted to make them a single observation with indicator names as columns. For achieveing this, first I dropped columns ‘Junk’ and ‘Indicator.Value’ from the dataset using select function.
d) I then used spread function to make indictor names as column names and their corresponding values for each year as the field values for these columns.
e) The column names were too long as the ‘Indicator.Name’ field had long descriptions of indicators. So, I gave these columns shorter names using the rename function.
f) Finally, I removed the columns, ‘Poverty’ and ‘Youth.Literacy’ from the table since I was not analyzing the trends for those indicators.
# Data cleaning
wdi <- wdi_data %>%
gather(`X1960`:`X2015`, key = 'YearVar', value = "Percentage") %>%
separate(YearVar, into = c("Junk", "Year"), sep=1, convert=TRUE) %>%
select(-(Junk),-(Indicator.Code)) %>%
spread(Indicator.Name, Percentage) %>%
rename(CO2.Emissions = `CO2 emissions (metric tons per capita)`,
Exports = `Exports of goods and services (% of GDP)`,
Forest.Area = `Forest area (% of land area)`,
GDP.Growth = `GDP growth (annual %)`,
Imports = `Imports of goods and services (% of GDP)`,
Poverty = `Poverty headcount ratio at national poverty lines (% of population)`,
Unemployment = `Unemployment, total (% of total labor force) (national estimate)`,
Youth.Literacy = `Youth literacy rate, population 15-24 years, both sexes (%)`) %>%
select(-(Poverty),-(Youth.Literacy))
datatable(wdi, caption = 'FINAL DATASET IN TIDY FORMAT')
# Generating summary statistics for inline code
min_year <- min(wdi$Year, na.rm = TRUE)
max_year <- max(wdi$Year, na.rm = TRUE)
mean_year <- mean(wdi$Year, na.rm = TRUE)
miss_year <- sum(is.na(wdi$Year))
min_CO2 <- min(wdi$CO2.Emissions, na.rm = TRUE)
max_CO2 <- max(wdi$CO2.Emissions, na.rm = TRUE)
mean_CO2 <- mean(wdi$CO2.Emissions, na.rm = TRUE)
miss_CO2 <- sum(is.na(wdi$CO2.Emissions))
min_exp <- min(wdi$Exports, na.rm = TRUE)
max_exp <- max(wdi$Exports, na.rm = TRUE)
mean_exp <- mean(wdi$Exports, na.rm = TRUE)
miss_exp <- sum(is.na(wdi$Exports))
min_for <- min(wdi$Forest.Area, na.rm = TRUE)
max_for <- max(wdi$Forest.Area, na.rm = TRUE)
mean_for <- mean(wdi$Forest.Area, na.rm = TRUE)
miss_for <- sum(is.na(wdi$Forest.Area))
min_gdp <- min(wdi$GDP.Growth, na.rm = TRUE)
max_gdp <- max(wdi$GDP.Growth, na.rm = TRUE)
mean_gdp <- mean(wdi$GDP.Growth, na.rm = TRUE)
miss_gdp <- sum(is.na(wdi$GDP.Growth))
min_imp <- min(wdi$Imports, na.rm = TRUE)
max_imp <- max(wdi$Imports, na.rm = TRUE)
mean_imp <- mean(wdi$Imports, na.rm = TRUE)
miss_imp <- sum(is.na(wdi$Imports))
min_ump <- min(wdi$Unemployment, na.rm = TRUE)
max_ump <- max(wdi$Unemployment, na.rm = TRUE)
mean_ump <- mean(wdi$Unemployment, na.rm = TRUE)
miss_ump <- sum(is.na(wdi$Unemployment))
Since the final datset has some new columns, I have provided variable description and some summary statistics for those columns below.
Year
Description: Represesnts the year in which the observation is recorded
Minimum value = 1960
Maximum value = 2015
Average value = 1987.5
Missing values = 0
CO2.Emissions
Description: CO2 emissions in metric tons per capita.
Minimum value = -0.0203
Maximum value = 99.8
Average value = 4.2342192
Missing values = 2877
Exports
Description: Exports of goods and services as percentage of GDP.
Minimum value = 0.00538
Maximum value = 230
Average value = 33.3705208
Missing values = 4441
Forest.Area
Description: Forest area as percentage of land area.
Minimum value = 0
Maximum value = 98.9
Average value = 31.8434439
Missing values = 8152
GDP.Growth
Description: Annual growth rate of GDP calculated as percentage.
Minimum value = -64
Maximum value = 190
Average value = 3.9406544
Missing values = 3987
Imports
Description: Imports of goods and services as percentage of GDP.
Minimum value = 0
Maximum value = 425
Average value = 38.5534421
Missing values = 4441
Unemployment
Description: Unemployment as a percentage of total labor force
Minimum value = 0
Maximum value = 59.5
Average value = 8.7052664
Missing values = 10992
CO2 emission is an indicator of the sustainability of the developemntal model of an economy. The lower the carbon footprint, the better sustainable the development model.Here, I used gplot to compare the per capita CO2 emission in USA, India and China.
# CO2 emission comp
wdi %>%
filter(Country.Code == c('USA', 'IND', 'CHN')) %>%
ggplot() + geom_line(mapping = aes(Year, CO2.Emissions,
group=Country.Name, color=Country.Name)) +
labs(y = 'CO2 emissions (metric tons per capita)') +
ggtitle(paste('CO2 Emission comparison of US, India and China'))
As seen from the plot, the per capita Carbondioxide emission is much higher in the US as compared to India and China. However, since the population of US and the other two nations is not comparable, we cannot make a proper inference on whether there is actually a significant difference in total CO2 emission by these countries.
Another interesting trend that can be seen is that, the per capita CO2 emission is on the decrease for US and slightly increasing for India but there is a sharp increase for China post the millenium. This means that developing countries still need to take necessary steps and adopt sustainable developement models.
GDP growth is an indicator of how fast a country is advancing economically. Here, I plotted the GDP growth indicator agaist Year for US, India and China to see the historical trend of GDP growth in these economies.
#GDP Growth comp
wdi %>%
filter(Country.Code == c('USA', 'IND', 'CHN')) %>%
ggplot() + geom_line(mapping = aes(Year, GDP.Growth,
group=Country.Name, color=Country.Name)) +
labs(y = 'GDP growth (annual %)') +
ggtitle(paste('GDP growth comparison of US, India and China'))
As seen from the graph, India has the highest GDP growth among the three in recent years. The GDP growth of the US is more or less constant post 1980s. China has historically shown alternate rise and falls in its GDP growth, going below zero more than once.
Balance of trade is difference between exports and imports in an economy. If exports are higher than imports, there is a positive balance of trade and if exports are lower than imports, there is a negative balance of trade or a trade defecit.
I introduced a new column to calculate the balance of trade for US, India and China. Then I produced a table with all available balance of trade values for these 3 countries post 2010.
Bot <- wdi %>%
filter(Country.Code == c('USA', 'IND', 'CHN'), Year > 2010 ) %>%
mutate( `Balance of trade` = Exports - Imports) %>%
select(Country.Name,`Balance of trade`, Year ) %>%
na.omit() %>%
arrange(desc(Year), desc(`Balance of trade`))
datatable(Bot)
I then plotted the historic trend of balance of trade for these countries.
#Trade deficit comp
wdi %>%
filter(Country.Code == c('USA', 'IND', 'CHN')) %>%
mutate( `Balance of trade (% of GDP)` = Exports - Imports) %>%
ggplot() + geom_line(mapping = aes(Year, `Balance of trade (% of GDP)`, stat = "identity",
group=Country.Name, color=Country.Name)) +
ggtitle(paste('Balance of trade comparison of US, India and China'))
As seen from the plot, India and the US has a trade defecit and China has a positive balance of trade since the 1990s. The US has been trying to achieve a better balance of trade after late 2000s. Inda has an alarming trade defecit which is increasing year after year.
Here, I created two tables of top five countries with lowest and highest unemployment.
#Unemployment comparison
low <- wdi %>%
filter(Year == 2014) %>%
select(Country.Name, Unemployment) %>%
na.omit() %>%
arrange(Unemployment) %>%
head(5)
high <- wdi %>%
filter(Year == 2014) %>%
select(Country.Name, Unemployment) %>%
na.omit() %>%
arrange(desc(Unemployment)) %>%
head(5)
datatable(low, caption = 'Top 5 countries with lowest unemployment in 2014')
datatable(high, caption = 'Top 5 countries with highest unemployment in 2014')
A comparative study of the performance of the USA, India and China based on CO2 emissions, GDP growth and balance of trade was performed making use of exploratory data analysis functionalitites of R tidyverse package. Most of the comparisons were made by observing the historical trends of various indicators for these three coutries. China has higher blance of trade and a very high GDP growth but the CO2 emission rate is also on the rise in China. It need to adopt more environment friendly development model. USA has a low and fairly constant GDP growth, indicating that the economy is more or less saturated. It has a tarde defecit which is improving over the years. US is also cutting down on its Carbon emissions, possibly shifting towards an environmentally sustainable economy. India has a very high GDP growth rate but increasing CO2 emissions and tarde defecit. It need to adopt more economically and environmentally sustainable development models.
A brief study on unemplyment across nations was also performed. Belarus has the least unemployment and Macedonia has the highest unemployment is 2014.