We are looking at data collected by the United Nations in html tables found here.
The data consists of information about a country, starting with general information and containing indicators related to economics, social and environmental factors.
Our task is to select the information of interest from the document and transform it in order to facilitate analysis. We will focus our attention on economic data and we will build upon the dataset with data from other countries for a more interesting comparison.
Load required libraries
library(tidyverse) #readr, dplyr, tidyr, stringr, tibble, ggplot2
library(knitr)
library(scales)
library(kableExtra)
library(readxl)A helper function for displaying tables
showtable <- function(data, title) {
#kable(data, format = "latex", caption = title) %>%
# kable_styling(bootstrap_options = c("striped", "hover", "condensed"), latex_options = "scale_down")
kable(data, caption = title) %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed"), latex_options = "scale_down")
}The data is originally found here here. The data was manually copied and pasted into an excel file which can be downloaded from this github page.
We start by reading in the .xlsx file (manually downloaded) from a github link and placed into the working directory.
Here is a quick look at the data. We find 3 tables of different indicators more or less formatted with 2 columns for years 2004, 2010, 2018. Year 2010 is used as a basis for CPI calculations.
Raw Data
The dataset actually has variables as rows and values of variables as column headers. While not helpful visually in this case, the tidy format requires that each indicator be placed in a column with arbitrary observation indexes. Since our table contains more variables (indicators) than observations (value of that indicator for a particular year), we will select a few of the indicators to narrow our analysis.
First we slice the dataset into the sections of interest by index (assuming this is consistent across datasets for each countries) and create separate tables for further study. We also want to add a country column to be able to expand the analysis to more countries.
country <- "USA"
# a step to simplify our downstream spread and gather operations
# we also add a new column called Country and place it at the first column position
untidy <- untidy %>%
rename("IndicatorName" = Region, "2004" = ...2, "2010" = `Northern America`, "2018" = ...4, ) %>%
mutate(Country = country) %>%
select(Country, 1:4, everything())
# slice the relevant portion of the dataset by index, assuming this is consistent
# rename the columns for clarity
clean_geninfo <- untidy %>% slice(1:8) %>%
select(IndicatorName, "2010") %>%
rename("Value" = "2010")
# slice into different types of indicator tables
clean_econ <- untidy %>% slice(11:28)
clean_social <- untidy %>% slice(31:46)
clean_envinf <- untidy %>% slice(49:60)
clean_econ$IndicatorName <- str_replace_all(clean_econ$IndicatorName, "US\\$", replacement = "USD")| Country | IndicatorName | 2004 | 2010 | 2018 |
|---|---|---|---|---|
| USA | GDP: Gross domestic product (million current USD) | 13 093 726 | 14 964 372 | 18 624 475d |
| USA | GDP growth rate (annual %, const. 2010 prices) | 3.3 | 2.5 | 1.5d |
| USA | GDP per capita (current USD) | 44 366.0 | 48 485.0 | 57 808.0d |
| USA | Economy: Agriculturee,f (% of Gross Value Added) | 1 | 1.1000000000000001 | 1d |
| USA | Economy: Industrye,f (% of Gross Value Added) | 21.5 | 20.2 | 19.2d |
| USA | Economy: Services and other activitye,f (% of GVA) | 77.5 | 78.8 | 79.9d |
| USA | Employment in agricultureg (% of employed) | 1.6 | 1.6 | 1.6 |
| USA | Employment in industryg (% of employed) | 21 | 18.5 | 18.8 |
| USA | Employment in servicesg (% employed) | 77.400000000000006 | 79.900000000000006 | 79.599999999999994 |
| USA | Unemployment rate (% of labour force) | 5.0999999999999996 | 9.6 | 4.3g |
| USA | Labour force participation rateg (female/male pop. %) | 58.2 / 72.2 | 57.5 / 69.9 | 55.5 / 68.0 |
| USA | CPI: Consumer Price Indexh (2010=100) | 90 | 100 | 112c |
| USA | Agricultural production index (2004-2006=100) | 100 | 106 | 117d |
| USA | Index of industrial production (2005=100) | 100 | 96i | 108i,j |
| USA | International trade: exportsk (million current USD) | 904 339 | 1 278 099 | 1 546 069c |
| USA | International trade: importsk (million current USD) | 1 732 321 | 1 968 260 | 2 408 395c |
| USA | International trade: balancek (million current USD) |
|
|
|
| USA | Balance of payments, current account (million USD) |
|
|
|
The two other tables are extracted.
| Country | IndicatorName | 2004 | 2010 | 2018 |
|---|---|---|---|---|
| USA | Population growth ratel (average annual %) | 0.9 | 0.9 | 0.7b |
| USA | Urban population (% of total population) | 79.900000000000006 | 80.8 | 82.3 |
| USA | Urban population growth ratel (average annual %) | 1.1000000000000001 | 1.1000000000000001 | 0.9b |
| USA | Fertility rate, totall (live births per woman) | 2 | 2 | 1.9b |
| USA | Life expectancy at birthl (females/males, years) | 79.7 / 74.5 | 80.6 / 75.6 | 81.2 / 76.5b |
| USA | Population age distribution (0-14/60+ years old, %) | 20.9 / 16.7 | 20.2 / 18.4 | 18.8 / 22.0a |
| USA | International migrant stock (000/% of total pop.) | 39 258.3 / 13.3 | 44 183.6 / 14.3 | 49 777.0 / 15.3c |
| USA | Refugees and others of concern to UNHCR 000 | 549.2m | 270.9m | 971.5n,c |
| USA | Infant mortality ratel (per 1 000 live births) | 7 | 6.8 | 6b |
| USA | Health: Current expenditure (% of GDP) | 14.5 | 16.399999999999999 | 16.8b |
| USA | Health: Physicians (per 1 000 pop.) | 2.7o | 2.4 | 2.6j |
| USA | Education: Government expenditure (% of GDP) | … | 5.4 | 5j |
| USA | Education: Primary gross enrol. ratio (f/m per 100 pop.) | 98.1 / 99.2 | 99.2 / 100.1 | 99.4 / 99.2b |
| USA | Education: Secondary gross enrol. ratio (f/m per 100 pop.) | 95.5 / 93.6 | 93.6 / 92.5 | 97.7 / 96.7b |
| USA | Intentional homicide rate (per 100 000 pop.) | 5.7 | 4.8 | 5.4d |
| USA | Seats held by women in National Parliaments (%) | 14.9 | 16.8 | 19.5 |
| Country | IndicatorName | 2004 | 2010 | 2018 |
|---|---|---|---|---|
| USA | Individuals using the Internet (per 100 inhabitants) | 68g | 71.7p | 76.2g,d |
| USA | Research & Development expenditureq (% of GDP) | 2.5 | 2.7 | 2.8r,b |
| USA | Threatened species (number) | 1 143o | 1 152 | 1 513c |
| USA | Forested area (% of land area) | 33.299999999999997 | 33.700000000000003 | 33.9g,b |
| USA | CO2 emission estimatess (million tons/tons per capita) | 5 789.7 / 19.3 | 5 395.5 / 17.2 | 5 254.3 / 16.2j |
| USA | Energy production, primaryt (Petajoules) | 68 124 | 71 882 | 84 007b |
| USA | Energy supply per capitat (Gigajoules) | 325 | 297 | 282b |
| USA | Tourist/visitor arrivals at national borders 000 | 49 206 | 60 010 | 75 608u,d |
| USA | Pop. using improved drinking water (urban/rural, %) | 99.5 / 96.7 | 99.4 / 97.6 | 99.4 / 98.2b |
| USA | Pop. using improved sanitation facilities (urban/rural, %) | 99.9 / 99.6 | 100.0 / 99.9 | 100.0 / 100.0b |
| USA | Net Official Development Assist. disbursedv (% of GNI) | 0.23 | 0.2 | 0.18r,c |
We notice that the values have occurences of lowercase letters. These are annotations/notes from the original dataset which we will ignore. These letters and formatting inconsistencies will be dealt with shortly.
From here onwards, only the Economic Indicator table will be tidied and shown since it will be the focus of our analysis.
Let’s pick out indicators of interest: indicators related to GDP, unemployment and inflation (CPI).
cleaner_econ <- clean_econ %>% filter(grepl("GDP|Unemployment|CPI", IndicatorName))
showtable(cleaner_econ, "Selected Indicators")| Country | IndicatorName | 2004 | 2010 | 2018 |
|---|---|---|---|---|
| USA | GDP: Gross domestic product (million current USD) | 13 093 726 | 14 964 372 | 18 624 475d |
| USA | GDP growth rate (annual %, const. 2010 prices) | 3.3 | 2.5 | 1.5d |
| USA | GDP per capita (current USD) | 44 366.0 | 48 485.0 | 57 808.0d |
| USA | Unemployment rate (% of labour force) | 5.0999999999999996 | 9.6 | 4.3g |
| USA | CPI: Consumer Price Indexh (2010=100) | 90 | 100 | 112c |
To make the table in a tidy format, we must gather the years in the column headers which are actually values of a variable Year. We also realize that the indicators are actually variables and the years are the observations. To address this, we spread the indicators as columns.
tidy_econ <- cleaner_econ %>%
gather("2004":"2018", key = Year, value = IndicatorValue) %>%
spread(IndicatorName, IndicatorValue)Standardize the column headers
# store the column names, remove the annotation h from the CPI column
headers <- colnames(tidy_econ)
headers <- gsub("Indexh", "Index", headers)
names(tidy_econ) <- headersClean the values with a helper function
# a function to clean the values by elimanting the annotations (lowercase letters) and spaces
clean_values <- function(data) {
clean <- apply(data, 2, function(x) gsub("[a-z],?", "", x))
clean <- apply(clean, 2, function(x) gsub(" ", "", x))
clean <- apply(clean, 2, function(x) as.numeric(x))
return(clean)
}
# clean the values
tidy_econ[3:7] <- clean_values(tidy_econ[3:7])| Country | Year | CPI: Consumer Price Index (2010=100) | GDP growth rate (annual %, const. 2010 prices) | GDP per capita (current USD) | GDP: Gross domestic product (million current USD) | Unemployment rate (% of labour force) |
|---|---|---|---|---|---|---|
| USA | 2004 | 90 | 3.3 | 44366 | 13093726 | 5.1 |
| USA | 2010 | 100 | 2.5 | 48485 | 14964372 | 9.6 |
| USA | 2018 | 112 | 1.5 | 57808 | 18624475 | 4.3 |
Viewed in this manner, the dataset is rather small. To make it more interesting, we build upon with data from other countries. We encapsulate all the steps above in the function split_clean() (not shown).
We download data from other countries to supplement our analysis. All additional files are found on Github in the form “unorg_countrydata.xlsx”
usa_econ <- split_clean("unorg_usdata.xlsx", "USA")
china_econ <- split_clean("unorg_chinadata.xlsx", "China")
japan_econ <- split_clean("unorg_japandata.xlsx", "Japan")
germany_econ <- split_clean("unorg_germanydata.xlsx", "Germany")We then combine the country datasets.
country_econ <- rbind(usa_econ, china_econ, japan_econ, germany_econ)
showtable(country_econ, "Economic Indicators")| Country | Year | CPI: Consumer Price Index (2010=100) | GDP growth rate (annual %, const. 2010 prices) | GDP per capita (current USD) | GDP: Gross domestic product (million current USD) | Unemployment rate (% of labour force) |
|---|---|---|---|---|---|---|
| USA | 2004 | 90 | 3.3 | 44366 | 13093726 | 5.1 |
| USA | 2010 | 100 | 2.5 | 48485 | 14964372 | 9.6 |
| USA | 2018 | 112 | 1.5 | 57808 | 18624475 | 4.3 |
| China | 2004 | NA | 11.4 | 1747 | 2308800 | 4.1 |
| China | 2010 | 100 | 10.6 | 4461 | 6066351 | 4.2 |
| China | 2018 | 104 | 7.3 | 7993 | 11218281 | 4.7 |
| Japan | 2004 | 100 | 1.7 | 37054 | 4755410 | 4.4 |
| Japan | 2010 | 100 | 4.2 | 44341 | 5700098 | 5.1 |
| Japan | 2018 | 104 | 1.0 | 38640 | 4936212 | 2.6 |
| Germany | 2004 | 92 | 0.7 | 35035 | 2861339 | 11.2 |
| Germany | 2010 | 100 | 4.1 | 42241 | 3417095 | 7.0 |
| Germany | 2018 | 109 | 1.9 | 42456 | 3477796 | 3.6 |
We can now once more gather the indicators into a single column to have the data in long format which will help with analysis and graphics. We rename the columns for ease of reference.
names(country_econ) <- c("Country", "Year", "CPI", "GDPGrowth", "GDPCapita", "GDP", "UnEmp")
country_econ <- country_econ %>%
gather("CPI":"UnEmp", key = Indicator, value = IndicatorValue)Let’s take a look at 15 random oberservations from this table. We see how the format above is more intelligible.
| Country | Year | Indicator | IndicatorValue |
|---|---|---|---|
| China | 2018 | GDPCapita | 7993.0 |
| USA | 2018 | GDPGrowth | 1.5 |
| Germany | 2018 | GDPCapita | 42456.0 |
| Japan | 2018 | GDPGrowth | 1.0 |
| China | 2010 | GDP | 6066351.0 |
| USA | 2010 | GDPGrowth | 2.5 |
| USA | 2004 | GDP | 13093726.0 |
| Germany | 2004 | GDP | 2861339.0 |
| China | 2010 | GDPGrowth | 10.6 |
| Germany | 2018 | GDPGrowth | 1.9 |
| Japan | 2010 | GDPCapita | 44341.0 |
| Japan | 2010 | CPI | 100.0 |
| Japan | 2004 | UnEmp | 4.4 |
| Germany | 2004 | UnEmp | 11.2 |
| USA | 2004 | CPI | 90.0 |
Here we take a closer look at the GDPGrowth data where we observe declining growth from all of these major world economies.
data <- country_econ %>% filter(Indicator == "GDPGrowth")
ggplot(data, aes(x=Year, y=IndicatorValue, group=Country)) +
geom_line(linetype="dashed", aes(color=Country)) +
geom_point(aes(color=Country)) +
theme(legend.position="top") +
ylab("GDP Growth %") +
ggtitle("GDP Growth")We then expand our outlook to more economic indicators. Note that, China’s CPI value is missing for 2004 and coincides with Japan’s line after 2010.
ggplot(country_econ, aes(x=Year, y=IndicatorValue, group=Country)) +
geom_line(linetype="dashed", aes(color=Country)) +
geom_point(aes(color=Country)) +
theme(legend.position="top") +
facet_wrap(~Indicator, scales="free_y") +
ggtitle("Economic Indicators")The plots above reveal some long term trends. In some cases like GDPGrowth or UnEmp, we can see the high impact of the financial crisis on a macro level if we place the event around the 2010 mark. More data granularity might be interesting here. We can see China’s high growth as a developing country and its decline like the other countries. Its large population accounts for the lower GDP per Capita.
This visualization also allows us the test the validity of the Phillips Curve which relates inflation (CPI: Consumer Price Index) and the rate of unemployment in an inverse relationship. The relationship appears true except in the case of the United States during the years 2004-2010, unemployment rose greatly even though inflation remained nearly steady. This would need to be confirmed with more data points for the years between 2004 and 2018. With significantly more data going back even further, we have the potential to spot some significant macro trends.
While working with this dataset, we encourated a few forms of “un-tidyness”:
- A dataset originally containing 4 tables and varying format
- Tables with year values in column headers
- Inconsistent column headers across multiple country data
- Values with annotations affixed to the data
- Long column names which make referecing more tedious
We also encapsulated the data wrangling steps shown above in a function split_clean() for re-use with data from other countries. This function can be built upon or modified to tidy the tables of the other indicators, social and environmental & infrastructure. Combining this with the general info table would allow for much more extensive future analysis.