Project 2A - UN Data

We are looking at data collected by the United Nations in html tables found here.

The data consists of information about a country, starting with general information and containing indicators related to economics, social and environmental factors.

Our task is to select the information of interest from the document and transform it in order to facilitate analysis. We will focus our attention on economic data and we will build upon the dataset with data from other countries for a more interesting comparison.

Data Import

The data is originally found here here. The data was manually copied and pasted into an excel file which can be downloaded from this github page.

We start by reading in the .xlsx file (manually downloaded) from a github link and placed into the working directory.

Data Evaluation

Here is a quick look at the data. We find 3 tables of different indicators more or less formatted with 2 columns for years 2004, 2010, 2018. Year 2010 is used as a basis for CPI calculations.

Raw Data

Raw Data

Data Transformation

The dataset actually has variables as rows and values of variables as column headers. While not helpful visually in this case, the tidy format requires that each indicator be placed in a column with arbitrary observation indexes. Since our table contains more variables (indicators) than observations (value of that indicator for a particular year), we will select a few of the indicators to narrow our analysis.

First we slice the dataset into the sections of interest by index (assuming this is consistent across datasets for each countries) and create separate tables for further study. We also want to add a country column to be able to expand the analysis to more countries.

Economic Indicators
Country IndicatorName 2004 2010 2018
USA GDP: Gross domestic product (million current USD) 13 093 726 14 964 372 18 624 475d
USA GDP growth rate (annual %, const. 2010 prices) 3.3 2.5 1.5d
USA GDP per capita (current USD) 44 366.0 48 485.0 57 808.0d
USA Economy: Agriculturee,f (% of Gross Value Added) 1 1.1000000000000001 1d
USA Economy: Industrye,f (% of Gross Value Added) 21.5 20.2 19.2d
USA Economy: Services and other activitye,f (% of GVA) 77.5 78.8 79.9d
USA Employment in agricultureg (% of employed) 1.6 1.6 1.6
USA Employment in industryg (% of employed) 21 18.5 18.8
USA Employment in servicesg (% employed) 77.400000000000006 79.900000000000006 79.599999999999994
USA Unemployment rate (% of labour force) 5.0999999999999996 9.6 4.3g
USA Labour force participation rateg (female/male pop. %) 58.2 / 72.2 57.5 / 69.9 55.5 / 68.0
USA CPI: Consumer Price Indexh (2010=100) 90 100 112c
USA Agricultural production index (2004-2006=100) 100 106 117d
USA Index of industrial production (2005=100) 100 96i 108i,j
USA International trade: exportsk (million current USD) 904 339 1 278 099 1 546 069c
USA International trade: importsk (million current USD) 1 732 321 1 968 260 2 408 395c
USA International trade: balancek (million current USD)
  • 827 981
  • 690 161
  • 862 326c
USA Balance of payments, current account (million USD)
  • 745 246
  • 430 702
  • 466 248c

The two other tables are extracted.

Social Indicators
Country IndicatorName 2004 2010 2018
USA Population growth ratel (average annual %) 0.9 0.9 0.7b
USA Urban population (% of total population) 79.900000000000006 80.8 82.3
USA Urban population growth ratel (average annual %) 1.1000000000000001 1.1000000000000001 0.9b
USA Fertility rate, totall (live births per woman) 2 2 1.9b
USA Life expectancy at birthl (females/males, years) 79.7 / 74.5 80.6 / 75.6 81.2 / 76.5b
USA Population age distribution (0-14/60+ years old, %) 20.9 / 16.7 20.2 / 18.4 18.8 / 22.0a
USA International migrant stock (000/% of total pop.) 39 258.3 / 13.3 44 183.6 / 14.3 49 777.0 / 15.3c
USA Refugees and others of concern to UNHCR 000 549.2m 270.9m 971.5n,c
USA Infant mortality ratel (per 1 000 live births) 7 6.8 6b
USA Health: Current expenditure (% of GDP) 14.5 16.399999999999999 16.8b
USA Health: Physicians (per 1 000 pop.) 2.7o 2.4 2.6j
USA Education: Government expenditure (% of GDP) 5.4 5j
USA Education: Primary gross enrol. ratio (f/m per 100 pop.) 98.1 / 99.2 99.2 / 100.1 99.4 / 99.2b
USA Education: Secondary gross enrol. ratio (f/m per 100 pop.) 95.5 / 93.6 93.6 / 92.5 97.7 / 96.7b
USA Intentional homicide rate (per 100 000 pop.) 5.7 4.8 5.4d
USA Seats held by women in National Parliaments (%) 14.9 16.8 19.5
Environment and Infrastructure Indicators
Country IndicatorName 2004 2010 2018
USA Individuals using the Internet (per 100 inhabitants) 68g 71.7p 76.2g,d
USA Research & Development expenditureq (% of GDP) 2.5 2.7 2.8r,b
USA Threatened species (number) 1 143o 1 152 1 513c
USA Forested area (% of land area) 33.299999999999997 33.700000000000003 33.9g,b
USA CO2 emission estimatess (million tons/tons per capita) 5 789.7 / 19.3 5 395.5 / 17.2 5 254.3 / 16.2j
USA Energy production, primaryt (Petajoules) 68 124 71 882 84 007b
USA Energy supply per capitat (Gigajoules) 325 297 282b
USA Tourist/visitor arrivals at national borders 000 49 206 60 010 75 608u,d
USA Pop. using improved drinking water (urban/rural, %) 99.5 / 96.7 99.4 / 97.6 99.4 / 98.2b
USA Pop. using improved sanitation facilities (urban/rural, %) 99.9 / 99.6 100.0 / 99.9 100.0 / 100.0b
USA Net Official Development Assist. disbursedv (% of GNI) 0.23 0.2 0.18r,c

We notice that the values have occurences of lowercase letters. These are annotations/notes from the original dataset which we will ignore. These letters and formatting inconsistencies will be dealt with shortly.

From here onwards, only the Economic Indicator table will be tidied and shown since it will be the focus of our analysis.

Let’s pick out indicators of interest: indicators related to GDP, unemployment and inflation (CPI).

Selected Indicators
Country IndicatorName 2004 2010 2018
USA GDP: Gross domestic product (million current USD) 13 093 726 14 964 372 18 624 475d
USA GDP growth rate (annual %, const. 2010 prices) 3.3 2.5 1.5d
USA GDP per capita (current USD) 44 366.0 48 485.0 57 808.0d
USA Unemployment rate (% of labour force) 5.0999999999999996 9.6 4.3g
USA CPI: Consumer Price Indexh (2010=100) 90 100 112c

To make the table in a tidy format, we must gather the years in the column headers which are actually values of a variable Year. We also realize that the indicators are actually variables and the years are the observations. To address this, we spread the indicators as columns.

Standardize the column headers

Clean the values with a helper function

Tidy Economic Indicator Table
Country Year CPI: Consumer Price Index (2010=100) GDP growth rate (annual %, const. 2010 prices) GDP per capita (current USD) GDP: Gross domestic product (million current USD) Unemployment rate (% of labour force)
USA 2004 90 3.3 44366 13093726 5.1
USA 2010 100 2.5 48485 14964372 9.6
USA 2018 112 1.5 57808 18624475 4.3

Viewed in this manner, the dataset is rather small. To make it more interesting, we build upon with data from other countries. We encapsulate all the steps above in the function split_clean() (not shown).

We download data from other countries to supplement our analysis. All additional files are found on Github in the form “unorg_countrydata.xlsx”

We then combine the country datasets.

Economic Indicators
Country Year CPI: Consumer Price Index (2010=100) GDP growth rate (annual %, const. 2010 prices) GDP per capita (current USD) GDP: Gross domestic product (million current USD) Unemployment rate (% of labour force)
USA 2004 90 3.3 44366 13093726 5.1
USA 2010 100 2.5 48485 14964372 9.6
USA 2018 112 1.5 57808 18624475 4.3
China 2004 NA 11.4 1747 2308800 4.1
China 2010 100 10.6 4461 6066351 4.2
China 2018 104 7.3 7993 11218281 4.7
Japan 2004 100 1.7 37054 4755410 4.4
Japan 2010 100 4.2 44341 5700098 5.1
Japan 2018 104 1.0 38640 4936212 2.6
Germany 2004 92 0.7 35035 2861339 11.2
Germany 2010 100 4.1 42241 3417095 7.0
Germany 2018 109 1.9 42456 3477796 3.6

We can now once more gather the indicators into a single column to have the data in long format which will help with analysis and graphics. We rename the columns for ease of reference.

Let’s take a look at 15 random oberservations from this table. We see how the format above is more intelligible.

Long Format Dataset
Country Year Indicator IndicatorValue
China 2018 GDPCapita 7993.0
USA 2018 GDPGrowth 1.5
Germany 2018 GDPCapita 42456.0
Japan 2018 GDPGrowth 1.0
China 2010 GDP 6066351.0
USA 2010 GDPGrowth 2.5
USA 2004 GDP 13093726.0
Germany 2004 GDP 2861339.0
China 2010 GDPGrowth 10.6
Germany 2018 GDPGrowth 1.9
Japan 2010 GDPCapita 44341.0
Japan 2010 CPI 100.0
Japan 2004 UnEmp 4.4
Germany 2004 UnEmp 11.2
USA 2004 CPI 90.0

Data Analysis & Visualization

Here we take a closer look at the GDPGrowth data where we observe declining growth from all of these major world economies.

We then expand our outlook to more economic indicators. Note that, China’s CPI value is missing for 2004 and coincides with Japan’s line after 2010.

The plots above reveal some long term trends. In some cases like GDPGrowth or UnEmp, we can see the high impact of the financial crisis on a macro level if we place the event around the 2010 mark. More data granularity might be interesting here. We can see China’s high growth as a developing country and its decline like the other countries. Its large population accounts for the lower GDP per Capita.

This visualization also allows us the test the validity of the Phillips Curve which relates inflation (CPI: Consumer Price Index) and the rate of unemployment in an inverse relationship. The relationship appears true except in the case of the United States during the years 2004-2010, unemployment rose greatly even though inflation remained nearly steady. This would need to be confirmed with more data points for the years between 2004 and 2018. With significantly more data going back even further, we have the potential to spot some significant macro trends.

Conclusion

While working with this dataset, we encourated a few forms of “un-tidyness”:
- A dataset originally containing 4 tables and varying format
- Tables with year values in column headers
- Inconsistent column headers across multiple country data
- Values with annotations affixed to the data
- Long column names which make referecing more tedious

We also encapsulated the data wrangling steps shown above in a function split_clean() for re-use with data from other countries. This function can be built upon or modified to tidy the tables of the other indicators, social and environmental & infrastructure. Combining this with the general info table would allow for much more extensive future analysis.