data607-illien-project2a

Libraries

Load required libraries

library(tidyverse) #readr, dplyr, tidyr, stringr, tibble, ggplot2
library(knitr)
library(scales)
library(kableExtra)
library(readxl)

A helper function for displaying tables

showtable <- function(data, title) {
  #kable(data, format = "latex", caption = title) %>%
  # kable_styling(bootstrap_options = c("striped", "hover", "condensed"), latex_options = "scale_down")
  kable(data, caption = title) %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed"), latex_options = "scale_down")
}

Data Import

The data is originally found here here. The data was manually copied and pasted into an excel file which can be downloaded from this github page.

We start by reading in the .xlsx file (manually downloaded) from a github link and placed into the working directory.

untidy <- read_xlsx("unorg_usdata.xlsx")

Data Evaluation

Here is a quick look at the data. We find 3 tables of different indicators more or less formatted with 2 columns for years 2004, 2010, 2018. Year 2010 is used as a basis for CPI calculations.

Raw Data

Data Transformation

The dataset actually has variables as rows and values of variables as column headers. While not helpful visually in this case, the tidy format requires that each indicator be placed in a column with arbitrary observation indexes. Since our table contains more variables (indicators) than observations (value of that indicator for a particular year), we will select a few of the indicators to narrow our analysis.

First we slice the dataset into the sections of interest by index (assuming this is consistent across datasets for each countries) and create separate tables for further study. We also want to add a country column to be able to expand the analysis to more countries.

country <- "USA"

# a step to simplify our downstream spread and gather operations
# we also add a new column called Country and place it at the first column position
untidy <- untidy %>% 
  rename("IndicatorName" = Region, "2004" = ...2, "2010" = `Northern America`, "2018" = ...4, ) %>%
  mutate(Country = country) %>%
  select(Country, 1:4, everything())

# slice the relevant portion of the dataset by index, assuming this is consistent
# rename the columns for clarity
clean_geninfo <- untidy %>% slice(1:8) %>% 
  select(IndicatorName, "2010") %>% 
  rename("Value" = "2010")

# slice into different types of indicator tables
clean_econ <- untidy %>% slice(11:28)
clean_social <- untidy %>% slice(31:46)
clean_envinf <- untidy %>% slice(49:60)

clean_econ$IndicatorName <- str_replace_all(clean_econ$IndicatorName, "US\\$", replacement = "USD")

showtable(clean_econ, "Economic Indicators")

Economic Indicators
Country	IndicatorName	2004	2010	2018
USA	GDP: Gross domestic product (million current USD)	13 093 726	14 964 372	18 624 475d
USA	GDP growth rate (annual %, const. 2010 prices)	3.3	2.5	1.5d
USA	GDP per capita (current USD)	44 366.0	48 485.0	57 808.0d
USA	Economy: Agriculturee,f (% of Gross Value Added)	1	1.1000000000000001	1d
USA	Economy: Industrye,f (% of Gross Value Added)	21.5	20.2	19.2d
USA	Economy: Services and other activitye,f (% of GVA)	77.5	78.8	79.9d
USA	Employment in agricultureg (% of employed)	1.6	1.6	1.6
USA	Employment in industryg (% of employed)	21	18.5	18.8
USA	Employment in servicesg (% employed)	77.400000000000006	79.900000000000006	79.599999999999994
USA	Unemployment rate (% of labour force)	5.0999999999999996	9.6	4.3g
USA	Labour force participation rateg (female/male pop. %)	58.2 / 72.2	57.5 / 69.9	55.5 / 68.0
USA	CPI: Consumer Price Indexh (2010=100)	90	100	112c
USA	Agricultural production index (2004-2006=100)	100	106	117d
USA	Index of industrial production (2005=100)	100	96i	108i,j
USA	International trade: exportsk (million current USD)	904 339	1 278 099	1 546 069c
USA	International trade: importsk (million current USD)	1 732 321	1 968 260	2 408 395c
USA	International trade: balancek (million current USD)	827 981	690 161	862 326c
USA	Balance of payments, current account (million USD)	745 246	430 702	466 248c

The two other tables are extracted.

showtable(clean_social, "Social Indicators")

Social Indicators
Country	IndicatorName	2004	2010	2018
USA	Population growth ratel (average annual %)	0.9	0.9	0.7b
USA	Urban population (% of total population)	79.900000000000006	80.8	82.3
USA	Urban population growth ratel (average annual %)	1.1000000000000001	1.1000000000000001	0.9b
USA	Fertility rate, totall (live births per woman)	2	2	1.9b
USA	Life expectancy at birthl (females/males, years)	79.7 / 74.5	80.6 / 75.6	81.2 / 76.5b
USA	Population age distribution (0-14/60+ years old, %)	20.9 / 16.7	20.2 / 18.4	18.8 / 22.0a
USA	International migrant stock (000/% of total pop.)	39 258.3 / 13.3	44 183.6 / 14.3	49 777.0 / 15.3c
USA	Refugees and others of concern to UNHCR 000	549.2m	270.9m	971.5n,c
USA	Infant mortality ratel (per 1 000 live births)	7	6.8	6b
USA	Health: Current expenditure (% of GDP)	14.5	16.399999999999999	16.8b
USA	Health: Physicians (per 1 000 pop.)	2.7o	2.4	2.6j
USA	Education: Government expenditure (% of GDP)	…	5.4	5j
USA	Education: Primary gross enrol. ratio (f/m per 100 pop.)	98.1 / 99.2	99.2 / 100.1	99.4 / 99.2b
USA	Education: Secondary gross enrol. ratio (f/m per 100 pop.)	95.5 / 93.6	93.6 / 92.5	97.7 / 96.7b
USA	Intentional homicide rate (per 100 000 pop.)	5.7	4.8	5.4d
USA	Seats held by women in National Parliaments (%)	14.9	16.8	19.5

showtable(clean_envinf, "Environment and Infrastructure Indicators")

Environment and Infrastructure Indicators
Country	IndicatorName	2004	2010	2018
USA	Individuals using the Internet (per 100 inhabitants)	68g	71.7p	76.2g,d
USA	Research & Development expenditureq (% of GDP)	2.5	2.7	2.8r,b
USA	Threatened species (number)	1 143o	1 152	1 513c
USA	Forested area (% of land area)	33.299999999999997	33.700000000000003	33.9g,b
USA	CO2 emission estimatess (million tons/tons per capita)	5 789.7 / 19.3	5 395.5 / 17.2	5 254.3 / 16.2j
USA	Energy production, primaryt (Petajoules)	68 124	71 882	84 007b
USA	Energy supply per capitat (Gigajoules)	325	297	282b
USA	Tourist/visitor arrivals at national borders 000	49 206	60 010	75 608u,d
USA	Pop. using improved drinking water (urban/rural, %)	99.5 / 96.7	99.4 / 97.6	99.4 / 98.2b
USA	Pop. using improved sanitation facilities (urban/rural, %)	99.9 / 99.6	100.0 / 99.9	100.0 / 100.0b
USA	Net Official Development Assist. disbursedv (% of GNI)	0.23	0.2	0.18r,c

We notice that the values have occurences of lowercase letters. These are annotations/notes from the original dataset which we will ignore. These letters and formatting inconsistencies will be dealt with shortly.

From here onwards, only the Economic Indicator table will be tidied and shown since it will be the focus of our analysis.

Let’s pick out indicators of interest: indicators related to GDP, unemployment and inflation (CPI).

cleaner_econ <- clean_econ %>% filter(grepl("GDP|Unemployment|CPI", IndicatorName)) 
showtable(cleaner_econ, "Selected Indicators")

Selected Indicators
Country	IndicatorName	2004	2010	2018
USA	GDP: Gross domestic product (million current USD)	13 093 726	14 964 372	18 624 475d
USA	GDP growth rate (annual %, const. 2010 prices)	3.3	2.5	1.5d
USA	GDP per capita (current USD)	44 366.0	48 485.0	57 808.0d
USA	Unemployment rate (% of labour force)	5.0999999999999996	9.6	4.3g
USA	CPI: Consumer Price Indexh (2010=100)	90	100	112c

To make the table in a tidy format, we must gather the years in the column headers which are actually values of a variable Year. We also realize that the indicators are actually variables and the years are the observations. To address this, we spread the indicators as columns.

tidy_econ <- cleaner_econ %>% 
  gather("2004":"2018", key = Year, value = IndicatorValue) %>% 
  spread(IndicatorName, IndicatorValue)

Standardize the column headers

# store the column names, remove the annotation h from the CPI column
headers <- colnames(tidy_econ)
headers <- gsub("Indexh", "Index", headers)
names(tidy_econ) <- headers

Clean the values with a helper function

# a function to clean the values by elimanting the annotations (lowercase letters) and spaces
clean_values <- function(data) {
  clean <- apply(data, 2, function(x) gsub("[a-z],?", "", x))
  clean <- apply(clean, 2, function(x) gsub(" ", "", x))
  clean <- apply(clean, 2, function(x) as.numeric(x))
  return(clean)
}

# clean the values
tidy_econ[3:7] <- clean_values(tidy_econ[3:7])

showtable(tidy_econ, "Tidy Economic Indicator Table")

Tidy Economic Indicator Table
Country	Year	CPI: Consumer Price Index (2010=100)	GDP growth rate (annual %, const. 2010 prices)	GDP per capita (current USD)	GDP: Gross domestic product (million current USD)	Unemployment rate (% of labour force)
USA	2004	90	3.3	44366	13093726	5.1
USA	2010	100	2.5	48485	14964372	9.6
USA	2018	112	1.5	57808	18624475	4.3

Viewed in this manner, the dataset is rather small. To make it more interesting, we build upon with data from other countries. We encapsulate all the steps above in the function split_clean() (not shown).

We download data from other countries to supplement our analysis. All additional files are found on Github in the form “unorg_countrydata.xlsx”

usa_econ <- split_clean("unorg_usdata.xlsx", "USA")
china_econ <- split_clean("unorg_chinadata.xlsx", "China")
japan_econ <- split_clean("unorg_japandata.xlsx", "Japan")
germany_econ <- split_clean("unorg_germanydata.xlsx", "Germany")

We then combine the country datasets.

country_econ <- rbind(usa_econ, china_econ, japan_econ, germany_econ)
showtable(country_econ, "Economic Indicators")

Economic Indicators
Country	Year	CPI: Consumer Price Index (2010=100)	GDP growth rate (annual %, const. 2010 prices)	GDP per capita (current USD)	GDP: Gross domestic product (million current USD)	Unemployment rate (% of labour force)
USA	2004	90	3.3	44366	13093726	5.1
USA	2010	100	2.5	48485	14964372	9.6
USA	2018	112	1.5	57808	18624475	4.3
China	2004	NA	11.4	1747	2308800	4.1
China	2010	100	10.6	4461	6066351	4.2
China	2018	104	7.3	7993	11218281	4.7
Japan	2004	100	1.7	37054	4755410	4.4
Japan	2010	100	4.2	44341	5700098	5.1
Japan	2018	104	1.0	38640	4936212	2.6
Germany	2004	92	0.7	35035	2861339	11.2
Germany	2010	100	4.1	42241	3417095	7.0
Germany	2018	109	1.9	42456	3477796	3.6

We can now once more gather the indicators into a single column to have the data in long format which will help with analysis and graphics. We rename the columns for ease of reference.

names(country_econ) <- c("Country", "Year", "CPI", "GDPGrowth", "GDPCapita", "GDP", "UnEmp")

country_econ <- country_econ %>% 
  gather("CPI":"UnEmp", key = Indicator, value = IndicatorValue)

Let’s take a look at 15 random oberservations from this table. We see how the format above is more intelligible.

showtable(sample_n(country_econ, 15), "Long Format Dataset")

Long Format Dataset
Country	Year	Indicator	IndicatorValue
China	2018	GDPCapita	7993.0
USA	2018	GDPGrowth	1.5
Germany	2018	GDPCapita	42456.0
Japan	2018	GDPGrowth	1.0
China	2010	GDP	6066351.0
USA	2010	GDPGrowth	2.5
USA	2004	GDP	13093726.0
Germany	2004	GDP	2861339.0
China	2010	GDPGrowth	10.6
Germany	2018	GDPGrowth	1.9
Japan	2010	GDPCapita	44341.0
Japan	2010	CPI	100.0
Japan	2004	UnEmp	4.4
Germany	2004	UnEmp	11.2
USA	2004	CPI	90.0

Data Analysis & Visualization

Here we take a closer look at the GDPGrowth data where we observe declining growth from all of these major world economies.

data <- country_econ %>% filter(Indicator == "GDPGrowth")

ggplot(data, aes(x=Year, y=IndicatorValue, group=Country)) +
  geom_line(linetype="dashed", aes(color=Country)) +
  geom_point(aes(color=Country)) +
  theme(legend.position="top") +
  ylab("GDP Growth %") +
  ggtitle("GDP Growth")

We then expand our outlook to more economic indicators. Note that, China’s CPI value is missing for 2004 and coincides with Japan’s line after 2010.

ggplot(country_econ, aes(x=Year, y=IndicatorValue, group=Country)) +
  geom_line(linetype="dashed", aes(color=Country)) +
  geom_point(aes(color=Country)) +
  theme(legend.position="top") +
  facet_wrap(~Indicator, scales="free_y") +
  ggtitle("Economic Indicators")

The plots above reveal some long term trends. In some cases like GDPGrowth or UnEmp, we can see the high impact of the financial crisis on a macro level if we place the event around the 2010 mark. More data granularity might be interesting here. We can see China’s high growth as a developing country and its decline like the other countries. Its large population accounts for the lower GDP per Capita.

This visualization also allows us the test the validity of the Phillips Curve which relates inflation (CPI: Consumer Price Index) and the rate of unemployment in an inverse relationship. The relationship appears true except in the case of the United States during the years 2004-2010, unemployment rose greatly even though inflation remained nearly steady. This would need to be confirmed with more data points for the years between 2004 and 2018. With significantly more data going back even further, we have the potential to spot some significant macro trends.

Conclusion

While working with this dataset, we encourated a few forms of “un-tidyness”:
- A dataset originally containing 4 tables and varying format
- Tables with year values in column headers
- Inconsistent column headers across multiple country data
- Values with annotations affixed to the data
- Long column names which make referecing more tedious

We also encapsulated the data wrangling steps shown above in a function split_clean() for re-use with data from other countries. This function can be built upon or modified to tidy the tables of the other indicators, social and environmental & infrastructure. Combining this with the general info table would allow for much more extensive future analysis.