Summary

This document shows the reproducible steps which were used to process data from the World Bank’s development indicators for use in Tableau, as part of the Coursera - UC Davis Data Visualization with Tableau Specialization Capstone course. The data can be accessed at the following address: http://databank.worldbank.org/data/download/WDI_csv.zip

Loading libraries

library(readr)
library(data.table)
library(tidyr)

Loading data

mydata <- read_csv("~/data visualisation/capstone/WDI_csv/WDIData.csv")
## Warning: Missing column names filled in: 'X65' [65]
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `Country Name` = col_character(),
##   `Country Code` = col_character(),
##   `Indicator Name` = col_character(),
##   `Indicator Code` = col_character(),
##   X65 = col_logical()
## )
## See spec(...) for full column specifications.

Transforming Column Names for better compatibility with R

Originally, the column names have spaces in them or they are year numbers, this is problematic, so we change them, putting _ instead of spaces and Y in front of the years, and removing the last column which is not useful.

mydata<- as.data.table(mydata)

years <- 1960:2019
ynames <- paste0("Y", years)
mydata[,"X65":= NULL]
names(mydata) <- c("Country_Name", "Country_Code", "Indicator_Name", "Indicator_Code", ynames)

Choosing Indicators

possibleindicators <- mydata[, unique(Indicator_Name)]
ghgindic <- grep("[Cc][Oo]2|greenhouse|ghg", possibleindicators)
keptGHG<- possibleindicators[ghgindic][c(3,6,7,8,13,14,15,18,31)]
educindicators <- grep("[Ee]ducation|[Ss]chool|[Ll]ower [Ss]econdary", possibleindicators)
keptedu <- possibleindicators[educindicators][65]
keptpop <- "Population, total"
healthindic <- grep("[Ll]ife [Ee]xpect", possibleindicators)
kepthealth<- possibleindicators[healthindic][3]
gdpindic <- grep("GDP|gdp", possibleindicators)
keptgdp <- possibleindicators[gdpindic][c(27,34,39)]
energyindic <- grep("[Ee]lectric|[Ee]nerg", possibleindicators)
keptenergy <- possibleindicators[energyindic][c(1, 6, 7, 8, 10:18, 26)]

Workflow for cleaning the data:

  1. Take out all data that is not from countries with over 1M population
    The reason for this is that the data includes lots of non-countries, categories such as Arab World, OECD countries, Lower Middle Income. In many cases these categories overlap with each other. There are also a lot of really small countries. These clutter the scatterplots and are often not reproducible models economically (the US or China cannot reduce their emissions while maintaining their way of life by taking inspiration from Lichtenstein or Macao).
alllisted <- mydata[, unique(Country_Name)]
actualcountries<- alllisted[48:264]
bigcountries <- mydata[Country_Name %in% actualcountries & Indicator_Name == "Population, total" & Y2010>1e6, unique(Country_Name)]
keptdata <- mydata[Country_Name %in% bigcountries]
  1. Keep only the indicators we are going to look at
allkept <- c(keptedu, keptenergy, keptgdp, keptGHG, kepthealth, keptpop)
keptdata <- keptdata[Indicator_Name %in% allkept]
  1. Consider only the years 2010-2016

For emissions of CO2, the latest available data is for 2014, and for all greenhouse gases, the last available data is from 2012. Some indicators, however, are not collected / reported on each year by all countries. Therefore, we consider the maximum of available values for the range of dates 2010-2016 for these indicators.

keptdata <- keptdata[,.(Country_Name, Indicator_Name, Y2010, Y2011, Y2012, Y2013, Y2014, Y2015, Y2016)]
  1. Compute the max indicator for those years. Compare the number of missing values from the max indicator with the one from the 2014; if it’s similar, we keep 2014 and only impute missing values from the max. If not, we keep the max for all countries.
keptdata[, maxind := max(Y2010, Y2011, Y2012, Y2013, Y2014, Y2015, Y2016, na.rm=TRUE), by=.(Country_Name, Indicator_Name)]
keptdata[maxind<0, maxind:=NA]
missing2014 <- keptdata[, sum(is.na(Y2014)), by=.(Indicator_Name)]
missingall <- keptdata[,sum(is.na(maxind)), by=.(Indicator_Name)]
missingcompare <- missing2014[missingall, on="Indicator_Name"]
names(missingcompare)[2:3]<- c("miss_2014", "miss_all")
missingcompare[, missvaldiff:= miss_2014- miss_all]
missingcompare[, missval_choice := ifelse(missvaldiff<=0, "do nothing", ifelse(missvaldiff<10, "impute", "replace"))]
missingcompare
##                                                                                                                        Indicator_Name
##  1:                                                                                           Access to electricity (% of population)
##  2:                                                                            Alternative and nuclear energy (% of total energy use)
##  3:                                                                                            CO2 emissions (kg per 2010 US$ of GDP)
##  4:                                                                                                                CO2 emissions (kt)
##  5:                                                                                            CO2 emissions (metric tons per capita)
##  6:                                            CO2 emissions from electricity and heat production, total (% of total fuel combustion)
##  7:                                         CO2 emissions from manufacturing industries and construction (% of total fuel combustion)
##  8: CO2 emissions from other sectors, excluding residential buildings and commercial and public services (% of total fuel combustion)
##  9:                          CO2 emissions from residential buildings and commercial and public services (% of total fuel combustion)
## 10:                                                                         CO2 emissions from transport (% of total fuel combustion)
## 11:                                                                            CO2 intensity (kg per kg of oil equivalent energy use)
## 12:                                                                                       Electric power consumption (kWh per capita)
## 13:                                                                 Electric power transmission and distribution losses (% of output)
## 14:                                                                             Electricity production from coal sources (% of total)
## 15:                                                                    Electricity production from hydroelectric sources (% of total)
## 16:                                                                      Electricity production from natural gas sources (% of total)
## 17:                                                                          Electricity production from nuclear sources (% of total)
## 18:                                                                              Electricity production from oil sources (% of total)
## 19:                                                                Electricity production from oil, gas and coal sources (% of total)
## 20:                                               Electricity production from renewable sources, excluding hydroelectric (% of total)
## 21:                                                                                       Fossil fuel energy consumption (% of total)
## 22:                                                                                                           GDP (constant 2010 US$)
## 23:                                                                                                GDP per capita (constant 2010 US$)
## 24:                                                                               GDP per capita, PPP (constant 2017 international $)
## 25:                                                                                           Life expectancy at birth, total (years)
## 26:                                                                  Lower secondary completion rate, total (% of relevant age group)
## 27:                                                                                                                 Population, total
## 28:                                                                             Total greenhouse gas emissions (kt of CO2 equivalent)
##                                                                                                                        Indicator_Name
##     miss_2014 miss_all missvaldiff missval_choice
##  1:         0        0           0     do nothing
##  2:        30       24           6         impute
##  3:         7        5           2         impute
##  4:         3        2           1         impute
##  5:         4        2           2         impute
##  6:        24       24           0     do nothing
##  7:        24       24           0     do nothing
##  8:        24       25          -1     do nothing
##  9:        24       24           0     do nothing
## 10:        24       24           0     do nothing
## 11:        31       25           6         impute
## 12:        25       24           1         impute
## 13:        25       25           0     do nothing
## 14:        24       24           0     do nothing
## 15:        24       24           0     do nothing
## 16:        24       24           0     do nothing
## 17:        24       24           0     do nothing
## 18:        24       24           0     do nothing
## 19:        24       24           0     do nothing
## 20:        24       24           0     do nothing
## 21:        31       25           6         impute
## 22:         4        3           1         impute
## 23:         4        3           1         impute
## 24:         8        8           0     do nothing
## 25:         0        0           0     do nothing
## 26:        58       20          38        replace
## 27:         1        0           1         impute
## 28:       158        8         150        replace
##     miss_2014 miss_all missvaldiff missval_choice

We can see that for most indicators, nothing needs to be done, a few (those where the difference of missing values is less than 10) can have the values imputed from the maximum in the time range without distorting the overall data, and two indicators need to take the maximum from the time frame in order to be valid.

replace_indicators <- missingcompare[missval_choice=="replace", unique(Indicator_Name)]
impute_indicators <- missingcompare[missval_choice=="impute", unique(Indicator_Name)]
keptdata[Indicator_Name %in% replace_indicators, Y2014:=maxind, by=.(Country_Name)]
keptdata[Indicator_Name %in% impute_indicators & is.na(Y2014), Y2014:=maxind] 
  1. Reshape the data so that the format is Tableau friendly

The kind of data as time series is not Tableau friendly, we reshape it using Hadley Wickham’s tidy data principles.

nonpivotdata <- keptdata[, c(1,2,7)]
library(tidyr)
shapeddata <- pivot_wider(nonpivotdata, names_from = "Indicator_Name", values_from = "Y2014")
head(shapeddata)
##   Country_Name Access to electricity (% of population)
## 1  Afghanistan                                89.50000
## 2      Albania                               100.00000
## 3      Algeria                                99.84082
## 4       Angola                                32.00000
## 5    Argentina                               100.00000
## 6      Armenia                                99.95728
##   Alternative and nuclear energy (% of total energy use)
## 1                                                     NA
## 2                                            24.54805021
## 3                                             0.03010604
## 4                                             2.54884002
## 5                                             5.50150364
## 6                                            25.71826413
##   CO2 emissions (kg per 2010 US$ of GDP) CO2 emissions (kt)
## 1                              0.5036294           9809.225
## 2                              0.4483627           5716.853
## 3                              0.7944380         145400.217
## 4                              0.3357378          34763.160
## 5                              0.4598181         204024.546
## 6                              0.4971488           5529.836
##   CO2 emissions (metric tons per capita)
## 1                              0.2939464
## 2                              1.9787633
## 3                              3.7355201
## 4                              1.2903068
## 5                              4.7815078
## 6                              1.8987194
##   CO2 emissions from electricity and heat production, total (% of total fuel combustion)
## 1                                                                                     NA
## 2                                                                               2.912621
## 3                                                                              38.826975
## 4                                                                              19.170984
## 5                                                                              38.038563
## 6                                                                              29.693487
##   CO2 emissions from manufacturing industries and construction (% of total fuel combustion)
## 1                                                                                        NA
## 2                                                                                 22.087379
## 3                                                                                  8.338079
## 4                                                                                  7.772021
## 5                                                                                 16.870225
## 6                                                                                 11.111111
##   CO2 emissions from other sectors, excluding residential buildings and commercial and public services (% of total fuel combustion)
## 1                                                                                                                                NA
## 2                                                                                                                         6.5533981
## 3                                                                                                                         2.3672008
## 4                                                                                                                         0.3626943
## 5                                                                                                                         6.4653604
## 6                                                                                                                         0.3831418
##   CO2 emissions from residential buildings and commercial and public services (% of total fuel combustion)
## 1                                                                                                       NA
## 2                                                                                                 8.495146
## 3                                                                                                15.179370
## 4                                                                                                28.704663
## 5                                                                                                14.463905
## 6                                                                                                31.034483
##   CO2 emissions from transport (% of total fuel combustion)
## 1                                                        NA
## 2                                                  59.95146
## 3                                                  35.28024
## 4                                                  43.98964
## 5                                                  24.16714
## 6                                                  27.77778
##   CO2 intensity (kg per kg of oil equivalent energy use)
## 1                                                     NA
## 2                                               2.447584
## 3                                               2.813876
## 4                                               2.369234
## 5                                               2.355512
## 6                                               1.868984
##   Electric power consumption (kWh per capita)
## 1                                          NA
## 2                                   2309.3665
## 3                                   1362.8719
## 4                                    312.2288
## 5                                   3074.7021
## 6                                   1961.6104
##   Electric power transmission and distribution losses (% of output)
## 1                                                                NA
## 2                                                          23.68755
## 3                                                          17.12742
## 4                                                          11.26582
## 5                                                          14.65738
## 6                                                          11.98710
##   Electricity production from coal sources (% of total)
## 1                                                    NA
## 2                                              0.000000
## 3                                              0.000000
## 4                                              0.000000
## 5                                              2.217284
## 6                                              0.000000
##   Electricity production from hydroelectric sources (% of total)
## 1                                                             NA
## 2                                                      100.00000
## 3                                                        0.39538
## 4                                                       53.17511
## 5                                                       29.60794
## 6                                                       25.70323
##   Electricity production from natural gas sources (% of total)
## 1                                                           NA
## 2                                                      0.00000
## 3                                                     98.61150
## 4                                                      0.00000
## 5                                                     48.19968
## 6                                                     42.43871
##   Electricity production from nuclear sources (% of total)
## 1                                                       NA
## 2                                                 0.000000
## 3                                                 0.000000
## 4                                                 0.000000
## 5                                                 4.166727
## 6                                                31.806452
##   Electricity production from oil sources (% of total)
## 1                                                   NA
## 2                                            0.0000000
## 3                                            0.9931198
## 4                                           46.8248945
## 5                                           13.8401066
## 6                                            0.0000000
##   Electricity production from oil, gas and coal sources (% of total)
## 1                                                                 NA
## 2                                                            0.00000
## 3                                                           99.60462
## 4                                                           46.82489
## 5                                                           64.25707
## 6                                                           42.43871
##   Electricity production from renewable sources, excluding hydroelectric (% of total)
## 1                                                                                  NA
## 2                                                                           0.0000000
## 3                                                                           0.0000000
## 4                                                                           0.0000000
## 5                                                                           1.9682645
## 6                                                                           0.0516129
##   Fossil fuel energy consumption (% of total) GDP (constant 2010 US$)
## 1                                          NA             19477070578
## 2                                    61.42180             12750510498
## 3                                    99.97792            183022745561
## 4                                    48.30559            103542597662
## 5                                    87.72241            443707073929
## 6                                    74.56187             11123101341
##   GDP per capita (constant 2010 US$)
## 1                           583.6562
## 2                          4413.3096
## 3                          4702.0917
## 4                          3843.1982
## 5                         10398.6940
## 6                          3819.2178
##   GDP per capita, PPP (constant 2017 international $)
## 1                                            2249.254
## 2                                           11877.462
## 3                                           11329.490
## 4                                            8239.832
## 5                                           23550.305
## 6                                           11019.839
##   Life expectancy at birth, total (years)
## 1                                  62.966
## 2                                  77.813
## 3                                  75.878
## 4                                  58.776
## 5                                  75.913
## 6                                  74.273
##   Lower secondary completion rate, total (% of relevant age group)
## 1                                                         50.69407
## 2                                                         99.15678
## 3                                                        114.63423
## 4                                                         20.73882
## 5                                                         90.11586
## 6                                                         96.92009
##   Population, total Total greenhouse gas emissions (kt of CO2 equivalent)
## 1          33370794                                             18168.860
## 2           2889104                                              8898.641
## 3          38923687                                            176471.227
## 4          26941779                                             41657.165
## 5          42669500                                            380295.320
## 6           2912403                                             12417.736