This document shows the reproducible steps which were used to process data from the World Bank’s development indicators for use in Tableau, as part of the Coursera - UC Davis Data Visualization with Tableau Specialization Capstone course. The data can be accessed at the following address: http://databank.worldbank.org/data/download/WDI_csv.zip
library(readr)
library(data.table)
library(tidyr)
mydata <- read_csv("~/data visualisation/capstone/WDI_csv/WDIData.csv")
## Warning: Missing column names filled in: 'X65' [65]
## Parsed with column specification:
## cols(
## .default = col_double(),
## `Country Name` = col_character(),
## `Country Code` = col_character(),
## `Indicator Name` = col_character(),
## `Indicator Code` = col_character(),
## X65 = col_logical()
## )
## See spec(...) for full column specifications.
Originally, the column names have spaces in them or they are year numbers, this is problematic, so we change them, putting _ instead of spaces and Y in front of the years, and removing the last column which is not useful.
mydata<- as.data.table(mydata)
years <- 1960:2019
ynames <- paste0("Y", years)
mydata[,"X65":= NULL]
names(mydata) <- c("Country_Name", "Country_Code", "Indicator_Name", "Indicator_Code", ynames)
possibleindicators <- mydata[, unique(Indicator_Name)]
ghgindic <- grep("[Cc][Oo]2|greenhouse|ghg", possibleindicators)
keptGHG<- possibleindicators[ghgindic][c(3,6,7,8,13,14,15,18,31)]
educindicators <- grep("[Ee]ducation|[Ss]chool|[Ll]ower [Ss]econdary", possibleindicators)
keptedu <- possibleindicators[educindicators][65]
keptpop <- "Population, total"
healthindic <- grep("[Ll]ife [Ee]xpect", possibleindicators)
kepthealth<- possibleindicators[healthindic][3]
gdpindic <- grep("GDP|gdp", possibleindicators)
keptgdp <- possibleindicators[gdpindic][c(27,34,39)]
energyindic <- grep("[Ee]lectric|[Ee]nerg", possibleindicators)
keptenergy <- possibleindicators[energyindic][c(1, 6, 7, 8, 10:18, 26)]
alllisted <- mydata[, unique(Country_Name)]
actualcountries<- alllisted[48:264]
bigcountries <- mydata[Country_Name %in% actualcountries & Indicator_Name == "Population, total" & Y2010>1e6, unique(Country_Name)]
keptdata <- mydata[Country_Name %in% bigcountries]
allkept <- c(keptedu, keptenergy, keptgdp, keptGHG, kepthealth, keptpop)
keptdata <- keptdata[Indicator_Name %in% allkept]
For emissions of CO2, the latest available data is for 2014, and for all greenhouse gases, the last available data is from 2012. Some indicators, however, are not collected / reported on each year by all countries. Therefore, we consider the maximum of available values for the range of dates 2010-2016 for these indicators.
keptdata <- keptdata[,.(Country_Name, Indicator_Name, Y2010, Y2011, Y2012, Y2013, Y2014, Y2015, Y2016)]
keptdata[, maxind := max(Y2010, Y2011, Y2012, Y2013, Y2014, Y2015, Y2016, na.rm=TRUE), by=.(Country_Name, Indicator_Name)]
keptdata[maxind<0, maxind:=NA]
missing2014 <- keptdata[, sum(is.na(Y2014)), by=.(Indicator_Name)]
missingall <- keptdata[,sum(is.na(maxind)), by=.(Indicator_Name)]
missingcompare <- missing2014[missingall, on="Indicator_Name"]
names(missingcompare)[2:3]<- c("miss_2014", "miss_all")
missingcompare[, missvaldiff:= miss_2014- miss_all]
missingcompare[, missval_choice := ifelse(missvaldiff<=0, "do nothing", ifelse(missvaldiff<10, "impute", "replace"))]
missingcompare
## Indicator_Name
## 1: Access to electricity (% of population)
## 2: Alternative and nuclear energy (% of total energy use)
## 3: CO2 emissions (kg per 2010 US$ of GDP)
## 4: CO2 emissions (kt)
## 5: CO2 emissions (metric tons per capita)
## 6: CO2 emissions from electricity and heat production, total (% of total fuel combustion)
## 7: CO2 emissions from manufacturing industries and construction (% of total fuel combustion)
## 8: CO2 emissions from other sectors, excluding residential buildings and commercial and public services (% of total fuel combustion)
## 9: CO2 emissions from residential buildings and commercial and public services (% of total fuel combustion)
## 10: CO2 emissions from transport (% of total fuel combustion)
## 11: CO2 intensity (kg per kg of oil equivalent energy use)
## 12: Electric power consumption (kWh per capita)
## 13: Electric power transmission and distribution losses (% of output)
## 14: Electricity production from coal sources (% of total)
## 15: Electricity production from hydroelectric sources (% of total)
## 16: Electricity production from natural gas sources (% of total)
## 17: Electricity production from nuclear sources (% of total)
## 18: Electricity production from oil sources (% of total)
## 19: Electricity production from oil, gas and coal sources (% of total)
## 20: Electricity production from renewable sources, excluding hydroelectric (% of total)
## 21: Fossil fuel energy consumption (% of total)
## 22: GDP (constant 2010 US$)
## 23: GDP per capita (constant 2010 US$)
## 24: GDP per capita, PPP (constant 2017 international $)
## 25: Life expectancy at birth, total (years)
## 26: Lower secondary completion rate, total (% of relevant age group)
## 27: Population, total
## 28: Total greenhouse gas emissions (kt of CO2 equivalent)
## Indicator_Name
## miss_2014 miss_all missvaldiff missval_choice
## 1: 0 0 0 do nothing
## 2: 30 24 6 impute
## 3: 7 5 2 impute
## 4: 3 2 1 impute
## 5: 4 2 2 impute
## 6: 24 24 0 do nothing
## 7: 24 24 0 do nothing
## 8: 24 25 -1 do nothing
## 9: 24 24 0 do nothing
## 10: 24 24 0 do nothing
## 11: 31 25 6 impute
## 12: 25 24 1 impute
## 13: 25 25 0 do nothing
## 14: 24 24 0 do nothing
## 15: 24 24 0 do nothing
## 16: 24 24 0 do nothing
## 17: 24 24 0 do nothing
## 18: 24 24 0 do nothing
## 19: 24 24 0 do nothing
## 20: 24 24 0 do nothing
## 21: 31 25 6 impute
## 22: 4 3 1 impute
## 23: 4 3 1 impute
## 24: 8 8 0 do nothing
## 25: 0 0 0 do nothing
## 26: 58 20 38 replace
## 27: 1 0 1 impute
## 28: 158 8 150 replace
## miss_2014 miss_all missvaldiff missval_choice
We can see that for most indicators, nothing needs to be done, a few (those where the difference of missing values is less than 10) can have the values imputed from the maximum in the time range without distorting the overall data, and two indicators need to take the maximum from the time frame in order to be valid.
replace_indicators <- missingcompare[missval_choice=="replace", unique(Indicator_Name)]
impute_indicators <- missingcompare[missval_choice=="impute", unique(Indicator_Name)]
keptdata[Indicator_Name %in% replace_indicators, Y2014:=maxind, by=.(Country_Name)]
keptdata[Indicator_Name %in% impute_indicators & is.na(Y2014), Y2014:=maxind]
The kind of data as time series is not Tableau friendly, we reshape it using Hadley Wickham’s tidy data principles.
nonpivotdata <- keptdata[, c(1,2,7)]
library(tidyr)
shapeddata <- pivot_wider(nonpivotdata, names_from = "Indicator_Name", values_from = "Y2014")
head(shapeddata)
## Country_Name Access to electricity (% of population)
## 1 Afghanistan 89.50000
## 2 Albania 100.00000
## 3 Algeria 99.84082
## 4 Angola 32.00000
## 5 Argentina 100.00000
## 6 Armenia 99.95728
## Alternative and nuclear energy (% of total energy use)
## 1 NA
## 2 24.54805021
## 3 0.03010604
## 4 2.54884002
## 5 5.50150364
## 6 25.71826413
## CO2 emissions (kg per 2010 US$ of GDP) CO2 emissions (kt)
## 1 0.5036294 9809.225
## 2 0.4483627 5716.853
## 3 0.7944380 145400.217
## 4 0.3357378 34763.160
## 5 0.4598181 204024.546
## 6 0.4971488 5529.836
## CO2 emissions (metric tons per capita)
## 1 0.2939464
## 2 1.9787633
## 3 3.7355201
## 4 1.2903068
## 5 4.7815078
## 6 1.8987194
## CO2 emissions from electricity and heat production, total (% of total fuel combustion)
## 1 NA
## 2 2.912621
## 3 38.826975
## 4 19.170984
## 5 38.038563
## 6 29.693487
## CO2 emissions from manufacturing industries and construction (% of total fuel combustion)
## 1 NA
## 2 22.087379
## 3 8.338079
## 4 7.772021
## 5 16.870225
## 6 11.111111
## CO2 emissions from other sectors, excluding residential buildings and commercial and public services (% of total fuel combustion)
## 1 NA
## 2 6.5533981
## 3 2.3672008
## 4 0.3626943
## 5 6.4653604
## 6 0.3831418
## CO2 emissions from residential buildings and commercial and public services (% of total fuel combustion)
## 1 NA
## 2 8.495146
## 3 15.179370
## 4 28.704663
## 5 14.463905
## 6 31.034483
## CO2 emissions from transport (% of total fuel combustion)
## 1 NA
## 2 59.95146
## 3 35.28024
## 4 43.98964
## 5 24.16714
## 6 27.77778
## CO2 intensity (kg per kg of oil equivalent energy use)
## 1 NA
## 2 2.447584
## 3 2.813876
## 4 2.369234
## 5 2.355512
## 6 1.868984
## Electric power consumption (kWh per capita)
## 1 NA
## 2 2309.3665
## 3 1362.8719
## 4 312.2288
## 5 3074.7021
## 6 1961.6104
## Electric power transmission and distribution losses (% of output)
## 1 NA
## 2 23.68755
## 3 17.12742
## 4 11.26582
## 5 14.65738
## 6 11.98710
## Electricity production from coal sources (% of total)
## 1 NA
## 2 0.000000
## 3 0.000000
## 4 0.000000
## 5 2.217284
## 6 0.000000
## Electricity production from hydroelectric sources (% of total)
## 1 NA
## 2 100.00000
## 3 0.39538
## 4 53.17511
## 5 29.60794
## 6 25.70323
## Electricity production from natural gas sources (% of total)
## 1 NA
## 2 0.00000
## 3 98.61150
## 4 0.00000
## 5 48.19968
## 6 42.43871
## Electricity production from nuclear sources (% of total)
## 1 NA
## 2 0.000000
## 3 0.000000
## 4 0.000000
## 5 4.166727
## 6 31.806452
## Electricity production from oil sources (% of total)
## 1 NA
## 2 0.0000000
## 3 0.9931198
## 4 46.8248945
## 5 13.8401066
## 6 0.0000000
## Electricity production from oil, gas and coal sources (% of total)
## 1 NA
## 2 0.00000
## 3 99.60462
## 4 46.82489
## 5 64.25707
## 6 42.43871
## Electricity production from renewable sources, excluding hydroelectric (% of total)
## 1 NA
## 2 0.0000000
## 3 0.0000000
## 4 0.0000000
## 5 1.9682645
## 6 0.0516129
## Fossil fuel energy consumption (% of total) GDP (constant 2010 US$)
## 1 NA 19477070578
## 2 61.42180 12750510498
## 3 99.97792 183022745561
## 4 48.30559 103542597662
## 5 87.72241 443707073929
## 6 74.56187 11123101341
## GDP per capita (constant 2010 US$)
## 1 583.6562
## 2 4413.3096
## 3 4702.0917
## 4 3843.1982
## 5 10398.6940
## 6 3819.2178
## GDP per capita, PPP (constant 2017 international $)
## 1 2249.254
## 2 11877.462
## 3 11329.490
## 4 8239.832
## 5 23550.305
## 6 11019.839
## Life expectancy at birth, total (years)
## 1 62.966
## 2 77.813
## 3 75.878
## 4 58.776
## 5 75.913
## 6 74.273
## Lower secondary completion rate, total (% of relevant age group)
## 1 50.69407
## 2 99.15678
## 3 114.63423
## 4 20.73882
## 5 90.11586
## 6 96.92009
## Population, total Total greenhouse gas emissions (kt of CO2 equivalent)
## 1 33370794 18168.860
## 2 2889104 8898.641
## 3 38923687 176471.227
## 4 26941779 41657.165
## 5 42669500 380295.320
## 6 2912403 12417.736