Transforming Wide Datasets

Assignment

Three different untidy (or “wide”) datasets were selected in order to practice different methods of data preparation.

Dataset 1: World Governmental Indicators

This dataset from the World Bank (source) reports “aggregate and individual governance indicators for over 200 countries and territories over the period 1996–2020, for six dimensions of governance:”

Voice and Accountability
Political Stability and Absence of Violence/Terrorism
Government Effectiveness
Regulatory Quality
Rule of Law
Control of Corruption

The source file is an Excel workbook (.xlsx) with multiple tabs, and for this exercise we’ll select one tab to focus on - Control of Corruption.

There is good deal of header text on this tab, and a ‘multiindex’ of column data pairing years with six numeric measures (Estimate, StdErr, NumSrc, Rank, Lower and Upper) resulting in 134 total columns.

Tab from wgidataset.xlsx

Data Prep

The ReadXL library had some trouble reading in an .xlsx from a remote URL, so we’ll use Curl to download a local copy first. Then we’ll read in the data from the ControlofCorruption tab, skipping all twelve rows of header information.

library(curl)
library(readxl)

# import excel sheet and skip header rows
#curl_download('http://info.worldbank.org/governance/wgi/Home/downLoadFile?fileName=wgidataset.xlsx', 
#              'data/wgi/wgidataset.xlsx') # first run only

df_raw <- read_excel('data/wgi/wgidataset.xlsx', 
                     sheet='ControlofCorruption', skip=12)

sample of df_raw
…1	…2	1996…3	1996…4
Country/Territory	Code	Estimate	StdErr
Aruba	ABW	#N/A	#N/A
Andorra	ADO	1.3181432485580444	0.48088869452476501
Afghanistan	AFG	-1.2917047739028931	0.34050697088241577
Angola	AGO	-1.1677018404006958	0.26207658648490906

We’ll want to create a pivot_longer of all this column data, but first we need to handle this multi-index in order to get a single row for each Country, Year and Measure.

Unlike Python’s Pandas, there is not much native support for multi-index dataframes in R. One approach would be to grab the two rows containing Year and Measure, and fuse them into individual column names for a new dataframe that we can then pivot out:

# convert the column names and first row into 2 vectors
df_col_1 <- colnames(df_raw) # colnames() returns a vector
df_col_2 <- as.character(slice(df_raw,1)) # slice() returns a df, so make vector

# get rid of the auto numbering in the column names
df_col_1 <- str_remove(df_col_1, '\\..*$')

# now join each element of the two vectors with paste0 (no separators)
df_cols <- paste0(df_col_1, df_col_2)

# preview
df_cols[1:8]

## [1] "Country/Territory" "Code"              "1996Estimate"     
## [4] "1996StdErr"        "1996NumSrc"        "1996Rank"         
## [7] "1996Lower"         "1996Upper"

# make new df with our new column names and the values
df <- df_raw[-1,]
names(df) <- df_cols

new df
Country/Territory	Code	1996Estimate	1996StdErr	1996NumSrc
Aruba	ABW	#N/A	#N/A	#N/A
Andorra	ADO	1.3181432485580444	0.48088869452476501	1
Afghanistan	AFG	-1.2917047739028931	0.34050697088241577	2
Angola	AGO	-1.1677018404006958	0.26207658648490906	4
Anguilla	AIA	#N/A	#N/A	#N/A

A bit of data cleanup before we go further. There are numerous instances of the string #N/A throughout this dataset, but we’ll need to replace those with actual NA values that R will recognize as NULL instead of character strings.

One way to apply this across all columns at once is to define a function that uses str_replace and use mutate(across(everything(), .func)) to execute.

(Note: We could have done this after the following step instead, when we only have one column to deal with, but it’s good practice..)

# replace all the instances of string '#N/A' with actual NA
nona <- function(s) {str_replace(s,'#N/A','NA')}
df <- df %>% mutate(across(everything(), nona))

# convert all available columns to numeric type
df <- type_convert(df)

formatted df
Country/Territory	Code	1996Estimate	1996StdErr	1996NumSrc
Aruba	ABW	NA	NA	NA
Andorra	ADO	1.318143	0.4808887	1
Afghanistan	AFG	-1.291705	0.3405070	2
Angola	AGO	-1.167702	0.2620766	4
Anguilla	AIA	NA	NA	NA

Now for the pivot!

The pivot_longer function has a parameter names_pattern that accepts regex capturing groups to split column names during the pivot.

In this case we’ll create two capture groups .. one for the four-digit year, and one for the Measure name:

# we can use the names_pattern parameter of pivot_longer to separate Year and Category into separate columns .. pretty useful!
df <- df %>% pivot_longer(
 cols = `1996Estimate`:`2020Upper`,
 names_to = c('Year','Measure'),
 names_pattern = '(.{4})(.*)',
 values_to = 'Value'
)

pivoted df
Country/Territory	Code	Year	Measure	Value
Aruba	ABW	2007	StdErr	0.4162902
Aruba	ABW	2007	NumSrc	1.0000000
Aruba	ABW	2007	Rank	88.8349533
Aruba	ABW	2007	Lower	72.3300934
Aruba	ABW	2007	Upper	93.6893234
Aruba	ABW	2008	Estimate	1.2886552

Analysis

We now have a tidy, well-formatted dataset with each row representing a single observation of a given Year, Country and Measure. Let’s demonstrate with a couple of simple graphs:

df_rank_2020 <- df %>% 
  filter(Measure=='Rank', Year==2020) %>% 
  filter(Value >= 90) %>%
  arrange(desc(Value))

df_rank_2020 %>% ggplot(aes(x=reorder(Code, -Value), y=Value)) +
  geom_bar(stat='identity', color = "#112446", fill="#ffffff") +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 90)) +
  labs(title='Top Countries', x='Year')

df_rank_annual <- df %>% 
  filter(Measure=='Rank', Code=='AFG') %>% 
  arrange(Year)

df_rank_annual %>% ggplot(aes(x=Year, y=Value)) +
  geom_bar(stat='identity', color = "#112446", fill="#ffffff") +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 90)) + 
  labs(title='Afghanistan')

Dataset 2: League of Legends Champions

Video game developer Riot Games publishes League of Legends game data and assets for use by third-party developers. We’ll be looking at the top-level dataset (source) for the main game characters (or “Champions”), which is made available in JSON format.

champions.json

Data Prep

We’ll be using two additional libraries to parse HTML and work efficiently with JSON-formatted data: rvest and tidyjson.

library(tidyjson)
library(rvest)

# get json from webpage with rvest read_html
page <- read_html('https://ddragon.leagueoflegends.com/cdn/11.19.1/data/en_US/champion.json')

# parse the html with rvest, return just the contents of the <body> element
json <- page %>% html_elements("body") %>% html_text()

# check out the structure
json %>% gather_object %>% json_types

## # A tbl_json: 4 x 4 tibble with a "JSON" attribute
##   ..JSON                  document.id name    type  
##   <chr>                         <int> <chr>   <fct> 
## 1 "\"champion\""                    1 type    string
## 2 "\"standAloneComp..."             1 format  string
## 3 "\"11.19.1\""                     1 version string
## 4 "{\"Aatrox\":{\"ver..."           1 data    object

The tricky part here is, if we were to use tidyjson at the root of this JSON document, we wouldn’t get the information we need.

What we actually need is the fourth node (helpfully named “data”), which itself is a series of nested JSON nodes that contain all the character information.

We’ll use the enter_object and gather_object functions to extract the content of this “data” node, spread_all to create a table of all the json elements, and then convert back into a standard R dataframe:

# get only the 'data' node from the json and spread_all into a 
# json table, then convert back to regular dataframe
champions <- json %>% 
  enter_object(data) %>% 
  gather_object %>% 
  spread_all %>% 
  as_data_frame.tbl_json %>%
  select(!blurb) # delete this column for demo .. very long strings

formatted df
document.id	name	version	id	key	name.2	title	partype	info.attack	info.defense	info.magic	info.difficulty	image.full	image.sprite	image.group	image.x	image.w	image.h	stats.hp	stats.hpperlevel	stats.mp	stats.mpperlevel	stats.movespeed	stats.armor	stats.armorperlevel	stats.spellblock	stats.spellblockperlevel	stats.attackrange	stats.hpregen	stats.hpregenperlevel	stats.mpregen	stats.mpregenperlevel	stats.attackdamage	stats.attackdamageperlevel	stats.attackspeedperlevel	stats.attackspeed
1	Aatrox	11.19.1	Aatrox	266	Aatrox	the Darkin Blade	Blood Well	8	4	3	4	Aatrox.png	champion0.png	champion	0	48	48	580	90	0	0	345	38	3.25	32	1.25	175	3.00	1.00	0.000	0.00	60	5.00	2.500	0.651
1	Ahri	11.19.1	Ahri	103	Ahri	the Nine-Tailed Fox	Mana	3	4	8	5	Ahri.png	champion0.png	champion	48	48	48	526	92	418	25	330	21	3.50	30	0.50	550	5.50	0.60	8.000	0.80	53	3.00	2.000	0.668
1	Akali	11.19.1	Akali	84	Akali	the Rogue Assassin	Energy	5	3	8	7	Akali.png	champion0.png	champion	96	48	48	500	105	200	0	345	23	3.50	37	1.25	125	9.00	0.90	50.000	0.00	62	3.30	3.200	0.625
1	Akshan	11.19.1	Akshan	166	Akshan	the Rogue Sentinel	Mana	0	0	0	0	Akshan.png	champion0.png	champion	144	48	48	560	90	350	40	330	26	3.00	30	0.50	500	3.75	0.65	8.175	0.70	52	3.50	4.000	0.638
1	Alistar	11.19.1	Alistar	12	Alistar	the Minotaur	Mana	6	9	5	7	Alistar.png	champion0.png	champion	192	48	48	600	106	350	40	330	44	3.50	32	1.25	125	8.50	0.85	8.500	0.80	62	3.75	2.125	0.625
1	Amumu	11.19.1	Amumu	32	Amumu	the Sad Mummy	Mana	2	6	8	3	Amumu.png	champion0.png	champion	240	48	48	615	75	285	40	335	30	3.50	32	1.25	125	9.00	0.85	7.380	0.53	53	3.80	2.180	0.736

Analysis

tidyjson has expanded all the information from the “data” node into a well-formatted dataframe, and has automatically prepended any sub-node names such as “info” and stats” to make it easy to differentiate.

Let’s examine the results with a couple of simple graphs:

df <- champions %>% 
  select(c('name','stats.hp','stats.mp','stats.movespeed','stats.armor'))

ggplot(df) +
  aes(x = stats.hp) +
  geom_histogram(binwidth = 10, color = "#112446", fill="#ffffff") +
  theme_minimal() + 
  labs(title='Hit Points')

ggplot(df) +
  aes(x = stats.mp) +
  geom_histogram(binwidth=50, color = "#112446", fill="#ffffff") +
  theme_minimal() +
  labs(title='Mana Points')

Dataset 3: Global Infant Mortality Rates

This dataset published by the World Bank (source) tracks the mortality rate of children under 5, measured as deaths per 1,000 live births. The data are broken down by country, region and year from 1960 to 2019.

Data Prep

This dataset includes three separate files, one with the Mortality data by country and year, and two files with additional metadata.

For this exercise we will load the Mortality data file and the Region metadata file, then transform and join the two together for well-formatted and informative dataset.

df_mort <- read_csv('https://raw.githubusercontent.com/jefedigital/cuny-data-607-projects/main/wide-datasets/data/worldbank/API_SH.DYN.MORT_DS2_en_csv_v2_3012069.csv', skip=4)
df_meta <- read_csv('https://raw.githubusercontent.com/jefedigital/cuny-data-607-projects/main/wide-datasets/data/worldbank/Metadata_Country_API_SH.DYN.MORT_DS2_en_csv_v2_3012069.csv')

# cleanup
df_mort <- df_mort %>% 
  select(!('2020':'...66')) %>% # drop the last 2 columns, no data
  select(!('Indicator Name':'Indicator Code')) %>% # drop these cols, not needed
  rename(Name = 'Country Name', Code = 'Country Code') # renaming

df_meta <- df_meta %>%
  select(!SpecialNotes) %>% # drop column 'SpecialNotes', not needed
  select(!last_col()) # drop last column, no data

One additional step on the regional metadata .. we will use the conditional statement ifelse to impute whether the observation pertains to a Country or a Region, and to impute the value from the TableName column where Region is NA.

(This is due to a very large number of ‘secondary’ Region categories, and the likelihood of wanting to look at Country-only and Region-only data in any further analysis.)

# impute missing Region values
df_meta <- df_meta %>% 
  mutate(`Record Type` = ifelse(is.na(Region),'Region','Country')) %>%
  mutate(Region = ifelse(is.na(Region), TableName, Region)) %>% 
  rename(Code = 'Country Code') %>%
  select(!TableName)

Now, we pivot the Mortality table with pivot_longer so that each row corresponds to a single Country and Year. We then left_join the Region metadata and perform a little cleanup.

# pivot
df_mort <- df_mort %>% 
  pivot_longer(!c('Name','Code'), 
               names_to = 'Year', 
               values_to = 'Mortality Rate')

# join
df <- left_join(df_mort, df_meta, by='Code')

# correct Mortality Rate to actual rate (/1000)
df <- df %>% mutate(`Mortality Rate` = `Mortality Rate`/1000)

final df (sample)
Name	Code	Year	Mortality Rate	Region
Africa Eastern and Southern	AFE	1969	0.2159640	Africa Eastern and Southern
Africa Eastern and Southern	AFE	1970	0.2146804	Africa Eastern and Southern
Africa Eastern and Southern	AFE	1971	0.2132826	Africa Eastern and Southern
Africa Eastern and Southern	AFE	1972	0.2118596	Africa Eastern and Southern
Africa Eastern and Southern	AFE	1973	0.2104980	Africa Eastern and Southern
Africa Eastern and Southern	AFE	1974	0.2091949	Africa Eastern and Southern

To aid in analysis, we’ll filter two subtables - one with Country data only, and one with Region data only.

Let’s also take a look at the unique names in each table to compare the “main” geo-based Regions in the Countries table, with a much larger number of “secondary” Regions in the Region table:

# analysis
df_regions <- df %>% filter(`Record Type` == 'Region')
df_countries <- df %>% filter(`Record Type` == 'Country')

# unique regions in the country table 
unique(df_countries$Region)

## [1] "Latin America & Caribbean"  "South Asia"                
## [3] "Sub-Saharan Africa"         "Europe & Central Asia"     
## [5] "Middle East & North Africa" "East Asia & Pacific"       
## [7] "North America"

# unique regions in the regions table
unique(df_regions$Name)

##  [1] "Africa Eastern and Southern"                         
##  [2] "Africa Western and Central"                          
##  [3] "Arab World"                                          
##  [4] "Central Europe and the Baltics"                      
##  [5] "Caribbean small states"                              
##  [6] "East Asia & Pacific (excluding high income)"         
##  [7] "Early-demographic dividend"                          
##  [8] "East Asia & Pacific"                                 
##  [9] "Europe & Central Asia (excluding high income)"       
## [10] "Europe & Central Asia"                               
## [11] "Euro area"                                           
## [12] "European Union"                                      
## [13] "Fragile and conflict affected situations"            
## [14] "High income"                                         
## [15] "Heavily indebted poor countries (HIPC)"              
## [16] "IBRD only"                                           
## [17] "IDA & IBRD total"                                    
## [18] "IDA total"                                           
## [19] "IDA blend"                                           
## [20] "IDA only"                                            
## [21] "Latin America & Caribbean (excluding high income)"   
## [22] "Latin America & Caribbean"                           
## [23] "Least developed countries: UN classification"        
## [24] "Low income"                                          
## [25] "Lower middle income"                                 
## [26] "Low & middle income"                                 
## [27] "Late-demographic dividend"                           
## [28] "Middle East & North Africa"                          
## [29] "Middle income"                                       
## [30] "Middle East & North Africa (excluding high income)"  
## [31] "North America"                                       
## [32] "OECD members"                                        
## [33] "Other small states"                                  
## [34] "Pre-demographic dividend"                            
## [35] "Pacific island small states"                         
## [36] "Post-demographic dividend"                           
## [37] "South Asia"                                          
## [38] "Sub-Saharan Africa (excluding high income)"          
## [39] "Sub-Saharan Africa"                                  
## [40] "Small states"                                        
## [41] "East Asia & Pacific (IDA & IBRD countries)"          
## [42] "Europe & Central Asia (IDA & IBRD countries)"        
## [43] "Latin America & the Caribbean (IDA & IBRD countries)"
## [44] "Middle East & North Africa (IDA & IBRD countries)"   
## [45] "South Asia (IDA & IBRD)"                             
## [46] "Sub-Saharan Africa (IDA & IBRD countries)"           
## [47] "Upper middle income"                                 
## [48] "World"

Analysis

We should have a good foundation now for further analysis.

Let’s demonstrate with a line graph of Mortality Rates by country, from the Latin America and Caribbean region. Right away, we can see that while child mortality has been steadily dropping, there are a few alarming outliers where the trend has reversed in recent years, or where catastrophic events have had a significant impact (such as the 2010 Haiti earthquake, seen in the top line below.)

df_countries_latam <- df_countries %>%
  filter(Region == 'Latin America & Caribbean')

ggplot(df_countries_latam) +
  aes(x = Year, y = `Mortality Rate`, group = Name, label=Name) +
  geom_line() +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90))