Transforming Wide Datasets

Assignment

Three different untidy (or “wide”) datasets were selected in order to practice different methods of data preparation.

Dataset 1: World Governmental Indicators

This dataset from the World Bank (source) reports “aggregate and individual governance indicators for over 200 countries and territories over the period 1996–2020, for six dimensions of governance:”

  • Voice and Accountability
  • Political Stability and Absence of Violence/Terrorism
  • Government Effectiveness
  • Regulatory Quality
  • Rule of Law
  • Control of Corruption

The source file is an Excel workbook (.xlsx) with multiple tabs, and for this exercise we’ll select one tab to focus on - Control of Corruption.

There is good deal of header text on this tab, and a ‘multiindex’ of column data pairing years with six numeric measures (Estimate, StdErr, NumSrc, Rank, Lower and Upper) resulting in 134 total columns.

Tab from wgidataset.xlsx

Data Prep

The ReadXL library had some trouble reading in an .xlsx from a remote URL, so we’ll use Curl to download a local copy first. Then we’ll read in the data from the ControlofCorruption tab, skipping all twelve rows of header information.

library(curl)
library(readxl)

# import excel sheet and skip header rows
#curl_download('http://info.worldbank.org/governance/wgi/Home/downLoadFile?fileName=wgidataset.xlsx', 
#              'data/wgi/wgidataset.xlsx') # first run only

df_raw <- read_excel('data/wgi/wgidataset.xlsx', 
                     sheet='ControlofCorruption', skip=12)
sample of df_raw
…1 …2 1996…3 1996…4
Country/Territory Code Estimate StdErr
Aruba ABW #N/A #N/A
Andorra ADO 1.3181432485580444 0.48088869452476501
Afghanistan AFG -1.2917047739028931 0.34050697088241577
Angola AGO -1.1677018404006958 0.26207658648490906

We’ll want to create a pivot_longer of all this column data, but first we need to handle this multi-index in order to get a single row for each Country, Year and Measure.

Unlike Python’s Pandas, there is not much native support for multi-index dataframes in R. One approach would be to grab the two rows containing Year and Measure, and fuse them into individual column names for a new dataframe that we can then pivot out:

# convert the column names and first row into 2 vectors
df_col_1 <- colnames(df_raw) # colnames() returns a vector
df_col_2 <- as.character(slice(df_raw,1)) # slice() returns a df, so make vector

# get rid of the auto numbering in the column names
df_col_1 <- str_remove(df_col_1, '\\..*$')

# now join each element of the two vectors with paste0 (no separators)
df_cols <- paste0(df_col_1, df_col_2)

# preview
df_cols[1:8]
## [1] "Country/Territory" "Code"              "1996Estimate"     
## [4] "1996StdErr"        "1996NumSrc"        "1996Rank"         
## [7] "1996Lower"         "1996Upper"
# make new df with our new column names and the values
df <- df_raw[-1,]
names(df) <- df_cols
new df
Country/Territory Code 1996Estimate 1996StdErr 1996NumSrc
Aruba ABW #N/A #N/A #N/A
Andorra ADO 1.3181432485580444 0.48088869452476501 1
Afghanistan AFG -1.2917047739028931 0.34050697088241577 2
Angola AGO -1.1677018404006958 0.26207658648490906 4
Anguilla AIA #N/A #N/A #N/A

A bit of data cleanup before we go further. There are numerous instances of the string #N/A throughout this dataset, but we’ll need to replace those with actual NA values that R will recognize as NULL instead of character strings.

One way to apply this across all columns at once is to define a function that uses str_replace and use mutate(across(everything(), .func)) to execute.

(Note: We could have done this after the following step instead, when we only have one column to deal with, but it’s good practice..)

# replace all the instances of string '#N/A' with actual NA
nona <- function(s) {str_replace(s,'#N/A','NA')}
df <- df %>% mutate(across(everything(), nona))

# convert all available columns to numeric type
df <- type_convert(df)
formatted df
Country/Territory Code 1996Estimate 1996StdErr 1996NumSrc
Aruba ABW NA NA NA
Andorra ADO 1.318143 0.4808887 1
Afghanistan AFG -1.291705 0.3405070 2
Angola AGO -1.167702 0.2620766 4
Anguilla AIA NA NA NA

Now for the pivot!

The pivot_longer function has a parameter names_pattern that accepts regex capturing groups to split column names during the pivot.

In this case we’ll create two capture groups .. one for the four-digit year, and one for the Measure name:

# we can use the names_pattern parameter of pivot_longer to separate Year and Category into separate columns .. pretty useful!
df <- df %>% pivot_longer(
 cols = `1996Estimate`:`2020Upper`,
 names_to = c('Year','Measure'),
 names_pattern = '(.{4})(.*)',
 values_to = 'Value'
)
pivoted df
Country/Territory Code Year Measure Value
Aruba ABW 2007 StdErr 0.4162902
Aruba ABW 2007 NumSrc 1.0000000
Aruba ABW 2007 Rank 88.8349533
Aruba ABW 2007 Lower 72.3300934
Aruba ABW 2007 Upper 93.6893234
Aruba ABW 2008 Estimate 1.2886552

Analysis

We now have a tidy, well-formatted dataset with each row representing a single observation of a given Year, Country and Measure. Let’s demonstrate with a couple of simple graphs:

df_rank_2020 <- df %>% 
  filter(Measure=='Rank', Year==2020) %>% 
  filter(Value >= 90) %>%
  arrange(desc(Value))

df_rank_2020 %>% ggplot(aes(x=reorder(Code, -Value), y=Value)) +
  geom_bar(stat='identity', color = "#112446", fill="#ffffff") +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 90)) +
  labs(title='Top Countries', x='Year')

df_rank_annual <- df %>% 
  filter(Measure=='Rank', Code=='AFG') %>% 
  arrange(Year)

df_rank_annual %>% ggplot(aes(x=Year, y=Value)) +
  geom_bar(stat='identity', color = "#112446", fill="#ffffff") +
  theme_minimal() + 
  theme(axis.text.x = element_text(angle = 90)) + 
  labs(title='Afghanistan')


Dataset 2: League of Legends Champions

Video game developer Riot Games publishes League of Legends game data and assets for use by third-party developers. We’ll be looking at the top-level dataset (source) for the main game characters (or “Champions”), which is made available in JSON format.

champions.json

Data Prep

We’ll be using two additional libraries to parse HTML and work efficiently with JSON-formatted data: rvest and tidyjson.

library(tidyjson)
library(rvest)

# get json from webpage with rvest read_html
page <- read_html('https://ddragon.leagueoflegends.com/cdn/11.19.1/data/en_US/champion.json')

# parse the html with rvest, return just the contents of the <body> element
json <- page %>% html_elements("body") %>% html_text()

# check out the structure
json %>% gather_object %>% json_types
## # A tbl_json: 4 x 4 tibble with a "JSON" attribute
##   ..JSON                  document.id name    type  
##   <chr>                         <int> <chr>   <fct> 
## 1 "\"champion\""                    1 type    string
## 2 "\"standAloneComp..."             1 format  string
## 3 "\"11.19.1\""                     1 version string
## 4 "{\"Aatrox\":{\"ver..."           1 data    object

The tricky part here is, if we were to use tidyjson at the root of this JSON document, we wouldn’t get the information we need.

What we actually need is the fourth node (helpfully named “data”), which itself is a series of nested JSON nodes that contain all the character information.

We’ll use the enter_object and gather_object functions to extract the content of this “data” node, spread_all to create a table of all the json elements, and then convert back into a standard R dataframe:

# get only the 'data' node from the json and spread_all into a 
# json table, then convert back to regular dataframe
champions <- json %>% 
  enter_object(data) %>% 
  gather_object %>% 
  spread_all %>% 
  as_data_frame.tbl_json %>%
  select(!blurb) # delete this column for demo .. very long strings
formatted df
document.id name version id key name.2 title partype info.attack info.defense info.magic info.difficulty image.full image.sprite image.group image.x image.y image.w image.h stats.hp stats.hpperlevel stats.mp stats.mpperlevel stats.movespeed stats.armor stats.armorperlevel stats.spellblock stats.spellblockperlevel stats.attackrange stats.hpregen stats.hpregenperlevel stats.mpregen stats.mpregenperlevel stats.crit stats.critperlevel stats.attackdamage stats.attackdamageperlevel stats.attackspeedperlevel stats.attackspeed
1 Aatrox 11.19.1 Aatrox 266 Aatrox the Darkin Blade Blood Well 8 4 3 4 Aatrox.png champion0.png champion 0 0 48 48 580 90 0 0 345 38 3.25 32 1.25 175 3.00 1.00 0.000 0.00 0 0 60 5.00 2.500 0.651
1 Ahri 11.19.1 Ahri 103 Ahri the Nine-Tailed Fox Mana 3 4 8 5 Ahri.png champion0.png champion 48 0 48 48 526 92 418 25 330 21 3.50 30 0.50 550 5.50 0.60 8.000 0.80 0 0 53 3.00 2.000 0.668
1 Akali 11.19.1 Akali 84 Akali the Rogue Assassin Energy 5 3 8 7 Akali.png champion0.png champion 96 0 48 48 500 105 200 0 345 23 3.50 37 1.25 125 9.00 0.90 50.000 0.00 0 0 62 3.30 3.200 0.625
1 Akshan 11.19.1 Akshan 166 Akshan the Rogue Sentinel Mana 0 0 0 0 Akshan.png champion0.png champion 144 0 48 48 560 90 350 40 330 26 3.00 30 0.50 500 3.75 0.65 8.175 0.70 0 0 52 3.50 4.000 0.638
1 Alistar 11.19.1 Alistar 12 Alistar the Minotaur Mana 6 9 5 7 Alistar.png champion0.png champion 192 0 48 48 600 106 350 40 330 44 3.50 32 1.25 125 8.50 0.85 8.500 0.80 0 0 62 3.75 2.125 0.625
1 Amumu 11.19.1 Amumu 32 Amumu the Sad Mummy Mana 2 6 8 3 Amumu.png champion0.png champion 240 0 48 48 615 75 285 40 335 30 3.50 32 1.25 125 9.00 0.85 7.380 0.53 0 0 53 3.80 2.180 0.736

Analysis

tidyjson has expanded all the information from the “data” node into a well-formatted dataframe, and has automatically prepended any sub-node names such as “info” and stats” to make it easy to differentiate.

Let’s examine the results with a couple of simple graphs:

df <- champions %>% 
  select(c('name','stats.hp','stats.mp','stats.movespeed','stats.armor'))

ggplot(df) +
  aes(x = stats.hp) +
  geom_histogram(binwidth = 10, color = "#112446", fill="#ffffff") +
  theme_minimal() + 
  labs(title='Hit Points')

ggplot(df) +
  aes(x = stats.mp) +
  geom_histogram(binwidth=50, color = "#112446", fill="#ffffff") +
  theme_minimal() +
  labs(title='Mana Points')


Dataset 3: Global Infant Mortality Rates

This dataset published by the World Bank (source) tracks the mortality rate of children under 5, measured as deaths per 1,000 live births. The data are broken down by country, region and year from 1960 to 2019.

Data Prep

This dataset includes three separate files, one with the Mortality data by country and year, and two files with additional metadata.

For this exercise we will load the Mortality data file and the Region metadata file, then transform and join the two together for well-formatted and informative dataset.

df_mort <- read_csv('https://raw.githubusercontent.com/jefedigital/cuny-data-607-projects/main/wide-datasets/data/worldbank/API_SH.DYN.MORT_DS2_en_csv_v2_3012069.csv', skip=4)
df_meta <- read_csv('https://raw.githubusercontent.com/jefedigital/cuny-data-607-projects/main/wide-datasets/data/worldbank/Metadata_Country_API_SH.DYN.MORT_DS2_en_csv_v2_3012069.csv')

# cleanup
df_mort <- df_mort %>% 
  select(!('2020':'...66')) %>% # drop the last 2 columns, no data
  select(!('Indicator Name':'Indicator Code')) %>% # drop these cols, not needed
  rename(Name = 'Country Name', Code = 'Country Code') # renaming

df_meta <- df_meta %>%
  select(!SpecialNotes) %>% # drop column 'SpecialNotes', not needed
  select(!last_col()) # drop last column, no data

One additional step on the regional metadata .. we will use the conditional statement ifelse to impute whether the observation pertains to a Country or a Region, and to impute the value from the TableName column where Region is NA.

(This is due to a very large number of ‘secondary’ Region categories, and the likelihood of wanting to look at Country-only and Region-only data in any further analysis.)

# impute missing Region values
df_meta <- df_meta %>% 
  mutate(`Record Type` = ifelse(is.na(Region),'Region','Country')) %>%
  mutate(Region = ifelse(is.na(Region), TableName, Region)) %>% 
  rename(Code = 'Country Code') %>%
  select(!TableName)

Now, we pivot the Mortality table with pivot_longer so that each row corresponds to a single Country and Year. We then left_join the Region metadata and perform a little cleanup.

# pivot
df_mort <- df_mort %>% 
  pivot_longer(!c('Name','Code'), 
               names_to = 'Year', 
               values_to = 'Mortality Rate')

# join
df <- left_join(df_mort, df_meta, by='Code')

# correct Mortality Rate to actual rate (/1000)
df <- df %>% mutate(`Mortality Rate` = `Mortality Rate`/1000)
final df (sample)
Name Code Year Mortality Rate Region
Africa Eastern and Southern AFE 1969 0.2159640 Africa Eastern and Southern
Africa Eastern and Southern AFE 1970 0.2146804 Africa Eastern and Southern
Africa Eastern and Southern AFE 1971 0.2132826 Africa Eastern and Southern
Africa Eastern and Southern AFE 1972 0.2118596 Africa Eastern and Southern
Africa Eastern and Southern AFE 1973 0.2104980 Africa Eastern and Southern
Africa Eastern and Southern AFE 1974 0.2091949 Africa Eastern and Southern

To aid in analysis, we’ll filter two subtables - one with Country data only, and one with Region data only.

Let’s also take a look at the unique names in each table to compare the “main” geo-based Regions in the Countries table, with a much larger number of “secondary” Regions in the Region table:

# analysis
df_regions <- df %>% filter(`Record Type` == 'Region')
df_countries <- df %>% filter(`Record Type` == 'Country')

# unique regions in the country table 
unique(df_countries$Region)
## [1] "Latin America & Caribbean"  "South Asia"                
## [3] "Sub-Saharan Africa"         "Europe & Central Asia"     
## [5] "Middle East & North Africa" "East Asia & Pacific"       
## [7] "North America"
# unique regions in the regions table
unique(df_regions$Name)
##  [1] "Africa Eastern and Southern"                         
##  [2] "Africa Western and Central"                          
##  [3] "Arab World"                                          
##  [4] "Central Europe and the Baltics"                      
##  [5] "Caribbean small states"                              
##  [6] "East Asia & Pacific (excluding high income)"         
##  [7] "Early-demographic dividend"                          
##  [8] "East Asia & Pacific"                                 
##  [9] "Europe & Central Asia (excluding high income)"       
## [10] "Europe & Central Asia"                               
## [11] "Euro area"                                           
## [12] "European Union"                                      
## [13] "Fragile and conflict affected situations"            
## [14] "High income"                                         
## [15] "Heavily indebted poor countries (HIPC)"              
## [16] "IBRD only"                                           
## [17] "IDA & IBRD total"                                    
## [18] "IDA total"                                           
## [19] "IDA blend"                                           
## [20] "IDA only"                                            
## [21] "Latin America & Caribbean (excluding high income)"   
## [22] "Latin America & Caribbean"                           
## [23] "Least developed countries: UN classification"        
## [24] "Low income"                                          
## [25] "Lower middle income"                                 
## [26] "Low & middle income"                                 
## [27] "Late-demographic dividend"                           
## [28] "Middle East & North Africa"                          
## [29] "Middle income"                                       
## [30] "Middle East & North Africa (excluding high income)"  
## [31] "North America"                                       
## [32] "OECD members"                                        
## [33] "Other small states"                                  
## [34] "Pre-demographic dividend"                            
## [35] "Pacific island small states"                         
## [36] "Post-demographic dividend"                           
## [37] "South Asia"                                          
## [38] "Sub-Saharan Africa (excluding high income)"          
## [39] "Sub-Saharan Africa"                                  
## [40] "Small states"                                        
## [41] "East Asia & Pacific (IDA & IBRD countries)"          
## [42] "Europe & Central Asia (IDA & IBRD countries)"        
## [43] "Latin America & the Caribbean (IDA & IBRD countries)"
## [44] "Middle East & North Africa (IDA & IBRD countries)"   
## [45] "South Asia (IDA & IBRD)"                             
## [46] "Sub-Saharan Africa (IDA & IBRD countries)"           
## [47] "Upper middle income"                                 
## [48] "World"

Analysis

We should have a good foundation now for further analysis.

Let’s demonstrate with a line graph of Mortality Rates by country, from the Latin America and Caribbean region. Right away, we can see that while child mortality has been steadily dropping, there are a few alarming outliers where the trend has reversed in recent years, or where catastrophic events have had a significant impact (such as the 2010 Haiti earthquake, seen in the top line below.)

df_countries_latam <- df_countries %>%
  filter(Region == 'Latin America & Caribbean')

ggplot(df_countries_latam) +
  aes(x = Year, y = `Mortality Rate`, group = Name, label=Name) +
  geom_line() +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90))