Transforming Wide Datasets
Assignment
Three different untidy (or “wide”) datasets were selected in order to practice different methods of data preparation.
Dataset 1: World Governmental Indicators
This dataset from the World Bank (source) reports “aggregate and individual governance indicators for over 200 countries and territories over the period 1996–2020, for six dimensions of governance:”
- Voice and Accountability
- Political Stability and Absence of Violence/Terrorism
- Government Effectiveness
- Regulatory Quality
- Rule of Law
- Control of Corruption
The source file is an Excel workbook (.xlsx) with multiple tabs, and for this exercise we’ll select one tab to focus on - Control of Corruption.
There is good deal of header text on this tab, and a ‘multiindex’ of column data pairing years with six numeric measures (Estimate, StdErr, NumSrc, Rank, Lower and Upper) resulting in 134 total columns.
Tab from wgidataset.xlsx
Data Prep
The ReadXL library had some trouble reading in an .xlsx from a remote URL, so we’ll use Curl to download a local copy first. Then we’ll read in the data from the ControlofCorruption tab, skipping all twelve rows of header information.
library(curl)
library(readxl)
# import excel sheet and skip header rows
#curl_download('http://info.worldbank.org/governance/wgi/Home/downLoadFile?fileName=wgidataset.xlsx',
# 'data/wgi/wgidataset.xlsx') # first run only
df_raw <- read_excel('data/wgi/wgidataset.xlsx',
sheet='ControlofCorruption', skip=12)| …1 | …2 | 1996…3 | 1996…4 |
|---|---|---|---|
| Country/Territory | Code | Estimate | StdErr |
| Aruba | ABW | #N/A | #N/A |
| Andorra | ADO | 1.3181432485580444 | 0.48088869452476501 |
| Afghanistan | AFG | -1.2917047739028931 | 0.34050697088241577 |
| Angola | AGO | -1.1677018404006958 | 0.26207658648490906 |
We’ll want to create a pivot_longer of all this column data, but first we need to handle this multi-index in order to get a single row for each Country, Year and Measure.
Unlike Python’s Pandas, there is not much native support for multi-index dataframes in R. One approach would be to grab the two rows containing Year and Measure, and fuse them into individual column names for a new dataframe that we can then pivot out:
# convert the column names and first row into 2 vectors
df_col_1 <- colnames(df_raw) # colnames() returns a vector
df_col_2 <- as.character(slice(df_raw,1)) # slice() returns a df, so make vector
# get rid of the auto numbering in the column names
df_col_1 <- str_remove(df_col_1, '\\..*$')
# now join each element of the two vectors with paste0 (no separators)
df_cols <- paste0(df_col_1, df_col_2)
# preview
df_cols[1:8]## [1] "Country/Territory" "Code" "1996Estimate"
## [4] "1996StdErr" "1996NumSrc" "1996Rank"
## [7] "1996Lower" "1996Upper"
# make new df with our new column names and the values
df <- df_raw[-1,]
names(df) <- df_cols| Country/Territory | Code | 1996Estimate | 1996StdErr | 1996NumSrc |
|---|---|---|---|---|
| Aruba | ABW | #N/A | #N/A | #N/A |
| Andorra | ADO | 1.3181432485580444 | 0.48088869452476501 | 1 |
| Afghanistan | AFG | -1.2917047739028931 | 0.34050697088241577 | 2 |
| Angola | AGO | -1.1677018404006958 | 0.26207658648490906 | 4 |
| Anguilla | AIA | #N/A | #N/A | #N/A |
A bit of data cleanup before we go further. There are numerous instances of the string #N/A throughout this dataset, but we’ll need to replace those with actual NA values that R will recognize as NULL instead of character strings.
One way to apply this across all columns at once is to define a function that uses str_replace and use mutate(across(everything(), .func)) to execute.
(Note: We could have done this after the following step instead, when we only have one column to deal with, but it’s good practice..)
# replace all the instances of string '#N/A' with actual NA
nona <- function(s) {str_replace(s,'#N/A','NA')}
df <- df %>% mutate(across(everything(), nona))
# convert all available columns to numeric type
df <- type_convert(df)| Country/Territory | Code | 1996Estimate | 1996StdErr | 1996NumSrc |
|---|---|---|---|---|
| Aruba | ABW | NA | NA | NA |
| Andorra | ADO | 1.318143 | 0.4808887 | 1 |
| Afghanistan | AFG | -1.291705 | 0.3405070 | 2 |
| Angola | AGO | -1.167702 | 0.2620766 | 4 |
| Anguilla | AIA | NA | NA | NA |
Now for the pivot!
The pivot_longer function has a parameter names_pattern that accepts regex capturing groups to split column names during the pivot.
In this case we’ll create two capture groups .. one for the four-digit year, and one for the Measure name:
# we can use the names_pattern parameter of pivot_longer to separate Year and Category into separate columns .. pretty useful!
df <- df %>% pivot_longer(
cols = `1996Estimate`:`2020Upper`,
names_to = c('Year','Measure'),
names_pattern = '(.{4})(.*)',
values_to = 'Value'
)| Country/Territory | Code | Year | Measure | Value |
|---|---|---|---|---|
| Aruba | ABW | 2007 | StdErr | 0.4162902 |
| Aruba | ABW | 2007 | NumSrc | 1.0000000 |
| Aruba | ABW | 2007 | Rank | 88.8349533 |
| Aruba | ABW | 2007 | Lower | 72.3300934 |
| Aruba | ABW | 2007 | Upper | 93.6893234 |
| Aruba | ABW | 2008 | Estimate | 1.2886552 |
Analysis
We now have a tidy, well-formatted dataset with each row representing a single observation of a given Year, Country and Measure. Let’s demonstrate with a couple of simple graphs:
df_rank_2020 <- df %>%
filter(Measure=='Rank', Year==2020) %>%
filter(Value >= 90) %>%
arrange(desc(Value))
df_rank_2020 %>% ggplot(aes(x=reorder(Code, -Value), y=Value)) +
geom_bar(stat='identity', color = "#112446", fill="#ffffff") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90)) +
labs(title='Top Countries', x='Year')
df_rank_annual <- df %>%
filter(Measure=='Rank', Code=='AFG') %>%
arrange(Year)
df_rank_annual %>% ggplot(aes(x=Year, y=Value)) +
geom_bar(stat='identity', color = "#112446", fill="#ffffff") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90)) +
labs(title='Afghanistan')Dataset 2: League of Legends Champions
Video game developer Riot Games publishes League of Legends game data and assets for use by third-party developers. We’ll be looking at the top-level dataset (source) for the main game characters (or “Champions”), which is made available in JSON format.
champions.json
Data Prep
We’ll be using two additional libraries to parse HTML and work efficiently with JSON-formatted data: rvest and tidyjson.
library(tidyjson)
library(rvest)
# get json from webpage with rvest read_html
page <- read_html('https://ddragon.leagueoflegends.com/cdn/11.19.1/data/en_US/champion.json')
# parse the html with rvest, return just the contents of the <body> element
json <- page %>% html_elements("body") %>% html_text()
# check out the structure
json %>% gather_object %>% json_types## # A tbl_json: 4 x 4 tibble with a "JSON" attribute
## ..JSON document.id name type
## <chr> <int> <chr> <fct>
## 1 "\"champion\"" 1 type string
## 2 "\"standAloneComp..." 1 format string
## 3 "\"11.19.1\"" 1 version string
## 4 "{\"Aatrox\":{\"ver..." 1 data object
The tricky part here is, if we were to use tidyjson at the root of this JSON document, we wouldn’t get the information we need.
What we actually need is the fourth node (helpfully named “data”), which itself is a series of nested JSON nodes that contain all the character information.
We’ll use the enter_object and gather_object functions to extract the content of this “data” node, spread_all to create a table of all the json elements, and then convert back into a standard R dataframe:
# get only the 'data' node from the json and spread_all into a
# json table, then convert back to regular dataframe
champions <- json %>%
enter_object(data) %>%
gather_object %>%
spread_all %>%
as_data_frame.tbl_json %>%
select(!blurb) # delete this column for demo .. very long strings| document.id | name | version | id | key | name.2 | title | partype | info.attack | info.defense | info.magic | info.difficulty | image.full | image.sprite | image.group | image.x | image.y | image.w | image.h | stats.hp | stats.hpperlevel | stats.mp | stats.mpperlevel | stats.movespeed | stats.armor | stats.armorperlevel | stats.spellblock | stats.spellblockperlevel | stats.attackrange | stats.hpregen | stats.hpregenperlevel | stats.mpregen | stats.mpregenperlevel | stats.crit | stats.critperlevel | stats.attackdamage | stats.attackdamageperlevel | stats.attackspeedperlevel | stats.attackspeed |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Aatrox | 11.19.1 | Aatrox | 266 | Aatrox | the Darkin Blade | Blood Well | 8 | 4 | 3 | 4 | Aatrox.png | champion0.png | champion | 0 | 0 | 48 | 48 | 580 | 90 | 0 | 0 | 345 | 38 | 3.25 | 32 | 1.25 | 175 | 3.00 | 1.00 | 0.000 | 0.00 | 0 | 0 | 60 | 5.00 | 2.500 | 0.651 |
| 1 | Ahri | 11.19.1 | Ahri | 103 | Ahri | the Nine-Tailed Fox | Mana | 3 | 4 | 8 | 5 | Ahri.png | champion0.png | champion | 48 | 0 | 48 | 48 | 526 | 92 | 418 | 25 | 330 | 21 | 3.50 | 30 | 0.50 | 550 | 5.50 | 0.60 | 8.000 | 0.80 | 0 | 0 | 53 | 3.00 | 2.000 | 0.668 |
| 1 | Akali | 11.19.1 | Akali | 84 | Akali | the Rogue Assassin | Energy | 5 | 3 | 8 | 7 | Akali.png | champion0.png | champion | 96 | 0 | 48 | 48 | 500 | 105 | 200 | 0 | 345 | 23 | 3.50 | 37 | 1.25 | 125 | 9.00 | 0.90 | 50.000 | 0.00 | 0 | 0 | 62 | 3.30 | 3.200 | 0.625 |
| 1 | Akshan | 11.19.1 | Akshan | 166 | Akshan | the Rogue Sentinel | Mana | 0 | 0 | 0 | 0 | Akshan.png | champion0.png | champion | 144 | 0 | 48 | 48 | 560 | 90 | 350 | 40 | 330 | 26 | 3.00 | 30 | 0.50 | 500 | 3.75 | 0.65 | 8.175 | 0.70 | 0 | 0 | 52 | 3.50 | 4.000 | 0.638 |
| 1 | Alistar | 11.19.1 | Alistar | 12 | Alistar | the Minotaur | Mana | 6 | 9 | 5 | 7 | Alistar.png | champion0.png | champion | 192 | 0 | 48 | 48 | 600 | 106 | 350 | 40 | 330 | 44 | 3.50 | 32 | 1.25 | 125 | 8.50 | 0.85 | 8.500 | 0.80 | 0 | 0 | 62 | 3.75 | 2.125 | 0.625 |
| 1 | Amumu | 11.19.1 | Amumu | 32 | Amumu | the Sad Mummy | Mana | 2 | 6 | 8 | 3 | Amumu.png | champion0.png | champion | 240 | 0 | 48 | 48 | 615 | 75 | 285 | 40 | 335 | 30 | 3.50 | 32 | 1.25 | 125 | 9.00 | 0.85 | 7.380 | 0.53 | 0 | 0 | 53 | 3.80 | 2.180 | 0.736 |
Analysis
tidyjson has expanded all the information from the “data” node into a well-formatted dataframe, and has automatically prepended any sub-node names such as “info” and stats” to make it easy to differentiate.
Let’s examine the results with a couple of simple graphs:
df <- champions %>%
select(c('name','stats.hp','stats.mp','stats.movespeed','stats.armor'))
ggplot(df) +
aes(x = stats.hp) +
geom_histogram(binwidth = 10, color = "#112446", fill="#ffffff") +
theme_minimal() +
labs(title='Hit Points')
ggplot(df) +
aes(x = stats.mp) +
geom_histogram(binwidth=50, color = "#112446", fill="#ffffff") +
theme_minimal() +
labs(title='Mana Points')Dataset 3: Global Infant Mortality Rates
This dataset published by the World Bank (source) tracks the mortality rate of children under 5, measured as deaths per 1,000 live births. The data are broken down by country, region and year from 1960 to 2019.
Data Prep
This dataset includes three separate files, one with the Mortality data by country and year, and two files with additional metadata.
For this exercise we will load the Mortality data file and the Region metadata file, then transform and join the two together for well-formatted and informative dataset.
df_mort <- read_csv('https://raw.githubusercontent.com/jefedigital/cuny-data-607-projects/main/wide-datasets/data/worldbank/API_SH.DYN.MORT_DS2_en_csv_v2_3012069.csv', skip=4)
df_meta <- read_csv('https://raw.githubusercontent.com/jefedigital/cuny-data-607-projects/main/wide-datasets/data/worldbank/Metadata_Country_API_SH.DYN.MORT_DS2_en_csv_v2_3012069.csv')
# cleanup
df_mort <- df_mort %>%
select(!('2020':'...66')) %>% # drop the last 2 columns, no data
select(!('Indicator Name':'Indicator Code')) %>% # drop these cols, not needed
rename(Name = 'Country Name', Code = 'Country Code') # renaming
df_meta <- df_meta %>%
select(!SpecialNotes) %>% # drop column 'SpecialNotes', not needed
select(!last_col()) # drop last column, no dataOne additional step on the regional metadata .. we will use the conditional statement ifelse to impute whether the observation pertains to a Country or a Region, and to impute the value from the TableName column where Region is NA.
(This is due to a very large number of ‘secondary’ Region categories, and the likelihood of wanting to look at Country-only and Region-only data in any further analysis.)
# impute missing Region values
df_meta <- df_meta %>%
mutate(`Record Type` = ifelse(is.na(Region),'Region','Country')) %>%
mutate(Region = ifelse(is.na(Region), TableName, Region)) %>%
rename(Code = 'Country Code') %>%
select(!TableName)Now, we pivot the Mortality table with pivot_longer so that each row corresponds to a single Country and Year. We then left_join the Region metadata and perform a little cleanup.
# pivot
df_mort <- df_mort %>%
pivot_longer(!c('Name','Code'),
names_to = 'Year',
values_to = 'Mortality Rate')
# join
df <- left_join(df_mort, df_meta, by='Code')
# correct Mortality Rate to actual rate (/1000)
df <- df %>% mutate(`Mortality Rate` = `Mortality Rate`/1000)| Name | Code | Year | Mortality Rate | Region |
|---|---|---|---|---|
| Africa Eastern and Southern | AFE | 1969 | 0.2159640 | Africa Eastern and Southern |
| Africa Eastern and Southern | AFE | 1970 | 0.2146804 | Africa Eastern and Southern |
| Africa Eastern and Southern | AFE | 1971 | 0.2132826 | Africa Eastern and Southern |
| Africa Eastern and Southern | AFE | 1972 | 0.2118596 | Africa Eastern and Southern |
| Africa Eastern and Southern | AFE | 1973 | 0.2104980 | Africa Eastern and Southern |
| Africa Eastern and Southern | AFE | 1974 | 0.2091949 | Africa Eastern and Southern |
To aid in analysis, we’ll filter two subtables - one with Country data only, and one with Region data only.
Let’s also take a look at the unique names in each table to compare the “main” geo-based Regions in the Countries table, with a much larger number of “secondary” Regions in the Region table:
# analysis
df_regions <- df %>% filter(`Record Type` == 'Region')
df_countries <- df %>% filter(`Record Type` == 'Country')
# unique regions in the country table
unique(df_countries$Region)## [1] "Latin America & Caribbean" "South Asia"
## [3] "Sub-Saharan Africa" "Europe & Central Asia"
## [5] "Middle East & North Africa" "East Asia & Pacific"
## [7] "North America"
# unique regions in the regions table
unique(df_regions$Name)## [1] "Africa Eastern and Southern"
## [2] "Africa Western and Central"
## [3] "Arab World"
## [4] "Central Europe and the Baltics"
## [5] "Caribbean small states"
## [6] "East Asia & Pacific (excluding high income)"
## [7] "Early-demographic dividend"
## [8] "East Asia & Pacific"
## [9] "Europe & Central Asia (excluding high income)"
## [10] "Europe & Central Asia"
## [11] "Euro area"
## [12] "European Union"
## [13] "Fragile and conflict affected situations"
## [14] "High income"
## [15] "Heavily indebted poor countries (HIPC)"
## [16] "IBRD only"
## [17] "IDA & IBRD total"
## [18] "IDA total"
## [19] "IDA blend"
## [20] "IDA only"
## [21] "Latin America & Caribbean (excluding high income)"
## [22] "Latin America & Caribbean"
## [23] "Least developed countries: UN classification"
## [24] "Low income"
## [25] "Lower middle income"
## [26] "Low & middle income"
## [27] "Late-demographic dividend"
## [28] "Middle East & North Africa"
## [29] "Middle income"
## [30] "Middle East & North Africa (excluding high income)"
## [31] "North America"
## [32] "OECD members"
## [33] "Other small states"
## [34] "Pre-demographic dividend"
## [35] "Pacific island small states"
## [36] "Post-demographic dividend"
## [37] "South Asia"
## [38] "Sub-Saharan Africa (excluding high income)"
## [39] "Sub-Saharan Africa"
## [40] "Small states"
## [41] "East Asia & Pacific (IDA & IBRD countries)"
## [42] "Europe & Central Asia (IDA & IBRD countries)"
## [43] "Latin America & the Caribbean (IDA & IBRD countries)"
## [44] "Middle East & North Africa (IDA & IBRD countries)"
## [45] "South Asia (IDA & IBRD)"
## [46] "Sub-Saharan Africa (IDA & IBRD countries)"
## [47] "Upper middle income"
## [48] "World"
Analysis
We should have a good foundation now for further analysis.
Let’s demonstrate with a line graph of Mortality Rates by country, from the Latin America and Caribbean region. Right away, we can see that while child mortality has been steadily dropping, there are a few alarming outliers where the trend has reversed in recent years, or where catastrophic events have had a significant impact (such as the 2010 Haiti earthquake, seen in the top line below.)
df_countries_latam <- df_countries %>%
filter(Region == 'Latin America & Caribbean')
ggplot(df_countries_latam) +
aes(x = Year, y = `Mortality Rate`, group = Name, label=Name) +
geom_line() +
theme_minimal() +
theme(axis.text.x = element_text(angle = 90))