1 Introduction

This project for the 2020 RStudio Contest Table was announced on September 15th and is due on October 31st.

Our data science team ( JITeam ) at the Jonglei Institute of Technology (The First South Sudanese online educational platform aspiring to train the next generation of South Sudanese data scientists and data analysts for free) is thrilled to participate in this exciting contest. JITeam comprises three members: Alier Reng, the Head of the Data Science Program & President of Jonglei Institute of Technology, Luka Chol Awan, TA & student, and Nazrul Islam, student.

Our team elected to showcase the gt package’s elegant features using South Sudan’s 2008 Census Dataset obtained here. Instead of creating just a table, we opted to create tutorials to help other data science enthusiasts, aspiring data scientists, and data analysts learn to implement the gt package in their data science projects.

South Sudan is the world’s youngest country that gained its independence from Sudan in 2011. According to Wikipedia, South Sudan has a population of 10.98 million; however, the dataset we’re using for this contest shows that South Sudan has 8.26 million.

Below is the map of South Sudan.

South Sudan Map (Credits: Wikipedia)

South Sudan Map (Credits: Wikipedia)

2 Details About the Contest

We obtained the below information about the 2020 RStudio Table Contest here!.

2.0.1 Contest Judging Criteria

Tables will be judged based on technical merit, artistic design, and quality of documentation. We recognize that some tables may excel in only one category and others in more than one or all categories. Honorable mentions will be awarded with this in mind.

We are working with maintainers of many of the R community’s most popular R packages for building tables, including Yihui Xie of DT, Rich Iannone of gt, Greg Lin of reactable, David Gohel of flextable, David Hugh-Jones of huxtable , and Hao Zhu of kableExtra. Many of these maintainers will help review submissions built with their packages.

2.0.2 Requirements

A submission must include all code and data used to replicate your entry. This may be a fully knitted R Markdown document with code (for example published to RPubs or shinyapps.io), a repository, or rstudio.cloud project.

A submission can use any table-making package available in R, not just the ones mentioned above.

Submission Types - We are looking for three types of table submissions,

  1. Single Table Example: This may highlight interesting structuring of content, useful and tricky features – for example, enabling interaction – or serve as an example of a common table popular in a specific field. Be sure to document your code for clarity.

  2. Tutorials: It’s all about teaching us how to craft an excellent table or understand a package’s features. This may include several tables and narrative.

  3. Other: For submissions that do not easily fit into one of the types above.

Category - Given that tables have different features and purposes, we’d also like you to further categorize the submission table. There are four categories, static-HTML, interactive-HTML, static-print, and interactive-Shiny. Simply choose the one that best fits your table.

You can submit your entry for the contest by filling the form at rstd.io/table-contest-2020. The form will generate a post on RStudio Community, which you can then edit further at a later date. You may make multiple entries.

The deadline for submissions is October 31st, 2020, at midnight Pacific Time.

3 Tabulating 2008 South Sudan Census Dataset With the gt Package

3.1 Loading the packages

Here we will only install two packages: tidyverse and gt.

library(tidyverse)
library(gt)

3.2 Importing the data

In this section, we’re using the vroom package, however, we could have also used readr package.

# Import the data
ss_2008_census_data_raw <- vroom::vroom("00_Data/ss_2008_census_data_raw.csv")

# View the first 5 rows
slice_head(ss_2008_census_data_raw, n = 5)
## # A tibble: 5 x 10
##   Region `Region Name` `Region - Regio… Variable `Variable Name` Age  
##   <chr>  <chr>         <chr>            <chr>    <chr>           <chr>
## 1 KN.A2  Upper Nile    SS-NU            KN.B2    Population, To… KN.C1
## 2 KN.A2  Upper Nile    SS-NU            KN.B2    Population, To… KN.C2
## 3 KN.A2  Upper Nile    SS-NU            KN.B2    Population, To… KN.C3
## 4 KN.A2  Upper Nile    SS-NU            KN.B2    Population, To… KN.C4
## 5 KN.A2  Upper Nile    SS-NU            KN.B2    Population, To… KN.C5
## # … with 4 more variables: `Age Name` <chr>, Scale <chr>, Units <chr>,
## #   `2008` <dbl>

Below, we see three rows with NAs; however, these rows do not add any value to our analyses, so we’ll delete them in the following section.

# View the last 10 rows
slice_tail(ss_2008_census_data_raw, n = 10)
## # A tibble: 10 x 10
##    Region `Region Name` `Region - Regio… Variable `Variable Name` Age  
##    <chr>  <chr>         <chr>            <chr>    <chr>           <chr>
##  1 KN.A11 Eastern Equa… SS-EE            KN.B8    Population, Fe… KN.C9
##  2 KN.A11 Eastern Equa… SS-EE            KN.B8    Population, Fe… KN.C…
##  3 KN.A11 Eastern Equa… SS-EE            KN.B8    Population, Fe… KN.C…
##  4 KN.A11 Eastern Equa… SS-EE            KN.B8    Population, Fe… KN.C…
##  5 KN.A11 Eastern Equa… SS-EE            KN.B8    Population, Fe… KN.C…
##  6 KN.A11 Eastern Equa… SS-EE            KN.B8    Population, Fe… KN.C…
##  7 KN.A11 Eastern Equa… SS-EE            KN.B8    Population, Fe… KN.C…
##  8 <NA>   <NA>          <NA>             <NA>     <NA>            <NA> 
##  9 Sourc… National Bur… <NA>             <NA>     <NA>            <NA> 
## 10 Downl… http://south… <NA>             <NA>     <NA>            <NA> 
## # … with 4 more variables: `Age Name` <chr>, Scale <chr>, Units <chr>,
## #   `2008` <dbl>

3.3 Wrangling the data

Now that we’ve imported our dataset and have inspected the first and the last few rows, we will wrangle the data to make it tidy.

# Subset the data
ss_2008_census_data_tbl <- ss_2008_census_data_raw %>% 
  
  # Select only the desired columns
  select(State          = `Region Name`, 
         Category       = `Variable Name`,
         `Age Category` = `Age Name`, 
         population     = `2008`) %>% 
  
  # Split the Category column
  separate(Category,
           into = c("Pop.", "Gender", "Other"),
           sep  = " ") %>% 
  
  # Delete Pop. and Other columns
  select(-Pop., -Other) %>% 
  
  # Delete NAs using the Gender column
  filter(!is.na(Gender),
         Gender         != "Total",
         `Age Category` != "Total") %>% 
  
  # Manually collapsing factor levels with fct_collapse()
  mutate(
    `Age Category` = fct_collapse(`Age Category`,
      `0-19`       = c("0 to 4", "5 to 9", "10 to 14", "15 to 19"),
      `20-34`      = c( "20 to 24", "25 to 29", "30 to 34"),
      `35-49`      = c("35 to 39", "40 to 44", "45 to 49"),
      `50-64`      = c( "50 to 54", "55 to 59", "60 to 64"),
      `>= 65`      = "65 +")) %>% 
  
  # Group by state, category and age category, and summarize
  group_by(State, Gender, `Age Category`) %>% 
  summarize(Population = sum(population),
            .groups    = "drop") %>% 
  ungroup() %>%
  
  # Add the region column
  mutate(Region = case_when(
    State %in% c("Central Equatoria", "Eastern Equatoria", "Western Equatoria")          ~ "Equatoria",
    State %in% c("Warrap", "Western Bahr el Ghazal", "Northern Bahr el Ghazal", "Lakes") ~ "Bahr el Ghazal",
    TRUE ~ "Upper Nile"),
    
    # Place this column before the State column
         .before = "State")
  
# View the first 15 rows
ss_2008_census_data_tbl %>% slice_head(n = 15)
## # A tibble: 15 x 5
##    Region    State             Gender `Age Category` Population
##    <chr>     <chr>             <chr>  <fct>               <dbl>
##  1 Equatoria Central Equatoria Female 0-19               283092
##  2 Equatoria Central Equatoria Female 20-34              139942
##  3 Equatoria Central Equatoria Female 35-49               66745
##  4 Equatoria Central Equatoria Female 50-64               23460
##  5 Equatoria Central Equatoria Female 65+                  8596
##  6 Equatoria Central Equatoria Male   0-19               308935
##  7 Equatoria Central Equatoria Male   20-34              153332
##  8 Equatoria Central Equatoria Male   35-49               79238
##  9 Equatoria Central Equatoria Male   50-64               28808
## 10 Equatoria Central Equatoria Male   65+                 11409
## 11 Equatoria Eastern Equatoria Female 0-19               243642
## 12 Equatoria Eastern Equatoria Female 20-34              111079
## 13 Equatoria Eastern Equatoria Female 35-49               57120
## 14 Equatoria Eastern Equatoria Female 50-64               20496
## 15 Equatoria Eastern Equatoria Female 65+                  8637

3.4 Tabulating the Data With the gt Package

In this section, we’ll tabulate the dataset and place the results in two separate tabs using .tabset - this saves space by arranging outputs horizontally on the page.

3.4.1 Population by State

In this section, we’ll tabulate only the states’ total populations (in persons).

# Subset the dataset to extract the state totals
state_pop_gt <- ss_2008_census_data_tbl %>% 
  
  # Group by region and state columns; summarize
  group_by(Region, State) %>% 
  summarize(Population = sum(Population),
            .groups = "drop") %>% 
  
  # Arrange the data in descending order by population
  arrange(desc(Population)) %>% 
  
  # Exclude the region
  select(-Region) %>% 
  
  # Initialize a gt table
  gt() %>% 
  
  # Add the spanners to the group the columns
  tab_spanner(
    label   = "State Population in Descending Order",
    columns = 2) %>% 
  
  # Add a title and a subtitle
  tab_header(
    title    = "South Sudan 2008 Population by State",
    subtitle = "Jonglei State has the largest population") %>% 
  
  # Add the row sums
  grand_summary_rows(
    columns   = 2,
    fns       = list(Total = ~sum(.)),
    formatter = fmt_number
  ) %>% 
  
  # Add the background styling - highlight the greatest state population with a green color
  tab_style(
    style             = list(
      cell_fill(color = "#4caf50"),
      cell_text(color = "white")
    ),
    locations         = cells_body(
      columns         = vars(Population),
      rows            = Population == max(Population))) %>% 
  
    # Add the background styling - highlight the median state population with an orange color
  tab_style(
    style             = list(
      cell_fill(color = "#ff8c00"),
      cell_text(color = "white")
    ),
    locations         = cells_body(
      columns         = vars(Population),
      rows            = Population %in% c("720898", "906161"))) %>% 
  
    # Add the background styling - highlight the minimum state population with a red color
  tab_style(
    style             = list(
      cell_fill(color = "#DC6140"),
      cell_text(color = "white")
    ),
    locations         = cells_body(
      columns         = vars(Population),
      rows            = Population == min(Population))) %>% 
  
  # Apply a gray background to the header
  tab_options(
    heading.background.color = "gray"
  ) %>% 
  
   # Add a foot note and a source information
   tab_footnote(
     footnote  = "gt Tutorials by JITeam, Jonglei Institute of Technology (www.jongleiinstitute.com)",
     locations = cells_column_labels(
       columns = 2)
  ) %>% 
  
  tab_source_note(
    source_note = "Data source: South Sudan Data Portal"
  )
  
# Display the table
state_pop_gt  
South Sudan 2008 Population by State
Jonglei State has the largest population
State State Population in Descending Order
Population1
Jonglei 1358602
Central Equatoria 1103557
Warrap 972928
Upper Nile 964353
Eastern Equatoria 906161
Northern Bahr el Ghazal 720898
Lakes 695730
Western Equatoria 619029
Unity 585801
Western Bahr el Ghazal 333431
Total 8,260,490.00
Data source: South Sudan Data Portal

1 gt Tutorials by JITeam, Jonglei Institute of Technology (www.jongleiinstitute.com)

3.4.2 Population by State and Gender

In this section, we’ll tabulate South Sudan’s 2008 Population by state and gender. Further, we’ll use the colors of the flag of South Sudan ( red, black, green, blue, and orange (for yellow)) to illustrate how to apply different background coloring to the gt table.

# Subset the dataset
ss_2008_census_gt_1 <- ss_2008_census_data_tbl %>% 
  
  # Pivot the data
  pivot_wider(
      names_from  = `Age Category`,
      values_from = Population) %>% 
  
  # Arrange the data by region in descending order by age 0-19 
  arrange(Region, desc(`0-19`)) %>% 
  
  # Exclude the region from the table
  select(-Region) %>%
  
  # Initialize a gt table
  gt() %>% 
  
  # Add a title and a subtitle
  tab_header(
    title    = "South Sudan 2008 Population by Gender and State",
    subtitle = "Population by Gender and Age Groups") %>% 
  
  # Create subgroups by columns
  # Bhar el Ghazal Region
  tab_row_group(
    group = "Bahr el Ghazal",
    rows  = 1:8) %>% 
  
  # Equatori Region
  tab_row_group(
    group = "Equatoria",
    rows  = 9:14) %>% 
  
  # Upper Nile Region
  tab_row_group(
    group = "Upper Nile",
    rows  = 15:20) %>% 
  
   # Add the spanners
  tab_spanner(
     label   = "Population by Gender & Age Category",
     columns = 2:7) %>% 
  
  tab_spanner(
     label   = "States by Former Regions",
     columns = 1) %>% 
  
  # Add the row grand summaries
  grand_summary_rows(
     columns   = 3:7,
     fns       = list(Totals = ~ sum(.)),
     formatter = fmt_number) %>% 
  
  # Style the table
  tab_options(heading.background.color       = "#ff8c00",
              column_labels.background.color = "gray") %>% 
  
  # Upper Nile Region
  tab_style(
    style             = list(
      cell_fill(color = "black"),
      cell_text(color = "white")),
    locations         = cells_body(
      columns         = 3:8,
      rows            = 15:20)) %>% 
  
  # Equatoria Region
    tab_style(
    style             = list(
      cell_fill(color = "#DC6140"),
      cell_text(color = "white")),
    locations         = cells_body(
      columns         = 3:8,
      rows            = 9:14)) %>% 
  
  # Bahr el Ghazal Region
    tab_style(
    style             = list(
      cell_fill(color = "#4caf50"),
      cell_text(color = "white")),
    locations         = cells_body(
      columns         = 3:8,
      rows            = 1:8)) %>% 
  
    tab_style(
    style             = list(
      cell_fill(color = "#5077E0"),
      cell_text(color = "white")),
    locations         = cells_body(
      columns         = 2,
      rows            = 1:20)) %>% 
  
   # Adding the foot note & source information
   tab_footnote(
     footnote  = "`gt` Tutorials by JITeam, The Jonglei Institute of Technology (www.jongleiinstitute.com)",
     locations = cells_column_labels(
       columns = 2:7)) %>% 
  
  tab_source_note(
    source_note = "Data source: South Sudan Data Portal"
  )
  
# Display the table
ss_2008_census_gt_1
South Sudan 2008 Population by Gender and State
Population by Gender and Age Groups
States by Former Regions Population by Gender & Age Category
State Gender1 0-191 20-341 35-491 50-641 65+1
Upper Nile
Jonglei Male 419182 157319 90925 44243 22658
Jonglei Female 329048 164193 87198 31452 12384
Upper Nile Male 294848 113552 70681 30603 15746
Upper Nile Female 237435 108924 60058 22362 10144
Unity Male 179616 62313 34091 15228 8999
Unity Female 163798 66837 33267 13851 7801
Equatoria
Central Equatoria Male 308935 153332 79238 28808 11409
Central Equatoria Female 283092 139942 66745 23460 8596
Eastern Equatoria Male 274404 99862 55139 23254 12528
Eastern Equatoria Female 243642 111079 57120 20496 8637
Western Equatoria Male 162324 77197 47857 19524 11541
Western Equatoria Female 148059 83592 45314 16252 7369
Bahr el Ghazal
Warrap Male 275805 94888 63010 24686 12345
Warrap Female 273397 127170 66936 24066 10625
Northern Bahr el Ghazal Male 204291 63709 45635 21132 13523
Northern Bahr el Ghazal Female 200375 89179 48861 21608 12585
Lakes Male 198581 87219 49536 20444 10100
Lakes Female 176918 86832 42932 16772 6396
Western Bahr el Ghazal Male 92265 45326 26307 8971 4171
Western Bahr el Ghazal Female 83151 41467 20767 7479 3527
Totals 4,549,166.00 1,973,932.00 1,091,617.00 434,691.00 211,084.00
Data source: South Sudan Data Portal

1 `gt` Tutorials by JITeam, The Jonglei Institute of Technology (www.jongleiinstitute.com)

4 Closing Remarks

In this article, we’ve demonstrated how to wrangle data with dplyr, and we have thoroughly shown how to tabulate the data with the gt package. In the 2020 RStudio Table Contest, we’re asked to choose any R data table package of our choice and highlight its prominent features. And as a result, our team decided to do a tutorial with two tables so that others may benefit from our project.

5 Acknowledgements

We thank RStudio for allowing us to showcase our R skills in the form of the gt tutorials. We hope that our work will benefit other aspiring data scientists, data analysts, data enthusiasts, and everyone else who wants to learn the R programming, particularly the gt package.

By the same token, we thank both DataCamp and Business Science University, for without their amazing courses and tutorials, we could not have been able to complete this project.

Lastly, I would like to thank both Luka Awan and Nazrul Islam for teaming up with me on this project to represent the Jonglei Institute of Technology - this is our first competition, and we hope to participate more in the future.

Thank you once again, RStudio, for the opportunity. We hope that your users and learners will find this work beneficial.

Kind regards,

Alier Reng, Head of the Data Science Program & President

Luka Chol D’Awan, TA & Student

Nazrul Islam, Student