Stylin’ your markdowns

Styling a Markdown document

Check out Yan Holtz’s “Pimp my RMD: a few tips for R Markdown” for some good examples.

Picking themes

Themes are used to style your R markdown. These are HTML and CSS styles that allow you to present your data in different ways. To use them, you’ll install the theme package the way you install any other, then set options in the YAML. I haven’t tried anything other than the HTML output – there are other things you’ll have to install on your computer if you want to try some of the others, such as PDF output.

Here’s the YAML for this document:

---
title: "Stylin' your markdowns"
date: "2021-11-10"
output:
  rmdformats::readthedown:
    self_contained: true
    code_download: true
    toc_depth: 4
    df_print: paged
    code_folding: hide
---

Theme options

There are a lot of options, coming from theme packages. In general:

  • A regular R Markdown document can use most of the free Bootswatch themes.
  • The package prettydocs re-creates themes that are default on Github. They’re very fast to knit, but not very visually appealling.
  • The package rmdformats has several well designed themes, including this one, but they have fewer options in terms of tables of contents, colors, etc.
  • The package tufte re-creates Edward Tufte’s very distinctive handout and book style - it takes forever to knit, but creates a beautiful PDF book.
  • The package bookdown creates book-style websites, like the one at our R Study Guide; blogdown creates full websites.

This setup chunk has several options you probably want to change from the defaults. Normally you’d hide all of this from your audience by changing the first line to the default after the three ticks:

      {r setup, include=FALSE}

Different themes have different options for your document. For example, the standard themes have an option called code folding which lets you put a “hide” and “show” button on each code chunk you use.

You might choose to use different themes depending on the audience. Usually , the theme will just ignore any options it doesn’t have. (It sometimes throws up an error, like trying to do toc_float: true on a rmdformats theme.)

After installation

Once you’ve installed a theme, it will be available in your New File menu as a template:

There is one thing that’s really confusing in this: some of these templates require you to create a folder to hold all of the related files. However, if you keep the “self-contained” selection in the YAML, it’s not required. I usually have to move these files back to my default directory, or the paths don’t work.


The setup chunk

When you use a theme, there will be a default set of setup options. Here’s how I usually change mine:

To get some good tips on how to deal with markdown chunks, the YAML and errors, look at Piping Hot Data’s gifs on “Chunk options” and “YAML errors”
# You need these first to set up some of these options. 
library(knitr)
library(rmdformats)

## Global options
options(max.print="75")
opts_chunk$set(echo=TRUE,    # change to TRUE if you want to see the code chunks 
                 cache=FALSE,    # generally change to FALSE - you don't need cache for simple documents.
               prompt=FALSE,
               tidy=FALSE,    # change to FALSE - It generates an annoying error with out. 
               comment=NA,
               message=FALSE,  
               warning=TRUE)  #consider changing to TRUE to see things you might not notice otherwise.



opts_knit$set(width=75)


# You might set some other defaults here for packages you usually use. 
# 
#
# And add in packages you almost always use: 
# 
library(tidyverse)
library(lubridate)
library(janitor)
library(forcats) # working with factors
library(scales) # turning numbers into something readable

# here are the table packages we're going to use
library(reactable)
library(gt)
library(formattable)
library (DT)

Styling tables

Now that you’re set up, I’ll show you some things you can do with your output of tables that will make it easier to work with. First, I want to make a data frame that is good for this kind of thing: It should be less than about 1,000 rows, and should have a mix of types of data and groups.

Pretty tables

There are two types of display tables:

  • Interactive tables, that allow sorting, searching and even changing data. These are good for sharing with teammates and exploring your data more.

  • Nice looking static tables, which are good for formal results. They look more like a printed government report.

Why formatted numbers matter

As you’ve found, one of the biggest annoyances about R is the inability to “format” numbers in a default way. In Excel, once you told the program that you wanted to see dollar signs and commas, it always showed it to you that way. We have to do this through either formatting a table, or by converting numbers to text (since “$” and “,” aren’t numbers!).

For static tables, it doesn’t matter whether numbers stay as numbers – once they’re in the sort order you want, they can be presented as text fields. But if you want people to be able play around with the result, you need something that lets you format numbers into something readable, but keep the underlying information with the proper type. It’s the difference between what you see and what the computer sees that matters.


Table packages

A full examination of the different table packages is on R for the Rest of Us blog, “How to Make Beautiful Tables in R”. It’s a little out of date, but it gives a good overview of the options.

While I’m at it, I’ve created a little data frame with my understanding of which features each package has, and then styled them using the reactable library.

library(reactable)

table_packages <- tribble (
  ~"pkg", ~"fmt_nums", ~"interactive", ~"nested", ~"sparklines", ~"positioning",
  "DT (datatables)", TRUE, TRUE, FALSE, FALSE, FALSE,
  "formattable", TRUE, FALSE, TRUE, TRUE,  FALSE,
  "gt", TRUE, FALSE, TRUE, TRUE,  TRUE,
  "kablextra", FALSE, FALSE, TRUE, TRUE, TRUE, 
  "reactable", TRUE, TRUE, TRUE, TRUE, TRUE
)

(Press the “Code” button here if you want to see how words were replaced with symbols)

reactable ( table_packages, 
            fullWidth = FALSE, compact=TRUE, width=500,
            defaultColDef= colDef (align="center", maxWidth=75),
            theme=reactableTheme( color="gray", style=list (fontFamily = "Work Sans, sans-serif", fontSize="70%")),
            columns = list ( 
                 pkg = colDef (minWidth=100, align="left", name = "Package name"), 
                 fmt_nums  = colDef (name = "Formatted #s?", 
                                     cell = function (value) {
                                        if (value) "\u2713" else "\u2718"
                                     }), 
                 interactive  = colDef (name = "Interactive?", 
                                     cell = function (value) {
                                        if (value) "\u2713" else "\u2718"
                                     }), 
                  nested   = colDef (name = "Nested tables?", 
                                     cell = function (value) {
                                        if (value) "\u2713" else "\u2718"
                                     }),
                                  fmt_nums  = colDef (name = "Nested?", 
                                     cell = function (value) {
                                        if (value) "\u2713" else "\u2718"
                                     }), 
                  sparklines  = colDef (name = "Sparklines?", 
                                     cell = function (value) {
                                        if (value) "\u2713" else "\u2718"
                                     }), 
                  positioning  = colDef (name = "Positioning / Fixed headers?", 
                                     cell = function (value) {
                                        if (value) "\u2713" else "\u2718"
                                     })
            )                                 
)

None of these packages is good with very large tables. The DT package will error out if it’s too big. Unfortunately the kableextra package will just keep printing – all of the rows – until it runs out of memory. You sometimes have to restart your machine if this happens, so be careful not to give it too much room to kill itself.


Ranking and sorting

We’ll use a piece of the PPP data we’ve been working with an example. In this case, we’ll work on displaying the names of lenders in some reasonable order – the frequency that they show up in Arizona loans.

Summarizing the data

I’ve loaded some PPP loan data from Arizona as an example:

load ( url ("https://github.com/cronkitedata/rstudyguide/blob/master/data/az_ppp_zipcodes.Rda?raw=true"))

Lender ranks

In our final table, we want to smush together all of the small lenders into an “Other” group. But first we have to see what a good cutoff might be. This is called a cumulative disribution graph, which shows what total percent of all loans were done by the top companies. I’ll use the gt package to show you how to make a better looking table at the same time. gt is good for static tables because it’s relatively simple code with powerful nesting and formatting options. But it’s not so great if you want a large table that you can sort and filter.

lender_ranks <- 
  az_ppp_zip %>%
  group_by (lender) %>%
  summarise (lender_num  =  n() ,
             lender_zips = n_distinct (census_zip) , 
             lender_amt = sum (initial_amt)
  ) %>%
  # a new function: min_rank, to provide the lowest number rank to the highest number of loans.
  mutate (lender_rank = min_rank ( desc(lender_num)))

gt table

How big is big?

First, I create a dataset that groups the lenders into chunks – top 50, top 100, top 200, and everything smaller than the top 200 based on the number of loans given in Arizona. This uses two things that you haven’t seen: the function cut, which we’ll use on the rank of the lender. This creates a factor variable that uses numbers internally to sort / arrange but shows you the words you specify in the labels= argument.

lender_sizes <-
  lender_ranks %>%
  # group the lenders into categories by the number of loans they have
  mutate ( lender_group = cut (lender_rank, breaks= c( -Inf, 50, 100, 200, Inf), 
                               labels=c( " Top 50", "51-100", "101-200", "Over 200") )) %>%
  group_by (lender_group) %>%
  # new : the "across" operator does the same thing to a group of variables, in this case summing them.
  summarise  (across ( c( lender_num, lender_amt), sum ), 
              avg_lender_zipcodes = mean( lender_zips)) %>%
  # This is a new function: cumsum, which accumulated the totals as it goes. 
  mutate ( cum_loans = cumsum ( lender_amt / sum(lender_amt))) 

Now, displaying it is done using the gt library, which is relatively easy to understand and works well on small, informative static tables.

# now we get to tell R how to make  a table
lender_sizes %>% 
 gt ( rowname_col = "lender_group") %>%
  tab_header ( title="Percent of loans in Arizona by Lender Size") %>%
  cols_label ( lender_num="# of loans", 
                    lender_amt = "$ loaned \n (in 1000s)", 
                    avg_lender_zipcodes = "Avg. # of zip codes", 
                    cum_loans = "Cum % by group") %>%
  fmt_number (columns = c(2,4) , decimals=0, sep_mark=",") %>%
  fmt_currency (columns = 3, decimals=0, scale_by = 1/1000) %>%
  fmt_percent ( columns=5, decimals=1) %>%
  tab_options ( table.font.size = "80%", 
                table.font.color = "slategray")  
Percent of loans in Arizona by Lender Size
# of loans $ loaned (in 1000s) Avg. # of zip codes Cum % by group
Top 50 97,889 $8,721,633 154 81.0%
51-100 8,736 $1,017,653 70 90.5%
101-200 3,214 $588,440 18 96.0%
Over 200 2,350 $434,013 2 100.0%

This suggests that collapsing all of the lender above the top 100 into an “Other” group won’t lose us much in the way of comparisons. (It could, however, remove some interesting information for small markets like Native American lands or border towns with small populations.)


reactable

The reactable table seems to be the one that a lot of people are using now instead of DT. It can handle somewhat larger tables than datatables, and has more options for calculating subtotals and for styling. Here’s an example printing out some information on zip codes from Arizona:

sticky_style <- list(position = "sticky", left = 0, background = "#fff", zIndex = 1,
                     borderRight = "1px solid #eee")

az_by_zipcode %>%
  select (zcta, zipcode_city,  zcta_ethnic, tot_pop, usps_businesses,  median_inc_2018) %>%
  reactable (
    searchable=TRUE,
    defaultPageSize = 5,
    columns = list (
      zcta = colDef(name="Zip code", style=sticky_style, headerStyle=sticky_style), 
      zipcode_city = colDef(name="City"), 
      zcta_ethnic = colDef (name="Ethnicity"),
      tot_pop = colDef(name="Population", format=colFormat(separators=TRUE, digits=0)), 
      usps_businesses = colDef(name= "# businesses", format=colFormat( separators=TRUE)),
      median_inc_2018 = colDef(name="Median income", format=colFormat(prefix="$", separators=TRUE))
      
    )

)
