But, why?

Ok, fair question. Well, one day I was building a lesson in Python and I just had it with how ugly the default pandas DataFrame output looked on RMarkdown. Here’s an example of a dataframe using the popular Python package, pandas:

##                      a         b         c         d
## first second                                        
## bar   one     0.354431  0.729289  0.326212  0.334261
##       two     0.216039  0.031685 -1.688434  0.118721
## baz   one     1.310318 -0.861194 -0.744905  0.672157
##       two    -1.224141 -0.587877 -1.360120  0.811736
## foo   one     0.510815  2.387444 -0.467747 -0.215802
##       two    -0.580599  0.451983 -1.088561  0.395173
## qux   one    -1.271253  0.119420  1.536242 -0.992219
##       two    -0.938276  0.649429 -0.870941 -0.976980

Looking at this, I thought to myself we have all these amazing R packages for RMarkdown to render pretty dataframes so why not use them? I know what you’re thinking: “But, you can’t do that since all these great R packages only accept R dataframes!”

Yes, fair point. But, what if I made an R dataframe “look” like a Python dataframe so I can still use these packages?

In this post, I talk about how I was able to use {reticulate} to mimic a Python DataFrame, and {kableExtra} together with {formattable} R packages to give some love to Python DataFrames rendering in RMarkdown documents. In the end, we will end up with something like this (hover over the rows!):

		a	b	c	d
first	second	a	b	c	d
bar	one	0.3544306	0.7292887	0.3262122	0.3342614
bar	two	0.2160393	0.0316852	-1.6884338	0.1187213
baz	one	1.3103177	-0.8611936	-0.7449052	0.6721570
baz	two	-1.2241410	-0.5878771	-1.3601202	0.8117365
foo	one	0.5108148	2.3874436	-0.4677472	-0.2158018
foo	two	-0.5805991	0.4519828	-1.0885607	0.3951728
qux	one	-1.2712526	0.1194200	1.5362417	-0.9922192
qux	two	-0.9382758	0.6494289	-0.8709414	-0.9769801

Keep in mind, the solutions I present in this post are largely dirty hacks, but it is truly amazing how far you can go with these packages, so let’s begin! I’ll likely be brief, highlighting the crucial parts to making this work but if you want a closer look, all the data and images can be found here.

Does it … roll? 👀

For a “good enough” visual representation of a Python DataFrame in pandas, I thought about some differences that need to be handled:

0-indexing
Some data type formatting
Visualizing Python’s crazy MultiIndex data structure

Of course, there are probably plenty of other differences beyond these, but I decided to handle these for now. 😄

I will use the raw weather.csv dataset from the nycflights13 package to illustrate how Python dataframes look different when compared to R.

Here’s the R dataframe:

nycflights13::weather

## # A tibble: 26,115 x 15
##    origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
##    <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
##  1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4 
##  2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06
##  3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5 
##  4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7 
##  5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7 
##  6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5 
##  7 EWR     2013     1     1     7  39.0  28.0  64.4      240      15.0 
##  8 EWR     2013     1     1     8  39.9  28.0  62.2      250      10.4 
##  9 EWR     2013     1     1     9  39.9  28.0  62.2      260      15.0 
## 10 EWR     2013     1     1    10  41    28.0  59.6      260      13.8 
## # … with 26,105 more rows, and 5 more variables: wind_gust <dbl>, precip <dbl>,
## #   pressure <dbl>, visib <dbl>, time_hour <dttm>

Let’s import Python libraries pandas and numpy as well as set up our weather dataframe called df in Python with US/Eastern timezone:

import pandas as pd
import numpy as np
df = pd.read_csv(r.weather_data_path, parse_dates=['time_hour'])
df.time_hour = pd.to_datetime(df.time_hour.dt.tz_convert('US/Eastern'))
df

##       origin  year  month  ...  pressure  visib                 time_hour
## 0        EWR  2013      1  ...    1012.0   10.0 2013-01-01 01:00:00-05:00
## 1        EWR  2013      1  ...    1012.3   10.0 2013-01-01 02:00:00-05:00
## 2        EWR  2013      1  ...    1012.5   10.0 2013-01-01 03:00:00-05:00
## 3        EWR  2013      1  ...    1012.2   10.0 2013-01-01 04:00:00-05:00
## 4        EWR  2013      1  ...    1011.9   10.0 2013-01-01 05:00:00-05:00
## ...      ...   ...    ...  ...       ...    ...                       ...
## 26110    LGA  2013     12  ...    1017.1   10.0 2013-12-30 14:00:00-05:00
## 26111    LGA  2013     12  ...    1018.8   10.0 2013-12-30 15:00:00-05:00
## 26112    LGA  2013     12  ...    1019.5   10.0 2013-12-30 16:00:00-05:00
## 26113    LGA  2013     12  ...    1019.9   10.0 2013-12-30 17:00:00-05:00
## 26114    LGA  2013     12  ...    1020.9   10.0 2013-12-30 18:00:00-05:00
## 
## [26115 rows x 15 columns]

Ok, we can definitely see the 0-indexing difference in Python’s output. We also see that R’s dataframe output is a lot more informative on the column types. How about some more details on the columns?

glimpse(nycflights13::weather)

## Rows: 26,115
## Columns: 15
## $ origin     <chr> "EWR", "EWR", "EWR", "EWR", "EWR", "EWR", "EWR", "EWR", "E…
## $ year       <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013…
## $ month      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day        <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ hour       <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16, 17, 18,…
## $ temp       <dbl> 39.02, 39.02, 39.02, 39.92, 39.02, 37.94, 39.02, 39.92, 39…
## $ dewp       <dbl> 26.06, 26.96, 28.04, 28.04, 28.04, 28.04, 28.04, 28.04, 28…
## $ humid      <dbl> 59.37, 61.63, 64.43, 62.21, 64.43, 67.21, 64.43, 62.21, 62…
## $ wind_dir   <dbl> 270, 250, 240, 250, 260, 240, 240, 250, 260, 260, 260, 330…
## $ wind_speed <dbl> 10.35702, 8.05546, 11.50780, 12.65858, 12.65858, 11.50780,…
## $ wind_gust  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 20…
## $ precip     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ pressure   <dbl> 1012.0, 1012.3, 1012.5, 1012.2, 1011.9, 1012.4, 1012.2, 10…
## $ visib      <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10…
## $ time_hour  <dttm> 2013-01-01 01:00:00, 2013-01-01 02:00:00, 2013-01-01 03:0…

Python:

df.info()

## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 26115 entries, 0 to 26114
## Data columns (total 15 columns):
##  #   Column      Non-Null Count  Dtype                     
## ---  ------      --------------  -----                     
##  0   origin      26115 non-null  object                    
##  1   year        26115 non-null  int64                     
##  2   month       26115 non-null  int64                     
##  3   day         26115 non-null  int64                     
##  4   hour        26115 non-null  int64                     
##  5   temp        26114 non-null  float64                   
##  6   dewp        26114 non-null  float64                   
##  7   humid       26114 non-null  float64                   
##  8   wind_dir    25655 non-null  float64                   
##  9   wind_speed  26111 non-null  float64                   
##  10  wind_gust   5337 non-null   float64                   
##  11  precip      26115 non-null  float64                   
##  12  pressure    23386 non-null  float64                   
##  13  visib       26115 non-null  float64                   
##  14  time_hour   26115 non-null  datetime64[ns, US/Eastern]
## dtypes: datetime64[ns, US/Eastern](1), float64(9), int64(4), object(1)
## memory usage: 3.0+ MB

From above, we can see that the core data types are pretty much the same. I will show how reticulate will help us with those. But, we’ll quickly come across an interesting difference in visual output with grouped dataframes. So, before we get to our table making fun, we need to briefly explain Python’s Index and MultiIndex.

Enter `Index` and `MultiIndex`

In pandas, there is a data structure called an Index, which you can kind of think of as the key column for your table. To me, the Index is one of the most confusing data structures, yet it is a core part of the DataFrame and Series data structures (similar to R data.frame and vector, respectively), and it allows us powerful indexing semantics. More on Index here.

This is all pretty vague, so let’s run through an example with our weather dataset. What’s the index of our dataframe df?

df.index

## RangeIndex(start=0, stop=26115, step=1)

It’s a RangeIndex, which is very similar to how we normally index R dataframes using numbers.

But now, let’s say we want to reshape our weather dataframe by time (time_hour) and then by airport (origin):

df.set_index(['time_hour', 'origin'])

##                                   year  month  day  ...  precip  pressure  visib
## time_hour                 origin                    ...                         
## 2013-01-01 01:00:00-05:00 EWR     2013      1    1  ...     0.0    1012.0   10.0
## 2013-01-01 02:00:00-05:00 EWR     2013      1    1  ...     0.0    1012.3   10.0
## 2013-01-01 03:00:00-05:00 EWR     2013      1    1  ...     0.0    1012.5   10.0
## 2013-01-01 04:00:00-05:00 EWR     2013      1    1  ...     0.0    1012.2   10.0
## 2013-01-01 05:00:00-05:00 EWR     2013      1    1  ...     0.0    1011.9   10.0
## ...                                ...    ...  ...  ...     ...       ...    ...
## 2013-12-30 14:00:00-05:00 LGA     2013     12   30  ...     0.0    1017.1   10.0
## 2013-12-30 15:00:00-05:00 LGA     2013     12   30  ...     0.0    1018.8   10.0
## 2013-12-30 16:00:00-05:00 LGA     2013     12   30  ...     0.0    1019.5   10.0
## 2013-12-30 17:00:00-05:00 LGA     2013     12   30  ...     0.0    1019.9   10.0
## 2013-12-30 18:00:00-05:00 LGA     2013     12   30  ...     0.0    1020.9   10.0
## 
## [26115 rows x 13 columns]

Ok, that looks a bit different! The time_hour and origin look like they belong together and are all the way to the left, and the rest of the actual columns are “raised”. But, it’s only when we sort the data by time_hour, that we actually see that our dataframe has been essentially grouped by these two columns:

df.set_index(['time_hour', 'origin']).sort_values(by='time_hour')

##                                   year  month  day  ...  precip  pressure  visib
## time_hour                 origin                    ...                         
## 2013-01-01 01:00:00-05:00 EWR     2013      1    1  ...     0.0    1012.0   10.0
##                           JFK     2013      1    1  ...     0.0    1012.6   10.0
##                           LGA     2013      1    1  ...     0.0    1011.9   10.0
## 2013-01-01 02:00:00-05:00 LGA     2013      1    1  ...     0.0    1011.5   10.0
##                           EWR     2013      1    1  ...     0.0    1012.3   10.0
## ...                                ...    ...  ...  ...     ...       ...    ...
## 2013-12-30 17:00:00-05:00 LGA     2013     12   30  ...     0.0    1019.9   10.0
##                           EWR     2013     12   30  ...     0.0    1020.5   10.0
## 2013-12-30 18:00:00-05:00 JFK     2013     12   30  ...     0.0    1020.9   10.0
##                           EWR     2013     12   30  ...     0.0    1021.1   10.0
##                           LGA     2013     12   30  ...     0.0    1020.9   10.0
## 
## [26115 rows x 13 columns]

In other words, this is no longer a regular dataframe but one that seems to have multiple Index levels into our table. To explain, let’s set the new dataframe to df and print the index:

# set index and return resulting DataFrame
df = df.set_index(['time_hour', 'origin']).sort_values(by='time_hour')
# index is now a MultiIndex
df.index

## MultiIndex([('2013-01-01 01:00:00-05:00', 'EWR'),
##             ('2013-01-01 01:00:00-05:00', 'JFK'),
##             ('2013-01-01 01:00:00-05:00', 'LGA'),
##             ('2013-01-01 02:00:00-05:00', 'LGA'),
##             ('2013-01-01 02:00:00-05:00', 'EWR'),
##             ('2013-01-01 02:00:00-05:00', 'JFK'),
##             ('2013-01-01 03:00:00-05:00', 'JFK'),
##             ('2013-01-01 03:00:00-05:00', 'EWR'),
##             ('2013-01-01 03:00:00-05:00', 'LGA'),
##             ('2013-01-01 04:00:00-05:00', 'JFK'),
##             ...
##             ('2013-12-30 15:00:00-05:00', 'JFK'),
##             ('2013-12-30 16:00:00-05:00', 'JFK'),
##             ('2013-12-30 16:00:00-05:00', 'EWR'),
##             ('2013-12-30 16:00:00-05:00', 'LGA'),
##             ('2013-12-30 17:00:00-05:00', 'JFK'),
##             ('2013-12-30 17:00:00-05:00', 'LGA'),
##             ('2013-12-30 17:00:00-05:00', 'EWR'),
##             ('2013-12-30 18:00:00-05:00', 'JFK'),
##             ('2013-12-30 18:00:00-05:00', 'EWR'),
##             ('2013-12-30 18:00:00-05:00', 'LGA')],
##            names=['time_hour', 'origin'], length=26115)

As you can see, we now have a MultiIndex for df where the names of the 2 levels are time_hour and origin.

The problem: We don’t really have a direct equivalent to Index or MultiIndex in R. We subset R dataframe rows using numerical indices (e.g. df[1:2]) or by character via row.names (e.g. mtcars['Volvo 142E', ]). However, the Index and MultiIndex in Python is a lot more complex because it also allows “heirarchical indexing”, meaning we can have many levels of subsetting rows (and columns).

For example, reshaping our dataframe df now allows us to perform complex queries using just labels. Here, we index into rows for January for JFK using loc:

df.loc[('January, 2013', 'JFK'), ]

##                                   year  month  day  ...  precip  pressure  visib
## time_hour                 origin                    ...                         
## 2013-01-01 01:00:00-05:00 JFK     2013      1    1  ...     0.0    1012.6   10.0
## 2013-01-01 02:00:00-05:00 JFK     2013      1    1  ...     0.0    1012.4   10.0
## 2013-01-01 03:00:00-05:00 JFK     2013      1    1  ...     0.0    1012.7   10.0
## 2013-01-01 04:00:00-05:00 JFK     2013      1    1  ...     0.0    1012.6   10.0
## 2013-01-01 05:00:00-05:00 JFK     2013      1    1  ...     0.0    1012.1   10.0
## ...                                ...    ...  ...  ...     ...       ...    ...
## 2013-01-31 19:00:00-05:00 JFK     2013      1   31  ...     0.0    1006.7   10.0
## 2013-01-31 20:00:00-05:00 JFK     2013      1   31  ...     0.0    1007.8    8.0
## 2013-01-31 21:00:00-05:00 JFK     2013      1   31  ...     0.0    1008.4   10.0
## 2013-01-31 22:00:00-05:00 JFK     2013      1   31  ...     0.0    1008.9   10.0
## 2013-01-31 23:00:00-05:00 JFK     2013      1   31  ...     0.0    1009.4   10.0
## 
## [742 rows x 13 columns]

Neat! Note how there are two levels of Index here, which forms a MultiIndex. Why do we need this? Well, because we now have a heirarchy of Index levels, we can use a subset operation via loc on a dataframe df to subset rows which corresponds to 'January, 2013' on the time_hour level, then the 'JFK' on the origin level. This is quite powerful, and is the basis of pandas.group_by operations.

Now, let’s see how we can use reticulate, kableExtra, and formattable to make an R data.frame look like this pandas.DataFrame.

Setting the scene with reticulate

The first thing I wanted to do was “Wizard of Oz” the basic data types and their formatting, i.e. make an R dataframe look Pythonic. 🐍

Wizard of Oz tricks

This is where we go a bit crazy by simulating a Python dataframe with an R dataframe using the reticulate and formattable packages. I wrote a utility function because this became a very useful piece of code:

python_df <- function(pydf) {
  # if dataframe has a MultiIndex, reset index to turn them into regular columns
  # the first element is the Python class of the object
  if (identical(class(pydf$index)[[1]], "pandas.core.indexes.multi.MultiIndex")) {
    rdf <- pydf$reset_index()
  } else {
    rdf <- reticulate::py_to_r(pydf)
  }
  # some formatting
  # - leave date as is
  # - reduce decimal places for numerics
  # - missing values -> NaN
  rdf <- rdf %>%
    mutate(across(where(lubridate::is.POSIXct), as.character)) %>%
    mutate(across(where(is.numeric), ~ as.numeric(formattable::digits(.x, 8)))) %>%
    mutate(across(everything(), ~ ifelse(is.na(.x), "NaN", .x)))
  # 0-indexing for row index hehe
  rownames(rdf) <- as.numeric(rownames(rdf)) - 1
  rdf
}

This python_df function:

Extracts information about the Python dataframe
Converts it to an R dataframe
Formats the table for certain data types
0-indexes the rownames

I want highlight the trickiest bits here with (1) and (2). When I call this utility function I pass in an “unconverted” Python object via reticulate::py_eval:

reticulate::py_eval("df.head()")

##                                   year  month  day  ...  precip  pressure  visib
## time_hour                 origin                    ...                         
## 2013-01-01 01:00:00-05:00 EWR     2013      1    1  ...     0.0    1012.0   10.0
##                           JFK     2013      1    1  ...     0.0    1012.6   10.0
##                           LGA     2013      1    1  ...     0.0    1011.9   10.0
## 2013-01-01 02:00:00-05:00 LGA     2013      1    1  ...     0.0    1011.5   10.0
##                           EWR     2013      1    1  ...     0.0    1012.3   10.0
## 
## [5 rows x 13 columns]

If we look at the structure of this object with str, notice how we get the Python object back, the result of df.head():

str(reticulate::py_eval("df.head()"))

##                                   year  month  day  ...  precip  pressure  visib
## time_hour                 origin                    ...                         
## 2013-01-01 01:00:00-05:00 EWR     2013      1    1  ...     0.0    1012.0   10.0
##                           JFK     2013      1    1  ...     0.0    1012.6   10.0
##                           LGA     2013      1    1  ...     0.0    1011.9   10.0
## 2013-01-01 02:00:00-05:00 LGA     2013      1    1  ...     0.0    1011.5   10.0
##                           EWR     2013      1    1  ...     0.0    1012.3   10.0
## 
## [5 rows x 13 columns]

This is key because we can keep using this Python object that helps us achieve the check of whether or not we have a MultiIndex on our hands:

# if dataframe has a MultiIndex, reset index to turn them into regular columns
# the first element is the Python class of the object
if (identical(class(pydf$index)[[1]], "pandas.core.indexes.multi.MultiIndex")) {
  rdf <- pydf$reset_index()
} else {
  rdf <- reticulate::py_to_r(pydf)
}

We check the dataframe’s Index with class(pydf$index):

class(pydf$index)

1] "pandas.core.indexes.multi.MultiIndex" "pandas.core.indexes.base.Index"       "pandas.core.base.IndexOpsMixin"      
[4] "pandas.core.base.PandasObject"        "pandas.core.accessor.DirNamesMixin"   "python.builtin.object"

The list here is typically the order of base classes that was inherited by the MultiIndex, with a common ancestor class of object (end of list). The first item is our Python dataframe’s Index class which indicates we have a MultiIndex.

We could try to convert the Python DataFrame with a MultiIndex directly with a reticulate::py_to_r but, we lose those special time_hour and origin columns. (If any reticulate expert knows why, let me know!)

To account for this edge case, we instead execute pydf$reset_index() that does a 2-step dance of:

turning the previous MultiIndex columns into regular columns, and
returning a converted version to an R data.frame, where the time_hour and origin are back to being regulars

pydf$reset_index()

##             time_hour origin year month day hour  temp  dewp humid wind_dir
## 0 2013-01-01 01:00:00    EWR 2013     1   1    1 39.02 26.06 59.37      270
## 1 2013-01-01 01:00:00    JFK 2013     1   1    1 39.02 26.06 59.37      260
## 2 2013-01-01 01:00:00    LGA 2013     1   1    1 39.92 26.06 57.33      260
## 3 2013-01-01 02:00:00    LGA 2013     1   1    2 41.00 26.06 54.97      260
## 4 2013-01-01 02:00:00    EWR 2013     1   1    2 39.02 26.96 61.63      250
##   wind_speed wind_gust precip pressure visib
## 0   10.35702       NaN      0   1012.0    10
## 1   12.65858       NaN      0   1012.6    10
## 2   13.80936   23.0156      0   1011.9    10
## 3   17.26170  25.31716      0   1011.5    10
## 4    8.05546       NaN      0   1012.3    10

The time_hour and origin are just regular columns now. If the dataframe does not have a MultiIndex, then we just convert it directly using reticulate::py_to_r(pydf).

Formatting

The next few tricks are somewhat self-explanatory but using some very handy dplyr::mutate + dplyr::across functions:

# - leave date as is
# - reduce decimal places for numerics
# - missing values -> NaN
rdf <- rdf %>%
  mutate(across(where(lubridate::is.POSIXct), as.character)) %>%
  mutate(across(where(is.numeric), ~ as.numeric(formattable::digits(.x, 8)))) %>%
  mutate(across(everything(), ~ ifelse(is.na(.x), "NaN", .x)))

We format the R dataframe by:

Leaving the Python’s date format as is using as.character
Formatting numerics to have 8 places like the default Python look
Converting all missing values to NaNs

Phew! Now, we are ready to throw that df into kableExtra!

Pretty printing the Python table with `kableExtra`

Now that we have our R pretending-to-be-a-Python-dataframe ready, let’s talk about how kableExtra helps us for pretty printing.

Bolding MultiIndex columns

I won’t go over the setup part in much detail, but I want to highlight the “trick” to making it look like the Python dataframe had raised regular columns with MultiIndex columns at a normal height. In another utiliy function I wrote, I have this code:

# base kbl
setup_kbl <- rdf %>%
  # this retains color for the row Index columns
  dplyr::rename_with(
    function(x) kableExtra::cell_spec(x, "html", color = "black"),
    dplyr::any_of(row_index_cols)
  ) %>%
  kableExtra::kbl(align = "l", escape = F, row.names = show_rownames)

Essentially, row_index_cols are the MultiIndex columns, so this is making the first base kableExtra::kbl by making keeping those columns black. I also added a show_rownames boolean flag to show the row.names in case we just have a regular DataFrame like this one:

kable_pandas(
  df = reticulate::py_eval("df_basic"),
  show_rownames = TRUE
)

	a	b	c	d
0	-0.4908300	-0.8255482	-2.0548656	1.2268369
1	0.3119496	1.0971422	-0.0945923	-3.0433091
2	1.0828095	-2.3672907	-0.1230130	1.6701767
3	2.4927998	0.0082783	0.5557527	-0.7990277
4	0.5229924	0.1447302	0.5393493	-0.8739172
5	0.1818034	-0.1608687	0.3773270	-0.8089840
6	-2.6097903	-0.7996391	0.4129209	0.6234479
7	-0.4755228	-1.1611904	-0.2758276	0.0222106

In this more basic form, we can see the Python 0-indexing!

Raise the regular columns

For my next trick, I “raised” the regular columns by coloring the regular columns white, and adding a header row like so:

final_kbl <-
  setup_kbl %>%
  kableExtra::row_spec(0, color = "white") %>%
  kableExtra::add_header_above(
    c(" " = idx_column_space, column_space),
    align = "c",
    bold = T,
    line = F
  ) %>%
  kableExtra::collapse_rows(columns = 1, valign = "top")

Here, the idx_column_space is the amount of columns to take up with " " because I want empty space above the MultiIndex columns. As for the column_space, it’s a named list that provides key value pairs for the column name and the space the column should take up.

Display multirow cells

Remember how we did the pydf$reset_index() from before for our Python DataFrame that has a MultiIndex? Well, that produced repeating values for our time_hour column. So, one last trick to display this properly is to use kableExtra::collapse_rows(columns = 1, valign = "top") to white out repititions of the first column, and align it to stick to the top of the cell. Behold, the final output (with rows chopped for brevity):

kable_pandas(
  df = reticulate::py_eval("df")
)

		year	month	day	hour	temp	dewp	humid	wind_dir	wind_speed	wind_gust	precip	pressure	visib
time_hour	origin	year	month	day	hour	temp	dewp	humid	wind_dir	wind_speed	wind_gust	precip	pressure	visib
2013-01-01 01:00:00	EWR	2013	1	1	1	39.02	26.06	59.37	270	10.35702	NaN	0	1012.0	10
	JFK	2013	1	1	1	39.02	26.06	59.37	260	12.65858	NaN	0	1012.6	10
	LGA	2013	1	1	1	39.92	26.06	57.33	260	13.80936	23.0156	0	1011.9	10
2013-01-01 02:00:00	LGA	2013	1	1	2	41.00	26.06	54.97	260	17.26170	25.31716	0	1011.5	10
	EWR	2013	1	1	2	39.02	26.96	61.63	250	8.05546	NaN	0	1012.3	10
	JFK	2013	1	1	2	39.02	26.06	59.37	270	11.50780	NaN	0	1012.4	10
2013-01-01 03:00:00	JFK	2013	1	1	3	39.92	26.96	59.50	260	14.96014	NaN	0	1012.7	10
	EWR	2013	1	1	3	39.02	28.04	64.43	240	11.50780	NaN	0	1012.5	10
	LGA	2013	1	1	3	41.00	26.06	54.97	260	16.11092	24.16638	0	1012.0	10
2013-01-01 04:00:00	JFK	2013	1	1	4	39.92	28.04	62.21	250	17.26170	NaN	0	1012.6	10
	LGA	2013	1	1	4	41.00	26.06	54.97	260	17.26170	25.31716	0	1011.8	10
	EWR	2013	1	1	4	39.92	28.04	62.21	250	12.65858	NaN	0	1012.2	10

Try hovering over the rows to see an extra visual aid that we’re looking at a MultiIndex! I don’t know about you but that looks a lot better than the default output to me. 😏

Bonuses! 🎁

If you use the kable::kable_paper you can actually get a pretty decent looking render in the RStudio IDE!!! 😮

Closing thoughts

I hope you have enjoyed this brief tutorial of how to pretty print Python dataframes with RMarkdown. If you have any questions, suggestions or spot mistakes, please let me know! It should be clear that we are just scratching the surface with making Python dataframes prettier in the RMarkdown universe.

Using kableExtra was neat because it had just the right features to allow flexibility to pull this off! In particular, the kableExtra::add_header_above to add arbitrary levels of header rows is nice for displaying MultiIndex levels (for rows or columns). The kableExtra::collapse_rows is useful for multirow cells for repeating values in a given column. The only caveat is with this package is that it seems to not perform well for 1000s of rows but I believe this is a common bottleneck for most table making packages out there.

One could go even further with this crazy exercise by adding in column type information to the columns just like {DT}, for example. Or, maybe even making an interactive version with {reactable}? 😉 Another promising package is {gt}, which offers the stub area for things like MultiIndex.

Pretty Printing Python DataFrames in RMarkdown using `reticulate` and `kableExtra`

Nischal Shrestha

11/14/2020

But, why?

Does it … roll? 👀

Enter `Index` and `MultiIndex`

Setting the scene with reticulate

Wizard of Oz tricks

Formatting

Pretty printing the Python table with `kableExtra`

Bolding MultiIndex columns

Raise the regular columns

Display multirow cells

Bonuses! 🎁

Closing thoughts

References

Pretty Printing Python DataFrames in RMarkdown using reticulate and kableExtra

Nischal Shrestha

11/14/2020

But, why?

Does it … roll? 👀

Enter Index and MultiIndex

Setting the scene with reticulate

Wizard of Oz tricks

Formatting

Pretty printing the Python table with kableExtra

Bolding MultiIndex columns

Raise the regular columns

Display multirow cells

Bonuses! 🎁

Closing thoughts

References

Pretty Printing Python DataFrames in RMarkdown using `reticulate` and `kableExtra`

Enter `Index` and `MultiIndex`

Pretty printing the Python table with `kableExtra`