reticulate and kableExtraOk, fair question. Well, one day I was building a lesson in Python and I just had it with how ugly the default pandas DataFrame output looked on RMarkdown. Here’s an example of a dataframe using the popular Python package, pandas:
## a b c d
## first second
## bar one 0.354431 0.729289 0.326212 0.334261
## two 0.216039 0.031685 -1.688434 0.118721
## baz one 1.310318 -0.861194 -0.744905 0.672157
## two -1.224141 -0.587877 -1.360120 0.811736
## foo one 0.510815 2.387444 -0.467747 -0.215802
## two -0.580599 0.451983 -1.088561 0.395173
## qux one -1.271253 0.119420 1.536242 -0.992219
## two -0.938276 0.649429 -0.870941 -0.976980
Looking at this, I thought to myself we have all these amazing R packages for RMarkdown to render pretty dataframes so why not use them? I know what you’re thinking: “But, you can’t do that since all these great R packages only accept R dataframes!”
Yes, fair point. But, what if I made an R dataframe “look” like a Python dataframe so I can still use these packages?
In this post, I talk about how I was able to use {reticulate} to mimic a Python DataFrame, and {kableExtra} together with {formattable} R packages to give some love to Python DataFrames rendering in RMarkdown documents. In the end, we will end up with something like this (hover over the rows!):
| first | second | a | b | c | d |
|---|---|---|---|---|---|
| bar | one | 0.3544306 | 0.7292887 | 0.3262122 | 0.3342614 |
| two | 0.2160393 | 0.0316852 | -1.6884338 | 0.1187213 | |
| baz | one | 1.3103177 | -0.8611936 | -0.7449052 | 0.6721570 |
| two | -1.2241410 | -0.5878771 | -1.3601202 | 0.8117365 | |
| foo | one | 0.5108148 | 2.3874436 | -0.4677472 | -0.2158018 |
| two | -0.5805991 | 0.4519828 | -1.0885607 | 0.3951728 | |
| qux | one | -1.2712526 | 0.1194200 | 1.5362417 | -0.9922192 |
| two | -0.9382758 | 0.6494289 | -0.8709414 | -0.9769801 |
Keep in mind, the solutions I present in this post are largely dirty hacks, but it is truly amazing how far you can go with these packages, so let’s begin! I’ll likely be brief, highlighting the crucial parts to making this work but if you want a closer look, all the data and images can be found here.
For a “good enough” visual representation of a Python DataFrame in pandas, I thought about some differences that need to be handled:
MultiIndex data structureOf course, there are probably plenty of other differences beyond these, but I decided to handle these for now. 😄
I will use the raw weather.csv dataset from the nycflights13 package to illustrate how Python dataframes look different when compared to R.
Here’s the R dataframe:
nycflights13::weather
## # A tibble: 26,115 x 15
## origin year month day hour temp dewp humid wind_dir wind_speed
## <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4
## 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06
## 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5
## 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7
## 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7
## 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5
## 7 EWR 2013 1 1 7 39.0 28.0 64.4 240 15.0
## 8 EWR 2013 1 1 8 39.9 28.0 62.2 250 10.4
## 9 EWR 2013 1 1 9 39.9 28.0 62.2 260 15.0
## 10 EWR 2013 1 1 10 41 28.0 59.6 260 13.8
## # … with 26,105 more rows, and 5 more variables: wind_gust <dbl>, precip <dbl>,
## # pressure <dbl>, visib <dbl>, time_hour <dttm>
Let’s import Python libraries pandas and numpy as well as set up our weather dataframe called df in Python with US/Eastern timezone:
import pandas as pd
import numpy as np
df = pd.read_csv(r.weather_data_path, parse_dates=['time_hour'])
df.time_hour = pd.to_datetime(df.time_hour.dt.tz_convert('US/Eastern'))
df
## origin year month ... pressure visib time_hour
## 0 EWR 2013 1 ... 1012.0 10.0 2013-01-01 01:00:00-05:00
## 1 EWR 2013 1 ... 1012.3 10.0 2013-01-01 02:00:00-05:00
## 2 EWR 2013 1 ... 1012.5 10.0 2013-01-01 03:00:00-05:00
## 3 EWR 2013 1 ... 1012.2 10.0 2013-01-01 04:00:00-05:00
## 4 EWR 2013 1 ... 1011.9 10.0 2013-01-01 05:00:00-05:00
## ... ... ... ... ... ... ... ...
## 26110 LGA 2013 12 ... 1017.1 10.0 2013-12-30 14:00:00-05:00
## 26111 LGA 2013 12 ... 1018.8 10.0 2013-12-30 15:00:00-05:00
## 26112 LGA 2013 12 ... 1019.5 10.0 2013-12-30 16:00:00-05:00
## 26113 LGA 2013 12 ... 1019.9 10.0 2013-12-30 17:00:00-05:00
## 26114 LGA 2013 12 ... 1020.9 10.0 2013-12-30 18:00:00-05:00
##
## [26115 rows x 15 columns]
Ok, we can definitely see the 0-indexing difference in Python’s output. We also see that R’s dataframe output is a lot more informative on the column types. How about some more details on the columns?
R:
glimpse(nycflights13::weather)
## Rows: 26,115
## Columns: 15
## $ origin <chr> "EWR", "EWR", "EWR", "EWR", "EWR", "EWR", "EWR", "EWR", "E…
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1…
## $ hour <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 13, 14, 15, 16, 17, 18,…
## $ temp <dbl> 39.02, 39.02, 39.02, 39.92, 39.02, 37.94, 39.02, 39.92, 39…
## $ dewp <dbl> 26.06, 26.96, 28.04, 28.04, 28.04, 28.04, 28.04, 28.04, 28…
## $ humid <dbl> 59.37, 61.63, 64.43, 62.21, 64.43, 67.21, 64.43, 62.21, 62…
## $ wind_dir <dbl> 270, 250, 240, 250, 260, 240, 240, 250, 260, 260, 260, 330…
## $ wind_speed <dbl> 10.35702, 8.05546, 11.50780, 12.65858, 12.65858, 11.50780,…
## $ wind_gust <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 20…
## $ precip <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ pressure <dbl> 1012.0, 1012.3, 1012.5, 1012.2, 1011.9, 1012.4, 1012.2, 10…
## $ visib <dbl> 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10…
## $ time_hour <dttm> 2013-01-01 01:00:00, 2013-01-01 02:00:00, 2013-01-01 03:0…
Python:
df.info()
## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 26115 entries, 0 to 26114
## Data columns (total 15 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 origin 26115 non-null object
## 1 year 26115 non-null int64
## 2 month 26115 non-null int64
## 3 day 26115 non-null int64
## 4 hour 26115 non-null int64
## 5 temp 26114 non-null float64
## 6 dewp 26114 non-null float64
## 7 humid 26114 non-null float64
## 8 wind_dir 25655 non-null float64
## 9 wind_speed 26111 non-null float64
## 10 wind_gust 5337 non-null float64
## 11 precip 26115 non-null float64
## 12 pressure 23386 non-null float64
## 13 visib 26115 non-null float64
## 14 time_hour 26115 non-null datetime64[ns, US/Eastern]
## dtypes: datetime64[ns, US/Eastern](1), float64(9), int64(4), object(1)
## memory usage: 3.0+ MB
From above, we can see that the core data types are pretty much the same. I will show how reticulate will help us with those. But, we’ll quickly come across an interesting difference in visual output with grouped dataframes. So, before we get to our table making fun, we need to briefly explain Python’s Index and MultiIndex.
Index and MultiIndexIn pandas, there is a data structure called an Index, which you can kind of think of as the key column for your table. To me, the Index is one of the most confusing data structures, yet it is a core part of the DataFrame and Series data structures (similar to R data.frame and vector, respectively), and it allows us powerful indexing semantics. More on Index here.
This is all pretty vague, so let’s run through an example with our weather dataset. What’s the index of our dataframe df?
df.index
## RangeIndex(start=0, stop=26115, step=1)
It’s a RangeIndex, which is very similar to how we normally index R dataframes using numbers.
But now, let’s say we want to reshape our weather dataframe by time (time_hour) and then by airport (origin):
df.set_index(['time_hour', 'origin'])
## year month day ... precip pressure visib
## time_hour origin ...
## 2013-01-01 01:00:00-05:00 EWR 2013 1 1 ... 0.0 1012.0 10.0
## 2013-01-01 02:00:00-05:00 EWR 2013 1 1 ... 0.0 1012.3 10.0
## 2013-01-01 03:00:00-05:00 EWR 2013 1 1 ... 0.0 1012.5 10.0
## 2013-01-01 04:00:00-05:00 EWR 2013 1 1 ... 0.0 1012.2 10.0
## 2013-01-01 05:00:00-05:00 EWR 2013 1 1 ... 0.0 1011.9 10.0
## ... ... ... ... ... ... ... ...
## 2013-12-30 14:00:00-05:00 LGA 2013 12 30 ... 0.0 1017.1 10.0
## 2013-12-30 15:00:00-05:00 LGA 2013 12 30 ... 0.0 1018.8 10.0
## 2013-12-30 16:00:00-05:00 LGA 2013 12 30 ... 0.0 1019.5 10.0
## 2013-12-30 17:00:00-05:00 LGA 2013 12 30 ... 0.0 1019.9 10.0
## 2013-12-30 18:00:00-05:00 LGA 2013 12 30 ... 0.0 1020.9 10.0
##
## [26115 rows x 13 columns]
Ok, that looks a bit different! The time_hour and origin look like they belong together and are all the way to the left, and the rest of the actual columns are “raised”. But, it’s only when we sort the data by time_hour, that we actually see that our dataframe has been essentially grouped by these two columns:
df.set_index(['time_hour', 'origin']).sort_values(by='time_hour')
## year month day ... precip pressure visib
## time_hour origin ...
## 2013-01-01 01:00:00-05:00 EWR 2013 1 1 ... 0.0 1012.0 10.0
## JFK 2013 1 1 ... 0.0 1012.6 10.0
## LGA 2013 1 1 ... 0.0 1011.9 10.0
## 2013-01-01 02:00:00-05:00 LGA 2013 1 1 ... 0.0 1011.5 10.0
## EWR 2013 1 1 ... 0.0 1012.3 10.0
## ... ... ... ... ... ... ... ...
## 2013-12-30 17:00:00-05:00 LGA 2013 12 30 ... 0.0 1019.9 10.0
## EWR 2013 12 30 ... 0.0 1020.5 10.0
## 2013-12-30 18:00:00-05:00 JFK 2013 12 30 ... 0.0 1020.9 10.0
## EWR 2013 12 30 ... 0.0 1021.1 10.0
## LGA 2013 12 30 ... 0.0 1020.9 10.0
##
## [26115 rows x 13 columns]
In other words, this is no longer a regular dataframe but one that seems to have multiple Index levels into our table. To explain, let’s set the new dataframe to df and print the index:
# set index and return resulting DataFrame
df = df.set_index(['time_hour', 'origin']).sort_values(by='time_hour')
# index is now a MultiIndex
df.index
## MultiIndex([('2013-01-01 01:00:00-05:00', 'EWR'),
## ('2013-01-01 01:00:00-05:00', 'JFK'),
## ('2013-01-01 01:00:00-05:00', 'LGA'),
## ('2013-01-01 02:00:00-05:00', 'LGA'),
## ('2013-01-01 02:00:00-05:00', 'EWR'),
## ('2013-01-01 02:00:00-05:00', 'JFK'),
## ('2013-01-01 03:00:00-05:00', 'JFK'),
## ('2013-01-01 03:00:00-05:00', 'EWR'),
## ('2013-01-01 03:00:00-05:00', 'LGA'),
## ('2013-01-01 04:00:00-05:00', 'JFK'),
## ...
## ('2013-12-30 15:00:00-05:00', 'JFK'),
## ('2013-12-30 16:00:00-05:00', 'JFK'),
## ('2013-12-30 16:00:00-05:00', 'EWR'),
## ('2013-12-30 16:00:00-05:00', 'LGA'),
## ('2013-12-30 17:00:00-05:00', 'JFK'),
## ('2013-12-30 17:00:00-05:00', 'LGA'),
## ('2013-12-30 17:00:00-05:00', 'EWR'),
## ('2013-12-30 18:00:00-05:00', 'JFK'),
## ('2013-12-30 18:00:00-05:00', 'EWR'),
## ('2013-12-30 18:00:00-05:00', 'LGA')],
## names=['time_hour', 'origin'], length=26115)
As you can see, we now have a MultiIndex for df where the names of the 2 levels are time_hour and origin.
The problem: We don’t really have a direct equivalent to Index or MultiIndex in R. We subset R dataframe rows using numerical indices (e.g. df[1:2]) or by character via row.names (e.g. mtcars['Volvo 142E', ]). However, the Index and MultiIndex in Python is a lot more complex because it also allows “heirarchical indexing”, meaning we can have many levels of subsetting rows (and columns).
For example, reshaping our dataframe df now allows us to perform complex queries using just labels. Here, we index into rows for January for JFK using loc:
df.loc[('January, 2013', 'JFK'), ]
## year month day ... precip pressure visib
## time_hour origin ...
## 2013-01-01 01:00:00-05:00 JFK 2013 1 1 ... 0.0 1012.6 10.0
## 2013-01-01 02:00:00-05:00 JFK 2013 1 1 ... 0.0 1012.4 10.0
## 2013-01-01 03:00:00-05:00 JFK 2013 1 1 ... 0.0 1012.7 10.0
## 2013-01-01 04:00:00-05:00 JFK 2013 1 1 ... 0.0 1012.6 10.0
## 2013-01-01 05:00:00-05:00 JFK 2013 1 1 ... 0.0 1012.1 10.0
## ... ... ... ... ... ... ... ...
## 2013-01-31 19:00:00-05:00 JFK 2013 1 31 ... 0.0 1006.7 10.0
## 2013-01-31 20:00:00-05:00 JFK 2013 1 31 ... 0.0 1007.8 8.0
## 2013-01-31 21:00:00-05:00 JFK 2013 1 31 ... 0.0 1008.4 10.0
## 2013-01-31 22:00:00-05:00 JFK 2013 1 31 ... 0.0 1008.9 10.0
## 2013-01-31 23:00:00-05:00 JFK 2013 1 31 ... 0.0 1009.4 10.0
##
## [742 rows x 13 columns]
Neat! Note how there are two levels of Index here, which forms a MultiIndex. Why do we need this? Well, because we now have a heirarchy of Index levels, we can use a subset operation via loc on a dataframe df to subset rows which corresponds to 'January, 2013' on the time_hour level, then the 'JFK' on the origin level. This is quite powerful, and is the basis of pandas.group_by operations.
Now, let’s see how we can use reticulate, kableExtra, and formattable to make an R data.frame look like this pandas.DataFrame.
The first thing I wanted to do was “Wizard of Oz” the basic data types and their formatting, i.e. make an R dataframe look Pythonic. 🐍
This is where we go a bit crazy by simulating a Python dataframe with an R dataframe using the reticulate and formattable packages. I wrote a utility function because this became a very useful piece of code:
python_df <- function(pydf) {
# if dataframe has a MultiIndex, reset index to turn them into regular columns
# the first element is the Python class of the object
if (identical(class(pydf$index)[[1]], "pandas.core.indexes.multi.MultiIndex")) {
rdf <- pydf$reset_index()
} else {
rdf <- reticulate::py_to_r(pydf)
}
# some formatting
# - leave date as is
# - reduce decimal places for numerics
# - missing values -> NaN
rdf <- rdf %>%
mutate(across(where(lubridate::is.POSIXct), as.character)) %>%
mutate(across(where(is.numeric), ~ as.numeric(formattable::digits(.x, 8)))) %>%
mutate(across(everything(), ~ ifelse(is.na(.x), "NaN", .x)))
# 0-indexing for row index hehe
rownames(rdf) <- as.numeric(rownames(rdf)) - 1
rdf
}
This python_df function:
I want highlight the trickiest bits here with (1) and (2). When I call this utility function I pass in an “unconverted” Python object via reticulate::py_eval:
reticulate::py_eval("df.head()")
## year month day ... precip pressure visib
## time_hour origin ...
## 2013-01-01 01:00:00-05:00 EWR 2013 1 1 ... 0.0 1012.0 10.0
## JFK 2013 1 1 ... 0.0 1012.6 10.0
## LGA 2013 1 1 ... 0.0 1011.9 10.0
## 2013-01-01 02:00:00-05:00 LGA 2013 1 1 ... 0.0 1011.5 10.0
## EWR 2013 1 1 ... 0.0 1012.3 10.0
##
## [5 rows x 13 columns]
If we look at the structure of this object with str, notice how we get the Python object back, the result of df.head():
str(reticulate::py_eval("df.head()"))
## year month day ... precip pressure visib
## time_hour origin ...
## 2013-01-01 01:00:00-05:00 EWR 2013 1 1 ... 0.0 1012.0 10.0
## JFK 2013 1 1 ... 0.0 1012.6 10.0
## LGA 2013 1 1 ... 0.0 1011.9 10.0
## 2013-01-01 02:00:00-05:00 LGA 2013 1 1 ... 0.0 1011.5 10.0
## EWR 2013 1 1 ... 0.0 1012.3 10.0
##
## [5 rows x 13 columns]
This is key because we can keep using this Python object that helps us achieve the check of whether or not we have a MultiIndex on our hands:
# if dataframe has a MultiIndex, reset index to turn them into regular columns
# the first element is the Python class of the object
if (identical(class(pydf$index)[[1]], "pandas.core.indexes.multi.MultiIndex")) {
rdf <- pydf$reset_index()
} else {
rdf <- reticulate::py_to_r(pydf)
}
We check the dataframe’s Index with class(pydf$index):
class(pydf$index)
1] "pandas.core.indexes.multi.MultiIndex" "pandas.core.indexes.base.Index" "pandas.core.base.IndexOpsMixin"
[4] "pandas.core.base.PandasObject" "pandas.core.accessor.DirNamesMixin" "python.builtin.object"
The list here is typically the order of base classes that was inherited by the MultiIndex, with a common ancestor class of object (end of list). The first item is our Python dataframe’s Index class which indicates we have a MultiIndex.
We could try to convert the Python DataFrame with a MultiIndex directly with a reticulate::py_to_r but, we lose those special time_hour and origin columns. (If any reticulate expert knows why, let me know!)
To account for this edge case, we instead execute pydf$reset_index() that does a 2-step dance of:
MultiIndex columns into regular columns, andtime_hour and origin are back to being regularspydf$reset_index()
## time_hour origin year month day hour temp dewp humid wind_dir
## 0 2013-01-01 01:00:00 EWR 2013 1 1 1 39.02 26.06 59.37 270
## 1 2013-01-01 01:00:00 JFK 2013 1 1 1 39.02 26.06 59.37 260
## 2 2013-01-01 01:00:00 LGA 2013 1 1 1 39.92 26.06 57.33 260
## 3 2013-01-01 02:00:00 LGA 2013 1 1 2 41.00 26.06 54.97 260
## 4 2013-01-01 02:00:00 EWR 2013 1 1 2 39.02 26.96 61.63 250
## wind_speed wind_gust precip pressure visib
## 0 10.35702 NaN 0 1012.0 10
## 1 12.65858 NaN 0 1012.6 10
## 2 13.80936 23.0156 0 1011.9 10
## 3 17.26170 25.31716 0 1011.5 10
## 4 8.05546 NaN 0 1012.3 10
The time_hour and origin are just regular columns now. If the dataframe does not have a MultiIndex, then we just convert it directly using reticulate::py_to_r(pydf).
The next few tricks are somewhat self-explanatory but using some very handy dplyr::mutate + dplyr::across functions:
# - leave date as is
# - reduce decimal places for numerics
# - missing values -> NaN
rdf <- rdf %>%
mutate(across(where(lubridate::is.POSIXct), as.character)) %>%
mutate(across(where(is.numeric), ~ as.numeric(formattable::digits(.x, 8)))) %>%
mutate(across(everything(), ~ ifelse(is.na(.x), "NaN", .x)))
We format the R dataframe by:
as.characterNaNsPhew! Now, we are ready to throw that df into kableExtra!
kableExtraNow that we have our R pretending-to-be-a-Python-dataframe ready, let’s talk about how kableExtra helps us for pretty printing.
I won’t go over the setup part in much detail, but I want to highlight the “trick” to making it look like the Python dataframe had raised regular columns with MultiIndex columns at a normal height. In another utiliy function I wrote, I have this code:
# base kbl
setup_kbl <- rdf %>%
# this retains color for the row Index columns
dplyr::rename_with(
function(x) kableExtra::cell_spec(x, "html", color = "black"),
dplyr::any_of(row_index_cols)
) %>%
kableExtra::kbl(align = "l", escape = F, row.names = show_rownames)
Essentially, row_index_cols are the MultiIndex columns, so this is making the first base kableExtra::kbl by making keeping those columns black. I also added a show_rownames boolean flag to show the row.names in case we just have a regular DataFrame like this one:
kable_pandas(
df = reticulate::py_eval("df_basic"),
show_rownames = TRUE
)
| a | b | c | d | |
|---|---|---|---|---|
| 0 | -0.4908300 | -0.8255482 | -2.0548656 | 1.2268369 |
| 1 | 0.3119496 | 1.0971422 | -0.0945923 | -3.0433091 |
| 2 | 1.0828095 | -2.3672907 | -0.1230130 | 1.6701767 |
| 3 | 2.4927998 | 0.0082783 | 0.5557527 | -0.7990277 |
| 4 | 0.5229924 | 0.1447302 | 0.5393493 | -0.8739172 |
| 5 | 0.1818034 | -0.1608687 | 0.3773270 | -0.8089840 |
| 6 | -2.6097903 | -0.7996391 | 0.4129209 | 0.6234479 |
| 7 | -0.4755228 | -1.1611904 | -0.2758276 | 0.0222106 |
In this more basic form, we can see the Python 0-indexing!
For my next trick, I “raised” the regular columns by coloring the regular columns white, and adding a header row like so:
final_kbl <-
setup_kbl %>%
kableExtra::row_spec(0, color = "white") %>%
kableExtra::add_header_above(
c(" " = idx_column_space, column_space),
align = "c",
bold = T,
line = F
) %>%
kableExtra::collapse_rows(columns = 1, valign = "top")
Here, the idx_column_space is the amount of columns to take up with " " because I want empty space above the MultiIndex columns. As for the column_space, it’s a named list that provides key value pairs for the column name and the space the column should take up.
Remember how we did the pydf$reset_index() from before for our Python DataFrame that has a MultiIndex? Well, that produced repeating values for our time_hour column. So, one last trick to display this properly is to use kableExtra::collapse_rows(columns = 1, valign = "top") to white out repititions of the first column, and align it to stick to the top of the cell. Behold, the final output (with rows chopped for brevity):
kable_pandas(
df = reticulate::py_eval("df")
)
| time_hour | origin | year | month | day | hour | temp | dewp | humid | wind_dir | wind_speed | wind_gust | precip | pressure | visib |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2013-01-01 01:00:00 | EWR | 2013 | 1 | 1 | 1 | 39.02 | 26.06 | 59.37 | 270 | 10.35702 | NaN | 0 | 1012.0 | 10 |
| JFK | 2013 | 1 | 1 | 1 | 39.02 | 26.06 | 59.37 | 260 | 12.65858 | NaN | 0 | 1012.6 | 10 | |
| LGA | 2013 | 1 | 1 | 1 | 39.92 | 26.06 | 57.33 | 260 | 13.80936 | 23.0156 | 0 | 1011.9 | 10 | |
| 2013-01-01 02:00:00 | LGA | 2013 | 1 | 1 | 2 | 41.00 | 26.06 | 54.97 | 260 | 17.26170 | 25.31716 | 0 | 1011.5 | 10 |
| EWR | 2013 | 1 | 1 | 2 | 39.02 | 26.96 | 61.63 | 250 | 8.05546 | NaN | 0 | 1012.3 | 10 | |
| JFK | 2013 | 1 | 1 | 2 | 39.02 | 26.06 | 59.37 | 270 | 11.50780 | NaN | 0 | 1012.4 | 10 | |
| 2013-01-01 03:00:00 | JFK | 2013 | 1 | 1 | 3 | 39.92 | 26.96 | 59.50 | 260 | 14.96014 | NaN | 0 | 1012.7 | 10 |
| EWR | 2013 | 1 | 1 | 3 | 39.02 | 28.04 | 64.43 | 240 | 11.50780 | NaN | 0 | 1012.5 | 10 | |
| LGA | 2013 | 1 | 1 | 3 | 41.00 | 26.06 | 54.97 | 260 | 16.11092 | 24.16638 | 0 | 1012.0 | 10 | |
| 2013-01-01 04:00:00 | JFK | 2013 | 1 | 1 | 4 | 39.92 | 28.04 | 62.21 | 250 | 17.26170 | NaN | 0 | 1012.6 | 10 |
| LGA | 2013 | 1 | 1 | 4 | 41.00 | 26.06 | 54.97 | 260 | 17.26170 | 25.31716 | 0 | 1011.8 | 10 | |
| EWR | 2013 | 1 | 1 | 4 | 39.92 | 28.04 | 62.21 | 250 | 12.65858 | NaN | 0 | 1012.2 | 10 |
Try hovering over the rows to see an extra visual aid that we’re looking at a MultiIndex! I don’t know about you but that looks a lot better than the default output to me. 😏
If you use the kable::kable_paper you can actually get a pretty decent looking render in the RStudio IDE!!! 😮
I hope you have enjoyed this brief tutorial of how to pretty print Python dataframes with RMarkdown. If you have any questions, suggestions or spot mistakes, please let me know! It should be clear that we are just scratching the surface with making Python dataframes prettier in the RMarkdown universe.
Using kableExtra was neat because it had just the right features to allow flexibility to pull this off! In particular, the kableExtra::add_header_above to add arbitrary levels of header rows is nice for displaying MultiIndex levels (for rows or columns). The kableExtra::collapse_rows is useful for multirow cells for repeating values in a given column. The only caveat is with this package is that it seems to not perform well for 1000s of rows but I believe this is a common bottleneck for most table making packages out there.
One could go even further with this crazy exercise by adding in column type information to the columns just like {DT}, for example. Or, maybe even making an interactive version with {reactable}? 😉 Another promising package is {gt}, which offers the stub area for things like MultiIndex.