Data Wrangling Titans

tidyverse vs. pandas

Author

Published

March 12, 2025

Intro

In our world, there are three certainties: death, taxes, and data that needs cleaning. According to various analyses, the data wrangling process before data can be properly analyzed typically takes anywhere from 45 to 80% (based on different reasearches) of the time spent on a given project. 45-80%!!! Why is this the case? Beyond the inherent nature of the work and the fact that data is imperfect, a significant role is also played by the environment. Some people furrow their brows and delve into complex transformations in Excel, Power Pivot, or try to click through all the Alteryx flow objects just to understand where that number came from. Others, like true people of culture, sift through lines of source code—usually in high-level programming languages such as R or Python.

In this short article, I decided to focus on the latter, specifically on the two most popular frameworks in these languages, and measure which environment is more sexy in data analysis—and why!

Data

The first step will simply be downloading the files. Since this process is very similar in both languages, I decided not to describe it in detail and used only R—I hope this is not a faux pas.

pacman::p_load(
    tidyverse,      # Data wrangling packages
    WDI,            # Fetching data from WorldData API
    knitr           # Neat .md table
)   

if (!file.exists("top30gdp.csv")) {
    # In case file is not avaiable in local env., use WD API
    top30gdp <- c(
    'US', 'CN', 'JP', 'DE', 'IN', 'GB', 'FR', 'IT',
    'CA', 'KR', 'RU', 'AU', 'BR', 'ES', 'MX', 'ID',
    'NL', 'SA', 'TR', 'CH', 'NG', 'PK', 'BE', 'AR',
    'PL', 'ZA', 'EG', 'TH', 'MY', 'SG'
    )
    df <- WDI(indicator='NY.GDP.PCAP.KD', country=top30gdp, start=1960, end=2024)
    write_csv(df, "top30gdp.csv")
    print("Downloaded!")
} else {
    # Otherwise - read local .csv
    df <- read_csv("top30gdp.csv")
    print("Loaded from local env.!")
}

[1] "Loaded from local env.!"

# Show data as a .md table
head(df) |> kable()

country	iso2c	iso3c	year	NY.GDP.PCAP.KD
Argentina	AR	ARG	2023	12933.25
Argentina	AR	ARG	2022	13182.79
Argentina	AR	ARG	2021	12549.28
Argentina	AR	ARG	2020	11393.05
Argentina	AR	ARG	2019	12706.40
Argentina	AR	ARG	2018	13058.33

The data used for this framework comparison is in a tabular format, specifically a tall table. It contains GDP per capita values for 30 countries that I have subjectively selected as the most interesting, covering the years 1960–2024. If data for a given country is unavailable, the corresponding value is missing.

Scripts

Now, we can move on to the comparison of both languages. Let’s start with the so-called big picture, which is a side-by-side comparison of code blocks from both frameworks, before diving into a more detailed description. :)

df |> 
select(
    country, 
    year,
    "gdp" = `NY.GDP.PCAP.KD`
) |> filter(year > 1990) |>
mutate(gdp_meandv = gdp - mean(gdp, na.rm = T), .by = 'year') |> 
nest(data = c(year, gdp, gdp_meandv)) |> 
mutate(
    model = map(data, ~lm(gdp_meandv ~ year, data = .x)),
    model_summary = map(model, tidy)
) |> 
unnest(model_summary) |> select(1:5) |> 
pivot_wider(names_from = "term", values_from = "estimate") |> 
mutate(is_above_average = year > mean(year)) |> 
filter(is_above_average) |> 
unnest(data, names_sep = '_') |>
ggplot(aes(x = data_year, y = data_gdp_meandv)) +  
    geom_point(color = 'black') + 
    geom_abline(
        aes(intercept = `(Intercept)`, slope = year),
        show.legend = F, color = 'red'
    ) + 
    facet_wrap(~country, scales = 'free_y') +
    labs(
        title = "Countries with the biggest GDP PCAP",
        subtitle = "linear model trendline",
        x = "Year",
        y = "GDP PCAP"
    )

df = (
    df
    .filter(['country', 'year', 'NY.GDP.PCAP.KD'])
    .rename(columns={'NY.GDP.PCAP.KD': 'gdp'})
    .dropna()
    .query('year > 1990')
)

df = df.groupby('year', group_keys=False).apply(
    lambda x: x.assign(gdp_meandv=x['gdp'] -
            x['gdp'].mean())
)

def fit_linear_model(group):
    X = sm.add_constant(group['year'])  
    y = group['gdp_meandv']
    model = sm.OLS(y, X).fit()
    return model.params

models = df.groupby('country', group_keys=False).apply(fit_linear_model)
df['gdp_meandv_pred'] = df.apply(
    lambda row: models.loc[row['country'], 'const'] +
        models.loc[row['country'], 'year'] * row['year'], axis=1
)

countires_above_mean = models.loc[models['year'] > models['year'].mean()].index.values
df = df[df['country'].isin(countires_above_mean)]

g = sns.FacetGrid(df, col="country", col_wrap=3, sharey=False, height=4)
g.map_dataframe(sns.scatterplot, x="year", y="gdp_meandv", color="black")  
g.map_dataframe(sns.lineplot, x="year", y="gdp_meandv_pred", color="red")  
g.set_axis_labels("Year", "GDP per capita mean deviation")

plt.show()

To evaluate the code, I will use several objective metrics, such as: - the number of characters,

the number of lines of code,
the number of unique functions used
the average number of named arguments per function.
the data wrangling and visualization performance

Additionally, I will include subjective assessments of syntax and overall readability. Of course, even the so-called objective metrics can be considered somewhat subjective, as the problem can be approached in multiple ways using different methods and functions. Nevertheless, I have made every effort to approach this analysis as rigorously as possible, writing the code in a way that, to the best of my knowledge, is optimal. If there are any shortcomings—my apologies!

Comparison

Objective metrics

Let’s start with a number of characters. In this competition, R probably wins, having about 15% fewer characters in the code. Of course, there is still room for experimentation, such as specifying column positions instead of full names, shortening certain function calls, and many other options. I opted for readability and best practices.

1. Comparison of number of characters for both frameworks.

On the other hand, in terms of the number of lines of code, both languages perform very similarly. For Standard Code (i.e., code formatted in the way I use daily), Python slightly leads with 1 line of code less. However, due to the lack of a neat way to chain the entire code with the pipe operator, it requires 6 additional lines to separate the different stages. In the case of maximum compression, R wins with 11 lines of code — though I wouldn’t recommend trying this at home. Code chunks in this format are available in the Horror-code section.

2. Comparison of number of lines for both frameworks.

The time has come for, in my opinion, the most interesting statistic – the number of unique functions and arguments, divided into named and unnamed arguments (keyword). I’ve always struggled with remembering the names of various functions, arguments, and other syntactical dependencies. And this is where R comes in. As seen in the comparison, 11 different functions versus as many as 17! In the case of named arguments, the difference is more than two-fold! Of course, for the sake of balance, the number of unnamed arguments favors Python – 10 to 14.

However, the way R reduces the number of names you need to remember compared to Python is remarkable to me and highlights the beauty of tidyverse as a framework for data manipulation

3. Comparison of number of unique funcstions and arguments passed for both frameworks.

Let’s get to the meat of it - performance. In both cases, it’s at an extremely high level, and the competition is very close, with a slight advantage for R. With larger datasets or datasets with different properties, using other functions, these values would likely fluctuate. Nevertheless, R wins, drawing with Python only in 1 out of 10 attempts.

4.1 Comparison of performance of data wrangling.

Moving on to the graphs, the result is devastating for seaborn and matplotlib. ggplot2 is hundreds of times more efficient. Adding to that the incredible grammar of graphics, it overwhelms its Python competitors with how it combines elegance with performance.

4.2 Comparison of performance of plotting.

In my interpretation, the result is 4:1 in favor of R, because when it comes to lines of code, it’s more of a tie; Python’s code is slightly longer visually, but after removing the blank lines, it’s actually a bit more compact. On the other hand, in the remaining categories, R unquestionably wins

R Phenomenen - subjective assessment

Python is a far more versatile language, with applications extending beyond data analysis into numerous other fields. In contrast, R was literally designed for data analysis. At first glance, minor syntactic differences might seem insignificant, but they can have a profound impact on usability.

One such example is how R allows positional arguments to be passed last, whereas Python requires them at the beginning of a function. This small detail enables seamless data frame passing via the |> operator while allowing an unlimited number of filtering conditions in filter() or new column definitions in mutate(). The implications of this seemingly minor design choice are enormous.

For instance, in R, the following code applies multiple filters without explicitly naming the arguments:

df <- df |> 
    mutate(Sales = qty * Price, SalesProfit = Sales - Cost) |>
    filter(Sales <= 0)

The data frame (df) is piped in as an explicit first argument, while the mutating rules are passed as positional arguments at the end. This makes the syntax incredibly fluid and readable.

In contrast, Python requires explicit argument naming, making the equivalent code more verbose:

df = (
    df
    .assign(Sales=lambda x: x['qty'] * x['Price'], SalesProfit=lambda x: x['Sales'] - x['Cost'])
    .query('Sales <= 0')
)

While both approaches achieve the same result, R’s syntax feels more natural, especially when chaining multiple operations. This design choice significantly contributes to R’s clarity and ease of use—its syntax could hardly be more intuitive!

In this example, we can observe another key difference: Python’s workflow often requires creating anonymous functions (lambda). This happens because assign() operates on the DataFrame in its current state and does not update it dynamically within the same operation. As a result, newly created columns are not immediately accessible, requiring lambda as a workaround to reference them. In contrast, R applies transformations sequentially, making newly created columns instantly available for subsequent operations. This eliminates the need for additional workarounds and ensures a smoother workflow. However, in certain cases, this behavior can also be a drawback—especially when unintended modifications propagate through the pipeline. Nonetheless, its advantages in terms of clarity and efficiency remain significant.

Horror-code

The hell of compact code – no space, no understanding; just a pure madness.

R

df |>  select(country, year,"gdp" = `NY.GDP.PCAP.KD`) |> filter(year > 1990) |>
mutate(gdp_meandv = gdp - mean(gdp, na.rm = T), .by = 'year') |> 
nest(data = c(year, gdp, gdp_meandv)) |> 
mutate(model = map(data, ~lm(gdp_meandv ~ year, data = .x)),model_summary = map(model, tidy)) |> 
unnest(model_summary) |> select(1:5) |>  pivot_wider(names_from = "term", values_from = "estimate") |> 
mutate(is_above_average = year > mean(year)) |> filter(is_above_average) |> 
unnest(data, names_sep = '_') |> ggplot(aes(x = data_year, y = data_gdp_meandv)) +  
    geom_point(color = 'black') + 
    geom_abline(aes(intercept = `(Intercept)`, slope = year),show.legend = F, color = 'red') + 
    facet_wrap(~country, scales = 'free_y') +
    labs(title = "Countries with the biggest GDP PCAP",subtitle = "linear model trendline",x = "Year",y = "GDP PCAP")

Python

df = (df.filter(['country', 'year', 'NY.GDP.PCAP.KD']).rename(columns={'NY.GDP.PCAP.KD': 'gdp'}).dropna().query('year > 1990'))
df = df.groupby('year', group_keys=False).apply(lambda x: x.assign(gdp_meandv=x['gdp'] -x['gdp'].mean()))
def fit_linear_model(group):
    X, y = sm.add_constant(group['year']), group['gdp_meandv']  
    model = sm.OLS(y, X).fit()
    return model.params
models = df.groupby('country', group_keys=False).apply(fit_linear_model)
df['gdp_meandv_pred'] = df.apply(lambda row: models.loc[row['country'], 'const'] + models.loc[row['country'], 'year'] * row['year'], axis=1)
countires_above_mean = models.loc[models['year'] > models['year'].mean()].index.values
df = df[df['country'].isin(countires_above_mean)]
g = sns.FacetGrid(df, col="country", col_wrap=3, sharey=False, height=4)
g.map_dataframe(sns.scatterplot, x="year", y="gdp_meandv", color="black")  
g.map_dataframe(sns.lineplot, x="year", y="gdp_meandv_pred", color="red")  
g.set_axis_labels("Year", "GDP per capita mean deviation")
plt.show()

Ok, and what?

So, we’ve looked at the data wrangling giants—R’s tidyverse and Python’s pandas—and had a side-by-side comparison of their capabilities. But now, let’s get real. We’ve seen which framework shines and where, but the decision which one will match your requirements better is up to you!

Beyond `tidyverse` and `pandas`

If you’re working with larger datasets and are tired of the typical R or Python limitations, duckdb might just be your new best friend. This lightning-fast SQL engine is embedded directly into both R and Python, allowing you to perform in-memory SQL queries without setting up a separate database system. It’s super useful for data processing that’s too heavy for traditional memory-based manipulations, especially when working with dplyr in R or pandas in Python. Why not leverage the power of SQL for some operations where it really excels?

On top of that, both environments have endless possibilities to get faster and more efficient. If you’re using R, perhaps it’s time to explore data.table, which rivals tidyverse in terms of speed and memory efficiency. In Python, there’s modin, which gives pandas a performance boost by parallelizing operations across all your cores. If you’re really pushing your dataset limits, dask can also help scale your operations to distributed computing with ease.

But it’s not just about speed and efficiency. Think about data pipelines. With Python, tools like Luigi or Airflow help you streamline your processes, while in R, you’ve got drake—a great way to build reproducible workflows. Let’s not forget about Spark—if you’re moving from a small dataset to something truly massive, both R and Python have interfaces that’ll let you tap into the distributed computing power of Apache Spark.

So, whether you’re sticking with your trusty tidyverse or pandas, or exploring some of the other tools that expand the potential of these environments, there’s no shortage of ways to improve the data wrangling process. Go ahead, push the boundaries—your data’s waiting!