{tidypolars}
Tidypolars is a data frame library built on top of the blazingly fast polars library, providing a tidy interface for R users familiar with the tidyverse.
Use Cases
Here are some common use cases:
- Data Manipulation:
- Tidypolars provides a tidyverse-like interface for data manipulation tasks. You can filter rows, select columns, arrange data, and create new variables using familiar syntax.
- Example: Filtering rows based on conditions, creating new columns, and summarizing data.
- Performance Optimization:
- Polars, the underlying library, is designed for speed and efficiency. Tidypolars leverages this performance to handle large datasets efficiently.
- Use it when you need to process large data frames quickly.
- Joining Data Frames:
- Tidypolars supports various join operations (inner, outer, left, right) to combine data frames.
- Example: Merging two data frames based on common keys.
- Aggregations and Grouping:
- You can group data by one or more columns and perform aggregations (sum, mean, count, etc.) within each group.
- Useful for summarizing data at different levels.
- Window Functions:
- Tidypolars allows you to apply window functions (rolling calculations) to data frames.
- Example: Calculating moving averages, cumulative sums, or ranking within partitions.
- Efficient Data Processing:
- If you work with large datasets and need performance gains, tidypolars is a great choice.
- It’s especially useful when you want to maintain a tidy data workflow.
Installation
Remember to install tidypolars from R-universe to explore its capabilities further!
Let’s explore the differences between polars and dplyr:
- Column Referencing:
- dplyr allows column references without quotation marks due to non-standard evaluation (NSE). It captures expressions passed as arguments, making the syntax more user-friendly.
- In polars, column references typically need
explicit quoting or methods attached to data frames (e.g.,
polars.col()
).
- Performance:
- polars is heavily optimized for performance. Users can expect significant speed improvements (orders of magnitude) compared to dplyr for large datasets (>500MB).
- Automatic optimization in polars can further boost performance for complex queries.
- Function Names:
- Both packages have similar function names (e.g.,
filter()
), but polars consistently uses snake case verbs for intuitive inputs and outputs. - dplyr relies on NSE, while polars adheres to standard evaluation.
- Both packages have similar function names (e.g.,
Remember that polars is relatively new in the Python world, and it’s worth exploring its capabilities!
Tidypolars allows you to work with data frames using methods that resemble tidyverse functions. For example:
import tidypolars as tp
from tidypolars import col, desc
df = tp.Tibble(x=range(3), y=range(3, 6), z=['a', 'a', 'b'])
result = (df
.select('x', 'y', 'z')
.filter(col('x') < 4, col('y') > 1)
.arrange(desc('z'), 'x')
.mutate(double_x=col('x') * 2, x_plus_y=col('x') + col('y'))
)
# Resulting data frame:
# x y z double_x x_plus_y
# 0 2 5 b 4 7
# 1 0 3 a 0 3
# 2 1 4 a 2 5
Remember that in tidypolars, column names must be wrapped in
col()
for certain methods like .filter()
,
.mutate()
, and .summarize()
. Grouping by
columns is also straightforward using the by
argument¹. You
can install tidypolars from R-universe on Windows, macOS, or Linux².
References
- GitHub - markfairbanks/tidypolars: Tidy interface to polars. https://github.com/markfairbanks/tidypolars.
- Get the Power of Polars with the Syntax of the Tidyverse • tidypolars. https://www.tidypolars.etiennebacher.com/.
- tidypolars · PyPI. https://pypi.org/project/tidypolars/.
- undefined. https://etiennebacher.r-universe.dev.
- github.com. https://github.com/markfairbanks/tidypolars/tree/dd839b890a8c9daee54efa4eaaf3dc766c07d4a0/README.md.
References
- Tidy Data Manipulation: dplyr vs polars – Tidily. https://blog.tidy-intelligence.com/posts/dplyr-vs-polars/.
- An Introduction to Polars from R. https://cran.r-universe.dev/polars/doc/polars.html.
- polars’ Rgonomic Patterns | Emily Riederer. https://www.emilyriederer.com/post/py-rgo-polars/.
- difference between plyr::mutate and dplyr::mutate - Stack Overflow. https://stackoverflow.com/questions/28812512/difference-between-plyrmutate-and-dplyrmutate.
Edit Notes
- Citrix Build on 25th July 2024
- UX improvements in 22nd August 2024 (Citrix)