Sparklyr, an LF AI & Data Foundation Incubation Project, has released version 1.6! Sparklyr is an R Language package that lets you analyze data in Apache Spark, the well-known engine for big data processing, while using familiar tools in R. The R Language is widely used by data scientists and statisticians around the world and is known for its advanced features in statistical computing and graphics.
In version 1.6, sparklyr adds a variety of improvements. Highlights include:
- Sparklyr now has an R interface for Power Iteration Clustering
- Power Iteration Clustering is a scalable and efficient graph clustering algorithm. It finds low-dimensional embedding of a dataset using truncated power iterations on a normalized pair-wise similarity matrix of all data points, and runs k-means algorithm on the embedded representation.
- Support for approximate weighted quantiles to `sdf_quantile()` and `ft_quantile_discretizer()`
- Sparklyr 1.6 features a generalized version of the Greenwald-Khanna algorithm that takes weights of sample data into account when approximating quantiles of a large number of data points.
- Similar to its unweighted counterpart, the weighted version of the Greenwald-Khanna algorithm can be executed distributively on multiple Spark worker nodes, with each worker node summarizing some partition(s) of a Spark dataframe in parallel, and quantile summaries of all partitions can be merged efficiently. The merged result can then be used to approximate weighted quantiles of the dataset, with a fixed upper bound on relative error on all approximations.
- `spark_write_rds()` was implemented to support exporting all partitions of a Spark dataframe in parallel into RDS (version 2) files. This functionality was designed and built to avoid high memory pressure on the Spark driver node when collecting large Spark dataframes.
- RDS files will be written to the default file system of the Spark instance (i.e., local file if the Spark instance is running locally, or a distributed file system such as HDFS if the Spark instance is deployed over a cluster of machines).
- The resulting RDS files, once downloaded onto the local file system, should be deserialized into R dataframes using `collect_from_rds()` (which calls `readRDS()` internally and also performs some important post-processing steps to support timestamp columns, date columns, and struct columns properly in R).
- Dplyr-related improvements:
- Dplyr verbs such as `select`, `mutate`, and `summarize` can now work with a set of Spark dataframe columns specified by `where()` predicates (e.g., `sdf %>% select(where(is.numeric))` and `sdf %>% summarize(across(starts_with(“Petal”), mean))`, etc)
- Sparklyr 1.6 implemented support for `if_all()` and `if_any()` for Spark dataframes
- Dbplyr integration in sparklyr has been revised substantially to be compatible with both dbplyr edition 1 and edition 2 APIs
As usual, there is strong support for sparklyr from our fantastic open-source community! In chronological order, we thank the following individuals for making their pull request part of sparklyr 1.6:
To learn more about the sparklyr 1.6 release, check out the full release notes. Want to get involved with sparklyr? Be sure to join the sparklyr-Announce and sparklyr-Technical-Discuss mailing lists to join the community and stay connected on the latest updates.
Congratulations to the sparklyr team and we look forward to continued growth and success as part of the LF AI & Data Foundation! To learn about hosting an open source project with us, visit the LF AI & Data Foundation website.
Sparklyr Key Links
LF AI & Data Resources