The Linux Foundation Projects
Skip to main content

Discover LF AI & Data Projects with TAC Talks Watch Now

LF AI & Data Blog

Sparklyr 1.5 Release Now Available!

By December 14, 2020No Comments

Sparklyr, an LF AI & Data Foundation Incubation Project, has released version 1.5! Sparklyr is an R Language package that lets you analyze data in Apache Spark, the well-known engine for big data processing, while using familiar tools in R. The R Language is widely used by data scientists and statisticians around the world and is known for its advanced features in statistical computing and graphics. 

In version 1.5, sparklyr adds a variety of improvements. Highlights include:

  • A large number of user feedbacks were addressed in this release, especially ones related to the `dplyr` interface of `sparklyr`. Spark dataframes now work with a larger number of dplyr verbs in the same way that R dataframes do.
  • There were 4 useful additions to the `sdf_*` family of functions.
    • As the name suggests, functions starting with the prefix `sdf_` in `sparklyr` are ones interfacing with Spark dataframes
    • `sdf_expand_grid()` performs the equivalent of `expand.grid()` with Spark dataframes
    • `sdf_partition_sizes()` computes partition size(s) of a Spark dataframe efficiently
    • `sdf_unnest_longer()` and `sdf_unnest_wider()` are Spark equivalents of `tidyr::unnest_longer()` and `tidyr::unnest_wider()`. They can be used to transform a struct column within a Spark dataframe.
      • `sdf_unnest_longer()` transforms fields in a struct columns into new rows
      • `sdf_unnest_wider()` transforms fields in a struct columns into new columns
  • The default non-arrow-based serialization format in sparklyr used to be CSV. It is now RDS starting from `sparklyr` 1.5 (a lot more detailed context can be found here).
    • Correctness issues that were previously hard to fix with CSV serialization were resolved easily with the new RDS format.
    • There was some performance improvement from RDS serialization too.
    • One can now import binary columns from R dataframe to Spark with RDS serialization.
    • RDS serialization also facilitated reduction of serialization overhead in Spark-based foreach parallel backend.

As usual, there is strong support for `sparklyr` from our fantastic open-source community! In chronological order, we thank the following individuals for making their pull request part of `sparklyr` 1.5:

To learn more about the sparklyr 1.5.0 release, check out the full release notes. Want to get involved with sparklyr? Be sure to join the sparklyr-Announce and sparklyr Technical-Discuss mailing lists to join the community and stay connected on the latest updates. 

Congratulations to the sparklyr team and we look forward to continued growth and success as part of the LF AI & Data Foundation! To learn about hosting an open source project with us, visit the LF AI & Data Foundation website.

sparklyr Key Links

LF AI & Data Resources

Author

  • Andrew Bringaze

    Andrew Bringaze is the senior developer for The Linux Foundation. With over 10 years of experience his focus is on open source code, WordPress, React, and site security.

    View all posts