The availability of Apache Spark 1.6 was announced today. The Spark 1.6 release continues the major theme of making Spark easier to use for data applications, said Reynold Xin, co-founder at Databricks, the company behind Spark, an open source processing engine built around speed, ease of use, and sophisticated analytics. The news follows the announcement of the Apache Spark 1.6 preview package in Databricks roughly one and a half months ago in mid-November 2015.
“There are lots of improvements and new features - in fact over 1,000 - in this Spark release,” said Xin.
In addition, with this release, Spark hits a milestone in terms of community development. With Spark 1.6, the numbers of contributing developers has surpassed 1,000, doubling the 500 contributors at the end of 2014.
4 Key Highlights in Spark 1.6
The Dataset API, automatic memory configuration, major improvements in Spark Streaming, and persistence of machine learning pipelines are among the many notable enhancements in the latest Spark release, said Xin.
- According to Xin, the DataFrame API is the largest API addition to Spark since its inception. “Since we released it earlier this year, we have gotten a lot of feedback and one of the main ones is the lack of support for compile-time type safety. The Dataset API is an extension of the DataFrame API that supports static typing and user functions that run directly on existing JVM types. When compared with the traditional RDD API, Datasets should provide better memory management as well as in the long run better performance.”
- Moreover, said Xin, considerable time and effort has been spent on creating an automatic memory manager so users no longer need to tune the size of different memory regions in Spark. “Instead, Spark at runtime will automatically grow and shrink regions according to the needs of the executing application. For many applications, this will mean a significant increase in available memory that can be used for operators such as joins and aggregations,” he said.
- In addition, the new release provides 10X performance improvements in Spark Streaming through better stage management. “State management is an important function in streaming applications, often used to maintain aggregations or session information. We have re-implemented Spark Streaming's state management functionality with a smarter way to track ‘deltas.’ This has resulted in order of magnitude performance improvements in many workloads.”
- Many machine learning applications leverage Spark's ML pipeline feature to construct learning pipelines, added Xin. “In the past, if the application wanted to store the pipeline outside the application, it needed to implement its custom persistence model. In Spark 1.6, the pipeline API offers functionality to save and reload pipelines from a previous state. This is very useful in reducing application boilerplate code as well as significantly reducing computation time when training very large models.”
For more information on the Spark 1.6 release, access a blog by Michael Armbrust, Patrick Wendell and Reynold Xin titled, “Announcing Spark 1.6.”
Access the Spark 1.6 release notes here.
Image courtesy of Shutterstock.