At Strata + Hadoop World, Pentaho announced five new improvements, including SQL on Spark, to help enterprises overcome big data complexity, skills shortages and integration challenges in complex, enterprise environments.
According to Donna Prlich, senior vice president, product management, Product Marketing & Solutions, at Pentaho, the enhancements are part of Pentaho’s mission to help make big data projects operational and deliver value by strengthening and supporting analytic data pipelines. The goal, she noted, is to allow enterprises to focus on their big data deployments, by decreasing the complexity and time involved in data preparation through use of technologies such Spark and Kafka in the big data ecosystem.
The expansion of the existing Spark integration in the Pentaho platform will help customers lower the skills barrier for Spark since data analysts can now query and process Spark data via Pentaho Data Integration (PDI) using SQL on Spark. Expanded PDI orchestration for Spark Streaming, Spark SQL and Spark machine learning (Spark MLlib and Spark ML) will also support developers who use multiple Spark libraries, enabling them to coordinate, schedule, reuse, and manage Spark applications in data pipelines more easily. In addition, the PDI Orchestration of Spark applications written in Python will benefit developers writing Spark applications in this popular language, helping them integrate Spark apps into larger data-driven processes.
In another enhancement, Pentaho has added more than 30 PDI transformation steps, including operations related to Hadoop, Hbase, JSON, XML, Vertica, Greenplum, and other big data sources. Pentaho’s metadata injection capability to onboard multiple data types faster allows data engineers to generate PDI transformations at runtime instead of hand coding each data source.
Pentaho has also expanded its Hadoop data security integration to promote better big data governance, protecting clusters from intruders. These include enhanced Kerberos integration for secure multi-user authentication and Apache Sentry integration to enforce rules that control access to specific Hadoop data assets
Pentaho also now provides its Enterprise customer support to send and receive data from Kafka, to facilitate continuous data processing use cases in PDI, and supports the output of files in Avro and Parquet formats in PDI, which are both popular for storing data in Hadoop in big data onboarding use cases.
More information is available about Pentaho’s big data enhancements.