Big data has become a big topic for 2012. It's not only the size, but also the complexity of formats and speed of delivery of data that is starting to exceed the capabilities of traditional data management technologies, requiring the use of new or exotic technologies simply to manage the volume alone. In recent years, the democratization of analytics and business intelligence (BI) solutions has become a major driving force for data warehousing, resulting in the use of self-service data marts. One major implication of big data is that in the future, users will not be able to put all useful information into a single data warehouse. Logical data warehouses bringing together information from multiple sources as needed is replacing the single data warehouse model. Combined with the fact that enterprise IT departments are continually moving towards distributed computing environments, the need for IT process automation to automate and execute the integration and movement of data between these disparate sources is more important than ever.
The Real-Time Data Warehouse
Data warehouses are no longer simply the attics of the business computing world. The use of business intelligence applications is growing exponentially, making such data repositories an essential part of daily business life.
Compounding the need is the democratization of BI applications. Employees at all levels-from sales to production to human resources-are using BI applications on a daily basis. Add in the new forms of data that are flooding in, from radio-frequency identification (RFID) readers, web services applications, mobile devices, cloud-based sources and more, along with the fact that data latency requirements have dropped (below one hour in 60% of cases and under one minute in another 35%), and it's clear just how important near real-time data warehouse updates have become.
It's estimated that up to 90% of all Global 2000 companies have now established a link between their data warehouse and at least one mission-critical application used to produce revenues or control costs. Without current and accurate information, it's impossible to produce actionable insights at the pace of modern business.
Big data only adds another dimension to the BI/data warehousing environment. Big data itself is parallel; data is split up into multiple stores and the processing is completed across multiple servers. Virtualization of both computing and storage go hand-in-hand with that. The idea behind data virtualization is to allow users to query across different data sources in near real-time.
The Problem
Accessing and integrating large amounts of data from sources both inside and outside the firewall, and even from within the cloud, is complex and requires that many elements work together to deliver accurate and timely information that supports the right strategic and tactical business decisions. In many cases, complex data pathways and their associated moving parts break, dependencies cannot be met, systems go offline, all without the visibility into each connection and process step. As a result, reports will be incorrect, late, or both.
While, many tools exist to automate the processes used to update data stores, the problem is that most have significant limitations. For example, a DBMS has job scheduling capabilities (as do major operating systems, including UNIX and Windows). But DBMS options typically focus on data maintenance only. And, while operating system tools are convenient, their workflows are limited to tasks occurring within a particular server or operating environment.
Many of the leading enterprise data warehousing and BI solutions also come equipped with job scheduling and automation capabilities, but again, these are typically limited in their respective capabilities to scheduling on just that system, leaving IT to rely on error-prone and time consuming scripting to pass data and manage dependencies across the vast array of BI and data warehousing solutions.
Workload Automation Is the Answer
To simplify the deployment of data warehouse computing tasks, IT automation has become a cornerstone to allow IT operations to efficiently integrate and manage key resources in their data warehousing environments for improved data quality and reporting. Workload automation solutions accomplish this by providing a single automation engine to integrate data pathways into automated, repeatable and schedulable processes that deliver a high degree of control over all steps in the BI/data warehousing process. A good workload automation solution will allow for the simple creation of these workflows, allowing IT operations to reduce or even eliminate the custom scripting and/or manual intervention that has been traditionally required to execute these processes in the past.
Workload automation solutions should also provide suitable support for your particular IT environment by providing a robust scheduling platform that incorporates cross-platform, cross-application support in real-time interaction for centralized control of data integration, ETL and data warehouse processes. Whether you operate Linux, UNIX, or Windows, are heavy into Java, web services, Oracle or SQL Server, use OpenPGP to encrypt and decrypt tasks or Secure Shell (SSH) to protect data exchange between devices, your scheduling application should be robust enough to handle it all. This should also include direct integration with many, if not all, of the leading data warehousing and BI solutions on the market today.
Any good workload automation solution should provide both for ease of authorship of workflows in a script-less fashion of both simple and complex processes, in addition to tracking and alerting capabilities to increase efficiency, reduce errors and support enterprise objectives, allowing IT staff to focus on innovation and efficiency.
For example, take a financial services organization that requires daily, updated reports from financial markets across the globe. A robust workload automation solution could allow for the creation of a workflow that automates the entire data warehousing/BI process, from data download to dispersal of finished reports to end users. The workflow could be automatically triggered upon local data marts receiving files, followed by an ETL procedure for uploading into a data warehouse, delivery to a BI solution for analysis and report generation and finally formatting and email distribution of completed reports to financial analysts worldwide.
The demands on today's data warehouses are more stringent and up to the minute than ever before. With a strong workload automation solution at work, it's possible to accommodate these demands while extending human and computing resources - and even improving return on infrastructure. Whether you have a workload automation solution in place or are considering adding one, leverage its capabilities wisely and you'll find your goal of a near real-time data warehouse becoming a reality.
The results are real and clear:
- Improved data quality through scheduling capabilities that ensure the completion and validity of upstream and downstream jobs
- Better reporting and faster time to insight due to tight integration between reporting packages and data warehousing solutions
- Reduced reliance on custom scripting to avoid constant updates and manual errors
- Develop end-to-end workflows faster and more reliably
About the author:
Colin Beasty is a product marketing manager with Advanced Systems Concepts and is responsible for ActiveBatch Enterprise Job Scheduling and Workload Automation product messaging and positioning.