Image courtesy of Shutterstock
What has traditionally made data a hard problem is precisely the issue of accessing, preparing, and producing it for machine and, ultimately, for human consumption. What makes this a much harder problem in the age of big data is that the information we’re consuming is vectored to us from so many different directions.
The data integration (DI) status quo is predicated on a model of data-at-rest. The designated final destination for data-at-rest is (and, at least for the foreseeable future, will remain) the data warehouse (DW). Traditionally, data of a certain type was vectored to the DW from more or less predictable directions—viz., OLTP systems, or flat files— and at the more or less predictable velocities circumscribed by the limitations of the batch model. Thanks to big data, this is no longer the case. Granted, the term “big data” is empty, hyperbolic, and insufficient; granted, there’s at least as much big data hype as big data substance. But still, as a phenomenon, big data at once describes 1) the technological capacity to ingest, store, manage, synthesize, and make use of information to an unprecedented degree and 2) the cultural capacity to imaginatively conceive of and meaningfully interact with information in fundamentally different ways. One consequence of this has been the emergence of a new DI model that doesn’t so much aim to supplant as to enrich the status quo ante. In addition to data-at-rest, the new DI model is able to accommodate data-in-motion—i.e., data as it streams and data as it pulses: from the logs or events generated by sensors or other periodic signalers to the signatures or anomalies that are concomitant with aperiodic events such as fraud, impending failure, or service disruption.
For more articles on big data technologies and trends, download the Free Big Data Sourcebook: Second Edition
Needless to say, comparatively little of this information is vectoring in from conventional OLTP systems. And that—as poet Robert Frost might put it—makes all the difference.
Big Data - Beyond Description
We’re used to thinking of data in terms of the predicates we attach to it. Now as ever, we want and need to access, integrate, and deliver data from traditional “structured” sources such as OLTP DBMSs, or flat and/or CSV files. Increasingly, however, we’re alert to, or we’re intrigued by, the value of the information that we believe to be locked into “multi-structured” or so-called “unstructured” data, too. (Examples of the former include log files and event messages; the latter is usually used as a kitchen-sink category to encompass virtually any data-type.) Even if we put aside the philosophical problem of structure as such (semantics is structure; schema is structure; a file-type is structure), we’re confronted with the fact that data integration practices and methods must and will differ for each of these different “types.” The kinds of operations and transformations we use to prepare and restructure the normalized data we extract from OLTP systems for business intelligence (BI) reporting and analysis will prove to be insufficient (or quite simply inapposite) when brought to bear against these different types of data. The problem of accessing, preparing, and delivering unconventional types of data from unconventional types of sources—as well as of making this data available to a new class of unconventional consumers—requires new methods and practices, to say nothing of new (or at least complementary) tools.