Technology changes, sometimes fairly fast. In the world of database products, trends seem to pop up quickly, like prairie dogs testing the air. Pundits have proclaimed the death of relational database management systems in favor as such fashions as object-oriented, XML, columnar, Hadoop, and anything else, as each new opportunity arises. In an effort to pique customer interest for these latest trends, the pundits seek to establish a clamor for the latest and greatest by proclaiming each new item as offering faster, cheaper, and better successes.
Needless to say, RDBMSs have not dried up and vanished. And regardless of the marketing jargon, or platform, vendor-product purchased, or open source utility downloaded, the one thing that remains unaltered is that in order to extract value from data, the data must be understood. The individuals within any organization who actually comprehend the data, the data structures, and all the exceptions to the usual rules are individuals who are considered critical resources.
Unstructured data is a loaded term. One can say that unstructured data is simply content that is not formally documented. However, even “formally documented” is a rather fuzzy term. For our discussion, “formally documented” means that there are not relational database tables designed, defined, and populated, supporting the actual or posited implementation of an algorithmic process to somehow or other manipulate that content and place things in distinct containers.
One may speak of analysis and “big data” with products such as MapReduce, but one cannot leverage insights and expose subtle interrelationships without knowing exactly which bytes to pull out of the heap and what values contained within those bytes designate something of interest. The magic that makes any analysis work is in understanding what all this unstructured content means and how it all relates, or at least minimally recognizing the scattered bits that are of utility to the analytical purpose at hand.
After acquiring such comprehension, then one may design the steps necessary to evaluate these data points and strategize on how to multithread things in order to obtain the evaluation results at lightning fast speeds. While our “unstructured data” is unstructured, and is not “formally documented,” it still must be “informally documented.” Even informal documentation is imprecise terminology. The “unstructured data” must be clear within the mind of the developer, and then outlined, albeit obtusely, via the instantiated lines of code forming the processes to be executed. Those developers may write something else down that helps to describe things, but only maybe. One never knows now, does one?
Quantitative analysts or quants as they are often called, may be employed to apply formal statistical analysis practices within an organization. Sometimes it may end up that these individuals become the only ones who “know” the content of these unstructured data stores. This silo of knowledge becomes a limiting factor, because the default state then becomes that only these quants can perform ad hoc queries about this data.
Indeed a search engine is both a valid data usage and a thing of value—for its purpose of finding documents related to some idea. And yes, search engines have been defined to perform a very specific and involved analysis of very large quantities of often unstructured data. However, finding other kinds of statistically valid inferences requires other kinds of evaluations. Those additional inferences and evaluations do not appear of themselves.
To do those other kinds of evaluations, one needs to understand the evaluation process and to understand the content and even the structure of the data input for evaluation, regardless of whether that incoming data is “structured” or “unstructured.” Invocation of the term “big data” or acquiring a NoSQL tool does not circumvent the need for comprehension.
About the author:
Todd Schraml has more than 20 years of IT management, project ?development, business analysis, and database design experience across many industries from telecommunications to healthcare. He can be reached at TWSchraml@gmail.com.