As a simple example, suppose there were in the raw text this sentence: “President Ford drove a Ford.” If the general context were about motor cars, then Ford would be interpreted to be an automobile. If the general context were about the history of presidents of the U.S., then Ford would be interpreted to be a reference to a former president.
A New Technology - Textual Disambiguation - Determines Context for Unstructured
The other type of context is specific context. Specific context can be derived in many different ways. Specific context can be derived by the structure of a word, the text surrounding a word, the placement of words in proximity to each other, and so forth. There is new technology called “textual disambiguation” which allows raw unstructured text to have its context specifically determined. In addition, textual disambiguation allows the output of its processing to be placed in a standard database format so that classical analytical tools can be used.
At the end of textual disambiguation, analytical processing can be done on the raw unstructured text that has now been disambiguated.
The Value of Determining Context of Unstructured Data
The determination of the context of unstructured data opens the door to many types of processing that previously were impossible. For example, corporations can now:
Read, understand, and analyze their corporate contracts en masse. Prior to textual disambiguation, it was not possible to look at contracts and other documents collectively.
Analyze medical records. For all the work done in the creation of EMRs (electronic medical records), there is still much narrative in a medical record. The ability to understand narrative and restructure that narrative into a form and format that can be analyzed automatically is a powerful improvement over the techniques used today.
Analyze emails. Today after an email is read, it is placed on a huge trash heap and is never seen again. There is, however, much valuable information in most corporations’ emails. By using textual disambiguation, the organization can start to determine what important information is passing through their hands.
Analyze and capture call center data. Today, most corporations look at and analyze only a sampling of their call center conversations. With big data and textual disambiguation, now corporations can capture and analyze all of their call center conversations.
Analyze warranty claims data. While a warranty claim is certainly important to the customer who has made the claim, warranty analysis is equally important to the manufacturer to understand what manufacturing processes need to be improved. By being able to automatically capture and analyze warranty data and to put the results in a database, the manufacturer can benefit mightily.
And the list goes on and on. This short list is merely the tip of the tip of the iceberg when it comes to the advantages of being able to capture and analyze unstructured data. Note that with standard structured processing, none of these opportunities have come to fruition.
Some Architectural Considerations of Managing Big Data Through Textual Disambiguation
One of the architectural considerations of managing big data through textual disambiguation technology is that raw data on a big data platform cannot be analyzed in a sophisticated manner. In order to set the stage for sophisticated analysis, the designer must take the unstructured text from big data, pass the text through textual disambiguation, then return the text back to big data. However, when the raw text passes through textual disambiguation, it is transformed into disambiguated text. In other words, when the raw text passes through textual disambiguation, it passes back into big data, where the context of the raw text has been determined.
Once the context of the unstructured text has been determined, it can then be used for sophisticated analytical processing.
What’s Ahead in Data Warehousing
The argument can be made that the process of disambiguating the raw text then rewriting it to big data in a disambiguated state increases the amount of data in the environment. Such an observation is absolutely true. However, given that big data is cheap and that the big data infrastructure is designed to handle large volumes of data, it should be of little concern that there is some degree of duplication of data after raw text passes through the disambiguation process. Only after big data has been disambiguated is the big data store fit to be called a data warehouse. However, once the big data is disambiguated, it makes a really valuable and really innovative addition to the analytical, data warehouse environment.
Big data has much potential. But unlocking that potential is going to be a real challenge. Textual disambiguation promises to be as profound as data warehousing once was. Textual disambiguation is still in its infancy, but then again, everything was once in its infancy. However, the early seeds sewn in textual disambiguation are bearing some most interesting fruit.
About the author:
W. H. Inmon—the “father of data warehouse”—has written 52 books published in nine languages. Inmon speaks at conferences regularly. His latest adventure is the building of Textual ETL—textual disambiguation—technology that reads raw text and allows raw text to be analyzed. Textual disambiguation is used to create business value from big data. Inmon was named by ComputerWorld as one of the 10 most influential people in the history of the computer profession, and lives in Castle Rock, Colo.