Unstructured Data: Overcoming Challenges to Reap the Rewards

By Peter Coppola

Jun 5, 2015

The popularity of social media, online shopping, online gaming, and smart devices, along with the massive growth of Internet of Things (IoT), is creating tremendous amounts of data—much of which is unstructured. Unstructured data presents many challenges—it’s hard to manage, datasets can be extremely large, and it does not have a pre-defined schema. Traditional tools such as relational database management systems (RDBMS) were not architected to store and retrieve unstructured data. Still, enterprises and service providers who manage to tame and mine unstructured data will have the ability to drive true business transformation based on the new insights it provides.

Distributed systems solve many of the challenges related to storing and retrieving unstructured data. Some of these challenges include data consistency, maintaining performance, and availability. Distributed systems can be used to extract compelling business benefits, giving organizations the ability to collect, analyze, and gain insights from previously unconnected and unanalyzed data.

Why So Much Data?

Data is doubling in size every 2 years and the total amount of data will reach 44ZB by 2020, with 80% of that unstructured, according to IDC Research. Sources, types, and volumes of data have changed in ways that we could not have imagined just a few years ago. Twitter, Facebook, online shopping, online gaming, and other everyday personal and business activities are creating a tremendous amount of data—much of it unstructured. We are also now seeing more and more technology based on sensors communicating over the internet to applications, through what is commonly referred to as the Internet of Things (IoT).

When many people think of the IoT, they think of personal fitness monitoring such as Fitbits, connected refrigerators, and security systems. However, the variety and proliferation of devices connected together include everything from home gas meters to airplane engines, from weather stations to pharmacy shelves, with millions of sensors placed all over the globe and applications creating tens of millions of data points daily. The innovations in IoT, and the related datasets, are expected to grow exponentially over the next few years.

Unstructured datasets can be extremely large and may be analyzed computationally to reveal patterns, trends, and associations, especially relating to human behavior and interactions. This includes geospatial tracking, web clickstreams, social media content, chat logs, and search data, along with any other data that doesn’t easily fit into a spreadsheet or relational database schema. The wider the range of unstructured data, the more likely it is that the data may lead to insightful analyses or correlations. More accurate analyses, of course, can lead to more confident decisions, and better decisions can mean greater operational efficiencies, increased productivity, cost reductions, and reduced risk.

Seeing the Light—Reaping the Rewards of Big Data

How do companies benefit from working with unstructured data? Correlation of seemingly disparate data points, driving adjustments to business strategy and execution, can generate sizable gains in productivity and revenues. Take the example of a large retail grocer that, drawing from weather data, was able to determine that certain atmospheric conditions drive people to buy certain things over others. For example, on still days with highs of less than 80 degrees, people respond well to berry ads and specials, buying three times as many as usual. On warm, dry days with high winds, people favor steak, but if the wind dies and the temperature rises, they go for burgers. Aligning beef ads with the shift in weather has increased sales by 18%. Without the data stored and made readily available through a NoSQL database, these realizations, and the resulting revenue gains, would never have come to light.

Scalability, Performance, and Global Availability

Applications and databases typically work at small scale. However, those working with vast amounts of unstructured data need the ability to scale up, down, out, and in. To scale vertically (or scale up) means to add resources to a single node in a system, typically involving the addition of bigger CPUs or memory to a single computer. To scale horizontally (or scale out) means to add more nodes to a cluster, such as adding a new computer (typically commodity hardware) to a distributed system. Not all distributed systems, such as NoSQL databases, are alike. Effective NoSQL databases must be able to scale both out and in, as well as scale up and down predictably and reliably.

Data Consistency

One of the important differences and key advantages of NoSQL systems is the concept of relaxed consistency. This relaxed consistency, known as eventual consistency, means that the system will always respond to a request but it may not respond with the most updated object. The CAP theorem defines a natural tension as well as trade-offs between three core operational capabilities in distributed systems and database infrastructure—consistency, availability, and partition tolerance.

At first reading, the CAP theorem seems to imply that distributed systems cannot achieve perfect consistency, availability, and partition tolerance. However, by relaxing the requirement for perfect consistency to allow for eventual consistency, distributed system parameters can be tuned to meet particular application requirements.

Distributed systems extend beyond the realm of write once, read many found in RDBMS methodologies. In the world of relational databases, strong consistency has reigned as a requirement. In the new world of distributed systems, users must look clearly at the requirement for strong consistency versus eventual consistency.

Data Gravity

As data accumulates (builds mass), there is a greater likelihood that additional services and applications will be attracted to the data. Services and applications can have their own gravity, but data is the most massive and dense, therefore, it has the most gravity. Data, if large enough, can be virtually impossible to move.

“Data gravity” is a term coined by Dave McCrory, CTO of Basho, to describe how the greater mass of data draws services and applications to it. The closer services and applications are to the data that they access, the higher the throughput and the lower the latency. In turn, those applications and services will become more reliant on high throughput and low latency. The requirement for applications to have high throughput and low latency increases the need for data to be located close to applications and services.

This is important because the customer experience matters. When access is slow, application usage drops along with productivity. When end users are closer to data operations, they receive better response times and a better end-user experience. The data locality capabilities of distributed systems enable data operations close to end users. Distributed systems must be designed to ensure low latency and high throughput to ensure a great user experience across the globe.

Managing Unstructured Data Is Critical for Modern Business

Massive increases in data, greatly fueled by the IoT, are pouring into our networks and systems, altering the nature of how we collect and analyze data. Distributed systems solve the challenges posed by enormous amounts of unstructured data. By providing scalability, global availability, fault tolerance, performance, and operational simplicity, distributed systems enable the business benefits that come from storing and retrieving unstructured data. The IoT and the flood of unstructured data are forcing companies, whole industries, and markets to evolve. This evolution can either be an opportunity for growth or render companies obsolete that can’t keep up.

Image courtesy of Shutterstock.