Companies are scrambling to learn all the various ways they can slice, dice, and mine big data coming in from across the enterprise and across the web. But with the rise of big data — hundreds of terabytes or petabytes of data—comes the challenge of where and how all of this information will be stored. For many organizations, current storage systems — disks, tapes, virtual tapes, clouds, inmemory systems — are not ready for the onslaught, industry experts say. There are new methodologies and technologies coming on the scene that may help address this challenge. But one thing is certain: Whether organizations manage their data in their internal data centers, or in the cloud, a lot more storage is going to be needed. As Jared Rosoff, director of customer engagement with 10gen, puts it: “Big data means we need ‘big storage.’”
Big data is proving to be a tall order for storage. “Over the next decade, the number of servers worldwide — both virtual and physical — will grow 10-fold, the amount of information managed by enterprise data centers 50-fold, and the number of files the data center will manage, 75-fold,” Teresa Worth, senior product marketing manager with Seagate Technology, tells DBTA. “The big question for any business that stores, manages, and delivers data is: ‘Can we keep up?’”
As companies recognize and build on the opportunities seen with big data, there will be even greater pressure on IT departments to keep up with big storage. “Support for analytics and business intelligence itself will be a major contributor to storage growth,” says Amy Price, senior consultant with Dell’s Storage Products Group. “Even an organization doing simple analytics today will want to keep all the data. They know they’ll come up with better questions and models to answer them down the road, making historical data invaluable.”
Along with business initiatives to capture and maintain transaction and analytic data, regulations, policies and mandates are also driving the need for big storage, especially since data must be preserved in order to meet auditing requirements and in the face of potential litigation. A recent survey finds that more than one-third of companies are now storing data in their archive systems for more than 7 years, either because of company policy or compliance mandates. In fact, “The Petabyte Challenge: 2011 Database Growth Survey” of 611 data managers, conducted by Unisphere Research among members of the Independent Oracle Users Group (IOUG), reveals that 12% of respondents say they simply now hang on to all data “forever.”
This creates a number of challenges, of course. Seven out of 10 respondents in the IOUG survey say this has resulted in a need for more hardware resources. Close to half also cite the increased complexity of managing data that needs to be saved for years, possibly decades, yet still be accessible even with very short notice.
The problem, says Steve Wojtowecz, vice president of storage software development at IBM, is “businesses are turning into data hoarders and spending too much time and money collecting useless or bad data. Companies are hesitant to delete any data (and many times duplicate data) due to the fear of needing specific data down the line for business analytics or compliance purposes.”
Big data requires enterprises to budget, purchase, and manage more storage and drives. “These drives must be bigger and faster to satisfy customer expectations of receiving data instantly, however, wherever, and whenever they want it,” Worth explains. However, big storage needs to be more than simply throwing more disk drives at growing data volumes, Worth and other storage industry experts argue. The trouble is, as the recent IOUG survey finds, that’s exactly what companies are doing. Half of the survey respondents, in fact, say data growth is currently outpacing storage capacity, and two out of three say they normally react to performance issues by upgrading their server hardware and processors. A majority, 53%, are upgrading or expanding their storage systems in response to big data.
Such an emphasis on storage systems has its price. “Consider this—in virtualized environments today, over 60% of the cost of the IT infrastructure is in storage,” says Ed Lee, lead architect for Tintri, formerly principal systems architect at Data Domain and an original member of the RAID team at the University of California at Berkeley. “Storage is the most expensive component to purchase, the most costly to maintain, and the source of the most difficult management and performance problems. In other words, the storage is the bottleneck.”
The need for high availability increases the costs and workloads of storage systems. “A fair amount of computation has to be done to aggregate unstructured data into a data store,” Scott Metzger, vice president of analytics for Violin Memory, tells DBTA. “Tools such as Hadoop have been built to tackle this variety but require significant technical depth to work at a lower level to derive characteristics. If these tools and the information they store need to be highly available, a lot of hardware is needed.”
These challenges are beyond the scope of traditional storage technologies. “The performance of traditional storage technologies makes it difficult for applications to meet user demands, especially at scale,” according to Bobby Patrick, chief marketing officer for Basho Technologies. “Storing this data at scale is cost-prohibitive.”
“Right now, most big data is scattered all over the enterprise,” Mike Spindler, practice manager for Datalink, tells DBTA. “It can often be a mish-mash of systems, like a variety of servers and disk that have been patched together. One example is in genome research or life sciences. A consortium of universities is involved in this. Yet, if you walked into one of those before they had bought some type of centralized storage to hold their big data, you might have found a large collection of Best Buy 2TB USB drives plugged in together!”
With vast amounts of data spread across the enterprise, “new issues emerge when aggregating data for analysis or managing and updating the internal storage capabilities of the corporation,” agrees David Meadows, managing director of discovery consulting at Kroll Ontrack. “Adding storage capabilities and upgrading the current environment is a difficult process to stay on top of, and can become extremely expensive,” he tells DBTA.
Legacy Storage
While many enterprises still rely on traditional storage systems based on RAID and replication, “they are beginning to confront the fact that traditional storage solutions are bound to fail when challenged to store massive amounts of unstructured data,” says Russ Kennedy, vice president of product strategy, marketing, and customer solutions at Cleversafe. “Current data storage systems based on RAID arrays were not designed to scale to increased levels of data growth. Most IT organizations using RAID for big data storage incur additional costs to copy their data two or three times to protect it from inevitable data loss.”
Also at issue is the long-term viability of digital storage technologies. “Digital storage can in many ways be more perishable than paper,” Wojtowecz says. “Disks corrode, bits rot, and hardware becomes obsolete. This presents a real concern of a digital dark age where digital storage techniques and formats created today may not be viable in the future as the technology originally used becomes antiquated. We’ve seen this happen—take the floppy disk, for example. A storage tool that was so ubiquitous people still click on this enduring icon to save their digital work and any Word, presentation or spreadsheet documents — yet most Millennials have never seen it in-person.”
What happened to the storage area network, or SAN, which, like cloud, was based on a virtual pool of resources pulled from any and all parts of the enterprise or beyond? “We see the SAN, particularly the Fibre Channel SAN, as the original private cloud architecture,” Len Rosenthal, vice president of marketing for Virtual Instruments, tells DBTA. “It was the first attempt to separate computing resources from storage resources, and was wildly successful because the SAN enabled storage devices (data) to be virtually connected to every server in the SAN with centralized management and provisioning. Today, when IT organizations deploy business-critical applications in what is now known as a private or internal cloud, it is nearly always deployed with Fibre Channel storage arrays, as they offer proven performance and reliability in even the largest data centers in the world.”
Not everyone agrees. “SAN is not going away entirely, but in many cases it’s being replaced by local storage,” Rosoff says. “One of the benefits of horizontal scalability is that you can leverage the local disks in each server, rather than centralized storage. This reduces costs associated with the SAN gear itself as well as optimizes latency by moving storage closer to the database and compute nodes that need it.”
Smart Storage Strategies
While storage technology continues to follow its own path of Moore’s Law — growing denser and capable of handling more data every year — smarter storage management strategies are called for. “Data is all over — stored on spinning disk, stored on flash, stored on SSD, stored on tape,” says David Chapa, chief technology evangelist for Quantum. “The bigger question is: How does the data get to all the tiers of storage — and more importantly the right tiers of storage at the right time in the lifecycle of big data?”
As data grows, the reflex reaction by most organizations is to buy and install more disk storage, the IOUG survey finds. Smart approaches are on the horizon but still only prevalent among a minority of companies. Close to one-third now embrace tiered storage strategies, and only one out of five is putting information lifecycle strategies into place to better and more cost-effectively manage their data.
Effective approaches require intelligent management — and strategies such as cloud computing may come into play, when companies decide they need a better way to manage and share resources such as data and storage. “Administrators need to put data into tiers, with the frequently accessed data in high-performance storage and other data in slower, cheaper storage,” says Denny LeCompte, vice president of product management for SolarWinds. “Some cloud proponents would say that all data needs to be highly accessible, but that can be cost-prohibitive.”
Companies “have to figure out a couple of things in parallel,” says Dell's Price. “They have to choose a storage solution that will work with the data analytics and business intelligence tools that they want to bring to bear on big data. Then, they have to pick a storage approach that will work with the other software tools they want to use.”
Finally, backup and restore strategies are another area impacted by the pressure to develop effective big storage strategies. “What data growth is forcing IT to do is move away from traditional, one-size-fits all approaches to data protection and instead look at implementing tiered recovery strategies,” according to Greg Davoll, director of product marketing and data protection for Quest Software. This tiered recovery strategy is based on determination of “the value of a given set of data or services to the organization.” He suggests that “more current technologies be deployed in targeted parts of the environment to ensure that mission-critical applications and databases can be rapidly restored, whereas more traditional approaches can be used to backup data that’s less critical.”
Emerging Storage Technologies
There are a range of alternatives to traditional storage media and platforms — tape, disk, or even storage area networks—that promise greater capacity to handle growing data loads. These include Serial ATA (SATA) disk drives, solid state drives, flash memory, object stores and cloud-based storage. Hu Yoshida, chief technology officer for Hitachi Data Systems, tells DBTA that new developments are leading to a “convergence” of server, storage, networks, and applications from across the enterprise within new approaches and technologies. “The exponential growth of unstructured data and mobile applications has created the requirement for larger file systems and scalable block storage systems,” he points out. “Beyond this, big data requires the integration and management of file, block, and object data. Convergence will translate into greater storage efficiencies by eliminating three major costs: Backup for data protection; extracting, transforming, and loading for data analysis; and managing silos of file, block, and object data.”
SATA Disk Drives
“SATA disk capacities are growing from 2TB to 3TB, with 4TB SATA drives also likely just around the corner,” Datalink's Spindler tells DBTA. “The only challenge with SATA disk is its data I/O, or disk access speeds. For example, you can write things to SATA disk quickly, but you can’t read them very fast. This is because SATA read performance is slower, due to its nature of performing random reads. Despite its access speed limitations, SATA still has a significant place in big data environments.”
Solid State Drives
Solid state drives are another technology that show promise for big storage implementations, and many companies are making new purchases in this area. “We definitely see a take up of SSD for I/O intense applications, more so than say 18 months ago,” according to Kris Domich, principal data center consultant for Dimension Data Americas. “The bigger trend lately seems to be more usage of lower cost, high-density disk. This is primarily due to clients realizing that it is possible to provide the performance, capacity and reliability needed for most of its data utilizing lower cost disk drives.”
Solid state drives “are making inroads within the data center but are still primarily used for application acceleration, data caching, and web or database indexing,” Worth agrees. “Data centers are investing in tools such as automated tiering to better align the type of data storage to the requirements of the data — in terms of how often the data is accessed, or how important the data is to business goals.”
Flash Memory
Flash memory technology is another form of media that will help make the move to big storage, says Violin Memory’s Metzger. “Flash memory arrays can collocate shared storage volumes to support data being stored in a variety of data models. Providing low latency access between core relational databases and big data stores to support real-time services such as decision support, analytics and risk management.”
Flash memory “delivers more than 100 times increase in input/output operations per second, and 100 times reduction in latency relative to hard drives,” says Dr. John Busch, co-founder and CTO of Schooner Information Technology. He says he is seeing rapid adoption of flash memory going forward, as it “will store an increasingly large portion of the big data for mission critical applications, and will also play a key role in data center storage hierarchies and in cloud storage.”
Cloud
Newer options, such as the cloud, offer opportunities to move forward, but the technology is not yet mature, Rosoff points out. “There’s a way to go before this technology is fully matured and ready for arbitrary workloads,” he says. “The movement to cloud gives new levels of scale, but often the performance of that cloud storage can be unpredictable. Also, it’s often difficult or impossible in the cloud to gain access to new storage mediums like Flash and SSD. So we still see many high performance applications being deployed on bare metal. But if you have a workload that is not sensitive to fluctuations in storage performance, then cloud based storage is appropriate.”
“A growing ambition by large enterprises is to operate their own private storage cloud, with the same characteristics and benefits underlying a public storage cloud, such as Amazon S3,” says Patrick. “A strategy we call ‘S3 without AWS.’ Companies are tapping into this market by offering an S3-equivalent solution that enterprises can deploy in their own data centers and also gain control and flexibility benefits. Long-term, I believe enterprises will utilize a combination of public and private storage clouds determined by access and latency requirements, security, and cost.”