Big data is a commonly used term to describe large, unstructured or semi-structured datasets that utilize cost-effective, horizontally scalable software and hardware to manage and access the data. The amount of data that organizations are analyzing continues to grow; estimates of the amount of datagenerated each day is about 2.5 exabytes. The workflow of acquiring, storing, and performing analytics on the data has created a very large ecosystem of products that work synergistically in order to address the application requirements.
When implementing a solution to get the most out of the data, it is important to set users’ expectations regarding analytics and reporting capabilities, as well as throughput and response time of the solution. In today’s high-paced environment, interactive users of NoSQL applications have come to expect low and predictable response times from systems that they use.
Modern interactive applications often use NoSQL technologies due to performance, scalability, and cost considerations. Response time and latency of an operation is often the most important performance metric, and is typically measured in milliseconds. When a user of a system performs an inquiry, either through an on-premise application or through a web page and interface, the response needs to be fast and predictable. It is important to design a system that offers not just fast response, but a predictable and repeatable response time.
End-to-end predictable performance of an application depends on a number of factors. The database software design is perhaps the most important, as well as the end user application architecture. In addition, as the amount of data grows, overall system performance should remain predictable if the overall solution was designed and implemented with the user expectations in mind. An area that tends to be ignored in many systems is what happens over time, even when the data does not grow.
When selecting a system that will be used for long periods of time, as opposed to a simple, short-lived benchmark, there are some questions that should be posed and understood before committing to a solution. Some of these questions and concerns may include:
- Does the system performance improve or decline as the data sets get larger?
- Does the system performance slow down when the number of users grows?
- Is performance better at startup than over the next hours and days?
- Does the performance vary widely during the course of use?
- Is there an optimal setup for both the hardware and software?
- Which components of the system (CPU, memory, networking, storage) might show variability when it comes to predictability?
The overall system performance predictability relies on both the hardware and software. In terms of the hardware, the performance can be made more predictable by having sufficient hardware processing capability as well as the network and storage devices sized appropriately. Assuming that the software is designed in a client-server mode with a networking component, then a typical system will have a number of components that must work together in a predictable manner.
Figure 1 – Single server
- Starting at the center of the system are the cores in the CPU. It is important not to oversubscribe the number of cores used on a system so that processes or threads don’t have to compete for CPU resources. A software system should be designed not to oversubscribe the computational capacity of the CPUs, otherwise, contention will occur and the performance may jitter.
Figure 2 - Multiple servers with networking
- Networking is also critical for determining and maintaining performance. A network that is oversubscribed will give unpredictable performance, as the network messages must compete for the wire bandwidth. InfiniBand is an excellent choice for server to server communications.
- Storage devices are an equally critical component of a NoSQL implementation. The hard disk drive is typically the slowest subsystem of the entire environment. NoSQL software is designed to be a very fast ingest engine. The main purpose of using NoSQL is to store tremendous amounts of information fast, which inevitably requires persistent storage. A proper design to distribute the I/O is critical for overall performance. In the figures below, A is an unbalanced network and I/O system, while B is balanced, as the data transferred between the different servers is distributed in B.
As mentioned, a critical aspect of a system that has to store and retrieve massive amounts of data is to distribute the data over multiple storage devices (hard disk drives or solid state disks). By instituting an efficient algorithm to distribute the writing to multiple servers, which each contain multiple drives, not only does performance increase, but becomes more predictable as well. For example, if 4TB of data needs to be written to disk, a modern SATA single disk would take about 11 minutes at maximum write speeds. However, if the writes were spread over 4 devices on 4 different machines (to avoid internal bus contention), it would take about ¼ of the time, or only about 3 minutes to store 4TB of data. Of course, this is the theoretical maximum performance and probably not achievable in practice; however, this example illustrates the benefits of distributing the data capture over multiple disks.