If there is a "bottom line" to measuring the effectiveness of your big-data applications, it's arguably performance, or how quickly those apps can finish the jobs they run. Let's consider Spark. Spark is designed for in-memory processing in a vast range of data processing scenarios. Data scientists use Spark to build and verify models. Data engineers use Spark to build data pipelines. In both of these scenarios, Spark achieves performance gains by caching the results of operations that repeat over and over again, then discards these caches once it is done with the computation.
Posted July 14, 2017