Since its beginning as a project aimed at building a better web search engine for Yahoo – inspired by Google’s well-known MapReduce paper – Hadoop has grown to occupy the center of the big data marketplace. From data offloading to preprocessing, Hadoop is not only enabling the analysis of new data sources amongst a growing legion of enterprise users; it is changing the economics of data. Alongside this momentum is a budding ecosystem of Hadoop-related solutions, from open source projects like Spark, Hive and Drill, to commercial products offered on-premises and in the cloud. These new technologies are solving real-world big data challenges today.
Tuesday, May 22: 9:00 a.m. - 9:45 a.m.
We, of course, will never know everything. But with the arrival of Big Data, machine learning, data interoperability, and all-to-all connections, our machines are changing the long-settled basics of what we know, how we know, and what we do with what we know. Our old—ancient—strategy was to find ways to narrow knowledge down to what our 3-pound brains could manage. Now it’s cheaper to include it all than to try to filter it on the way in. But in connecting all those tiny datapoints, we are finding that the world is far more complex, delicately balanced, and unruly than we’d imagined. This is leading us to switch our fundamental strategies from preparing to unanticipating, from explaining to optimizing, from looking for causality to increasing interoperability. The risks are legion, as we have all been told over and over. But the change is epochal, and the opportunities are transformative.
David Weinberger, Harvard metaLAB and Harvard Berkman Klein Center
Tuesday, May 22: 9:45 a.m. - 10:00 a.m.
9:45 a.m. - 10:00 a.m.
Only a small fraction of global firms increase productivity year after year, according to the Organization of Economic Cooperation and Development (OECD). Creating and using unique stocks of data capital is one of the key tactics these firms use to widen their lead. Come learn how two new ideas—data trade and data liquidity—can help all companies, not just superstars, take advantage of the data revolution, and hear examples of firms already putting these ideas into practice.
Paul Sonderegger, Senior Data Strategist, Oracle
Tuesday, May 22: 10:45 a.m. - 11:45 a.m.
The expanding array of data, data types, and data management systems is making the enterprise data landscape more complicated. It is all about finding the right balance for data access and management.
10:45 a.m. - 11:45 a.m.
We are now in the Big Data era, thanks to an explosion in the volume, velocity, and variety of data. We are also now in the post-relational era, thanks to a proliferation of options for handling Big Data more naturally and efficiently than relational database management systems (RDBMS). That’s not to say that we’re done with RDBMS; rather, that Big Data is better handled by technologies such as Hadoop, HBase, Cassandra, and MongoDB, which provide scale-out, massively parallel processing (MPP) architectures. This presentation discusses the rise of Hadoop and other MPP technologies and where they fit into an enterprise architecture in the Big Data era.
David Teplow, Founder & CEO, Integra Technology Consulting
10:45 a.m. - 11:45 a.m.
This comprehensive overview of SQL engines on Big Data focuses on low latency. SQL has been with us for more than 40 years and Big Data technologies for about 10 years. Both are here to stay. Pal covers how SQL engines are architected for processing structured, unstructured, and streaming data and the concepts behind them. He also covers the rapidly evolving landscape and innovations happening in the space—with products such as OLAP on Big Data, probabilistic SQL engines such as BlinkDB, HTAP-based solutions such as NuoDB, exciting solutions using GPUs with 40,000 cores to build massively parallel SQL engines for large-scale datasets with low latency, and the TPC-Benchmark 2.0 for evaluating the performance of SQL engines on Big Data.
Sumit Pal, Big Data and Data Science Architect, Independent Consultant and Author - Book - Apress - "SQL on Big Data - Technology, Architecture and Innovation"
Tuesday, May 22: 12:00 p.m. - 12:45 p.m.
The concept of a data lake that encompasses data of all types is highly appealing. Before diving in, it is important to consider the key attributes of a successful data lake and the products and processes that make it possible.
12:00 p.m. - 12:45 p.m.
Making the case for collaboration and diverse analytical workloads are the two key goals when designing a data lake. Modern data lakes contain an incredible variety of datasets, varying in size, formats, quality, and update frequency. The only way to manage this complexity is to enable collaboration, which not only promotes reuse, but also enables the network effect that helps solve some of the vexing problems of quality and reusability. Given the scale and complexity of data, moving it outside of the lake is not only impractical but also expensive, so the data lake needs to support diverse needs and the resulting diverse workloads.
Mukund Deshpande, VP, Data Analytics, Accelerite
12:00 p.m. - 12:45 p.m.
Only with a rich interactive semantic layer, based on knowledge graph technology and situated at the heart of the data lake, can organisations hope to delivery true on-demand access to all of the data, answers, and insights -- woven together as an enterprise information fabric.
Sean Martin, CTO, Cambridge Semantics
Tuesday, May 22: 2:00 p.m. - 2:45 p.m.
Cutting-edge Big Data technologies are easily accessible in the cloud today. However, overcoming integration challenges and operationalizing, securing, governing, and enabling self-service usage in the cloud can still be vexing concerns, just as they are on-premise.
2:00 p.m. - 2:45 p.m.
Database characteristics that impact query performance for BI and analytic use cases include the use of columnar structures, parallelization of operations, memory optimizations, and scaling to high numbers of concurrent users. Maguire also covers the requirements for handling updates for real-time analytics.
Walt Maguire, VP Systems Engineering, Actian
2:00 p.m. - 2:45 p.m.
What three important questions should business leaders consider asking the next time they need to make a technology decision for a data monetization project? Get your guidance from Joe de Buzna.
Joseph deBuzna, VP Field Engineering, HVR
Tuesday, May 22: 3:15 p.m. - 4:00 p.m.
Big Data requires processing on a massive scale. Newer open source technologies such as Spark can help to enable Big Data processing for use cases that were previously unimaginable.
3:15 p.m. - 4:00 p.m.
Outbrain is the world’s largest discovery platform, bringing personalized and relevant content to audiences while helping publishers understand their audiences through data. Outbrain uses a multiple-stage machine learning workflow over Spark to deliver personalized content recommendations to hundreds of millions of monthly users. This talk covers its journey toward solutions that would not compromise on scale or on model complexity and design of a dynamic framework that shortens the cycle between research and production. It also covers the different stages of the framework, including important takeaway lessons for data scientists as well as software engineers.
Shaked Bar, Tech Lead & Algorithm Engineer, Outbrain
Tuesday, May 22: 4:15 p.m. - 5:00 p.m.
Walt Maguire introduces analytic case studies including one from Craig Strong, chief technology and product Officer at Hubble, who describes how Hubble is able to provide real-time corporate performance management (CPM) through high-speed analytics dashboards. Hubble’s dashboards draw from hybrid data sources to provide dynamic dashboards that allow for ad hoc query and analysis of near real-time corporate performance. The results from the performance tests that Hubble ran, which compared the performance seen from Actian Vector to a selection of databases including SQLServer, Mem SQL, SAP, Presto, Spark and RedShift, are presented along with the results of recent scaled, cloud-based databases and factors to consider including configurations, query complexity, database size, and concurrency, for such performance tests.
Walt Maguire, VP Systems Engineering, Actian