The Elephant is coming back to NYC ...

Since its beginning as a project aimed at building a better web search engine for Yahoo – inspired by Google’s well-known MapReduce paper – Hadoop has grown to occupy the center of the big data marketplace. From data offloading to preprocessing, Hadoop is not only enabling the analysis of new data sources amongst a growing legion of enterprise users; it is changing the economics of data. Alongside this momentum is a budding ecosystem of Hadoop-related solutions, from open source projects like Spark, Hive and Drill, to commercial products offered on-premises and in the cloud. These new technologies are solving real-world big data challenges today.

Whether your organization is currently considering Hadoop and Hadoop-related solutions or already using them in production, Hadoop Day is your opportunity to connect with the experts in New York City and expand your knowledge base. This unique event has all the bases covered:

Enterprise Use Cases Today
Architecting a Scalable Hadoop Platform
Building Hadoop Applications
Data Warehouse Optimization with Hadoop
Troubleshooting Hadoop Performance Issues
Data Science with Hadoop
Machine-Learning with Spark
Data Analysis with Hive & Pig
Taking Advantage of SQL-on-Hadoop Solutions
Running Hadoop in the Cloud
Securing Data in Hadoop
Optimizing ETL with Hadoop
Diving into the Data Lake

Tuesday, May 10, 2016

CONTINENTAL BREAKFAST

8:00 a.m. - 9:00 a.m.

WELCOME & KEYNOTE - How Statistics (And a Little Public Data) Can Change a City

9:00 a.m. - 9:45 a.m.

The creator of I Quant NY, a data science and policy blog that focuses on insights drawn from New York City's public data, Ben Wellington, advocates the analysis of open data to affect policy. The Open Data movement is growing, and governments are releasing vast amounts of data to the public. As citizens push for more transparency, it is fair to ask what we can actually do to derive actionable insights from this data. How can this data help us improve the cities we live and work in, whether we are policymakers, businesses, or residents? Wellington explores how he's used his blog and some simple data science techniques to make changes in New York City. He discusses best practices for data science in the policy space, explores how storytelling is an important aspect of data science, and highlights the various datadriven interactions he's had with city agencies. He contends that data science need not use complicated math: It's often more about curiosity and the questions we ask than the complexity of the equations we use.

Ben Wellington, I Quant NY

Rethink Data Management

9:45 a.m. - 10:00 a.m.

From On-Premises to Cloud It is undeniable that data continues to power business growth, competitive advantage and customer experience. In the midst of a monumental transformation fueled by social, mobile, and cloud, business and IT leaders alike are rethinking their roles and how they manage and deploy technology to accelerate business growth. Leveraging Oracle Database 12c, customers can rethink data management and gracefully evolve architectures from on-premises to the Cloud. Learn about the latest data management trends and how transforming to the Cloud can help organizations innovate faster, improve time to market, and stay ahead of the pack.

Nicholas Chandra, VP, Cloud Computing Success, Oracle

COFFEE BREAK in the Data Solutions Showcase

10:00 a.m. - 10:45 a.m.

H101: Unleashing the Power of Hadoop

10:45 a.m. - 11:45 a.m.

Data analytics has emerged as the must-have strategy of organizations around the world, helping them understand customers and markets and predict shifts before they happen. At the center of the new Big Data movement is the Hadoop framework, which provides an efficient file system and related ecosystem of solutions to store and analyze big datasets. Find out how to make the power of Hadoop work for you.

Harnessing the Hadoop Ecosystem

People who are new to Big Data lack a big-picture view of how end-to-end solutions are actually constructed. Non-adopters are confronted with a vast amount of disparate information with no understanding of how to use the underlying tools. As a result, they are left with an incomplete understanding of how Hadoop may be used to solve their problems. This session addresses that knowledge and experience gap.

James Casaletto, Principal Solutions Architect, Professional Services, MapR

HBase Data Model—The Ultimate Model on Hadoop

There are many limitations on Hadoop-HDFS. HBase— the Database of Hadoop—helps overcome these issues. HBase is a NoSQL (nonrelational) database and an Apache Project. It is a column-oriented database management system that runs on top of HDFS, it is modeled after Google’s BigTable, and is suited for hosting very large tables to store semi-structured parse datasets. Attend this session to learn more.

Tassos Sarbanes, Mathematician / Data Scientist, Investment Banking, Credit Suisse City University of New York

H102: Querying Hadoop and NoSQL Data Stores

12:00 p.m. - 12:45 p.m.

The pressure is growing for organizations to react faster to changing opportunities and risks by using data to improve decision making. Hadoop and NoSQL data stores provide new options for organizing and aggregating data in all its forms. Find out what you need to know about querying Hadoop and NoSQL data stores.

De-Siloing Data Using Apache Drill

Study after study shows that data scientists spend 50–90% of their time gathering and preparing data. In many large organizations, this problem is exacerbated by data being stored on a variety of systems, with different structures and architectures. Apache Drill is a relatively new tool that can help solve this difficult problem by allowing analysts and data scientists to query disparate datasets in-place using standard ANSI SQL without having to define complex schemata or rebuild their entire data infrastructure. This session introduces the audience to Apache Drill and presents a case study of how Drill can be used to query a variety of data sources.

Jair Aguirre, Lead Data Scientist, Booz Allen Hamilton

ATTENDEE LUNCH in the Data Solutions Showcase

12:45 p.m. - 2:00 p.m.

H103: Harnessing Big Data With Spark

2:00 p.m. - 2:45 p.m.

Apache Spark, an engine for large-scale data processing, can be complementary to Hadoop but it can also be deployed without it. Find out what Spark offers and why it is gaining ground in the world of Big Data.

Apache Spark and Effective Machine Learning

The introduction of Hadoop MapReduce (MR) allowed the application of algorithms to data of unprecedented scale using systems built from cheap commodity hardware. However, MR is slow, significantly curtailing its applicability to advanced iterative machine learning (ML) algorithms. These algorithms frequently need to be run multiple times in order to effectively train and optimally parameterize. Spark changed this, and, by providing speedups of 100X or more, fundamentally introduced the possibility of applying ML to Big Data and extracting meaningful insights in actionable timeframes. This presentation provides an overview of the framework Alpine Data developed and present results for a variety of well-known datasets, illustrating how the software can significantly eliminate the repetitive, trial-and-error nature of today's data science and reduce the time to an effective model.

Lawrence Spracklen, VP, Engineering, Alpine Data

COFFEE BREAK in the Data Solutions Showcase

2:45 p.m. - 3:15 p.m.

H104: Building Hadoop Applications

3:15 p.m. - 4:00 p.m.

Building Hadoop Applications

Whether Hadoop becomes the de facto management platform of the future for Big Data or simply a key component in a hybrid architecture comprised of numerous technologies, by now it is certain that Hadoop, along with its larger ecosystem, is no fly-by-night technology. Find out the key issues involved in leveraging Hadoop for Big Data applications.

Building Scale-able Machine Learning Applications on Apache Spark

There is a growing demand in the industry for highly scale-able data processing platforms that are built around simplicity of architecture, compatibility and robustness. Apache Spark is one such example, and is one of the most exciting emerging technologies in today's computing landscape. It is finding many exciting use cases, one of them being Machine Learning. In this session, a brief description of the Apache Spark Architecture will be presented, along with a case study of a machine learning algorithm to demonstrate the versatility of the Spark Architecture.

Abhik Roy, Engineering Principal, Wells Fargo

H105: The Great Data Lake Roundtable

4:15 p.m. - 5:00 p.m.

The concept of the data lake is intriguing for Big Data because it allows data to be retained in its original format so it can be used now and in the future by different users for different purposes. But what are the issues that need to be considered in terms of data lake governance, regulatory compliance, security, and access, as well as data cleansing and validation to make sure data is accurate and up-to-date?

Anne Buff, Business Solutions Manager, SAS Best Practices, SAS Institute

Abhik Roy, Engineering Principal, Wells Fargo

Tassos Sarbanes, Mathematician / Data Scientist, Investment Banking, Credit Suisse City University of New York