The need for greater capacity and storage while managing costs is an all-too-familiar challenge within IT departments, especially as businesses strive to digitally transform themselves into insights-driven enterprises.
The cloud is becoming an increasingly attractive location for data lakes for many of the same reasons organizations are moving transactional and analytical workloads off the ground in general.
DBTA recently held a webinar featuring Clive Bearman, director of marketing strategy, Attunity, and Paul Nelson, innovation lead, Accenture Applied Intelligence, who discussed how to succeed with data lakes in the cloud.
There are a few challenges when it comes to maintaining a data lake, Bearman explained.
- Data lakes store raw data and their business value is entirely determined by the skills of data lake users.
- Many technologies used to implement data lakes are new and lack the necessary information capabilities that organizations normally take for granted.
- Without data lineage within data lakes, data must be collected, assembled and refined by each user separately and independently to drive meaningful business insights.
The current data lake status quo needs new requirements such as automated data pipelines, integrated platform built for data lakes, cloud, and streaming, effective and ubiquitous metadata, and a multi-data zone methodology.
Attunity has a data lake solution that offers benefits including:
- Reducing time and cost to onboard new data
- Continuous updates of fresh, analytic-ready data that ensures transactional integrity
- Adaptive and resilient to changes in sources and targets
- Faster onboarding of new sources
- Future-proof to rapidly changing enterprise architectures
Nelson believes in a different approach, making sure security of the data lake is priority number one. Client data substantially increases security risk. Additional security controls are needed:
- Encryption at rest
- Administrator monitoring
- Data transfer monitoring
- Security key management
- Security Audits
In addition, search, real-time, and data lineage are cornerstones of a successful data lake, Nelson said.
Search:
- Used to distribute 360-degree “single source of truth”
- Ad-hoc, end-user analytics over billions of records
- Analytics over real-time data with high-volume writes
Real-Time Data:
- Typically requires Kafka queues (or similar)
- Spark Streaming or Flink
Data Lineage:
- Tracing data usage from source to destination
- “Universal Metadata Repository” Combines lineage from many systems into a universal representation
An archived on-demand replay of this webinar is available here.