How to stop a data lake turning into a data swamp
Monday, 10 November, 2014
The concept of the data lake is emerging as a popular way to organise and build the next generation of systems to master new big data challenges.
An effective metadata layer is critical to the success of any data lake project. Organisations that fail to implement a metadata layer correctly risk their data lake turning into a ‘data swamp’.
A data lake refers to a repository of data that brings together the many different and varied types of data available, such as video, sound, structured data, documents and sensor data. It provides a way to store this data together in a single place at a low cost. Distributed file systems such as Hadoop often serve this purpose.
Many organisations are beginning to understand the benefits of a data lake. For example, companies can store a large amount of data in the data lake until they have time to explore its value and, should it prove to have value, to decide how best to incorporate it into the way they do business.
A second benefit is that the data is still available if the need arises to re-examine a data source. This often occurs when a new analytical technology comes to market that facilitates previously intractable analyses, or a new algorithm is developed that extracts value from data more accurately.
However, there is some uncertainty as to how to initiate a data lake project. This is exacerbated by the requirement to demonstrate a return on investment in platforms like Hadoop.
Some companies have created Hadoop ‘sandpits’ to experiment with the technology. As a result, organisations tend to see Hadoop in one of two ways: either as a bespoke tool only for data scientists but not for mainstream users trying to make decisions; or as a data lake available to mainstream business users.
Organisations that adopt a data lake may tend to build a ‘sandpit’ environment quickly, prove the concept and then implement it without further analysis. This can lead to failure just like any other IT project. The result is referred to as a data swamp. The usual cause is that internal resources are scarce, only a very few understand the technology and support services are virtually non-existent.
Another cause of failure is that the data pumped into the data lake is not handled correctly. The data lake needs help to make the data worthwhile for the wider enterprise. This is where an effective metadata management layer is vital.
The Hadoop platform excels at consuming data quickly, which is a strength of the platform. However, it can also turn data into an unmanageable mess just as quickly without an effective data management layer.
A leading example of a metadata management technology is Loom from Revelytix (recently acquired by Teradata). Loom is unique in that it employs a semantic approach to automatically determine the data and its contents, and is the only complete metadata management toolset for Hadoop in the market. It offers capabilities for data profiling, reporting and governance. There are four key components:
1. Extensible metadata registry. The metadata layer must be flexible and extensible, customising to the organisation’s needs.
2. Activescan. This technology automates much of the work of managing new data into the data lake. New data types can also be catered for as organisations grow their data lake.
3. Data transformations and lineage. It lets analysts track the lineage of all transformations in Hadoop. Inputs and outputs of Hive processes executed through Loom are automatically recorded and users can report on data imports/exports and advanced analytics.
4. Open REST API. This provides integration into other platforms and tools into the enterprise. It lets businesses use the API to access and register metadata or use it for direct access to data in analytics or applications.
Data lakes and enterprise data warehouses
Big data was the impetus for the creation of data lakes. They were developed to complement enterprise data warehouses, working together to create a logical data warehouse.
A data lake provides a location for ‘extract-transform-load’, the process of preparing data for analysis in a data warehouse. The data lake lets big data be distilled into a form that can be utilised more broadly via the data warehouse.
As companies create hybrid systems that combine data lakes and enterprise data warehouses, it is critical that the data be secure, access is controlled, compliance regulations are adhered to, activity is tracked with an audit trail and data is managed through a lifecycle.
These hybrid, unified systems let users ask questions that can be answered by more data and more analytics with less effort.
In the long term it seems likely that the location of data and the analytics to answer questions will be hidden from end users, whose job is simply to ask questions of the data. The logical data warehouse, made up on an enterprise data warehouse, a data lake and a discovery platform to facilitate analytics across the architecture, will determine what data and what analytics to use to answer those questions.
How to build a data lake
According to Teradata, there are four key stages to building a data lake:
1. Handling data at scale. Understanding what kind of data the business deals with as well as how to acquire it and transform it at scale is essential. It may not be possible to conduct sophisticated analytics at this stage.
2. Building transformation and analytics. Once the business understands how to handle data, it can begin to transform and analyse it. At this point organisations can start building applications and combining capabilities from the enterprise data warehouse and the data lake itself.
3. Disseminating data and analytics. Now the business can begin making data and analytics available to more employees.
4. Adding enterprise capabilities. Few companies have reached this level of maturity but it is the highest stage of the data lake. As big data grows, this level will become more common.
More information about best practices for implementing a data lake is available in a downloadable guide.
Facing criticism of its 457 skilled visa system, the federal government has introduced a new...
Innovative technology brands will be joining top-level panel discussions at CeBIT Australia when...
Atlassian's Mike Cannon-Brookes says we must urgently overturn skilled immigration...