The hype about data lakes, big data and all the great data generated by the IIoT (Industrial Internet of Things) keeps growing. There are valid use cases for data lakes in manufacturing operations management and the Smart Manufacturing IT framework but companies should know more about the realities in this arena. Gartner recently published a report titled “Three Architecture Styles for a Useful Data Lake” [1] which mentions these concerns. We need to make sure they don’t stay buried in that report and instead bring them to the front of the conversation.
First, what are data lakes?
A data lake is a collection of storage instances of various data assets stored redundantly in a near-exact copy of the source format. The general idea is to store data separate from source domains for future processing and analysis by different applications in the enterprise.
A data lake is a concept, not a technology. There are multiple types and purposes for data lakes. There could be more than one data lake in an IT architecture. Data lakes may add contextual metadata to the raw data. They may have a short or long shelf life for the data depending on the purpose for each data lake.
Examples of the use of data lake concepts range from self-service intelligence to ad-hoc data science analysis to advanced analytical alerts on new data trends. However, the data lake is not the new data warehouse. It is a place to search for and discover additional insights, it is not optimized for daily or weekly reporting of metrics to the organization.
Second, what are the misconceptions about data lakes?
There are many valid reasons to consider data lakes in the information technology landscape for future manufacturing systems. However, they should not be viewed as a silver bullet answer to the complexity of architecting and integrating enterprise platforms. They should be viewed as another tool in the IT arsenal, to be used where appropriate, and approached with an understanding of the following myths.
Myth: A data lake strategy does not include ETL tools
Key to the data lake concept is the separation of independent data storage and computing software. The methods of moving data into and out of the data lake can include tools like ETL (extract, transform, load) and ESB (enterprise service bus).
Myth: A data lake has to be Hadoop
Hadoop is the most popular platform for data lakes. However, it is possible to use other types of relational, NoSQL and object data storage tools. Some of the factors to consider in this decision include data volume, velocity, variety, and scalability.
Myth: Data lakes are inexpensive to implement
Even though data storage cost continues to decrease, the costs of networks, integration, backup, security and support can add up very quickly and significantly. Open source software and cloud storage options can help with cost but they should not be viewed as free of cost.
Myth: Get all the data you can into the data lake. No need to worry about data modeling.
Gartner predicts that by 2018, 90% of data lakes will be rendered useless as they become overwhelmed by a vastness of data captured for uncertain use. The concept of garbage-in-garbage-out applies to the data lake. Data hoarded without a preplanned purpose or proper contextual or lineage information will create a lot of noise that can obstruct finding important signals in the data.
Myth: A data lake creates a single source of truth
Data collected in a data lake is usually schema-on-read which means that the structure is determined through discovery at the time it is analyzed. The data lake must remain flexible to be inclusive of new data sources. Data lakes have soft governance approaches to avoid anarchy in the data lake, but they don’t have hard governance. The data lake should not be considered a system of record.
Myth: Anyone can use a data lake
It is not trivial to do data discovery and model development in the data lake. Organizations will need staff skilled in statistics, machine learning and data science to reap benefits from data lake efforts.
Myth: A data lake is a data integration method
The data pipelines in a data lake scheme should be viewed as additional-to and not in-lieu-of data integration schemes linking the organization’s business systems. It is possible to leverage data integration messages as one of the sources for a data lake, but the data lake purposes should always be secondary, and should not interfere with the queuing and synchronization required for system integration purposes.
Myth: If we build a data lake, people will use it
The implementation of technology should always have specific use cases and a business purpose in mind. If we start with a need, the system will be used. If we start with technology and the need is not clear, there is a big risk that no one will fish in that data lake.
References
[1] “Three Architecture Styles for a Useful Data Lake”, S. Sicular, Gartner, 2016
Comments