Demystifying the Data Lake - Part 1

If you are new to the underlying technology that supports analytics, or just unfamiliar with some of the key concepts associated with data lakes, this blog post is for you. It is intended to be a short primer for both technical and non-technical audiences.

If you are a technologist that has been working with data lakes or data warehouses for a while, you might want to skip it.

This is the first of two posts, with the Part 2 focused on some of the more technical details.

What Is A Data Lake?

The term data lake carries a lot of baggage, being both a hot tech buzzword and being somewhat loosely defined. Early on, there was a strong focus on the underlying technology (Apache Hadoop) as the defining characteristic of data lakes, muddying the waters for the uninitiated. Fortunately, it isn’t that difficult to put together a working definition of a data lake that can be broadly understood.

At the core, data lakes are a means to an end. They are a mechanism for letting analytics within an organization happen efficiently, flexibly, and with the necessary controls in place. Data lakes work by bringing together information of all kinds together, in a central repository, surrounded by tools that let users prepare data and conduct analysis.

Analysis Without Centralized Data is Hard

It is easy to imagine a scenario in which data is spread out across multiple systems and you have a burning business question that you need to quickly answer.

For example, you might wonder how account manager interactions with customers influence renewal rates in a subscription-based business.

To find out you might need to combine data from a customer relationship management (CRM) system like Salesforce.com with data from an internal system that manages customer subscription data, and perhaps external data that tells you something about the market conditions in a particular geography where a customer is located.

This immediately surfaces a number of issues:

  • You have to combine data from multiple sources and that might not be easy. You will probably need to ask several different people to get the data for you and put it into a usable format like a spreadsheet. Once you have that data, how can you be sure that the customers in the CRM system accurately match the customers in the internal subscription system?

  • You won’t be sure of all of the data you need up front. This makes it hard to explore the data freely and improvise as you learn more. You might have to go “back to the well” asking for more data and waiting for requests to be fulfilled.

  • You haven’t built anything that delivers lasting value beyond answering the original question for a single point in time.

If that data were all in one place, the benefits are clear:

  • You can have confidence that the data has already been married together correctly with customers in one system matched up to customers in another.

  • You don’t need to repeat the effort of integrating the data on each run through the analysis or if you forgot a required piece of data the first time.

  • You can refresh the analysis as new data comes in creating lasting value and the means to monitor progress as you act on insights from the initial analysis.

One way to realize these benefits is by building a data lake as a shared, central repository for analytics data.

Data Lakes Improve on An Existing Model

Data lakes, to some extent, are the latest in a series of technology approaches for bringing data together in a central location for analysis. Starting decades ago with dedicated reporting databases that evolved into data warehouses companies have been building central repositories for analytics data since the beginning. These older approaches are now being supplemented, and in some cases, subsumed by data lakes.

However, data lakes go beyond simply centralizing the data and introduce some marked improvements over previous approaches in two areas:

  • Flexibility – Data lakes allow organizations to freely combine information of different types in one repository and run a variety of different types of analyses against that single repository. Data lakes often support both conventional business analytics and more advanced techniques like machine learning. They also can allow unstructured data like documents and images to be analyzed alongside tabular data like sales transactions.

  • Scalability – Data lakes are also more flexible in the way that they store data and use processing power. This allows organizations to both run a wider variety of analytics and run those analytics over larger data sets without breaking the bank.

While we won’t be diving into the details here, part 2 of this series will provide more detail about how this is achieved from a technical perspective.

…But There are Tradeoffs

This added capability is not free. The technologies needed to build a data lake are complex and less proven than more traditional database environments.

The very flexibility that data lakes provide makes them more difficult to implement and manage successfully. Data lakes are in some cases been treated like dumping grounds for data with little overarching organization and even less guidance for users trying to leverage the data for analysis. This leads to redundant data, poor security, and little payoff for the effort.

Data Lakes Need to Balance Flexibility and Control

Succeeding with a data lake requires a management layer that provides a few core services to help your IT team contain the potential chaos:

  • Data Catalog - Users need to be able to find the data they need. Your data lake should have a catalog that helps users find information and also acts as the backbone for other capabilities like security, job management, and logging.

  • Security and Access Control - The data lake should also provide security limiting individual users’ access to specific files or tables.

  • Data Integration - Using a consistent set of approaches and tools to build data pipelines to populate the data lake reduces potential redundancy and creates opportunities for reuse.

  • Job Management and Scheduling - Once pipelines are built, they generally need to be composed into larger workflows and scheduled.

  • Monitoring and Notifications - The lake needs robust monitoring and notifications that will inform administrators and even end users when something has gone wrong.

  • Detailed Logging and Activity Tracking - Finally, detailed tracking of events within the data lake is needed to ensure that when things do go wrong, there is a way to diagnose the underlying issue. Detailed activity tracking can also help with meeting compliance and audit requirements.

These capabilities can be built by selecting best of breed tools in each area and integrating them, or by leveraging packaged offerings from providers like Silectis.

Building a data lake with a strong management layer can mean the difference between actually impeding useful analytics efforts and building a sustainable foundation that can serve your company into the future.

Silectis Can Help

Data lakes can be difficult and expensive to set up, requiring the integration of multiple technologies each with specialized skills.

Our Magpie platform brings all of the elements of the data lake in one package making it easier for your team to get up and running with a scalable data lake infrastructure.

Magpie lives fully the cloud, reducing the overall cost of getting started and eliminating the need for your team to setup and manage a complex infrastructure. Our consulting services help you launch your data lake faster and start analyzing data.

In Part 2, we focus on shedding light on some of the technical details of data lake implementation. You can find it here.

To learn more about Magpie, click here.

To learn more about the key challenges organizations face in building agile analytics, and strategies for success, download our latest white paper.

Demetrios Kotsikopoulos is the CEO of Silectis. 
You can find him on LinkedIn and Twitter.