To better understand data lakes, it is useful to think about how they evolved and to examine some of the earlier architectures used to support data management for analytics.
This blog post describes the main architectural approaches that have been used to support analytics in the enterprise and how they have changed over time.
By reflecting on the specific problems being solved and how each approach tries to address the shortcomings of earlier approaches, we can illustrate how to apply them today and bring the strengths of more modern architectures into sharper relief.
Reporting Against Source Systems
Many enterprises started out directly running reports from the source databases backing the underlying applications. This is the most obvious way of pulling metrics out of a system and can meet basic needs.
This approach to some extent treats reporting and analytics as an extension of application requirements and can be adequate for simple metrics collection, operational reporting, and basic analysis.
Reporting from the source database has the advantage of not requiring any additional technologies to support basic needs. However, this approach also has some readily apparent weaknesses:
Production Impact – Reporting workload can impact production application performance. Anything beyond very basic operational reporting could result in substantial impacts to performance. This can be mitigated by doing things like mirroring databases, but this introduces additional operational challenges.
Difficult “Cross-Application” Reporting – Combining data from multiple systems is not supported unless you are willing to set up multiple point to point database connections.
Poor Analytics Performance –The highly normalized designs of application systems often perform poorly when used for reporting. This is because it may be necessary to join data across a large number of tables to generate a result.
Poor Usability – The same issues with the organization of data that cause performance bottlenecks also make the data difficult to use. Having to figure out complex join logic on each analysis is time consuming and error prone.
Dedicated Reporting Databases
Organizations very quickly realized that running reports directly against the databases that also supported their core transaction processing was a recipe for disaster. As noted above, it introduced load in an environment where response time was critical and made it difficult to combine data across sources.
The obvious remedy in this case is the creation of separate reporting databases, sometimes simply mirroring the structure of the source, but more often reflecting the structure of the specific reports that needed to be produced.
This has worked reasonably well in places where reporting needs were well understood, narrow in scope, and unchanging. The main advantages of this approach included the following:
Isolating production systems from the impacts of reporting workloads.
Providing a central hub for data integrations to support analysis.
Allowing for denormalization or the use of specialized databases to improve performance.
Simplifying use of the database by building structures meant to support reporting.
Unfortunately, this approach is not sustainable for more complex organizations. While it improves on reporting directly from the source data, it still has some drawbacks:
Redundant data – The same data might be copied over several times in order to serve slightly different requirements. This makes it difficult to know which tables to use in analytics and makes maintainability more difficult.
Lack of flexibility – Building a new report often requires building a new table because these systems often evolve organically based on a specific reporting requirement. Data needed for a new report may not be available in the reporting database.
Difficulty integrating changes to source systems – Because the same data is often living in multiple reporting tables, changes to upstream systems might necessitate changes to a large number of target tables.
Eventually, there was a realization that data needed to be modeled specifically for analytics both for performance reasons and for reasons of usability. This led to the emergence of the data warehouse, typically supported by a dimensional model, as the leading approach to storing data for analytics.
While the core concepts that underlie data warehousing emerged in the 60s and 70s, the approach became mainstream in the late 80s and early 90s as relational databases became the mainstream approach to storing enterprise data and vendors began selling tools and data stores specifically in support of warehousing.
In data warehouses, the approach was more top-down and planned than with a typical reporting database where the design might be driven by the specific output of a report. The goal was to bring all of the key enterprise data together, ensure that it was accurate, and structure it such that it was easy to explore either using SQL queries or using an emerging class of BI tools that supported both canned reports and ad hoc exploration.
The dimensional models often supporting data warehouses are optimized for the traditional business analytics most organizations need. They are optimized for rolling up and drilling down along dimensions like time, geography, product category, etc. and aggregating meaningful business metrics. This is a very natural approach to evaluating business performance and the endurance of the dimensional modeling approach is a testament to how accurately it reflects the reality of mainstream business analytics needs.
The core strengths of the data warehouse model include the following:
The data model is no longer aimed at a specific report. Data warehouses are well engineered to handle more ad hoc data exploration. Dimensions and facts are modeled generically to serve a broad range of analysis.
Reduced data redundancy makes the model less brittle. Upstream changes may only affect a small number of tables making it easier to integrate changes.
Dimensional models and their implementation as star schemas within databases can support better performance than more flattened models, particularly when using analytics database platforms optimized for them.
Data warehouses delivered unprecedented flexibility for data exploration when they were introduced. However, they present their own challenges:
Optimized toward aggregation and filtering – Dimensional models are not suitable for all kinds of workloads. They are built for aggregation and filtering. Certain kinds of advanced analytics are better served by other structures.
Require up front modeling of data – Data needs to be modeled before it can be used. The schema needs to be defined before data can be ingested and explored. This introduces additional latency to the integration of data into the warehouse.
Limited to structured data – Data warehouses and their supporting technologies are built around tabular data that fits well into structured columns. Unstructured data like images, large blocks of text, and complex nested structures are not well-supported and are often treated as an afterthought.
One of the most recent styles of architecture for analytics to emerge is the data lake. The emergence of the Apache Hadoop and Spark projects, cloud infrastructure as a service, and distributed, scalable object stores has enabled a more flexible architecture for analytics. The separation of computing from storage, and the ability to support schema on read allows for a more flexible architecture than the traditional data warehouse.
In a previous series of blog posts, I provided a more detailed walk through of data lake architecture and the related technologies. The advantages of this approach are described in detail there and include the ability to impose a schema on read, the ability to handle both structured and unstructured data, and the ability to support a broad range of advanced analytics beyond simple aggregation and filtering.
These expanded capabilities do come at a cost:
Technology complexity – The technology stack for data lakes is newer than the other approaches described and is still evolving relatively rapidly. This makes it difficult to select technologies and difficult to find skilled resources to operate those technologies.
Operational challenges – Data lake architecture is more distributed and has fewer constraints on the format and scale of data stored. This makes it relatively easy to build silos within a larger data lake, create redundant copies of data, and otherwise undermine the organization of data. The lack of a standard approach to cataloging data in the data lake contributes to the chaos.
Less mature security infrastructure – Along with increased technology complexity, the lack of integration across the data lake stack and the evolution of data lakes primarily to service data science, security is sometimes treated as an afterthought. Organizations that are more security-minded are often challenged to formulate a cohesive approach.
It is possible to manage around these challenges today and platforms, including our own platform Magpie, are evolving to make end-to-end implementation easier. Our cloud data lake platform, Magpie, is built from the ground up to address these issues and make the advanced capabilities of data lakes available to a broader range of organizations.
Given this steady evolution, where are things going next? We can speculate based on the weaknesses of data lakes as they currently exist and the shortcomings of previous models that haven’t yet been addressed.
Data lake platforms will continue to simplify the deployment and management of data lakes. Hiding the complexity of the technology stack without sacrificing the power and flexibility of data lakes will allow more enterprises to adopt the technology over time.
Governance tools for data lakes will take on increasing importance. This is reflected in the industry broadly and is a is an area where we have built substantial features into our platform.
Data Warehouse platforms will seek to adopt the best features of data lakes. This is reflected in the fact that many cloud-based systems like Redshift and Snowflake which support externalized tables in arbitrary formats.
All of these architectures are likely to continue to persist side-by-side often within a single organization. They each offer tradeoffs between functionality, manageability, and performance and each is useful under the right conditions.
To learn more about our data lake platform, Magpie, click here.