Data Lake Storage: Choosing the Right Tool for the Job

Overview

One of the things that we often wrestle with in building data lakes is how to best architect the infrastructure to support different analytical use cases, and more specifically, what storage mechanism might yield the best performance.

One of the virtues of data lakes is that you can build hybrid approaches to storing and processing information that allow you to play to the strengths of particular types of storage without being tied to them exclusively. This also gives you a broad range of choices. While these choices bring flexibility, they can also lead to more complexity. Choosing the right set of “good enough” storage formats for your data lake can make implementation a whole lot easier.

This post lays out what we see as some common use cases and the implications for data storage in support of those use cases. It illustrates the diversity of technical approaches to storing data for analytics available and how they compare to one anther. I have also included a small, informal benchmark to further demonstrate the point. 

The Contenders

There are a number of different types of storage that you can integrate into your data lake or broader analytics architecture. These include storage layers that are largely independent of compute capacity as well approaches that integrate the two and may store data in a proprietary format.

The table below describes some of these approaches and includes example technologies.

Storage

Description

Examples

Distributed File System/Object Store – Text File

The simplest form of storage beyond local disk storage. Distributed file systems typically provide fault tolerance, redundancy, and the ability to read and write data in parallel from clustered nodes. In this case, data is stored in text files and either stored in a tabular, delimited format, or in a more structured format like JSON.

For the purposes of this discussion, we are going to treat object stores like AWS S3 that present an HDFS-like interface as part of this category.

HDFS/CSV, HDFS/JSON

Distributed File System/Object Store – Column-oriented

Flat text files can be effective for certain use cases, but more often than not, a structured file format with schema information. In particular, compressed, column-oriented formats with partitioning tend to perform very well for a range of analytical applications. Data retrieval and aggregation is accelerated because only the columns required are retrieved and data in the same column is contiguous.

HDFS/Parquet, HDFS/ORC

Distributed File System – Row-oriented

In some cases, column-oriented stores may not be effective. For example, streaming application or applications where there is a good deal of variability in the schema of written objects. In this case, a binary, compressed format is still useful.

HDFS/Avro

RDBMS – Column-oriented

Similar to compressed column formats described above, but embedded in a fully functional SQL database stored in a proprietary format further improve performance.

Redshift, Vertica, Snowflake

RDBMS – Row-oriented

A conventional relational database typically used as the backend data store for applications, but can also be suitable for analytics at a moderate scale.

PostgreSQL, MySQL, Oracle, MS SQL Server

NoSQL

Key value and document stores that typically provide for vast writes and retrievals of data. Analytical performance can vary significantly from one architecture to another.

MongoDB, ElasticSearch, Cassandra

In Memory

These are data stores that manage information in memory in the server/cluster providing for very high-performance for queries by eliminating the storage I/O bottleneck. There are also a variety of caching mechanism, like the one built into Spark that can provide similar performance once they have been initially loaded.

MemSQL, MonetDB, Various Caching Mechanisms

Local File System

The native file system for the server being used. This may also include network-attached storage and external storage arrays, but it does not include distributed systems like HDFS.

ext4, NTFS

There are other types of data stores that we could add, for example graph databases, but the intent here is to be more representative than exhaustive and we may expand on this topic later and take a more comprehensive look.

Strengths and Weaknesses

As stated above, each of these storage approaches has different strengths and weakness. They may all have a place in your broader analytics architecture, but they need to be applied in the right way.

The table below lays out some of the specific use cases that may be present in your data lake and how they map to different storage approaches.

Use Case

Description

Good Structures

Good Enough Structures

Bad Structures

Feature Engineering/ Model Training (large data set)

The process of building machine learning and AI models to improve decision-making. In this case, with a focus on large data sets.

DFS/Compressed Columns,
NoSQL,
DFS/Compressed Rows

Anything that allows parallel access by multiple cluster nodes

Anything that doesn’t allow parallel access

Model Training (small data set)

For the training of models on smaller data sets, simple files within the local file system of the computer being used will usually yield the best performance. If you have “small” data, a simple file accessed from tools like RStudio or Jupyter can be adequate.

Local file system

DFS/CSV,
DFS/Row Oriented,
NoSQL,
Relational DB,
Column Store

NA – depending on the size of data, it may not really matter

Traditional BI – Aggregation and filtering – large scale, low concurrency, not real-time

Reporting and ad hoc exploration against larger datasets usually with a small number of concurrent users, perhaps dozens. DFS/Compressed Columns,
Columnar Databases
In-memory Databases, NoSQL – Search Engine-based,
RDBMS – Row-oriented
DFS/Row-oriented,
DFS/Flat files

Traditional BI – Aggregation and filtering – moderate scale, high concurrency, real-time/streamed

Reporting and ad hoc exploration against moderate-sized data sets, but with a high number of users (hundreds or thousands) accessing the information concurrently. This is often the case when reporting is embedded within applications. In-memory Databases,
NoSQL - Search Engine-based
DFS/Compressed Column-oriented,
Columnar databases

DFS/Flat files

A Practical Perspective

Pragmatically speaking, you will need to deal with some of these formats just given their prevalence in both source systems and analytics infrastructure. The following are likely to play some role in your implementation based on their as storage formats within the data lake, or as sources or targets for data.

  • Text Files – Information will often come into the data lake in the form of delimited text, JSON, or other similar formats. As discussed above, text formats are seldom the best choice for analysis, so you should generally convert to a compressed format like ORC or Parquet.

  • Relational Database Sources – Almost every enterprise has one or more relational databases that are likely to be sources for data that needs to be fed into analytics. Column-stores like Redshift or Snowflake are really a category of relational database and will often be a target for refined information coming from the data lake.

  • Compressed Column-oriented Formats – These formats are the work horse of most data lakes. They provide reasonable performance under a variety of workloads and are a space-efficient from a storage perspective. Either Parquet or ORC is likely to play a role in your data lake.

Other types of storage should be considered on a case by case basis.

  • Compressed Row-oriented Formats – You may need to deal with the Avro format in particular if you want to ingest and use streaming data from Kafka.

  • In Memory Databases - While caching within your data lake platform is likely, you should only use a dedicated in memory caching infrastructure or full-fledged database if you have a specific use case that calls for it. Examples include real-time analytics against streaming data and high-concurrency applications as described above.

  • NoSQL Stores - NoSQL databases and search index based platforms like ElasticSearch can be useful for rapidly generating metrics over large sets of data. The performance characteristics and features of these platforms vary widely, so it is best to fully understand specific needs and choose accordingly.

Data lakes are well-positioned to handle a mixed workload and to a large extent, compressed, binary formats, stored in HDFS or in a distributed object store that allows parallel access, like AWS S3 can be fairly good for a very broad range of different applications. As reflected in the “Good Enough” column in the table above. Keep in mind that sometime, good enough is great.

Benchmarking Analytics Performance

To drive home the point that different storage approaches can substantially impact performance for specific use cases, we tested the analytics performance of three different data storage approaches for some simple processing.

The Data

We conducted the benchmark using public data about New York City taxi trips during 2017 and 2018. This data was initially sourced from the New York Taxi and Limousine Commissions public data - https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page. There were two tables used in the analysis:

  • taxi_zones – This is a lookup table that provides attributes about different taxi zones within New York City including a name and latitude and longitude for the centroid of the zone. This dataset is 236 rows.

  • yellowtrip_data – This is a listing of taxi trips in New York City over the course of two years. This table has approximately 108M rows.

While these are relatively small datasets by Spark standards, they are big enough to illustrate some of the performance differences on a small cluster. They also represent a fairly typical analytics scenario in which data is being aggregated based on a set of dimensional attributes in one table from another large fact table representing a series of events.

The Storage Formats

We took a representative set of storage formats that we support within our Magpie platform.

  • Parquet Format in S3, Snappy Compression– As described above, Parquet is often the “default” storage format for Spark.

  • ORC Format in S3, Snappy Compression – ORC is the predominant format used in Hive-based implementations. More recent versions of Spark have substantially improved its handling of ORC files.

  • Avro Format in S3, Snappy Compression –Avro is row-oriented (unlike Parquet and ORC) and cannot selectively return only relevant columns or accelerate aggregation. It is the default format used in the Kafka streaming framework.

  • JSON Format in S3, GZIP Compression – We also included a JSON format file to illustrate just how badly text files perform for this kind of analytical processing.

  • JSON Format Cached in Memory – We also used Spark’s caching mechanism to cache the JSON version of the table in memory. Our expectation would be that any of the storage formats would perform equivalently once cached in memory. JSON, taking about two minutes to warm up the cache on the initial run of a query, is probably slower than other approaches in this one respect but should otherwise be comparable.

The Analysis

In addition, we ran 3 distinct kinds of workload to illustrate the differences between formats in different scenarios.

  • Join with Aggregation – Summarizing metrics based by attributes. In this case, summarizing average trip distance and trip counts for every combination of pickup and drop-off taxi zones.

  • Join and Filter without Aggregation – In this case, pickup and drop-off locations are joined to the core trip table to get a filtered by date. The purpose of these queries is to measure how the performance advantage of the column-oriented formats is impacted when their main advantages, fast aggregation and column selection, are not used. Two different queries are used in this case because we discovered that the filter criteria used significantly impacted relative performance. Results for both are included.

  • Feature extraction/transformation – We also ran a standard type of feature extraction process against the tables. In this case, we simply scaled some numerical variables present in the table. This often needs to be done to ensure that differences in scale between variables do not unduly impact modeling results. It involves scanning all of the rows of the table

This benchmark was run on a single node Magpie cluster with 16 cores and 64GB of RAM. The files were all stored in S3 using either Snappy or GZIP compression. Magpie is our data lake platform powered by Apache Spark. Performance in our platform is roughly comparable to performance measured in a vanilla Spark environment.

The Results

The charts below summarize the results of the performance testing. Below each chart is a more detailed evaluation of the results.

Join with Aggregation

The query used to generate these results aggregates data by pickup and dropoff location.

 
SELECT pulocationid,
       tz1.zone                      AS pickup_zone,
       dolocationid,
       tz2.zone                      AS dropoff_zone,
       Sum(trip_distance) / Count(*) AS avg_distance,
       Count(*)                      AS trip_count
FROM   yellow_tripdata_orc yt
       JOIN taxi_zones_orc tz1
         ON tz1.zone_id = yt.pulocationid
       JOIN taxi_zones_orc tz2
         ON tz2.zone_id = yt.dolocationid
GROUP  BY pulocationid,
          dolocationid,
          pickup_zone,
          dropoff_zone 
 


In this case, the results are predictable with columnar formats performing best, Avro performing slightly worse, and JSON performing significantly worse. As we will see in the other cases, cached data performed best.

Join with Filtering and No Aggregation - Query 1

For this test, the query joins both trip start and destination information, but only filters on date criteria in the yellow_tripdata table. The query used is shown below.

 
SELECT pulocationid,
       tz1.zone                      AS pickup_zone,
       dolocationid,
       tz2.zone                      AS dropoff_zone,
       Sum(trip_distance) / Count(*) AS avg_distance,
       Count(*)                      AS trip_count
FROM   yellow_tripdata_json yt
       JOIN taxi_zones_json tz1
         ON tz1.zone_id = yt.pulocationid
       JOIN taxi_zones_json tz2
         ON tz2.zone_id = yt.dolocationid
GROUP  BY pulocationid,
          dolocationid,
          pickup_zone,
          dropoff_zone 
 

In this test, ORC performed significantly better than Parquet. This could be due to the the specific approach ORC uses for full row retrieval and its indexing mechanisms. More importantly, it illustrates how relative performance can vary based on workload.

A surprising result here is that JSON significantly outperforms Avro for this particular query. This could simply be because of the additional overhead of processing Avro’s more complex format. When we introduced more complex filter criteria, Avro tended to perform better.

Join with Filtering and No Aggregation - Query 2

The second query, shown below, differs in that it includes filter criteria for both the start and destination of the trip in the taxi_zones table. This slows down performance overall and changes the relative performance across locations.

 
SELECT tz1.zone_id,
       tz1.borough,
       tz1.centroid_lat,
       tz1.centroid_lon,
       tz1.zone AS pickup_zone,
       tz2.zone AS dropoff_zone,
       yt.*
FROM   yellow_tripdata yt
       JOIN taxi_zones tz1
         ON tz1.zone_id = yt.pulocationid
       JOIN taxi_zones_avro tz2
         ON tz2.zone_id = yt.dolocationid
WHERE  tz1.zone_id = 234
       AND tz2.zone_id = 170 
 

In this case, the performance is more in line with what we would expect. Avro outperforms JSON significantly, but still lags relative to the column-oriented formats. Interestingly, Parquet is faster than ORC in this case. This may imply that Parquet does better with smaller result sets and more complex filters than ORC. More testing would be required to confirm that finding.

Feature Engineering

In the feature selection example above, we applied a basic scaling transformation to all of the numerical variables in the core yellow_tripdata table. This type of transformation typically requires a full traversal of the table and the results show a less stark difference in performance across types of storage suggesting that the individual column selectivity of ORC and Parquet drive some benefit, but that computation rather than retrieval is a bigger bottleneck. The exception here is JSON which still may be suffering from its less efficient storage and the need to parse each line as it is processed.

Key Takeaways

There are a few observations that we can make based on our test runs.

  • Performance for Parquet and ORC are roughly equivalent or at least in the same neighborhood. There are subtle differences depending on the size of data being aggregated, the level of aggregation, the selectivity of filters and the total number of columns being pulled. Spark tends to favor Parquet and Hive tends to favor ORC, so it may be advantageous to stick with what is most widely used in your platform.

  • As expected, JSON is slow relative to other approaches and this just reinforces the fact that text formats are useful for information exchange, but that they should be converted to a compressed format as quickly as possible for analysis.

  • Avro sits in the middle in terms of performance and should be considered based on the specific use case. For example, if there is a lot of schema variability across rows expected and it can’t be foreseen ahead of time, Avro provides greater flexibility than Parquet or ORC which depend on a largely consistent schema to drive their performance. When all columns are being selected and there is no aggregation, performance can be comparable to column oriented approaches. Surprisingly, in our testing, Avro performed worse than even a JSON text file in the case of one specific case, showing once again that there can be significant variability in performance depending on specific workload.

  • The tables used in this benchmark are relatively narrow for wider tables that may have hundreds of columns, the difference between columnar approaches and row-oriented approaches will be more pronounced. This is intuitive given the underlying technology and is consistent with what I have observed in practice.

  • In memory approaches can be very useful, if your data is actually small enough to fit in memory. In memory data performed best in all cases after the cache is warm. The difference, at this scale was not substantial enough to warrant a blanket recommendation that there should be as much caching as possible.

  • We were using S3 for storage in this case (a “default” option for many implementations on AWS). The performance characteristics will be different in a true HDFS cluster where there is less storage I/O overhead.

Note: this is a very informal test and really just shows how much of a difference some simple choices about format can make. This should not be seen as a comprehensive comparison or used to drive decision-making. Your best bet is to experiment with your own data in your own environment to get to an optimal approach. However, this can be used as a backgrounder to get you up to speed on some of the differences.

Conclusions

A modern analytics architecture will require some or all of these different storage types based either on specific requirements or on the existing technology stack within that particular organization and how source systems are set up. 

Anticipate using a diverse range. At a minimum, expect to include Parquet or ORC in your environment. You will end up connecting to relational database sources in many cases, and you may end up using formats like Avro to retrieve and store streamed data.

Most importantly, the type of storage that is best for each use case will likely be specific to your environment. Experimentation will allow you to learn the best option in your particular environment.

To learn more about our data lake platform, Magpie, click here.

Demetrios Kotsikopoulos is the CEO of Silectis. 
You can find him on LinkedIn and Twitter.