Data lineage can be a tremendously useful tool for data engineering and analytics, but is often treated as an afterthought both because of the challenges in implementation and the fact that it has not been broadly available within organizations. Many practitioners have never had access to data lineage information and may not know what they are … [Read more...] about Knowing Your Data Starts with Data Lineage
This blog post covers Apache Spark basics and teaches readers why optimizing Spark scripts is important, and how to do it for both memory and runtime efficiency. This blog post is best suited for data analysts and data scientists looking for information on optimizing existing Spark workflows or creating new ones.INTRODUCTION: WHAT IS APACHE … [Read more...] about Apache Spark: A Beginner’s Guide to Optimizing Spark Scripts
In this blog post, we explore how a data analyst can know when to trust data and when to be skeptical. Using data integrity features in your modern data engineering platform should help you sort this quickly, and move on to empowering decision-making across the company.DATA INTEGRITY: JUST THE FACTSPeople say you can trust data. “Let … [Read more...] about Data Integrity Issues Holding You Back?
Overview: Build Your Data Architecture to Enable Use CasesOne of the things that we often wrestle with in building out data lake architecture is how to best lay out the infrastructure to support different analytical use cases, and more specifically, what storage mechanism might yield the best performance.One of the virtues of data lakes is … [Read more...] about Data Lake Architecture Guide: Choosing the Right Storage Tool
EXAMINING NYC TRANSPORTATION DATA THROUGH MAGPIE’S ONE-CLICK RAPID DATA PROFILINGIn our first data-centric blog post, we provide a step-by-step introduction to the immediate value generated by Magpie’s ability to show users what is in a dataset before analysis begins. Below, we use publicly available data from New York City's Open Data … [Read more...] about Data Profiling: A Step-by-Step Introduction
The previous post in this series established a non-technical basis for understanding data lakes and why they might be important. In this post, the focus will be on providing a more technical take on data lakes. This will not be a detailed technical primer so much as a review of the architectural dynamics that underly data lakes and some of their … [Read more...] about Diving In to the Technology Behind Data Lakes