Notebooks Are Taking Over Conventional Data Tools

notebook chicken.jpg

This blog post takes a look at how a popular user interface has gotten traction as the go-to front-end for data professionals and data companies alike. Notebooks can make data teams more productive by obscuring cumbersome configurations. How will the notebook evolve to meet the demands of the tech world?

DATA PEOPLE LOVE NOTEBOOKS

If you have been working in data science, you might have noticed the booming popularity of the notebook interface. Notebooks are server-client apps that run in a web browser. Underneath is a computational engine called a kernel that executes the code contained in the document.

Notebooks are becoming the default vehicle for exploration, collaboration, and presentation amongst data science practitioners and scientific researchers. The ubiquity of the friendly notebook interface has lead SaaS companies to use notebooks as the face of their products.

In this blog post, we will talk a bit about how we got here, why notebooks are so powerful, and what might be next in store for notebooks.

A NOTEWORTHY JOURNEY

The first and last miles of data science are about discovery and communication, respectively. Teams need to be agile and sharing knowledge needs to be seamless. Not too long ago, analytics workflows lived in the command line, integrated development environments (IDEs), or point-and-click SaaS tools.

The need for a computing environment to support interactive data science was met by popular open source projects like Jupyter and Apache Zeppelin. Voila! A user interface that allows users to annotate computational narratives, produce data visualizations inline with code, and share knowledge effortlessly.

Many organizations have begun trading out older tools to make way for the notebook interface. Jupyter, for example, boasts institutional partners like Netflix, JP Morgan, and Bloomberg. Github has native rendering of notebooks. The Atlantic declared the scientific paper to be obsolete. Google provides access to GPU servers, commonly used for machine learning, through a notebook interface.

CONFIGURING ENVIRONMENTS IS HARD

Perhaps the most powerful feature is that notebooks run on top of preconfigured kernels. This affords a range of skill levels access to high-powered, cutting-edge technology.

A personal aside.

I cut my teeth as a data analyst using SQL workbenches and user-friendly IDEs like R Studio. These tools typically pointed at relational databases (i.e., no-frills tables made up of rows and columns). Connecting these types of tools is a light lift and any gap in my skills could be readily bridged by my colleagues.

As my career advanced so did the complexity of my projects. I became solely responsible for configuring my team’s environment. Configuration was often the most difficult part of using more powerful and exotic big data systems.

Which package manager does this machine use? StackOverflow search: “jdbc error message 202”. Now paste this key into that dot file… Wait, I need our system administrator to run a sudo command. Okay… let me undo whatever Anaconda just did because the summer intern convinced me to try virtualenv.


Once we got up and running, I wasn’t sure how I got there.

Organizations are also trying to overcome this skills gap. With the proliferation of data science programs and coding bootcamps, the market for data science talent has never been better. But many data professionals entering the workforce today don’t have experience connecting their preferred DS tool to a real life data source. As a result, data access can be a major setback to productivity. The challenge is only exacerbated by the trend of organizations looking to adopt more sophisticated and specialized technology.

THE GROWING NEED FOR BEST PRACTICES

Notebooks support data access for a range of users including Business Analysts, Data Scientists, and Software Engineers. However, this current generation of notebooks still have room for improvement.

Some improvements are simple. For instance, when users run code blocks out of order, notebooks can become confusing and lose the reproducibility users have come to love. In response, many notebook tools have inline counts that annotate the order in which code blocks were run.

Some improvements will require more thoughtful answers. Without a standard way to apply version control, there isn’t an obvious fit for notebooks in the broader software development process. While this may seem trivial for now, organizations are putting an increased focus on DataOps. Data teams will need to make sure their tools do not impede higher standards.

Bringing notebooks into the fold will require the adoption of new features and best practices. Luckily, notebook fans have been studiously building and trying out new tools, add-ons, and extensions. Developers are already creating methods for code reviews and git-like version control.

Our Data Lake platform, Magpie, uses a notebook interface to provide users easy access to their organization’s central analytics hub. To make life easier, we give users the ability to set up reproducible pipelines as jobs. This feature effectively stores and automates the logic you develop in your notebooks, while eliminating some of the dangers that come with using notebooks directly for recurring pipelines.

CONCLUSION

Before a clear winner emerges, one can only speculate how organizations will accommodate notebooks in a mature development setting. The pull is likely to come in both directions. That is, more data professionals will be accustomed to the ease and transparency of notebooks while organizations will be getting smarter about DataOps and how data teams can effectively deliver insight.

Until open source catches up, enterprise data tools, like Magpie, may lead the charge in introducing features and best practices to data driven users and organizations hungry for what comes next.

To learn more about our data lake platform, Magpie, click here.

Brendan Freehart is a Data Engineer at Silectis.
You can find him on LinkedIn.