The toolchain for data science is incredibly large today. Perhaps twice as large as compared to traditional software development because of the fact that we have tools and processes for both code and data.
A project lifecycle has 3 important components namely:-
- Review and Versioning
In data science projects all of these components are applicable to code, models, and data.
The data science community is still catching up with standardizing the process and tools around data and model versioning. That being said, one thing that the field has done quite well is at creating great experimentation environments.
Jupyter notebook and its competing flavors are the lifelines of data scientists.
We tend to spend almost all of our time in these notebook environments whether it is to prepare datasets, train models, evaluate results, or perform data analysis.
Notebooks are great for quick experimentation but traditionally they have had several challenges:
- Notebooks are not directly executable.
- Git doesn’t support code reviews on notebooks inherently.
Due to the above-mentioned challenges, the common development cycle involves experimentation on notebooks and deployment through a painstaking process of converting notebooks to scripts. Converting notebooks to scripts is fine if it needs to be done once but that is rarely the case.
While ML modeling is notebook first, Spark and PySpark on the other hand are script first. That means that the majority of the standard tooling out there today is concentrated to work well with scripts. While this definitely works quite well, but it unnecessarily increases the workload of Full Stack Data Science teams to keep juggling between experimenting on notebooks and deploying code through scripts.
At Draup, the data science team is full of super users. We found that this was a point of friction for us. Although, we were comfortable converting our code to scripts we took a bold step and asked ourselves “What if?”
We started looking out for success or failure stories from organizations about putting notebooks to production. We couldn’t find failure stories because “oh well” however, one very promising success story was about Netflix where they did it all too well. This gave us a ray of hope and propelled us to walk the rarely trodden path.
Our requirements were the following:-
- The system should provide an excellent experimentation environment.
- The system should provide an easy code review process.
- The system should require minimum to no code changes to a cleaned-up experiment notebook for deployment.
- The system should allow passing arguments to jobs
- The system should allow configuration of spark jobs
Let’s cover the core components of our process in detail.
AWS EMR(Elastic Map Reduce) provides a hosted jupyter notebook environment as a part of its offering. EMR allows users to create compute clusters and provides ways to customize them. The notebook editors are separate entities that can be attached to a running cluster and users can run pyspark jobs through them.
Jupyter notebooks can use the sparkmagic kernel to communicate with a Spark cluster and run pyspark jobs.
In summary, the jupyter notebooks can submit jobs to the Spark cluster using livy + sparkmagic.
Thus EMR notebooks provide us with a familiar jupyter environment to perform analysis.
Overall we get a comfortable jupyter environment to run pyspark code and perform our analysis.
Review and Versioning
EMR notebooks provide the ability to sync multiple git repositories. This provides us a gateway into code review and versioning. However, things work well with scripts but Github doesn’t do a very good job at providing proper tools for code reviews of notebooks. Github treats the notebooks as text files and makes it almost impossible for anyone to review them. Thankfully for us, we came across an amazing tool called reviewnb.com
Used by several AI-first organizations, it was a no-brainer for us to give it a shot and we loved it at first sight.
This tool promises just one thing; Code review of Jupyter notebooks, and it does it exceptionally well.
The onboarding experience was great and the tool is very simple to use. There is no going back now!
This is the component that required a lot of effort and courage to push through. As mentioned earlier in this post, we didn’t find a lot of people excited about executing notebooks apart from Netflix, which easily translates to limited tooling.
To maintain transparency, we first used the EMR notebook execution API to execute our notebooks programmatically in production and thought that we had won the battle but later found that it had some major shortcomings:-
We couldn’t configure the spark parameters at runtime
Sparkmagic provides the %%configure magic command to configure the spark runtime right before starting the Pyspark app.
Notebook execution behavior
The default pyspark notebook execution behavior was inconsistent with our requirement. By default, the EMR notebook execution didn’t FAIL on the failure of one or more cells in notebook execution. This behavior was a dealbreaker for us and we knew that things had to change.
We investigated EMR notebook execution and one thing that became quite obvious to us was the use of papermill.
Papermill provides a CLI as well as a python module to parameterize notebooks and execute them.
A simple notebook like the one below
can be executed with several parameters using the papermill python module or the command line.
Our system also adds a few more parameters for logging purposes but the final output looks something like the below image.
In summary, papermill inserts a new cell in the notebook and executes it based on the kernel information present in the notebook itself.
Injecting the Pyspark app configuration
A jupyter notebook is actually a json file with a defined format. We can easily play around with this file and update its contents. To insert a dynamic configuration, we insert the configuration parameters right before executing the notebook adhering to the livy standards. Thus, the final notebook looks like the below image.
To summarize the steps:-
- Get the raw notebook
- Insert notebook arguments using papermill
- Insert the livy-spark configuration as the first cell
- Execute using papermill
Cluster Setup and Execution
The cluster needs to be configured to allow notebook execution through papermill. This is done by leveraging EMR steps. EMR steps allow us to run shell commands on the master node of the cluster.
We use the following bash command to setup a cluster.
Here we update the sparkmagic configuration file to ensure that our application fails gracefully on a notebook cell faliure
Finally, once we have the papermill environment setup on the master node, we can execute the below python file as a step to execute our production notebooks.
What did we achieve?
Developer productivity: The data science team now doesn’t have to worry about converting every notebook to script.
We go from experimentation notebook to production notebook with little to no changes
Some utility code still goes to scripts but this code expects little to no change.
No more code handoff: The earlier process was to handoff developed modules to our ETL team which was a time-consuming process and development and deployment time was naturally high. Now we have achieved Separation of Concerns and the entire process is owned by the Data Science group. Other folks just call our APIs to run jobs.
Executed notebooks as logs: An executed notebook is a one stop shop to get information about the runtime config, input parameters, source code, runtime errors, warning, etc. Executed notebooks are stored on AWS S3, and through the usage of commuter, we get a neat browsable interface to checkout our executed notebooks.
We leveraged several open-source frameworks like papermill, pyspark, sparkmagic, jupyter to create a process that allows us to reliably execute notebooks in production thus reducing experimentation to production delays and improving developer productivity.
Read more about Data Science at Draup