Dataframes and the proliferation of bad code in data science

Soumendra P
7 min readNov 28, 2020

Originally published at https://databiryani.com on November 28, 2020. You can signup for the dataBiryani newsletter at https://databiryani.com/.

Overusing dataframes and similar data structures gives rise to most of the bad code we have seen data scientists write today.

Why good code is important?

Developing a solution with machine learning is equal parts science and software engineering. The science of it is important, certainly, but your success will be equally dependant, if not more, on how you engineer your solution. Given the abundance of libraries that implement state-of-the-art algorithms for easy consumption and the data (and pre-trained models) to go with them, your ability to engineer your code well is the critical component for your success.

Between two somewhat comparable data scientists, the one with better software engineering skills is more likely to succeed in a machine learning project than the one with better machine learning skills.

The reasons are twofold. First, the barrier to entry to build effective models is getting lower as many high-level libraries make it increasingly easier to solve a wide variety of problems in machine learning and deep learning. State-of-the-art results become available (as pre-trained models) in a matter of weeks after publication. There is always a tutorial around the corner showing you how to tackle your ml/dl problems with freely available tools.

Secondly, the data science culture is a mess, leading to a focus on model performance rather than real-life performance (which can be measured only after deployment). For data scientists, the impetus is more on getting the experiments right, quickly; whatever it takes to achieve that! But the value from data science/machine learning projects comes from their deployment, not modeling experiments.

A more effective data scientist is the one who deploys more often, not the one with better model scores.

The culture around this is changing though, and we hope to be a part of that change.

Why data scientists write bad code?

In this issue, we focus on our perception of the core reasons coming from working with and educating a lot of data scientists over the years.

Primarily, we think the problem arises from how the Cult of Kaggle (not in the way you think) promoted a habit of State Mismanagement, which results in Poorly abstracted codebases hurting data science projects.

Cult of Kaggle

Kaggle is a competition platform for machine learning and deep learning, and a lot of people complain that Kaggle is not real data science. That it sets up wrong expectations with its super-clean datasets neatly divided into train/test-sets, freeing the data scientist to focus only on the modeling problem and not have to worry about data issues.

While this is a valid criticism, we believe that there is a place and utility for such a platform. Given the fantastic community that has come up around Kaggle, it is now one of the most popular platforms for new (and old) data scientists to hone their craft.

The problem is the outsized role Kaggle plays in the education of data scientists. Given a new problem, most will try to get to those train/test-sets (and possibly a validation-set thrown in) and try to get a model that “works” in the empirical risk management sense (on the test-set, of course).

For data scientists, the impetus is more on getting the experiments right, quickly; whatever it takes to achieve that! But the value from data science/machine learning projects comes from their deployment, not modelling experiments.

This results in code that has bad abstractions. This code is usually propped up with all sorts of hacky engineering practices and put into production.

But if we have to put our finger on one bad abstraction that is pervasive in the data science (engineering) universe and responsible for much of the bad code, it will be state mismanagement.

State Mismanagement

DataFrames are one of the most ubiquitous and useful data structures, no matter what stack you are using, and they are also very useful to see if a codebase is likely to be bad.

If most of the functions in your codebase accept dataframes as input and return dataframes as output, you are likely not modelling the entities correctly and writing hacky code to patch things.

What is state mismanagement?

If you are coming from a traditional programming background, having done webapps or backends or any of the myriad things we do when we code, you’ll be used to creating some kind of abstraction to model the entities involved in your code. A customer, a purchase, a geographic location — any of these could be an entity of interest given the problem we have.

With OOP, you may have created classes to represent and model the state/behaviour of those entities. With FP, you may have focussed more on the state and the transitions those entities go through.

By state mismanagement, we refer to the situation when we don’t create these abstractions.

Why state management is important?

These abstractions are very useful. We can write tests against them. We can use them to debug our code. We can persist them and inspect them later when something goes wrong. These abstractions also make code readable — we can follow the state and transitions of entities to understand what is happening. We can also use the type of these abstractions to debug/reason about our code if it is possible.

State mismanagement with DataFrames

DataFrames list all observations of an experiment as rows and attributes of those observations as columns. They are similar to the tables in databases we are used to, but most dataframes in a data science project will typically combine many tables (each table typically represents an entity) into a single object (which will represent/hide states of many entities together).

When we don’t explicitly model different entities, our ability to reason about and debug/inspect them gets diminished significantly. We can’t write tests against functions that do things with those entities and their states. Most of our functions will accept dataframes as input and return dataframes as output, and will typically be in violation of the SOLID principle.

There are, of course, situations where dataframes are the correct solution. But someone not used to thinking about hidden entities and their states may not use them optimally.

The habit of including the states of many entities together in a single dataframe is the most prevalent reason for bad code that we have seen.

When we talk about the poor standard of software engineering in data science, we talk a lot about peripheral issues, but incorrect and leaky abstractions are the main reason why most of the codebase we have seen suck. If you don’t even model your data adequately, how do you become one with the data?

Models doing well in training and poorly in production, whether due to drift or underspecification, is a real problem. Without the necessary abstractions in place, it becomes very difficult to monitor, detect, and debug issues when they arise.

If you like what you have read so far, you may consider subscribing to our newsletter.

Poorly abstracted codebases

When we say poor abstractions result in poor code, lack of SOLID-ness is mostly what we mean. A lot of the hacky software engineering practices we have seen in the wild arise as a consequence.

Consequences

  • Deploying models: The pattern for consuming data in production is different compared to training. Abstractions created to explore/test/validate data in training can be mostly reused in deployment, and the lack of them hurts greatly.
  • Writing tests: It is hard to write meaningful tests if models (datastructures) for the entities in data are not available. This is why we are not seeing more tests in ML even though everyone is talking about them.
  • Monitoring models: So your credit default model is starting to perform poorly. Is this across the board, or only for a certain segment for certain entities (geocode, user-acquisition-channel cohort, or month)? Hard to answer this quickly in real-time while your deployed models are falling apart (and losing money) if you have not spent the time and effort to model the data and set up appropriate tests/monitoring.
  • Reproducibility: Reproducibility can come in many forms. You may be unable to reproduce the older models during monthly updates (data and hyperparameter versioning is another issue entirely), or the same deep learning-based model may produce different embeddings in a cpu-based deployment server when moved from a gpu-based training server. Without correct abstractions to debug with, you’ll be left wondering what is going wrong.
  • Commenting and reading code: Poorly written code with a lot of dataframe needs a lot of documentation, and anyone who has worked in a live codebase knows that good comment coverage is a moving goalpost (moving as the code changes).

(Note: If you are working with a Python stack, type-annotation is your friend. If you have to use a lot of any or DataFrame types, you are probably doing it wrong. In a correctly architected codebase, type-annotation can completely replace all forms of commenting and still make reading the codebase a breeze.)

How to do better

As @seanjtaylor said, there are no hard problems, only slow iterations.

If you have an existing codebase that you think suffers from the issues we outline, start by refactoring small chunks of it (larger chunks will mean slower iterations and you’ll learn slower). If you are starting a new codebase, you can identify and model your entities and be one with your data before you start modeling.

And if you already kaggle, don’t stop. Just recognize that the things in Kaggle that give you dopamine hits should not be the same things that give you dopamine hits when you work on a production system. Hack that dopamine!

Acknowledgments

We are heavily indebted to our colleagues at Difference Engine for the thoughtful conversations and prompts. Particularly, this essay would not have been possible without the insightful observations of Anshul Khandelwal. Thanks to Afzal Sayed for reading an early draft of this post and providing valuable feedback.

If you liked this essay, we hope that you’ll spread the word around. You can sign up for the dataBiryani newsletter at https://databiryani.com/.

--

--

Soumendra P

General all-around hacker and tinkerer (also a Data Scientist)