All About Pyjanitor’s Method Chaining Functionality, And Why Its Useful

Image by Editor

Table of Contents

# Introduction

Working intensively with data in Python teaches all of us an important lesson: data cleaning usually doesn’t feel much like performing data science, but rather like acting as a digital janitor. Here’s what it takes in most use cases: loading a dataset, discovering many column names are messy, coming across missing values, and ending up with plenty of temporary data variables, only the last of them containing your final, clean dataset.

Pyjanitor provides a cleaner approach to carry these steps out. This library can be used alongside the notion of method chaining to transform otherwise arduous data cleaning processes into pipelines that look elegant, efficient, and readable.

This article shows how and demystifies method chaining in the context of Pyjanitor and data cleaning.

# Understanding Method Chaining

Method chaining is not something new in the realm of programming: actually, it is a well-established coding pattern. It consists of calling multiple methods in sequential order on an object: all in just one statement. This way, you don’t need to reassign a variable after each step, because each method returns an object that invokes the next attached method, and so on.

The following example helps understand the concept at its core. Observe how we would apply several simple modifications to a small piece of text (string) using “standard” Python:

text = "  Hello World!  "
text = text.strip()
text = text.lower()
text = text.replace("world", "python")

The resulting value in text will be: "hello python!".

Now, with method chaining, the same process would look like:

text = "  Hello World!  "
cleaned_text = text.strip().lower().replace("world", "python")

Notice that the logical flow of operations applied goes from left to right: all in a single, unified chain of thought!

If you got it, now you perfectly understand the notion of method chaining. Let’s translate this vision now to the context of data science using Pandas. A standard data cleaning on a dataframe, consisting of multiple steps, typically looks like this without chaining:

# Traditional, step-by-step Pandas approach
df = pd.read_csv("data.csv")
df.columns = df.columns.str.lower().str.replace(' ', '_')
df = df.dropna(subset=['id'])
df = df.drop_duplicates()

As we will see shortly, by applying method chaining, we will construct a unified pipeline whereby dataframe operations are encapsulated using parentheses. On top of that, we will no longer need intermediate variables containing non-final dataframes, allowing for cleaner, more bug-resilient code. And (once again) on the very top of that, Pyjanitor makes this process seamless.

# Entering Pyjanitor: Application Example

Pandas itself offers native support for method chaining to some extent. However, some of its essential functionalities have not been designed strictly bearing this pattern in mind. This is a core motivation why Pyjanitor was born, based on a nearly-namesake R package: janitor.

In essence, Pyjanitor can be framed as an extension for Pandas that brings a pack of custom data-cleaning processes in a method chaining-friendly fashion. Examples of its application programming interface (API) method names include clean_names(), rename_column(), remove_empty(), and so on. Its API employs a suite of intuitive method names that take code expressiveness to a whole new level. Besides, Pyjanitor completely relies on open-source, free tools, and can be seamlessly run in cloud and notebook environments, such as Google Colab.

Let’s fully understand how method chaining in Pyjanitor is applied, through an example in which we first create a small, synthetic dataset that looks intentionally messy, and put it into a Pandas DataFrame object.

IMPORTANT: to avoid common, yet somewhat dreadful errors due to incompatibility between library versions, make sure you have the latest available version of both Pandas and Pyjanitor, by using !pip install --upgrade pyjanitor pandas first.

messy_data = {
    'First Name ': ['Alice', 'Bob', 'Charlie', 'Alice', None],
    '  Last_Name': ['Smith', 'Jones', 'Brown', 'Smith', 'Doe'],
    'Age': [25, np.nan, 30, 25, 40],
    'Date_Of_Birth': ['1998-01-01', '1995-05-05', '1993-08-08', '1998-01-01', '1983-12-12'],
    'Salary ($)': [50000, 60000, 70000, 50000, 80000],
    'Empty_Col': [np.nan, np.nan, np.nan, np.nan, np.nan]
}

df = pd.DataFrame(messy_data)
print("--- Messy Original Data ---")
print(df.head(), "\n")

Now we define a Pyjanitor method chain that applies a series of processing to both column names and data itself:

cleaned_df = (
    df
    .rename_column('Salary ($)', 'Salary')  # 1. Manually fix tricky names BEFORE getting them mangled
    .clean_names()                          # 2. Standardize everything (makes it 'salary')
    .remove_empty()                         # 3. Drop empty columns/rows
    .drop_duplicates()                      # 4. Remove duplicate rows
    .fill_empty(                            # 5. Impute missing values
        column_names=['age'],               # CAUTION: after previous steps, assume lowercase name: 'age'
        value=df['Age'].median()            # Pull the median from the original raw df
    )
    .assign(                                # 6. Create a new column using assign
        salary_k=lambda d: d['salary'] / 1000
    )
)

print("--- Cleaned Pyjanitor Data ---")
print(cleaned_df)

The above code is self-explanatory, with inline comments explaining each method called at every step of the chain.

This is the output of our example, which compares the original messy data with the cleaned version:

--- Messy Original Data ---
  First Name    Last_Name   Age Date_Of_Birth  Salary ($)  Empty_Col
0       Alice       Smith  25.0    1998-01-01       50000        NaN
1         Bob       Jones   NaN    1995-05-05       60000        NaN
2     Charlie       Brown  30.0    1993-08-08       70000        NaN
3       Alice       Smith  25.0    1998-01-01       50000        NaN
4         NaN         Doe  40.0    1983-12-12       80000        NaN 

--- Cleaned Pyjanitor Data ---
  first_name_ _last_name   age date_of_birth  salary  salary_k
0       Alice      Smith  25.0    1998-01-01   50000      50.0
1         Bob      Jones  27.5    1995-05-05   60000      60.0
2     Charlie      Brown  30.0    1993-08-08   70000      70.0
4         NaN        Doe  40.0    1983-12-12   80000      80.0

# Wrapping Up

Throughout this article, we have learned how to use the Pyjanitor library to apply method chaining and simplify otherwise arduous data cleaning processes. This makes the code cleaner, expressive, and — in a manner of speaking — self-documenting, so that other developers or your future self can read the pipeline and easily understand what is going on in this journey from raw to ready dataset.

Great job!

Iván Palomares Carrascosa is a leader, writer, speaker, and adviser in AI, machine learning, deep learning & LLMs. He trains and guides others in harnessing AI in the real world.

What's Hot

Nicaragua strips lawyers of certification in latest crackdown on dissent | Human Rights News

Patch for Windows Defender 0-day could allow attackers to fill hard disk

This Isn’t The First Time PlayStation (And Others) Tried To Kill Used Games

All About Pyjanitor’s Method Chaining Functionality, And Why Its Useful

The AI Powerhouse Built for Developers

3 NLTK Tricks for Advanced Text Preprocessing & Linguistic Analysis

Nano Banana 2 is Here! Smaller, Faster, Cheaper

Fine-Tuning Explained for Noobs (How Pretrained Models Learn New Skills)

Local Video Summarization Pipeline: Processing Frames with SmolVLM2-2.2B

Sol, Terra, and Luna Pricing & Benchmarks

Black Swans in Artificial Intelligence — Dan Rose AI

Every Clue That Tony Stark Was Always Doctor Doom

We let ChatGPT judge impossible superhero debates — here’s how it ruled

Most Popular

Black Swans in Artificial Intelligence — Dan Rose AI

Every Clue That Tony Stark Was Always Doctor Doom

We let ChatGPT judge impossible superhero debates — here’s how it ruled

Subscribe to Updates

What's Hot

All About Pyjanitor’s Method Chaining Functionality, And Why Its Useful

# Introduction

# Understanding Method Chaining

# Entering Pyjanitor: Application Example

# Wrapping Up

Related posts:

The AI Powerhouse Built for Developers

3 NLTK Tricks for Advanced Text Preprocessing & Linguistic Analysis

Nano Banana 2 is Here! Smaller, Faster, Cheaper

Related Posts

Subscribe to Updates