Anonymizing Production Data for Data Science with Mimesis

Table of Contents

# Introduction

Production data is typically subject to notable privacy and compliance constraints. For this reason, anonymizing such data becomes critical in virtually every real-world data science project involving the launch of a data-driven product, service, or solution.

Mimesis is an open-source Python library that stands out for its ability to generate realistic “fake” data in a high-performance fashion. Mimesis runs locally and provides a free, robust data pipeline solution. This article will show you how to utilize this library for anonymizing sensitive production data, based on a step-by-step example you can easily try in your IDE or a notebook environment.

# Step-by-Step Procedure

Assuming you are new to Mimesis, you may need to install it in your Python environment with a command like:

Remember to add ! at the beginning of the pip command if you are working in a Google Colab notebook environment or similar.

Now we are ready to start! We will consider a scenario revolving around a software product’s tier-based subscription system. For simplicity, we will synthetically generate a toy dataset containing data about customers and their subscription type. There is highly sensitive data in some of the dataset variables, as you can observe below:

import pandas as pd

# Creation of a mock "production" customer dataset
production_data = {
    'user_id': [101, 102, 103, 104],
    'real_name': ['Alice Smith', 'Bob Jones', 'Charlie Brown', 'Diana Prince'],
    'email': ['alice.smith@corp.com', 'bjones@startup.io', 'cbrown@domain.org', 'diana@amazon.com'],
    'phone': ['555-0100', '555-0101', '555-0102', '555-0103'],
    'subscription_tier': ['Premium', 'Basic', 'Basic', 'Enterprise']
}

df = pd.DataFrame(production_data)
print("--- Original Sensitive Data ---")
print(df.head())

While subscription tiers are not necessarily sensitive data in our example, user names, emails, and phone numbers are. With the aid of Mimesis, we can initialize a provider: a sort of tailored data anonymization template suited to the type of data we have. Since our data observations are associated with people, we can import and use the Person class — a provider that, given a specific language like English and aided by a random seed, can be used to generate fake substitutes for real, sensitive personal data:

from mimesis import Person
from mimesis.locales import Locale

# Initializing a Person provider for English locales
person = Person(locale=Locale.EN, seed=42)

From this point onwards, the process to anonymize personally identifiable information (PII) is quite simple. All it takes is replacing the sensitive columns — specified by us — with freshly generated data from the Mimesis person locale generator. This is done by iterating through the DataFrame object containing the whole dataset and calling suitable Mimesis functions to realistically create substitutes for the data, depending on each given attribute:

# 1. Replacing real names with fake, realistic names
df['real_name'] = [person.full_name() for _ in range(len(df))]

# 2. Replacing real emails with fake ones
df['email'] = [person.email() for _ in range(len(df))]

# 3. Replacing real phone numbers
df['phone'] = [person.telephone() for _ in range(len(df))]

# 4. Renaming the column to reflect that it is no longer the real name
df.rename(columns={'real_name': 'anon_name'}, inplace=True)

Notice above how Mimesis’ Person class provides dedicated functions for generating full names, emails, and telephone numbers, among others. In addition, the name column is renamed to reflect that the name included in the updated dataset is no longer real but anonymized.

We now verify the results by looking at the transformed DataFrame. The sensitive PII fields have completely changed: they are now overwritten with legitimate-looking synthetic data, keeping the overall dataset structured and important information for downstream analyses like subscription_tier absolutely intact.

print("\n--- Anonymized Data for Data Science Analyses ---")
print(df.head())

Output:

--- Anonymized Data for Data Science Analyses ---
   user_id         anon_name                    email            phone  \
0      101    Anthony Reilly    archived1911@duck.com     +13312271333   
1      102           Kai Day    suspect2087@yahoo.com  +1-205-759-3586   
2      103  Cleveland Osborn     urgent1912@yahoo.com     +13691067988   
3      104       Zack Holder  johnson1881@example.com  +1-574-481-3676   

  subscription_tier  
0           Premium  
1             Basic  
2             Basic  
3        Enterprise

Fantastic! We have just applied a few simple steps to anonymize several sensitive data fields typically found in real-world, production data science projects and analyses — all for free, thanks to Mimesis being open-source.

To finalize, here are some best practices and observations for conducting the anonymization process we just covered:

We replaced the columns directly in the DataFrame. Depending on your context, consider whether this is the right approach, or whether you may want to store the new information in a separate DataFrame if there is a risk of losing the original data.
Mimesis operates in a data-consistent fashion, so generated data matches the expected data types.
Seeding helps keep generated information consistent across different runs and facilitates reproducibility.

# Wrapping Up

In this article, we have shown how to use Mimesis — a powerful Python library for anonymized and fake data generation — to transform a sensitive production dataset into a version that can be safely used for further analysis without compromising private information like real people’s PII.

Iván Palomares Carrascosa is a leader, writer, speaker, and adviser in AI, machine learning, deep learning & LLMs. He trains and guides others in harnessing AI in the real world.

What's Hot

You and I Are Polar Opposites Actors Discuss Their Characters

Salisbury Post-Punk Thrives on Chaos and Sharp Observation

‘A Tribute To The Very Beginnings:’ McLaren Rebuilt Its First Road Car To Absolute Perfection

Anonymizing Production Data for Data Science with Mimesis

Building an AI Agent Tutorial – Part 1

Top 5 Free Google Certificate Courses in 2026

9 Biggest Benefits of Using AI in Your Retail Business

Data Scientists Are Becoming AI Managers, Not Model Builders

5 Ways Small Language Models Are Powering Next-Gen Agents

Modern Vision Language Models Explained

Black Swans in Artificial Intelligence — Dan Rose AI

Every Clue That Tony Stark Was Always Doctor Doom

We let ChatGPT judge impossible superhero debates — here’s how it ruled

Most Popular

Black Swans in Artificial Intelligence — Dan Rose AI

Every Clue That Tony Stark Was Always Doctor Doom

We let ChatGPT judge impossible superhero debates — here’s how it ruled

Subscribe to Updates

What's Hot

Anonymizing Production Data for Data Science with Mimesis

# Introduction

# Step-by-Step Procedure

# Wrapping Up

Related posts:

Building an AI Agent Tutorial – Part 1

Top 5 Free Google Certificate Courses in 2026

9 Biggest Benefits of Using AI in Your Retail Business

Related Posts

Subscribe to Updates