Close Menu

    Subscribe to Updates

    Get the latest news from tastytech.

    What's Hot

    Check Your CGM: Recalled FreeStyle Libre 3 Sensors Associated With 7 Deaths

    February 5, 2026

    Overwatch’s Heroes Are Getting Hotter, Here’s Why

    February 4, 2026

    Taylor Sheridan’s TV Shows, Ranked Worst to Best

    February 4, 2026
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    tastytech.intastytech.in
    Subscribe
    • AI News & Trends
    • Tech News
    • AI Tools
    • Business & Startups
    • Guides & Tutorials
    • Tech Reviews
    • Automobiles
    • Gaming
    • movies
    tastytech.intastytech.in
    Home»Business & Startups»Building a Simple Data Quality DSL in Python
    Building a Simple Data Quality DSL in Python
    Business & Startups

    Building a Simple Data Quality DSL in Python

    gvfx00@gmail.comBy gvfx00@gmail.comDecember 1, 2025No Comments12 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Building a Simple Data Quality DSL in Python
    Image by Author

     

    Table of Contents

    Toggle
    • # Introduction
    • # Why Building a DSL?
    • # Creating a Sample Dataset
    • # Writing the Validation Logic
        • // Creating the Rule Class
        • // Adding Multiple Rules
        • // Building Readable Conditions
    • # Building a Data Validator for the Sample Dataset
    • # Adding Cross-Column Validations
    • # Handling Date Validation
    • # Writing Decorator Integration Patterns
    • # Extending the Pattern
    • # Conclusion
      • Related posts:
    • Pixi: A Smarter Way to Manage Python Environments
    • 5 Biggest Hackathons of 2026 That You Can’t Miss
    • 7 AI Automation Tools for Streamlined Workflows

    # Introduction

     
    Data validation code in Python is often a pain to maintain. Business rules get buried in nested if statements, validation logic mixes with error handling, and adding new checks often means sifting through procedural functions to find the right place to insert code. Yes, there are data validation frameworks you can use, but we’ll focus on building something super simple yet useful with Python.

    Let’s write a simple Domain-Specific Language (DSL) of sorts by creating a vocabulary specifically for data validation. Instead of writing generic Python code, you build specialized functions and classes that express validation rules in terms that match how you think about the problem.

    For data validation, this means rules that read like business requirements: “customer ages must be between 18 and 120” or “email addresses must contain an @ symbol and should have a valid domain.” You’d like the DSL to handle the mechanics of checking data and reporting violations, while you focus on expressing what valid data looks like. The result is validation logic that’s readable, easy to maintain and test, and simple to extend. So, let’s start coding!

    🔗 Link to the code on GitHub

     

    # Why Building a DSL?

     
    Consider validating customer data with Python:

    def validate_customers(df):
        errors = []
        if df['customer_id'].duplicated().any():
            errors.append("Duplicate IDs")
        if (df['age'] < 0).any():
            errors.append("Negative ages")
        if not df['email'].str.contains('@').all():
            errors.append("Invalid emails")
        return errors

     

    This approach hardcodes validation logic, mixes business rules with error handling, and becomes unmaintainable as rules multiply. Instead, we’re looking to write a DSL that separates concerns and creates reusable validation components.

    Instead of writing procedural validation functions, a DSL lets you express rules that read like business requirements:

    # Traditional approach
    if df['age'].min() < 0 or df['age'].max() > 120:
        raise ValueError("Invalid ages found")
    
    # DSL approach  
    validator.add_rule(Rule("Valid ages", between('age', 0, 120), "Ages must be 0-120"))

     

    The DSL approach separates what you’re validating (business rules) from how violations are handled (error reporting). This makes validation logic testable, reusable, and readable by non-programmers.

     

    # Creating a Sample Dataset

     
    Start by spinning up a sample, realistic e-commerce customer data containing common quality issues:

    import pandas as pd
    
    customers = pd.DataFrame({
        'customer_id': [101, 102, 103, 103, 105],
        'email': ['john@gmail.com', 'invalid-email', '', 'sarah@yahoo.com', 'mike@domain.co'],
        'age': [25, -5, 35, 200, 28],
        'total_spent': [250.50, 1200.00, 0.00, -50.00, 899.99],
        'join_date': ['2023-01-15', '2023-13-45', '2023-02-20', '2023-02-20', '']
    }) # Note: 2023-13-45 is an intentionally malformed date.

     

    This dataset has duplicate customer IDs, invalid email formats, impossible ages, negative spending amounts, and malformed dates. That should work pretty well for testing validation rules.

     

    # Writing the Validation Logic

     

    // Creating the Rule Class

    Let’s start by writing a simple Rule class that wraps validation logic:

    class Rule:
        def __init__(self, name, condition, error_msg):
            self.name = name
            self.condition = condition
            self.error_msg = error_msg
        
        def check(self, df):
            # The condition function returns True for VALID rows.
            # We use ~ (bitwise NOT) to select the rows that VIOLATE the condition.
            violations = df[~self.condition(df)]
            if not violations.empty:
                return {
                    'rule': self.name,
                    'message': self.error_msg,
                    'violations': len(violations),
                    'sample_rows': violations.head(3).index.tolist()
                }
            return None

     

    The condition parameter accepts any function that takes a DataFrame and returns a boolean Series indicating valid rows. The tilde operator (~) inverts this Boolean Series to identify violations. When violations exist, the check method returns detailed information including the rule name, error message, violation count, and sample row indices for debugging.

    This design separates validation logic from error reporting. The condition function focuses purely on the business rule while the Rule class handles error details consistently.

     

    // Adding Multiple Rules

    Next, let’s code up a DataValidator class that manages collections of rules:

    class DataValidator:
        def __init__(self):
            self.rules = []
        
        def add_rule(self, rule):
            self.rules.append(rule)
            return self # Enables method chaining
        
        def validate(self, df):
            results = []
            for rule in self.rules:
                violation = rule.check(df)
                if violation:
                    results.append(violation)
            return results

     

    The add_rule method returns self to enable method chaining. The validate method executes all rules independently and collects violation reports. This approach ensures one failing rule doesn’t prevent others from running.

     

    // Building Readable Conditions

    Recall that when instantiating an object of the Rule class, we also need a condition function. This can be any function that takes in a DataFrame and returns a Boolean Series. While simple lambda functions work, they aren’t very easy to read. So let’s write helper functions to create a readable validation vocabulary:

    def not_null(column):
        return lambda df: df[column].notna()
    
    def unique_values(column):
        return lambda df: ~df.duplicated(subset=[column], keep=False)
    
    def between(column, min_val, max_val):
        return lambda df: df[column].between(min_val, max_val)

     

    Each helper function returns a lambda that works with pandas Boolean operations.

    • The not_null helper uses pandas’ notna() method to identify non-null values.
    • The unique_values helper uses duplicated(..., keep=False) with a subset parameter to flag all duplicate occurrences, ensuring a more accurate violation count.
    • The between helper uses the pandas between() method which handles range checks automatically.

    For pattern matching, regular expressions become straightforward:

    import re
    
    def matches_pattern(column, pattern):
        return lambda df: df[column].str.match(pattern, na=False)

     

    The na=False parameter ensures missing values are treated as validation failures rather than matches, which is typically the desired behavior for required fields.

     

    # Building a Data Validator for the Sample Dataset

     
    Let’s now build a validator for the customer dataset to see how this DSL works:

    validator = DataValidator()
    
    validator.add_rule(Rule(
       "Unique customer IDs", 
       unique_values('customer_id'),
       "Customer IDs must be unique across all records"
    ))
    
    validator.add_rule(Rule(
       "Valid email format",
       matches_pattern('email', r'^[^@\s]+@[^@\s]+\.[^@\s]+$'),
       "Email addresses must contain @ symbol and domain"
    ))
    
    validator.add_rule(Rule(
       "Reasonable customer age",
       between('age', 13, 120),
       "Customer age must be between 13 and 120 years"
    ))
    
    validator.add_rule(Rule(
       "Non-negative spending",
       lambda df: df['total_spent'] >= 0,
       "Total spending amount cannot be negative"
    ))

     

    Each rule follows the same pattern: a descriptive name, a validation condition, and an error message.

    • The first rule uses the unique_values helper function to check for duplicate customer IDs.
    • The second rule applies regular expression pattern matching to validate email formats. The pattern requires at least one character before and after the @ symbol, plus a domain extension.
    • The third rule uses the between helper for range validation, setting reasonable age limits for customers.
    • The final rule uses a lambda function for an inline condition checking that total_spent values are non-negative.

    Notice how each rule reads almost like a business requirement. The validator collects these rules and can execute them all against any DataFrame with matching column names:

    issues = validator.validate(customers)
    
    for issue in issues:
        print(f"❌ Rule: {issue['rule']}")
        print(f"Problem: {issue['message']}")
        print(f"Affected rows: {issue['sample_rows']}")
        print()

     

    The output clearly identifies specific problems and their locations in the dataset, making debugging straightforward. For the sample data, you’ll get the following output:

    Validation Results:
    ❌ Rule: Unique customer IDs
       Problem: Customer IDs must be unique across all records
       Violations: 2
       Affected rows: [2, 3]
    
    ❌ Rule: Valid email format
       Problem: Email addresses must contain @ symbol and domain
       Violations: 3
       Affected rows: [1, 2, 4]
    
    ❌ Rule: Reasonable customer age
       Problem: Customer age must be between 13 and 120 years
       Violations: 2
       Affected rows: [1, 3]
    
    ❌ Rule: Non-negative spending
       Problem: Total spending amount cannot be negative
       Violations: 1
       Affected rows: [3]

     

    # Adding Cross-Column Validations

     

    Real business rules often involve relationships between columns. Custom lambda functions handle complex validation logic:

    def high_spender_email_required(df):
        high_spenders = df['total_spent'] > 500
        has_valid_email = df['email'].str.contains('@', na=False)
        # Passes if: (Not a high spender) OR (Has a valid email)
        return ~high_spenders | has_valid_email
    
    validator.add_rule(Rule(
        "High Spenders Need Valid Email",
        high_spender_email_required,
        "Customers spending over $500 must have valid email addresses"
    ))

     

    This rule uses Boolean logic where high-spending customers must have valid emails, but low spenders can have missing contact information. The expression ~high_spenders | has_valid_email translates to “not a high spender OR has valid email,” which allows low spenders to pass validation regardless of email status.

     

    # Handling Date Validation

     
    Date validation requires careful handling since date parsing can fail:

    def valid_date_format(column, date_format="%Y-%m-%d"):
        def check_dates(df):
            # pd.to_datetime with errors="coerce" turns invalid dates into NaT (Not a Time)
            parsed_dates = pd.to_datetime(df[column], format=date_format, errors="coerce")
            # A row is valid if the original value is not null AND the parsed date is not NaT
            return df[column].notna() & parsed_dates.notna()
        return check_dates
    
    validator.add_rule(Rule(
        "Valid Join Dates",
        valid_date_format('join_date'),
        "Join dates must follow YYYY-MM-DD format"
    ))

     

    The validation passes only when the original value is not null AND the parsed date is valid (i.e., not NaT). We remove the unnecessary try-except block, relying on errors="coerce" in pd.to_datetime to handle malformed strings gracefully by converting them to NaT, which is then caught by parsed_dates.notna().

     

    # Writing Decorator Integration Patterns

     
    For production pipelines, you can write decorator patterns that provide clean integration:

    def validate_dataframe(validator):
        def decorator(func):
            def wrapper(df, *args, **kwargs):
                issues = validator.validate(df)
                if issues:
                    error_details = [f"{issue['rule']}: {issue['violations']} violations" for issue in issues]
                    raise ValueError(f"Data validation failed: {'; '.join(error_details)}")
                return func(df, *args, **kwargs)
            return wrapper
        return decorator
    
    # Note: 'customer_validator' needs to be defined globally or passed in a real implementation
    # Assuming 'customer_validator' is the instance we built earlier
    # @validate_dataframe(customer_validator)
    def process_customer_data(df):
        return df.groupby('age').agg({'total_spent': 'sum'})

     

    This decorator ensures data passes validation before processing begins, preventing corrupted data from propagating through the pipeline. The decorator raises descriptive errors that include specific validation failures. A comment was added to the code snippet to note that customer_validator would need to be accessible to the decorator.

     

    # Extending the Pattern

     
    You can extend the DSL to include other validation rules as needed:

    # Statistical outlier detection
    def within_standard_deviations(column, std_devs=3):
        # Valid if absolute difference from mean is within N standard deviations
        return lambda df: abs(df[column] - df[column].mean()) <= std_devs * df[column].std()
    
    # Referential integrity across datasets
    def foreign_key_exists(column, reference_df, reference_column):
        # Valid if value in column is present in the reference_column of the reference_df
        return lambda df: df[column].isin(reference_df[reference_column])
    
    # Custom business logic
    def profit_margin_reasonable(df):
        # Ensures 0 <= margin <= 1
        margin = (df['revenue'] - df['cost']) / df['revenue']
        return (margin >= 0) & (margin <= 1)

     

    This is how you can build validation logic as composable functions that return Boolean series.

    Here’s an example of how you can use the data validation DSL we’ve built on the sample data, assuming the helper functions are in a module called data_quality_dsl:

    import pandas as pd
    from data_quality_dsl import DataValidator, Rule, unique_values, between, matches_pattern
    
    # Sample data
    df = pd.DataFrame({
        'user_id': [1, 2, 2, 3],
        'email': ['user@test.com', 'invalid', 'user@real.com', ''],
        'age': [25, -5, 30, 150]
    })
    
    # Build validator
    validator = DataValidator()
    validator.add_rule(Rule("Unique users", unique_values('user_id'), "User IDs must be unique"))
    validator.add_rule(Rule("Valid emails", matches_pattern('email', r'^[^@]+@[^@]+\.[^@]+$'), "Invalid email format"))
    validator.add_rule(Rule("Reasonable ages", between('age', 0, 120), "Age must be 0-120"))
    
    # Run validation
    issues = validator.validate(df)
    for issue in issues:
        print(f"❌ {issue['rule']}: {issue['violations']} violations")

     

    # Conclusion

     
    This DSL, although simple, works because it aligns with how data professionals think about validation. Rules express business logic in easy-to-understand requirements while allowing us to use pandas for both performance and flexibility.

    The separation of concerns makes validation logic testable and maintainable. This approach requires no external dependencies beyond pandas and introduces no learning curve for those already familiar with pandas operations.

    This is something I worked on over a couple of evening coding sprints and several cups of coffee (of course!). But you can use this version as a starting point and build something much cooler. Happy coding!
     
     

    Bala Priya C is a developer and technical writer from India. She likes working at the intersection of math, programming, data science, and content creation. Her areas of interest and expertise include DevOps, data science, and natural language processing. She enjoys reading, writing, coding, and coffee! Currently, she’s working on learning and sharing her knowledge with the developer community by authoring tutorials, how-to guides, opinion pieces, and more. Bala also creates engaging resource overviews and coding tutorials.



    Related posts:

    How to build an Artificial Intelligence business case — Dan Rose AI

    Managing Secrets and API Keys in Python Projects (.env Guide)

    6 ways sentiment analysis will help your business

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleUnique Gifts for People Who Have Everything
    Next Article Pope Leo urges unity on day two of Lebanon visit | Conflict News
    gvfx00@gmail.com
    • Website

    Related Posts

    Business & Startups

    AI Agents Can Now Hire Real Humans via rentahuman.ai

    February 4, 2026
    Business & Startups

    5 Open Source Image Editing AI Models

    February 4, 2026
    Business & Startups

    Top 10 MCP Servers for AI Builders in 2026

    February 4, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    BMW Will Put eFuel In Cars Made In Germany From 2028

    October 14, 202511 Views

    Best Sonic Lego Deals – Dr. Eggman’s Drillster Gets Big Price Cut

    December 16, 20259 Views

    What is Fine-Tuning? Your Ultimate Guide to Tailoring AI Models in 2025

    October 14, 20259 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram

    Subscribe to Updates

    Get the latest tech news from tastytech.

    About Us
    About Us

    TastyTech.in brings you the latest AI, tech news, cybersecurity tips, and gadget insights all in one place. Stay informed, stay secure, and stay ahead with us!

    Most Popular

    BMW Will Put eFuel In Cars Made In Germany From 2028

    October 14, 202511 Views

    Best Sonic Lego Deals – Dr. Eggman’s Drillster Gets Big Price Cut

    December 16, 20259 Views

    What is Fine-Tuning? Your Ultimate Guide to Tailoring AI Models in 2025

    October 14, 20259 Views

    Subscribe to Updates

    Get the latest news from tastytech.

    Facebook X (Twitter) Instagram Pinterest
    • Homepage
    • About Us
    • Contact Us
    • Privacy Policy
    © 2026 TastyTech. Designed by TastyTech.

    Type above and press Enter to search. Press Esc to cancel.

    Ad Blocker Enabled!
    Ad Blocker Enabled!
    Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.