Guardrails for LLMs: Measuring AI ‘Hallucination’ and Verbosity

Table of Contents

# Introduction

Large language models (LLMs) have a taste for using “flowery”, sometimes overly verbose language in their responses. Ask a simple question, and chances are you may get flooded with paragraphs of overly detailed, enthusiastic, and complex prose. This usual behavior is rooted in their training, as they are optimized to be as helpful and conversational as possible.

Unfortunately, verbosity is a serious aspect to have under the radar, and can be argued to often correlate with an increased odds of a major issue: hallucinations. The more words are generated in a response, the higher the chances of drifting from grounded knowledge and venturing into “the art of fabrication”.

In sum, robust guardrails are needed to prevent this double-sided problem, starting with verbosity checks. This article shows how to use the Textstat Python library to measure readability and detect overly complex responses before they reach the end user, forcing the model to refine its response.

# Setting a Complexity Budget with Textstat

The Textstat Python library can be used to compute scores such as the automated readability index (ARI); it estimates the grade level (level of study) needed to understand a piece of text, such as a model response. If this complexity metric exceeds a budget or threshold — such as 10.0, equivalent to a 10th-grade reading level — a re-prompting loop can be automatically triggered to require a more concise, simpler response. This strategy not only dispels flowery language but may also help reduce hallucination risks, because the model adheres to core facts more strictly as a result.

# Implementing the LangChain Pipeline

Let’s see how to implement the above-described strategy and integrate it into a LangChain pipeline that can be easily run in a Google Colab notebook. You will need a Hugging Face API token, obtainable for free at https://huggingface.co/settings/tokens. Create a new “secret” named HF_TOKEN on the left-hand side menu of Colab by clicking on the “Secrets” icon (it looks like a key). Paste the generated API token in the “Value” field, and you are all set up!

To start, install the necessary libraries:

!pip install textstat langchain_huggingface langchain_community

The following code is Google Colab-specific, and you may need to adjust it accordingly if you are working in a different environment. It focuses on recovering the stored API token:

from google.colab import userdata

# Obtain Hugging Face API token saved in your Colab session's Secrets
HF_TOKEN = userdata.get('HF_TOKEN')

# Verify token recovery
if not HF_TOKEN:
    print("WARNING: The token 'HF_TOKEN' wasn't found. This may cause errors.")
else:
    print("Hugging Face Token loaded successfully.")

In the following piece of code, we perform several actions. First, it sets up components for local text generation via a pre-trained Hugging Face model — specifically distilgpt2. After that, the model is integrated into a LangChain pipeline.

import textstat
from langchain_core.prompts import PromptTemplate
# Importing necessary classes for local Hugging Face pipelines
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
from langchain_community.llms import HuggingFacePipeline

# Initializing a free-tier, local-friendly, compatible LLM for text generation
model_id = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id)

# Creating a text-generation pipeline
pipe = pipeline(
    "text-generation", 
    model=model, 
    tokenizer=tokenizer, 
    max_new_tokens=100,
    device=0 # Use GPU if available, otherwise it will default to CPU
)

# Wrapping the pipeline in HuggingFacePipeline
llm = HuggingFacePipeline(pipeline=pipe)

Our core mechanism for measuring and managing verbosity is implemented next. The following function generates a summary of text passed to it (assumed to be an LLM’s response) and tries to ensure the summary does not exceed a threshold level of complexity. Note that when using an appropriate prompt template, generation models like distilgpt2 can be used for obtaining text summaries, although the quality of such summarizations may not match that of heavier, summarization-focused models. We chose this model due to its reliability for local execution in a constrained environment.

def safe_summarize(text_input, complexity_budget=10.0):
    print("\n--- Starting Summary Process ---")
    print(f"Input text length: {len(text_input)} characters")
    print(f"Target complexity budget (ARI score): {complexity_budget}")

    # Step 1: Initial Summary Generation
    print("Generating initial comprehensive summary...")
    base_prompt = PromptTemplate.from_template(
        "Provide a comprehensive summary of the following: {text}"
    )
    chain = base_prompt | llm
    summary = chain.invoke({"text": text_input})
    print("Initial Summary generated:")
    print("-------------------------")
    print(summary)
    print("-------------------------")

    # Step 2: Measure Readability
    ari_score = textstat.automated_readability_index(summary)
    print(f"Initial ARI Score: {ari_score:.2f}")

    # Step 3: Enforce Complexity Budget
    if ari_score > complexity_budget:
        print("Budget exceeded! Initial summary is too complex.")
        print("Triggering simplification guardrail...")
        simplification_prompt = PromptTemplate.from_template(
            "The following text is too verbose. Rewrite it concisely "
            "using simple vocabulary, stripping away flowery language:\n\n{text}"
        )
        simplify_chain = simplification_prompt | llm
        simplified_summary = simplify_chain.invoke({"text": summary})

        new_ari = textstat.automated_readability_index(simplified_summary)
        print("Simplified Summary generated:")
        print("-------------------------")
        print(simplified_summary)
        print("-------------------------")
        print(f"Revised ARI Score: {new_ari:.2f}")
        summary = simplified_summary
    else:
        print("Initial summary is within complexity budget. No simplification needed.")

    print("--- Summary Process Finished ---")
    return summary

Notice also in the code above that ARI scores are calculated to estimate text complexity.

The final part of the code example tests the function defined previously, passing sample text and a complexity budget of 10.0, and printing the final results.

# 1. Providing some highly verbose, complex sample text
sample_text = """
The inextricably intertwined permutations of cognitive computational arrays within the 
realm of Large Language Models often precipitate a cascade of unnecessarily labyrinthine 
lexical structures. This propensity for circumlocution, whilst seemingly indicative of 
profound erudition, frequently obfuscates the foundational semantic payload, thereby 
rendering the generated discourse significantly less accessible to the quintessential layperson.
"""

# 2. Calling the function
print("Running summarizer pipeline...\n")
final_output = safe_summarize(sample_text, complexity_budget=10.0)

# 3. Printing the final result
print("\n--- Final Guardrailed Summary ---")
print(final_output)

The resulting printed messages may be quite lengthy, but you will see a subtle decrease in the ARI score after calling the pre-trained model for summarization. Do not expect miraculous results, though: the model chosen, while lightweight, is not great at summarizing text, so the ARI score reduction is rather modest. You can try using other models like google/flan-t5-small to see how they perform for text summarization, but be warned — these models will be heavier and harder to run.

# Wrapping Up

This article shows how to implement an infrastructure for measuring and controlling overly verbose LLM responses by calling an auxiliary model to summarize them before approving their level of complexity. Hallucinations are a byproduct of high verbosity in many scenarios. While the implementation shown here focuses on assessing verbosity, there are specific checks that can also be used for measuring hallucinations — such as semantic consistency checks, natural language inference (NLI) cross-encoders, and LLM-as-a-judge solutions.

Iván Palomares Carrascosa is a leader, writer, speaker, and adviser in AI, machine learning, deep learning & LLMs. He trains and guides others in harnessing AI in the real world.

What's Hot

BMW X7 Rival Gets Automatic Doors

AI automates HR compliance, except for the area tech companies need

Guardrails for LLMs: Measuring AI ‘Hallucination’ and Verbosity

Guardrails for LLMs: Measuring AI ‘Hallucination’ and Verbosity

Excel 101: COUNT and COUNTIF Functions

What is Agentic AI?

A Guide to Kedro: Your Production-Ready Data Science Toolbox

Top 10 LLM Research Papers of 2026

10 GitHub Repositories to Master FastAPI

Understanding AI Agent Memory Patterns: A Guide with LangGraph

Black Swans in Artificial Intelligence — Dan Rose AI

Every Clue That Tony Stark Was Always Doctor Doom

We let ChatGPT judge impossible superhero debates — here’s how it ruled

Most Popular

Black Swans in Artificial Intelligence — Dan Rose AI

Every Clue That Tony Stark Was Always Doctor Doom

We let ChatGPT judge impossible superhero debates — here’s how it ruled

Subscribe to Updates

What's Hot

Guardrails for LLMs: Measuring AI ‘Hallucination’ and Verbosity

# Introduction

# Setting a Complexity Budget with Textstat

# Implementing the LangChain Pipeline

# Wrapping Up

Related posts:

Excel 101: COUNT and COUNTIF Functions

What is Agentic AI?

A Guide to Kedro: Your Production-Ready Data Science Toolbox

Related Posts

Subscribe to Updates