Close Menu

    Subscribe to Updates

    Get the latest news from tastytech.

    What's Hot

    Does Israel have nukes? ‘Most of the world assesses they do,’ says Rubio | Nuclear Weapons News

    June 3, 2026

    LangSmith vs. Langfuse vs. Arize Compared

    June 3, 2026

    Best Buy launches a huge Sonos sale ahead of the World Cup — here are the 7 top-rated soundbars and speakers I’d buy

    June 3, 2026
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    tastytech.intastytech.in
    Subscribe
    • AI News & Trends
    • Tech News
    • AI Tools
    • Business & Startups
    • Guides & Tutorials
    • Tech Reviews
    • Automobiles
    • Gaming
    • movies
    tastytech.intastytech.in
    Home»Business & Startups»LangSmith vs. Langfuse vs. Arize Compared
    LangSmith vs. Langfuse vs. Arize Compared
    Business & Startups

    LangSmith vs. Langfuse vs. Arize Compared

    gvfx00@gmail.comBy gvfx00@gmail.comJune 3, 2026No Comments11 Mins Read
    Share
    Facebook Twitter LinkedIn Pinterest Email


    Your AI agent works great in testing. Then you ship it, and something kinda breaks. A tool called loops forever, like it never learns. A retrieval step returns garbage and costs spike. You have no idea why, at all.

    That’s the agent observability problem. And if you’re building with LLMs, you need to solve it before production, not after. This post kinda breaks down three of the most-used observability tools: LangSmith, Langfuse and Arize. We’ll set each one up, trace the same agent and compare what you actually get. 

    Table of Contents

    Toggle
    • What is Agent Observability?
    • Setting Up the Test Agent
    • LangSmith: Native Langchain Tracing
      • What you’ll see on the dashboard: 
    • Langfuse: Open Source and Framework-Agnostic
        •  Evaluation Workflow 
    • Arize: Production-Grade ML Observability
      • Utilizing OpenInference 
    • Which Should You Pick for Agent Observability?
    • Conclusion
        • Login to continue reading and enjoy expert-curated content.
      • Related posts:
    • 7 Steps to Mastering Data Storytelling for Business Impact
    • 10 Best YouTube Channels to Learn Generative AI
    • Opus 4.7 vs Opus 4.6: Should You Switch?

    What is Agent Observability?

    Traditional application monitoring tracks requests, errors, and latency, but that is not enough for Agents.

    An Agent may call multiple tools in sequence, with each LLM step having its own prompt, token usage, latency, and potential failure point. A single failed retrieval or tool call can lead to an incorrect final response.

    Agent observability captures the full execution graph: every step, decision, LLM input and output, tool call, arguments, results, token usage, latency, and evaluation score. Without this visibility, debugging agent behavior becomes guesswork.

    Setting Up the Test Agent

    We will utilize a very simple LangChain agent to compare them. The agent receives a question from the user, retrieves relevant context, and responds using one or more tools to provide an answer.  

    First, you need to create the test agent and for that install all the required libraries.   

    Dependencies list

    Let’s look at the base agent with two methods (search_docs and get_order_status). This will act as our foundational base for comparison with the three observability tools. 

    """
    Base agent used across all three observability demos.
    
    Swap the OPENAI_API_KEY env var or call build_agent() from any demo file.
    """
    
    import os
    
    from dotenv import load_dotenv
    from langchain.agents import AgentExecutor, create_openai_tools_agent
    from langchain.tools import tool
    from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
    from langchain_openai import ChatOpenAI
    
    load_dotenv()
    
    
    @tool
    def search_docs(query: str) -> str:
        """Search internal docs for relevant information."""
        # Simulated retrieval — swap with your actual vector store
        docs = {
            "refund": (
                "Refunds are processed within 5-7 business days. "
                "Items must be returned within 30 days."
            ),
            "shipping": (
                "Standard shipping takes 3-5 business days. "
                "Express is 1-2 days."
            ),
            "account": (
                "You can reset your password via the login page. "
                "Contact support for account issues."
            ),
        }
    
        for keyword, content in docs.items():
            if keyword in query.lower():
                return content
    
        return f"Found general docs related to: {query}"
    
    
    @tool
    def get_order_status(order_id: str) -> str:
        """Look up the status of an order by ID."""
        # Simulated order lookup
        statuses = {
            "ORD-001": "Shipped — expected delivery 2026-05-30",
            "ORD-002": "Processing — not yet shipped",
            "ORD-003": "Delivered on 2026-05-25",
        }
    
        return statuses.get(
            order_id,
            f"Order {order_id} not found in the system.",
        )
    
    
    def build_agent() -> AgentExecutor:
        llm = ChatOpenAI(
            model="gpt-4o",
            temperature=0,
            api_key=os.environ["OPENAI_API_KEY"],
        )
    
        tools = [search_docs, get_order_status]
    
        prompt = ChatPromptTemplate.from_messages(
            [
                (
                    "system",
                    "You are a helpful customer support assistant. "
                    "Use tools when needed.",
                ),
                ("user", "{input}"),
                MessagesPlaceholder(variable_name="agent_scratchpad"),
            ]
        )
    
        agent = create_openai_tools_agent(llm, tools, prompt)
    
        return AgentExecutor(
            agent=agent,
            tools=tools,
            verbose=False,
        )
    
    
    TEST_QUESTIONS = [
        "What are the refund policies?",
        "What is the status of order ORD-002?",
        "How long does shipping take?",
    ]
    
    
    if __name__ == "__main__":
        executor = build_agent()
    
        for question in TEST_QUESTIONS:
            print(f"\nQ: {question}")
    
            result = executor.invoke({"input": question})
    
            print(f"A: {result['output']}")

    This creates a candidate agent that can also be used with each of the tools. The first tool we will explore will be the one provided by LangSmith. 

    LangSmith: Native Langchain Tracing

    The LangChain team has developed LangSmith. If you are using LangChain, then integration will be quick and easy. 

    """
    LangSmith observability demo.
    
    Setup:
    
    pip install langsmith
    
    Set LANGCHAIN_API_KEY in your .env file.
    
    How it works:
    
    LangSmith hooks into LangChain's callback system via env vars, so no code
    changes are needed beyond the two os.environ lines below.
    """
    
    import os
    
    from dotenv import load_dotenv
    
    from agent_base import TEST_QUESTIONS, build_agent
    
    load_dotenv()
    
    # Enable LangSmith tracing. These two vars are all you need.
    os.environ["LANGCHAIN_TRACING_V2"] = "true"
    os.environ["LANGCHAIN_PROJECT"] = "agent-observability-demo"
    
    # LANGCHAIN_API_KEY must be set in your .env or environment.
    
    
    def run_with_metadata(
        executor,
        question: str,
        user_id: str = "demo-user",
    ):
        """Run the agent and attach per-run metadata via config."""
        return executor.invoke(
            {"input": question},
            config={
                "metadata": {
                    "user_id": user_id,
                    "source": "langsmith_demo",
                },
                # Optional: tag runs for filtering in the dashboard.
                "tags": ["observability-blog", "demo"],
            },
        )
    
    
    def main():
        print("=== LangSmith Demo ===")
        print("Traces will appear at: https://smith.langchain.com")
        print(f"Project: {os.environ['LANGCHAIN_PROJECT']}\n")
    
        executor = build_agent()
    
        for question in TEST_QUESTIONS:
            print(f"Q: {question}")
    
            result = run_with_metadata(executor, question)
    
            print(f"A: {result['output']}\n")
    
        print("Done. Open LangSmith to inspect the full trace tree for each run.")
    
    
    if __name__ == "__main__":
        main()

    LangSmith automatically connects to LangChain’s callback system without the need for decorators or wrappers to see each run appear in your project dashboard. 

    What you’ll see on the dashboard: 

    LangSmith’s trace view shows the full agent execution tree, from the initial call to tool use, LLM responses, and final output. Each node includes inputs, outputs, and latency.

    You can tag runs, add metadata, filter by outcome, save runs as datasets, and run evaluations. This is useful when improving prompts or retrieval logic.

    The prompt playground is another strong feature. You can open any trace, edit the prompt inline, and rerun it to debug poor LLM performance.

    LangSmith’s limitations appear at scale. The free tier has caps, and integration takes more effort if you are not using LangChain, though OpenTelemetry is supported.

    Langfuse: Open Source and Framework-Agnostic

    Langfuse is the open-source alternative here. You can either host it on your server, or use their cloud service. It also integrates with all frameworks like LangChain, LlamaIndex, raw OpenAI APIs, etc. 

    # Read this Doc-string for installing the dependencies and their setup 
    """
    Langfuse observability demo.
    
    Setup:
    
    pip install langfuse
    
    Set LANGFUSE_PUBLIC_KEY, LANGFUSE_SECRET_KEY in your .env file.
    
    LANGFUSE_HOST defaults to https://cloud.langfuse.com; override for self-hosted.
    
    Key differences from LangSmith:
    
    - Callback handler is passed per-invoke for more explicit control.
    - Native session grouping for multi-turn conversations.
    - You can score any trace after the fact via the Langfuse client.
    """
    
    import os
    
    from dotenv import load_dotenv
    from langfuse import Langfuse
    from langfuse.callback import CallbackHandler
    
    from agent_base import TEST_QUESTIONS, build_agent
    
    load_dotenv()
    
    
    def build_handler(
        session_id: str,
        user_id: str = "demo-user",
    ) -> CallbackHandler:
        return CallbackHandler(
            public_key=os.environ["LANGFUSE_PUBLIC_KEY"],
            secret_key=os.environ["LANGFUSE_SECRET_KEY"],
            host=os.getenv("LANGFUSE_HOST", "https://cloud.langfuse.com"),
            session_id=session_id,
            user_id=user_id,
            metadata={"source": "langfuse_demo"},
            tags=["observability-blog", "demo"],
        )
    
    
    def score_trace(
        trace_id: str,
        score: float,
        comment: str = "",
    ):
        """Add a correctness score to a trace after reviewing the output."""
        lf = Langfuse(
            public_key=os.environ["LANGFUSE_PUBLIC_KEY"],
            secret_key=os.environ["LANGFUSE_SECRET_KEY"],
            host=os.getenv("LANGFUSE_HOST", "https://cloud.langfuse.com"),
        )
    
        lf.score(
            trace_id=trace_id,
            name="correctness",
            value=score,
            comment=comment,
        )
    
        lf.flush()
    
        print(f"Scored trace {trace_id}: {score}")
    
    
    def run_single_session(
        executor,
        session_id: str,
    ):
        """Run all test questions in a single session so they're linked in the UI."""
        handler = build_handler(session_id=session_id)
        trace_ids = []
    
        for question in TEST_QUESTIONS:
            print(f"Q: {question}")
    
            result = executor.invoke(
                {"input": question},
                config={"callbacks": [handler]},
            )
    
            print(f"A: {result['output']}\n")
    
            # handler.get_trace_id() returns the trace ID for the last run.
            trace_ids.append(handler.get_trace_id())
    
        # Flush ensures traces are sent before the process exits.
        # This is critical in batch jobs.
        handler.flush()
    
        return trace_ids
    
    
    def main():
        print("=== Langfuse Demo ===")
        print(f"Dashboard: {os.getenv('LANGFUSE_HOST', 'https://cloud.langfuse.com')}\n")
    
        executor = build_agent()
        session_id = "demo-session-001"
    
        trace_ids = run_single_session(executor, session_id)
    
        # Example: programmatically score the first trace.
        if trace_ids and trace_ids[0]:
            print("\nScoring first trace as an example:")
            score_trace(trace_ids[0], score=0.9, comment="Answer was accurate")
    
        print(f"\nDone. Find all runs under session '{session_id}' in your Langfuse dashboard.")
    
    
    if __name__ == "__main__":
        main()

    You can pass callback handlers every run, which is a little bit more explicit than LangSmith is, but provides greater flexibility since you can assign user IDs, session IDs, and custom metadata when you invoke it. 

     Evaluation Workflow 

    Langfuse has a really good evaluation workflow as well; you can add scores after the trace has been completed. 

    from langfuse import Langfuse
    
    lf = Langfuse()
    
    # Score a specific trace by ID.
    lf.score(
        trace_id="trace-abc123",
        name="correctness",
        value=0.9,
        comment="Answer was accurate but slightly verbose",
    )

    This works in conjunction with human reviews of the responses your team scores, allowing you to get aggregated evaluation metrics over time. 

    Users can organize their sessions by connecting them, so agents can easily follow conversations across multiple turns. All the traces in an individual user session are connected in the application, which allows you to follow an entire conversation in one place. 

    Arize: Production-Grade ML Observability

    Initially developed as a platform for monitoring conventional machine learning models, Arize is now capable of observing both language models and agents. The fact that it was originally created to help teams deploy models into production at scale has remained intact. 

    Utilizing OpenInference 

    In addition to using the OpenInference standard as its measurement scheme, Arize integrates with OpenTelemetry for instrumentation. Configuring Arize is more complicated than it is for most providers. 

    # Read this Doc-string for installing the dependencies and their setup 
    """
    Arize observability demo.
    
    Setup:
    
    pip install arize-otel openinference-instrumentation-langchain
    
    Set ARIZE_SPACE_ID and ARIZE_API_KEY in your .env file.
    
    Key differences from the others:
    
    - Uses OpenTelemetry under the hood, so it integrates with existing OTel stacks.
    - Instrumentation is global like LangSmith, not per-invoke like Langfuse.
    - Best-in-class production monitoring: drift detection, cohort analysis, alerting.
    - Phoenix, arize-phoenix, is the free local sibling for development use.
    """
    
    import os
    
    from arize.otel import register
    from dotenv import load_dotenv
    from openinference.instrumentation.langchain import LangChainInstrumentor
    
    from agent_base import TEST_QUESTIONS, build_agent
    
    load_dotenv()
    
    
    def setup_arize_tracing():
        """Register Arize as the OTel tracer provider and instrument LangChain globally."""
        tracer_provider = register(
            space_id=os.environ["ARIZE_SPACE_ID"],
            api_key=os.environ["ARIZE_API_KEY"],
            project_name="agent-observability-demo",
        )
    
        LangChainInstrumentor().instrument(tracer_provider=tracer_provider)
    
        return tracer_provider
    
    
    def run_with_attributes(
        executor,
        question: str,
        user_segment: str = "standard",
    ):
        """Run the agent and attach span attributes for cohort analysis in Arize."""
        from opentelemetry import trace
    
        tracer = trace.get_tracer(__name__)
    
        with tracer.start_as_current_span("agent_run") as span:
            span.set_attribute("user.segment", user_segment)
            span.set_attribute("query.text", question)
            span.set_attribute("demo.source", "arize_demo")
    
            result = executor.invoke({"input": question})
    
            span.set_attribute("response.text", result["output"])
    
            return result
    
    
    def main():
        print("=== Arize Demo ===")
        print("Traces will appear at: https://app.arize.com")
        print("Project: agent-observability-demo\n")
    
        setup_arize_tracing()
    
        executor = build_agent()
    
        # Simulate two user segments to demonstrate cohort analysis in Arize.
        segments = ["premium", "standard", "standard"]
    
        for question, segment in zip(TEST_QUESTIONS, segments):
            print(f"Q: {question} [segment={segment}]")
    
            result = run_with_attributes(
                executor,
                question,
                user_segment=segment,
            )
    
            print(f"A: {result['output']}\n")
    
        print("Done. In Arize, use the cohort filter to compare premium vs standard responses.")
        print("Set up monitors on the Arize dashboard to alert on response quality drift.")
    
    
    if __name__ == "__main__":
        main()

    The instrumentation is global like that of LangSmith, but it becomes a component of OpenTelemetry’s overall measurement framework. Therefore, Arize can utilize the existing observability stack of your organization regardless of the actual framework you use (i.e., Jaeger, Grafana, etc.). 

    Which Should You Pick for Agent Observability?

    To be completely open, there is no single right tool for all use cases; it all depends on where you are in the development cycle and what your team needs.  

    Feature LangSmith Langfuse Arize
    Setup complexity Minimal (2 env vars) Low (callback handler) Most boilerplate
    Framework support LangChain-native; others via OTel Any framework Any framework via OTel
    Self-hosting Limited First-class (Docker Compose) Phoenix only (local dev)
    Trace visualization Excellent tree view Good, session-linked Good, OTel-standard
    Evaluation / scoring Dataset + playground Session-level human scores Rubric-based evals
    Production monitoring Basic Basic Drift, alerting, cohorts
    Multi-turn / sessions Thread-level Native session grouping Trace-level only
    Open source Proprietary Fully open source Phoenix is OSS; platform isn’t
    Free tier Limited traces/month Generous (self-host = unlimited) Limited
    Best for LangChain dev & iteration Data ownership + any framework Production-scale monitoring
    • Use LangSmith if you are building with LangChain and want the fastest setup for prompt debugging and iteration.
    • Use Langfuse if you need self-hosting, stronger data ownership, multi-framework support, or session-level tracking for conversational agents.
    • Use Arize when your agent is moving into production and you need monitoring, drift detection, cohorts, and alerts.

    Conclusion

    Agent observability is one of those things you only regret skipping after something goes wrong in production. Tracing an agent run after the fact, without any instrumentation is like debugging a distributed system with print statements.  

    All three tools covered here are production ready. They each have a free path in. And they each take under 30 minutes to integrate with a LangChain agent. There’s no good reason to ship an unobservable agent anymore. 

    Pick the tool that fits your current stage. Add scoring early, even informally. And when your agent starts doing something weird at 2am, you’ll be glad you did. 


    Riya Bansal

    Data Science Trainee at Analytics Vidhya
    I am currently working as a Data Science Trainee at Analytics Vidhya, where I focus on building data-driven solutions and applying AI/ML techniques to solve real-world business problems. My work allows me to explore advanced analytics, machine learning, and AI applications that empower organizations to make smarter, evidence-based decisions.
    With a strong foundation in computer science, software development, and data analytics, I am passionate about leveraging AI to create impactful, scalable solutions that bridge the gap between technology and business.
    📩 You can also reach out to me at [email protected]

    Login to continue reading and enjoy expert-curated content.

    Related posts:

    30+ Data Engineer Interview Questions and Answers (2026 Edition)

    What Does the End of GIL Mean for Python?

    Data Scientist vs AI Engineer: Which Career Should You Choose in 2026?

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Previous ArticleBest Buy launches a huge Sonos sale ahead of the World Cup — here are the 7 top-rated soundbars and speakers I’d buy
    Next Article Does Israel have nukes? ‘Most of the world assesses they do,’ says Rubio | Nuclear Weapons News
    gvfx00@gmail.com
    • Website

    Related Posts

    Business & Startups

    How to Write to Files in Python: A Beginner’s Guide

    June 3, 2026
    Business & Startups

    10 GitHub Repositories for Modern Database Systems and Tools

    June 3, 2026
    Business & Startups

    How to Use Claude Managed Agents?

    June 2, 2026
    Add A Comment
    Leave A Reply Cancel Reply

    Top Posts

    Black Swans in Artificial Intelligence — Dan Rose AI

    October 2, 2025182 Views

    Every Clue That Tony Stark Was Always Doctor Doom

    October 20, 2025112 Views

    We let ChatGPT judge impossible superhero debates — here’s how it ruled

    December 31, 202591 Views
    Stay In Touch
    • Facebook
    • YouTube
    • TikTok
    • WhatsApp
    • Twitter
    • Instagram

    Subscribe to Updates

    Get the latest tech news from tastytech.

    About Us
    About Us

    TastyTech.in brings you the latest AI, tech news, cybersecurity tips, and gadget insights all in one place. Stay informed, stay secure, and stay ahead with us!

    Most Popular

    Black Swans in Artificial Intelligence — Dan Rose AI

    October 2, 2025182 Views

    Every Clue That Tony Stark Was Always Doctor Doom

    October 20, 2025112 Views

    We let ChatGPT judge impossible superhero debates — here’s how it ruled

    December 31, 202591 Views

    Subscribe to Updates

    Get the latest news from tastytech.

    Facebook X (Twitter) Instagram Pinterest
    • Homepage
    • About Us
    • Contact Us
    • Privacy Policy
    © 2026 TastyTech. Designed by TastyTech.

    Type above and press Enter to search. Press Esc to cancel.

    Ad Blocker Enabled!
    Ad Blocker Enabled!
    Our website is made possible by displaying online advertisements to our visitors. Please support us by disabling your Ad Blocker.