Google Stax: Testing Models and Prompts Against Your Own Criteria

Image by Author

Table of Contents

# Introduction

If you’re building applications with large language models (LLMs), you’ve probably experienced this scenario where you change a prompt, run it a few times, and the output feels better. But is it actually better? Without objective metrics, you are stuck in what the industry now calls “vibe testing,” which means making decisions based on intuition rather than data.

The challenge comes from a fundamental characteristic of AI models: uncertainty. Unlike traditional software, where the same input always produces the same output, LLMs can generate different responses to similar prompts. This makes conventional unit testing ineffective and leaves developers guessing whether their changes truly improved performance.

Then came Google Stax, a new experimental toolkit from Google DeepMind and Google Labs designed to bring accuracy to AI evaluation. In this article, we take a look at how Stax enables developers and data scientists to test models and prompts against their own custom criteria, replacing subjective judgments with repeatable, data-driven decisions.

# Understanding Google Stax

Stax is a developer tool that simplifies the evaluation of generative AI models and applications. Think of it as a testing framework specifically built for the unique challenges of working with LLMs.

At its core, Stax solves a simple but critical problem: how do you know if one model or prompt is better than another for your specific use case? Rather than relying on general criteria that may not reflect your application’s needs, Stax lets you define what “good” means for your project and measure against those standards.

// Exploring Key Capabilities

It helps define your own success criteria beyond generic metrics like fluency and safety
You can test different prompts across various models side-by-side
You can make data-driven decisions by visualizing gathered performance metrics, including quality, latency, and token usage
It can run assessments at scale using your own datasets

Stax is flexible, supporting not only Google’s Gemini models but also OpenAI’s GPT, Anthropic’s Claude, Mistral, and others through API integrations.

# Moving Beyond Standard Benchmarks

General AI benchmarks serve an important purpose, like helping track model progress at a high level. However, they often fail to reflect domain-specific requirements. A model that excels at open-domain reasoning might perform poorly on specialized tasks like:

Compliance-focused summarization
Legal document analysis
Enterprise-specific Q&A
Brand-voice adherence

The gap between general benchmarks and real-world applications is where Stax provides value. It enables you to evaluate AI systems based on your data and your criteria, not abstract global scores.

# Getting Started With Stax

// Step 1: Adding An API Key

To generate model outputs and run evaluations, you’ll need to add an API key. Stax recommends starting with a Gemini API key, as the built-in evaluators use it by default, though you can configure them to use other models. You can add your first key during onboarding or later in Settings.

For comparing multiple providers, add keys for each model you want to test; this enables parallel comparison without switching tools.

Getting an API key

// Step 2: Creating An Evaluation Project

Projects are the central workspace in Stax. Each project corresponds to a single evaluation experiment, for example, testing a new system prompt or comparing two models.

You’ll choose between two project types:

Project Type	Best For
Single Model	Baselining performance or testing an iteration of a model or system prompt
Side-by-Side	Directly comparing two different models or prompts head-to-head on the same dataset

Figure 1: A side-by-side comparison flowchart showing two models receiving the same input prompts and their outputs flowing into an evaluator that produces comparison metrics

// Step 3: Building Your Dataset

A solid evaluation starts with data that is accurate and reflects your real-world use cases. Stax offers two primary methods to achieve this:

Option A: Adding Data Manually in the Prompt Playground

If you don’t have an existing dataset, build one from scratch:

Select the model(s) you want to test
Set a system prompt (optional) to define the AI’s role
Add user prompts that represent real user inputs
Provide human ratings (optional) to create baseline quality scores

Each input, output, and rating automatically saves as a test case.

Option B: Uploading an Existing Dataset
For teams with production data, upload CSV files directly. If your dataset doesn’t include model outputs, click “Generate Outputs” and select a model to generate them.

Best practice: Include the edge cases and conflicting examples in your dataset to ensure comprehensive testing.

# Evaluating AI Outputs

// Conducting Manual Evaluation

You can provide human ratings on individual outputs directly in the playground or on the project benchmark. While human evaluation is considered the “gold standard,” it’s slow, expensive, and difficult to scale.

// Performing Automated Evaluation With Autoraters

To score many outputs at once, Stax uses LLM-as-judge evaluation, where a powerful AI model assesses another model’s outputs based on your criteria.

Stax includes preloaded evaluators for common metrics:

Fluency
Factual consistency
Safety
Instruction following
Conciseness

The Stax evaluation interface showing a column of model outputs with adjacent score columns from various evaluators, plus a “Run Evaluation” button

// Leveraging Custom Evaluators

While preloaded evaluators provide an excellent starting point, building custom evaluators is the best way to measure what matters for your specific use case.

Custom evaluators let you define specific criteria like:

“Is the response helpful but not overly familiar?”
“Does the output contain any personally identifiable information (PII)?”
“Does the generated code follow our internal style guide?”
“Is the brand voice consistent with our guidelines?”

To build a custom evaluator: Define your clear criteria, write a prompt for the judge model that includes a scoring checklist, and test it against a small sample of manually rated outputs to ensure alignment.

# Exploring Practical Use Cases

// Reviewing Use Case 1: Customer Support Chatbot

Imagine that you are building a customer support chatbot. Your requirements might include the following:

Professional tone
Accurate answers based on your knowledge base
No hallucinations
Resolution of common issues within three exchanges

With Stax, you would:

Upload a dataset of real customer queries
Generate responses from different models (or different prompt versions)
Create a custom evaluator that scores for professionalism and accuracy
Compare results side-by-side to select the best performer

// Reviewing Use Case 2: Content Summarization Tool

For a news summarization application, you care about:

Conciseness (summaries under 100 words)
Factual consistency with the original article
Preservation of key information

Using Stax’s pre-built Summarization Quality evaluator gives you immediate metrics, while custom evaluators can enforce specific length constraints or brand voice requirements.

Figure 2: A visual of the Stax Flywheel showing three stages: Experiment (test prompts/models), Evaluate (run evaluators), and Analyze (review metrics and decide)

# Interpreting Results

Once evaluations are complete, Stax adds new columns to your dataset showing scores and rationales for every output. The Project Metrics section provides an aggregated view of:

Human ratings
Average evaluator scores
Inference latency
Token counts

Use this quantitative data to:

Compare iterations: Does Prompt A consistently outperform Prompt B?
Choose between models: Is the faster model worth the slight drop in quality?
Track progress: Are your optimizations actually improving performance?
Identify failures: Which inputs consistently produce poor outputs?

Figure 3: A dashboard view showing bar charts comparing two models across multiple metrics (quality score, latency, cost)

# Implementing Best Practices For Effective Evaluations

Start Small, Then Scale: You don’t need hundreds of test cases to get value. An evaluation set with just ten high-quality prompts is endlessly more valuable than relying on vibe testing alone. Start with a focused set and expand as you learn.
Create Regression Tests: Your evaluations should include tests that protect existing quality. For example, “always output valid JSON” or “never include competitor names.” These prevent new changes from breaking what already works.
Build Challenge Sets: Create datasets targeting areas where you want your AI to improve. If your model struggles with complex reasoning, build a challenge set specifically for that capability.
Don’t Abandon Human Review: While automated evaluation scales well, having your team use your AI product remains crucial for building intuition. Use Stax to capture compelling examples from human testing and incorporate them into your formal evaluation datasets.

# Answering Frequently Asked Questions

What is Google STAX? Stax is a developer tool from Google for evaluating LLM-powered applications. It helps you test models and prompts against your own criteria rather than relying on general benchmarks.
How does Stax AI work? Stax uses an “LLM-as-judge” approach where you define evaluation criteria, and an AI model scores outputs based on those criteria. You can use pre-built evaluators or create custom ones.
Which tool from Google allows individuals to make their machine learning models? While Stax focuses on evaluation rather than model creation, it works alongside other Google AI tools. For building and training models, you’d typically use TensorFlow or Vertex AI. Stax then helps you evaluate those models’ performance.
What is Google’s equivalent of ChatGPT? Google’s primary conversational AI is Gemini (formerly Bard). Stax can help you test and optimize prompts for Gemini and compare its performance against other models.
Can I train AI on my own data? Stax doesn’t train models; it evaluates them. However, you can use your own data as test cases to evaluate pre-trained models. For training custom models on your data, you’d use tools like Vertex AI.

# Conclusion

The era of vibe testing is ending. As AI moves from experimental demos to production systems, detailed evaluation becomes important. Google Stax provides the framework to define what “good” means for your unique use case and the tools to measure it systematically.

By replacing subjective judgments with repeatable, data-driven evaluations, Stax helps you:

Ship AI features with confidence
Make informed decisions about model selection
Iterate faster on prompts and system instructions
Build AI products that reliably meet user needs

Whether you’re a beginner data scientist or an experienced ML engineer, adopting structured evaluation practices will transform how you build with AI. Start small, define what matters for your application, and let data guide your decisions.

Ready to move beyond vibe testing? Visit stax.withgoogle.com to explore the tool and join the community of developers building better AI applications.

// References

Shittu Olumide is a software engineer and technical writer passionate about leveraging cutting-edge technologies to craft compelling narratives, with a keen eye for detail and a knack for simplifying complex concepts. You can also find Shittu on Twitter.

What's Hot

US orders Anthropic to disable AI models for all foreign nationals | Technology News

UGREEN Unveils New NASync DXP GT NAS Lineup with 2 Attractive Servers

GameStop’s Rewards Program May Be Losing Its Best Feature – Report

Google Stax: Testing Models and Prompts Against Your Own Criteria

10 GitHub Repositories to Master System Design

10 ChatGPT Workflows That Save You Hours Every Week

Top 5 Agentic Coding CLI Tools

3 NumPy Tricks for Numerical Performance

Pairing Claude Code with Local Models

How to Generate AI Videos using Gemini

Black Swans in Artificial Intelligence — Dan Rose AI

Every Clue That Tony Stark Was Always Doctor Doom

We let ChatGPT judge impossible superhero debates — here’s how it ruled

Most Popular

Black Swans in Artificial Intelligence — Dan Rose AI

Every Clue That Tony Stark Was Always Doctor Doom

We let ChatGPT judge impossible superhero debates — here’s how it ruled

Subscribe to Updates

What's Hot

Google Stax: Testing Models and Prompts Against Your Own Criteria

# Introduction

# Understanding Google Stax

// Exploring Key Capabilities

# Moving Beyond Standard Benchmarks

# Getting Started With Stax

// Step 1: Adding An API Key

// Step 2: Creating An Evaluation Project

// Step 3: Building Your Dataset

# Evaluating AI Outputs

// Conducting Manual Evaluation

// Performing Automated Evaluation With Autoraters

// Leveraging Custom Evaluators

# Exploring Practical Use Cases

// Reviewing Use Case 1: Customer Support Chatbot

// Reviewing Use Case 2: Content Summarization Tool

# Interpreting Results

# Implementing Best Practices For Effective Evaluations

# Answering Frequently Asked Questions

# Conclusion

// References

Related posts:

10 GitHub Repositories to Master System Design

10 ChatGPT Workflows That Save You Hours Every Week

Top 5 Agentic Coding CLI Tools

Related Posts

Subscribe to Updates