A Hands-On Test of Google’s Newest AI

Just 3 months after the release of their state-of-the-art model Gemini 3 Pro, Google DeepMind is here with its latest iteration: Gemini 3.1 Pro.

A radical upgrade in terms of capabilities and safety, Gemini 3.1 Pro model strives to be accessible and operable by all. Regardless of your preference, platform, purchasing power, the model has a lot to offer for all the users.

I’d be testing the capabilities of Gemini 3.1 Pro and would elaborate on its key features. From how to access Gemini 3.1 Pro to benchmarks, all things about this new model has been touched upon in this article.

Table of Contents

Gemini 3.1 Pro: What’s new?

Gemini 3.1 Pro is the latest member of the Gemini model family. As usual the model comes with an astounding number of features and improvements from the past. Some of the most noticeable one are:

1 Million Context Window: Maintains the industry-leading 1 million token input capacity, allowing it to process over 1,500 pages of text or entire code repositories in a single prompt.
Advanced Reasoning Performance: It delivers more than double the reasoning performance of Gemini 3 Pro, scoring 77.1% on the ARC-AGI-2 benchmark.
Enhanced Agentic Reliability: Specifically optimized for autonomous workflows, including a dedicated API endpoint (gemini-3.1-pro-preview-customtools) for high-precision tool orchestration and bash execution.
Pricing: The cost/token of the latest model is the same as that of its predecessor. For those accustomed to the Pro variant, they are getting a free upgrade.

Advanced Vibe Coding: The model handles visual coding exceptionally well. It can generate website-ready, animated SVGs purely through code, meaning crisp scaling and tiny file sizes.
Hallucinations: Gemini 3.1 Pro has tacked the hallucinations problem head on by reducing its rate of hallucinations from 88% to 50% across AA-Omniscience: Knowledge and Hallucination Benchmark

Granular Thinking: The model adds more granularity to the thinking option offered by its predecessor. Now the users can choose between high, medium and low thinking parameters.

Thinking Level	Gemini 3.1 Pro	Gemini 3 Pro	Gemini 3 Flash	Description
Minimal	Not supported	Not supported	Supported	Matches the no thinking setting for most queries. The model may think minimally for complex coding tasks. Minimizes latency for chat or high throughput applications.
Low	Supported	Supported	Supported	Minimizes latency and cost. Best for simple instruction following or high-throughput applications.
Medium	Supported	Not supported	Supported	Balanced reasoning for most tasks.
High	Supported (Default, Dynamic)	Supported (Default, Dynamic)	Supported (Default, Dynamic)	Maximizes reasoning depth. May increase latency, but outputs are more carefully reasoned.

Hands-On: Let’s have some fun

All the talk in the world wouldn’t amount to anything if the performance falls flat in practice. To evaluate Gemini 3.1 Pro properly, I tested it across three categories:

Complex reasoning
Code generation & debugging
Long-context synthesis

Task 1: Multi-Step Logical Reasoning

What this tests: Chain-of-thought reasoning, constraint handling, and hallucination resistance.

Prompt:

“You are given the following scenario:

Five analysts — A, B, C, D, and E — are assigned to three projects: Alpha, Beta, and Gamma.

Rules:

1. Each project must have at least one analyst.
2. A cannot work with C.
3. B must be assigned to the same project as D.
4. E cannot be on Alpha.
5. No project can have more than three analysts.

Question: List all valid assignment combinations. Show your reasoning clearly and ensure no rule is violated.

“

Response:

Gemini 3.1 Pro handled constraint-heavy logic without collapsing into contradictions, which is where most models stumble. The consistency and clarity in enumerating valid combinations showed serious reasoning depth.

Task 2: Code Generation & Refactoring

What this tests: Structured thinking, edge-case awareness, code correctness, and production realism.

Prompt:

“You are given the following Python function:

def process_orders(orders):
    total = 0
    for order in orders:
        if order["status"] == "completed":
            total += order["amount"]
    return total

Refactor this function to:

1. Handle missing keys safely.
2. Ignore negative amounts.
3. Log skipped invalid entries.
4. Be production-ready with type hints and docstrings.
5. Include unit tests for edge cases.

Return clean, well-structured Python code only.

“

Response:

import logging
import unittest
from typing import List, Dict, Any

logger = logging.getLogger(__name__)

def process_orders(orders: List[Dict[str, Any]]) -> float:
    """
    Process a list of order dictionaries and calculate
    the total amount of completed orders.
    """
    total: float = 0.0

    if not isinstance(orders, list):
        logger.error("'orders' must be a list.")
        return total

    for index, order in enumerate(orders):
        if not isinstance(order, dict):
            logger.warning("Skipping index %d: Not a dictionary.", index)
            continue

        if order.get("status") != "completed":
            continue

        if "amount" not in order:
            logger.warning("Skipping index %d: Missing 'amount'.", index)
            continue

        amount = order["amount"]

        if not isinstance(amount, (int, float)) or isinstance(amount, bool):
            logger.warning("Skipping index %d: 'amount' must be numeric.", index)
            continue

        if amount < 0:
            logger.warning("Skipping index %d: Negative amount (%.2f).", index, amount)
            continue

        total += float(amount)

    return total


class TestProcessOrders(unittest.TestCase):

    def setUp(self):
        logging.disable(logging.WARNING)

    def tearDown(self):
        logging.disable(logging.NOTSET)

    def test_valid_completed_orders(self):
        orders = [
            {"status": "completed", "amount": 100},
            {"status": "completed", "amount": 50.5}
        ]
        self.assertEqual(process_orders(orders), 150.5)

    def test_ignore_pending_orders(self):
        orders = [
            {"status": "pending", "amount": 100},
            {"status": "completed", "amount": 50}
        ]
        self.assertEqual(process_orders(orders), 50.0)

    def test_missing_keys_skipped(self):
        orders = [
            {"amount": 100},
            {"status": "completed"},
            {"status": "completed", "amount": 20}
        ]
        self.assertEqual(process_orders(orders), 20.0)

    def test_negative_amounts_ignored(self):
        orders = [
            {"status": "completed", "amount": -10},
            {"status": "completed", "amount": 3

The refactored code felt production-aware, not toy-level. It anticipated edge cases, enforced type safety, and included meaningful tests. This is the kind of output that actually respects real-world development standards.

Task 3: Long-Context Analytical Synthesis

What this tests: Information compression, structured summarization, and reasoning across context.

Prompt:

“Below is a synthetic business report:

Company: NovaGrid AI

2022 Revenue: $12M
2023 Revenue : $28M
2024 Revenue: $46M

Customer churn increased from 4% to 11% in 2024.
R&D spending increased by 70% in 2024.
Operating margin dropped from 18% to 9%.
Enterprise customers grew by 40%.
SMB customers declined by 22%.
Cloud infrastructure costs doubled.

Task:

1. Diagnose the most likely root causes of margin decline.
2. Identify strategic risks.
3. Recommend 3 data-backed actions.
4. Present your answer in a structured executive memo format.

“

Response:

It connected financial signals, operational shifts, and strategic risks into a coherent executive narrative. The ability to diagnose margin pressure while balancing growth signals shows strong business reasoning. It read like something a sharp strategy consultant would draft, not a generic summary.

Note: I didn’t use the standard “Create a dashboard” tasks as most latest models like Sonnet 4.6, Kimi K 2.5, are easily able to create one. So it wouldn’t offer much of a challenge to a model this capable.

How to access Gemini 3.1 Pro?

Unlike the previous Pro models, Gemini 3.1 Pro is freely accessible by all the users on the platform of their choice.

Now that you’ve made up your mind about using Gemini 3.1 Pro, let’s see how you can access the model.

Gemini Web UI: Free and Gemini Advanced users now have 3.1 Pro available under the model section option.

API: Available via Google AI Studio for developers (models/Gemini-3.1-pro).

Model	Base Input Tokens	5m Cache Writes	1h Cache Writes	Cache Hits & Refreshes	Output Tokens
Gemini 3.1 Pro (≤200 K tokens)	$2 / 1M tokens	~$0.20–$0.40 / 1M tokens	~$4.50 / 1M tokens per hour storage	Not formally documented	$12 / 1M tokens
Gemini 3.1 Pro (>200 K tokens)	$4 / 1M tokens	~$0.20–$0.40 / 1M tokens	~$4.50 / 1M tokens per hour storage	Not formally documented	$18 / 1M tokens

Cloud Platforms: Being rolled out to NotebookLM, Google Cloud’s Vertex AI, and Microsoft Foundry.

Benchmarks

To quantify how good this model is, the benchmarks would assist.

There is a lot to decipher here. But the most astounding improvement of all is certainly in Abstract reasoning puzzles.

Let me put things into perspective: Gemini 3 Pro released with a ARC-AGI-2 score of 31.1%. This was the highest for the time and considered a breakthrough for LLM standards. Fast forward just 3 months, and that score has been eclipsed by its own successor by double the margin!

This is the rapid pace at which AI models are improving.

If you’re unfamiliar with what these benchmarks test, read this article: AI Benchmarks.

Conclusion: Powerful and Accessible

Gemini 3.1 Pro proves it’s more than a flashy multimodal model. Across reasoning, code, and analytical synthesis, it demonstrates real capability with production relevance. It’s not flawless and still demands structured prompting and human oversight. But as a frontier model embedded in Google’s ecosystem, it’s powerful, competitive, and absolutely worth serious evaluation.

Frequently Asked Questions

Q1. What is Gemini 3.1 Pro designed for?

A. It is built for advanced reasoning, long-context processing, multimodal understanding, and production-grade AI applications.

Q2. How can developers access Gemini 3.1 Pro?

A. Developers can access it via Google AI Studio for prototyping or Vertex AI for scalable, enterprise deployments.

Q3. Is Gemini 3.1 Pro reliable for high-stakes tasks?

A. It performs strongly but still requires structured prompting and human oversight to ensure accuracy and reduce hallucinations.

I specialize in reviewing and refining AI-driven research, technical documentation, and content related to emerging AI technologies. My experience spans AI model training, data analysis, and information retrieval, allowing me to craft content that is both technically accurate and accessible.

What's Hot

Best Crimson Desert Mods To Improve Your Journey Through Pywel

Arnold Schwarzenegger’s $261M Franchise Failure That Only Made 1.5x Its Budget Redeems Itself on Streaming

We Drove the MINI Aceman SE in Sydney — Here’s Where It Makes Sense

A Hands-On Test of Google’s Newest AI

10 Essential Agentic AI Interview Questions for AI Engineers

5 Useful DIY Python Functions for Parsing Dates and Times

How Conversational Chatbots Can Revolutionize Your Sales Process

How Andrej Karpathy’s Idea Is Changing AI

5 Fun Projects Using OpenClaw

Is it the Best Open-Source Model of 2026?

Black Swans in Artificial Intelligence — Dan Rose AI

BMW Will Put eFuel In Cars Made In Germany From 2028

Best Sonic Lego Deals – Dr. Eggman’s Drillster Gets Big Price Cut

Most Popular

Black Swans in Artificial Intelligence — Dan Rose AI

BMW Will Put eFuel In Cars Made In Germany From 2028

Best Sonic Lego Deals – Dr. Eggman’s Drillster Gets Big Price Cut

Subscribe to Updates

What's Hot

A Hands-On Test of Google’s Newest AI

Gemini 3.1 Pro: What’s new?

Hands-On: Let’s have some fun

Task 1: Multi-Step Logical Reasoning

Task 2: Code Generation & Refactoring

Task 3: Long-Context Analytical Synthesis

How to access Gemini 3.1 Pro?

Benchmarks

Conclusion: Powerful and Accessible

Frequently Asked Questions

Login to continue reading and enjoy expert-curated content.

Related posts:

10 Essential Agentic AI Interview Questions for AI Engineers

5 Useful DIY Python Functions for Parsing Dates and Times

How Conversational Chatbots Can Revolutionize Your Sales Process

Related Posts

Subscribe to Updates