VCF 9.0 GA Mental Model Part 5: Topology Patterns for Single Site, Two Sites, and Multi-Region

Table of Contents

TL;DR

If you want architects, operators, and leadership aligned, you need a topology mental model that starts with VCF objects and only then maps to your physical sites.

The hierarchy you should standardize on is Fleet -> Instance -> Domain -> Cluster.
Your topology decision is mostly about:
- How many instances you deploy and how they map to sites and regions.
- How many fleets you operate as governance, identity, and operational boundary lines.
Three practical deployment postures:
- Single site: fastest path, smallest blast radius, simplest networking.
- Two sites in one region: stretched clusters, stronger site resilience, tighter latency constraints.
- Multi-region: multiple instances, DR-oriented operating model, more dependencies and more change control.
VCF 9.0 GA code levels referenced in this post (component set and build numbers):
- SDDC Manager 9.0.0.0 build 24703748
- vCenter 9.0.0.0 build 24755230
- ESX 9.0.0.0 build 24755229
- NSX 9.0.0.0 build 24733065
- VCF Operations 9.0.0.0 build 24695812
- VCF Automation 9.0.0.0 build 24701403
- VCF Identity Broker 9.0.0.0 build 24695128
- Note: the BOM for this release also calls out VCF Installer 9.0.1.0 build 24962180 as required to deploy all VCF 9.0.0.0 components.

Architecture Diagram

Scenario
Scope and version alignment
Core concepts: mapping physical topology to fleets and instances
Decision criteria you should agree on up front
Challenge: choose your deployment posture
Architecture tradeoff matrix
One private cloud vs multiple fleets
Identity and SSO boundary patterns
Failure domain analysis
Day-0, day-1, day-2 map by topology
Who owns what
Operational runbook snapshot
Troubleshooting workflow
Anti-patterns
Best practices
Summary and takeaways
Conclusion

Scenario

You are about to deploy VCF 9.0 GA greenfield and you need a shared language for:

What VCF is actually managing.
Where you draw governance boundaries vs infrastructure boundaries.
What changes when you move from a single site to stretched sites to multiple regions.
How identity scope and fleet count become day-0 decisions with long tail operational consequences.

Scope and version alignment

This post assumes:

VCF 9.0.0.0 GA terminology and workflows.
Greenfield deployment using VCF Installer.
You deploy both VCF Operations and VCF Automation from day-1, even if you phase consumption later.

Version compatibility matrix

Use this as your “what are we talking about” anchor in architecture reviews and CAB meetings.

Component	Version	Build
SDDC Manager	9.0.0.0	24703748
vCenter	9.0.0.0	24755230
ESX	9.0.0.0	24755229
NSX	9.0.0.0	24733065
VCF Operations	9.0.0.0	24695812
VCF Automation	9.0.0.0	24701403
VCF Identity Broker	9.0.0.0	24695128
VCF Installer	9.0.1.0	24962180

Core concepts: mapping physical topology to fleets and instances

The physical words that matter

You will hear these terms used casually. Align them to your constraints:

Site: your contained fault domain boundary. Power, cooling, ToR switches, upstream routing, and physical security usually correlate here.
Region: one or more sites within synchronous replication latencies. Crossing regions is typically a disaster recovery process, not an HA event.

The VCF objects you should use in every design discussion

Fleet: your shared governance and shared platform services boundary. This is where you centralize fleet services like operations and automation.
Instance: a discrete VCF deployment footprint. Each instance contains its own management domain and workload domains.
Domain: the lifecycle and isolation boundary. You patch and evolve domains independently.
- Management domain: hosts instance management components.
- VI workload domain(s): run consumer workloads.

Practical rule:

If someone says “we need another vCenter,” you force the conversation back to domain and instance.
If someone says “we need separation,” you ask whether they mean governance separation (fleet) or workload isolation (domain).

Decision criteria you should agree on up front

These are design-time decisions that are expensive to reverse later.

Design-time decision criteria

Availability objective
- Host failure only
- Rack failure
- Site or availability zone failure
- Region failure
Latency and network fabric capability
- ESX host to ESX host latency within clusters
- Stretched VLAN and L2 adjacency requirements
- Fleet-wide connectivity constraints between instances
Isolation objective
- Logical isolation only
- Physical isolation by cluster or domain
- Regulated tenant isolation that requires separate identity and change control
Operating model maturity
- Do you have a platform team that can own fleet services and identity lifecycle?
- Do you have standardized change windows and patch discipline?

Day-2 reality check questions

Can you support fleet services as shared dependencies?
Can you operationalize backup schedules, certificate lifecycle, and password rotation consistently?
Can you troubleshoot cross-site failures without escalating everything to vendors?

Challenge: choose your deployment posture

You need a deployment posture that matches your physical topology without creating a governance model you cannot operate.

Solution A: Single site

This is your default starting posture unless you have a clear availability driver.

What it looks like

One fleet
One instance
One management domain
One or more workload domains

Design intent

Optimize for time-to-value and operational simplicity.
Keep latency and networking requirements straightforward.

Operational implications

You can still scale within the site by adding:
- More clusters to domains
- More workload domains
- Potentially more instances if you need isolation at the instance boundary
Your DR posture becomes a separate conversation, usually backup/restore first, then replication and orchestration.

Solution B: Two sites in one region

This is the “site resilience” posture. It usually assumes stretched constructs and tighter network constraints.

What it looks like

One fleet
One instance stretched across two sites in the same region
A management domain designed for high availability across the two sites
Workload domains separated from management

Design intent

Tolerate a full site or availability zone failure for management and potentially workloads.
Reduce downtime for site-local incidents.

Hard constraints you must respect

Stretched designs require disciplined network engineering. VCF calls out maximum latency thresholds for:
- ESX hosts within a vSphere cluster.
- Hosts running NSX Edge nodes within the same NSX Edge cluster.
- vSAN witness connectivity in stretched designs.
You also inherit “stretched gateway” and “first hop gateway HA” problems that are not optional in real outages.

Operational implications

Your patch workflow needs to understand site affinity and failover capacity.
Your network troubleshooting becomes as important as your vSphere troubleshooting.

Solution C: Multi-region

This is the “DR-oriented operating model” posture. You typically deploy multiple instances, each aligned to a region.

What it looks like

One fleet
Multiple instances
Each instance has its own management domain and workload domains
Regions are connected with a cross-region network for centralized management and visibility

Design intent

Separate failure domains by region.
Enable recovery workflows that survive region-level events, given adequate capacity and replication strategy.

Operational implications

You introduce an explicit dependency chain:
- Fleet services live somewhere (commonly the first instance), and other instances rely on cross-region connectivity to reach them.
Change control becomes multi-region by default:
- Certificates, identity, and patching must be coordinated across locations.

Architecture tradeoff matrix

Use this in design boards to stop circular debates.

Attribute	Single site	Two sites in one region	Multi-region
Primary goal	Simplicity	Site resilience	Region separation and DR posture
Typical instance count	1	1	2+
Network complexity	Low	High	Medium to high
Latency sensitivity	Moderate	High	Medium
Fleet service dependency	Local	Local but stretched	Cross-region dependency
Operational overhead	Low	High	High
Cost drivers	Host count, storage	Stretched fabric, witness, failover capacity	Duplicate capacity, replication, bandwidth

Cost model snapshot

This is not pricing. It is what actually moves your bill of materials.

Single site
- Cheapest fleet service hosting footprint.
- Lowest network engineering cost.
Two sites in one region
- You pay for:
  - Higher-quality inter-site links
  - Stretched VLAN support
  - Additional failover capacity (because you are engineering for a site loss)
Multi-region
- You pay for:
  - Duplicate management footprints per region
  - Data replication and orchestration tooling
  - Higher operational toil unless you automate day-2 heavily

One private cloud vs multiple fleets

Treat “private cloud” as your organizational wrapper. VCF objects start at fleet.

When one fleet is enough

Choose one fleet when:

You want centralized observability and automation.
You can accept shared governance services across multiple instances.
You want a standard operating model across locations.

When you should operate multiple fleets

Choose multiple fleets when you need:

Separate identity providers or separate SSO boundaries for regulated isolation.
Independent change windows and patch schedules.
Hard blast radius separation for fleet services.

Practical framing:

Fleet separation is about governance, identity scope, and shared service blast radius.
Domain separation is about workload isolation and lifecycle independence.

Identity and SSO boundary patterns

Identity is not a “later” decision. It is a boundary decision.

Challenge: unify access or isolate tenants

You need a model that matches your org and compliance posture.

Solution A: Fleet-wide SSO

Use this when:

You want one set of credentials and SSO across all components in the fleet.
You can tolerate that an identity broker impact affects the fleet.

Operational reality:

This is powerful for operator experience, but it increases the blast radius of identity outages.

Solution B: Cross-instance SSO

Use this when:

You want shared identity across a subset of instances, not necessarily all.
You want more control over blast radius than a single fleet-wide configuration.

Solution C: Single instance SSO boundaries

Use this when:

You need regulated or tenant isolation.
You need different identity providers or different authentication policies per instance.
You want to localize identity outages.

Embedded vs appliance identity broker

Treat this as a scaling and availability decision:

Embedded identity broker is simpler but inherits dependency on instance components.
Appliance identity broker adds overhead but improves availability and scale.
Design constraint worth calling out: there is a maximum number of instances that can connect to a single identity broker deployment.

Failure domain analysis

This is where topology turns into real operational outcomes.

Failure domains you should model

Failure	What breaks	What keeps running	Typical owner response
Fleet services unavailable	Central observability, centralized automation, fleet management workflows, and optionally SSO experience	Existing workloads and instance-level management planes continue to run	Platform team restores fleet services, validates integrations
Instance management domain impaired	Domain lifecycle actions, some instance operations	Workloads may still run, but you lose safe lifecycle control and may lose some vCenter or NSX management functions depending on the failure	VI admin + platform team coordinate recovery
Workload domain impaired	Workloads in that domain	Other domains and instances continue	VI admin + app teams execute workload recovery runbooks

Practical RTO and RPO examples you can use as starting targets

These are real-world starting points, not vendor commitments:

Fleet services (VCF Operations, VCF Automation, identity broker)
- RTO: 2 to 8 hours depending on how automated your restore is
- RPO: 24 hours is a common baseline when backups are daily
Instance management domain components
- RTO: 1 to 4 hours if you have clean backups and documented restore procedures
- RPO: 24 hours baseline, tighter if you replicate critical config
Workload domains and applications
- RTO and RPO are application-specific and often require orchestration tooling

If you cannot state these targets, you should at least agree on the priority order:

Identity and authentication
SDDC Manager and lifecycle
vCenter and NSX management
Workload recovery

Day-0, day-1, day-2 map by topology

Day-0: decisions you should lock

These apply to all topologies:

Fleet count and naming standard.
Instance to site or region mapping.
Domain strategy:
- Management domain is not where you run business workloads.
- Workload domains align to lifecycle and isolation needs.
Network and IP plan:
- Treat subnet sizing as irreversible planning, not something you “fix later”.
- Allocate address space with room for expansion.
Identity model:
- Fleet-wide vs instance-level boundaries.
- Corporate IdP integration and MFA policy alignment.
Certificate authority and certificate lifecycle plan.
Backup targets and backup schedule owners.

Day-1: bring-up sequence that fits the object model

A typical greenfield flow looks like:

Deploy VCF Installer appliance.
Start new fleet deployment and create the first instance.
License and stand up fleet services:
- VCF Operations
- VCF Automation
Deploy identity broker and configure VCF SSO with your directory.
Create workload domain(s) and establish network connectivity patterns.
Stand up VCF Automation constructs for consumption.

Day-2: operations you should operationalize early

Patch and lifecycle:
- Domain-based upgrades and maintenance windows.
- Explicit rollback plans when upgrading shared fleet services.
Backup and restore:
- SFTP backup targets for management components.
- Backup schedules for fleet services and for instance components.
Security lifecycle:
- Password rotation and account management.
- Certificate replacement and renewal.
Expansion:
- Add workload domains, clusters, and potentially additional instances.

Who owns what

Use this to prevent “everyone owns it, so no one owns it.”

Capability	Platform team	VI admin	App and platform teams
Fleet services lifecycle	Own	Consult	Informed
VCF Operations configuration and alerts	Own	Consult	Informed
VCF Automation provider setup	Own	Consult	Informed
Identity broker and SSO model	Own	Consult	Informed
Instance bring-up and health	Own	Own	Informed
SDDC Manager operations	Consult	Own	Informed
vCenter and NSX in management domain	Consult	Own	Informed
Workload domain creation and lifecycle	Consult	Own	Informed
Workload provisioning via automation	Own the platform	Consult	Own consumption
Application deployment and runtime	Informed	Consult	Own

Operational runbook snapshot

Keep this as a living page in your ops wiki.

Weekly

Review fleet service health and integrations.
Validate that all instances are reporting metrics and logs.
Confirm certificate expiration windows and rotation queue.

Monthly

Execute backup restore tests for:
- Fleet services
- Instance management components
Review capacity and “failover capacity” assumptions for your topology.

Quarterly

Patch at domain boundaries, not by ad hoc component upgrades.
Re-validate cross-site network latency and packet loss.
Run a tabletop exercise:
- Fleet services outage
- Instance outage
- Site outage

Validation checklist

Use UI and workflow validation before you declare success:

In VCF Operations, confirm each VCF instance is visible and healthy.
Confirm your automation provider and tenant access paths work with the chosen identity model.
Confirm backups are running and stored off the platform.

Troubleshooting workflow

When something breaks, troubleshoot by boundary.

Provisioning failures
- Check VCF Automation health and its integration to VCF Operations.
- Validate identity provider connectivity and token issuance.
- Validate network connectivity between fleet services and target instance.
Instance lifecycle failures
- Inspect SDDC Manager alarms and recent change history.
- Validate domain health and vCenter availability.
- Check for drift from out-of-band changes.
Cross-site weirdness
- Start with latency and MTU validation.
- Validate gateway HA behavior for stretched segments.
- Confirm site affinity rules for critical components.

Anti-patterns

Avoid these early and you remove a lot of future toil.

Treating a two-site stretched design like “just two data centers”.
Using a single fleet across regulated tenants when you actually need separate identity and change boundaries.
Running meaningful workloads in the management domain because it was “available”.
Designing IP space too tightly and assuming you can resize later.
Assuming multi-region means “active-active” without defining replication, orchestration, and capacity for failover.

Best practices

Standardize vocabulary in writing:
- Fleet -> Instance -> Domain -> Cluster
Keep fleet services highly available and backed up like any other Tier 0 platform.
Make identity a design board item, not an implementation checkbox.
Use domains as your lifecycle boundary:
- Patch domains, validate domains, roll back at domain boundaries.
Write failure-mode runbooks for:
- Fleet services down
- Instance down
- Site down
- Region down

Summary and takeaways

Your topology posture is an operating model decision, not just an architecture diagram.
Two sites in one region usually increases availability but also increases network and day-2 complexity.
Multi-region usually improves fault domain separation, but it introduces cross-region dependencies for fleet services unless you deliberately isolate with multiple fleets.
Decide identity scope and fleet count at day-0. The cost of changing later is always higher than the cost of deciding carefully now.

Conclusion

VCF 9.0 GA becomes easier to design and operate when you treat fleet, instance, and domain as explicit boundaries and then map them to site and region realities. Pick the simplest topology that meets your availability and isolation goals, and invest early in day-2 practices for identity, backups, lifecycle, and change control.

Sources

VMware Cloud Foundation 9.0 and later Documentation: https://techdocs.broadcom.com/us/en/vmware-cis/vcf/vcf-9-0-and-later/9-0.html

Iran targets Israeli embassy in Bahrain, Saudi Arabia intercepts missile | Conflict News

Ukraine anticorruption investigators search home of Zelenskyy’s top aide | Corruption News

UK sovereign AI fund to build up domestic computing infrastructure

What's Hot

Los Angeles Stadium workers urge FIFA to bar ICE from World Cup | World Cup 2026 News

How Andrej Karpathy’s Idea Is Changing AI

Today’s NYT Mini Crossword Answers for April 7

VCF 9.0 GA Mental Model Part 5: Topology Patterns for Single Site, Two Sites, and Multi-Region

Los Angeles Stadium workers urge FIFA to bar ICE from World Cup | World Cup 2026 News

‘They want to create a rift’: Israeli attacks deepen Lebanon fissures | Israel attacks Lebanon

Russia jails former Kursk governor in Ukraine incursion-linked graft probe | Russia-Ukraine war News

Black Swans in Artificial Intelligence — Dan Rose AI

BMW Will Put eFuel In Cars Made In Germany From 2028

Best Sonic Lego Deals – Dr. Eggman’s Drillster Gets Big Price Cut

Most Popular

Black Swans in Artificial Intelligence — Dan Rose AI

BMW Will Put eFuel In Cars Made In Germany From 2028

Best Sonic Lego Deals – Dr. Eggman’s Drillster Gets Big Price Cut

Subscribe to Updates

What's Hot

VCF 9.0 GA Mental Model Part 5: Topology Patterns for Single Site, Two Sites, and Multi-Region

TL;DR

Architecture Diagram

Table of Contents

Scenario

Scope and version alignment

Version compatibility matrix

Core concepts: mapping physical topology to fleets and instances

The physical words that matter

The VCF objects you should use in every design discussion

Decision criteria you should agree on up front

Design-time decision criteria

Day-2 reality check questions

Challenge: choose your deployment posture

Solution A: Single site

Solution B: Two sites in one region

Solution C: Multi-region

Architecture tradeoff matrix

Cost model snapshot

One private cloud vs multiple fleets

When one fleet is enough

When you should operate multiple fleets

Identity and SSO boundary patterns

Challenge: unify access or isolate tenants

Solution A: Fleet-wide SSO

Solution B: Cross-instance SSO

Solution C: Single instance SSO boundaries

Embedded vs appliance identity broker

Failure domain analysis

Failure domains you should model

Practical RTO and RPO examples you can use as starting targets

Day-0, day-1, day-2 map by topology

Day-0: decisions you should lock

Day-1: bring-up sequence that fits the object model

Day-2: operations you should operationalize early

Who owns what

Operational runbook snapshot

Weekly

Monthly

Quarterly

Validation checklist

Troubleshooting workflow

Anti-patterns

Best practices

Summary and takeaways

Conclusion

Sources

Like this:

Related posts:

Iran targets Israeli embassy in Bahrain, Saudi Arabia intercepts missile | Conflict News

Ukraine anticorruption investigators search home of Zelenskyy’s top aide | Corruption News

UK sovereign AI fund to build up domestic computing infrastructure

Related Posts

Subscribe to Updates