Search-Based Testing (SBT) and the Curse of Dimensionality

📚 Series Context: In Part 1, we explored why traditional mileage-based validation doesn't scale for autonomous vehicles—requiring billions of kilometers to achieve statistical confidence. This article presents the first puzzle piece of the solution: how Search-Based Testing intelligently samples vast scenario spaces to find critical cases within a single logical scenario.

Scenario Hierarchy and ODD Context

Every AV operates within an Operational Design Domain (ODD), defining the conditions under which the system is intended to operate—roads, traffic rules, weather, and environmental constraints. Within this ODD, individual situations are captured as scenarios at different abstraction levels:

Abstract: Human-readable description (e.g., "uncontrolled intersection with cross traffic").
Logical: Parameter ranges that define the abstract scenario (e.g., vehicle speeds, positions, accelerations).
Concrete: Specific parameter assignments executable in simulation.

In this article, the intersection scenario we use as an example is one logical scenario within its ODD.

The Curse of Dimensionality: 4-Car Intersection Example

Let’s look at a standard unsignalized intersection with four vehicles (ego + three traffic participants). We fix the ego vehicle's acceleration (allowing the planner to control it) but vary its initial state. This gives us:

Ego Vehicle: Initial velocity (v₀) and Initial position (p₀) [2 parameters]
3 Traffic Participants: Initial velocity (v₀), Initial position (p₀), and Acceleration (a₀) each [3 × 3 = 9 parameters]

That gives us an 11-dimensional space to cover. With a coarse discretization of 10 steps per parameter, brute-force simulation would require:

10¹¹ = 100,000,000,000 concrete scenarios

And this toy setup is aggressively simplified. We have not modeled:

steering

driver models

sensor perception

priority rules

traffic lights

occlusions

friction / weather

vehicle dynamics

and many more...

turn maneuvers

pedestrians

Yet even this minimal scenario already produces a continuous parameter space large enough that naive brute sweeps are completely impossible to compute.

Assuming an optimistic simulation engine running 100 simulations per second, the runtime would be:

100,000,000,000 scenarios / 100 sims/sec = 1,000,000,000 sec ≈ 31.7 years!

Note: Achieving 100 simulations per second is only feasible with lightweight, low-fidelity simulations. In reality, each scenario would take much longer to simulate.

payments Business Impact

Cloud simulation at scale (1000+ cores) represents significant ongoing costs. Every 10× reduction in required scenarios translates directly to infrastructure savings and faster time-to-market — critical when OEM release windows are measured in quarters, not years.

Even under optimistic assumptions, brute-force evaluation of all possible combinations doesn't scale.

Why Brute-Force Fails

Running every single variation is a massive waste of resources on uninteresting scenarios where the AV performs well. The critical events — collisions, near-misses, and edge cases — occupy only small subregions of the parameter space. Identifying them requires a method that focuses exploration where it matters.

The solution: instead of evaluating everything, evaluate only what's likely to matter.

Why Search-Based Testing Works

SBT treats testing like an optimization problem, typically using genetic algorithms, Bayesian optimization, or surrogate models, guided by a Key Performance Indicator (KPI). For example:

KPI: Minimum bounding box distance during an intersection crossing.
Objective: Minimize this distance to uncover near-misses or collisions.

Iterative Process of SBT:

Coarse Sampling: Sample initial points across the logical scenario space.
KPI Evaluation: Run simulations and compute the KPI for each scenario.
Surrogate Model Training: Build a fast approximation (e.g., Gaussian Process, neural network) of the KPI function based on evaluated samples. This model predicts KPI values without running expensive simulations.
Region Refinement: Use the surrogate model to identify promising regions and focus subsequent samples where the KPI indicates potential critical events.
Continuous Improvement: As new samples are evaluated, the surrogate model is continuously retrained and refined, improving prediction accuracy in critical regions while maintaining computational efficiency.
Repeat: Iterate between surrogate updates and targeted sampling until convergence or computational budget is exhausted.

Surrogate models are the secret weapon of production SBT: instead of running expensive simulations for every candidate scenario, the search algorithm queries a fast approximation to eliminate obviously uninteresting regions. Only the most promising candidates get full simulation treatment, reducing evaluations potentially by multiple orders of magnitude. That said, building a good surrogate is easier said than done — getting the training set right is half the battle.

Illustration of SBT vs Full Grid Sampling:

summarize Recap: The SBT Approach

The Problem: 11-dimensional intersection scenario = 10¹¹ test cases ≈ 31 years of simulation time
The Solution: Search-Based Testing focuses evaluation on critical regions rather than exhaustive enumeration
Key Enabler: Surrogate models + optimization algorithms guide the search toward scenarios that matter
What SBT Solves: Efficient sampling within a single logical scenario — not scenario generation, prioritization, or ODD coverage
Business Value: Faster iteration cycles, ISO 21448/UL 4600 compliance-ready, measurable resource optimization

The Critical Role of KPI Selection

The KPI is the compass guiding the search. For the simple intersection scenario above, minimum bounding box distance works well, but real-world scenarios are rarely this straightforward—different scenario types demand different metrics:

KPI Sensitivity: Time-to-collision (TTC) vs. distance-to-collision yield fundamentally different critical scenario sets—TTC prioritizes velocity mismatches, while distance emphasizes spatial proximity.
Multi-Objective KPIs: Combine multiple safety concerns (e.g., collision risk + passenger comfort + regulatory compliance) into a weighted composite metric, enabling the search to balance competing objectives.
Domain Expertise Required: Each logical scenario type (lane change, merge, pedestrian crossing) needs carefully designed KPIs that capture the relevant safety properties.

KPI Design Principles: What Works and What Fails

Designing KPIs is where theory meets reality. The textbooks won't tell you this, but the most common failure mode isn't mathematical—it's choosing metrics that optimize for something you don't actually care about.

Monotonicity and Gradients:

❌ Poor Choice: Binary crash/no-crash flag (discontinuous). The search algorithm has no gradient to follow near the safety boundary—it's blindly guessing.
✅ Better Choice: Minimum distance during trajectory (continuous). Provides smooth gradient that guides the search toward critical boundaries, even from safe initial conditions.

Other Critical Considerations:

Beyond continuity, effective KPIs must be resistant to exploitation—a metric that only measures time-to-collision can be gamed by scenarios where vehicles never enter the intersection. Teams often don't discover this until weeks into a test campaign, when they realize their "critical scenarios" are just clever ways to avoid the intersection entirely.

KPIs should also capture temporal dynamics over entire trajectories, not just static spatial snapshots. Where possible, leverage formal safety frameworks like RSS (Responsibility-Sensitive Safety) to provide regulation-aligned, interpretable metrics.

Scenario Type	Weak KPI	Strong KPI
Intersection	Binary crash flag	Min bounding-box distance + TTC
Lane Change	Lateral offset only	Time-integrated TTC + jerk comfort
Pedestrian Crossing	Final separation distance	Max deceleration + stopping margin
Highway Merge	Closest approach	RSS gap compliance + merge completion

Interactive Demonstration

In the following simplified intersection with two vehicles, you can play with initial velocities using grid resolutions from 20–80 steps. The simulation uses realistic bounding box collision detection, where a crash occurs when the vehicle bounding boxes actually touch or overlap. Two strategies are compared:

Full grid sweep (exhaustive brute-force sampling)
SBT refinement (adaptive sampling guided by the KPI)

science Interactive Lab: Brute Force vs SBT

Grid Resolution (K)

Sim Speed

Vel Range (m/s)

Accel (m/s²)

Simulation Replay

Simulations: 0

Critical Scenarios: 0

Ratio: 0.00

Safe

Near Miss

Crash

Hover grid to inspect

Implementation Note: This demo uses adaptive binary refinement in a 2D space. Production implementations leverage Bayesian optimization, surrogate models, and genetic algorithms to achieve orders of magnitude better performance in high-dimensional spaces. Success depends on choosing appropriate KPIs and search strategies. The key insight: efficiency gains compound exponentially with dimensionality.

Understanding SBT Trade-offs

While SBT dramatically reduces computational cost, it's important to understand its limitations:

Core Limitation: SBT identifies local optima in the logical scenario space. Multiple runs with different initial conditions or alternative search strategies may be required to cover the full range of interesting behaviors.

False Negatives: SBT may miss critical scenarios in unexplored regions of the parameter space. The search converges on local optima, potentially overlooking isolated critical events elsewhere.
Multiple Search Runs: To improve coverage, practitioners often run multiple searches with different initial sampling strategies or random seeds.
KPI Dependency: Search effectiveness depends heavily on choosing appropriate KPIs. A poorly chosen KPI can mislead the search away from truly critical scenarios.
Computational Overhead: The search algorithm itself introduces overhead—evaluating gradients, managing the search state, and deciding where to sample next.

The hardest part? Knowing when you've sampled enough. There's no universal stopping criterion—it's part science, part judgment, part organizational risk tolerance.

gavel Regulatory Context

ISO 21448 (SOTIF) and UL 4600 require demonstrating thorough scenario space exploration. SBT's KPI-driven methodology provides auditable coverage metrics and traceability that support safety case arguments—answering the regulator question: "How do you know you tested the right things?"

integration_instructions Operational Integration

SBT isn't a replacement for existing simulation platforms—it's an orchestration layer that sits on top of your existing validation infrastructure, intelligently selecting which scenarios to test.

Role of SBT in the Validation Pipeline

SBT reduces computations within a single logical scenario, but this is only one component of the AV validation pipeline. Hundreds of millions of scenarios across the ODD still need to be generated, prioritized, and tested. Subsequent articles will focus on those to expand the V&V toolchain.

References

ISO 21448:2022 — Road vehicles — Safety of the intended functionality (SOTIF).
Operational Design Domain (Wikipedia). Link
Scenario abstraction levels (functional → logical → concrete). Link
PEGASUS project: Scenario-based testing methodology. Link

Kaveh Rahnema

V&V Expert for ADAS & Autonomous Driving with 7+ years at Robert Bosch GmbH.

Connect on LinkedIn