NVIDIA's 8B Orchestrator Model Surpasses GPT-5 in Efficiency and Cost

NVIDIA Research has introduced Orchestrator, an 8-billion-parameter (8B) model designed to manage and optimize the use of various AI tools, demonstrating enhanced accuracy and cost-effectiveness compared to larger models. The model achieved a 37.1% score on the Humanity's Last Exam (HLE) benchmark, outperforming GPT-5's 35.1%, while reducing operational costs by approximately 70%.
Orchestrator also showed superior performance in the tau2-Bench and FRAMES tests, maintaining lower costs. The research indicates that smaller, fine-tuned models can effectively direct larger models and tools, leading to more efficient and adaptable AI systems.
Orchestrator's Approach to Tool Coordination
Traditional approaches to AI problem-solving often rely on a single, powerful large language model (LLM) to perform all tasks, including calling basic tools like search and code interpreters. This can lead to high computational costs and challenges in achieving simultaneous accuracy, affordability, and control, particularly in complex reasoning tasks such as those found in HLE. Attempts to use LLMs as "schedulers" to assign tasks have frequently resulted in the scheduler defaulting back to the most powerful model for a majority of requests.
Orchestrator addresses this by decoupling intelligence from a single model. Instead, it functions as a lightweight scheduling hub that coordinates a diverse set of specialized tools and models. This composite system aims to optimize resource allocation and task execution.
Training and Architecture
NVIDIA's Orchestrator model is trained using reinforcement learning (RL) to prioritize locally deployed models based on user preferences. The RL reward function incorporates three main components: the correctness of the answer, operational efficiency (cost and latency), and alignment with user-defined tool preferences. This multi-objective optimization allows Orchestrator to balance performance, cost, and user alignment.
The model employs a multi-turn execution mechanism, utilizing Chain-of-Thought (CoT) to analyze current states and plan subsequent tool calls. It then executes these calls within an environment (e.g., mathematical derivation, code execution) and processes the results in an iterative loop.
To support this RL training, the research team developed ToolScale, a large-scale, verifiable multi-turn tool-calling synthetic dataset. ToolScale automatically generates simulated environments across 10 domains, including finance, healthcare, and aviation, and creates 430,000 tasks with manually labeled optimal tool-calling trajectories. Each task undergoes triple verification to ensure execution correctness, process fidelity, and operational completeness.
Performance and Cost Efficiency
On the HLE benchmark, Orchestrator achieved 37.1% accuracy at a cost of 9.2 cents per query, which is 30% of GPT-5's cost. In the τ2-Bench, a function call test, Orchestrator achieved 80.2% correctness, with only about 40% of steps requiring calls to GPT-5. For factual reasoning tasks in FRAMES, it scored 76.3%, reducing latency to 8.2 minutes, or 41% of GPT-5's latency.
The model's efficiency stems from its strategic division of labor. Orchestrator calls low-cost tools, such as local retrieval, Math-7B, and Qwen-32B, as needed, reserving calls to GPT-5 for critical steps, averaging 1.95 calls per question. In contrast, if GPT-5 were to perform scheduling, a problem would typically require 5.23 calls to GPT-5-mini. This selective use of resources is key to Orchestrator's cost reduction.
Orchestrator also demonstrates strong generalization capabilities, maintaining stable performance when interacting with unseen models (e.g., Gemma-3-27B, Codestral-22B) or new pricing strategies. This suggests that the model learns abstract strategies for tool capabilities and cost-benefit trade-offs rather than overfitting to specific configurations. It also shows an ability to satisfy user preferences, indicating customizable and interpretable tool scheduling.
Towards Composite AI Systems
The development of Orchestrator represents a step toward composite AI systems that integrate multiple models and tools. This paradigm offers potential advantages in terms of safety, speed, and cost efficiency compared to monolithic LLM architectures. By separating decision-making from execution, Orchestrator proposes a new pathway for developing efficient, controllable, and scalable practical AGI systems, suggesting that small language models could become central to scalable agent AI.