NVIDIA Nemotron 3: Open-Source Mamba-Transformer Models

NVIDIA announced the release of its Nemotron 3 family of open-source models, with the Nano version immediately available and Super and Ultra variants planned for the first half of 2026. This move signals NVIDIA's expansion beyond its traditional role as a hardware provider into the development of large language models.

For years, NVIDIA has been a dominant supplier of AI hardware, often described as "selling shovels" to companies developing AI models. However, the introduction of Nemotron 3, which incorporates a Mamba architecture, Mixture of Experts (MoE), a hybrid design, and a 1-million token context window, indicates a strategic shift.

Nemotron 3 Architecture and Family Overview

The Nemotron 3 series is designed to cover a range of applications, from edge devices to cloud supercomputers. The family includes:

Nemotron 3 Nano (released): This version features 30 billion total parameters, with approximately 3 billion active parameters during inference. It is optimized for efficient inference and edge computing, capable of running on consumer-grade GPUs and high-end laptops. Nano employs a hybrid expert architecture, which NVIDIA states increases throughput fourfold compared to Nemotron 2 Nano, and is designed for agent tasks requiring rapid responses.
Nemotron 3 Super (expected H1 2026): With an estimated 100 billion parameters (10 billion active), this model targets enterprise applications and multi-agent collaboration, balancing performance and cost. It is expected to utilize advanced Latent MoE technology.
Nemotron 3 Ultra (expected H1 2026): The flagship model, projected to have 500 billion parameters (50 billion active), is aimed at complex inference, scientific research, and deep planning tasks. NVIDIA intends for Ultra to compete with closed-source models at the GPT-5 level.

Hybrid Mamba-Transformer Design

Nemotron 3 integrates Mamba (State Space Model), Transformer (attention mechanism), and MoE technologies. The Mamba architecture addresses the computational and memory demands of long input sequences, which can be a bottleneck for traditional Transformer models. Mamba processes content by compressing it into a fixed-size memory state, resulting in linear complexity concerning sequence length and faster inference speeds.

However, Mamba has limitations in complex logical reasoning or precise recall tasks. NVIDIA's solution is a hybrid approach:

Mamba layers: Handle extensive context information and long-term memory, optimizing VRAM usage.
Transformer layers: Positioned at key nodes to manage high-level logical reasoning and detailed recall tasks.

This hybrid design allows Nemotron 3 Nano to achieve a 1-million token context window while offering inference speeds four times faster than a pure Transformer model of comparable size.

The MoE architecture, which Nemotron 3 Nano utilizes with 128 "experts" within its 30-billion parameter structure, aims to improve efficiency by activating only a subset of the model's parameters for specific tasks, contrasting with traditional dense models that engage all neurons.

NVFP4 and Hardware Integration

NVIDIA's Nemotron 3 Super and Ultra models will use the NVFP4 format for training and inference. This format is natively supported by NVIDIA's upcoming Blackwell GPU architecture. NVFP4 compresses model size by 3.5 times compared to FP16 or BF16 while maintaining precision through a two-level scaling technique. This approach suggests that future 500-billion parameter models could operate with VRAM requirements similar to current 100-billion parameter models, but it necessitates the use of Blackwell graphics cards.

NVIDIA has also launched the "NeMo Gym" reinforcement learning laboratory and open-sourced its training data to provide developers with tools for building AI agents.

Strategic Implications

NVIDIA's entry into open-source model development with Nemotron 3 is viewed as a strategy to define future AI standards. By promoting the Mamba architecture, which benefits from NVIDIA's hardware optimization expertise, the company aims to encourage developers to build ecosystems around its GPUs. The NVFP4 format further reinforces this by tying advanced model performance to NVIDIA's Blackwell hardware.

This strategy aims to create a "closed-loop open ecosystem," where model weights are open-sourced, but optimal performance is achieved within NVIDIA's full-stack ecosystem, including its Blackwell GPUs, NVLink, NVFP4, CUDA, NeMo, TensorRT, and NIMs (Nvidia Inference Microservices).

According to information reviewed by toolmesh.ai, Nemotron 3 Nano (30B-A3B) currently ranks 120th on the text leaderboard with a score of 1328 and 47th among open-source models.