Google Gemini Deep Research Agent: New Benchmarks & API

Google has launched its new Gemini Deep Research Agent, along with two additional capabilities: the open-source DeepSearchQA benchmark and the Interactions API. This release follows the introduction of OpenAI's GPT-5.2.

Lukas Haas, a product manager at Google DeepMind, stated on social media that the Gemini Deep Research Agent achieved a score of 46.4% on Google's new benchmark. Haas also noted that its performance on BrowseComp was comparable to GPT-5 Pro, but at a significantly lower cost.

Deep Research Capabilities

The Gemini Deep Research Agent is designed for long-term context collection and synthesis tasks. It is built on the Gemini 3 Pro model, which has been trained to minimize hallucinations and enhance report quality for complex assignments. The agent uses multi-step reinforcement learning to navigate information environments.

In testing, Gemini Deep Research scored 46.4% on the Humanity's Last Exam (HLE) test set, 66.1% on DeepSearchQA, and 59.2% in the BrowseComp test. The system employs an iterative research planning mechanism, formulating queries, analyzing results, identifying knowledge gaps, and conducting further searches. It also features enhanced web search capabilities for detailed data extraction and is optimized for generating research reports efficiently. Unlike traditional chatbots, Deep Research functions as a long-running system for complex, non-real-time tasks.

The agent's workflow simulates human expert cognitive behavior, involving planning, execution, reasoning, and reporting. For broad instructions, it uses a "step-back prompting" technique to break down problems into sub-dimensional research paths, such as technological maturity, supply chain issues, policy environments, and competitor analysis. This planning process is dynamic, allowing the system to modify its research plan and add new branches for exploration if new concepts emerge during initial searches.

DeepSearchQA Benchmark

Google developed DeepSearchQA as a benchmark to evaluate the performance of deep research agents in multi-step information retrieval tasks. It contains 900 manually designed causal chain tasks across 17 domains, where each step depends on prior analysis. DeepSearchQA assesses research completeness by requiring agents to generate comprehensive answer sets, and it also measures research precision and information recall.

Google's internal evaluations showed that agent performance improved significantly when allowed more search and reasoning steps. The benchmark also serves as a diagnostic tool for thinking time efficiency.

Interactions API for Agent Development

The Interactions API provides interfaces for agent application development, managing complex context for interleaved messages, chain-of-thought processes, tool calls, and their state information. It includes a built-in Gemini Deep Research Agent, with plans to expand built-in agents and allow developers to integrate custom agents.

The API offers a single RESTful endpoint for interacting with models and agents. It extends the core functionality of generateContent with features such as optional server-side state for history management, an explainable and composable data model for agent history, background execution for long-running inference loops, and remote Model Context Protocol (MCP) tool support.

The Interactions API aims to shift AI application development from a "stateless request-response" model to a "stateful agent interaction" model. It introduces server-side state management, allowing Google's servers to maintain context, tool call results, and the agent's internal thought state for a session. This enables developers to directly call Google's pre-trained agents, such as the deep research agent, for integration into enterprise software.

DeepMind's UK Government Partnerships

Google DeepMind is collaborating with the UK government on AI governance initiatives, leveraging DeepResearch and its underlying technologies. This partnership addresses public administration challenges, including the UK's housing crisis and planning inefficiencies.

Project Extract for Urban Planning

The UK's urban planning system faces challenges due to fragmented data, with many historical archives existing in paper or scanned formats. DeepMind partnered with the UK government's AI incubator (i.AI) to develop the Extract tool. This geospatial intelligence system, based on Gemini's multimodal reasoning, aims to digitize and process planning documents.

Extract uses Gemini's visual language capabilities to interpret low-quality scanned documents, recognizing text and understanding handwritten annotations with a 94% date recognition accuracy. It can also interpret visual symbols on maps, distinguishing features like property boundaries from drainage ditches. The system uses computer vision tools such as OpenCV and SAM to extract geographical polygons from pixel images, achieving a shape matching degree (IoU) of 90%.

The tool employs the LoFTR algorithm to align historical maps with modern satellite maps, finding common feature points and calculating transformation matrices to accurately map historical data to current digital coordinates. This automation reduces the processing time for complex planning documents from an average of two hours to between 40 seconds and three minutes. Extract is currently being piloted in four areas, including Westminster and Hillingdon, with a planned nationwide rollout by spring 2026. This initiative aims to create a unified digital planning database to support the UK government's goal of building 1.5 million new homes.

Scientific Infrastructure and National Security

DeepMind is also collaborating with the UK government to accelerate scientific discovery through AI. In 2026, DeepMind plans to establish its first automated AI science laboratory in the UK. This laboratory will use a closed-loop system driven by Gemini and GNoME (Graph Networks for Materials Exploration) to design new crystal structures, predict their stability, and synthesize them using automated robotic platforms. Experimental results will be fed back to the AI in real-time to refine predictions, aiming to reduce the discovery cycle for new materials.

In the security domain, DeepMind has partnered with the UK AI Security Institute to deploy cyber defense tools based on DeepResearch technology. This includes "BigSleep" (formerly Project Naptime), an agent that uses large language models to identify vulnerabilities in codebases, and "Code Mender," which automatically generates repair patches. This automated discovery-repair loop aims to establish a real-time "digital immune system" for the UK's Critical National Infrastructure.