Hugging Face Boosts Dataset Streaming for Large AI Training

Hugging Face has introduced significant optimizations to its dataset streaming capabilities, enabling users to access and process large datasets without prior download. The update allows for immediate training on terabyte-scale datasets, addressing common issues such as disk space limitations and excessive request errors.

The enhancements are designed to improve efficiency and speed, particularly in high-concurrency environments. According to Hugging Face, the optimized streaming system reduces startup requests by 100 times, accelerates data parsing by 10 times, and doubles sample processing speed. This performance allows streaming load speeds to exceed those of local SSDs in certain configurations, such as a 64×H100 setup with 256 concurrent workers.

Streamlined Access and Performance Improvements

The core improvement allows users to stream datasets by adding streaming=True to the load_dataset function, maintaining compatibility with existing code. This eliminates the need for complex configurations or local storage for datasets hosted on Hugging Face.

Previously, large-scale model training often required pre-downloading data or utilizing cloud storage solutions like S3. Early attempts at direct streaming from the Hugging Face Hub encountered challenges, including IP blocks due to "request storms" caused by independent dataset initialization by each DataLoader worker.

Hugging Face states that its optimization efforts focused on two main stages: startup and streaming load. During the startup phase, a persistent data file cache now allows all DataLoader workers to share a file list, reducing network requests. The initial worker retrieves the file list, and subsequent workers access it from a local cache. Additionally, the logic for parsing file lists has been streamlined, bundling multiple API calls to reduce latency.

For the streaming load phase, two key features were added: Parquet data prefetching and a configurable buffering mechanism. Parquet prefetching allows the datasets library to pre-load subsequent data blocks in the background while the model processes the current block. This aims to keep the data pipeline continuously supplied, preventing GPU idle time. Advanced users can configure buffer parameters, including prefetch quantity and block size, to optimize I/O based on hardware and network conditions.

Xet Technology and Custom Streaming

Hugging Face attributes part of its streaming performance to the underlying Xet storage system, which employs deduplication for faster uploads and downloads. Xet's Parquet Content Defined Chunking (CDC) further accelerates data transfer by identifying and skipping duplicate data within Parquet files.

For users requiring more control or working with unsupported data formats, Hugging Face provides custom streaming capabilities through HfFileSystem in the huggingface_hub library. This allows efficient reading of remote dataset files and reuses cached results to minimize network requests when enumerating data files.

Hugging Face has implemented these optimizations in its own projects, such as the training of nanoVLM models. The company reports that direct streaming now surpasses the speed of its cluster's multi-layer disk system, nearly matching local SSD performance. This eliminates the previous requirement to copy data to local SSDs, a process that could take several hours.

The updated features are integrated into the datasets and huggingface_hub libraries. Users can access them by upgrading their library versions. To demonstrate the new capabilities, Hugging Face has released FineVisionMax, a unified and pre-shuffled dataset for training VLM models.