Keling O1: Unified Multimodal AI for Video & Image Editing

A shift in human-computer interaction is emerging with the introduction of Keling O1, a new unified video and image generation and editing tool. This platform is designed to integrate various editing and generation tasks into a single interface, streamlining workflows for creators.

Highlights

Keling O1 functions as a multimodal video large model, offering capabilities such as reference-to-video generation, text-to-video generation, and advanced editing features. These include the ability to define start and end frames, add or delete content, and repaint styles. The system provides a comprehensive solution from initial generation to final modification.

For developers, the model supports multimodal input, accepting images, videos, subject references, and text. This allows for precise editing using natural language, eliminating the need for traditional masks or keyframes. By leveraging multi-view subjects and reference materials, O1 aims to maintain consistent characteristics of characters, props, and scenes across different shots, ensuring continuous footage.

The platform enables the free combination of references and instructions, facilitating complex functionalities like camera movements, actions, and shot extensions. It can generate narrative shots ranging from approximately 3 to 10 seconds, offering flexible control over rhythm and shot length.

Under the Hood

The O1 interface features a prominent new icon on the left, which, once activated, consolidates most functionalities. The prompt input box at the bottom integrates various options, with basic operations like aspect ratio adjustments located at the bottom. Capsule buttons at the top control the type of input content. For instance, selecting "image subject reference" displays input fields for video, image, and subject. Users can also opt for text-only operations. When working with start and end frames, the input fields adapt to "start frame" and "end frame," requiring corresponding labels in the prompt.

Text-based video editing involves uploading a video and using the "@" symbol in the prompt to reference specific materials. This allows for modifications such as altering clothing or adding accessories. The system demonstrates the ability to transfer mouth shapes and movements, suggesting potential applications in digital human modeling.

Image referencing is another core feature. Users can upload single or multiple images to guide modifications, particularly when specific environmental or character details are difficult to describe textually. The system can differentiate between direct background modification and transitional effects. Detailed descriptions of background movement or foreground elements contribute to more realistic outcomes. For example, adding vines in front of a character can result in the system adjusting lighting on the character's face and body to match the new environment, while background elements continue to move naturally.

A notable technique involves iterative image referencing: modifying a video with one image, then using the newly modified video for further adjustments. This approach offers enhanced control over the editing process.

Engineering Notes

The "subject" feature in O1 allows users to create and store character profiles. Once a subject is created, it can be selected directly for future use without repeated uploads. The platform offers a selection of built-in subjects, and users can create their own by uploading multiple images from different angles. This multi-angle input significantly improves consistency for characters, props, and scenes in video generation. Multiple subjects can be stacked, which is particularly beneficial for professional content creation where character and scene consistency is paramount.

For instance, a user can transform themselves into a specific character subject and add props, even in complex indoor environments with foreground and background elements. The system integrates these elements realistically, with props moving in sync with the body and lighting adjustments appearing natural. This "subject" feature holds significant implications for e-commerce, as it allows for consistent product display from various angles. By uploading four images to create a product subject, the system can generate videos with large circular camera movements while maintaining product stability and detail, such as scratches or signs of use.

O1 also supports generating videos solely from a subject. Users can specify a single video generation duration within 10 seconds, with inspiration points deducted based on length. This feature is designed to be cost-effective for video agent products and light display scenarios. Additionally, O1 can directly change the style of a video, converting it into styles such as felt, anime, or 8-bit pixel art through simple prompts, a process that was previously resource-intensive.

The model retains support for generating videos from start and end frames. Combining video editing with start and end frames can produce complex special effects. For example, a mouse in a user's hand could be transformed into a code dragon, and then the final frame of that video, along with the code dragon image, could be used to generate a new video with a natural transition into a larger scene.

What Comes Next

Beyond video, O1 also supports image editing. By switching to "image" mode, users can upload multiple images, add subjects, and perform edits. Multi-image referencing allows for complex scenarios, such as combining human and animated characters while maintaining scene consistency and character expressions. Images can also be modified by mixing them with subjects to enhance consistency, such as transforming a user into a specific costume within an office setting.

The video field is mirroring the developmental trajectory of the image field, with continuous advancements in inference capabilities, world knowledge, and editing functionalities. The recent release of the Keling Video O1 model demonstrates significant progress in this domain.