Researchers at T-Bank AI Research have unveiled a groundbreaking method for training visual-language models using simulators, a significant advancement that promises to reduce the costs and time associated with traditional retraining on real-world data. This new approach, known as VL-DAC, is designed to accelerate the learning process for models, enabling them to analyze images or interfaces, execute tasks step by step, and evaluate outcomes more efficiently. Following training, models demonstrated remarkable improvements in web navigation, spatial orientation, and route planning—capabilities applicable across banking, robotics, gaming, and various other industries.
Currently, visual-language models excel in recognizing and describing objects and interfaces but struggle in sequential action scenarios, such as navigating a website or selecting products. These tasks require models to track previous actions and determine if each subsequent step brings them closer to the desired outcome. Traditionally, training in real environments has been costly and labor-intensive.
The innovative VL-DAC method allows models to be trained in low-cost synthetic environments, where the consequences of errors are minimal. Experimental results revealed that when employing VL-DAC, models required 1.5 to 2.5 times fewer steps and GPU hours to complete tasks compared to traditional methods like RL4VLM, while achieving higher task accuracy.
The researchers utilized multiple types of simulators, each targeting specific skills: MiniWorld for navigation and route planning, Gym-Cards for logical and arithmetic tasks, ALFWorld for everyday problem-solving, and WebShop for interacting with online store interfaces.
In their experiments, the Qwen2-VL-7B model was retrained using VL-DAC, resulting in over a 50% improvement in achieving goals in interactive environments, a 5% improvement in spatial planning, and a 2% enhancement in web navigation.
A key feature of VL-DAC is its separate learning processes for executing actions and assessing the value of those actions, addressing previous challenges where these signals interfered with each other. The method employs token-level action training alongside step-level utility evaluation, streamlining the process.
Moreover, the diversity of simulators plays a critical role in broadening a model's skill set. Each simulator enhances distinct abilities, allowing models to better transfer learned experiences to real-world tasks and even improve general task performance without needing additional training on labeled data.
This novel method is particularly useful in scenarios requiring a predefined sequence of actions, such as in banking and insurance, where models can assist in filling out forms or comparing products. Its potential extends to gaming, robotics, retail, industry, and logistics, facilitating tasks like object recognition, route construction, or movement planning within spaces.
Daniel Gavrilov, head of the fundamental AI research laboratory at T-Bank, emphasized that their findings demonstrate that reinforcement learning in simulators could serve as a faster, cheaper, yet equally accurate alternative to conventional methods. He likened the training process to using a gym, where each simulator targets specific skills, ultimately enhancing the overall capabilities of the model. Looking forward, the team aims to explore the efficacy of this approach in more complex three-dimensional environments and in scenarios where models must plan a series of actions in advance.
This innovative training method could significantly impact the market by providing companies with a cost-effective solution for developing advanced AI capabilities, potentially shifting the competitive landscape in various sectors reliant on AI technology.
Informational material. 18+.