Microsoft Researchers Warn AI Models Are Not Ready for Complex Tasks

Microsoft Researchers Warn AI Models Are Not Ready for Complex Tasks

Researchers at Microsoft have raised concerns about the capabilities of advanced AI models when faced with lengthy, multi-step tasks. During testing, top models such as Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4 exhibited a significant loss of data, averaging a 25% reduction in the content of documents they were assigned to process autonomously. This alarming statistic was revealed through a benchmark named DELEGATE-52, designed to simulate workflows across 52 professional fields, including software coding, music notation, and crystallography.

In the evaluation, the models were assessed on their ability to maintain document integrity after 20 cycles of processing, with a readiness threshold set at 98%. Results indicated that while the models performed better with programming tasks, they struggled significantly with natural language processing. Document integrity dropped below 80% in over 80% of the task combinations tested. Surprisingly, the best-performing model, Google’s Gemini 3.1 Pro, only met the readiness criteria in 11 out of the 52 domains examined.

Errors in the models did not accumulate gradually; instead, they manifested abruptly, with potential score losses ranging from 10 to 30 points in a single interaction. More sophisticated models attempted to avoid minor errors by postponing their handling until later stages, but issues persisted. Moreover, when these AI models operated with access to tools in an agent-controlled environment, their performance declined by an average of 6% by the end of the processing cycle.

According to the researchers, users must remain vigilant when delegating tasks to AI systems, as current models are only equipped for autonomous operation in narrowly defined areas. Despite these challenges, the authors of the benchmark acknowledged the significant progress made by large language models (LLMs), noting that OpenAI's models, for example, improved their performance metrics from 14.7% to 71.5% over a span of 16 months.

This situation underscores the need for continued oversight and development in the AI space, as both market leaders and competitors must navigate these limitations to enhance the reliability and effectiveness of AI technologies.

Informational material. 18+.

" content="b3bec31a494fc878" />