Fire! Agents are incredibly popular. Advances in the field of agents are ubiquitous, making it impossible to keep up with everything…
Look here—this survey may help clarify many issues:
- A research team from East China Normal University and Donghua University has published “A Survey on the Optimization of Large Language Model-based Agents,” providing a comprehensive review and analysis of LLM agent optimization strategies from a systematic perspective for the first time.
The paper categorizes existing methods into two main types: parameter-driven optimization and parameter-free optimization.
The former includes supervised fine-tuning, reinforcement learning (such as PPO, DPO), and hybrid strategies combining fine-tuning with RL, focusing on key modules such as trajectory data construction, reward function design, and optimization algorithms.
The latter involves optimizing agent behavior without modifying model parameters through methods like prompt engineering, external tool invocation, and knowledge retrieval.

In addition, the authors compiled mainstream agent fine-tuning and evaluation datasets and reviewed representative practices of LLM agents in various application fields such as healthcare, science, finance, and programming.
Finally, the research team summarized the key challenges facing agents today and future research directions.

Why Do We Need to Specifically Optimize LLM Agents?
In recent years, large language models such as GPT-4, PaLM, and DeepSeek have not only excelled in language understanding and generation but also demonstrated extraordinary capabilities in reasoning, planning, and complex decision-making.
Therefore, an increasing number of researchers are beginning to attempt using LLMs as agents, exploring their potential in automatic decision-making and the direction of general artificial intelligence.
Unlike traditional reinforcement learning agents, LLM agents do not rely on explicit reward functions; instead, they complete complex tasks through natural language instructions, prompt templates, and in-context learning (ICL).
This “text-driven” agent paradigm exhibits high flexibility and generalization capabilities, enabling cross-task understanding of human intent, execution of multi-step operations, and decision-making in dynamic environments.
Currently, researchers have attempted to improve performance through task decomposition, self-reflection, memory enhancement, and multi-agent collaboration, with application scenarios covering software development, mathematical reasoning, embodied intelligence, web navigation, and other fields.
It is worth noting that the training objective of LLMs themselves is next-token prediction, not designed for agent tasks involving long-term planning and interactive learning.
This leads to several challenges when using LLMs as agents:
- Insufficient long-horizon planning and multi-step reasoning capabilities, leading to cumulative errors in complex tasks;
- Lack of persistent memory mechanisms, making it difficult to reflect and optimize based on historical experience;
- Limited adaptability to new environments, struggling to dynamically respond to changing scenarios.
In particular, open-source LLMs generally lag behind closed-source models like GPT-4 in agent tasks. The high cost and opacity of closed-source models have made optimizing open-source LLMs to enhance agent capabilities a key research need.
Existing surveys either focus solely on large model optimization or discuss only local agent capabilities (such as planning, memory, or role-playing), failing to treat “LLM agent optimization” as an independent and systematic research direction for in-depth exploration.
The research team fills this gap by conducting the first systematic review centered on “optimization technologies for LLM-based agents,” constructing a unified framework, summarizing methodological paths, and comparing the pros, cons, and applicable contexts of different techniques.
Parameter-Driven Optimization of LLM Agents
In parameter-driven LLM optimization, the authors divide it into three directions.
Optimization Based on Conventional Fine-Tuning
The first direction is optimization based on conventional fine-tuning.
It is divided into two major steps: constructing high-quality trajectory data for agent tasks—then using these trajectories to fine-tune the agent.

First is data acquisition and generation.
Constructing high-quality trajectory data begins with the acquisition and generation of initial data. This requires not only a diverse set of trajectories but also sufficient alignment with target tasks to ensure effective learning.
The authors categorize mainstream methods into the following four types:
- Expert-annotated data: Manually designed by human experts, offering high quality and strong alignment; it serves as the gold standard for fine-tuning. However, due to high labor costs and scalability issues, it is often used as a supplementary dataset of high-quality examples.
- Data automatically generated by powerful LLMs: Utilizes large models like GPT-4 combined with ReAct or Chain-of-Thought (CoT) strategies to generate trajectories. It is efficient and suitable for large-scale construction but relies heavily on large models, leading to issues such as high costs and bias propagation.
-
- Data from autonomous agent exploration: Generated by open-source models interacting autonomously with the environment. It is low-cost and helps break free from closed-source dependencies. The drawback is limited exploration capability, requiring subsequent filtering mechanisms to remove low-quality data.
- Data generated through multi-agent collaboration: Involves multiple
Agents collaborate to complete complex task workflows, enhancing data diversity and interaction complexity. However, this approach increases system design complexity, posing challenges in stability and resource costs.
Secondly, data evaluation and filtering.
Since the quality of generated trajectory data varies significantly, evaluating and filtering the data has become an indispensable step.
The authors categorize mainstream methods into three types:
- Environment-based evaluation: These methods rely on external feedback such as task success or environmental rewards to judge trajectory quality. They are easy to implement and highly automated. However, their drawback is that the feedback signals are too coarse-grained, focusing only on final outcomes and failing to detect implicit errors in the reasoning chain.
- Human- or rule-based evaluation: This approach employs preset rules (such as task completion rate, answer consistency, diversity, etc.) or expert manual review for more fine-grained quality control. It offers strong adaptability and high accuracy but requires significant human involvement and complex design.
- Model-based evaluation: Leveraging powerful LLMs (e.g., GPT-4) to automatically score and analyze trajectories allows for multi-dimensional assessment across relevance, accuracy, and completeness, building an automated quality evaluation framework. The downside is that the evaluation itself relies on models, which may introduce new biases.
Next is the utilization of low-quality samples.
Beyond acquiring high-quality data, it is also necessary to repurpose substandard low-quality trajectories.
Current mainstream strategies include:
- Contrastive utilization: By comparing correct and incorrect samples, the model can more clearly identify which behaviors are effective.
- Error-correction methods: Identifying and correcting failed trajectories transforms them into learnable data, thereby improving training quality.
- Direct use of error samples: Instead of correction, failing cases are used directly to train the model, enhancing its fault tolerance when facing erroneous situations.
After constructing high-quality trajectory data, the next step is the critical fine-tuning phase.
Through fine-tuning, open-source large models truly adapt to Agent tasks; learning planning, reasoning, and interaction is an indispensable step in optimizing LLM agents.
Notably, fine-tuning solely with Agent task trajectories may weaken the general capabilities of LLMs.
Therefore, most approaches choose to train on a mix of general instruction data and Agent trajectories, aiming to enhance Agent execution capabilities while preserving foundational language skills.
The authors divide existing fine-tuning methods into three major categories:
- Standard SFT: The most common method, which optimizes the model’s full parameters using high-quality instruction-output pairs or trajectory data, achieving the best alignment with target tasks. Additionally, behavior cloning in imitation learning essentially falls under this category, emphasizing learning decision strategies from expert trajectories.
- Parameter-Efficient Fine-Tuning (e.g., LoRA/QLoRA): Only a small number of parameters are updated while other weights remain frozen, significantly reducing VRAM and computational overhead. This is particularly common in fine-tuning large model Agents. Although training costs are lower compared to full-parameter fine-tuning, performance often matches or even exceeds it.
- Custom Fine-Tuning Strategies: Methods designed for specific tasks, such as mixing general instructions with trajectory data or introducing additional constraints (e.g., regularization) to improve generalization and stability. These methods offer greater flexibility and are suitable for complex or scarce task scenarios.

Optimization Based on Reinforcement Learning
Compared to traditional fine-tuning methods, reinforcement learning provides a more proactive learning path for Agents.
It enables models to go beyond mere “imitation,” allowing them to explore behaviors in the environment, receive rewards and penalties, and dynamically adjust strategies, truly achieving growth through trial and error.
The authors categorize current RL optimization approaches into: reward-function-based optimization and preference-alignment-based optimization.

First, let’s discuss reward-function-based optimization.
In reinforcement learning optimization, the reward function acts like a baton for the agent, guiding the model to continuously improve its strategy. By establishing clear standards for “doing well vs. doing wrong,” Agents can learn more precisely and robustly through interaction.
The authors classify current methods into three types based on reward sources:
- Environment-based rewards: Scoring is directly based on whether the task is completed. This approach is simple, intuitive, and highly automated. However, it often focuses only on the final outcome, ignoring the quality of intermediate steps.
- Model-based rewards: Trajectories are evaluated by LLMs or auxiliary models, making them suitable for scenarios with sparse environmental feedback as they can provide more detailed signals. However, effectiveness depends on the quality of the evaluation model.
- Custom reward functions: Researchers design multi-dimensional rewards based on specific task needs, assessing not only completion rates but also strategy stability and collaboration efficiency. While flexible and powerful, these methods have high design costs and are difficult to generalize.

Next, let’s look at preference-alignment-based optimization.
Compared to traditional RL training based on reward functions, preference alignment offers a more direct and lightweight optimization path.
It no longer relies on cumbersome reward modeling; instead, it teaches the Agent “which behaviors are more favored by humans.”
Its representative method is DPO, a simpler offline approach…
Strengthen learning methods by directly training samples through “positive-negative contrast” based on human or expert preferences.
Based on the primary sources of preference data, the authors categorize these optimization approaches into two types:
- Expert/Human Preference Data: Positive and negative samples are constructed based on expert demonstrations or human annotations (high-quality vs. erroneous trajectories). While high in quality, this method is difficult to scale and has limited coverage.
- Task or Environment Feedback: Preference pairs are automatically constructed from task performance metrics (such as success rates or scores). This approach is suitable for dynamic task scenarios but relies on the rational design of feedback mechanisms.

In summary, preference alignment methods are efficient to train and simple to deploy, but they strongly depend on the quality and coverage of preference data, making them suitable for task scenarios with clear structures and feedback.
In contrast, reward function-based methods are better suited for complex and changing environments, though they come at a higher cost.
Hybrid Parameter Fine-Tuning Methods
Single optimization methods each have shortcomings: conventional fine-tuning is stable and efficient but lacks dynamic adaptability, while Reinforcement Learning (RL) is flexible and powerful yet incurs significant computational overhead.
Consequently, an increasing number of studies are exploring hybrid fine-tuning strategies that combine the advantages of both to build more robust LLM agents.
These works primarily fall into two categories:
First, sequential two-stage training.
This is currently the mainstream approach, following a “Supervised Fine-Tuning (SFT) first, then RL” strategy.
- Stage 1: Behavioral Cloning Fine-Tuning (SFT): The model is pre-trained using expert trajectories or curated data to establish foundational capabilities;
- Stage 2: Reinforcement Learning Optimization (PPO / DPO): The model’s policy is fine-tuned based on environment or preference feedback.
Second, alternating optimization.
This involves introducing an iterative alternation mechanism, switching back and forth between SFT and RL over multiple rounds to achieve granular improvements.
Parameter-Free LLM Agent Optimization
Compared to parameter fine-tuning, parameter-free optimization methods do not involve updating model weights. Instead, they demonstrate strong potential in resource-constrained or lightweight deployment scenarios by adjusting prompts, context, and external information structures.
The authors categorize these into five core strategies:
Category 1: Experience-Based Optimization.
Through memory modules or historical trajectories, agents “learn to review,” extracting strategies from past successes and failures to enhance long-term adaptability.
Category 2: Feedback-Based Optimization.
Agents continuously correct their behavior through self-reflection or external evaluation, forming an iterative loop; other methods optimize global instruction structures via meta-prompts to improve generalization capabilities.
Category 3: Tool-Based Optimization.
This enables agents to learn how to use tools (such as search engines, calculators, and APIs) to enhance execution capabilities. Some methods optimize tool-calling strategies, while others train agents to construct more efficient task-to-tool pathways.
Category 4: RAG-Based Optimization.
By combining retrieval with generation, this approach enhances the reasoning process by retrieving information in real-time from databases or knowledge bases, making it particularly suitable for knowledge-intensive tasks and rapidly changing scenarios.
Category 5: Multi-Agent Collaboration Optimization.
Multiple LLM agents collaborate to complete tasks, achieving synergistic intelligence greater than the sum of its parts (1+1>2) through role division, information sharing, and feedback mechanisms.

Parameter-free optimization makes LLM agents “smarter,” more “adaptable,” and “lighter” without modifying the model itself.
Datasets and Benchmarks
The authors divide data and benchmarks into two main categories: those used for evaluation and those used for fine-tuning.
Evaluation tasks are divided into two types.
The first type consists of general evaluation tasks, categorized by domain, such as mathematical reasoning, question answering (QA), multimodal tasks, programming, etc.

The second type comprises multi-task evaluation benchmarks. These assess LLM-based agents across various tasks, testing their ability to generalize and adapt to different domains.

Agent Fine-Tuning Datasets are specifically designed data sets for agent fine-tuning, aimed at enhancing the capabilities of LLM agents across different tasks and environments.

Applications
As optimization methods continue to mature, LLM-based agents are emerging in various real-world scenarios, gradually moving from laboratories into practical applications:

Challenges and Future Directions
Data Bias Issues.
Agents rely heavily on data quality; however, there is often a mismatch between the distribution of pre-training data and fine-tuning trajectories.
This, combined with the potential biases introduced by LLM self-generation and evaluation, can lead to performance instability.
Future research could explore methods such as bias testing, adversarial training, and knowledge boundary assessment to build a more robust data foundation.
Algorithmic Efficiency and Adaptability
Current reinforcement learning and fine-tuning methods struggle with high costs and poor effectiveness when facing sparse rewards, large action spaces, and multi-step interactions.
A key future focus will be enhancing the multi-turn capabilities of lightweight methods like DPO, or exploring hybrid training approaches combining RL and SFT, meta-learning, and self-supervised learning.
Difficulty in Cross-Task and Cross-Domain Transfer
Many methods perform well on single tasks but often fail in new environments or real-world scenarios.
There is a need to develop stronger generalization mechanisms, such as task distribution alignment, domain adaptation, and multi-task joint training, to improve model transferability and adaptability.
Lack of Unified Evaluation Standards
Agents use different metrics for various tasks (e.g., mathematical reasoning, web navigation, embodied AI), making cross-comparison difficult.
Establishing a unified evaluation benchmark that incorporates new dimensions such as reasoning complexity, adaptability, and preference scoring will drive Agent research toward more systematic and comparable development.
Absence of Parameter-Driven Multi-Agent Optimization
Current multi-agent strategies largely rely on frozen LLMs, lacking joint parameter training mechanisms, which limits the development of collaborative intelligence.
Future work should explore joint fine-tuning for multiple agents, reward-sharing mechanisms, and hierarchical control strategies to enhance overall system capabilities and collaboration levels.
arXiv link:
https://arxiv.org/abs/2503.12434
GitHub link:
https://github.com/YoungDubbyDu/LLM-Agent-Optimization