top of page
  • Writer's pictureShawn Tan

Mastering Agent Evaluation: Strategies for Measuring and Enhancing Performance

In our previous post, we delved into the fundamental aspects of building AI agents and the distinct capabilities that have emerged through the integration of Large Language Models (LLMs). Here, we'll explore the evaluation of these agents' performance as we think about deploying them in a production environment.

If you're developing your own agent, it's likely that you need it to be efficient and reliable so that you can confidently rely on it for your business requirements. But how does one measure success, and improve its performance over time?

Measuring Success

Before we can trust an AI system or agent to successfully do its job, we need a way to measure and improve its success. To do so, we can look at two broad categories of performance metrics: task performance and goal performance.

Task Performance: Is it doing individual tasks in a performant manner? Examples include:

  1. Picking the right tool: Can the system choose the best tool(s) for the job?

  2. Fixing errors: Can it correct its own errors and recover well?

  3. Accurate Reasoning: Can it effectively think through problems and reason through them?

  4. Learn from new information: Can it observe its performance and adapt to changing environments?

Goal Performance: Is it achieving it’s domain specific business objectives? Examples include:

  1. In Customer Service: Can it reduce time taken to resolve support requests?

  2. In Financial Markets: Can it effectively maximize trading revenue while controlling risk?

  3. In Transportation: Can it safely arrive at destinations?

Task performance evaluates the effectiveness of individual steps or tasks, whereas goal performance assesses the overall outcome.

Ways to Improve Performance

Given that agents rely on LLMs for core functionality, we focus on improving the performance of LLM generation. There are 3 common ways to improve LLM performance (in order of least to most effort):

  1. Prompt Engineering - improve the structure of prompts so that models can understand them better

  2. Retrieval-Augmented Generation (RAG) - retrieve relevant information from a database, and inject them into the prompt

  3. Fine-tuning - improve the LLM itself by making minor adjustments to its internal parameters with extra training data

Think of these techniques as tools that are part of an application builder’s toolkit. They improve LLM performance in different ways, and should be used in combination to build a production-level application.

For each method, we will give a short run-down of how to do it and when it is typically used.

Prompt Engineering

A comprehensive repository of resources can be found here.

Prompt engineering is more of an art than a science, depending heavily on the creativity and experimentation of prompt engineers (or AI engineers, data scientists, etc.) in crafting various ways to interact with Large Language Models (LLMs). While a well-designed prompt may or may not lead to impressive responses, an inadequate prompt is certain to yield unsatisfactory outcomes.

To elicit the best possible answers, it's crucial to tailor the structure of prompts to the specific challenges and models at hand. For instance, complex, indirect questions might benefit from the "Chain-of-Thought" prompting technique, which prompts the model to lay out its reasoning steps explicitly, enhancing the clarity and depth of its responses. Prompt engineering is typically the first strategy used when it comes to improving LLM performance.

Benefits of prompt engineering include:

  • Easy and Fast to Start: It's user-friendly and doesn't require deep technical knowledge. Anyone can start experimenting with different prompts quickly.

  • Immediate Feedback: You can quickly see the results of your changes. Modify your prompt, send it to the model, and receive instant answers, making it great for quick improvements.

But, prompt engineering quickly hits its limits:

  • Bounded Knowledge: Prompts that demand information beyond the model's capacity cannot be fulfilled, and it's often ambiguous whether the LLM holds the necessary knowledge.

  • Not Always Transferable: Prompts that works well for one problem may not be effective for another. This lack of generalizability can be a challenge.

  • Context Window Limits: While current high-performance models offer very large context windows, they can still fall short in situations where the overall input is intricate and lengthy.

In summary, while prompt engineering is a straightforward and quick way to interact with LLMs, it's not a one-size-fits-all solution and comes with limitations in scalability and applicability to diverse problems.

Retrieval Augmented Generation (RAG)

Some useful resources for RAG: datacamp RAG tutorial, advanced RAG papers

RAG combines the power of LLMs with the ability to fetch external information from databases or the internet in real-time, thereby enriching its responses. This approach broadens the capabilities of LLMs, enabling them to retrieve and assimilate precise and up-to-date information beyond their pre-trained knowledge base.

The process involves two key steps:

  1. Retrieve relevant documents or data snippets from relevant sources based on the input prompt

  2. Use the retrieved information to augment prompts sent to the LLM to improve the final response

The effectiveness of RAG hinges on the quality of both the retrieval mechanism of the system and the generation capabilities of the underlying LLM. This approach is particularly useful in scenarios where the required information for a query might be too recent or too specialized for a standalone LLM to have learned during its training phase. For example, some chatbots might require recent customer data that was not in the LLM training data, while medical or legal use cases might require factual information with no room for errors.

Benefits of RAG include:

  • Up-to-date Insights: Access to latest information, going beyond the LLM's training cut-off date.

  • Enhanced accuracy: Highly relevant and correct responses by leveraging domain-specific data sources.

  • Expanded expertise: Ability to manage a broader spectrum of queries with factual or data-driven requirements.

However, RAG also presents its own set of challenges:

  • Response Quality: The effectiveness of its responses hinges greatly on the quality and presence of external data sources.

  • Risk of Delays: Additional steps in the retrieval sequence and the need to query large databases or internet access, being resource-intensive, can result in latency.

  • Setup Complexity: The intricacies of implementing and integrating with current LLM frameworks can be complicated.


Some useful resources for fine-tuning: extensive hands-on examples of fine-tuning

Pre-trained models such as GPT and Llama excel in general tasks due to their training on extensive datasets. However, they may falter with tasks that demand deeper context. This is where it is crucial to fine-tune the models. Fine-tuning refers to a subset of transfer learning methods applicable to neural networks — including LLMs — that entails adjusting a model by retraining it on a smaller, more focused dataset.

These foundational models are typically trained on vast amounts of text data sourced from the internet, leading to challenges when dealing with esoteric content. For instance, legal documents can be perplexing for these models because legal terminology, though similar to everyday language, often carries different meanings. Similarly, older programming languages, which are fading in popularity and thus have less web presence, present another area of difficulty. Fine-tuning a model can be a very powerful technique.

Benefits of fine-tuning include:

  • Minimal data requirement: Fine-tuning necessitates significantly less data than training of a Large Language Model (LLM) from zero.

  • Computational efficiency: Utilizing an existing model as a foundation enables the efficient training of a high-performing model with reduced computational resources

  • Knowledge expansion: Fine-tuning serves as a method for 'teaching' the model new insights through the integration of fresh data.

But not without its challenges:

  • Data Preparation: Similar to all machine learning initiatives, curating and preparing relevant training data poses significant challenges.

  • Hardware Requirements: The size of models and training methodologies demand significant RAM and GPU resources which can be expensive and difficult to acquire.

In summary, enhancing model performance can be achieved through various strategies, each presenting unique challenges and ideal use cases. Choosing the best combination of these methods to optimize performance can feel more like an art than a science, requiring experience and instinct to realize maximum performance gains with minimum effort. These approaches should be considered in tandem, with several likely to be employed concurrently in a production environment.

In the next article, we will delve deeper into model fine-tuning—exploring methods and considerations, common challenges faced, and approaches to autonomously generating training datasets.

We welcome you to share your insights with us, and if you're seeking solutions, we’d love to hear from you. Contact us at

104 views0 comments

Recent Posts

See All


bottom of page