Documentation Index
Fetch the complete documentation index at: https://labs.prompthon.io/llms.txt
Use this file to discover all available pages before exploring further.
Summary
Evaluation tells you whether the agent is good enough for a task. Observability tells you why it passed or failed. Production systems need both, because scores without traces are hard to improve and traces without metrics are hard to prioritize. This page covers general evaluation and observability for agent systems, with a dedicated section on why LLM-based agents need a different evaluation mindset than traditional software.Why It Matters
Agent systems are probabilistic and multi-step. That makes them harder to judge than deterministic software.- A correct answer may depend on search, tools, or file state.
- The same task can fail for very different reasons.
- A prompt or model change can improve one capability while silently harming another.
- an
evaluation loopfor measuring capability - a
diagnostic loopfor explaining behavior
Why LLM Evaluation Is Different
Evaluating an agent system is not the same as testing a function. Four properties of LLM-based systems break traditional test assumptions. Nondeterministic outputs. The same prompt can produce different valid answers across runs. Tests that assert an exact string match will flake. You need tolerance for semantic equivalence, not just syntactic identity. See LLM Foundations for Agent Systems for the underlying model behavior. Probabilistic success. An agent may succeed on 8 out of 10 runs and fail twice for reasons that are not reproducible. Evaluation shifts from “does it pass?” to “what fraction of runs succeed, and under what conditions?” This means reporting confidence intervals, not just point scores. Tool-call variance. Two correct trajectories may call tools in different orders, use different argument combinations, or skip steps that another path includes. Structured correctness checks must compare intent and outcome, not just the literal call sequence. See Context Engineering for how tool selection relates to prompt and retrieval design. Ambiguous criteria. For open-ended tasks (summarization, drafting, creative synthesis), there is often no single correct answer. Multiple responses can be equally valid. This makes binary pass/fail insufficient and pushes you toward rubric-based scoring, pairwise comparison, or human judgment.Evaluation Approach Map
No single evaluation method covers every task. Match the approach to what you are measuring.| Approach | When to use | What it checks |
|---|---|---|
| Exact / structured checks | Tool calling, JSON schemas, output format | Correct function selected, correct parameter types, valid schema |
| Reference-based comparison | Retrieval-augmented tasks, factual QA | Output matches a known-good reference (exact, fuzzy, or semantic similarity) |
| Rubric scoring | Open-ended generation, summarization, reasoning | Human or judge model scores output against a multi-point rubric |
| Pairwise comparison | Model or prompt selection | Given two outputs, a human or judge picks the better one |
| LLM-as-a-judge | High-volume automated scoring where human review is too slow | A second LLM evaluates output quality using a structured prompt |
| Human review | High-stakes, novel tasks, calibration of automated methods | Domain expert rates output quality; used to validate and calibrate judges |
| Online telemetry | Production monitoring | Success rate, latency, escalation rate, retries, user feedback signals |
Human-in-the-Loop vs LLM-as-a-Judge
Both approaches have a role. Knowing where each is strongest avoids the two common failures: relying on humans too slowly to catch regressions, or relying on a judge model that is not calibrated to your domain. Human-in-the-loop is strongest when:- the task is high-stakes (medical, legal, financial advice)
- criteria are novel or poorly defined and you are still learning what “good” looks like
- you need to calibrate or audit an automated judge
- edge cases are rare but critical (safety, bias, harmful output)
- you need to score hundreds or thousands of runs per change
- criteria are well-defined and stable enough to encode in a prompt
- you are running regression tests on every commit or deployment
- the judge model can be validated against a held-out human-labeled set
- which runs were human-reviewed vs judge-scored
- judge agreement rates on calibration samples
- any overrides where a human disagreed with the judge
Mental Model
Think in three layers.offline evaluation: benchmark-style checks run on known tasks to compare prompts, models, tools, and policies.online evaluation: production signals such as success rate, latency, escalation rate, retries, or human overrides.observability: traces, tool logs, state transitions, and artifacts that show what the system actually did.
- Tool use often needs structured correctness checks such as function and parameter matching.
- General assistant tasks often need answer-level correctness plus task-level completion.
- Data generation or synthesis tasks may need comparative review, judge models, or human verification.
Architecture Diagram
Framework Landscape
A compact comparison of common evaluation and observability frameworks. This is not exhaustive; it covers the three most frequently referenced in agent-system work.| Framework | Primary focus | Key strengths | Source |
|---|---|---|---|
| Langfuse | Observability and eval platform | Open-source, trace-first design, supports LLM-as-a-judge scoring, prompt versioning, dataset management | langfuse.com |
| LangSmith | Observability and eval for LangChain ecosystem | Tight LangChain/LangGraph integration, trace visualization, automated evaluation workflows, dataset curation | smith.langchain.com |
| Ragas | RAG-specific evaluation metrics | Metrics for faithfulness, answer relevance, context precision, and context recall; framework-agnostic | docs.ragas.io |
Tool Landscape
The imported reference material highlights three useful evaluation shapes:- benchmark-style tool-use evaluation, where structured matching checks whether the agent selected the right function and arguments
- general-assistant evaluation, where tasks require multi-step reasoning and broader success judgments
- generation-quality evaluation, where relative comparison or human review is often more useful than one exact metric
- Keep full tool inputs and outputs.
- Preserve failure records rather than collapsing them into generic errors.
- Track step order, retries, and state changes.
- Keep traces readable by both humans and machines.
Tradeoffs
- Offline benchmarks are useful, but they can overfit the system to lab tasks that are cleaner than production reality.
- Online metrics reflect real usage, but they lag and are noisy without good segmentation.
- Judge-model evaluation scales well, but it still needs human calibration.
- Rich traces improve diagnosis, but they create storage, privacy, and review overhead.
- Pairwise comparison is easy for humans but does not give you an absolute score; use it for ranking, not for release gating.
- LLM-as-a-judge is fast but inherits the judge model’s biases; a judge that prefers verbose outputs will systematically over-score wordy generations.
- evaluate the capability you are actually changing
- keep traces for both failed and successful runs
- review failure modes before rewriting prompts
- do not ship “tool failed” as the only explanation developers can see
- maintain a calibration set and re-check judge agreement after every model change
Reading Extensions
- LLM Foundations for Agent Systems
- Context Engineering
- Deep Research Agents
- Protocols And Interoperability
- Systems Overview
Update Log
- 2026-05-18: Major revision adding LLM-specific evaluation methodology, approach map, human-in-the-loop vs LLM-as-a-judge, and framework landscape.
- 2026-04-21: Initial repo-native draft based on imported reference material and lab rewrite rules.
