Asako HayaseAsako Hayase
Technical Tutorial

Observability and Evaluation for AI Agents: A Practical Guide

This guide covers the practical evaluation implementation with Strands Agents and Arize AX.

October 8, 2025
Binoculars

Observability Foundation: The Four Pillars

Before diving into implementation, understand that effective agent observability requires four components:

The Four Pillars

The Four Pillars

Tracing: Understanding Agent Behavior

Trace vs Span:

  • Trace: The entire journey (like a complete conversation turn)
  • Span: Individual steps within that journey (like "search memory", "call tool", "generate response")

e.g. Trace: User asks "Recommend a comedy movie"

├── Span 1: Parse user intent

├── Span 2: Search memory for comedy preferences

├── Span 3: Call movie recommendation tool

├── Span 4: Filter results based on preferences

└── Span 5: Generate final response

This visibility is crucial for debugging why agents make certain decisions and where failures occur.

Evaluation Approaches: The Four Quadrants

(Ideas summarized from the DeepLearning.AI short course on Evaluating AI Agents, created in collaboration with Arize AI)

Evaluation techniques fall into four categories based on two key dimensions:

Deterministic vs Non-Deterministic

  • Deterministic: Consistent, repeatable results every time
  • Non-Deterministic: Results may vary between runs

Flexible vs Inflexible Criteria

  • Flexible: Can handle qualitative, subjective assessments
  • Inflexible: Requires quantifiable, specific criteria

In practice, here’s how these play out:

  • Human Labels (Non-Deterministic + Flexible): Best for subjective quality, tone evaluation
  • LLM-as-a-Judge (Non-Deterministic + Flexible): Scalable quality assessment, complex reasoning
  • Code-based (Deterministic + Inflexible): Format validation, exact matching, measurable criteria

What Can Go Wrong: Common Evaluation Areas

Agent evaluation covers multiple critical areas where failures commonly occur:

  1. Router Evaluation: Do agents choose the right tools and extract correct parameters?
    1. Function Calling Choice: Did it choose the right tool?
    2. Parameter Extraction: Did it extract correct parameters?
  2. Content Quality: Are responses helpful, accurate, and appropriate?
    1. Hallucinations, factual errors, inappropriate tone
  3. Memory & Context Usage: Do agents use available information effectively?
    1. Ignoring user preferences, failing to use retrieved context (RAG)
  4. Task Completion: Do agents actually accomplish what users asked for?
    1. Overall correctness, following instructions properly

Common failures cascade through your entire system, leading to poor user experiences.

Five-Step Evaluation Framework

(Reference: Operationalizing Generative AI on Vertex AI using MLOps)

  1. Create Evaluation Dataset: Essential, average, and edge cases
  2. Define Metrics: Objective (accuracy, format) and subjective (tone, helpfulness)
  3. Generate Responses: Run agent systematically against test cases
  4. Run Evaluation: Manual, automated, or hybrid approaches
  5. Interpret Results: Identify patterns, guide improvements

Implementation: Movie Recommendation Agent

Let's implement tracing and evaluation for a movie recommendation agent I built here: https://www.asakohayase.com/blog/movie-recommendation-memory-strands-agents

If you haven’t installed the project yet, please complete steps 1 and 2 of the “Installation” section in the previous blog. I will use LLM-as-a-Judge for evaluation to enable scalable, consistent assessment of recommendation quality without manual review overhead.

Three Test Scenarios:

Scenario 1: Essential - Positive Title Preference

Conversation Flow:

User: I love Spirited Away
Agent: Rate the title on a scale of 1-5
User:5 stars Agent:[Stores 5-star rating for Spirited Away and positive preference for anime & fantasy genres]
User: Recommend something for tonight
Agent:[Should recommend animated fantasy films similar to Spirited Away]

Expected Response Quality: Should recommend animated fantasy films that match the user's expressed love for Spirited Away.

Scenario 2: Average - Positive Genre Preference

Conversation Flow:

User: I love comedies, especially romantic ones
Agent: Rate the genre on a scale of 1-5User:5 stars for rom-coms
Agent:[Stores 5-star rating for romantic comedy genre and positive preference for comedy]
User: What's good for date night?
Agent:[Should prioritize romantic comedies based on stored preferences and date night context]

Expected Response Quality: Should prioritize romantic comedies, recognizing both the genre preference and the date night context.

Scenario 3: Edge Case - Negative Title Preference with Contradictory Request

Conversation Flow:

User: I didn't like The Matrix
Agent: Rate the title on a scale of 1-5
User:2 stars
Agent:[Stores 2-star rating for The Matrix and negative preference for sci-fi action genres]
User: Recommend action movies
Agent:[Must navigate contradiction: user wants action movies but dislikes a sci-fi action film]

Expected Response Quality: Should recommend action movies but avoid The Matrix series. The agent must balance the negative preference with the genre request.

movie_evaluation_scenarios.json
1[
2  {
3    "scenario_id": 1,
4    "description": "Essential - positive title preference",
5    "input": [
6      "I love Spirited Away",
7      "5 stars"
8    ],
9    "evaluation_query": "Recommend something for tonight",
10    "expected_response_quality": "Should recommend animated fantasy films"
11  },
12  {
13    "scenario_id": 2,
14    "description": "Average - positive genre preference", 
15    "input": [
16      "I love comedies, especially romantic ones",
17      "5 stars for rom-coms"
18    ],
19    "evaluation_query": "What's good for date night?",
20    "expected_response_quality": "Should prioritize romantic comedies"
21  },
22  {
23    "scenario_id": 3,
24    "description": "Edge case - negative title preference with contradictory request",
25    "input": [
26      "I didn't like The Matrix",
27      "2 stars"
28    ],
29    "evaluation_query": "Recommend action movies",
30    "expected_response_quality": "Should recommend action movies but avoid The Matrix series"
31  }
32]

Evaluation Dimension: Response accuracy

Arize AX Implementation

I'll send tracing data to Arize AX by converting Strands Agents' native telemetry to OpenInference format (Arize's open standard for LLM observability) using a custom processor and OpenTelemetry.

Tracing:

Prerequisites:

  1. Install dependencies
1uv add opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc

2. Save the strands_to_openinference_mapping.py file from Openinference-Arize repository to your project directory.

3. Set up cost tracking

  1. Go to Settings > Cost Configs > User Defined Costs in Arize AX
  2. Look for matching your models. If your models are already listed, you're done - cost tracking works automatically.
  3. Only if your models are missing, go to User Defined Cost and click Add Model
  4. Enter Model Name, Prompt Tokens, and Completion Tokens
eval_arize_tracing.py
1import os
2import json
3import uuid
4from dotenv import load_dotenv
5from main import MovieRecommendationAssistant
6
7# OpenTelemetry imports for manual setup
8from opentelemetry import trace
9from opentelemetry.sdk.trace import TracerProvider
10from opentelemetry.sdk.trace.export import BatchSpanProcessor
11from opentelemetry.sdk.resources import Resource
12from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
13from strands_to_openinference_mapping import StrandsToOpenInferenceProcessor
14
15# Load environment variables
16load_dotenv()
17
18
19def run_arize_tracing():
20    """Arize AX tracing - official manual OpenTelemetry setup"""
21
22    # Get Arize credentials from .env file
23    space_id = os.getenv("ARIZE_SPACE_ID")
24    api_key = os.getenv("ARIZE_API_KEY")
25
26    if not space_id or not api_key:
27        print("Error: Set ARIZE_SPACE_ID and ARIZE_API_KEY in .env file")
28        return
29
30    # Create the Strands to OpenInference processor
31    strands_processor = StrandsToOpenInferenceProcessor(debug=True)
32
33    # Create resource with project name
34    resource = Resource.create(
35        {
36            "model_id": "strands-agents-memory-tracing",
37            "service.name": "strands-agent-integration",
38        }
39    )
40
41    # Create tracer provider and add processors
42    provider = TracerProvider(resource=resource)
43    provider.add_span_processor(strands_processor)
44
45    # Create OTLP exporter for Arize
46    otlp_exporter = OTLPSpanExporter(
47        endpoint="otlp.arize.com:443",
48        headers={"space_id": space_id, "api_key": api_key},
49    )
50    provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
51
52    # Set the global tracer provider
53    trace.set_tracer_provider(provider)
54
55    # Load test scenarios from JSON file
56    with open("movie_evaluation_scenarios.json", "r") as f:
57        scenarios = json.load(f)
58
59    for scenario in scenarios:
60        print(f"\n{'=' * 60}")
61        print(f"SCENARIO {scenario['scenario_id']}: {scenario['description']}")
62        print(f"{'=' * 60}")
63
64        # Create fresh assistant with unique UUID for each scenario
65        user_id = str(uuid.uuid4())
66        assistant = MovieRecommendationAssistant(user_id=user_id)
67
68        print(f"Using user_id: {user_id}")
69
70        # Set STRANDS_AGENT_SYSTEM_PROMPT for the processor
71        os.environ["STRANDS_AGENT_SYSTEM_PROMPT"] = assistant.agent.system_prompt
72
73        # Add trace attributes for better organization in Arize
74        assistant.agent.trace_attributes = {
75            "session.id": f"scenario-{scenario['scenario_id']}-{user_id[:8]}",
76            "user.id": user_id,
77            "scenario.id": scenario["scenario_id"],
78            "arize.tags": [
79                "Agent-SDK",
80                "Arize-Project",
81                "OpenInference-Integration",
82            ],
83        }
84
85        # Run each input message in the scenario
86        for user_input in scenario["input"]:
87            print(f"\nUser: {user_input}")
88            assistant.agent(user_input)
89
90        # Run evaluation query
91        eval_query = scenario["evaluation_query"]
92        print(f"\nEvaluation Query: {eval_query}")
93        assistant.agent(eval_query)
94
95    print(f"\nView traces at: https://app.arize.com")
96    print(f"Project: strands-agents-memory-tracing")
97
98
99if __name__ == "__main__":
100    run_arize_tracing()

Run uv run eval_arize_tracing.py

Key Points:

  • StrandsToOpenInferenceProcessor(debug=False): Converts Strands native telemetry to OpenInference format that Arize understands
  • Resource.create(): Defines project metadata (model_id = project name in Arize dashboard)
  • TracerProvider: Core OpenTelemetry component that manages all tracing
  • Dual processors:
    • provider.add_span_processor(strands_processor) - First processor converts Strands format to OpenInference
    • provider.add_span_processor(BatchSpanProcessor(otlp_exporter)) - Second processor batches and exports traces to Arize
  • OTLPSpanExporter: Sends traces to otlp.arize.com:443 via gRPC protocol
  • trace.set_tracer_provider(): Sets global tracer so Strands automatically uses it
  • trace_attributes: Added to agent for filtering/grouping (session.id, user.id, arize.tags)
Projects View

Projects View

Trace Detail View

Trace Detail View

Attributes Tab

Attributes Tab

LLM-as-a-Judge:

eval_arize_llm_as_a_judge.py
1import os
2import json
3import uuid
4import pandas as pd
5from dotenv import load_dotenv
6from main import (
7    MovieRecommendationAssistant,
8)  # Custom assistant for movie recommendations
9
10# Load environment variables from .env file
11load_dotenv()
12
13# Try importing Arize dependencies
14try:
15    from arize.experimental.datasets import ArizeDatasetsClient
16    from arize.experimental.datasets.experiments.types import EvaluationResult
17    from arize.experimental.datasets.utils.constants import GENERATIVE
18    from phoenix.evals import llm_classify, OpenAIModel
19    from openinference.instrumentation import suppress_tracing
20
21    ARIZE_AVAILABLE = True
22except ImportError as e:
23    ARIZE_AVAILABLE = False
24    print('Install: uv add "arize[Datasets]" arize-phoenix openai pandas')
25    print(f"Error: {e}")
26
27
28def movie_task(dataset_row):
29    """
30    Task function executed for each row in the dataset.
31    Input: dataset_row containing scenario input messages and evaluation query.
32    Output: final response text from the AI assistant.
33    """
34    try:
35        input_messages = json.loads(
36            dataset_row.get("input", "[]")
37        )  # Load scenario input messages
38        eval_query = dataset_row.get(
39            "evaluation_query", ""
40        )  # Final query for evaluation
41
42        user_id = str(uuid.uuid4())  # Unique ID for this session
43        assistant = MovieRecommendationAssistant(user_id=user_id)
44
45        # Execute all input messages in the scenario
46        for message in input_messages:
47            assistant.agent(message)
48
49        # Execute the final evaluation query
50        result = assistant.agent(eval_query)
51
52        # Extract text from the agent's structured message
53        response_text = result.message["content"][0]["text"]
54
55        return response_text
56
57    except Exception as e:
58        return f"Task failed: {str(e)}"
59
60
61def quality_evaluator(output, dataset_row):
62    """
63    Evaluator function to score response quality.
64    Input: output from task and dataset_row.
65    Output: EvaluationResult containing score, label, and explanation.
66    """
67    template = """
68    Evaluate the quality of movie recommendations based on the conversation context.
69    
70    Scenario Description: {description}
71    Conversation Input: {input}
72    Agent Response: {output}
73    Expected Quality: {expected_quality}
74    
75    Based on the conversation context, score how well the agent's recommendations 
76    align with the user's expressed preferences:
77    
78    5 = Excellent - Perfect alignment with user preferences
79    4 = Good - Strong alignment with minor gaps
80    3 = Adequate - Some alignment but could be better
81    2 = Poor - Minimal alignment with preferences
82    1 = Terrible - No alignment or inappropriate recommendations
83    
84    Respond with: 1, 2, 3, 4, or 5
85    """
86
87    try:
88        description = dataset_row.get("description", "")
89        input_data = dataset_row.get("input", "[]")
90        expected_quality = dataset_row.get("expected_quality", "")
91
92        df = pd.DataFrame(
93            [
94                {
95                    "description": description,
96                    "input": input_data,
97                    "output": output,
98                    "expected_quality": expected_quality,
99                }
100            ]
101        )
102
103        with suppress_tracing():
104            result = llm_classify(
105                data=df,
106                template=template,
107                model=OpenAIModel(model="gpt-4o-mini"),
108                rails=["1", "2", "3", "4", "5"],
109                provide_explanation=True,
110            )
111
112        label = result["label"][0]
113        score = int(label)
114        explanation = result.get("explanation", [""])[0]
115
116        return EvaluationResult(score=score, label=str(score), explanation=explanation)
117
118    except Exception as e:
119        return EvaluationResult(score=0.0, label="0", explanation=f"Failed: {str(e)}")
120
121
122def run_arize_llm_evaluation():
123    """
124    Main function to run LLM-as-a-judge experiments.
125    Steps:
126    1. Initialize Arize client.
127    2. Load scenarios from JSON.
128    3. For each scenario:
129       - Create dataset.
130       - Run task to generate output.
131       - Run evaluators to score output.
132    4. Collect and print results.
133    """
134    if not ARIZE_AVAILABLE:
135        print("Missing Arize dependencies")
136        return
137
138    api_key = os.getenv("ARIZE_API_KEY")
139    developer_key = os.getenv("ARIZE_DEVELOPER_KEY")
140    space_id = os.getenv("ARIZE_SPACE_ID")
141
142    if not (api_key or developer_key) or not space_id:
143        print("Missing ARIZE_API_KEY/ARIZE_DEVELOPER_KEY and ARIZE_SPACE_ID")
144        return
145
146    try:
147        client = (
148            ArizeDatasetsClient(developer_key=developer_key)
149            if developer_key
150            else ArizeDatasetsClient(api_key=api_key)
151        )
152        print("Arize client initialized")
153    except Exception as e:
154        print(f"Client init failed: {e}")
155        return
156
157    with open("movie_evaluation_scenarios.json", "r") as f:
158        scenarios = json.load(f)
159
160    print(f"Running experiments for {len(scenarios)} scenarios...")
161
162    experiment_results = []
163
164    for scenario in scenarios:
165        scenario_id = scenario["scenario_id"]
166        print(f"\n=== Scenario {scenario_id} ===")
167
168        row = {
169            "id": scenario_id,
170            "description": scenario["description"],
171            "input": json.dumps(scenario["input"]),
172            "evaluation_query": scenario["evaluation_query"],
173            "expected_quality": scenario["expected_response_quality"],
174        }
175
176        single_row_df = pd.DataFrame([row])
177        scenario_dataset_name = f"scenario_{scenario_id}_{uuid.uuid4().hex[:8]}"
178
179        # Create dataset for this scenario
180        scenario_dataset_id = client.create_dataset(
181            space_id=space_id,
182            dataset_name=scenario_dataset_name,
183            dataset_type=GENERATIVE,
184            data=single_row_df,
185        )
186        print(f"Created dataset: {scenario_dataset_name}")
187
188        # Run experiment: generate output and score it
189        result = client.run_experiment(
190            space_id=space_id,
191            dataset_id=scenario_dataset_id,
192            task=movie_task,  # Generates output
193            evaluators=[quality_evaluator],  # Score output quality only
194            experiment_name=f"scenario_{scenario_id}_{uuid.uuid4().hex[:8]}",
195            exit_on_error=False,
196        )
197
198        if isinstance(result, tuple):
199            experiment_id, results_df = result
200            print(f"Scenario {scenario_id} completed! Experiment ID: {experiment_id}")
201            if not results_df.empty:
202                quality_score = results_df.iloc[0].get(
203                    "eval.quality_evaluator.score", "N/A"
204                )
205                print(f"Quality Score: {quality_score}/5")
206
207            experiment_results.append(
208                {
209                    "scenario_id": scenario_id,
210                    "experiment_id": experiment_id,
211                    "results_df": results_df,
212                }
213            )
214        else:
215            print(f"Unexpected result type for scenario {scenario_id}: {type(result)}")
216            experiment_results.append(
217                {
218                    "scenario_id": scenario_id,
219                    "experiment_id": None,
220                    "error": f"Unexpected result: {result}",
221                }
222            )
223
224        print("-" * 60)
225
226    print(f"All experiments complete: {len(experiment_results)} scenarios processed")
227    return experiment_results
228
229
230if __name__ == "__main__":
231    run_arize_llm_evaluation()

Run uv run eval_arize_llm_as_a_judge.py

Key Points:

  • suppress_tracing(): Skips tracing for LLM judge calls to keep evaluation traces separate from agent traces
  • llm_classify(): AX function that runs LLM evaluation with structured output
  • data=df: Input DataFrame with columns referenced in template
  • rails: Constrains LLM judge to only return specified labels
  • provide_explanation=True: Makes LLM judge explain its reasoning
  • Returns EvaluationResult with score, label, and explanation
  • Separate experiments: Each scenario runs as its own experiment for cleaner organization
Experiment View

Experiment View

Compare Experiments View

Compare Experiments View

The evaluator gave a score of 2 because the agent recommended romantic movies as the user requested in the last message but did not follow the user's previously expressed preference for comedies.


Resources

🚀 Try It Yourself

📚 Learn More

  • DeepLearning.ai 'Evaluating AI Agents': https://www.deeplearning.ai/short-courses/evaluating-ai-agents/
  • Operationalizing Generative AI on Vertex AI using MLOps: https://www.kaggle.com/whitepaper-operationalizing-generative-ai-on-vertex-ai-using-mlops
  • Strands Agents Official Documentation: https://strandsagents.com/latest/
  • Arize AX Official Documentation: https://arize.com/docs/ax