Observability and Evaluation for AI Agents: A Practical Guide
This guide covers the practical evaluation implementation with Strands Agents and Arize AX.

Observability Foundation: The Four Pillars
Before diving into implementation, understand that effective agent observability requires four components:

The Four Pillars
Tracing: Understanding Agent Behavior
Trace vs Span:
- Trace: The entire journey (like a complete conversation turn)
- Span: Individual steps within that journey (like "search memory", "call tool", "generate response")
e.g. Trace: User asks "Recommend a comedy movie"
├── Span 1: Parse user intent
├── Span 2: Search memory for comedy preferences
├── Span 3: Call movie recommendation tool
├── Span 4: Filter results based on preferences
└── Span 5: Generate final response
This visibility is crucial for debugging why agents make certain decisions and where failures occur.
Evaluation Approaches: The Four Quadrants
(Ideas summarized from the DeepLearning.AI short course on Evaluating AI Agents, created in collaboration with Arize AI)
Evaluation techniques fall into four categories based on two key dimensions:
Deterministic vs Non-Deterministic
- Deterministic: Consistent, repeatable results every time
- Non-Deterministic: Results may vary between runs
Flexible vs Inflexible Criteria
- Flexible: Can handle qualitative, subjective assessments
- Inflexible: Requires quantifiable, specific criteria
In practice, here’s how these play out:
- Human Labels (Non-Deterministic + Flexible): Best for subjective quality, tone evaluation
- LLM-as-a-Judge (Non-Deterministic + Flexible): Scalable quality assessment, complex reasoning
- Code-based (Deterministic + Inflexible): Format validation, exact matching, measurable criteria
What Can Go Wrong: Common Evaluation Areas
Agent evaluation covers multiple critical areas where failures commonly occur:
- Router Evaluation: Do agents choose the right tools and extract correct parameters?
- Function Calling Choice: Did it choose the right tool?
- Parameter Extraction: Did it extract correct parameters?
- Content Quality: Are responses helpful, accurate, and appropriate?
- Hallucinations, factual errors, inappropriate tone
- Memory & Context Usage: Do agents use available information effectively?
- Ignoring user preferences, failing to use retrieved context (RAG)
- Task Completion: Do agents actually accomplish what users asked for?
- Overall correctness, following instructions properly
Common failures cascade through your entire system, leading to poor user experiences.
Five-Step Evaluation Framework
(Reference: Operationalizing Generative AI on Vertex AI using MLOps)
- Create Evaluation Dataset: Essential, average, and edge cases
- Define Metrics: Objective (accuracy, format) and subjective (tone, helpfulness)
- Generate Responses: Run agent systematically against test cases
- Run Evaluation: Manual, automated, or hybrid approaches
- Interpret Results: Identify patterns, guide improvements
Implementation: Movie Recommendation Agent
Let's implement tracing and evaluation for a movie recommendation agent I built here: https://www.asakohayase.com/blog/movie-recommendation-memory-strands-agents
If you haven’t installed the project yet, please complete steps 1 and 2 of the “Installation” section in the previous blog. I will use LLM-as-a-Judge for evaluation to enable scalable, consistent assessment of recommendation quality without manual review overhead.
Three Test Scenarios:
Scenario 1: Essential - Positive Title Preference
Conversation Flow:
User: I love Spirited Away
Agent: Rate the title on a scale of 1-5
User:5 stars Agent:[Stores 5-star rating for Spirited Away and positive preference for anime & fantasy genres]
User: Recommend something for tonight
Agent:[Should recommend animated fantasy films similar to Spirited Away]
Expected Response Quality: Should recommend animated fantasy films that match the user's expressed love for Spirited Away.
Scenario 2: Average - Positive Genre Preference
Conversation Flow:
User: I love comedies, especially romantic ones
Agent: Rate the genre on a scale of 1-5User:5 stars for rom-coms
Agent:[Stores 5-star rating for romantic comedy genre and positive preference for comedy]
User: What's good for date night?
Agent:[Should prioritize romantic comedies based on stored preferences and date night context]
Expected Response Quality: Should prioritize romantic comedies, recognizing both the genre preference and the date night context.
Scenario 3: Edge Case - Negative Title Preference with Contradictory Request
Conversation Flow:
User: I didn't like The Matrix
Agent: Rate the title on a scale of 1-5
User:2 stars
Agent:[Stores 2-star rating for The Matrix and negative preference for sci-fi action genres]
User: Recommend action movies
Agent:[Must navigate contradiction: user wants action movies but dislikes a sci-fi action film]
Expected Response Quality: Should recommend action movies but avoid The Matrix series. The agent must balance the negative preference with the genre request.
1[
2 {
3 "scenario_id": 1,
4 "description": "Essential - positive title preference",
5 "input": [
6 "I love Spirited Away",
7 "5 stars"
8 ],
9 "evaluation_query": "Recommend something for tonight",
10 "expected_response_quality": "Should recommend animated fantasy films"
11 },
12 {
13 "scenario_id": 2,
14 "description": "Average - positive genre preference",
15 "input": [
16 "I love comedies, especially romantic ones",
17 "5 stars for rom-coms"
18 ],
19 "evaluation_query": "What's good for date night?",
20 "expected_response_quality": "Should prioritize romantic comedies"
21 },
22 {
23 "scenario_id": 3,
24 "description": "Edge case - negative title preference with contradictory request",
25 "input": [
26 "I didn't like The Matrix",
27 "2 stars"
28 ],
29 "evaluation_query": "Recommend action movies",
30 "expected_response_quality": "Should recommend action movies but avoid The Matrix series"
31 }
32]
Evaluation Dimension: Response accuracy
Arize AX Implementation
I'll send tracing data to Arize AX by converting Strands Agents' native telemetry to OpenInference format (Arize's open standard for LLM observability) using a custom processor and OpenTelemetry.
Tracing:
Prerequisites:
- Install dependencies
1uv add opentelemetry-sdk opentelemetry-exporter-otlp-proto-grpc
2. Save the strands_to_openinference_mapping.py
file from Openinference-Arize repository to your project directory.
3. Set up cost tracking
- Go to Settings > Cost Configs > User Defined Costs in Arize AX
- Look for matching your models. If your models are already listed, you're done - cost tracking works automatically.
- Only if your models are missing, go to User Defined Cost and click Add Model
- Enter Model Name, Prompt Tokens, and Completion Tokens
1import os
2import json
3import uuid
4from dotenv import load_dotenv
5from main import MovieRecommendationAssistant
6
7# OpenTelemetry imports for manual setup
8from opentelemetry import trace
9from opentelemetry.sdk.trace import TracerProvider
10from opentelemetry.sdk.trace.export import BatchSpanProcessor
11from opentelemetry.sdk.resources import Resource
12from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
13from strands_to_openinference_mapping import StrandsToOpenInferenceProcessor
14
15# Load environment variables
16load_dotenv()
17
18
19def run_arize_tracing():
20 """Arize AX tracing - official manual OpenTelemetry setup"""
21
22 # Get Arize credentials from .env file
23 space_id = os.getenv("ARIZE_SPACE_ID")
24 api_key = os.getenv("ARIZE_API_KEY")
25
26 if not space_id or not api_key:
27 print("Error: Set ARIZE_SPACE_ID and ARIZE_API_KEY in .env file")
28 return
29
30 # Create the Strands to OpenInference processor
31 strands_processor = StrandsToOpenInferenceProcessor(debug=True)
32
33 # Create resource with project name
34 resource = Resource.create(
35 {
36 "model_id": "strands-agents-memory-tracing",
37 "service.name": "strands-agent-integration",
38 }
39 )
40
41 # Create tracer provider and add processors
42 provider = TracerProvider(resource=resource)
43 provider.add_span_processor(strands_processor)
44
45 # Create OTLP exporter for Arize
46 otlp_exporter = OTLPSpanExporter(
47 endpoint="otlp.arize.com:443",
48 headers={"space_id": space_id, "api_key": api_key},
49 )
50 provider.add_span_processor(BatchSpanProcessor(otlp_exporter))
51
52 # Set the global tracer provider
53 trace.set_tracer_provider(provider)
54
55 # Load test scenarios from JSON file
56 with open("movie_evaluation_scenarios.json", "r") as f:
57 scenarios = json.load(f)
58
59 for scenario in scenarios:
60 print(f"\n{'=' * 60}")
61 print(f"SCENARIO {scenario['scenario_id']}: {scenario['description']}")
62 print(f"{'=' * 60}")
63
64 # Create fresh assistant with unique UUID for each scenario
65 user_id = str(uuid.uuid4())
66 assistant = MovieRecommendationAssistant(user_id=user_id)
67
68 print(f"Using user_id: {user_id}")
69
70 # Set STRANDS_AGENT_SYSTEM_PROMPT for the processor
71 os.environ["STRANDS_AGENT_SYSTEM_PROMPT"] = assistant.agent.system_prompt
72
73 # Add trace attributes for better organization in Arize
74 assistant.agent.trace_attributes = {
75 "session.id": f"scenario-{scenario['scenario_id']}-{user_id[:8]}",
76 "user.id": user_id,
77 "scenario.id": scenario["scenario_id"],
78 "arize.tags": [
79 "Agent-SDK",
80 "Arize-Project",
81 "OpenInference-Integration",
82 ],
83 }
84
85 # Run each input message in the scenario
86 for user_input in scenario["input"]:
87 print(f"\nUser: {user_input}")
88 assistant.agent(user_input)
89
90 # Run evaluation query
91 eval_query = scenario["evaluation_query"]
92 print(f"\nEvaluation Query: {eval_query}")
93 assistant.agent(eval_query)
94
95 print(f"\nView traces at: https://app.arize.com")
96 print(f"Project: strands-agents-memory-tracing")
97
98
99if __name__ == "__main__":
100 run_arize_tracing()
Run uv run eval_arize_tracing.py
Key Points:
StrandsToOpenInferenceProcessor(debug=False)
: Converts Strands native telemetry to OpenInference format that Arize understandsResource.create()
: Defines project metadata (model_id
= project name in Arize dashboard)TracerProvider
: Core OpenTelemetry component that manages all tracing- Dual processors:
provider.add_span_processor(strands_processor)
- First processor converts Strands format to OpenInferenceprovider.add_span_processor(BatchSpanProcessor(otlp_exporter))
- Second processor batches and exports traces to Arize
OTLPSpanExporter
: Sends traces tootlp.arize.com:443
via gRPC protocoltrace.set_tracer_provider()
: Sets global tracer so Strands automatically uses ittrace_attributes
: Added to agent for filtering/grouping (session.id
,user.id
,arize.tags
)

Projects View

Trace Detail View

Attributes Tab
LLM-as-a-Judge:
1import os
2import json
3import uuid
4import pandas as pd
5from dotenv import load_dotenv
6from main import (
7 MovieRecommendationAssistant,
8) # Custom assistant for movie recommendations
9
10# Load environment variables from .env file
11load_dotenv()
12
13# Try importing Arize dependencies
14try:
15 from arize.experimental.datasets import ArizeDatasetsClient
16 from arize.experimental.datasets.experiments.types import EvaluationResult
17 from arize.experimental.datasets.utils.constants import GENERATIVE
18 from phoenix.evals import llm_classify, OpenAIModel
19 from openinference.instrumentation import suppress_tracing
20
21 ARIZE_AVAILABLE = True
22except ImportError as e:
23 ARIZE_AVAILABLE = False
24 print('Install: uv add "arize[Datasets]" arize-phoenix openai pandas')
25 print(f"Error: {e}")
26
27
28def movie_task(dataset_row):
29 """
30 Task function executed for each row in the dataset.
31 Input: dataset_row containing scenario input messages and evaluation query.
32 Output: final response text from the AI assistant.
33 """
34 try:
35 input_messages = json.loads(
36 dataset_row.get("input", "[]")
37 ) # Load scenario input messages
38 eval_query = dataset_row.get(
39 "evaluation_query", ""
40 ) # Final query for evaluation
41
42 user_id = str(uuid.uuid4()) # Unique ID for this session
43 assistant = MovieRecommendationAssistant(user_id=user_id)
44
45 # Execute all input messages in the scenario
46 for message in input_messages:
47 assistant.agent(message)
48
49 # Execute the final evaluation query
50 result = assistant.agent(eval_query)
51
52 # Extract text from the agent's structured message
53 response_text = result.message["content"][0]["text"]
54
55 return response_text
56
57 except Exception as e:
58 return f"Task failed: {str(e)}"
59
60
61def quality_evaluator(output, dataset_row):
62 """
63 Evaluator function to score response quality.
64 Input: output from task and dataset_row.
65 Output: EvaluationResult containing score, label, and explanation.
66 """
67 template = """
68 Evaluate the quality of movie recommendations based on the conversation context.
69
70 Scenario Description: {description}
71 Conversation Input: {input}
72 Agent Response: {output}
73 Expected Quality: {expected_quality}
74
75 Based on the conversation context, score how well the agent's recommendations
76 align with the user's expressed preferences:
77
78 5 = Excellent - Perfect alignment with user preferences
79 4 = Good - Strong alignment with minor gaps
80 3 = Adequate - Some alignment but could be better
81 2 = Poor - Minimal alignment with preferences
82 1 = Terrible - No alignment or inappropriate recommendations
83
84 Respond with: 1, 2, 3, 4, or 5
85 """
86
87 try:
88 description = dataset_row.get("description", "")
89 input_data = dataset_row.get("input", "[]")
90 expected_quality = dataset_row.get("expected_quality", "")
91
92 df = pd.DataFrame(
93 [
94 {
95 "description": description,
96 "input": input_data,
97 "output": output,
98 "expected_quality": expected_quality,
99 }
100 ]
101 )
102
103 with suppress_tracing():
104 result = llm_classify(
105 data=df,
106 template=template,
107 model=OpenAIModel(model="gpt-4o-mini"),
108 rails=["1", "2", "3", "4", "5"],
109 provide_explanation=True,
110 )
111
112 label = result["label"][0]
113 score = int(label)
114 explanation = result.get("explanation", [""])[0]
115
116 return EvaluationResult(score=score, label=str(score), explanation=explanation)
117
118 except Exception as e:
119 return EvaluationResult(score=0.0, label="0", explanation=f"Failed: {str(e)}")
120
121
122def run_arize_llm_evaluation():
123 """
124 Main function to run LLM-as-a-judge experiments.
125 Steps:
126 1. Initialize Arize client.
127 2. Load scenarios from JSON.
128 3. For each scenario:
129 - Create dataset.
130 - Run task to generate output.
131 - Run evaluators to score output.
132 4. Collect and print results.
133 """
134 if not ARIZE_AVAILABLE:
135 print("Missing Arize dependencies")
136 return
137
138 api_key = os.getenv("ARIZE_API_KEY")
139 developer_key = os.getenv("ARIZE_DEVELOPER_KEY")
140 space_id = os.getenv("ARIZE_SPACE_ID")
141
142 if not (api_key or developer_key) or not space_id:
143 print("Missing ARIZE_API_KEY/ARIZE_DEVELOPER_KEY and ARIZE_SPACE_ID")
144 return
145
146 try:
147 client = (
148 ArizeDatasetsClient(developer_key=developer_key)
149 if developer_key
150 else ArizeDatasetsClient(api_key=api_key)
151 )
152 print("Arize client initialized")
153 except Exception as e:
154 print(f"Client init failed: {e}")
155 return
156
157 with open("movie_evaluation_scenarios.json", "r") as f:
158 scenarios = json.load(f)
159
160 print(f"Running experiments for {len(scenarios)} scenarios...")
161
162 experiment_results = []
163
164 for scenario in scenarios:
165 scenario_id = scenario["scenario_id"]
166 print(f"\n=== Scenario {scenario_id} ===")
167
168 row = {
169 "id": scenario_id,
170 "description": scenario["description"],
171 "input": json.dumps(scenario["input"]),
172 "evaluation_query": scenario["evaluation_query"],
173 "expected_quality": scenario["expected_response_quality"],
174 }
175
176 single_row_df = pd.DataFrame([row])
177 scenario_dataset_name = f"scenario_{scenario_id}_{uuid.uuid4().hex[:8]}"
178
179 # Create dataset for this scenario
180 scenario_dataset_id = client.create_dataset(
181 space_id=space_id,
182 dataset_name=scenario_dataset_name,
183 dataset_type=GENERATIVE,
184 data=single_row_df,
185 )
186 print(f"Created dataset: {scenario_dataset_name}")
187
188 # Run experiment: generate output and score it
189 result = client.run_experiment(
190 space_id=space_id,
191 dataset_id=scenario_dataset_id,
192 task=movie_task, # Generates output
193 evaluators=[quality_evaluator], # Score output quality only
194 experiment_name=f"scenario_{scenario_id}_{uuid.uuid4().hex[:8]}",
195 exit_on_error=False,
196 )
197
198 if isinstance(result, tuple):
199 experiment_id, results_df = result
200 print(f"Scenario {scenario_id} completed! Experiment ID: {experiment_id}")
201 if not results_df.empty:
202 quality_score = results_df.iloc[0].get(
203 "eval.quality_evaluator.score", "N/A"
204 )
205 print(f"Quality Score: {quality_score}/5")
206
207 experiment_results.append(
208 {
209 "scenario_id": scenario_id,
210 "experiment_id": experiment_id,
211 "results_df": results_df,
212 }
213 )
214 else:
215 print(f"Unexpected result type for scenario {scenario_id}: {type(result)}")
216 experiment_results.append(
217 {
218 "scenario_id": scenario_id,
219 "experiment_id": None,
220 "error": f"Unexpected result: {result}",
221 }
222 )
223
224 print("-" * 60)
225
226 print(f"All experiments complete: {len(experiment_results)} scenarios processed")
227 return experiment_results
228
229
230if __name__ == "__main__":
231 run_arize_llm_evaluation()
Run uv run eval_arize_llm_as_a_judge.py
Key Points:
suppress_tracing()
: Skips tracing for LLM judge calls to keep evaluation traces separate from agent tracesllm_classify()
: AX function that runs LLM evaluation with structured outputdata=df
: Input DataFrame with columns referenced in templaterails
: Constrains LLM judge to only return specified labelsprovide_explanation=True
: Makes LLM judge explain its reasoning- Returns
EvaluationResult
withscore
,label
, andexplanation
- Separate experiments: Each scenario runs as its own experiment for cleaner organization

Experiment View

Compare Experiments View
The evaluator gave a score of 2 because the agent recommended romantic movies as the user requested in the last message but did not follow the user's previously expressed preference for comedies.
Resources
🚀 Try It Yourself
- GitHub Repository - Complete source code
📚 Learn More
- DeepLearning.ai 'Evaluating AI Agents': https://www.deeplearning.ai/short-courses/evaluating-ai-agents/
- Operationalizing Generative AI on Vertex AI using MLOps: https://www.kaggle.com/whitepaper-operationalizing-generative-ai-on-vertex-ai-using-mlops
- Strands Agents Official Documentation: https://strandsagents.com/latest/
- Arize AX Official Documentation: https://arize.com/docs/ax