MCP Agent 可观测性:监控、追踪和调试 AI Agent 系统
构建完整的 MCP Agent 可观测性体系,包括指标监控、分布式追踪、日志聚合和告警机制的最佳实践。
AI Agent 系统的行为具有不确定性——同样的输入可能产生不同的工具调用序列。这使得传统的监控方法不够用,我们需要专门为 Agent 系统设计的可观测性方案。
Agent 可观测性的三大支柱
指标(Metrics)
数值型的聚合数据,用于监控系统的整体健康状态。
追踪(Traces)
单次请求的完整调用链,用于定位具体问题。
日志(Logs)
详细的事件记录,用于深入分析和审计。
指标设计
关键指标
import { Counter, Histogram, Gauge } from 'prom-client';
// 工具调用计数
const toolCallsTotal = new Counter({
name: 'mcp_tool_calls_total',
help: 'Total tool calls',
labelNames: ['tool', 'server', 'status'],
});
// 工具调用延迟
const toolCallDuration = new Histogram({
name: 'mcp_tool_call_duration_seconds',
help: 'Tool call duration',
labelNames: ['tool', 'server'],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 5, 10],
});
// 活跃 Agent 数量
const activeAgents = new Gauge({
name: 'mcp_active_agents',
help: 'Number of active agents',
});
// LLM Token 使用量
const tokenUsage = new Counter({
name: 'mcp_token_usage_total',
help: 'LLM token usage',
labelNames: ['model', 'type'], // type: input/output
});
// Agent 对话轮次
const conversationTurns = new Histogram({
name: 'mcp_conversation_turns',
help: 'Number of turns per conversation',
buckets: [1, 2, 5, 10, 20, 50],
});
指标收集
class MetricsCollector {
recordToolCall(tool: string, server: string, duration: number, success: boolean) {
toolCallsTotal.inc({
tool,
server,
status: success ? 'success' : 'error',
});
toolCallDuration.observe({ tool, server }, duration / 1000);
}
recordTokenUsage(model: string, inputTokens: number, outputTokens: number) {
tokenUsage.inc({ model, type: 'input' }, inputTokens);
tokenUsage.inc({ model, type: 'output' }, outputTokens);
}
recordConversation(turns: number) {
conversationTurns.observe(turns);
}
}
分布式追踪
追踪上下文传播
import { randomUUID } from 'crypto';
interface TraceContext {
traceId: string;
spanId: string;
parentSpanId?: string;
}
class Tracer {
private context: TraceContext | null = null;
startTrace(): TraceContext {
this.context = {
traceId: randomUUID(),
spanId: randomUUID(),
};
return this.context;
}
startSpan(parentContext: TraceContext): TraceContext {
const span: TraceContext = {
traceId: parentContext.traceId,
spanId: randomUUID(),
parentSpanId: parentContext.spanId,
};
this.context = span;
return span;
}
getContext(): TraceContext | null {
return this.context;
}
}
Agent 调用链追踪
interface SpanData {
traceId: string;
spanId: string;
parentSpanId?: string;
operation: string;
startTime: number;
endTime?: number;
attributes: Record<string, any>;
events: Array<{ name: string; timestamp: number; data?: any }>;
}
class AgentTracer {
private spans: SpanData[] = [];
startSpan(traceId: string, parentSpanId: string | undefined, operation: string): string {
const spanId = randomUUID();
this.spans.push({
traceId,
spanId,
parentSpanId,
operation,
startTime: Date.now(),
attributes: {},
events: [],
});
return spanId;
}
endSpan(spanId: string, attributes?: Record<string, any>) {
const span = this.spans.find(s => s.spanId === spanId);
if (span) {
span.endTime = Date.now();
if (attributes) {
span.attributes = { ...span.attributes, ...attributes };
}
}
}
addEvent(spanId: string, name: string, data?: any) {
const span = this.spans.find(s => s.spanId === spanId);
if (span) {
span.events.push({ name, timestamp: Date.now(), data });
}
}
getTrace(traceId: string): SpanData[] {
return this.spans.filter(s => s.traceId === traceId);
}
}
追踪可视化
追踪数据可以导出为 OpenTelemetry 格式,在 Jaeger、Zipkin 等工具中可视化:
function formatTraceForJaeger(spans: SpanData[]) {
return spans.map(span => ({
traceID: span.traceId,
spanID: span.spanId,
parentSpanID: span.parentSpanId || '',
operationName: span.operation,
startTime: span.startTime * 1000, // 微秒
duration: ((span.endTime || Date.now()) - span.startTime) * 1000,
tags: Object.entries(span.attributes).map(([key, value]) => ({
key,
type: 'string',
value: String(value),
})),
logs: span.events.map(e => ({
timestamp: e.timestamp * 1000,
fields: [{ key: 'event', type: 'string', value: e.name }],
})),
}));
}
日志设计
结构化日志
import pino from 'pino';
const logger = pino({
level: 'info',
formatters: {
level(label: string) {
return { level: label };
},
},
});
class AgentLogger {
private traceId: string;
private agentId: string;
constructor(traceId: string, agentId: string) {
this.traceId = traceId;
this.agentId = agentId;
}
info(message: string, data?: any) {
logger.info({
traceId: this.traceId,
agentId: this.agentId,
...data,
}, message);
}
toolCall(toolName: string, serverName: string, args: any) {
logger.info({
traceId: this.traceId,
agentId: this.agentId,
event: 'tool_call',
tool: toolName,
server: serverName,
args,
}, `Tool call: ${toolName}`);
}
toolResult(toolName: string, success: boolean, duration: number) {
logger.info({
traceId: this.traceId,
agentId: this.agentId,
event: 'tool_result',
tool: toolName,
success,
duration,
}, `Tool result: ${toolName} (${success ? 'success' : 'error'})`);
}
llmCall(model: string, inputTokens: number, outputTokens: number) {
logger.info({
traceId: this.traceId,
agentId: this.agentId,
event: 'llm_call',
model,
inputTokens,
outputTokens,
}, `LLM call: ${model}`);
}
}
告警规则
关键告警
# Prometheus 告警规则
groups:
- name: mcp_agent_alerts
rules:
- alert: HighToolErrorRate
expr: rate(mcp_tool_calls_total{status="error"}[5m]) / rate(mcp_tool_calls_total[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "工具调用错误率超过 10%"
- alert: HighToolLatency
expr: histogram_quantile(0.95, rate(mcp_tool_call_duration_seconds_bucket[5m])) > 5
for: 5m
labels:
severity: warning
annotations:
summary: "工具调用 P95 延迟超过 5 秒"
- alert: AgentStuck
expr: mcp_active_agents > 0 and rate(mcp_tool_calls_total[10m]) == 0
for: 10m
labels:
severity: critical
annotations:
summary: "Agent 可能卡住 - 有活跃 Agent 但无工具调用"
仪表盘设计
Grafana 仪表盘
关键面板:
- 总览面板——活跃 Agent 数、总调用次数、错误率、平均延迟
- 工具面板——每个工具的调用次数、延迟分布、错误率
- Agent 面板——每个 Agent 的对话轮次、Token 使用量、完成率
- Server 面板——每个 MCP Server 的连接数、响应时间、可用性
常见问题(FAQ)
追踪数据保留多久?
建议保留 7-30 天。超过这个时间的追踪数据价值有限,但审计日志应保留更长时间。
如何追踪跨 Agent 的调用?
通过 traceId 传播追踪上下文。当一个 Agent 调用另一个 Agent 时,传递相同的 traceId。
可观测性会影响性能吗?
指标收集和日志记录的开销很小(通常 < 1ms)。分布式追踪的开销略高,但通常在 5% 以内。
总结
Agent 系统的可观测性需要专门设计——不同于传统的 Web 服务,Agent 的行为具有不确定性,需要更丰富的追踪和更智能的告警。通过指标、追踪、日志三大支柱的结合,你可以全面掌控 Agent 系统的运行状态。