MCP

MCP Agent 可观测性:监控、追踪和调试 AI Agent 系统

构建完整的 MCP Agent 可观测性体系,包括指标监控、分布式追踪、日志聚合和告警机制的最佳实践。

AI Agent 系统的行为具有不确定性——同样的输入可能产生不同的工具调用序列。这使得传统的监控方法不够用,我们需要专门为 Agent 系统设计的可观测性方案。

Agent 可观测性的三大支柱

指标(Metrics)

数值型的聚合数据,用于监控系统的整体健康状态。

追踪(Traces)

单次请求的完整调用链,用于定位具体问题。

日志(Logs)

详细的事件记录,用于深入分析和审计。

指标设计

关键指标

import { Counter, Histogram, Gauge } from 'prom-client';

// 工具调用计数
const toolCallsTotal = new Counter({
  name: 'mcp_tool_calls_total',
  help: 'Total tool calls',
  labelNames: ['tool', 'server', 'status'],
});

// 工具调用延迟
const toolCallDuration = new Histogram({
  name: 'mcp_tool_call_duration_seconds',
  help: 'Tool call duration',
  labelNames: ['tool', 'server'],
  buckets: [0.01, 0.05, 0.1, 0.5, 1, 5, 10],
});

// 活跃 Agent 数量
const activeAgents = new Gauge({
  name: 'mcp_active_agents',
  help: 'Number of active agents',
});

// LLM Token 使用量
const tokenUsage = new Counter({
  name: 'mcp_token_usage_total',
  help: 'LLM token usage',
  labelNames: ['model', 'type'],  // type: input/output
});

// Agent 对话轮次
const conversationTurns = new Histogram({
  name: 'mcp_conversation_turns',
  help: 'Number of turns per conversation',
  buckets: [1, 2, 5, 10, 20, 50],
});

指标收集

class MetricsCollector {
  recordToolCall(tool: string, server: string, duration: number, success: boolean) {
    toolCallsTotal.inc({
      tool,
      server,
      status: success ? 'success' : 'error',
    });
    toolCallDuration.observe({ tool, server }, duration / 1000);
  }

  recordTokenUsage(model: string, inputTokens: number, outputTokens: number) {
    tokenUsage.inc({ model, type: 'input' }, inputTokens);
    tokenUsage.inc({ model, type: 'output' }, outputTokens);
  }

  recordConversation(turns: number) {
    conversationTurns.observe(turns);
  }
}

分布式追踪

追踪上下文传播

import { randomUUID } from 'crypto';

interface TraceContext {
  traceId: string;
  spanId: string;
  parentSpanId?: string;
}

class Tracer {
  private context: TraceContext | null = null;

  startTrace(): TraceContext {
    this.context = {
      traceId: randomUUID(),
      spanId: randomUUID(),
    };
    return this.context;
  }

  startSpan(parentContext: TraceContext): TraceContext {
    const span: TraceContext = {
      traceId: parentContext.traceId,
      spanId: randomUUID(),
      parentSpanId: parentContext.spanId,
    };
    this.context = span;
    return span;
  }

  getContext(): TraceContext | null {
    return this.context;
  }
}

Agent 调用链追踪

interface SpanData {
  traceId: string;
  spanId: string;
  parentSpanId?: string;
  operation: string;
  startTime: number;
  endTime?: number;
  attributes: Record<string, any>;
  events: Array<{ name: string; timestamp: number; data?: any }>;
}

class AgentTracer {
  private spans: SpanData[] = [];

  startSpan(traceId: string, parentSpanId: string | undefined, operation: string): string {
    const spanId = randomUUID();

    this.spans.push({
      traceId,
      spanId,
      parentSpanId,
      operation,
      startTime: Date.now(),
      attributes: {},
      events: [],
    });

    return spanId;
  }

  endSpan(spanId: string, attributes?: Record<string, any>) {
    const span = this.spans.find(s => s.spanId === spanId);
    if (span) {
      span.endTime = Date.now();
      if (attributes) {
        span.attributes = { ...span.attributes, ...attributes };
      }
    }
  }

  addEvent(spanId: string, name: string, data?: any) {
    const span = this.spans.find(s => s.spanId === spanId);
    if (span) {
      span.events.push({ name, timestamp: Date.now(), data });
    }
  }

  getTrace(traceId: string): SpanData[] {
    return this.spans.filter(s => s.traceId === traceId);
  }
}

追踪可视化

追踪数据可以导出为 OpenTelemetry 格式,在 Jaeger、Zipkin 等工具中可视化:

function formatTraceForJaeger(spans: SpanData[]) {
  return spans.map(span => ({
    traceID: span.traceId,
    spanID: span.spanId,
    parentSpanID: span.parentSpanId || '',
    operationName: span.operation,
    startTime: span.startTime * 1000,  // 微秒
    duration: ((span.endTime || Date.now()) - span.startTime) * 1000,
    tags: Object.entries(span.attributes).map(([key, value]) => ({
      key,
      type: 'string',
      value: String(value),
    })),
    logs: span.events.map(e => ({
      timestamp: e.timestamp * 1000,
      fields: [{ key: 'event', type: 'string', value: e.name }],
    })),
  }));
}

日志设计

结构化日志

import pino from 'pino';

const logger = pino({
  level: 'info',
  formatters: {
    level(label: string) {
      return { level: label };
    },
  },
});

class AgentLogger {
  private traceId: string;
  private agentId: string;

  constructor(traceId: string, agentId: string) {
    this.traceId = traceId;
    this.agentId = agentId;
  }

  info(message: string, data?: any) {
    logger.info({
      traceId: this.traceId,
      agentId: this.agentId,
      ...data,
    }, message);
  }

  toolCall(toolName: string, serverName: string, args: any) {
    logger.info({
      traceId: this.traceId,
      agentId: this.agentId,
      event: 'tool_call',
      tool: toolName,
      server: serverName,
      args,
    }, `Tool call: ${toolName}`);
  }

  toolResult(toolName: string, success: boolean, duration: number) {
    logger.info({
      traceId: this.traceId,
      agentId: this.agentId,
      event: 'tool_result',
      tool: toolName,
      success,
      duration,
    }, `Tool result: ${toolName} (${success ? 'success' : 'error'})`);
  }

  llmCall(model: string, inputTokens: number, outputTokens: number) {
    logger.info({
      traceId: this.traceId,
      agentId: this.agentId,
      event: 'llm_call',
      model,
      inputTokens,
      outputTokens,
    }, `LLM call: ${model}`);
  }
}

告警规则

关键告警

# Prometheus 告警规则
groups:
  - name: mcp_agent_alerts
    rules:
      - alert: HighToolErrorRate
        expr: rate(mcp_tool_calls_total{status="error"}[5m]) / rate(mcp_tool_calls_total[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "工具调用错误率超过 10%"

      - alert: HighToolLatency
        expr: histogram_quantile(0.95, rate(mcp_tool_call_duration_seconds_bucket[5m])) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "工具调用 P95 延迟超过 5 秒"

      - alert: AgentStuck
        expr: mcp_active_agents > 0 and rate(mcp_tool_calls_total[10m]) == 0
        for: 10m
        labels:
          severity: critical
        annotations:
          summary: "Agent 可能卡住 - 有活跃 Agent 但无工具调用"

仪表盘设计

Grafana 仪表盘

关键面板:

  1. 总览面板——活跃 Agent 数、总调用次数、错误率、平均延迟
  2. 工具面板——每个工具的调用次数、延迟分布、错误率
  3. Agent 面板——每个 Agent 的对话轮次、Token 使用量、完成率
  4. Server 面板——每个 MCP Server 的连接数、响应时间、可用性

常见问题(FAQ)

追踪数据保留多久?

建议保留 7-30 天。超过这个时间的追踪数据价值有限,但审计日志应保留更长时间。

如何追踪跨 Agent 的调用?

通过 traceId 传播追踪上下文。当一个 Agent 调用另一个 Agent 时,传递相同的 traceId。

可观测性会影响性能吗?

指标收集和日志记录的开销很小(通常 < 1ms)。分布式追踪的开销略高,但通常在 5% 以内。

总结

Agent 系统的可观测性需要专门设计——不同于传统的 Web 服务,Agent 的行为具有不确定性,需要更丰富的追踪和更智能的告警。通过指标、追踪、日志三大支柱的结合,你可以全面掌控 Agent 系统的运行状态。