抚州市网站建设_网站建设公司_SSG_seo优化
2025/12/26 14:37:56 网站建设 项目流程

前言

可观测性这个词这两年被说烂了,但很多团队的实际情况是:Prometheus管指标、ELK管日志、Jaeger管链路,三套系统各自为战,排查问题时要在三个界面之间跳来跳去。

去年我们开始推OpenTelemetry(简称OTel),目标是统一数据采集标准。折腾了大半年,总算把三大支柱(Metrics、Logs、Traces)串起来了。

这篇文章分享一下我们的落地经验,包括架构设计、踩过的坑和最终效果。

为什么要用OpenTelemetry

先说现状:

+-------------+     +-------------+     +-------------+
|  Prometheus |     |   ELK Stack |     |   Jaeger    |
+------+------+     +------+------+     +------+------+
|                   |                   |
v                   v                   v
指标采集SDK         日志采集Agent        链路追踪SDK
(各种exporter)     (Filebeat/Fluentd)   (Jaeger client)

问题很明显:

  1. 技术栈割裂:三套采集方案,三种数据格式
  2. 上下文断裂:告警触发后,找不到对应的日志和链路
  3. 维护成本高:每种语言都要适配三套SDK

OpenTelemetry要解决的就是这个问题——统一采集标准:


+------------------+
|   OpenTelemetry  |
|    Collector     |
+--------+---------+
|
统一采集格式
(OTLP协议)
|
+--------+---------+
|   OTel SDK       |
| (Metrics+Logs+   |
|  Traces一套搞定) |
+------------------+

架构设计

我们的最终架构:


+-----------------+
| Grafana |
| (统一展示) |
+-------+---------+
|
+---------------+---------------+---------------+
| | | |
v v v v
+-----------+ +-----------+ +-----------+ +-----------+
| Prometheus| | Loki | | Tempo | | Jaeger |
| (指标) | | (日志) | | (链路) | | (链路备选)|
+-----------+ +-----------+ +-----------+ +-----------+
^ ^ ^ ^
| | | |
+---------------+-------+-------+---------------+
|
+---------+---------+
| OTel Collector |
| (Gateway模式) |
+---------+---------+
^
| OTLP
+---------------+---------------+
| | |
+-----+-----+ +-----+-----+ +-----+-----+
| Service A | | Service B | | Service C |
| (OTel SDK)| | (OTel SDK)| | (OTel SDK)|
+-----------+ +-----------+ +-----------+


核心思路:

  1. 应用集成OTel SDK,通过OTLP协议上报数据
  2. Collector作为网关,统一接收、处理、分发
  3. 后端存储可以替换,不锁定特定厂商
  4. Grafana统一展示,Metrics/Logs/Traces互相关联

Collector部署

OpenTelemetry Collector是核心组件,负责数据的接收、处理和导出。

Docker部署

# docker-compose.yml
version: '3.8'
services:otel-collector:image: otel/opentelemetry-collector-contrib:0.92.0container_name: otel-collectorcommand: ["--config=/etc/otel-collector-config.yaml"]volumes:- ./otel-collector-config.yaml:/etc/otel-collector-config.yamlports:- "4317:4317"   # OTLP gRPC- "4318:4318"   # OTLP HTTP- "8888:8888"   # Collector自身指标- "8889:8889"   # Prometheus exporterrestart: unless-stopped

Collector配置

# otel-collector-config.yaml
receivers:otlp:protocols:grpc:endpoint: 0.0.0.0:4317http:endpoint: 0.0.0.0:4318# 同时支持Prometheus格式(兼容现有监控)prometheus:config:scrape_configs:- job_name: 'otel-collector'scrape_interval: 10sstatic_configs:- targets: ['localhost:8888']processors:# 批量处理,减少网络开销batch:timeout: 5ssend_batch_size: 1000# 内存限制,防止OOMmemory_limiter:check_interval: 1slimit_mib: 1000spike_limit_mib: 200# 添加通用属性resource:attributes:- key: deployment.environmentvalue: productionaction: upsertexporters:# 指标 -> Prometheusprometheus:endpoint: "0.0.0.0:8889"namespace: otel# 链路 -> Tempootlp/tempo:endpoint: tempo:4317tls:insecure: true# 日志 -> Lokiloki:endpoint: http://loki:3100/loki/api/v1/pushlabels:attributes:service.name: "service_name"level: "severity"# 调试用logging:verbosity: detailedservice:pipelines:traces:receivers: [otlp]processors: [memory_limiter, batch]exporters: [otlp/tempo]metrics:receivers: [otlp, prometheus]processors: [memory_limiter, batch]exporters: [prometheus]logs:receivers: [otlp]processors: [memory_limiter, batch, resource]exporters: [loki]

应用接入

Go服务接入

package mainimport ("context""log""net/http""time""go.opentelemetry.io/otel""go.opentelemetry.io/otel/attribute""go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc""go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetricgrpc""go.opentelemetry.io/otel/sdk/metric""go.opentelemetry.io/otel/sdk/resource""go.opentelemetry.io/otel/sdk/trace"semconv "go.opentelemetry.io/otel/semconv/v1.21.0"
)func initTracer() (*trace.TracerProvider, error) {exporter, err := otlptracegrpc.New(context.Background(),otlptracegrpc.WithEndpoint("otel-collector:4317"),otlptracegrpc.WithInsecure(),)if err != nil {return nil, err}res := resource.NewWithAttributes(semconv.SchemaURL,semconv.ServiceName("user-service"),semconv.ServiceVersion("1.0.0"),attribute.String("environment", "production"),)tp := trace.NewTracerProvider(trace.WithBatcher(exporter),trace.WithResource(res),trace.WithSampler(trace.TraceIDRatioBased(0.1)), // 采样率10%)otel.SetTracerProvider(tp)return tp, nil
}func initMeter() (*metric.MeterProvider, error) {exporter, err := otlpmetricgrpc.New(context.Background(),otlpmetricgrpc.WithEndpoint("otel-collector:4317"),otlpmetricgrpc.WithInsecure(),)if err != nil {return nil, err}mp := metric.NewMeterProvider(metric.WithReader(metric.NewPeriodicReader(exporter, metric.WithInterval(15*time.Second))),)otel.SetMeterProvider(mp)return mp, nil
}func main() {tp, _ := initTracer()defer tp.Shutdown(context.Background())mp, _ := initMeter()defer mp.Shutdown(context.Background())tracer := otel.Tracer("user-service")meter := otel.Meter("user-service")// 创建指标requestCounter, _ := meter.Int64Counter("http_requests_total")requestDuration, _ := meter.Float64Histogram("http_request_duration_seconds")http.HandleFunc("/api/user", func(w http.ResponseWriter, r *http.Request) {ctx, span := tracer.Start(r.Context(), "GetUser")defer span.End()start := time.Now()// 业务逻辑span.SetAttributes(attribute.String("user.id", r.URL.Query().Get("id")))// 模拟数据库查询_, dbSpan := tracer.Start(ctx, "DB.Query")time.Sleep(50 * time.Millisecond)dbSpan.End()// 记录指标requestCounter.Add(ctx, 1, attribute.String("method", r.Method))requestDuration.Record(ctx, time.Since(start).Seconds())w.Write([]byte(`{"name": "test"}`))})log.Println("Server starting on :8080")http.ListenAndServe(":8080", nil)
}

Java服务接入

Java用Agent方式更方便,不用改代码:

# 下载Agent
wget https://github.com/open-telemetry/opentelemetry-java-instrumentation/releases/download/v2.1.0/opentelemetry-javaagent.jar# 启动时加参数
java -javaagent:opentelemetry-javaagent.jar \-Dotel.service.name=order-service \-Dotel.exporter.otlp.endpoint=http://otel-collector:4317 \-Dotel.traces.sampler=traceidratio \-Dotel.traces.sampler.arg=0.1 \-jar order-service.jar

自动埋点支持:HTTP请求、数据库调用、Redis、Kafka等,开箱即用。

Python服务接入

from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.sdk.resources import Resourceresource = Resource.create({"service.name": "payment-service"})# 配置Tracer
trace_provider = TracerProvider(resource=resource)
trace_exporter = OTLPSpanExporter(endpoint="otel-collector:4317", insecure=True)
trace_provider.add_span_processor(BatchSpanProcessor(trace_exporter))
trace.set_tracer_provider(trace_provider)# 配置Meter
metric_reader = PeriodicExportingMetricReader(OTLPMetricExporter(endpoint="otel-collector:4317", insecure=True),export_interval_millis=15000
)
meter_provider = MeterProvider(resource=resource, metric_readers=[metric_reader])
metrics.set_meter_provider(meter_provider)tracer = trace.get_tracer("payment-service")
meter = metrics.get_meter("payment-service")# 使用
@tracer.start_as_current_span("process_payment")
def process_payment(order_id):span = trace.get_current_span()span.set_attribute("order.id", order_id)# 业务逻辑...

关联Metrics、Logs、Traces

这是OTel最有价值的部分——三大支柱的关联。

TraceID注入日志

import ("go.opentelemetry.io/otel/trace""go.uber.org/zap"
)func LogWithTrace(ctx context.Context, logger *zap.Logger) *zap.Logger {span := trace.SpanFromContext(ctx)if span.SpanContext().IsValid() {return logger.With(zap.String("trace_id", span.SpanContext().TraceID().String()),zap.String("span_id", span.SpanContext().SpanID().String()),)}return logger
}// 使用
func handleRequest(ctx context.Context) {logger := LogWithTrace(ctx, zap.L())logger.Info("Processing request", zap.String("user_id", "123"))
}

日志里带上trace_id后,在Grafana里可以直接从日志跳转到对应的链路。

Exemplar关联

Prometheus 2.25+支持Exemplar,把指标和TraceID关联:

// 记录指标时带上TraceID
requestDuration.Record(ctx, duration,metric.WithAttributes(attribute.String("method", "GET")),
)

Grafana看到指标异常时,可以直接跳转到具体的链路追踪。

Grafana配置

数据源配置

# grafana/provisioning/datasources/datasources.yaml
apiVersion: 1
datasources:- name: Prometheustype: prometheusurl: http://prometheus:9090isDefault: true- name: Tempotype: tempourl: http://tempo:3200jsonData:tracesToLogs:datasourceUid: lokitags: ['service.name']mappedTags: [{ key: 'service.name', value: 'service_name' }]mapTagNamesEnabled: true- name: Lokitype: lokiurl: http://loki:3100jsonData:derivedFields:- datasourceUid: tempomatcherRegex: '"trace_id":"(\w+)"'name: TraceIDurl: '$${__value.raw}'

效果

配置好后,排查问题的体验:

  1. Prometheus告警:某服务P99延迟飙升
  2. 点击Exemplar:跳转到具体的慢请求链路
  3. 在Tempo看链路:发现DB查询耗时异常
  4. 从链路跳转日志:看到具体的SQL和错误信息

整个链路打通,效率提升太多。

生产经验

采样策略

全量采集不现实,要设置采样率:

// 尾部采样:异常请求一定采集
trace.NewTracerProvider(trace.WithSampler(trace.ParentBased(trace.TraceIDRatioBased(0.1), // 正常请求10%采样)),
)

更智能的做法是用Collector做尾部采样:

processors:tail_sampling:decision_wait: 10spolicies:# 错误一定采集- name: errorstype: status_codestatus_code: {status_codes: [ERROR]}# 慢请求一定采集- name: slow-requeststype: latencylatency: {threshold_ms: 1000}# 其他随机采样- name: randomizedtype: probabilisticprobabilistic: {sampling_percentage: 10}

资源控制

Collector本身也需要监控和限制:

processors:memory_limiter:check_interval: 1slimit_mib: 2000spike_limit_mib: 400extensions:health_check:endpoint: :13133zpages:endpoint: :55679  # 调试页面

多集群管理

我们有三个Kubernetes集群,每个集群部署一个Collector。管理这些Collector时,我用星空组网把三个集群的内网打通,Grafana统一查询所有集群的数据。不然每个集群单独配一套Grafana,运维成本太高。

踩过的坑

坑1:Collector内存暴涨

刚上线时Collector经常OOM。原因是batch processor积攒太多数据。

解决:加memory_limiter,调小batch size

坑2:SDK版本不一致

不同服务用的OTel SDK版本不一样,导致数据格式有差异。

解决:统一SDK版本,在Collector用transform processor做兼容

坑3:日志量太大

OTel日志采集默认全量,Loki扛不住。

解决:在应用层过滤,只采集ERROR及以上级别;或者在Collector用filter processor

processors:filter:logs:exclude:match_type: strictseverity_texts: ["DEBUG", "INFO"]

总结

OpenTelemetry带来的改变:

  1. 统一标准:一套SDK搞定三大支柱
  2. 数据关联:从指标到链路到日志,一键跳转
  3. 厂商中立:后端存储可以随时换
  4. 社区活跃:主流语言和框架都有官方支持

落地成本确实不低,但长期收益明显。特别是排查线上问题时,能快速定位到具体代码,这个效率提升是实打实的。

建议新项目直接用OTel,老项目可以逐步迁移——先接Collector,再慢慢替换各服务的SDK。


需要专业的网站建设服务?

联系我们获取免费的网站建设咨询和方案报价,让我们帮助您实现业务目标

立即咨询