Observability without the bill shock | Tishan David AI & MCP Editorial

The invoice that scales with your success

The first time a Datadog bill doubles, nobody planned it. You shipped a feature, traffic climbed, your trace volume climbed with it, and the line item that read “APM” quietly tracked your growth. That is the whole problem with usage-priced observability: the better your product does, the more you pay to watch it work.

Let me put real numbers on it. Datadog APM lists at US$31 per host per month on an annual commit (Pro is $35, Enterprise $40), or $48 per host on demand. APM cannot be bought alone — every APM host needs a paired Infrastructure plan, so the honest cost of adding tracing to a box is closer to $46 per host per month. Each host includes 150 GB of ingested spans and 1 million indexed spans at 15-day retention. Go past that and you pay $0.10 per extra GB ingested and $1.70 per million indexed spans. For a ten-service team on twenty hosts, you are at roughly $11,000 a year before a single overage, and overages are exactly what a growth spike produces.

None of this is Datadog being greedy. The product is genuinely good. The pricing model is just structurally hostile to a small team whose telemetry volume is volatile and whose budget is not.

OpenTelemetry is the part that makes you portable

The strategic move is to stop instrumenting against a vendor and start instrumenting against OpenTelemetry. OTel is the vendor-neutral wire format and SDK for traces, metrics, and logs. Your application emits OTLP; where that data lands is a routing decision, not a rewrite.

That distinction is the entire escape hatch. If your code speaks OTLP, switching backends is a Collector config change. You can run Datadog and a self-hosted stack side by side during a migration, or send errors to one place and the firehose to another. The lock-in that makes a renewal negotiation painful simply does not form. I have walked a few teams through this exact swap, and the work that survives is the instrumentation — the backend is replaceable. (More of that pattern in my migration write-ups.)

The self-hosted stack that actually holds up

The credible open-source answer is Grafana’s LGTM stack, fed by the OTel Collector or Grafana Alloy:

Tempo for traces. It needs only object storage to run — no hot database to babysit — and it generates RED (rate, error, duration) metrics and service-graph metrics from incoming spans, pushing them to your metrics backend.
Loki for logs. Horizontally scalable, indexes labels rather than full text, and like Tempo leans on cheap object storage.
Mimir (or plain Prometheus) for metrics. Mimir scales to a billion active series with durable object-storage backing; most small teams never outgrow a single Prometheus.
Grafana to query and correlate all three, with trace-to-log and metric-to-trace links wired in.

The cost shape inverts. Instead of paying per host and per GB ingested, you pay for a few VMs and an object-storage bucket. Tempo and Loki keeping their bulk in object storage is the line item that matters — S3-class storage runs cents per GB per month, and it does not care whether your trace volume tripled overnight. Your spend tracks infrastructure you control, not a meter someone else reads.

For a team that wants the managed-but-cheap middle ground, Grafana Cloud’s free tier is genuinely usable: 10k metric series, 50 GB logs, 50 GB traces, three users, no credit card. It is a fair place to prototype before you decide whether to self-host. The Sydney crowd I trade notes with mostly start there, then graduate to self-hosted once volume makes the free tier tight — there’s a running comparison on the local-first tooling page.

Sampling is where the bill is won or lost

Self-hosting does not make data free. CPU, memory, and storage are still real. The discipline that keeps a self-hosted stack cheap is the same discipline that would have kept your Datadog bill down: send less, but send the right less.

For traces, that means tail-based sampling in the Collector. Head sampling decides at the start of a trace, blind to how it ends. Tail sampling waits for the whole trace, then keeps it based on what happened — every error, every slow request, and a small percentage of the boring successful ones. You keep the traces you would actually open and drop the ones you never would.

processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 50000
    policies:
      - name: keep-errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: keep-slow
        type: latency
        latency:
          threshold_ms: 500
      - name: sample-the-rest
        type: probabilistic
        probabilistic:
          sampling_percentage: 5

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling, batch]
      exporters: [otlp/tempo]

One catch worth knowing before you scale: tail sampling needs every span of a trace to reach the same Collector instance. Past a single Collector, you run two layers — a front tier using the load-balancing exporter to route by trace ID, and a back tier that does the actual sampling. Set decision_wait longer than your slowest expected trace, or you will silently drop long ones, and size num_traces to your in-flight volume or the buffer evicts traces before it judges them.

Cardinality is the metrics version of the same fight

Metrics have no sampling knob; their cost is cardinality — the number of unique label combinations. Each active series costs roughly 1 to 3 KiB in the Prometheus head block. A healthy mid-sized instance sits between 100k and 2 million active series. North of 5 million you have a problem; north of 10 million it is an incident waiting to happen.

The series explosion almost always comes from one mistake: a high-churn value used as a label. user_id, request_id, session, trace_id, a URL with query string — each unique value mints a permanent new series. The fixes, cheapest first:

Drop or rewrite offending labels with metric_relabel_configs at scrape time.
Pre-aggregate with recording rules so dashboards query a small derived series.
Cap each scrape with sample_limit, and bound labels with label_limit and target_limit.
Publish a label allowlist (env, region, service, status_code, method) and a denylist of anything unbounded.

If per-bucket histogram series dominate your count, native histograms collapse a whole distribution into one series with dynamic buckets — a structural fix rather than a trim.

Why it matters

The pitch is not “self-hosting is always cheaper.” Below a handful of services, a managed free tier wins on time saved, and a heavily staffed platform team might rationally pay Datadog to never think about Mimir compaction. The pitch is that OpenTelemetry turns that into a reversible decision. Instrument once against the open standard, and your backend becomes a price negotiation you can actually walk away from.

For a small team in particular, that changes the posture from hoping traffic stays flat to wanting it to grow — because the observability bill no longer grows with it. The teams that get burned are the ones who discover, mid-renewal, that their telemetry only speaks one vendor’s dialect. Don’t be that team.