Dev Infra

Observability without the bill shock

Traces, metrics, and logs for a small team — without a SaaS invoice that grows every time traffic does.

By Tishan David 6 min read

The invoice that scales with your success

The first time a Datadog bill doubles, nobody planned it. You shipped a feature, traffic climbed, your trace volume climbed with it, and the line item that read “APM” quietly tracked your growth. That is the whole problem with usage-priced observability: the better your product does, the more you pay to watch it work.

Let me put real numbers on it. Datadog APM lists at US$31 per host per month on an annual commit (Pro is $35, Enterprise $40), or $48 per host on demand. APM cannot be bought alone — every APM host needs a paired Infrastructure plan, so the honest cost of adding tracing to a box is closer to $46 per host per month. Each host includes 150 GB of ingested spans and 1 million indexed spans at 15-day retention. Go past that and you pay $0.10 per extra GB ingested and $1.70 per million indexed spans. For a ten-service team on twenty hosts, you are at roughly $11,000 a year before a single overage, and overages are exactly what a growth spike produces.

None of this is Datadog being greedy. The product is genuinely good. The pricing model is just structurally hostile to a small team whose telemetry volume is volatile and whose budget is not.

OpenTelemetry is the part that makes you portable

The strategic move is to stop instrumenting against a vendor and start instrumenting against OpenTelemetry. OTel is the vendor-neutral wire format and SDK for traces, metrics, and logs. Your application emits OTLP; where that data lands is a routing decision, not a rewrite.

That distinction is the entire escape hatch. If your code speaks OTLP, switching backends is a Collector config change. You can run Datadog and a self-hosted stack side by side during a migration, or send errors to one place and the firehose to another. The lock-in that makes a renewal negotiation painful simply does not form. I have walked a few teams through this exact swap, and the work that survives is the instrumentation — the backend is replaceable. (More of that pattern in my migration write-ups.)

The self-hosted stack that actually holds up

The credible open-source answer is Grafana’s LGTM stack, fed by the OTel Collector or Grafana Alloy:

  • Tempo for traces. It needs only object storage to run — no hot database to babysit — and it generates RED (rate, error, duration) metrics and service-graph metrics from incoming spans, pushing them to your metrics backend.
  • Loki for logs. Horizontally scalable, indexes labels rather than full text, and like Tempo leans on cheap object storage.
  • Mimir (or plain Prometheus) for metrics. Mimir scales to a billion active series with durable object-storage backing; most small teams never outgrow a single Prometheus.
  • Grafana to query and correlate all three, with trace-to-log and metric-to-trace links wired in.

The cost shape inverts. Instead of paying per host and per GB ingested, you pay for a few VMs and an object-storage bucket. Tempo and Loki keeping their bulk in object storage is the line item that matters — S3-class storage runs cents per GB per month, and it does not care whether your trace volume tripled overnight. Your spend tracks infrastructure you control, not a meter someone else reads.

For a team that wants the managed-but-cheap middle ground, Grafana Cloud’s free tier is genuinely usable: 10k metric series, 50 GB logs, 50 GB traces, three users, no credit card. It is a fair place to prototype before you decide whether to self-host. The Sydney crowd I trade notes with mostly start there, then graduate to self-hosted once volume makes the free tier tight — there’s a running comparison on the local-first tooling page.

Sampling is where the bill is won or lost

Self-hosting does not make data free. CPU, memory, and storage are still real. The discipline that keeps a self-hosted stack cheap is the same discipline that would have kept your Datadog bill down: send less, but send the right less.

For traces, that means tail-based sampling in the Collector. Head sampling decides at the start of a trace, blind to how it ends. Tail sampling waits for the whole trace, then keeps it based on what happened — every error, every slow request, and a small percentage of the boring successful ones. You keep the traces you would actually open and drop the ones you never would.

processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 50000
    policies:
      - name: keep-errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: keep-slow
        type: latency
        latency:
          threshold_ms: 500
      - name: sample-the-rest
        type: probabilistic
        probabilistic:
          sampling_percentage: 5

exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling, batch]
      exporters: [otlp/tempo]

One catch worth knowing before you scale: tail sampling needs every span of a trace to reach the same Collector instance. Past a single Collector, you run two layers — a front tier using the load-balancing exporter to route by trace ID, and a back tier that does the actual sampling. Set decision_wait longer than your slowest expected trace, or you will silently drop long ones, and size num_traces to your in-flight volume or the buffer evicts traces before it judges them.

Cardinality is the metrics version of the same fight

Metrics have no sampling knob; their cost is cardinality — the number of unique label combinations. Each active series costs roughly 1 to 3 KiB in the Prometheus head block. A healthy mid-sized instance sits between 100k and 2 million active series. North of 5 million you have a problem; north of 10 million it is an incident waiting to happen.

The series explosion almost always comes from one mistake: a high-churn value used as a label. user_id, request_id, session, trace_id, a URL with query string — each unique value mints a permanent new series. The fixes, cheapest first:

  • Drop or rewrite offending labels with metric_relabel_configs at scrape time.
  • Pre-aggregate with recording rules so dashboards query a small derived series.
  • Cap each scrape with sample_limit, and bound labels with label_limit and target_limit.
  • Publish a label allowlist (env, region, service, status_code, method) and a denylist of anything unbounded.

If per-bucket histogram series dominate your count, native histograms collapse a whole distribution into one series with dynamic buckets — a structural fix rather than a trim.

Why it matters

The pitch is not “self-hosting is always cheaper.” Below a handful of services, a managed free tier wins on time saved, and a heavily staffed platform team might rationally pay Datadog to never think about Mimir compaction. The pitch is that OpenTelemetry turns that into a reversible decision. Instrument once against the open standard, and your backend becomes a price negotiation you can actually walk away from.

For a small team in particular, that changes the posture from hoping traffic stays flat to wanting it to grow — because the observability bill no longer grows with it. The teams that get burned are the ones who discover, mid-renewal, that their telemetry only speaks one vendor’s dialect. Don’t be that team.