Observability without the bill shock
Traces, metrics, and logs for a small team — without a SaaS invoice that grows every time traffic does.
The invoice that scales with your success
The first time a Datadog bill doubles, nobody planned it. You shipped a feature, traffic climbed, your trace volume climbed with it, and the line item that read “APM” quietly tracked your growth. That is the whole problem with usage-priced observability: the better your product does, the more you pay to watch it work.
Let me put real numbers on it. Datadog APM lists at US$31 per host per month on an annual commit (Pro is $35, Enterprise $40), or $48 per host on demand. APM cannot be bought alone — every APM host needs a paired Infrastructure plan, so the honest cost of adding tracing to a box is closer to $46 per host per month. Each host includes 150 GB of ingested spans and 1 million indexed spans at 15-day retention. Go past that and you pay $0.10 per extra GB ingested and $1.70 per million indexed spans. For a ten-service team on twenty hosts, you are at roughly $11,000 a year before a single overage, and overages are exactly what a growth spike produces.
None of this is Datadog being greedy. The product is genuinely good. The pricing model is just structurally hostile to a small team whose telemetry volume is volatile and whose budget is not.
OpenTelemetry is the part that makes you portable
The strategic move is to stop instrumenting against a vendor and start instrumenting against OpenTelemetry. OTel is the vendor-neutral wire format and SDK for traces, metrics, and logs. Your application emits OTLP; where that data lands is a routing decision, not a rewrite.
That distinction is the entire escape hatch. If your code speaks OTLP, switching backends is a Collector config change. You can run Datadog and a self-hosted stack side by side during a migration, or send errors to one place and the firehose to another. The lock-in that makes a renewal negotiation painful simply does not form. I have walked a few teams through this exact swap, and the work that survives is the instrumentation — the backend is replaceable. (More of that pattern in my migration write-ups.)
The self-hosted stack that actually holds up
The credible open-source answer is Grafana’s LGTM stack, fed by the OTel Collector or Grafana Alloy:
- Tempo for traces. It needs only object storage to run — no hot database to babysit — and it generates RED (rate, error, duration) metrics and service-graph metrics from incoming spans, pushing them to your metrics backend.
- Loki for logs. Horizontally scalable, indexes labels rather than full text, and like Tempo leans on cheap object storage.
- Mimir (or plain Prometheus) for metrics. Mimir scales to a billion active series with durable object-storage backing; most small teams never outgrow a single Prometheus.
- Grafana to query and correlate all three, with trace-to-log and metric-to-trace links wired in.
The cost shape inverts. Instead of paying per host and per GB ingested, you pay for a few VMs and an object-storage bucket. Tempo and Loki keeping their bulk in object storage is the line item that matters — S3-class storage runs cents per GB per month, and it does not care whether your trace volume tripled overnight. Your spend tracks infrastructure you control, not a meter someone else reads.
For a team that wants the managed-but-cheap middle ground, Grafana Cloud’s free tier is genuinely usable: 10k metric series, 50 GB logs, 50 GB traces, three users, no credit card. It is a fair place to prototype before you decide whether to self-host. The Sydney crowd I trade notes with mostly start there, then graduate to self-hosted once volume makes the free tier tight — there’s a running comparison on the local-first tooling page.
Sampling is where the bill is won or lost
Self-hosting does not make data free. CPU, memory, and storage are still real. The discipline that keeps a self-hosted stack cheap is the same discipline that would have kept your Datadog bill down: send less, but send the right less.
For traces, that means tail-based sampling in the Collector. Head sampling decides at the start of a trace, blind to how it ends. Tail sampling waits for the whole trace, then keeps it based on what happened — every error, every slow request, and a small percentage of the boring successful ones. You keep the traces you would actually open and drop the ones you never would.
processors:
tail_sampling:
decision_wait: 10s
num_traces: 50000
policies:
- name: keep-errors
type: status_code
status_code:
status_codes: [ERROR]
- name: keep-slow
type: latency
latency:
threshold_ms: 500
- name: sample-the-rest
type: probabilistic
probabilistic:
sampling_percentage: 5
exporters:
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [tail_sampling, batch]
exporters: [otlp/tempo] One catch worth knowing before you scale: tail sampling needs every span of a trace to reach the same Collector instance. Past a single Collector, you run two layers — a front tier using the load-balancing exporter to route by trace ID, and a back tier that does the actual sampling. Set decision_wait longer than your slowest expected trace, or you will silently drop long ones, and size num_traces to your in-flight volume or the buffer evicts traces before it judges them.
Cardinality is the metrics version of the same fight
Metrics have no sampling knob; their cost is cardinality — the number of unique label combinations. Each active series costs roughly 1 to 3 KiB in the Prometheus head block. A healthy mid-sized instance sits between 100k and 2 million active series. North of 5 million you have a problem; north of 10 million it is an incident waiting to happen.
The series explosion almost always comes from one mistake: a high-churn value used as a label. user_id, request_id, session, trace_id, a URL with query string — each unique value mints a permanent new series. The fixes, cheapest first:
- Drop or rewrite offending labels with
metric_relabel_configsat scrape time. - Pre-aggregate with recording rules so dashboards query a small derived series.
- Cap each scrape with
sample_limit, and bound labels withlabel_limitandtarget_limit. - Publish a label allowlist (
env,region,service,status_code,method) and a denylist of anything unbounded.
If per-bucket histogram series dominate your count, native histograms collapse a whole distribution into one series with dynamic buckets — a structural fix rather than a trim.
Why it matters
The pitch is not “self-hosting is always cheaper.” Below a handful of services, a managed free tier wins on time saved, and a heavily staffed platform team might rationally pay Datadog to never think about Mimir compaction. The pitch is that OpenTelemetry turns that into a reversible decision. Instrument once against the open standard, and your backend becomes a price negotiation you can actually walk away from.
For a small team in particular, that changes the posture from hoping traffic stays flat to wanting it to grow — because the observability bill no longer grows with it. The teams that get burned are the ones who discover, mid-renewal, that their telemetry only speaks one vendor’s dialect. Don’t be that team.