An observability engine for the AI era

It's 3 am. Your pager goes off. You reach for your Claude code to understand what happened. It triggered on a metric that went off the charts, but why? Which traces, and what logs explain what happened? Your system generates terabytes of logs and traces daily, so the infrastructure team decided to sample the data down to 10%. How do you find that needle in the haystack? Is it even there?

That's the problem we set out to solve. Berserk is the result of this effort, and now we're ready to take on early customers who want to be part of the journey.

What Berserk is: a modern high-performance data store for monitoring and observability data, written from scratch in Rust. Backed by object storage, query it using KQL. It is designed to support AI-enabled workflows. You run it on your own infrastructure: bare metal, your cloud, a sovereign cloud, your laptop. Your S3, your bucket, no lock-in. As a first application, we've built an OpenTelemetry observability solution: logs, traces, and metrics.

A word on the name. Traditionally, a berserkr is a bear-shirt ber (bear) + serkr (shirt), letting you harness the power of the bear. That's the shirt we want you reaching for when the pager goes off at 3am. In essence Berserk works the way those warriors fought: by brute-force, and it invites you to go berserk: throw all the logs, traces, and metrics you want at it and tear straight through. It also says where we're from, Scandinavia. The product is Berserk; bzrk, the short form, is the command you type.

Why now?

Three things changed at once. The answer to a real question now lives between your signals; the thing exploring your data is increasingly an agent, not a person at a dashboard; and the resilient choice is to own your data and run the engine that reads it yourself.

Data exploration must be able to see across many data sources. A real investigation rarely fits in one store. We are used to split logs, traces, and metrics into separate systems with separate query languages, so "correlate them" means switching tabs and joining by eye. The sequence diagram in every tracing tool is a view of one trace, which is useful only after you've found the right trace id.

The questions worth asking across signals and systems: why are some requests slow, what changed since yesterday, who is affected by the error, and which service is the common factor. Those are query problems, and that is exactly the join many observability tools can't express.

The thing exploring your data is increasingly an agent. AI didn't create a new kind of observability. AI created a new thing to observe and a new thing doing the observing.

The systems we use now reason, call tools, and build other systems: an agent run is a trace, a prompt an event, a tool call a span. It is not a new category of data. What's new is the shape of that telemetry. It fans out and runs to thousands of events per developer per day, and carries high-cardinality dimensions on every event (session, prompt, tool, model, cost). That is exactly what a metrics store tells you to drop.

The agent is also a consumer of that same data. A model's context is finite, and so it can't read your telemetry whole, thus it has to query it the same way you would. That makes the query language and the tools around it the only surface that matters; the agent needs good tools.

"We have an AI strategy" is not a reason to skip observability; it's a reason you need more of it. The agent is the new user, and it doesn't want a dashboard: it wants a query language, fast answers, and access to all the data. So observability now has to serve a user who writes queries, not one who clicks.

Resilience is owning your data and running the system yourself. The durable position is the simple one: your telemetry sits in storage you own, read by an engine you run and control on your own infrastructure. That's what survives the things you can't predict: your observability vendor getting acquired, a SaaS bundle quietly turning into something you can't tune or reason about, or the rules about where customer payloads may live shifting under you. That last one is a real and widening undercurrent: Schrems II, the EU Data Act, NIS2, and DORA are Europe's version, and teams elsewhere feel the same pull, but it's one instance of a bigger point, not the argument itself. Ownership is just the resilient default.

What Berserk is

Logs, traces, and metrics on one engine with native metric functions (rates, histograms, percentiles) and Grafana dashboards on top. All three are first-class, not an afterthought bolted onto a log tool.
KQL as the query language, structured enough that an agent can write it and a human can trust what it wrote.
Object storage you control. segments sit on your S3, in your bucket, in your jurisdiction. The on-disk format is ours today; an open-source reader for the row chunks is on the roadmap.
Written in Rust. deployable on your infrastructure. Be it bare metal, your cloud, an EU sovereign cloud, or your laptop.
An engine, not a SaaS. You send OpenTelemetry, we store it on S3, you query it through Grafana, or the built-in Berserk UI for ad-hoc exploration, an agent, or whatever you already use. No proprietary SDK, no lock-in at the instrumentation layer.

The big-picture pieces of the system looks like this. The OpenTelemetry application is what ships today. The engine will host other domain adaptations when the time comes (maybe telco, network security, intelligence) but the engine is what makes any of it possible.

Apps and agents emit telemetry through OpenTelemetry to the ingest pipeline. The query engine reads from S3 and serves Grafana, the dedicated Berserk UI for data exploration, and AI agents like Claude or Codex which connect via the bzrk CLI or its MCP server. A control plane sits alongside.

Tearing down the pillars

Here is one query, just to get a feel for what you can do with Berserk. It scans every span and every log for the phrase "internal server error," gathers the traces that contain it, and counts which entry-point service each one came in through:

union otel_logs, otel_traces
| trace-find { search "internal server error" }
| summarize count() by services[0]

Running it on the otel demo app, almost every matching trace entered through one service, the load-generator used to keep things busy. No real-user errors hiding in there; nothing to chase. That's the loop closing: ask one question across logs and spans, get the answer, move on.

That's a cross-signal join. "Internal server error" usually lives in a log line; the entry-point service lives on a span; the only thing tying them together is the trace they share. In most platforms that's two stores, two query languages, and a hand-rolled join. Here it's one operator. trace-find is the spans+logs join primitive: hand it a structural pattern and it returns the matching traces with their spans and logs gathered together. The pattern can be trivial as in "any row containing this phrase," as above, or precise: trace-find within 15m { resource.service.name == "api-gateway" } >> { body has "OutOfMemory" } finds traces up to 15 minutes long where an api-gateway ancestor had a log whose body mentions OutOfMemory.

And just like trace-find, general joins in Berserk also work with time intervals and produce incremental results. trace-find is one of several operators tailored for OpenTelemetry signals. That's part the OTel adaptation that integrates with the engine.

What we believe

Built for humans and agents. KQL is structured enough that an agent can generate it and a human can trust what it generated; results stream interactively, so neither waits. The engine that stores your AI app's telemetry is queryable by an agent debugging it. More on this soon.

European-owned, and built to last. No venture capital pushing for a quick exit. We know there's more than one path, and we're choosing the longer road. Because Berserk is self-hosted, your telemetry never leaves your environment. And that removes a whole class of cross-border and third-party-risk questions before anyone asks them. We've done the procurement legwork most startups defer: documented GDPR data flows, a NIS2 Article 21(2) mapping, ISO 27001 readiness, a pre-filled CAIQ, a DPA, a published sub-processor list. The full trust framework is public at bzrk.dev/legal/trust. We're honest about the gap: no SOC 2 audit yet, and we don't pretend otherwise.

We built Humio. A decade of hindsight. We kept what was always true about logs: don't force structure onto data that has none, and make ingestion cheap enough that you never ration what you keep. We rebuilt the rest on a modern, object-storage-native architecture. How is a separate post that we can't wait to share with you.

Honest about what we are. At this point we're building for stability and scale, not chrome polish. The basics work and work fast; the SaaS bells and whistles aren't here yet, and that's deliberate.

What works today

Live: logs, traces and metrics ingest and query, KQL across all three, native metric functions, Grafana on top. Deployment is Kubernetes today, with Docker Compose for a single-node demo; anything else is a more hand-held conversation for now. We've been dogfooding it in production for six months. Any concrete use case will likely find things we have not done yet. We'd rather tell you the edges than have you find them.

Who this is for

Platform teams at scale, drowning in observability cost. AI companies generating massive telemetry from day one. Regulated industries that can self-host: finance, public sector, healthcare. Places where telemetry must never leave the environment. European teams that hit the sovereignty wall. Operators who've been burned by an acquisition and won't be again. Grafana users who just need a better backend.

And honestly, who it isn't for: teams who want turnkey SaaS magic, or who can't move without a SOC 2 certificate on file today.

Talk to us

We're working with a small number of design partners. This is a conversation, not a sales call. We want to learn what "good" looks like when we're not controlling the surface: what you need from the API, the query language, the integrations.

If you've been here before, you know how to reach us. If you haven't: hello@bzrk.dev.