Metric-gate: filter and pre-aggreagate metrics before ingestion to Prometheus
9 Jul 2025This is about a Prometheus scrape proxy that can filter and aggregate metrics at the source, reducing cardinality before the ingestion:
graph LR prometheus --> metric-gate subgraph Pod metric-gate -. localhost .-> target end
Why?
There are cases when you do not need all the metrics exposed by a target. For example, k8s ingress-nginx exposes 6 Histograms (of 12 buckets each) for each Ingress object. Now suppose you have a k8s cluster with 1k Ingresses, each having 10 Paths defined:
Cardinality: 1000 ingresses * 6 histograms * 12 buckets * 10 paths = 720k metrics
The resulting size of the HTTP response on the /metrics endpoint is 276MB, which is pulled by Prometheus every scrape_interval (default 15s), leading to constant ~40Mbit/s traffic (compressed) on each replica Pod of ingress-nginx.
Sure, metrics could be filtered on the Prometheus side in metric_relabel_configs, but that will not reduce the amount of data being pulled from the target. And aggregation could then be done via recording rules, but you cannot drop already ingested source data afterward.
In our case, we do not need all the metrics, but ingress-nginx has no configuration option for that. So the idea is to add a proxy in between, which could filter some metrics or labels.
Sidecar
We can add a sidecar container that connects to ingress-nginx via a fast localhost connection and then returns a smaller response to Prometheus (as shown in the picture above). Okay, but what would the configuration for that proxy look like?
As we need to be able to drop both metrics as a whole and individual labels, we need something flexible. I believe metric_relabel_configs is a good fit.
This way, we can reduce cardinality 10x by removing path label from the ingress-nginx metrics above:
# before
metric{ingress="test", path="/api", ...} 5
metric{ingress="test", path="/ui", ...} 2
# after
metric{ingress="test", ...} 7
That could be done by dropping the label from all metrics:
metric_relabel_configs:
- action: labeldrop
regex: path
So, it would work as sum(metric) without(path) in PromQL notation. That should do for Counters and Histograms but makes no sense for Gauges. In the case of ingress-nginx, that should be fine.
Also, we have multiple ingress-nginx replicas, which all serve the same 1k sites. But queries on the Prometheus side are mostly aggregations for a specific site, not for a specific replica. How can we reduce cardinality for replicas as well?
DNS mode
Let’s take the idea from Thanos, which has a DNS-resolving mode for endpoints. Prepending the URL with dns+ leads to resolving the domain name into a list of IPs and then fanning out requests to all of them:
graph LR prometheus --> metric-gate metric-gate --> k8s-dns metric-gate -.-> deployment-pod-a & deployment-pod-b subgraph Deployment deployment-pod-a deployment-pod-b end
Because in k8s it is easy to have a headless Service for that usecase:
apiVersion: v1
kind: Service
metadata:
name: ingress-nginx-controller-metrics
spec:
clusterIP: None # headless Service
selector:
app.kubernetes.io/name: ingress-nginx
publishNotReadyAddresses: true # try to collect metrics from non-Ready Pods too
That should work but has some issues:
- All the metrics are summed from all replicas, so information like
process_start_time_seconds, which only makes sense for a single replica, becomes meaningless. - There is “automatic availability monitoring” based on the up metric in Prometheus, which can detect when a specific target is down. In this case, it would only provide the status of
metric-gateitself, not eachingress-nginx-controllerreplica. - “Counter reset” detection would break in the case of aggregation. Consider this example:
| t1 | t2 | t3 | |
|---|---|---|---|
metric{instance="a"} |
10 | 10 | 10 |
metric{instance="b"} |
20 | 0 | 0 |
sum(rate(metric)) |
0 | 0 | 0 |
metric aggregated |
30 | 10 | 10 |
rate(metric) |
0 | 10/15s | 0 |
There are two pods (a and b) serving a metric counter. At point t2, we restart pod b. This works fine in Prometheus — see Rate then sum — because rate() is calculated first and detects the counter drop to 0, resulting in the correct sum() output.
Now, if we aggregate those two metrics into one (dropping the instance label), at point t2 the value is 10. For rate(), that means a “counter reset” occurred (the counter value is less than before), so it reads as if the counter dropped to 0 and then increased to 10 within one scrape interval (15s), producing rate=10/15s=0.67/s, which is incorrect.
4. What should we do when we can’t scrape one of the targets discovered via DNS? Returning partial data for the remaining targets would lead to spikes in graphs due to false counter resets. Maybe it is better to fail the whole scrape, so Prometheus graphs use neighboring points. But how many scrapes can we fail in a row?
There is no good solution for #3.
For #4, we can add a configurable timeout for subrequests. Setting it higher than scrape_interval would cause a full scrape failure, while setting it lower would return partial data.
And the first two could have the following solution:
Subset mode
Taking our Ingress example again, docs assign metrics to the following groups:
Request metrics: the main number of series on each replica (~720k). We want to aggregate these across all replicas.Nginx process metricsandController metrics: 68 metrics in total. These only make sense for each specific replica.
To aggregate the first and directly scrape the second, we can stack sidecar and dns modes together:
graph LR
prometheus --> metric-gate
metric-gate --> k8s-dns
prometheus --> ma & mb
metric-gate -.-> ma & mb
subgraph Deployment
pod-a
pod-b
end
subgraph pod-b
mb["metric-gate-b"] -.localhost.-> tb["target-b"]
end
subgraph pod-a
ma["metric-gate-a"] -.localhost.-> ta["target-a"]
end
For that to work, we need to split metrics from the target into multiple endpoints. Let’s say the usual /metrics for process metrics and /metrics/requests for request metrics. Prometheus would scrape the /metrics endpoint with a small number of series, giving us per-replica data (and a working up metric). Meanwhile, metric-gate scrapes /metrics/requests, aggregates the large request data across all replicas, and only then returns it to Prometheus.
Those were my thoughts on how the problem could be approached. But maybe there is a solution already?
Off-the-shelf
- vmagent can perform aggregation to new metric names and then send remote-write to Prometheus.
How to relabel metrics to the original form?
Possibly usevmserverinstead ofvmagent, to scrape/federateendpoint instead of remote-write, allowing relabeling in the scrape config on the Prometheus side
- otelcol could work, but needs relabeling on the Prometheus side to have metrics with original names.
- Grafana Alloy is similar to otelcol.
- exporter_aggregator scrapes metrics from a list of Prometheus exporter endpoints and aggregates the values of any metrics with the same name and labels. Similar to
dnsmode, but with a static list of upstreams. - Telegraf and Vector cannot aggregate metrics like
sum() without(label)they only downsample values.
Implementation
Okay, so “if you want a thing done well, do it yourself”, or as others might call it, “NIH syndrome.” Let’s see what it should look like. That is basically an HTTP proxy, which does:
- Parse metrics from the target
- Relabel metrics
- Render the memory state to text metrics format
Prometheus packages are available for all three and could be reused. But all of the above is on the hot path. Because just getting 1.3M of metrics from ingress-nginx in a sidecar container takes ~4s:
$ k -n ingress-nginx exec -it ingress-nginx-controller-testing-7889b68f87-dz57l -- sh
/chroot/etc/nginx $ time wget localhost:10254/metrics -O - | wc -l
Connecting to localhost:10254 ([::1]:10254)
writing to stdout
- 100% |*******| 462M 0:00:00 ETA
written to stdout
real 0m 4.64s
user 0m 0.40s
sys 0m 1.10s
1293720
So we are left with 10s until the scrape timeout and nothing is done yet. Let’s check where we can improve the speed. Relabel package implementation looks clean and simple. Render could be done via the prometheus client library, but it is too heavyweight. It also requires its own data structure and exports metrics with # HELP and # TYPE comments, which I would like to avoid to save bandwidth. It is easier to write a renderer from scratch. Parsing also looks unnecessarily complicated, so I did a go bench:
$ make bench
goos: darwin
goarch: arm64
pkg: github.com/sepich/metric-gate
BenchmarkParseLine0-16 232239 5100 ns/op 9184 B/op 110 allocs/op
BenchmarkParseLine1-16 983380 1246 ns/op 3120 B/op 11 allocs/op
BenchmarkParseLine-16 2222438 539.5 ns/op 1456 B/op 8 allocs/op
BenchmarkParseLine2-16 2635765 458.3 ns/op 1408 B/op 6 allocs/op
BenchmarkParseLine3-16 1817930 659.9 ns/op 1832 B/op 26 allocs/op
BenchmarkParseLine4-16 2623164 453.5 ns/op 1408 B/op 6 allocs/op
where:
- ParseLine0 uses Prometheus
expfmt.TextToMetricFamilies - ParseLine1 uses Prometheus
textparse.PromParser - ParseLine is the first attempt of a custom implementation
- ParseLine2 is an AI-improved version for speed (to the point of being unreadable)
- ParseLine3 attempt to use raw
[]byteinstead ofstringto skip UTF-8 decoding, but it turns outLabelconversion tostringis more expensive - ParseLine4 attempt to use state-machine with only pointers to a slice (current version)
So, the custom implementation is >2x faster than using the Prometheus library. The actual algorithm matters less than the number of memory allocations per line.
If you’ve read this far, you are probably interested in the result. You can find it here:
https://github.com/sepich/metric-gate
or on Docker Hub:
$ docker run sepa/metric-gate -h
Usage of /metric-gate:
-f, --file string Analyze file for metrics and label cardinality and exit
--log-level string Log level (info, debug) (default "info")
-p, --port int Port to serve aggregated metrics on (default 8080)
--relabel string Contents of yaml file with metric_relabel_configs
--relabel-file string Path to yaml file with metric_relabel_configs (mutually exclusive)
-t, --scrape-timeout duration Timeout for upstream requests (default 15s)
-H, --upstream string Source URL to get metrics from. The scheme may be prefixed with 'dns+' to resolve and aggregate multiple targets (default "http://localhost:10254/metrics")
-v, --version Show version and exit