Prometheus *does* support push. It's just that it's considered such an antipatte...

zaroth · on Sept 25, 2017

But why is it simpler for Prometheus to have to query Kube to discover all the endpoints in order to collect the data, versus the endpoints just pushing out to Prometheus?

Obviously endpoints already need to know how to contact all sorts of services they depend on. So it's not like you're "saving" anything by not telling them "PrometheusIP = X".

Let's say you want to cleanly shut-down some instances of your endpoint. They are holding connection stats & request counts that you don't want to lose. With push the endpoint can close its connection handler, finish any outstanding requests, push final stats, and then exit. With pull are you supposed to just sit and wait until a Pull happens before the process can exit?

atombender · on Sept 25, 2017

Because it shifts all the complexity to the monitoring system, making the "agents" really, really dumb. There would have to be more to push than just a single IP:

* Many installations run multiple Prometheus servers for redundancy, so to start, it'd have to be multiple IPs.

* They would also need auth credentials.

* They'd need retry/failure logic with backoff to prevent dogpiling.

* Clients would have to be careful to resolve the name, not cache the DNS lookup, in order to always resolve Prometheus to the right IP.

* If Prometheus moves, every pusher has to be updated.

* Since Prometheus wouldn't know about pushers, it wouldn't know if a push has failed. As Prometheus is pull-based, you can detect actual failure, not just absence of data.

There's a lot to be said for Prometheus' principle of baking exporters into individual, completely self-encapsulated programs — as opposed to things like collectd, diamond, Munin, Nagios etc. that collect a lot of stuff into a single, possibly plugin-based, system.

Don't forget, a lot of exporters come with third-party software. You want those programs to have as little config as possible. If I release an open-source app (let's say, a search engine), I can include a /metrics handler, and users who deploy my app can just point their Prometheus at it. It's enticingly simple.

As for graceful shutdown: The default pull frequency is 15 seconds, and you can increase it if you want to avoid losing metrics. Prometheus is designed not to deal with extremely fine-grained metrics; losing a few requests due to a shutdown shouldn't matter in the big picture. But for metrics that are sensitive, it's easy enough to bake them into some stateful store anyway (Redis or etcd, for example), or computing them in real time from stateful data (e.g. SQL). For example, if you have some kind of e-commerce order system, it's better if the exporter produces the numbers by issuing a query against the transaction tables, rather than maintaining RAM counters of dollars and cents.

boto3 · on Sept 25, 2017

How do you handle aggregated metrics, e.g. request count? What does the instance (either server/container) expose at localhost:9001/metrics?

foxylion · on Sept 25, 2017

You would expose a counter with the total request count. Summing those up across all nodes known by Prometheus will give you the total amount of requests currently visible to monitoring. With "rate()" you could calculate the requests/second.

But yes it is possible to miss some requests if a node goes down without Prometheus collecting the latest stats.

But as the parent said, if you need such totals it might be better to store them persistently. Also I do not know a scenario where the total number of requests will trigger an alert.

bbrazil · on Sept 26, 2017

> But yes it is possible to miss some requests if a node goes down without Prometheus collecting the latest stats.

The rate() function allows for this, you'll get the right answer on average.

jalk · on Sept 25, 2017

The same way push based systems do it. Prom. scrapes the individual metrics off each instance and provides aggr. functions in the query lang.

chrisweekly · on Sept 25, 2017

Thank you for your patient and articulate reponses in this thread, 'lobster_johnson'. You make an excellent case. For me, this nails it:

>"if you're measuring some number that can go critically high, you don't make the application issue a warning if it goes above a threshold; rather, you collect the data from the application as a raw number, then perform calculations (e.g. max over the last N mins, sum over the last N mins, total count, etc.) that you compare against the threshold."

lima · on Sept 25, 2017

But that's how you do it with push-based statsd setups, too.

XorNot · on Sept 24, 2017

Also metrics have a very intuitive text exposition format which means you can host then as easily as updating a text file for an http server.