Counting with Prometheus [I] - Brian Brazil, Robust Perception

By: Cloud Native Computing Foundation - CNCF

15   0   3446

Uploaded on 04/11/2017

Counting with Prometheus [I] - Brian Brazil, Robust Perception

Counters are one of the two core metric types in Prometheus, allowing for tracking of request rates, error ratios and other key measurements. Learn why are they designed the way they are, how client libraries implement them and how rate() works.

About Brian Brazil
Brian Brazil is a core developer of Prometheus, and the founder of Robust Perception. He has developed and maintains components and features across the Prometheus ecosystem including the Python and Java clients, and many exporters. He wrote many of the best practices and guidelines for those looking to use Prometheus, and publishes regularly on the Reliable Insights blog.

Comments (7):

By anonymous    2017-09-20

You should use the Count and take a rate() of it on the Prometheus side. The example config file that comes with the JMX exporter already selects the right metric for you.

MeanRate is the average rate per second since the binary started, accordingly it's not very useful. OneMinuteRate, FiveMinuteRate and FifteenMinuteRate are exponential moving averages, so would also decay over time.

https://www.youtube.com/watch?v=67Ulrq6DxwA has more information on various ways counters are handled by different instrumentation/monitoring systems.

Original Thread

By anonymous    2017-09-20

Prometheus is suited more to high volume than low volume events, as at low volumes artifacts from how we keep things accurate on average show up.

So for example rate(job_failed[15s]) with an increase of 1 over the 15 seconds is 1/15 = 0.066/s. Rounding could make that show as 0.1.

https://www.youtube.com/watch?v=67Ulrq6DxwA goes into more detail as to how this all works.

The short version is what you're doing now is the way to do it.

Original Thread

By anonymous    2017-10-15

Data is not exact, the above samples for example aren't exactly aligned to the second. This means that we need to extrapolate a bit when the data doesn't exactly cover the 10s range, which can cause artifacts like this. On average however, the result will be correct.

Counting with Prometheus goes into this in more detail.

Original Thread

By anonymous    2018-01-07

The algorithm that Prometheus uses for rate() is a little intricate due to handling of issues like alignment and counter resets as explained in Counting with Prometheus.

The short version is to subtract first value from the last value, and divide by the time they are over. It's probably easiest to use Prometheus rather than doing this yourself.

Original Thread

By anonymous    2018-01-14

I've found that for some graphs I get doubles values from Prometheus where should be just ones:

Graph with twos above bars

Query I use:

increase(signups_count[4m])

Scrape interval is set to the recommended maximum of 2 minutes.

If I query the actual data stored:

curl -gs 'localhost:9090/api/v1/query?query=(signups_count[1h])'

"values":[
     [1515721365.194, "579"],
     [1515721485.194, "579"],
     [1515721605.194, "580"],
     [1515721725.194, "580"],
     [1515721845.194, "580"],
     [1515721965.194, "580"],
     [1515722085.194, "580"],
     [1515722205.194, "581"],
     [1515722325.194, "581"],
     [1515722445.194, "581"],
     [1515722565.194, "581"]
],

I see that there were just two increases. And indeed if I query for these times I see an expected result:

curl -gs 'localhost:9090/api/v1/query_range?step=4m&query=increase(signups_count[4m])&start=1515721965.194&end=1515722565.194'

"values": [
     [1515721965.194, "0"],
     [1515722205.194, "1"],
     [1515722445.194, "0"]
],

But Grafana (and Prometheus in the GUI) tends to set a different step in queries, with which I get a very unexpected result for a person unfamiliar with internal workings of Prometheus.

curl -gs 'localhost:9090/api/v1/query_range?step=15&query=increase(signups_count[4m])&start=1515721965.194&end=1515722565.194'

... skip ...
 [1515722190.194, "0"],
 [1515722205.194, "1"],
 [1515722220.194, "2"],
 [1515722235.194, "2"],
... skip ...

Knowing that increase() is just a syntactic sugar for a specific use-case of the rate() function, I guess this is how it is supposed to work given the circumstances.

How to avoid such situations? How do I make Prometheus/Grafana show me ones for ones, and twos for twos, most of the time? Other than by increasing the scrape interval (this will be my last resort).

I understand that Prometheus isn't an exact sort of tool, so it is fine with me if I would have a good number not at all times, but most of the time.

What else am I missing here?

Original Thread

Recommended Books

    Submit Your Video

    If you have some great dev videos to share, please fill out this form.