Loading video player...
All righty. Why don't we go ahead and
get started? So, this is the Prometheus
maintainer track. Um, I'm gonna be
talking about You can't hear me? Do I
need to talk a little bit closer? Okay.
Wow.
Okay.
All right. Um, this is the Prometheus
maintainer track. And, uh, I'm David
Ashpole. I work for Google.
Um, I wear a lot of hats. So, as you may
notice, I have one of the open telemetry
maintainer shirts on, but I'm also um a
Prometheus team member, which is why I'm
here giving this talk today.
Unfortunately, my co-presenter, who
actually did most of the work for these
slides, was not able to make it because
he's in Europe. Um, so you guys get to
listen to me for for half an hour today
instead.
All right. If you've been to a
Prometheus talk before, you may
recognize this. Has anyone heard of
Prometheus before? Show of hands.
All right, keep them up if you've used
Prometheus before.
Keep them up if you use Prometheus in
production.
And last and not least, raise your hand
if you know what the plural of
Prometheus is.
It's Promethei. Listen to the talk. Um,
it's really funny, actually.
All right, for those in the audience who
haven't heard of Prometheus before,
um, Prometheus was originally developed
by Soundcloud and inspired by actually
Borgmon from Google and uh, it's a
monitoring and alerting system
and it was the second project to join
the CNCF after Kubernetes.
Roughly speaking, Prometheus is broken
down into an ecosystem of exporters. So
[snorts] your custom application may
have a Prometheus exporter endpoint on
it. But there are also common ones like
node exporter or cube state metrics that
expose metrics in Prometheus format.
Then there's the Prometheus server which
is in charge of discovering the
endpoints that need to be scraped doing
the actual scraping of that metric data
storing it in a time series database and
making it available for querying and
alerting using PromQL which is
Prometheus's query language.
Alerting is very important
so that you know if your production is
on fire or not. Um so Prometheus is
depended on all around the world for uh
making sure that production systems
everywhere are available.
Uh Prometheus was originally um started
work on in 2012 and has gone through a
lot of iterations since then. One of the
biggest milestones was the project's
graduation, which means that uh the CNCF
deems it to have reached a high level of
maturity, have strong governance, a
large contributor base, and uh lots of
people using it in production.
But we've been busy and uh if you've
been paying attention for the last year
or so, Prometheus 3.0 was a big deal.
Came with a lot of new features. And
today in November of October or of 2025,
um we've most recently released the 3.7
release. So that's where we are today.
So what am I going to cover today? Um
we've done a lot of work on open
telemetry compatibility. So that
includes deltas. Oh my goodness. Um it
includes keeping the original open
telemetry names with dots. Um even
improvements to resource handling and
scope handling.
Um, I have a couple updates on native
histograms. They're finally stable.
Hooray. And, uh, there's a new cool
thing around called native histograms,
custom buckets. What are those? And
finally, um, there's been some
improvements to PromQL. So, um, without
further ado, let's talk about open
telemetry.
Um
uh the Prometheus server supports OTLP
at the standard OTLP uh path and so uh
actually since 3.0 Prometheus has
supported pushing metrics to it which is
for me kind of crazy um and you can do
that in OTLP format but we've gotten a
lot of feedback. We've done some surveys
and there's a lot of improvements that
we've been trying to make to the open
telemetry experience of using
Prometheus.
So, first you may think that Prometheus
and Open Telemetry were enemies, were
competitors, but that's not actually the
case. Prometheus um a while ago actually
published a big blog post committing to
supporting open telemetry. Um you know,
it it has made many different design
decisions, I'll say. And um you know
there's definitely been friction between
the communities but we and the CNCF are
working hard to make sure that the
projects work well together um and can
be used
in a variety of mixed ways but it's
still a work in progress and sometimes
you need to turn on experimental
features to have it work as intended
right now. So, this is still not not
quite there, but we've been making a lot
of improvements. And there's a nice
guide on the Prometheus website telling
you exactly all the details of what to
turn on, what not, and what some of the
trade-offs are that you have to make
there.
Okay. One of the most requested features
since 3.0 is to be able to keep your
original Open Telemetry metric names.
Open telemetry is built around the idea
of semantic conventions that you know a
metric should have one name and one name
only and should have exactly these set
of labels like stuff like that. Um and
so it a lot of people had asked to be
able to keep that original structure
even when sending it to something like
Prometheus which most people expect to
have underscores sprinkled everywhere
and unit suffixes and that sort of
thing. So, we've introduced a new
translation strategy option and one of
the options that we've added recently is
the no translation option.
There's some gotchas though which I'll
get into.
Um, yeah. So, some of the changes are
changing like dots to underscores that
happens or that happened before this and
also adding suffixes.
So, let's talk about uh type and unit.
So, Prometheus has always recommended
strongly that you should include a type
hint in your metric name like underscore
total and that you should include the
unit in your metric name as well. So,
Prometheus names are typically named
something like HTTP request duration
seconds, you know, total, I guess, if it
was a counter. Um, but open telemetry
doesn't do that, right? Open telemetry
is HTTP.reest.duration.
And if you're looking at this in a YAML
file, good luck. Um hopefully you can
remember what unit it was. So um to
partially address that, one feature that
we've worked on is called the type and
unit labels feature. And we do recommend
turning this on if you're ingesting open
telemetry metrics via the OTLP endpoint.
What this does is it adds type and unit
as labels on the metric. So that if you
do end up say ingesting one met metric
that's in seconds and a different one in
milliseconds, you can actually
disambiguate them later at query time.
Uh and the hope is also that this will
uh be used to provide richer user
interface experiences as well by making
type and unit more readily available.
But um we're still looking for feedback.
So if you love it or hate it um make
your voice known.
Open telemetry is working on delta
support. It is in the super duper duper
early stages. So we when we ingest delta
counters mark them as unknown because
Prometheus they're obviously not the
same as traditional Prometheus
cumulative counters. Um but this is
going to be an area that I think
improves slowly over time. Things like
adding start time stamp support. Um, and
there's obviously going to be a lot of
work on the storage and query layer
required for this as well, but the very
beginnings of this has already sort of
started. And if you want to play around
with it, I would say it's in a state
where if you do happen to have delta
metrics, they are usable in some in some
cases, but do have some pretty severe
limitations.
Another massive area of feedback has
been dealing with open telemetry
resource. So, open telemetry sticks a
ton of metadata about where a process is
running in resource attributes and not
just like one or two resource attributes
but like 20 or 50 sometimes. Um, so
dealing with these in Prometheus and in
the data model can be a little bit
tricky. Um, if you promote them all,
well, one you've added tons and tons of
labels to all of your metrics. it clut
clutters up your UI and that's not
really a great solution. But if you've
been following the mapping up until this
point, you can also take them and like
just stick them in a different metric
called target info [snorts] and make
everyone write a nice long join query.
So that also has some problems and
instead of picking one one of the two
approaches, we've actually kind of been
trying to make both approaches a little
bit better supported.
So first, if you want to take the
approach of promoting resource
attributes into your Prometheus labels,
we've added better support for that. So
now we have an allow list and a deny
list essentially. So if you want to add
in certain resource attributes whenever
they're present, uh you can use promote
resource attributes. And then if you
want to add in all resource attributes
except for those problematic ones over
there, you can do that with promote all
of them and then ignore resource
attributes.
Um, and then there's separately one
special for the service resource
attributes called keep identifying
resource attributes and you can yeah
learn more on the the website there.
There's actually some pretty good best
practices there. So if you're looking
for the copy paste and forget version of
this, it's probably right there.
Second, um, if you, we do put all the
resource attributes in a metric called
target info, but we've gotten a lot of
complaints over the past year or two
that this is basically unusable for most
people. Um, you can see the query you
have to write here is quite long
compared to just the thing that you were
trying to do, which is get your HTTP
server duration over the last 2 minutes,
right? So, that's a pain in the butt. In
order to try and make that experience a
little bit better, make the queries more
readable uh and a little bit more
ergonomic, we've introduced the info
function. So the idea here is that we
want to add some um we want to just make
joining with an infometric across
Prometheus generally a simpler concept.
And so you can see it actually does
simplify it quite a bit and it's a maybe
a little bit easier to tell what's going
on, which is that we're trying to add
the Kubernetes cluster name to the
metric that we're quering there.
And that has to be enabled with a
feature flag um for experimental promql
functions.
And I think finally, whoops. Yeah,
finally um we've added a little bit
better support for open telemetry scope.
So if you've ever been debugging a
metric and you're like who the heck
defined this? Um that's what open
telemetry scope is meant to answer. So
it tells you usually the package name
where the metric is defined and also
includes the version. So, uh, it can be
helpful to catch regressions. If a
metric broke or if you're seeing, as was
the case before, one metric in seconds
and another in milliseconds, you can
kind of figure out who the culprit was,
who defined their metric, not using base
units.
Um, those do those are opt-in still. So,
um, there's a promote scope metadata
option there that you can turn on to get
those as labels.
All right, let's talk about native
histograms. Native histograms are an
awesome new feature that comes with
Prometheus 3.0. For more details on
exactly how they work and why they're
awesome, some of the previous talks are
probably going to be better than this.
But I'm going to focus today on
something called native histograms with
custom buckets,
which when I first heard it sounded kind
of bizarre, like isn't the whole point
of native histograms that they have
exponential buckets?
All right, so we'll start with an
overview of the classic Prometheus
histogram today.
This example has six individual samples,
right? Each there's four of them that
have the bucket suffix with the LE
label, right? Meaning less than or
equal. And each series has the count of
observations that were less than the
threshold, right? Most people are pretty
familiar with this. There's also a sum
and a count series.
On the other hand, native histograms,
even though they're full of information,
right? They're very dense, often times
could have like 100 buckets. Um, they're
only stored as a single data point.
They're stored as a single complex
sample.
And it turns out that especially in the
TSDB and the query layer, that that is
actually really, really efficient. So
the logical next step is for someone to
say, hey, why don't we just store all
the histograms as a complex sample type?
Right? So turns out
why not take our regular histograms and
turn them into this complex type and
store them that way and query them that
way so that we get all these efficiency
gains and um so that's what was
implemented and that's called native
histogram with custom buckets.
There's there's mostly one catch to
this, which is that um if you're
familiar with native histograms,
they're not individual series anymore.
So, you can't query for them like that.
There's no series in a native histogram
that has the underscorebucket suffix,
for example. Instead, you have to access
fields in that complex histogram type by
using functions on it. So, um I might
use the histogram count function to get
the count of a native histogram. So when
you migrate or if you if and when you
migrate to native histograms with custom
buckets from classic histograms, you'll
need to change your queries as well to
actually query using these new native
histogram functions.
Um, the folks that worked on this, I
think, thought through a lot of
different migration cases, and they have
a couple different config options that
you can use to either write just the
classic ones, both, or just the new uh,
native histogram custom bucket versions.
So, definitely something worth trying
out. Um, and yeah, there's some real
benefits to using this.
All right, for a fun like let's look at
a bunch of the features all in one view.
Here we have a metric that is using UTF8
support. So it's got some dots in it,
right? It's a native histogram. So as
you can see it um
is this a native histogram? I think it's
a native histogram. Um and it's also
quered on the queried on the new
Prometheus 3.0 user interface. So, this
is actually showing off a bunch of the
cool new features all in one view. Ah,
and adjusted via the OTLP endpoint.
And finally, let's talk about some of
the changes to PromQL. These are all new
feature additions, so it's a great time
to offer feedback on them um or play
around with them, find and report bugs,
things like that. So, first um this
apparently has been open a while but is
finally getting looked at which is now
you can finally use you know plus or
minus in uh durations. So I can do 9
minutes and 2 seconds by writing 9
minutes plus 2 seconds instead of having
to write 542 seconds. It doesn't yet
apply to the at offset but hopefully
that'll come in the future.
Next, um, it it turns out it's quite
expensive to query for timestamps of
metrics. And Prometheus doesn't have
like a general way to say, I'd like to
query for a bunch of time stamps and do
math on them. There's a time stamp
function, but that has a lot of
performance issues from what I've read
here. um instead of trying to work
towards a generic solution for now
they've tried to address the most common
use cases of quering for time stamps
with these new functions that they've
added. Um, so time stamp of min or max
or last over time. And so if you have
encountered this, there are now fixes.
Um, but if your use case isn't met, it's
probably also a good idea to raise your
voice too.
All right. Finally,
um, I think this is actually really
cool. So there's now support for a
little bit more control over how
extrapolation works in ProQL.
So there's two new keywords that are
supported. One is anchored and the other
is smoothed.
So if you look at how
currently PromQL deals with missing
data. So on the
on the left, man, I'm bad at left and
right. On the left, you can see what
Prometheus does today. If there's
missing data, it'll show you the two
pieces that it knows about. And for
example, if you measure the increase,
it'll show you the portions it knows
about, which adds up to about six in
this case. Um, if you use the smoothed
keyword, I think it's called a keyword.
Um, if you use the smooth keyword,
Prometheus does a lot more
interpolation. It ignores, for example,
staleness markers and just linearly
interpolates between points that it
knows. So if you've ever looked at a
Prometheus dashboard that has some like
intermittent data or data missing,
you'll just see like a scatter plot and
it's not very useful. This is
potentially more useful in situations
like that because the UI will then or
the query engine will then draw smooth
lines between all your points and make
it a little bit more usable. It's maybe
harder to see that data is missing, but
it's closer to probably what the
original data was supposed to be.
There is a big catch with this though,
which is that it requires a data point
before the query window and requires a
data point after the query window to
look correct. This is mostly really
important if you're talking about
recording rules or alerts because let's
say you had a constant line going
across, it doesn't have a point in the
future if you're quering all the way up
to now. And so you'll see it drop off.
That's the case in like some other
monitoring products as well. But if
you're doing an alert and you're looking
at the data that's dropped off because
it doesn't have data in the future yet,
um you might get the wrong alert value.
So there's good documentation on exactly
how to handle that if you still want to
use the smoothed um rule. But just be
careful using this. Like it's great for
dashboards and probably is something
you'll want to play around with, but you
just have to be careful with uh looking
at time. that's really close to the
current time where you might not have
the most recent data yet.
Now, let's talk about anchored.
Show of hands again. Who has ever
queried for an error metric only to be
confused as to how you got 3.75
errors in an interval? Right? That's
that's a pretty common piece of
confusion. How the heck can you query
over a bunch of integers and get a
float? Right? So that's where the new
anchored keyword comes in. So it doesn't
do any fancy extrapolation. It doesn't
do any interpolation.
It just gives you exactly what the
change was between the values that are
in the time series database. So if the
last value was what is it here? If the
last value is four or three and then the
next value is nine, it's not going to
try and draw lines all the way between
them.
This is let's see
this avoids
this avoids over interpolation. So it's
generally it it's making fewer
assumptions. And it can quite it can be
quite useful for like business metrics
use cases where you don't really want it
to be an overestimate or to be guessing
as to what comes next, right? Um, and if
you're querying with an increase on
integer data, you will only get integers
back. So, in many ways, that can make
dashboards on things like errors much
much easier to actually understand.
All right. Um, there's also been some
recent changes in governance. So, we did
a big batch of new team members. Um, and
that was a a good process and I think uh
welld deserved for them. So here are the
the people that were added recently and
as you can see there's actually quite a
good mix of of companies represented
here which is very exciting for the
project.
Uh we're working on a new 2.0
of the governance structure and are
trying to
um mimic more the other CNCF projects in
terms of project structure. So it right
now Prometheus has just a group of team
members, myself included. Um, but we're
moving to a model that's more similar to
other projects where there'll be
contributors members maintainers and
then a smaller steering committee that
makes the hard decisions.
All right, there's a ton of stuff that's
happening that's going to be part of the
next Cube CubeCon talk in EU or North
America next year that you can be a part
of. So, um, and yeah, I'm excited about
a lot of like there's much to do with
Delta support. There's open metrics,
which is we're working on a 2.0 and
that's extremely exciting. And there's a
bunch of other really useful features
that we need feedback on, but also
appreciate contributor help with. So,
um, if anything excites you, get
involved, talk to me afterwards, and
I'll try and connect you with the right
people.
Um, Slack is also a great place to leave
feedback. If you see neat thing today
that you love or that you hate like let
us know. Um there's a variety of of
channels here for different topics.
All right, I'm happy to take questions
now.
I think we've got five minutes.
[applause]
This turned on
where it's on. Okay.
>> Yeah. Do you plan to do anything about
the initial zero problem?
>> Say that one more time. Initial zero
problem where you know when the metrics
comes in it's not set to zero and then
something you know get reported but when
you are running rate function or
increase function it doesn't return
anything because there was no zero. So
I I'm not familiar with this specific
problem. I do know that a lot of the
weirdness around querying when a metric
initially started is
in some respects addressed by adding
created or start timestamps because then
we actually have a point at a time stamp
with a zero value but I again I'm not
super famili familiar with that exactly
so I'm not sure if if that'll fix it or
not.
>> Okay.
Hi, uh I have a question not related to
the topics in the talk but uh mainly
uh regarding the wall corruption in
Prometheus. Um
uh I know a lot of uh these wall
corruption issues have been fixed over
various versions but we still keep
hitting them occasionally and the only
way to fix them is to you know go in and
just RM minus RF the wall directory. Um
so is there a plan to add it in prompt
tool uh where Prometheus can
automatically handle this or provide
some tooling around this
un I unfortunately not like an expert in
the TSDB. I I encourage you to open an
issue or look for existing ones if uh it
it sounds like it sounds like a useful
feature. it Prometheus sometimes will
solve problems like that itself and
sometimes it expects like an operator or
some wrapper you know like something
managing it to solve it so I I'm not
sure maybe I don't know if you know but
>> okay thank you
>> yeah you're welcome
all All right. Thank you everyone for
coming. Thank you for coming all this
way out to C11.
Don't miss out! Join us at our next Flagship Conference: KubeCon + CloudNativeCon events in Amsterdam, The Netherlands (23-26 March, 2026). Connect with our current graduated, incubating, and sandbox projects as the community gathers to further the education and advancement of cloud native computing. Learn more at https://kubecon.io Prometheus Intro, Deep Dive, and Open Q+A - Owen Williams, Grafana Labs & David Ashpole, Google As the 2nd oldest project in the CNCF, you have probably heard about Prometheus before. Nevertheless, the project maintainers will give you an introduction from the very beginning, followed by a deep dive into the exciting new features that have been released recently or are in the pipeline. You will learn about many opportunities to use Prometheus, and maybe we can even tempt you to contribute to the project yourself.