Grafana Campfire 🔥- PromCON 2025 Highlights - What's New? (Grafana Community Call - Oct 2025) | DailyDevLists

Loading video player...

Full Transcript

10,586 words • EN

Perfect, we are live again. So hello

everyone, welcome to the Grafana

campfire community call October edition

and I'm your host Usman and today we got

something very interesting

uh which happened actually just like few

days ago in Germany and in Germany we're

in Munich and the topic was about or the

conference was about the Prometheus or

the Promcon which is really one of the

unique conferences where I think mostly

the Prometheus maintainers, developers

ers uh arrive talk about what's new

what's exciting and what they're working

and in this call today we're going to

discuss about what's going on with the

Prometheus and the community how

graphana labs is working with Prometheus

or with the integration to make the user

experience better and also uh we might

see some cool demos uh we have something

planned for you today and uh I will go

with the introduction so um myself uh my

name is Usman I work in Griffin lab as a

staff developer advocate and I have been

working closely with the open source

community. Right now I'm working mostly

on documentation and learning journey

which is fun. So documentation is

something we need and I think community

users love our documentation and it's an

integral part and along with me are uh

two very specialist I would say and

expert guests sitting today. So KL I

think you know some of you know already

have seen it and also Gotham but let us

uh do their introduction in more

briefly. So I hand over first to you KL.

>> Thank you Osman. Yes my name is Kfrist.

I'm one of those OG's Grafana Labs. I've

been here for ages. I recently had my 10

year celebration. I spent all of my time

in the graphana projects. like we do so

many things at Grafana Labs these days

but I've always been around the Grafana

project and I first got in contact with

Prometheus 2016 uh when they had her

first Prometheus conference in Berlin

and that's when I met the like the OG

Prometheus team and

I um I really like that group of people

like it's it's very monitoring first

pragmatic friendly group uh so I've been

sticking around that project for quite

some time know um not engaged in the

Prometheus project itself but like

making sure that Kfana works well with

the Prometheus project

and that's how I met Gotham in Munich 20

I think 2017.

>> Yeah. Yes.

>> Yeah.

Yeah.

>> Yeah. Hey everyone, I'm Gautam. Uh yeah,

I'm I'm a Prometheus maintainer, but I

haven't been writing a lot of code

recently for Prometheus, but I've also

been actually involved with the

Prometheus community since 2016 when I

started using Prometheus in the summer

of 2016. In 2017, I started contributing

to Prometheus uh as part of an

internship. And yeah, like eight years

later, uh I'm still around and I

recently helped organize Promcon. And

yeah, for my day job, I work at

Graphfana. I've been at Graphfana 7

years. Uh well, I thought it was a long

time and then KL said 10 years. Um and

yeah and I've been I had I wore a lot of

different roles and hats like I was

doing a lot of engineering around

Prometheus the time series database I

hosted metrics but I've also worked with

the open telemetry community put on my

PM hat for a couple of years and now I'm

back as an engineer. It's been three

months since I was back as an engineer

and oh my god I love it.

>> So so that is the secret. That's why

you're not writing the code these days.

Yeah,

>> cool. Uh, yeah, I I I must say about one

thing about KL. So I actually uh met KHL

first time face to face at Graphfanocon

and I still remember like we have this

roundt panel discussion and there were a

lot of questions on every direction and

KL was there and he was such a like a

expert like I know this I can answer

this and we were like okay we should

have people like KL because uh when you

work so much you know the like the the

concept or the background of why why

this is working or what we are looking

at it. So yeah, so I think yeah, both of

uh these experts have like definitely an

expert expertise in Prometheus but also

in Grafana and yeah uh let's uh kick off

the discussion. So before we discuss

like what is Prometheus or what's going

on, let's talk about Promcon. So maybe

Gotham or Kal or maybe Gotham you can uh

because I I remember you you are also

the organizer. So you can uh talk a

little bit about Promcon. how was it and

uh what you did, what you find

interesting and so on.

>> Yeah. No, Promcon is the annual

Prometheus conference for like

Prometheus and users, I mean use

Prometheus users and maintainers to come

together and share about what we've been

doing and what like what what are all

the new developments or how people are

using Prometheus in new in new and

exciting ways and yeah it's it's it's an

amazing conference like people have been

consistently coming to it again and

again and it's actually a small and cozy

conference uh so We it's organized by

the maintainers themselves. Um and we

try to keep the ticket prices slow and

accessible and in general the vibe is

very chill. Uh it's actually a very

special conference for me personally

because I gave my first conference talk

um in 2017 at the same venue on the same

stage. So yeah, like I was just looking

uh at a YouTube video of of that talk

and I'm like, "Holy I had a lot

more hair back then and I looked like a

baby." But like yeah, it's been it's

been happening I think since 2015.

So it's like the 10th Promcon this year

and I think that is very special and

yeah

>> and I think sorry uh

>> yeah something I just want to add like I

do agree that is special and it's

special because there's so many experts

there but they also the atmosphere is

very very friendly like you can be on

stage and be a beginner and that's fine

we all want to hear from everyone at

this conference and and I that's that's

what makes it very unique and that's why

I want to keep going there and I I would

highly suggest just for someone new in

this space or if you've been long in

this space. This this is one of the best

conferences in the monitoring space

except Profonicon obviously.

Yeah. No, I think I totally agree. I was

also there uh on uh at the prom this uh

on the day two of this one and uh yeah

it was pretty pretty exciting to see so

many people uh who are working in the

community and very friendly and I think

Gotham if correct me if I'm wrong uh

Prometheus conference is like uh it

never get like a break or a pause even

during the back in the days of covid uh

you guys organized like a virtual

conferences like it it always remain

alive is it correct

I think so. We I don't know if we did it

on both the COVID years, but I I'm

pretty sure we did one virtual

conference. Uh let me quickly check

2020, 2021, 20. Yeah, it has always

happened like from 2016 to 2025. I can

see all like you know in the footer of

the website you can see the links for

each of those. So it happened each year.

>> Yeah. And I mean this this shows the

commitment from the community like they

they they want it and the organizers

says like let's do it and uh this is

where you you see the passion from the

community like they want to use they

want to build or improve it better. So

this is amazing to see because not so

many conferences have like like a

journey where there's a break or a pause

and then they continue again. So yeah,

but that's pretty cool about Promcon.

And I think um let's talk about what's

in the Prometheus community like what's

what's was the highlight of Prometheus

conference and yeah over to you guys.

>> Yeah, I mean honestly we didn't have any

big announcements, nothing like a

Prometheus 4.0 or anything, but we we've

launched Prometheus 3.0 a year ago. I

think September or October, so roughly a

year ago. And that was a big release and

we've been basically continuously

improving upon it like we've kicked off

a lot of features and uh we started like

uh yeah we basically continued improving

them. One thing I want to add is native

histograms are now stable. So if you're

unfamiliar with Prometheus histograms

they kind of help you understand the

distribution of data. They kind of help

you know, okay, there's like 10,000

requests that took like nine like

between nine and 10 seconds. There's

like 5,000 requests that took like

between 1 and 2 seconds. And there's

like 100,000 requests that were like

less than 200 millconds. So you can

understand the distribution of the data.

But before native histograms you have to

pick the buckets like you have to kind

of say okay I want I'm actually

interested in the buckets between 0 200

500 milliseconds a second and stuff like

that and now you pick buckets and then

maybe you do a new release and suddenly

these buckets are not relevant anymore.

Um maybe it got faster, maybe things got

slower and uh yeah this used to cause

some problems and also when you have a

wide distribution you used you needed to

have a lot of buckets and that used to

cause other performance problems. But

with native histograms we kind of fixed

a lot of this. You just kind of define

roughly what resolution you have and we

automatically scale up and scale down

the buckets uh based on the distribution

of the data. uh you can have really high

resolution and you can actually render

videos as heat maps that beyond has

done. Uh but yeah like native histogram

solves a lot of the problems that like

the previous classic histograms used to

have. We've been working on this for two

and a half years I think or more

actively and they're now finally stable.

So we're not going to break any of the

APIs. Uh so you can go ahead and start

using them.

I think it's super super nice and it's

it sounds a little bit more

sophisticated and hard to learn but it's

actually the opposite. The the the

classic histograms are harder to operate

well and like keep working with. So this

is this is a big step up for usability

for all of the Prometheus users.

Um I I have something else I want to

share that we did in Grafana related to

Prometheus and we uh announced that at

Prometheus

Promcon as well. So

let's see if I can add this stage. So So

we added um or we massively improved I

would say instead the ad hoc filter in

Grafana. So the ad hoc filter in Grafana

with Prometheus.

>> Sorry uh Kyle C sorry I'm really

interrupting. Can you just zoom in? It

looks nice, but if you can. Yeah.

>> Yeah. Perfect. And please continue.

>> So the the add-on filter allows you to

drill down to any dimension rather than

having to predefine what you want to

filter by. Classic template variables in

Grafana. Then you have to do like you

have to decide up front if you want to

filter by cluster alert state and so on

in your query. And then we in do screen

interpolation at query time.

Doing all of these decision up fronts

means it's harder for people to share

dashboards and also you might be in edge

cases where you didn't think about how

you want to filter these uh metrics up

front and then you have to jump out of

explore explore or other pages of

graphana. With a new filtering feature

in in uh graphana for Prometheus, we

load the the times or the labels you can

filter by dynamically. So when you click

on this, we figure out what labels

actually exist for the time series in

the dashboard and then we only show them

as options here instead of the thousands

of labels that was earlier um suggested.

And when you pick one of these or by the

way all of these works really well with

the keyboard as well. So I'm just going

to use the keyboard. When you pick one

of these, your options is also going to

be limited to the option that really

exist in the premier source. And in the

previous version of addup variables, it

pulled all of the different options

instead of just the one that is really

relevant for you.

And when you keep drilling down to the

uh to the time series like this, you

also get fewer and fewer options.

So this has a way of working more

dynamically with dashboards and being

able to share dashboards between

organizations because of course if we go

into the panel itself and look at the

query it doesn't contain any of the

labels that we use to um like center in

this case these filters that we apply

they are rewritten uh at runtime and

that is the the power here we disconnect

the dashboard definition from what

you're actually seeing and experience

right now in Grafana

because if you go to the the query

inspector which a great tool for

understanding what's actually going on

you'll see that these are inserted

before sent to Prometheus.

So I think this is a big big shift in

graphana and Prometheus because we're

now going to be able to share dashboards

easier and with less friction and it's

going to be a long journey to back fill

a lot of dashboards but now have we have

the the tools to do that and I'm I'm

very bullish on that even if it's going

to be quite a journey there

>> and uh Kyle uh correct me if I'm wrong

uh is this feature is uh available

already in the latest version ah Cool.

>> It's been behind a feature flag for over

a year now and but now we turn it on by

default.

>> Perfect.

>> Thanks. Thanks for sharing this. This is

really useful because um I remember like

uh filtering out those labels can take

time and um yeah making it like as s as

easy as this uh really helps for users

to find what they are looking for and I

really love like you are just using the

keyboard. Uh yeah that's that's the most

important part for especially for users

who are like okay Linux or Mac like

that's all we need.

>> Yeah no kickoffs here.

>> Cool. Um

so yeah thanks for sharing uh these new

u features or uh what's uh what's there

already. Um I think the next topic is

about uh something about open telemetry.

Yes. Yeah. Like uh basically we've made

a lot of progress and we also spoke a

lot about open telemetry at the

conference. Uh and after the conference

typically all the developers and

maintainers and people in the active in

the community they get together for a

developer summit. Um in person we

basically spend 9 to5 just discussing uh

different topics like and trying to

build consensus hammering out

differences and figuring out the road

map. And a lot of the discussion has

focused around open telemetry uh like

over the past year or so. So for those

of you who are not aware, open telemetry

is a new um ecosystem of instrumentation

libraries and protocol that kind of

helps you instrument instrument your

your your applications. The idea is

there are like these mature open-source

instrumentation agents and libraries and

you instrument your application once

um and then you can send the same

telemetry to any vendor you want. So if

you if you are on APM vendor one, you

don't need to actually change anything

in the application. you just change the

config of uh like you know the router

which or like a collector uh which is

managed centrally like you change the

config in one place and you're able to

kind of fork the rights to another

another APM solution or an observability

solution and you can compare and then

switch easily like you don't need to go

and reinstrument thousands of

applications or like ask your developers

to do that and OTL is uh gaining a lot

of popularity for a very good reason

like this resonates and they built

something really uh really cool. Um and

now like you know Prometheus already has

its own ecosystem of like

instrumentation and collection and we're

seeing open telemetry also gaining

adoption and we want to like we're

trying to understand and become better

at supporting these open telemetry

users.

Um but like you know as we started doing

this as we started implementing features

in Prometheus that make sense for open

telemetry we we also kind of had a

little bit of a conversation around like

okay what is the like are we like I

would say Prometheus made a few choices

uh because we believed they were better

and now with open telemetry we're feel

like we are walking them back a little

uh for example let me quickly share my

screen um share screen and then window.

Yeah, if you can see my screen here like

you know basically Prometheus is a

pullbased uh monitoring system. Uh

essentially what happens is your

applications are instrumented with

telemetry but Prometheus sends a request

to request those metrics and every 60

seconds or 15 seconds Prometheus

continuously collects all of this

telemetry and stores this whereas

typical monitoring systems before

Prometheus were push based like the

applications had an endpoint where they

pushed all the metrics and yeah they

basically kept pushing the metrics

either ad hoc or on a regular interval.

Um and Prometheus was pull based uh

because we like we fundamentally believe

it's a better way to monitor systems.

There are a few advantages. Let me

quickly also pull up the FAQ doc for

Prometheus that kind of explains this.

Um like yeah we have one of these

popular FAQ items that says would you

rather pull or push? And we believe pull

is better. uh basically one thing is

your application is always exposing

these metrics. So if you want to

understand what's happening with your

application, you can also start curling

or running Prometheus on your local tele

local machine and getting those

telemetry to understand what's happening

with your application. You don't need to

kind of depend on a central monitoring

system or set up something complex to do

this. You can just curl and I've done

this many times to understand how many

requests happened or how many times a

certain event happened or things like

that. But I think what is more important

and interesting is pull will help you

kind of easily understand if your target

is up or down.

So for example, let's say you're running

uh running a simple API server and

there's like 10 replicas of this API

server that are serving requests. It's

really hard with push to figure out that

one of them is down or one of them is

not starting up uh or one of them is

having issues. The reason for this is if

it's having issues, if it's crashing, if

it's running out of memory, and if it's

unable to push, you basically don't know

if the system has should have nine

replicas or 10 replicas. It could be

autoscaling, right? Like maybe one of

the replicas went down because somebody

scaled it down. Like with a push system,

you don't know if the 10th replica is

actually down for a good reason because

you know it shouldn't be running or if

it's downc because of it's having

because it's having issues or not. With

pull what happens is uh we kind of

already know there are 10 replicas

running. Either you configure it

manually that these are the 10 replicas

or you kind of look at the runtime like

Kubernetes to understand how many

replicas should exist. And we know that

there are 10 replicas. Okay, here's the

IPS of them. Let me get the metrics from

them. Oh no, this 10th replica is

crashing and I can't collect the

metrics. And Prometheus then can easily

say, okay, this target is down. It's

having issues. It generates a metric

called up and sets the value to zero.

And then you can easily write an alert

says up equals zero. Send me an alert.

So you know that something is down. And

this kind of um health check is really

hard to do with a pushbased system. And

this health check kind of saved us at

least uh me being on call many a time

like when we do a new release and this

new release is crashing and immediately

like within a few minutes we get an

alert that the new release is having

issues uh and this is hard to kind of

replicate with the yeah uh with

pushbased

>> I' I'd like to add here as well like

>> yes yeah

>> for me it just really

>> exploded at the same time as kubernetes

because when we added to manipulate. It

made it easier for everyone to

dynamically schedule workloads with like

pods going up and down constantly. And

with a pull-based system that is easier

to catch problems in if if you have a

pushbased system and kubernetes, you

need to do the extra dance afterwards of

should this be running or not

calculation and compare that to service

discovery. And I I I think this is one

was one of the killer features in

Prometheus to be honest to make it align

with with Kubernetes and the rise of

containerization at the same time.

Yes, we used to also get a lot of

questions early on like why is this pull

because there was no other pool based

system and again we even wrote a blog

post in 2016

uh like a lot of concerns were like oh

with pull maybe it doesn't scale

>> and we have this really really good blog

post. Um, one of the interesting thing

about the blog post is also accidentally

DDoSing your monitoring like you know

you could have like a bad deploy that

just floods your monitoring system with

push and can take it down. But with pull

we know okay oh this if this is sending

too many new requests

um let's just not scrape it or like if

it's just in too many metrics let's not

scrape it like and we had a talk uh in

prom itself about like collecting data

with hotel collector and it's really

hard to understand if you're having less

data because the hotel collector is

having issues because of a DOS or

because the applications are actually

sending less data or not and that's

another thing that is solved. Now with

2018, 2019, 2020 as we kind of grew, we

didn't get this question anymore. People

started understanding that uh pull might

actually scale, pull might actually be

better, but now with uh with open

telemetry, people are going back to push

and we were like, okay, what do we do?

Uh we thought pull was better and now we

have to support a pushbased system. uh

we had like yeah we had this interesting

discussion and I think the conclusion

was we believe pull is better but we're

not going to be on a high horse and say

you should always do pull like there are

there's good use cases for push and

there's good reasons people choose push

so we want to kind of support both uh

both push uh and pull push with open

telemetry and pull with uh the

Prometheus clients so that people can

make the tradeoffs and choices that they

want with that Then we also wanted to

kind of highlight uh that okay we

believe pull is better and here's some

use cases why it might be better and we

want to also do that like not saying

push is bad or wrong but yeah

oh we also have a comment uh from Basti

who says Nagios was also pull I was a

student in 2017 uh and Prometheus was my

first monitoring system so I did not

know enough about Nagios so yeah okay

thank you bestie Cool.

>> Cool. Uh I think Gotham you actually

gave us like a complete uh one-on-one

learning about push versus pull. This is

actually a very common question also

asked in our community forum like what

what method should we use and uh

understanding now and sharing this link

I think that that will help. Uh I do

have a question. I'm I I'm not sure if

I'm in the right place uh when it comes

to hotel and prometheus. So uh as a new

user for open like I have been using

Prometheus for a while and I I'm pretty

comfortable in using it. I know how it

works but um like maybe you can share

some some some guidance like when

someone should really use Prometheus and

when someone say like okay we should try

hotel approach.

>> That's a really hard question to answer.

um like if you're looking for efficiency

and simplicity of instrumenting your

applications and adding custom telemetry

I think the Prometheus SDKs and

ecosystem is very good at that however

uh with you also get like really good

auto instrumentation APM style agents uh

out of the box so you can just if you're

running Java you can just drop an agent

and uh you you basically instrument the

whole app and the Other big advantage of

hotel is it also does traces and logs

while Prometheus is focused on metrics.

So if if you if you are running a high

scale system and you're really

interested in highly efficient and

simple to use metrics, I would use

Prometheus. Um and Prometheus gets you a

long way in terms of like understanding

your system and kind of like observing

it and monitoring it and things like

that. But if you need traces, if you

need some of this auto instrumentation

u and ease of use like auto magic

instrumentation like hotel ecosystem

stands out. So you need to kind of look

at your requirements, your apps, your

runtimes and make those choices.

>> Understood. Yeah, I that makes sense

because like as if you are a developer

and you want like a quick access to see

what's going on in your application.

Yeah, then uh OTEL might be the best

choice because you don't deep dive into

like uh setting up all the uh

configuration details. But Prometheus

score on metrics can give you more

insight like how was your application

performance look like and where are the

bottlenecks and so on. Thanks. Thanks

for sharing this. Yes, I just want to

add no matter you pick the Prometheus

client libraries or hotel client

libraries, you can still store all your

metrics even if they're from open

telemetry in Prometheus and you can have

a great experience.

>> Nice.

Nice. And um yeah, let's let's continue.

I'm sorry I I also forgot where

>> I mean I can I mean one of the things I

wanted to kind of also highlight is that

like I've been working on this open

telemetry Prometheus compatibility

problem for over two and a half years.

That was when I was like I just started

putting my PM hat on and uh I kind of

like corrected all the problems of

storing and using these pro uh open

telemetry metrics in in Prometheus and

help kickstart like a Prometheus hotel

working group to just to make sure that

people instrumenting with hotel have a

great experience with Prometheus. Um,

so yeah. Oops. The long link. Uh, crap.

One second.

Where did I put this link now? Huh?

Sorry.

>> So, we Yeah. Uh,

>> one second. Yeah.

>> Okay.

>> Yes. Found it. Sorry. Uh, multiple

browser windows and I'm struggling. Um

so we published

one blog post

uh about 18 months ago called our

commitment to open telemetry where we

kind of said we wanted to be the default

store for open telemetry metrics and

this is basically a lot of the work that

we've been doing over the past 18 months

uh to kind of like ensure we close a lot

of the gaps and usability issues between

with open telemetry metrics in

Prometheus. We've done a lot of this. Uh

we had two amazing talks on this work.

One focused on delta temporality uh by

Fiona and one focused on in general all

the improvements we've made uh by Owen

and Arbor at Promcon. The recording

should be up soon. Um yeah, so that's

something like you know essentially this

been this has been 18 months two years

worth of work trying to bring the

ecosystems together and I'm quite happy

with the current state of things.

Nice. I think we got a question from one

of our users. Uh and the question is

about like how does the graphana alloy

fit in the mix? It can also scrape as an

exporter on say like like on a host and

remote write to Prometheus.

>> Yeah, that is an excellent question. Um

essentially if you look at alloy um it's

you you can kind of see like you know it

was the graphana agent before and then

we became alloy and this is a natural

evolution. So Prometheus has a really

solid ecosystem for infrastructure

monitoring in terms of the exporters.

You're able to monitor all kinds of

systems. There's like I don't know 7,000

different things listed on one page of

all the different things you can monitor

including I mean sometimes your TPLink

routers for example and their stats or

your uh or even like your CSGO uh data

somebody just showed or like the 3D

printer data and stuff like that. I mean

that's on the fun side but even uh even

in like at work we have a really solid

ecosystem for infrastructure monitoring

like the node exporter MySQL exporter

help you understand all the different

infra components running in your

systems. Open telemetry on the other

hand focused on application monitoring

first. So they had a really solid hold

on the application monitoring side of

things. And with alloy we tried to bring

the best of both like application like

the hotel support also the mature

infrastructure uh ecosystem of

Prometheus together into one binary. Um

and yeah now coming to the question uh

how does alloy fit into the push versus

pull you can as you've said it alloy

scrapes it so it pulls metrics and then

pushes them to Prometheus. Now the cool

thing about this is you can also push to

a central location and you get all the

benefits of pull. So alloy also knows

that there need to be 10 pods or 10

replicas and it scrapes those it can see

that something is down set the up to

zero and then push to a central

location. So you still get your

pullbased simplicity and health checks

while also having this central pushbased

mechanism. So that's also like I think

alloy highlights uh also uh how this can

work at scale

>> and that I think that's a a really good

um like middle ground where you use

alloy to scrape to get up and then I

have one centralized place for all of

the metrics because if the one of the

downsides of Prometheus when you get to

a certain size is that you're going to

need to run many many Prometheus and and

then you need to figure out how to

connect all of them and there's like

different solutions for this like you

can use tunnels you can use mamir but I

think the end user experience of having

one data source or like one canonical

place to go look I think that's

important when you scale up your

organization uh because the more

engineers you have like they're not

going to be interested in figure out

like which prome service should I go to

so to me that is definitely the sweet

spot of having the agent knowing if

things should be running or not and then

one place to look

>> yeah Nice. Thanks uh thanks for asking

this good question.

Perfect. Uh we can move along. I think

we still have some uh some more topics

to discuss. Uh um Gotham are we still

going to discuss on some more on the

open telemetry development? Uh from the

last

>> I want to kind of end open telemetry on

a on a like on an interesting note like

this was also one of the interesting

conversations that we had. Open

telemetry is traces like first like you

know it started its roots in traces and

it kind of helps unify traces, metrics,

logs together into kind of like uh one

solution and we said we wanted to be the

default open telemetry metrics backend.

Um now the question was does a metric

only backend even kind of like align or

like vibe with the open telemetry users?

Um now we had like yeah I think uh

essentially

by just choosing to do metrics we've

optimized the hell out of this use case

like Prometheus is an extremely

optimized extremely efficient metric

store but there's still people who have

enough telemetry including us at Grafana

that make Prometheus struggle like

there's so much telemetry out there good

telemetry bad telemetry but telemetry

and with hotel you actually have more

metric not less because it provides a

lot out of the box. So yeah, like you

can build a general purpose store but

it's going to struggle for the metrics

use case at least it's going to struggle

more than Prometheus will and by being a

great extremely efficient metrics

backend for that you can easily hook up

to your log store or tracer store and

kind of help build good workflows. I

think Prometheus can still uh resonate

and uh stay relevant for the open

telemetry users and that is our focus

for the uh next year. Understand what

are these workflows that you know people

are looking for in a metric back end and

focus from a UX perspective on making

this happen.

Yeah, that's basically everything for

hotel. Yeah. No, I I think uh this

actually I had this question back in the

mind because when when I started my

journey as an open telemetry user I I

actually have like a Java application

and if I use like Prometheus I know what

I'm getting like okay there are metrics

but with hotel uh it was very easy like

I can get logs metric traces but there

was also a lot of u like telemetry data

which I was like confused why this is

coming from or like I don't need this

right now it's good for for maybe

further investigation but uh doesn't

make sense right now. So it's a it's a

very very interesting point. You have to

find the balance and see like okay does

we need this uh all of this telemetry

data or we should focus on on the core

or the essential part for now.

>> Yeah there is a lot of bad telemetry out

there.

Yeah, I think I think we we also at the

prom con there was a lightning talk like

how you can like uh remove some of your

uh uh telemetry or metrics like which

you do not need which was also very

interesting uh like if if if someone

knows uh oh these are all the bad uh

metrics which we do not need and just

make usage of CPU and memory uh higher

so uh just remove

Okay. Uh, moving on. Um, so,

uh, KL, do you want to discuss something

about what's new in the TSTB, the time

series database?

>> I'm actually going to let Cotton be.

>> Oh, got I don't have the technical

there.

>> Yeah. Uh I mean honestly this also is

because of open telemetry as you will

see soon uh for a little bit. Um but

yeah so Prometheus was uh is a single

node system we want like you basically

log into a system dump the Prometheus

boundary on it and say dot / Prometheus

and then basically Prometheus runs on

that node. You can give it more memory,

more RAM, I mean like more CPU and more

disk and you can scale Prometheus

vertically. But Prometheus is not

something that you can say I'm going to

have 10 nodes, 10 replicas and

Prometheus is going to like you know

automatically rebalance and figure

things out between itself. The reason

for this is because Prometheus is more

than a monitoring system, an alerting

system and we need alerting to be as

reliable as possible. So Prometheus made

this uh choice that okay rather than

build a distributed system let's build a

very reliable single node extremely

efficient uh alerting focused system and

that's one of the reasons why Prometheus

never chose to become a distributed

system. The moment you become a

distributed system you get a lot of

problems. Um

with that said again people basically

used to run like a Prometheus pur

namespace or Prometheus cluster and then

there was this big ecosystem of projects

around it like Cortex, Thanos, Victoria

Metrics, uh Mimir from Grafana that let

you kind of like you send all this data

from individual Prometheuses to a

central system that is actually

distributed and scalable and now you

have an extremely reliable alerting

system that's sending you all the alerts

at the right time even if your remote

network-based distributed system is

broken. But you can also then uh query

them all all the metrics together using

this ecosystem of projects. Now as the

data started getting bigger and bigger

uh

the ecos like you know basically now

they're dealing with a lot more scale

than a single Prometheus and this

they're dealing with like billions of

metrics and billions of active series

and they started running into problems

around like querying them efficiently

and they started storing this in S S3

buckets and the storage that we built

for Prometheus assumed SSDs and local

discs. So we we we do a lot of different

random reads and it's fine for the

single node use case but it's not fine

for like you know when you're trying to

get lot of data from S3 it's nice to

read uh or GCS it's nice to read one

large chunk rather than like thousands

of small chunks. So they ran into this

problem and what I really love is like

four different companies like Shopify or

people at people at four different

companies Shopify uh Cloudflare Graphana

and uh AWS they came together to kind of

solve this common problem for the

ecosystem and they kind of like

prototyped a parquet based system. It's

a columnar format u that is very good

for like uh querying I mean the new

system that they've they've explored and

built is very good for optimizing reads

if you're storing data in S3.

Um yeah so now we had this discussion of

hey does this actually make sense to put

it into the single node uh like

Prometheus

um and we realized actually it might be

slower because uh that the trade-offs

are different but we might actually want

to still do it. Uh the reason for this

is one like for massive scale

Prometheuses it's actually might be

faster even if for the small

prometheuses it might be slower uh but

the other thing that resonated is it's

because we're using parquet it's easy to

extend and add new features to the time

series database and that kind of blew my

mind I mean I worked with Fabian who

created the new like the the TSDB that

is being used and I was one of the first

storage maintainers and I maintained it

for a year and we've optimized it

heavily and it works really really well.

Uh and I'm really happy about that. But

the problem is we've handrolled

everything. Adding a new feature is a

pain. Like if you want to add a new

feature you need to basically again like

change a lot of different things and

people just don't want to deal with

this. And one of the reasons we want to

adopt Par, we're going to experiment

with it and see how well it is is

because it's easy to extend and add new

features. And a lot of these features

are are being motivated by open

telemetry as well. So I'm I'm actually

really really happy about that. There's

also plans uh to kind of make it a

little more native for the single node

use case where it becomes a lot more

efficient. But we first are going to

kick off this working group, put park as

an experimental engine in Prometheus to

see how how things go and see how easy

it is to prototype new features and add

new things.

>> Is uh is there um a way for if this is

still like an experiment, is there a way

for users to test it out? Is there a

link or is it currently internal project

for for testing? So it is so

experimental that it doesn't exist yet.

Um so there was a talk uh I'm just uh

going to share uh share my screen again

and also I'll share the link here. Uh

there was a talk called beyond tsdb

unlocking prometheus for park prometheus

with park for modern scale. This talk

showed how Cortex, Mimir, and Thanos are

using it or exploring it. But now we

decided to put it in Prometheus

uh as an experimental thing, but you

know it's like 6 months away. So this is

a consensus that we've come to in our

DevSummit, but not something that we've

already implemented.

Cool.

Nice.

I think um I think there's nothing much

more at least for now in adding the

park. So we can move to next topic which

I think is my very uh very close to me

or close to the community is the alert

manager uh and the Prometheus part. So

um I and I do have some questions also

but uh uh um maybe uh KL or Gotham who

wants to take lead on this one the other

manager discussion. H I think it's going

to be Gotham again. Honestly, he's he's

the expert in this area. I what I would

like to say here is that I haven't been

part of the Prometheus maintainer group,

but I feel like alert manager always had

a struggle having maintainers who

actively work on it and it seems like

that's that's a shifting trend right

now.

>> Yes. Um essentially we we had a good

run. I mean alert manager always

struggled with maintainers. That was

true. Uh but we had a good run like last

year or like 18 months ago where we had

a couple of maintenance from Graphfana

kind of helping uh maintain uh alert

manager also. This was because Grafana's

alert manager is based on the Prometheus

alert manager. So improving the

Prometheus alert manager actually helps

improve the Grafana alert manager as

well. But uh sadly uh people again

they're shifting priorities uh and

people move teams and alert manager kind

of lost those super active maintainers

and it's been it's been a while since we

had a release of alert manager even and

this kind of sucks uh because alert

manager is a core component of the

Prometheus ecosystem. We I just said

Prometheus is a alerting system and

without alert manager you know you miss

a big part of that. Uh yeah and we had

this amazing talk uh by Joel um again

it's it it's basically says alert

manager has amishia should we fix it um

yeah like it's it's a great talk about

some of the issues with alert manager

and potential solutions are being

explored and stuff but what was super

interesting uh for us uh was at the end

of the talk people were just like

discussing all the problems in the Q&A

session and then people after After the

after the session ended, people just

like gathered around Joel and kind of

had like this interesting group therapy

session around all the common problems

they had with alert manager. And now

this is not a great uh place to be. But

as part of this, we also discovered that

Cloudflare and Hudson driven trading uh

two amazingly technical companies they

depend on alert manager so much and they

have internal forks of it. So we kind of

uh brought the kind of maintainers and

people interested in maintaining the

alert manager together and we are

kicking off the alert manager working

group again with these new community

members uh to kind of like breathe new

life uh into the alert manager

ecosystem. They're already discussing

what they can upstream, how they can

collaborate and I mean there's a lot of

good energy there and again this is one

of the reasons I love open source.

um now this is an open source project

and there here's two companies that are

like yeah we we we're happy to do this

like this is a great project we are

happy to push this forward and yeah I I

think that I think I I've based my

career on Prometheus and open source and

a lot of open source stuff and whenever

I still when I see these I'm still

amazed by how cool open source is yeah

so yeah if for those who have issues

with alert manager please be patient

maybe a few months and things will start

to get dramatically better. I'm

confident of it.

>> Nice. And I I have a question because

this is uh something very close uh and

it has also like a history in Grafana

Labs as well. So Grafana Labs also has

like this alert with changes with some

time obviously again the priorities were

focused and the team like have to uh

work on different areas. But this is a

common question in the community when it

comes like okay now I have graphana

running prometheus as a data source.

Should I go for conferring my alerts in

graphana or should I use prometheus like

what is what what is the difference

between these two and when like what are

the trade-offs like what are the

advantage of using graphana alerts or

when you should go for prometheus alerts

if you can maybe explain this.

>> I have an opinion but I want to hear

KL's opinion first.

Okay. So, so if you if you step back a

little bit, they're going to do pretty

much the same thing. Uh but and and

depending on your your

um resilience target, I think you can

deploy it a little bit differently. So

the alert manager's job and why it's

been like able to not have a lot of

attention and still do its very job very

good is that has a quite simple job and

it is take the alert signals from

multiple Prometheus or Grafana instances

and then decide which which is the

leader in the alert manager cluster and

then send signals to uh you know

graphana page duty emails or something

like that. So it it has a fairly small

simple job and it does it well. And if

your graphana pods or pome servers are

starting to have problems,

it's fine and they can fail and the

alert manager should still be running.

And if that is your level of resilience

you're aiming for, then running that as

a dedicated service next to Graphana or

Prometheus is is in my opinion a better

option. But unless you're like really

want that level of u resilience or also

like to be honest added complexity then

it's fine to use the the alert manage

manager within graphana. So I I think

that's how I want to think about it and

uh yeah do do you have anything to add

to that Kotum or

>> just one thing I I fully agree like you

know more components is more complexity

and sometimes the amount of complexity

will also break things at certain point

like you need to know how to

troubleshoot this. So if you want like

higher like higher amount of resilience

and you're willing to kind of put in the

effort to understand and maintain the

complex system you can have a nice

topology of alert managers like one

thing we see regularly is there's like a

three or four node alert manager cluster

each node in one different data center

um now I don't know how to do that with

graphana u but it it is possible to do

it with alert manager now the problem is

when two alert managers have issues

communicating with each other, they

might send you two alerts because alert

manager optimizes for sending you two

alerts rather than zero alerts. Uh being

paged twice is better than being paged

none. So this will give you more

resiliency, but it might also make it a

little bit more brittle and a little

harder to troubleshoot as well. But

again, these are these are the

trade-offs that you need to look at like

how much resiliency do you need and how

much complexity are you willing to put

in for it.

Yeah.

>> Yeah. And I think one one one point to

add here is that like if you have like

uh Prome only prometers data source then

obviously you can decide like what you

want to do but if you have like other

like my SQL data source or maybe

postgress or some something other then I

think in that case and you are using

like those data sources of graphana. So

you can um use the graphana uh alerting

itself because then your data data

source is supported for alerting and you

can like uh define the rules when to uh

generate an alert when when to notify

you. So that's the one case where you

feel like okay now I have like in a

centralized place uh if if you have like

more complex system or something more or

more services to to check and balance.

>> Yeah. And if there's something if

there's one thing I could change about

the alert manager project, it would

probably be the name to like alert

grouper or uh something like that. Alert

manager gets so many people confused.

They think it's the the alert rule

evaluation system, you know. Yeah, I

might it be a little bit too late for

that. But um that that would be my top

future request.

>> Kyle, that will break a lot of things

again. You don't want it.

Yeah,

>> nice.

Um, so we still have like uh about 9

minutes shortly. Um, if uh we can do

like a demo. I think we we have planned

a small demo. So uh Gotham if you may

share your screen about that home lab

part which looks sounds very

interesting. Might be more interesting

to look into it actually.

>> Yeah, I'm happy to. So essentially um I

became a PM and uh once uh one of the

reasons I became a PM was because I was

an engineer building all these tools for

other engineers to use but I was using

my own tools and I was on call and I was

troubleshooting my issues with the tools

I built. And I'm like ah I know what

works. I know what we need to build. I

don't know I know what doesn't work. I

think I'll be a great great PM. and I

kind of became a PM but then immediately

I stopped being on call for my tools and

I started like you stopped using them in

real production scenarios. Um so I

decided to kind of put together all the

different random Raspberry Pies I had uh

lying around, hooked them up to a home

lab, run a few services on it and just

use the products that we built. And that

was the kind of intention that started

off like you know of why I started doing

this home lab. But it became so much fun

that I completely forgot about that. Um,

and now in my home lab I have a I kind

of went a little bit crazy. Uh, let me

show you. So

there should be Okay, I didn't start

that dashboard, but if I look at

>> Can you please zoom in a little bit? It

looks

>> Yes, happy to. Happy to. Yeah. Is it

better?

>> Yeah. I'm going to look at the Linux

nodes. And if I look at if you look at

my fleet of Linux nodes, there's oh okay

this is uh yeah these are a lot of the

different Linux nodes that I'm running.

Uh essentially you can see that you know

there's a cluster of oldroids by my

office mate Ben and there's a new pie

there's an old knuck all these random

bits and bobs that I've put together and

I now have like 13 nodes running in the

system running a lot of different

services

uh but also kind of helping me

experiment with a lot of different

things. One of the things that I also do

is I run Prometheus on a risk 5 board.

Uh there should be risk five system

somewhere. Star five is the risk 5

system and uh yeah it's been up for 7

months which I'm actually surprised by.

Uh but I have Prometheus running on risk

5 that's monitoring a lot of different

components including these metrics that

you see on here. One of my favorite

dashboards uh though is this air

gradient dashboard which shows the air

quality in my office and like you know

oh yeah it basically the Prometheus is

running locally and it's connecting to

the office air gradient here and this is

actually quite interesting in winter um

in winter essentially I have to close

all my windows and then I breathe out

all this CO2 and the CO2 concentration

keeps rising so I get paged around 800

ppm to open the window. Now I've

silenced all my notifications but like

you know I'm pretty sure I got paged

here and usually then I'm forced to open

a window and take a walk and things like

that. And uh what is really cool is this

air gradient. It's an open- source air

quality monitor. You install it, it

comes with a Prometheus endpoint out of

the box. You there's no configuration.

It's default and you just point your

Prometheus at it and it can collect all

of this data.

The other interesting project that I did

uh was actually uh inspired by a

different project done by Ed Welch where

he kind of monitored his Nissan Leaf uh

in Loki and uh that kind of inspired me

to do this other thing. So I do have um

a Prussa 3D printer that is kind of

printing things. Now I'm not a 3D

printing expert. Um,

I don't know enough about 3D printing,

but like, you know, it it looked really

cool. Uh, and I've done a little bit of

3D printing in school, so I was like,

you know what? I'll go buy a 3D printer.

And then the next thing I I see as one

of the accessories was like this flame

extinguisher. So, you put this on top of

the printer and if if the printer

catches fire, this kind of bursts and

extinguishes this fire. And that scared

the out of me. I was like, whoa,

what? Wait, what? 3D printers catch

fire. Um, however, again, this used to

happen a lot in the past. Nowadays, 3D

printers don't catch fire, but I still

couldn't get myself to leave the printer

on and leave this office uh while it's

running unmonitored. So to and to kind

of solve that problem, I came up with a

very interesting hack I would say which

is I have a camera hooked up to it uh

hooked up to the printer and uh what it

does is it basically prints

uh all the like uh it takes a picture

every 10 seconds and then prints it out

to stand it out. So if you kind of look

at it, uh it's just it's called prile

GTM and uh I'm just going to do

select for data

and you can see like you know these are

just images that are that are or like

image printed out as base 64 literally

that's what it is like I take a picture

and I'm printing it out to stand it out

and then I'm sending it to Loki in

graphana cloud and I'm using uh a panel

called uh B 64 to image

uh from Volco Labs uh to kind of now

generate a live stream of uh my 3D

printer.

So I'm essentially using Loki the log

storage to store and live stream my 3D

printing. And now I have a program that

reads all this data back and creates a

nice uh time-lapse video uh out of all

of this. And this is like other cool fun

project I've been doing on the site and

it kind of blows me the flexibility that

Graphana has like shoved B 64 and now

you have images on a dashboard. I love

that. Carl,

uh do you think we can we can maintain

this uh this panel the base 64 to image

panel because I have a lot of feature

requests.

uh you know there's there's some new

stars and I I I don't know the status of

it but it doesn't seem like a massive

undertaking. It depends on your feature

request but it's

>> because you know my my my screen is

usually this wide and I can't get it to

kind of be the right dimension no matter

like I know the dimensions of the

picture but I just can't get the panel

to be in the right

ratio. So yeah, I have more feature

requests, but yeah,

>> there's also a video panel in Grafana

that you can use if you want to live

stream uh a little bit faster. I guess I

I mean I certainly love this solution

for its uh workoundness.

Uh but um yeah.

>> Yeah. Um

>> and I'm I'm really amazed. So you you're

taking this image as a screenshot and

basically you're sending this to Loki,

right? And then using the the base 64

panel to like uh uh create the image for

it from the code.

>> Exactly. So all I'm doing is querying

Loki filtering out to just the base 64s

and then automatically this thing came

up and I'm like oh my god Graphana is

awesome. Yeah.

>> Prometheus labels as well.

What

>> I think you can use Prometheus labels as

well. I don't know how big they can be.

>> Uh they can be

>> the size of the image.

>> Yeah, they can be very very big. And

honestly like I'm taking a image every

10 seconds and I'm writing like 20 kbps

uh of data to Loki and I've done the

math. You can have uh 70 printers

printing non-stop and you can still fit

into the free plan of Grafana Cloud. So

if you want a free video streaming

solution, you can use Grafana Cloud. Uh

the Loki team will hate me, but you can.

>> Nice. And and I and I and I must say, so

sorry, K. Uh

may I go ahead or you

>> Oh, yeah. Go ahead. It's fine.

>> Yeah. I know I I just want to say the

other other fascinating project which

you did with the air quality uh filter

world. So I it just remind me that at

Graphfana uh this year we had like a

booth of like science fair and uh were

using this sensory data to capture the

air quality and since it was jam-packed

and uh then they did it I think live and

they open the windows so that the fresh

air in the event. So it was very good

use of like hey we are actually using it

for real for the for the whole event.

So, uh, using this in your home is

actually very useful because at winter,

yes, I I don't blame you. I have the

same thing like close all the windows

and CO2 levels can increase

>> and it increases insanely fast. So, I

would recommend getting a air quality

monitor. Say yeah.

>> Yeah. Yeah.

>> Cool. That's all that I had.

>> Okay. Okay, I think we are also on time.

I don't think we have any question. Let

me just check. Ah, we just have a

comment from a user. Thank you. Thank

you for loving all. It's been really

core part of Graphfana and still a lot

of development is going. Um, yeah, I

think uh we can wrap this up. So yeah,

thank you again. Thank you Gotham for

joining us and KL uh being such a

amazing admin and host today because it

was the first time KL has like all the

superpowers for the show. We may keep it

for longer. Who knows? Uh but yeah uh

thank you everyone uh for joining this

uh amazing uh session of Grafana

Campfire and especially asking some good

questions around Prometheus and

different use cases. We learn a lot

about pull and push uh push request and

yeah uh check out the video recording,

check out the links as well. I will

share them also in the video description

and uh till then uh we will join next

time. So till then take care and

goodbye.

>> Thank you. Bye.

Grafana Campfire 🔥- PromCON 2025 Highlights - What's New? (Grafana Community Call - Oct 2025)

Grafana

23 days ago

1:02:38

Observability & Monitoring

Rank #1

Description

#prometheus #grafana #opensource 𝙃𝙤𝙨𝙩𝙨: Syed Usman Ahmad 𝙀𝙭𝙥𝙚𝙧𝙩𝙨: Goutham Veeramachaneni, the Prometheus project maintainer and contributor, and Carl Bergquist, who has been actively using Prometheus and Grafana core since its release and continues to contribute to it, will be among us. 𝗦𝘂𝗺𝗺𝗮𝗿𝘆: In the October edition of the Grafana Campfire Community Call, host Usman led a discussion about the recent Prometheus conference (Promcon) held in Munich, featuring guests Carl Bergquist and Gautam, both experts in Prometheus and Grafana. They shared insights from Promcon, highlighting the friendly atmosphere and the importance of community engagement. Key topics included the stable release of native histograms in Prometheus, improvements in Grafana's ad hoc filtering feature, and the ongoing integration with OpenTelemetry. The conversation also touched on the challenges faced by the Alert Manager and the future of Prometheus with potential upgrades to its time series database (TSDB). Gautam demonstrated a personal home lab setup involving monitoring air quality and 3D printing, showcasing the practical applications of Grafana and Prometheus in real-world scenarios. Overall, the call provided a comprehensive overview of advancements in the Prometheus community and the collaborative efforts to enhance user experiences. #############⏳TIMESTAMPS⏳#############: Here are the key moments from the livestream along with their timestamps: 00:00:00 Introductions and overview of the call 00:02:30 Introduction of guests Carl Bergquist and Gotham 00:06:00 Discussion about Promcon and its significance 00:08:30 Highlights from Promcon, including the atmosphere and community involvement 00:11:00 Overview of new features in Prometheus 3.0, including native histograms 00:15:00 Introduction of new ad hoc filtering feature in Grafana 00:21:00 Discussion about OpenTelemetry and its integration with Prometheus 00:28:00 Explanation of the differences between pull and push monitoring systems 00:35:00 Introduction to Gotham's home lab project and live demo 00:45:00 Wrap-up and final thoughts on alert management and Grafana's alerting system 𝗝𝗼𝗶𝗻 𝘁𝗵𝗲 𝗠𝗲𝗲𝘁𝘂𝗽 𝗚𝗿𝗼𝘂𝗽: https://www.meetup.com/grafana-friends-virtual-meetup-group/events/311564547/ 𝗝𝗼𝗶𝗻 𝘁𝗵𝗲 𝗚𝗿𝗮𝗳𝗮𝗻𝗮 𝗢𝗳𝗳𝗶𝗰𝗶𝗮𝗹 𝗖𝗼𝗺𝗺𝘂𝗻𝗶𝘁𝘆: https://community.grafana.com/ We look forward to see you 🙂

Video Details

Category

Observability & Monitoring

Featured Date

November 14, 2025

Quality Rank

#1

AI Recommended