Loading video player...
Perfect, we are live again. So hello
everyone, welcome to the Grafana
campfire community call October edition
and I'm your host Usman and today we got
something very interesting
uh which happened actually just like few
days ago in Germany and in Germany we're
in Munich and the topic was about or the
conference was about the Prometheus or
the Promcon which is really one of the
unique conferences where I think mostly
the Prometheus maintainers, developers
ers uh arrive talk about what's new
what's exciting and what they're working
and in this call today we're going to
discuss about what's going on with the
Prometheus and the community how
graphana labs is working with Prometheus
or with the integration to make the user
experience better and also uh we might
see some cool demos uh we have something
planned for you today and uh I will go
with the introduction so um myself uh my
name is Usman I work in Griffin lab as a
staff developer advocate and I have been
working closely with the open source
community. Right now I'm working mostly
on documentation and learning journey
which is fun. So documentation is
something we need and I think community
users love our documentation and it's an
integral part and along with me are uh
two very specialist I would say and
expert guests sitting today. So KL I
think you know some of you know already
have seen it and also Gotham but let us
uh do their introduction in more
briefly. So I hand over first to you KL.
>> Thank you Osman. Yes my name is Kfrist.
I'm one of those OG's Grafana Labs. I've
been here for ages. I recently had my 10
year celebration. I spent all of my time
in the graphana projects. like we do so
many things at Grafana Labs these days
but I've always been around the Grafana
project and I first got in contact with
Prometheus 2016 uh when they had her
first Prometheus conference in Berlin
and that's when I met the like the OG
Prometheus team and
I um I really like that group of people
like it's it's very monitoring first
pragmatic friendly group uh so I've been
sticking around that project for quite
some time know um not engaged in the
Prometheus project itself but like
making sure that Kfana works well with
the Prometheus project
and that's how I met Gotham in Munich 20
I think 2017.
>> Yeah. Yes.
>> Yeah.
Yeah.
>> Yeah. Hey everyone, I'm Gautam. Uh yeah,
I'm I'm a Prometheus maintainer, but I
haven't been writing a lot of code
recently for Prometheus, but I've also
been actually involved with the
Prometheus community since 2016 when I
started using Prometheus in the summer
of 2016. In 2017, I started contributing
to Prometheus uh as part of an
internship. And yeah, like eight years
later, uh I'm still around and I
recently helped organize Promcon. And
yeah, for my day job, I work at
Graphfana. I've been at Graphfana 7
years. Uh well, I thought it was a long
time and then KL said 10 years. Um and
yeah and I've been I had I wore a lot of
different roles and hats like I was
doing a lot of engineering around
Prometheus the time series database I
hosted metrics but I've also worked with
the open telemetry community put on my
PM hat for a couple of years and now I'm
back as an engineer. It's been three
months since I was back as an engineer
and oh my god I love it.
>> So so that is the secret. That's why
you're not writing the code these days.
Yeah,
>> cool. Uh, yeah, I I I must say about one
thing about KL. So I actually uh met KHL
first time face to face at Graphfanocon
and I still remember like we have this
roundt panel discussion and there were a
lot of questions on every direction and
KL was there and he was such a like a
expert like I know this I can answer
this and we were like okay we should
have people like KL because uh when you
work so much you know the like the the
concept or the background of why why
this is working or what we are looking
at it. So yeah, so I think yeah, both of
uh these experts have like definitely an
expert expertise in Prometheus but also
in Grafana and yeah uh let's uh kick off
the discussion. So before we discuss
like what is Prometheus or what's going
on, let's talk about Promcon. So maybe
Gotham or Kal or maybe Gotham you can uh
because I I remember you you are also
the organizer. So you can uh talk a
little bit about Promcon. how was it and
uh what you did, what you find
interesting and so on.
>> Yeah. No, Promcon is the annual
Prometheus conference for like
Prometheus and users, I mean use
Prometheus users and maintainers to come
together and share about what we've been
doing and what like what what are all
the new developments or how people are
using Prometheus in new in new and
exciting ways and yeah it's it's it's an
amazing conference like people have been
consistently coming to it again and
again and it's actually a small and cozy
conference uh so We it's organized by
the maintainers themselves. Um and we
try to keep the ticket prices slow and
accessible and in general the vibe is
very chill. Uh it's actually a very
special conference for me personally
because I gave my first conference talk
um in 2017 at the same venue on the same
stage. So yeah, like I was just looking
uh at a YouTube video of of that talk
and I'm like, "Holy I had a lot
more hair back then and I looked like a
baby." But like yeah, it's been it's
been happening I think since 2015.
So it's like the 10th Promcon this year
and I think that is very special and
yeah
>> and I think sorry uh
>> yeah something I just want to add like I
do agree that is special and it's
special because there's so many experts
there but they also the atmosphere is
very very friendly like you can be on
stage and be a beginner and that's fine
we all want to hear from everyone at
this conference and and I that's that's
what makes it very unique and that's why
I want to keep going there and I I would
highly suggest just for someone new in
this space or if you've been long in
this space. This this is one of the best
conferences in the monitoring space
except Profonicon obviously.
Yeah. No, I think I totally agree. I was
also there uh on uh at the prom this uh
on the day two of this one and uh yeah
it was pretty pretty exciting to see so
many people uh who are working in the
community and very friendly and I think
Gotham if correct me if I'm wrong uh
Prometheus conference is like uh it
never get like a break or a pause even
during the back in the days of covid uh
you guys organized like a virtual
conferences like it it always remain
alive is it correct
I think so. We I don't know if we did it
on both the COVID years, but I I'm
pretty sure we did one virtual
conference. Uh let me quickly check
2020, 2021, 20. Yeah, it has always
happened like from 2016 to 2025. I can
see all like you know in the footer of
the website you can see the links for
each of those. So it happened each year.
>> Yeah. And I mean this this shows the
commitment from the community like they
they they want it and the organizers
says like let's do it and uh this is
where you you see the passion from the
community like they want to use they
want to build or improve it better. So
this is amazing to see because not so
many conferences have like like a
journey where there's a break or a pause
and then they continue again. So yeah,
but that's pretty cool about Promcon.
And I think um let's talk about what's
in the Prometheus community like what's
what's was the highlight of Prometheus
conference and yeah over to you guys.
>> Yeah, I mean honestly we didn't have any
big announcements, nothing like a
Prometheus 4.0 or anything, but we we've
launched Prometheus 3.0 a year ago. I
think September or October, so roughly a
year ago. And that was a big release and
we've been basically continuously
improving upon it like we've kicked off
a lot of features and uh we started like
uh yeah we basically continued improving
them. One thing I want to add is native
histograms are now stable. So if you're
unfamiliar with Prometheus histograms
they kind of help you understand the
distribution of data. They kind of help
you know, okay, there's like 10,000
requests that took like nine like
between nine and 10 seconds. There's
like 5,000 requests that took like
between 1 and 2 seconds. And there's
like 100,000 requests that were like
less than 200 millconds. So you can
understand the distribution of the data.
But before native histograms you have to
pick the buckets like you have to kind
of say okay I want I'm actually
interested in the buckets between 0 200
500 milliseconds a second and stuff like
that and now you pick buckets and then
maybe you do a new release and suddenly
these buckets are not relevant anymore.
Um maybe it got faster, maybe things got
slower and uh yeah this used to cause
some problems and also when you have a
wide distribution you used you needed to
have a lot of buckets and that used to
cause other performance problems. But
with native histograms we kind of fixed
a lot of this. You just kind of define
roughly what resolution you have and we
automatically scale up and scale down
the buckets uh based on the distribution
of the data. uh you can have really high
resolution and you can actually render
videos as heat maps that beyond has
done. Uh but yeah like native histogram
solves a lot of the problems that like
the previous classic histograms used to
have. We've been working on this for two
and a half years I think or more
actively and they're now finally stable.
So we're not going to break any of the
APIs. Uh so you can go ahead and start
using them.
I think it's super super nice and it's
it sounds a little bit more
sophisticated and hard to learn but it's
actually the opposite. The the the
classic histograms are harder to operate
well and like keep working with. So this
is this is a big step up for usability
for all of the Prometheus users.
Um I I have something else I want to
share that we did in Grafana related to
Prometheus and we uh announced that at
Prometheus
Promcon as well. So
let's see if I can add this stage. So So
we added um or we massively improved I
would say instead the ad hoc filter in
Grafana. So the ad hoc filter in Grafana
with Prometheus.
>> Sorry uh Kyle C sorry I'm really
interrupting. Can you just zoom in? It
looks nice, but if you can. Yeah.
>> Yeah. Perfect. And please continue.
>> So the the add-on filter allows you to
drill down to any dimension rather than
having to predefine what you want to
filter by. Classic template variables in
Grafana. Then you have to do like you
have to decide up front if you want to
filter by cluster alert state and so on
in your query. And then we in do screen
interpolation at query time.
Doing all of these decision up fronts
means it's harder for people to share
dashboards and also you might be in edge
cases where you didn't think about how
you want to filter these uh metrics up
front and then you have to jump out of
explore explore or other pages of
graphana. With a new filtering feature
in in uh graphana for Prometheus, we
load the the times or the labels you can
filter by dynamically. So when you click
on this, we figure out what labels
actually exist for the time series in
the dashboard and then we only show them
as options here instead of the thousands
of labels that was earlier um suggested.
And when you pick one of these or by the
way all of these works really well with
the keyboard as well. So I'm just going
to use the keyboard. When you pick one
of these, your options is also going to
be limited to the option that really
exist in the premier source. And in the
previous version of addup variables, it
pulled all of the different options
instead of just the one that is really
relevant for you.
And when you keep drilling down to the
uh to the time series like this, you
also get fewer and fewer options.
So this has a way of working more
dynamically with dashboards and being
able to share dashboards between
organizations because of course if we go
into the panel itself and look at the
query it doesn't contain any of the
labels that we use to um like center in
this case these filters that we apply
they are rewritten uh at runtime and
that is the the power here we disconnect
the dashboard definition from what
you're actually seeing and experience
right now in Grafana
because if you go to the the query
inspector which a great tool for
understanding what's actually going on
you'll see that these are inserted
before sent to Prometheus.
So I think this is a big big shift in
graphana and Prometheus because we're
now going to be able to share dashboards
easier and with less friction and it's
going to be a long journey to back fill
a lot of dashboards but now have we have
the the tools to do that and I'm I'm
very bullish on that even if it's going
to be quite a journey there
>> and uh Kyle uh correct me if I'm wrong
uh is this feature is uh available
already in the latest version ah Cool.
>> It's been behind a feature flag for over
a year now and but now we turn it on by
default.
>> Perfect.
>> Thanks. Thanks for sharing this. This is
really useful because um I remember like
uh filtering out those labels can take
time and um yeah making it like as s as
easy as this uh really helps for users
to find what they are looking for and I
really love like you are just using the
keyboard. Uh yeah that's that's the most
important part for especially for users
who are like okay Linux or Mac like
that's all we need.
>> Yeah no kickoffs here.
>> Cool. Um
so yeah thanks for sharing uh these new
u features or uh what's uh what's there
already. Um I think the next topic is
about uh something about open telemetry.
Yes. Yeah. Like uh basically we've made
a lot of progress and we also spoke a
lot about open telemetry at the
conference. Uh and after the conference
typically all the developers and
maintainers and people in the active in
the community they get together for a
developer summit. Um in person we
basically spend 9 to5 just discussing uh
different topics like and trying to
build consensus hammering out
differences and figuring out the road
map. And a lot of the discussion has
focused around open telemetry uh like
over the past year or so. So for those
of you who are not aware, open telemetry
is a new um ecosystem of instrumentation
libraries and protocol that kind of
helps you instrument instrument your
your your applications. The idea is
there are like these mature open-source
instrumentation agents and libraries and
you instrument your application once
um and then you can send the same
telemetry to any vendor you want. So if
you if you are on APM vendor one, you
don't need to actually change anything
in the application. you just change the
config of uh like you know the router
which or like a collector uh which is
managed centrally like you change the
config in one place and you're able to
kind of fork the rights to another
another APM solution or an observability
solution and you can compare and then
switch easily like you don't need to go
and reinstrument thousands of
applications or like ask your developers
to do that and OTL is uh gaining a lot
of popularity for a very good reason
like this resonates and they built
something really uh really cool. Um and
now like you know Prometheus already has
its own ecosystem of like
instrumentation and collection and we're
seeing open telemetry also gaining
adoption and we want to like we're
trying to understand and become better
at supporting these open telemetry
users.
Um but like you know as we started doing
this as we started implementing features
in Prometheus that make sense for open
telemetry we we also kind of had a
little bit of a conversation around like
okay what is the like are we like I
would say Prometheus made a few choices
uh because we believed they were better
and now with open telemetry we're feel
like we are walking them back a little
uh for example let me quickly share my
screen um share screen and then window.
Yeah, if you can see my screen here like
you know basically Prometheus is a
pullbased uh monitoring system. Uh
essentially what happens is your
applications are instrumented with
telemetry but Prometheus sends a request
to request those metrics and every 60
seconds or 15 seconds Prometheus
continuously collects all of this
telemetry and stores this whereas
typical monitoring systems before
Prometheus were push based like the
applications had an endpoint where they
pushed all the metrics and yeah they
basically kept pushing the metrics
either ad hoc or on a regular interval.
Um and Prometheus was pull based uh
because we like we fundamentally believe
it's a better way to monitor systems.
There are a few advantages. Let me
quickly also pull up the FAQ doc for
Prometheus that kind of explains this.
Um like yeah we have one of these
popular FAQ items that says would you
rather pull or push? And we believe pull
is better. uh basically one thing is
your application is always exposing
these metrics. So if you want to
understand what's happening with your
application, you can also start curling
or running Prometheus on your local tele
local machine and getting those
telemetry to understand what's happening
with your application. You don't need to
kind of depend on a central monitoring
system or set up something complex to do
this. You can just curl and I've done
this many times to understand how many
requests happened or how many times a
certain event happened or things like
that. But I think what is more important
and interesting is pull will help you
kind of easily understand if your target
is up or down.
So for example, let's say you're running
uh running a simple API server and
there's like 10 replicas of this API
server that are serving requests. It's
really hard with push to figure out that
one of them is down or one of them is
not starting up uh or one of them is
having issues. The reason for this is if
it's having issues, if it's crashing, if
it's running out of memory, and if it's
unable to push, you basically don't know
if the system has should have nine
replicas or 10 replicas. It could be
autoscaling, right? Like maybe one of
the replicas went down because somebody
scaled it down. Like with a push system,
you don't know if the 10th replica is
actually down for a good reason because
you know it shouldn't be running or if
it's downc because of it's having
because it's having issues or not. With
pull what happens is uh we kind of
already know there are 10 replicas
running. Either you configure it
manually that these are the 10 replicas
or you kind of look at the runtime like
Kubernetes to understand how many
replicas should exist. And we know that
there are 10 replicas. Okay, here's the
IPS of them. Let me get the metrics from
them. Oh no, this 10th replica is
crashing and I can't collect the
metrics. And Prometheus then can easily
say, okay, this target is down. It's
having issues. It generates a metric
called up and sets the value to zero.
And then you can easily write an alert
says up equals zero. Send me an alert.
So you know that something is down. And
this kind of um health check is really
hard to do with a pushbased system. And
this health check kind of saved us at
least uh me being on call many a time
like when we do a new release and this
new release is crashing and immediately
like within a few minutes we get an
alert that the new release is having
issues uh and this is hard to kind of
replicate with the yeah uh with
pushbased
>> I' I'd like to add here as well like
>> yes yeah
>> for me it just really
>> exploded at the same time as kubernetes
because when we added to manipulate. It
made it easier for everyone to
dynamically schedule workloads with like
pods going up and down constantly. And
with a pull-based system that is easier
to catch problems in if if you have a
pushbased system and kubernetes, you
need to do the extra dance afterwards of
should this be running or not
calculation and compare that to service
discovery. And I I I think this is one
was one of the killer features in
Prometheus to be honest to make it align
with with Kubernetes and the rise of
containerization at the same time.
Yes, we used to also get a lot of
questions early on like why is this pull
because there was no other pool based
system and again we even wrote a blog
post in 2016
uh like a lot of concerns were like oh
with pull maybe it doesn't scale
>> and we have this really really good blog
post. Um, one of the interesting thing
about the blog post is also accidentally
DDoSing your monitoring like you know
you could have like a bad deploy that
just floods your monitoring system with
push and can take it down. But with pull
we know okay oh this if this is sending
too many new requests
um let's just not scrape it or like if
it's just in too many metrics let's not
scrape it like and we had a talk uh in
prom itself about like collecting data
with hotel collector and it's really
hard to understand if you're having less
data because the hotel collector is
having issues because of a DOS or
because the applications are actually
sending less data or not and that's
another thing that is solved. Now with
2018, 2019, 2020 as we kind of grew, we
didn't get this question anymore. People
started understanding that uh pull might
actually scale, pull might actually be
better, but now with uh with open
telemetry, people are going back to push
and we were like, okay, what do we do?
Uh we thought pull was better and now we
have to support a pushbased system. uh
we had like yeah we had this interesting
discussion and I think the conclusion
was we believe pull is better but we're
not going to be on a high horse and say
you should always do pull like there are
there's good use cases for push and
there's good reasons people choose push
so we want to kind of support both uh
both push uh and pull push with open
telemetry and pull with uh the
Prometheus clients so that people can
make the tradeoffs and choices that they
want with that Then we also wanted to
kind of highlight uh that okay we
believe pull is better and here's some
use cases why it might be better and we
want to also do that like not saying
push is bad or wrong but yeah
oh we also have a comment uh from Basti
who says Nagios was also pull I was a
student in 2017 uh and Prometheus was my
first monitoring system so I did not
know enough about Nagios so yeah okay
thank you bestie Cool.
>> Cool. Uh I think Gotham you actually
gave us like a complete uh one-on-one
learning about push versus pull. This is
actually a very common question also
asked in our community forum like what
what method should we use and uh
understanding now and sharing this link
I think that that will help. Uh I do
have a question. I'm I I'm not sure if
I'm in the right place uh when it comes
to hotel and prometheus. So uh as a new
user for open like I have been using
Prometheus for a while and I I'm pretty
comfortable in using it. I know how it
works but um like maybe you can share
some some some guidance like when
someone should really use Prometheus and
when someone say like okay we should try
hotel approach.
>> That's a really hard question to answer.
um like if you're looking for efficiency
and simplicity of instrumenting your
applications and adding custom telemetry
I think the Prometheus SDKs and
ecosystem is very good at that however
uh with you also get like really good
auto instrumentation APM style agents uh
out of the box so you can just if you're
running Java you can just drop an agent
and uh you you basically instrument the
whole app and the Other big advantage of
hotel is it also does traces and logs
while Prometheus is focused on metrics.
So if if you if you are running a high
scale system and you're really
interested in highly efficient and
simple to use metrics, I would use
Prometheus. Um and Prometheus gets you a
long way in terms of like understanding
your system and kind of like observing
it and monitoring it and things like
that. But if you need traces, if you
need some of this auto instrumentation
u and ease of use like auto magic
instrumentation like hotel ecosystem
stands out. So you need to kind of look
at your requirements, your apps, your
runtimes and make those choices.
>> Understood. Yeah, I that makes sense
because like as if you are a developer
and you want like a quick access to see
what's going on in your application.
Yeah, then uh OTEL might be the best
choice because you don't deep dive into
like uh setting up all the uh
configuration details. But Prometheus
score on metrics can give you more
insight like how was your application
performance look like and where are the
bottlenecks and so on. Thanks. Thanks
for sharing this. Yes, I just want to
add no matter you pick the Prometheus
client libraries or hotel client
libraries, you can still store all your
metrics even if they're from open
telemetry in Prometheus and you can have
a great experience.
>> Nice.
Nice. And um yeah, let's let's continue.
I'm sorry I I also forgot where
>> I mean I can I mean one of the things I
wanted to kind of also highlight is that
like I've been working on this open
telemetry Prometheus compatibility
problem for over two and a half years.
That was when I was like I just started
putting my PM hat on and uh I kind of
like corrected all the problems of
storing and using these pro uh open
telemetry metrics in in Prometheus and
help kickstart like a Prometheus hotel
working group to just to make sure that
people instrumenting with hotel have a
great experience with Prometheus. Um,
so yeah. Oops. The long link. Uh, crap.
One second.
Where did I put this link now? Huh?
Sorry.
>> So, we Yeah. Uh,
>> one second. Yeah.
>> Okay.
>> Yes. Found it. Sorry. Uh, multiple
browser windows and I'm struggling. Um
so we published
one blog post
uh about 18 months ago called our
commitment to open telemetry where we
kind of said we wanted to be the default
store for open telemetry metrics and
this is basically a lot of the work that
we've been doing over the past 18 months
uh to kind of like ensure we close a lot
of the gaps and usability issues between
with open telemetry metrics in
Prometheus. We've done a lot of this. Uh
we had two amazing talks on this work.
One focused on delta temporality uh by
Fiona and one focused on in general all
the improvements we've made uh by Owen
and Arbor at Promcon. The recording
should be up soon. Um yeah, so that's
something like you know essentially this
been this has been 18 months two years
worth of work trying to bring the
ecosystems together and I'm quite happy
with the current state of things.
Nice. I think we got a question from one
of our users. Uh and the question is
about like how does the graphana alloy
fit in the mix? It can also scrape as an
exporter on say like like on a host and
remote write to Prometheus.
>> Yeah, that is an excellent question. Um
essentially if you look at alloy um it's
you you can kind of see like you know it
was the graphana agent before and then
we became alloy and this is a natural
evolution. So Prometheus has a really
solid ecosystem for infrastructure
monitoring in terms of the exporters.
You're able to monitor all kinds of
systems. There's like I don't know 7,000
different things listed on one page of
all the different things you can monitor
including I mean sometimes your TPLink
routers for example and their stats or
your uh or even like your CSGO uh data
somebody just showed or like the 3D
printer data and stuff like that. I mean
that's on the fun side but even uh even
in like at work we have a really solid
ecosystem for infrastructure monitoring
like the node exporter MySQL exporter
help you understand all the different
infra components running in your
systems. Open telemetry on the other
hand focused on application monitoring
first. So they had a really solid hold
on the application monitoring side of
things. And with alloy we tried to bring
the best of both like application like
the hotel support also the mature
infrastructure uh ecosystem of
Prometheus together into one binary. Um
and yeah now coming to the question uh
how does alloy fit into the push versus
pull you can as you've said it alloy
scrapes it so it pulls metrics and then
pushes them to Prometheus. Now the cool
thing about this is you can also push to
a central location and you get all the
benefits of pull. So alloy also knows
that there need to be 10 pods or 10
replicas and it scrapes those it can see
that something is down set the up to
zero and then push to a central
location. So you still get your
pullbased simplicity and health checks
while also having this central pushbased
mechanism. So that's also like I think
alloy highlights uh also uh how this can
work at scale
>> and that I think that's a a really good
um like middle ground where you use
alloy to scrape to get up and then I
have one centralized place for all of
the metrics because if the one of the
downsides of Prometheus when you get to
a certain size is that you're going to
need to run many many Prometheus and and
then you need to figure out how to
connect all of them and there's like
different solutions for this like you
can use tunnels you can use mamir but I
think the end user experience of having
one data source or like one canonical
place to go look I think that's
important when you scale up your
organization uh because the more
engineers you have like they're not
going to be interested in figure out
like which prome service should I go to
so to me that is definitely the sweet
spot of having the agent knowing if
things should be running or not and then
one place to look
>> yeah Nice. Thanks uh thanks for asking
this good question.
Perfect. Uh we can move along. I think
we still have some uh some more topics
to discuss. Uh um Gotham are we still
going to discuss on some more on the
open telemetry development? Uh from the
last
>> I want to kind of end open telemetry on
a on a like on an interesting note like
this was also one of the interesting
conversations that we had. Open
telemetry is traces like first like you
know it started its roots in traces and
it kind of helps unify traces, metrics,
logs together into kind of like uh one
solution and we said we wanted to be the
default open telemetry metrics backend.
Um now the question was does a metric
only backend even kind of like align or
like vibe with the open telemetry users?
Um now we had like yeah I think uh
essentially
by just choosing to do metrics we've
optimized the hell out of this use case
like Prometheus is an extremely
optimized extremely efficient metric
store but there's still people who have
enough telemetry including us at Grafana
that make Prometheus struggle like
there's so much telemetry out there good
telemetry bad telemetry but telemetry
and with hotel you actually have more
metric not less because it provides a
lot out of the box. So yeah, like you
can build a general purpose store but
it's going to struggle for the metrics
use case at least it's going to struggle
more than Prometheus will and by being a
great extremely efficient metrics
backend for that you can easily hook up
to your log store or tracer store and
kind of help build good workflows. I
think Prometheus can still uh resonate
and uh stay relevant for the open
telemetry users and that is our focus
for the uh next year. Understand what
are these workflows that you know people
are looking for in a metric back end and
focus from a UX perspective on making
this happen.
Yeah, that's basically everything for
hotel. Yeah. No, I I think uh this
actually I had this question back in the
mind because when when I started my
journey as an open telemetry user I I
actually have like a Java application
and if I use like Prometheus I know what
I'm getting like okay there are metrics
but with hotel uh it was very easy like
I can get logs metric traces but there
was also a lot of u like telemetry data
which I was like confused why this is
coming from or like I don't need this
right now it's good for for maybe
further investigation but uh doesn't
make sense right now. So it's a it's a
very very interesting point. You have to
find the balance and see like okay does
we need this uh all of this telemetry
data or we should focus on on the core
or the essential part for now.
>> Yeah there is a lot of bad telemetry out
there.
Yeah, I think I think we we also at the
prom con there was a lightning talk like
how you can like uh remove some of your
uh uh telemetry or metrics like which
you do not need which was also very
interesting uh like if if if someone
knows uh oh these are all the bad uh
metrics which we do not need and just
make usage of CPU and memory uh higher
so uh just remove
Okay. Uh, moving on. Um, so,
uh, KL, do you want to discuss something
about what's new in the TSTB, the time
series database?
>> I'm actually going to let Cotton be.
>> Oh, got I don't have the technical
there.
>> Yeah. Uh I mean honestly this also is
because of open telemetry as you will
see soon uh for a little bit. Um but
yeah so Prometheus was uh is a single
node system we want like you basically
log into a system dump the Prometheus
boundary on it and say dot / Prometheus
and then basically Prometheus runs on
that node. You can give it more memory,
more RAM, I mean like more CPU and more
disk and you can scale Prometheus
vertically. But Prometheus is not
something that you can say I'm going to
have 10 nodes, 10 replicas and
Prometheus is going to like you know
automatically rebalance and figure
things out between itself. The reason
for this is because Prometheus is more
than a monitoring system, an alerting
system and we need alerting to be as
reliable as possible. So Prometheus made
this uh choice that okay rather than
build a distributed system let's build a
very reliable single node extremely
efficient uh alerting focused system and
that's one of the reasons why Prometheus
never chose to become a distributed
system. The moment you become a
distributed system you get a lot of
problems. Um
with that said again people basically
used to run like a Prometheus pur
namespace or Prometheus cluster and then
there was this big ecosystem of projects
around it like Cortex, Thanos, Victoria
Metrics, uh Mimir from Grafana that let
you kind of like you send all this data
from individual Prometheuses to a
central system that is actually
distributed and scalable and now you
have an extremely reliable alerting
system that's sending you all the alerts
at the right time even if your remote
network-based distributed system is
broken. But you can also then uh query
them all all the metrics together using
this ecosystem of projects. Now as the
data started getting bigger and bigger
uh
the ecos like you know basically now
they're dealing with a lot more scale
than a single Prometheus and this
they're dealing with like billions of
metrics and billions of active series
and they started running into problems
around like querying them efficiently
and they started storing this in S S3
buckets and the storage that we built
for Prometheus assumed SSDs and local
discs. So we we we do a lot of different
random reads and it's fine for the
single node use case but it's not fine
for like you know when you're trying to
get lot of data from S3 it's nice to
read uh or GCS it's nice to read one
large chunk rather than like thousands
of small chunks. So they ran into this
problem and what I really love is like
four different companies like Shopify or
people at people at four different
companies Shopify uh Cloudflare Graphana
and uh AWS they came together to kind of
solve this common problem for the
ecosystem and they kind of like
prototyped a parquet based system. It's
a columnar format u that is very good
for like uh querying I mean the new
system that they've they've explored and
built is very good for optimizing reads
if you're storing data in S3.
Um yeah so now we had this discussion of
hey does this actually make sense to put
it into the single node uh like
Prometheus
um and we realized actually it might be
slower because uh that the trade-offs
are different but we might actually want
to still do it. Uh the reason for this
is one like for massive scale
Prometheuses it's actually might be
faster even if for the small
prometheuses it might be slower uh but
the other thing that resonated is it's
because we're using parquet it's easy to
extend and add new features to the time
series database and that kind of blew my
mind I mean I worked with Fabian who
created the new like the the TSDB that
is being used and I was one of the first
storage maintainers and I maintained it
for a year and we've optimized it
heavily and it works really really well.
Uh and I'm really happy about that. But
the problem is we've handrolled
everything. Adding a new feature is a
pain. Like if you want to add a new
feature you need to basically again like
change a lot of different things and
people just don't want to deal with
this. And one of the reasons we want to
adopt Par, we're going to experiment
with it and see how well it is is
because it's easy to extend and add new
features. And a lot of these features
are are being motivated by open
telemetry as well. So I'm I'm actually
really really happy about that. There's
also plans uh to kind of make it a
little more native for the single node
use case where it becomes a lot more
efficient. But we first are going to
kick off this working group, put park as
an experimental engine in Prometheus to
see how how things go and see how easy
it is to prototype new features and add
new things.
>> Is uh is there um a way for if this is
still like an experiment, is there a way
for users to test it out? Is there a
link or is it currently internal project
for for testing? So it is so
experimental that it doesn't exist yet.
Um so there was a talk uh I'm just uh
going to share uh share my screen again
and also I'll share the link here. Uh
there was a talk called beyond tsdb
unlocking prometheus for park prometheus
with park for modern scale. This talk
showed how Cortex, Mimir, and Thanos are
using it or exploring it. But now we
decided to put it in Prometheus
uh as an experimental thing, but you
know it's like 6 months away. So this is
a consensus that we've come to in our
DevSummit, but not something that we've
already implemented.
Cool.
Nice.
I think um I think there's nothing much
more at least for now in adding the
park. So we can move to next topic which
I think is my very uh very close to me
or close to the community is the alert
manager uh and the Prometheus part. So
um I and I do have some questions also
but uh uh um maybe uh KL or Gotham who
wants to take lead on this one the other
manager discussion. H I think it's going
to be Gotham again. Honestly, he's he's
the expert in this area. I what I would
like to say here is that I haven't been
part of the Prometheus maintainer group,
but I feel like alert manager always had
a struggle having maintainers who
actively work on it and it seems like
that's that's a shifting trend right
now.
>> Yes. Um essentially we we had a good
run. I mean alert manager always
struggled with maintainers. That was
true. Uh but we had a good run like last
year or like 18 months ago where we had
a couple of maintenance from Graphfana
kind of helping uh maintain uh alert
manager also. This was because Grafana's
alert manager is based on the Prometheus
alert manager. So improving the
Prometheus alert manager actually helps
improve the Grafana alert manager as
well. But uh sadly uh people again
they're shifting priorities uh and
people move teams and alert manager kind
of lost those super active maintainers
and it's been it's been a while since we
had a release of alert manager even and
this kind of sucks uh because alert
manager is a core component of the
Prometheus ecosystem. We I just said
Prometheus is a alerting system and
without alert manager you know you miss
a big part of that. Uh yeah and we had
this amazing talk uh by Joel um again
it's it it's basically says alert
manager has amishia should we fix it um
yeah like it's it's a great talk about
some of the issues with alert manager
and potential solutions are being
explored and stuff but what was super
interesting uh for us uh was at the end
of the talk people were just like
discussing all the problems in the Q&A
session and then people after After the
after the session ended, people just
like gathered around Joel and kind of
had like this interesting group therapy
session around all the common problems
they had with alert manager. And now
this is not a great uh place to be. But
as part of this, we also discovered that
Cloudflare and Hudson driven trading uh
two amazingly technical companies they
depend on alert manager so much and they
have internal forks of it. So we kind of
uh brought the kind of maintainers and
people interested in maintaining the
alert manager together and we are
kicking off the alert manager working
group again with these new community
members uh to kind of like breathe new
life uh into the alert manager
ecosystem. They're already discussing
what they can upstream, how they can
collaborate and I mean there's a lot of
good energy there and again this is one
of the reasons I love open source.
um now this is an open source project
and there here's two companies that are
like yeah we we we're happy to do this
like this is a great project we are
happy to push this forward and yeah I I
think that I think I I've based my
career on Prometheus and open source and
a lot of open source stuff and whenever
I still when I see these I'm still
amazed by how cool open source is yeah
so yeah if for those who have issues
with alert manager please be patient
maybe a few months and things will start
to get dramatically better. I'm
confident of it.
>> Nice. And I I have a question because
this is uh something very close uh and
it has also like a history in Grafana
Labs as well. So Grafana Labs also has
like this alert with changes with some
time obviously again the priorities were
focused and the team like have to uh
work on different areas. But this is a
common question in the community when it
comes like okay now I have graphana
running prometheus as a data source.
Should I go for conferring my alerts in
graphana or should I use prometheus like
what is what what is the difference
between these two and when like what are
the trade-offs like what are the
advantage of using graphana alerts or
when you should go for prometheus alerts
if you can maybe explain this.
>> I have an opinion but I want to hear
KL's opinion first.
Okay. So, so if you if you step back a
little bit, they're going to do pretty
much the same thing. Uh but and and
depending on your your
um resilience target, I think you can
deploy it a little bit differently. So
the alert manager's job and why it's
been like able to not have a lot of
attention and still do its very job very
good is that has a quite simple job and
it is take the alert signals from
multiple Prometheus or Grafana instances
and then decide which which is the
leader in the alert manager cluster and
then send signals to uh you know
graphana page duty emails or something
like that. So it it has a fairly small
simple job and it does it well. And if
your graphana pods or pome servers are
starting to have problems,
it's fine and they can fail and the
alert manager should still be running.
And if that is your level of resilience
you're aiming for, then running that as
a dedicated service next to Graphana or
Prometheus is is in my opinion a better
option. But unless you're like really
want that level of u resilience or also
like to be honest added complexity then
it's fine to use the the alert manage
manager within graphana. So I I think
that's how I want to think about it and
uh yeah do do you have anything to add
to that Kotum or
>> just one thing I I fully agree like you
know more components is more complexity
and sometimes the amount of complexity
will also break things at certain point
like you need to know how to
troubleshoot this. So if you want like
higher like higher amount of resilience
and you're willing to kind of put in the
effort to understand and maintain the
complex system you can have a nice
topology of alert managers like one
thing we see regularly is there's like a
three or four node alert manager cluster
each node in one different data center
um now I don't know how to do that with
graphana u but it it is possible to do
it with alert manager now the problem is
when two alert managers have issues
communicating with each other, they
might send you two alerts because alert
manager optimizes for sending you two
alerts rather than zero alerts. Uh being
paged twice is better than being paged
none. So this will give you more
resiliency, but it might also make it a
little bit more brittle and a little
harder to troubleshoot as well. But
again, these are these are the
trade-offs that you need to look at like
how much resiliency do you need and how
much complexity are you willing to put
in for it.
Yeah.
>> Yeah. And I think one one one point to
add here is that like if you have like
uh Prome only prometers data source then
obviously you can decide like what you
want to do but if you have like other
like my SQL data source or maybe
postgress or some something other then I
think in that case and you are using
like those data sources of graphana. So
you can um use the graphana uh alerting
itself because then your data data
source is supported for alerting and you
can like uh define the rules when to uh
generate an alert when when to notify
you. So that's the one case where you
feel like okay now I have like in a
centralized place uh if if you have like
more complex system or something more or
more services to to check and balance.
>> Yeah. And if there's something if
there's one thing I could change about
the alert manager project, it would
probably be the name to like alert
grouper or uh something like that. Alert
manager gets so many people confused.
They think it's the the alert rule
evaluation system, you know. Yeah, I
might it be a little bit too late for
that. But um that that would be my top
future request.
>> Kyle, that will break a lot of things
again. You don't want it.
Yeah,
>> nice.
Um, so we still have like uh about 9
minutes shortly. Um, if uh we can do
like a demo. I think we we have planned
a small demo. So uh Gotham if you may
share your screen about that home lab
part which looks sounds very
interesting. Might be more interesting
to look into it actually.
>> Yeah, I'm happy to. So essentially um I
became a PM and uh once uh one of the
reasons I became a PM was because I was
an engineer building all these tools for
other engineers to use but I was using
my own tools and I was on call and I was
troubleshooting my issues with the tools
I built. And I'm like ah I know what
works. I know what we need to build. I
don't know I know what doesn't work. I
think I'll be a great great PM. and I
kind of became a PM but then immediately
I stopped being on call for my tools and
I started like you stopped using them in
real production scenarios. Um so I
decided to kind of put together all the
different random Raspberry Pies I had uh
lying around, hooked them up to a home
lab, run a few services on it and just
use the products that we built. And that
was the kind of intention that started
off like you know of why I started doing
this home lab. But it became so much fun
that I completely forgot about that. Um,
and now in my home lab I have a I kind
of went a little bit crazy. Uh, let me
show you. So
there should be Okay, I didn't start
that dashboard, but if I look at
>> Can you please zoom in a little bit? It
looks
>> Yes, happy to. Happy to. Yeah. Is it
better?
>> Yeah. I'm going to look at the Linux
nodes. And if I look at if you look at
my fleet of Linux nodes, there's oh okay
this is uh yeah these are a lot of the
different Linux nodes that I'm running.
Uh essentially you can see that you know
there's a cluster of oldroids by my
office mate Ben and there's a new pie
there's an old knuck all these random
bits and bobs that I've put together and
I now have like 13 nodes running in the
system running a lot of different
services
uh but also kind of helping me
experiment with a lot of different
things. One of the things that I also do
is I run Prometheus on a risk 5 board.
Uh there should be risk five system
somewhere. Star five is the risk 5
system and uh yeah it's been up for 7
months which I'm actually surprised by.
Uh but I have Prometheus running on risk
5 that's monitoring a lot of different
components including these metrics that
you see on here. One of my favorite
dashboards uh though is this air
gradient dashboard which shows the air
quality in my office and like you know
oh yeah it basically the Prometheus is
running locally and it's connecting to
the office air gradient here and this is
actually quite interesting in winter um
in winter essentially I have to close
all my windows and then I breathe out
all this CO2 and the CO2 concentration
keeps rising so I get paged around 800
ppm to open the window. Now I've
silenced all my notifications but like
you know I'm pretty sure I got paged
here and usually then I'm forced to open
a window and take a walk and things like
that. And uh what is really cool is this
air gradient. It's an open- source air
quality monitor. You install it, it
comes with a Prometheus endpoint out of
the box. You there's no configuration.
It's default and you just point your
Prometheus at it and it can collect all
of this data.
The other interesting project that I did
uh was actually uh inspired by a
different project done by Ed Welch where
he kind of monitored his Nissan Leaf uh
in Loki and uh that kind of inspired me
to do this other thing. So I do have um
a Prussa 3D printer that is kind of
printing things. Now I'm not a 3D
printing expert. Um,
I don't know enough about 3D printing,
but like, you know, it it looked really
cool. Uh, and I've done a little bit of
3D printing in school, so I was like,
you know what? I'll go buy a 3D printer.
And then the next thing I I see as one
of the accessories was like this flame
extinguisher. So, you put this on top of
the printer and if if the printer
catches fire, this kind of bursts and
extinguishes this fire. And that scared
the out of me. I was like, whoa,
what? Wait, what? 3D printers catch
fire. Um, however, again, this used to
happen a lot in the past. Nowadays, 3D
printers don't catch fire, but I still
couldn't get myself to leave the printer
on and leave this office uh while it's
running unmonitored. So to and to kind
of solve that problem, I came up with a
very interesting hack I would say which
is I have a camera hooked up to it uh
hooked up to the printer and uh what it
does is it basically prints
uh all the like uh it takes a picture
every 10 seconds and then prints it out
to stand it out. So if you kind of look
at it, uh it's just it's called prile
GTM and uh I'm just going to do
select for data
and you can see like you know these are
just images that are that are or like
image printed out as base 64 literally
that's what it is like I take a picture
and I'm printing it out to stand it out
and then I'm sending it to Loki in
graphana cloud and I'm using uh a panel
called uh B 64 to image
uh from Volco Labs uh to kind of now
generate a live stream of uh my 3D
printer.
So I'm essentially using Loki the log
storage to store and live stream my 3D
printing. And now I have a program that
reads all this data back and creates a
nice uh time-lapse video uh out of all
of this. And this is like other cool fun
project I've been doing on the site and
it kind of blows me the flexibility that
Graphana has like shoved B 64 and now
you have images on a dashboard. I love
that. Carl,
uh do you think we can we can maintain
this uh this panel the base 64 to image
panel because I have a lot of feature
requests.
uh you know there's there's some new
stars and I I I don't know the status of
it but it doesn't seem like a massive
undertaking. It depends on your feature
request but it's
>> because you know my my my screen is
usually this wide and I can't get it to
kind of be the right dimension no matter
like I know the dimensions of the
picture but I just can't get the panel
to be in the right
ratio. So yeah, I have more feature
requests, but yeah,
>> there's also a video panel in Grafana
that you can use if you want to live
stream uh a little bit faster. I guess I
I mean I certainly love this solution
for its uh workoundness.
Uh but um yeah.
>> Yeah. Um
>> and I'm I'm really amazed. So you you're
taking this image as a screenshot and
basically you're sending this to Loki,
right? And then using the the base 64
panel to like uh uh create the image for
it from the code.
>> Exactly. So all I'm doing is querying
Loki filtering out to just the base 64s
and then automatically this thing came
up and I'm like oh my god Graphana is
awesome. Yeah.
>> Prometheus labels as well.
What
>> I think you can use Prometheus labels as
well. I don't know how big they can be.
>> Uh they can be
>> the size of the image.
>> Yeah, they can be very very big. And
honestly like I'm taking a image every
10 seconds and I'm writing like 20 kbps
uh of data to Loki and I've done the
math. You can have uh 70 printers
printing non-stop and you can still fit
into the free plan of Grafana Cloud. So
if you want a free video streaming
solution, you can use Grafana Cloud. Uh
the Loki team will hate me, but you can.
>> Nice. And and I and I and I must say, so
sorry, K. Uh
may I go ahead or you
>> Oh, yeah. Go ahead. It's fine.
>> Yeah. I know I I just want to say the
other other fascinating project which
you did with the air quality uh filter
world. So I it just remind me that at
Graphfana uh this year we had like a
booth of like science fair and uh were
using this sensory data to capture the
air quality and since it was jam-packed
and uh then they did it I think live and
they open the windows so that the fresh
air in the event. So it was very good
use of like hey we are actually using it
for real for the for the whole event.
So, uh, using this in your home is
actually very useful because at winter,
yes, I I don't blame you. I have the
same thing like close all the windows
and CO2 levels can increase
>> and it increases insanely fast. So, I
would recommend getting a air quality
monitor. Say yeah.
>> Yeah. Yeah.
>> Cool. That's all that I had.
>> Okay. Okay, I think we are also on time.
I don't think we have any question. Let
me just check. Ah, we just have a
comment from a user. Thank you. Thank
you for loving all. It's been really
core part of Graphfana and still a lot
of development is going. Um, yeah, I
think uh we can wrap this up. So yeah,
thank you again. Thank you Gotham for
joining us and KL uh being such a
amazing admin and host today because it
was the first time KL has like all the
superpowers for the show. We may keep it
for longer. Who knows? Uh but yeah uh
thank you everyone uh for joining this
uh amazing uh session of Grafana
Campfire and especially asking some good
questions around Prometheus and
different use cases. We learn a lot
about pull and push uh push request and
yeah uh check out the video recording,
check out the links as well. I will
share them also in the video description
and uh till then uh we will join next
time. So till then take care and
goodbye.
>> Thank you. Bye.
#prometheus #grafana #opensource 𝙃𝙤𝙨𝙩𝙨: Syed Usman Ahmad 𝙀𝙭𝙥𝙚𝙧𝙩𝙨: Goutham Veeramachaneni, the Prometheus project maintainer and contributor, and Carl Bergquist, who has been actively using Prometheus and Grafana core since its release and continues to contribute to it, will be among us. 𝗦𝘂𝗺𝗺𝗮𝗿𝘆: In the October edition of the Grafana Campfire Community Call, host Usman led a discussion about the recent Prometheus conference (Promcon) held in Munich, featuring guests Carl Bergquist and Gautam, both experts in Prometheus and Grafana. They shared insights from Promcon, highlighting the friendly atmosphere and the importance of community engagement. Key topics included the stable release of native histograms in Prometheus, improvements in Grafana's ad hoc filtering feature, and the ongoing integration with OpenTelemetry. The conversation also touched on the challenges faced by the Alert Manager and the future of Prometheus with potential upgrades to its time series database (TSDB). Gautam demonstrated a personal home lab setup involving monitoring air quality and 3D printing, showcasing the practical applications of Grafana and Prometheus in real-world scenarios. Overall, the call provided a comprehensive overview of advancements in the Prometheus community and the collaborative efforts to enhance user experiences. #############⏳TIMESTAMPS⏳#############: Here are the key moments from the livestream along with their timestamps: 00:00:00 Introductions and overview of the call 00:02:30 Introduction of guests Carl Bergquist and Gotham 00:06:00 Discussion about Promcon and its significance 00:08:30 Highlights from Promcon, including the atmosphere and community involvement 00:11:00 Overview of new features in Prometheus 3.0, including native histograms 00:15:00 Introduction of new ad hoc filtering feature in Grafana 00:21:00 Discussion about OpenTelemetry and its integration with Prometheus 00:28:00 Explanation of the differences between pull and push monitoring systems 00:35:00 Introduction to Gotham's home lab project and live demo 00:45:00 Wrap-up and final thoughts on alert management and Grafana's alerting system 𝗝𝗼𝗶𝗻 𝘁𝗵𝗲 𝗠𝗲𝗲𝘁𝘂𝗽 𝗚𝗿𝗼𝘂𝗽: https://www.meetup.com/grafana-friends-virtual-meetup-group/events/311564547/ 𝗝𝗼𝗶𝗻 𝘁𝗵𝗲 𝗚𝗿𝗮𝗳𝗮𝗻𝗮 𝗢𝗳𝗳𝗶𝗰𝗶𝗮𝗹 𝗖𝗼𝗺𝗺𝘂𝗻𝗶𝘁𝘆: https://community.grafana.com/ We look forward to see you 🙂