Loading video player...
What if your AI systems could improve
themselves? Not metaphorically, but
literally. Imagine a chatbot that can
detect its own bad responses and fix its
own prompts. Or an automation that runs
and breaks and can detect that
something's wrong and repair itself.
These kinds of concepts used to be
science fiction. But now you can build
them yourself. And I'll prove it to you
because I built one in Clawed Code and
I'm going to show you how you can do it,
too. And I've been able to use this
concept I'm about to show you in a
variety of different apps with different
applications. Because once you see this
pattern, you'll never build an AI app
the same ever again. Let's dive in. So
to keep this concept as approachable and
accessible to everyone as possible,
we're going to use this app that I put
together, which is just a chatbot. But
technically, it's not just any other
chatbot because like you can see on the
top right hand side, it's a chatbot that
retrains itself based on the
conversations that it has. So whether or
not a user said, "You're too dry. This
answer's way too verbose, this answer is
way too vague, it's way too AI," it can
detect and read through conversations
and decide based on a rubric that I've
created for it when it makes sense to
update its own prompt. Now, if you don't
believe me, we'll go into the admin tab
right here, and you'll see that we've
had a total of 60 messages, and we're
currently at the fourth version of the
current prompt. I've yet to intervene
myself on the system prompt. This is the
AI with basically another AI being a
judge reading through the conversations
and deciding what should change and why
it should change. And if we scroll down,
you'll see that we have a concept of
reflection here. And the way we're using
reflection is it will go and check the
last x number of messages. So it could
be the last seven exchanges or 14
messages, the last 10 exchanges or 20
messages. And then we can decide whether
or not we want the chatbot to look at
only the messages it has not looked at
before or if you want to actually change
the logic of the judge itself. We can
make it so that it consistently goes and
looks at the last n number of messages.
Now, if I'm already losing you, just
stick with me. I have diagrams. I have
visuals. I really want you to get this
cuz it's super cool. The TLDDR is I can
set this app to look every hour, every
30 minutes, or we can make it every
couple hours to look through all of the
chats and decide is this chatbot still
on track, or have we had enough
conversations where it's clear that the
users are not being well serviced by it
and things need to be updated. So, if we
take a look at the last reflection right
here, you'll see it graded itself two
out of five on completeness, four out of
five on depth, four out of five on tone,
and one out of five on scope. [snorts]
So, as a result of that, it actually
shows you the conversation that was
included in this evaluation. So, you can
have a second set of eyes if you want to
see if your judge is doing the right
job. And if you want to see the analysis
of why it decided that it needed to
change its own prompt, it walks through
and breaks down in plain English why it
decided from what conversations that it
needed that trigger. But wait, there's
more. In the app itself, as we go from
version to version, if we decide as the
human that even though the AI decided
the system prompt should change, that we
want to revert back to a previous
version, we can always go back and
revert to whatever it is we had before.
And on top of that, if you want to audit
all the reflections, you'll notice that
the majority of the time that I ran
this, it passed, meaning it maintained
the system prompt. And usually a system
prompt for a chatbot shouldn't be very
reactive. I literally wrote a prompt so
good by accident that I had to manually
break it by making the rubric impossible
to pass. And since I had a database
behind the scenes, we could click on
anything like show right here. And we
can always go back in time to see why a
particular phase either failed or
passed. And the last tab we have is the
suggestions tab. And this is where the
LM as a judge gives any form of feedback
or tidbits that you could use to improve
the conversations based on what it's
seen, but it isn't enough evidence to
want to change the underlying system
prompts. So, for example, if we go on
this one, which says use more natural
conversational language, you can go
through and it says the assistance
response style is very structured and
professional throughout. While
appropriate for the technical audience,
this becomes slightly cold when a user
is in genuine distress. So imagine you
have a little Jira board or or a project
manager, an AI project manager. It can
go through and give you advice on how
you can improve without overriding what
you already have. So if I've piqu your
interest so far, let's take a look at
how we built it. Now, big picture.
Usually, especially in the world of Vibe
coding, many apps are built very
linearly where you write a prompt, you
pray it works, you fight with the AI,
you test it, you tweak it, you go back
and forth in this iterative circle. And
even when it's in production, even if
you get to that point where it's ready,
every single time you have an
interaction, you'll go through all the
interactions back and forth to see is it
performing up to par? And then at some
point you'll have a threshold or a user
swearing at you where you decide, okay,
now we have to make a change. This new
world I'm presenting you is a
self-improvement feedback loop where you
create the app in a way where you have
different parts of your database
tracking different pieces of metadata,
data in general, flow of the app, so it
can come up proactively with
suggestions. And if you want it to
implement such suggestions, then it can.
And if you've watched this far and
you're non-technical and you're sweating
already because you can imagine there's
so much technology at play. There's
literally two things I used to build
this one is Cloud Code. That's the
obvious one. The second one is
Superbase, a database. And all I did was
hook up the Superbase MCP server. And
this allows cloud code to go free reign,
build a new database, build whatever
functions need to be built, test them,
create new tables, and allow things like
edge functions, which help us create
micro interactions and behaviors
throughout the whole app. And the best
part is, despite my technical
background, all I did to build this was
use natural language prompts. So now to
get more granular, we'll go into teacher
mode and walk through the exact process
of how this is built. So you have cloud
code that builds and connects
everything. That's step one. Step two is
you want to be able to connect Superbase
via MCP server. This makes it a lot
easier to have cloud code go from a plan
to going back and forth and having a
feedback loop with Superbase. So if
something's failing, you don't have to
go back and forth and copy paste errors
and screenshot and tell cloud code about
it. There is this seamless
communication. It's like an open phone
line between both services. In
Superbase, we're going to do things like
store the prompts, the responses,
timestamps, timestamps of the last time
we ran a reflection, the logs associated
with that reflection, the prompts
associated with our rubric that should
persist over time, and the user prompts
that we change from time to time. So,
the biggest part of this process is
really ideulating what are all the
components you need to have an effective
feedback loop system. Now, the next
layer is the most important layer, which
is the evaluation layer. How do you
create a way where the AI can go through
some self assessment? Now, in this case,
we have the AI taking care of the chat
and we have another virginized blank
slate AI whose sole role is to monitor
the other AI's behavior and basically
give feedback on. So, as an analogy, if
you've ever had a 9to-5, you'd be
familiar with performance evaluations.
And typically, an employer would
evaluate you and once in a while you
would be asked to evaluate yourself. In
this case, we care about the latter
where we're giving a rubric where the AI
assesses itself and if it comes to the
honest conclusion without ego that it's
done poorly, then it tries to come up
with a plan on how it can improve. And
the key thing is that this is a feedback
loop. This can keep going on and on,
especially as you have more
conversations or more users in this
case. And to get even more granular, you
have the user ask a question and then
the chatbot responds and then if there's
a trigger to evaluate it, it will look
at the back and forth exchanges and then
it will score and save this to a
database. If nothing needs to be
changed, which should be the status quo,
you should not have a system prompt
changing non-stop. Otherwise, you're
going to have a very unstable app that
is very reactive to nuanced
conversations. If it decides that it
needs to update, then it will
automatically update and it will keep
going in this circle. And if it's not
clear by now, the main benefits is you
essentially have AI metarrompting
itself, using AI to write and evaluate
its own prompts, which is usually a much
better prompt engineer than you and I.
And not only that, it does give some
more richness to you as the owner of
this application cuz now you can see the
AI's thoughts, you can audit, and you
basically have a thought partner as to
how you can improve the experience of
your users on the app. Now, building the
system is doable, but it needs some
imagination and some back and forth. So,
while I still have maintained all my
chats where I'll pigeon hole certain
parts to show you to give you that
inspiration of how you could build
something just like this or apply this
concept elsewhere, I'll first walk you
through the entire journey of how we
went from beginning to end and I'll show
you the mega prompt that I started out
with. So, this was built over multiple
sessions. And in the first session, like
most sessions, you want to build the
foundation where you go through the big
picture and try to explain to it, you
know, this is the MCP server that you
want to connect to. This is Superbase.
You're going to use that primarily.
We're going to create different tables.
The goal is to create an experience
analyzer, and then we want to be able to
interchangeably change things in the
database, add new edge functions as
needed. So, I gave Cloud Code free reign
on the first pass to build as much as it
wanted and think about any thoughtful
features that made sense for this
self-improving system. By session number
two, I noticed it created its own
features like a cooldown feature where
once the prompt was updated, it would
prevent it would have a blackout period
where the prompt couldn't self-update
again until 1 hour. Engineering wise, it
actually makes a lot of sense, but for
my purposes of testing, I had to find a
way to break it. So in session three, we
focused on safety nets. How do we manage
the grading? How do I make sure that
it's not just being nice to itself every
single time it runs a self- assessment?
Like a human sometimes can be when
they're looking for a promotion or to
keep their job. Where in the self
assessment, even if they're not that
great of an employee, they might say, "I
am awesome. I am a 10 out of 10." And in
the other sessions like session four and
onwards, I created this handoff file
where because my conversations were
getting meaty very quickly, I tried to
create this baton pass method where I
could collapse a conversation and where
I could pass it off to the next agent to
continue and go from there. Now, brace
yourself and take a deep breath because
I'm about to spend the next few minutes
breaking down this mega prompt. And
we're not going to read it line by line,
but I'll give you enough of an idea that
you can appreciate what I used as a
foundation for all of this. And don't
worry about having to screenshot to keep
up. I'll make this available to you
along with some other goodies in the
second link in the description below.
So, let's get to it. We start off by
saying, "Build a self-improving chatbot
that answers questions about an AI
consultancy and business AI
transformation. The system uses
Superbase for persistence, fancy word
for keeping things in memory, aka a
database, and Claude 4.5 Haiku, cheap,
fast, easy to use for both the chat
agent and the self-improvement
reflection loop. Now, this is the part
where I had AI come up with a text
stack, a series of ideas for cloud code.
So, one of the most important things is
just feeding it 4.5 haiku and what the
model name is because most of these
models, even if they're trained as of
January 2025, still wouldn't know about
the existence of models like 4.5 haiku.
And you'll notice in things like cursor,
claude code, whatever it is you use,
they will always default to an old
version. So, if you ask for use Gemini,
it'll say Gemini 1.5 or Gemini 2. Same
thing with claude, it will use Claude
3.5. And next up, I say I have superbase
MCP enabled. It's helpful before you
even start this session that you would
create and enable the MCP server so it's
connected and ready to go. And this is
the part where I really wanted the AI to
step in and do the heavy lifting. So it
created the database design. So it
created a table for users where it
collected information like ID, email,
created at data of the messages which is
super important because we need to be
able to track either X amount of
messages over X amount of time horizon.
And then authentication
[snorts] sessions. So multiple chat
sessions should be persisted. The number
of messages we had a messages table. And
then it created other tables like one
for system prompts. That's a no-brainer.
We have to store those over time.
Reflection log. So where we're going to
store the reflections of the AI judge,
any decisions. And the next part is the
setup of its own prompt using AI to
create that prompt. So meta prompting on
steroids. And this says, "Create a
well-crafted initial system prompt for
the AI consultancy chatbot. It should a
be an expert on AI consultancy, know
about common frameworks, understand
business context, and a series of other
piece of information. And this is
possibly the most important part where I
had Claude nerd out on what edge
functions might be needed to enable
different behaviors in the app. So we
have a chat handler that is responsible
for receiving the messages fetching the
conversations calling the API because
anthropics API is best called using
something like an edge function the
reflection loop how that would work. So
step one fetch the recent messages or
exchanges. Step two analyze with the
rubric that AI would help create the
first draft of and we told it create
some criteria from one to five. So
response completeness, response depth,
tone appropriateness, scope adherence,
missed opportunities. Step three was the
decision framework it would use to do
so. And then step four was when to
decide to update and what the criteria
for that is. And the rest of this looks
at things like the guardrails of the app
itself, how it should evaluate itself,
examples of reflection logs, so it can
basically model after those examples.
And then we enter this and go from
there. Now, instead of me going through
back and forth between two different
screens to show you snippets of chats,
I'll walk through the general concepts
of what needed to happen in each session
that was in my cloud code instance. And
what I'll also do is go back and forth
between the app so I can tell you
exactly where this issue was and how I
decided that we needed to iterate on it.
So this first session, like I said, was
all about foundations. So if I show you
a little teaser here, this MCP superbase
execute SQL. This is Superbase MCP at
its finest where it doesn't have to ask
me permission to write every single SQL
query. And it's very similar to an
experience you would imagine from
Lovable or Bolt or any one of those
other tools that are browserbased where
it would ask you permission to execute a
bunch of SQL which even if you have no
idea what it said, it would still
necessitate you to click on and enable
that. In this case, if you're being
experimental and you're going on YOLO
mode on bypass permissions, then this
will keep running, keep executing
different queries, testing those
queries, and this is where you can even
start to apply different features. You
can have multiple agents working on the
same task, each one working on a
different part of the database. So, the
most common error that popped up was I
initially wanted a way to connect a user
to a chat. So, if we go back to the app
itself and I am to log out, I want to
just be able to enter my name, enter my
email, click on start chatting, and it
would remember me from that email as a
unique identifier. And now, if you go
in, you'll see I have all my
conversations. So, we had some initial
problems there. And then the second part
was just establishing the connection to
Superbase, making sure it was working
and making sure that it wasn't going off
the rails. Now, when I initially set
this up, I got a screen where if I
clicked on a new chat and sent a
message, one, I couldn't see the message
that I sent, let alone the fact that
when we received a response, it would
kick me out of the chat. So, many times
when you're vibe coding, you'll have
these unexpected micro behaviors that
happen that you have to account for.
Another thing that I wanted was I wanted
to have some prompts here that give you
suggestions on what you could ask so
that when you click on it, it would also
send that prompt directly. So if I click
on what should I cover in an AI
discovery workshop, this should go. It
has a little loading state and comes
back with a response. The next thing is
the response came back with a series of
hashtags basically markdown. And we
needed a way that the AI could render
this. So it looks something like this
where it's well put together. It's
structured. It's easy to read. So just
this scope took us to the end of the
context window of our first session. And
then we moved on to part two. Part two
is where we had a system prompt and I
wanted to see nothing broke. It seems
like it's functional. It looks like it's
actually looking at the past chats, but
I can't tell if it's actually working
because it just says pass. So then I
tried to manipulate the rubric. It came
up with itself and realized like I
mentioned, it came up with what's called
a cool down period where any time it
ran, it would be refusing any form of
update to the system prompt within 30
minutes even if I had the review period
within 5 minutes. So, I basically had
the Superbase MCP inject its own code to
allow me to override this. So, I could
test out different versions or
permutations of the LLM as a judge
prompt. Eventually, we broke it down. We
were good to go. And I saw that if I
switched and played around with the
prompt, it was actually working. Now,
was it working well? Was it being too
nice to itself? That was the next part.
Now, when we go to the admin part of the
app, we now have a very thoughtful set
of settings where we can set the score
threshold. We can set how many messages
to evaluate and whether or not to
evaluate messages that it's seen before
or just net new messages. When it first
came up with this user interface, you
couldn't see any of this. You could just
see this reflection interval that only
had a couple settings like 1 minute, 1
hour, a couple hours, etc. So all of
this stuff lives in Superbase, meaning
the database itself already had the
functionality. All we had to do is tell
cloud code, make it come to the four.
let me see it on the front end so I can
manipulate it. And key thing here, make
sure that if I manipulate it on the
front end, it gets propagated or it gets
sent that same change to the database.
Don't just fool me and make it look like
I'm doing something on the front end
that isn't actually changing the back
end. So once we had these two toggles,
one to evaluate the last n number of
messages and ideally I could pick what
those messages are or unevaluated
messages, we now had a good pipeline for
reflection. And the next step was
understanding what if I wanted to
reflect now like I wanted to run it and
not wait five minutes or not wait one
minute. So then we added a reflect now
button that will override any other
setting so it re-reflects on whatever
the number of messages that I set to
are. And the goal is that it
automatically checks the system prompt
ignores the cooldown so it doesn't
override and generate suggestions or
evaluates and updates the prompt. And
you can see this right here. Here if I
click on reflect now it will go through
and reflect on the last system prompt
and the last few messages. In this case
I have unevalued messages. So there have
been no new messages. So it won't really
have any form of real change. And then
we have the last 14 messages but these
two are basically going against each
other. So I just forgot we actually just
sent a message. So we do have two
unevaluated messages. It did evaluate
it. And you can see right here it graded
itself as perfect. And technically what
it came back with was pretty good. So
it's not wrong. But you can see the
value of being able to tinker and design
your app so you can test and tinker and
stress test it and build that and bake
it into the back end itself. So because
these sessions would drag on and almost
always cap out the context limit. And
you can see right here one example 85%
used. I would create this handoff
document that I called the baton pass
self-improving basically checklist. It
would go through and maintain context on
what it did, what bugs it encountered,
and basically what phase it was in and
what was completed. It would denote
something as completed with this check
mark emoji. And if there was anything
left or anything to investigate, I could
always refer to the next chat to go
through and see what the latest update
was. So, it's my hacky way of keeping
the most important pieces of context
together because yes, you can use things
like slashcompact to summarize the
conversation and generate a brand new
session, but many times some micro
behaviors or some really pivotal pieces
of information get left out. And the
overall goal of this was to create one
unified save state where everything
that's changing, every bug that I
encountered, everything that needed
investigation could be in one place. So
I could be the lazy person that I am and
say refer to and I tag the file at
handoff and execute on all the remaining
bugs in the app. And this is a big hack
for memory management, context
management, and overall if you ever
build this project and you want to
replicate it, it's cool to take this
artifact as something to your new
project and take the code, the context,
and it'll help you build other
self-improving systems that much faster.
Now, at this point, we had finished
quite a few of the phases of the core
build. So the first pass of the entire
app was put together. And now this is
the part where I would go in test things
and realize I didn't have everything at
my disposal that I needed to test things
out. So as a good example, if I would go
to the prompts, I wanted some way to see
the system history of all the different
prompts that I had before and ideally
revert back to them. We didn't have
this. It would overwrite the existing
prompt without any form of history. So I
wanted that. So then superbase MCP had
to listen to those requirements, create
new tables to store that on the
reflection logs. I would only have the
last reflection. I wanted all of them.
And again, this is something that
existed in the database, but just wasn't
shown on the front end. So this
iterative process is something you can
only understand once you're in it and
you see what's missing from your initial
requirements, which is how this
suggestions tab was born because I
wanted to know how close were we
threshold-wise to not passing the test.
Like what was it that the AI noticed?
Does the AI know what to look for? And
is there a way that I can hide these or
check these off if I've completed them?
So if I say mark is addressed, I can do
that. If I'm unhappy with it, I feel
like it's not a great piece of advice, I
can just hide it entirely. And those are
the extra second, third, fourth order
features that come about when you're
actually testing out the app. And like I
said, I wanted to be able to intervene
and see what is this reflection prompt
that keeps passing with magnificent
colors. Cuz like I said before, one of
the problems was I assumed some
overconfidence. It was rating itself
like it was amazing all the time, and I
felt like it was being too nice to the
other AI. It's good to be stable, but
it's not good to be biased. So then I
created this next tab where I could
audit and see exactly what the prompt
was that was being used as the judge for
the app. You'll notice here, this is an
example of me trying to do the
following, which is create the ruthless
critic so I could purposely break it and
see whether or not it would respond. So
you can see here from the very first
line of the prompt I say you are an
impossibly critical quality assurance
system for an AI consultancy chatbot.
Your standards are unreasonably high.
You are looking for perfection and
perfection does not exist. And I
basically set it up for failure just to
test out that it would would work. It
would actually update the prompt by
making it impossible to score anything
above a four. So, if I went back to the
dashboard and I set this to 4.5, no
matter how good the conversations were,
it would have to fail if the app was
actually working. And I know I'm pseudo
rambling here, but I wanted to walk you
through the mental model of how to build
an app and how to make sure you can
stress test whether it does what it says
it does. And after this phase of
creating version control, logging
everything, this app is not perfect.
There are still many areas, especially
if there were active users on the
platform, where I could see that it
could fail or would need more
adjustments. But the cool part is now
that we have the understanding of a
self-improving system, you don't just
have to stop here. It's not just about
improving the system prompt, you could
improve the app itself. You could create
a part of this app that would say based
on user behavior and what people are
asking for, maybe come up with and
implement a new feature or at least
draft a new feature that we could add to
this app that's maybe not chat oriented
or it's an area where you can go back
and forth and build something like a
document cuz everyone's asking for XYZ
document. The sky's is the limit and I
wanted to show you this example end to
end at least theoretically and
conceptually so you can apply it to
whatever makes sense for you. Now, if
you enjoyed this video, it took me a
while to put this mad scientist
experiment together and break it down in
a way that I could show you in an easy
way. Now, if you want to build a version
of this app on your end, then I'll give
you the mega prompt I put together in
the second link in the description below
along with a guide and a few other
goodies to help you do that. And if you
want access to this system asis, carbon
copy, along with a series of other
systems that I'm continually building
for my community, you can check out the
first link in the description below. and
I could see you in my early AI adopters
community. And last but not least, if
you like videos just like these where I
go very mad scientist and I try to go
against the grain and see what's
possible, please let me know down in the
comments below. It gives me feedback
that I should do more of this and that
you like it. And number two, it helps
the video and helps the channel. I'll
see you the next one.
Join my AI community: https://bit.ly/earlyaidopters Get the Mega Prompt + Guide: https://bit.ly/44HeWKB Book a call: https://bit.ly/markaicoaching --- What if your AI could improve itself? Not metaphorically - literally. In this video, I show you how to build self-improving AI systems using Claude Code and Supabase. I built a chatbot that detects its own bad responses, grades itself on a rubric, and rewrites its own prompts - all without human intervention. This isn't science fiction anymore. Once you see this pattern, you'll never build an AI app the same way again. What's inside: - Live demo of a self-improving chatbot - The evaluation layer that makes AI judge itself - How to set up Supabase MCP with Claude Code - The mega prompt I used to build the entire system - Safety nets to prevent overconfident self-assessments - Version control for AI-generated prompts --- TIMESTAMPS: 00:00 - What if AI could improve itself? 00:36 - Demo: Self-improving chatbot in action 01:10 - Admin panel: Reflection logs and versioning 02:01 - How the evaluation triggers work 03:02 - Version control and audit trails 04:31 - Traditional vs self-improving architecture 05:35 - The only two tools you need 06:06 - Teacher mode: How it's built 07:06 - The evaluation layer explained 08:00 - The feedback loop diagram 09:07 - Building session 1: Foundations 11:07 - The mega prompt breakdown 14:14 - Session 2: Breaking the cooldown 17:00 - Session 3: Admin UI and settings 19:56 - The handoff document hack 21:17 - Testing and stress testing 23:01 - Creating the ruthless critic prompt 24:16 - Beyond prompts: Self-improving features --- #claudecode #aiautomation #selfimprovingai #supabase #vibecoding #aichatbot #promptengineering #llmasajudge #aitools #buildwithclaude #nocode #aiforbusiness #metaprompting #claudeai #aidevelopment