How to Build SELF-IMPROVING Systems in Claude Code | DailyDevLists

Loading video player...

Full Transcript

5,261 words • EN

What if your AI systems could improve

themselves? Not metaphorically, but

literally. Imagine a chatbot that can

detect its own bad responses and fix its

own prompts. Or an automation that runs

and breaks and can detect that

something's wrong and repair itself.

These kinds of concepts used to be

science fiction. But now you can build

them yourself. And I'll prove it to you

because I built one in Clawed Code and

I'm going to show you how you can do it,

too. And I've been able to use this

concept I'm about to show you in a

variety of different apps with different

applications. Because once you see this

pattern, you'll never build an AI app

the same ever again. Let's dive in. So

to keep this concept as approachable and

accessible to everyone as possible,

we're going to use this app that I put

together, which is just a chatbot. But

technically, it's not just any other

chatbot because like you can see on the

top right hand side, it's a chatbot that

retrains itself based on the

conversations that it has. So whether or

not a user said, "You're too dry. This

answer's way too verbose, this answer is

way too vague, it's way too AI," it can

detect and read through conversations

and decide based on a rubric that I've

created for it when it makes sense to

update its own prompt. Now, if you don't

believe me, we'll go into the admin tab

right here, and you'll see that we've

had a total of 60 messages, and we're

currently at the fourth version of the

current prompt. I've yet to intervene

myself on the system prompt. This is the

AI with basically another AI being a

judge reading through the conversations

and deciding what should change and why

it should change. And if we scroll down,

you'll see that we have a concept of

reflection here. And the way we're using

reflection is it will go and check the

last x number of messages. So it could

be the last seven exchanges or 14

messages, the last 10 exchanges or 20

messages. And then we can decide whether

or not we want the chatbot to look at

only the messages it has not looked at

before or if you want to actually change

the logic of the judge itself. We can

make it so that it consistently goes and

looks at the last n number of messages.

Now, if I'm already losing you, just

stick with me. I have diagrams. I have

visuals. I really want you to get this

cuz it's super cool. The TLDDR is I can

set this app to look every hour, every

30 minutes, or we can make it every

couple hours to look through all of the

chats and decide is this chatbot still

on track, or have we had enough

conversations where it's clear that the

users are not being well serviced by it

and things need to be updated. So, if we

take a look at the last reflection right

here, you'll see it graded itself two

out of five on completeness, four out of

five on depth, four out of five on tone,

and one out of five on scope. [snorts]

So, as a result of that, it actually

shows you the conversation that was

included in this evaluation. So, you can

have a second set of eyes if you want to

see if your judge is doing the right

job. And if you want to see the analysis

of why it decided that it needed to

change its own prompt, it walks through

and breaks down in plain English why it

decided from what conversations that it

needed that trigger. But wait, there's

more. In the app itself, as we go from

version to version, if we decide as the

human that even though the AI decided

the system prompt should change, that we

want to revert back to a previous

version, we can always go back and

revert to whatever it is we had before.

And on top of that, if you want to audit

all the reflections, you'll notice that

the majority of the time that I ran

this, it passed, meaning it maintained

the system prompt. And usually a system

prompt for a chatbot shouldn't be very

reactive. I literally wrote a prompt so

good by accident that I had to manually

break it by making the rubric impossible

to pass. And since I had a database

behind the scenes, we could click on

anything like show right here. And we

can always go back in time to see why a

particular phase either failed or

passed. And the last tab we have is the

suggestions tab. And this is where the

LM as a judge gives any form of feedback

or tidbits that you could use to improve

the conversations based on what it's

seen, but it isn't enough evidence to

want to change the underlying system

prompts. So, for example, if we go on

this one, which says use more natural

conversational language, you can go

through and it says the assistance

response style is very structured and

professional throughout. While

appropriate for the technical audience,

this becomes slightly cold when a user

is in genuine distress. So imagine you

have a little Jira board or or a project

manager, an AI project manager. It can

go through and give you advice on how

you can improve without overriding what

you already have. So if I've piqu your

interest so far, let's take a look at

how we built it. Now, big picture.

Usually, especially in the world of Vibe

coding, many apps are built very

linearly where you write a prompt, you

pray it works, you fight with the AI,

you test it, you tweak it, you go back

and forth in this iterative circle. And

even when it's in production, even if

you get to that point where it's ready,

every single time you have an

interaction, you'll go through all the

interactions back and forth to see is it

performing up to par? And then at some

point you'll have a threshold or a user

swearing at you where you decide, okay,

now we have to make a change. This new

world I'm presenting you is a

self-improvement feedback loop where you

create the app in a way where you have

different parts of your database

tracking different pieces of metadata,

data in general, flow of the app, so it

can come up proactively with

suggestions. And if you want it to

implement such suggestions, then it can.

And if you've watched this far and

you're non-technical and you're sweating

already because you can imagine there's

so much technology at play. There's

literally two things I used to build

this one is Cloud Code. That's the

obvious one. The second one is

Superbase, a database. And all I did was

hook up the Superbase MCP server. And

this allows cloud code to go free reign,

build a new database, build whatever

functions need to be built, test them,

create new tables, and allow things like

edge functions, which help us create

micro interactions and behaviors

throughout the whole app. And the best

part is, despite my technical

background, all I did to build this was

use natural language prompts. So now to

get more granular, we'll go into teacher

mode and walk through the exact process

of how this is built. So you have cloud

code that builds and connects

everything. That's step one. Step two is

you want to be able to connect Superbase

via MCP server. This makes it a lot

easier to have cloud code go from a plan

to going back and forth and having a

feedback loop with Superbase. So if

something's failing, you don't have to

go back and forth and copy paste errors

and screenshot and tell cloud code about

it. There is this seamless

communication. It's like an open phone

line between both services. In

Superbase, we're going to do things like

store the prompts, the responses,

timestamps, timestamps of the last time

we ran a reflection, the logs associated

with that reflection, the prompts

associated with our rubric that should

persist over time, and the user prompts

that we change from time to time. So,

the biggest part of this process is

really ideulating what are all the

components you need to have an effective

feedback loop system. Now, the next

layer is the most important layer, which

is the evaluation layer. How do you

create a way where the AI can go through

some self assessment? Now, in this case,

we have the AI taking care of the chat

and we have another virginized blank

slate AI whose sole role is to monitor

the other AI's behavior and basically

give feedback on. So, as an analogy, if

you've ever had a 9to-5, you'd be

familiar with performance evaluations.

And typically, an employer would

evaluate you and once in a while you

would be asked to evaluate yourself. In

this case, we care about the latter

where we're giving a rubric where the AI

assesses itself and if it comes to the

honest conclusion without ego that it's

done poorly, then it tries to come up

with a plan on how it can improve. And

the key thing is that this is a feedback

loop. This can keep going on and on,

especially as you have more

conversations or more users in this

case. And to get even more granular, you

have the user ask a question and then

the chatbot responds and then if there's

a trigger to evaluate it, it will look

at the back and forth exchanges and then

it will score and save this to a

database. If nothing needs to be

changed, which should be the status quo,

you should not have a system prompt

changing non-stop. Otherwise, you're

going to have a very unstable app that

is very reactive to nuanced

conversations. If it decides that it

needs to update, then it will

automatically update and it will keep

going in this circle. And if it's not

clear by now, the main benefits is you

essentially have AI metarrompting

itself, using AI to write and evaluate

its own prompts, which is usually a much

better prompt engineer than you and I.

And not only that, it does give some

more richness to you as the owner of

this application cuz now you can see the

AI's thoughts, you can audit, and you

basically have a thought partner as to

how you can improve the experience of

your users on the app. Now, building the

system is doable, but it needs some

imagination and some back and forth. So,

while I still have maintained all my

chats where I'll pigeon hole certain

parts to show you to give you that

inspiration of how you could build

something just like this or apply this

concept elsewhere, I'll first walk you

through the entire journey of how we

went from beginning to end and I'll show

you the mega prompt that I started out

with. So, this was built over multiple

sessions. And in the first session, like

most sessions, you want to build the

foundation where you go through the big

picture and try to explain to it, you

know, this is the MCP server that you

want to connect to. This is Superbase.

You're going to use that primarily.

We're going to create different tables.

The goal is to create an experience

analyzer, and then we want to be able to

interchangeably change things in the

database, add new edge functions as

needed. So, I gave Cloud Code free reign

on the first pass to build as much as it

wanted and think about any thoughtful

features that made sense for this

self-improving system. By session number

two, I noticed it created its own

features like a cooldown feature where

once the prompt was updated, it would

prevent it would have a blackout period

where the prompt couldn't self-update

again until 1 hour. Engineering wise, it

actually makes a lot of sense, but for

my purposes of testing, I had to find a

way to break it. So in session three, we

focused on safety nets. How do we manage

the grading? How do I make sure that

it's not just being nice to itself every

single time it runs a self- assessment?

Like a human sometimes can be when

they're looking for a promotion or to

keep their job. Where in the self

assessment, even if they're not that

great of an employee, they might say, "I

am awesome. I am a 10 out of 10." And in

the other sessions like session four and

onwards, I created this handoff file

where because my conversations were

getting meaty very quickly, I tried to

create this baton pass method where I

could collapse a conversation and where

I could pass it off to the next agent to

continue and go from there. Now, brace

yourself and take a deep breath because

I'm about to spend the next few minutes

breaking down this mega prompt. And

we're not going to read it line by line,

but I'll give you enough of an idea that

you can appreciate what I used as a

foundation for all of this. And don't

worry about having to screenshot to keep

up. I'll make this available to you

along with some other goodies in the

second link in the description below.

So, let's get to it. We start off by

saying, "Build a self-improving chatbot

that answers questions about an AI

consultancy and business AI

transformation. The system uses

Superbase for persistence, fancy word

for keeping things in memory, aka a

database, and Claude 4.5 Haiku, cheap,

fast, easy to use for both the chat

agent and the self-improvement

reflection loop. Now, this is the part

where I had AI come up with a text

stack, a series of ideas for cloud code.

So, one of the most important things is

just feeding it 4.5 haiku and what the

model name is because most of these

models, even if they're trained as of

January 2025, still wouldn't know about

the existence of models like 4.5 haiku.

And you'll notice in things like cursor,

claude code, whatever it is you use,

they will always default to an old

version. So, if you ask for use Gemini,

it'll say Gemini 1.5 or Gemini 2. Same

thing with claude, it will use Claude

3.5. And next up, I say I have superbase

MCP enabled. It's helpful before you

even start this session that you would

create and enable the MCP server so it's

connected and ready to go. And this is

the part where I really wanted the AI to

step in and do the heavy lifting. So it

created the database design. So it

created a table for users where it

collected information like ID, email,

created at data of the messages which is

super important because we need to be

able to track either X amount of

messages over X amount of time horizon.

And then authentication

[snorts] sessions. So multiple chat

sessions should be persisted. The number

of messages we had a messages table. And

then it created other tables like one

for system prompts. That's a no-brainer.

We have to store those over time.

Reflection log. So where we're going to

store the reflections of the AI judge,

any decisions. And the next part is the

setup of its own prompt using AI to

create that prompt. So meta prompting on

steroids. And this says, "Create a

well-crafted initial system prompt for

the AI consultancy chatbot. It should a

be an expert on AI consultancy, know

about common frameworks, understand

business context, and a series of other

piece of information. And this is

possibly the most important part where I

had Claude nerd out on what edge

functions might be needed to enable

different behaviors in the app. So we

have a chat handler that is responsible

for receiving the messages fetching the

conversations calling the API because

anthropics API is best called using

something like an edge function the

reflection loop how that would work. So

step one fetch the recent messages or

exchanges. Step two analyze with the

rubric that AI would help create the

first draft of and we told it create

some criteria from one to five. So

response completeness, response depth,

tone appropriateness, scope adherence,

missed opportunities. Step three was the

decision framework it would use to do

so. And then step four was when to

decide to update and what the criteria

for that is. And the rest of this looks

at things like the guardrails of the app

itself, how it should evaluate itself,

examples of reflection logs, so it can

basically model after those examples.

And then we enter this and go from

there. Now, instead of me going through

back and forth between two different

screens to show you snippets of chats,

I'll walk through the general concepts

of what needed to happen in each session

that was in my cloud code instance. And

what I'll also do is go back and forth

between the app so I can tell you

exactly where this issue was and how I

decided that we needed to iterate on it.

So this first session, like I said, was

all about foundations. So if I show you

a little teaser here, this MCP superbase

execute SQL. This is Superbase MCP at

its finest where it doesn't have to ask

me permission to write every single SQL

query. And it's very similar to an

experience you would imagine from

Lovable or Bolt or any one of those

other tools that are browserbased where

it would ask you permission to execute a

bunch of SQL which even if you have no

idea what it said, it would still

necessitate you to click on and enable

that. In this case, if you're being

experimental and you're going on YOLO

mode on bypass permissions, then this

will keep running, keep executing

different queries, testing those

queries, and this is where you can even

start to apply different features. You

can have multiple agents working on the

same task, each one working on a

different part of the database. So, the

most common error that popped up was I

initially wanted a way to connect a user

to a chat. So, if we go back to the app

itself and I am to log out, I want to

just be able to enter my name, enter my

email, click on start chatting, and it

would remember me from that email as a

unique identifier. And now, if you go

in, you'll see I have all my

conversations. So, we had some initial

problems there. And then the second part

was just establishing the connection to

Superbase, making sure it was working

and making sure that it wasn't going off

the rails. Now, when I initially set

this up, I got a screen where if I

clicked on a new chat and sent a

message, one, I couldn't see the message

that I sent, let alone the fact that

when we received a response, it would

kick me out of the chat. So, many times

when you're vibe coding, you'll have

these unexpected micro behaviors that

happen that you have to account for.

Another thing that I wanted was I wanted

to have some prompts here that give you

suggestions on what you could ask so

that when you click on it, it would also

send that prompt directly. So if I click

on what should I cover in an AI

discovery workshop, this should go. It

has a little loading state and comes

back with a response. The next thing is

the response came back with a series of

hashtags basically markdown. And we

needed a way that the AI could render

this. So it looks something like this

where it's well put together. It's

structured. It's easy to read. So just

this scope took us to the end of the

context window of our first session. And

then we moved on to part two. Part two

is where we had a system prompt and I

wanted to see nothing broke. It seems

like it's functional. It looks like it's

actually looking at the past chats, but

I can't tell if it's actually working

because it just says pass. So then I

tried to manipulate the rubric. It came

up with itself and realized like I

mentioned, it came up with what's called

a cool down period where any time it

ran, it would be refusing any form of

update to the system prompt within 30

minutes even if I had the review period

within 5 minutes. So, I basically had

the Superbase MCP inject its own code to

allow me to override this. So, I could

test out different versions or

permutations of the LLM as a judge

prompt. Eventually, we broke it down. We

were good to go. And I saw that if I

switched and played around with the

prompt, it was actually working. Now,

was it working well? Was it being too

nice to itself? That was the next part.

Now, when we go to the admin part of the

app, we now have a very thoughtful set

of settings where we can set the score

threshold. We can set how many messages

to evaluate and whether or not to

evaluate messages that it's seen before

or just net new messages. When it first

came up with this user interface, you

couldn't see any of this. You could just

see this reflection interval that only

had a couple settings like 1 minute, 1

hour, a couple hours, etc. So all of

this stuff lives in Superbase, meaning

the database itself already had the

functionality. All we had to do is tell

cloud code, make it come to the four.

let me see it on the front end so I can

manipulate it. And key thing here, make

sure that if I manipulate it on the

front end, it gets propagated or it gets

sent that same change to the database.

Don't just fool me and make it look like

I'm doing something on the front end

that isn't actually changing the back

end. So once we had these two toggles,

one to evaluate the last n number of

messages and ideally I could pick what

those messages are or unevaluated

messages, we now had a good pipeline for

reflection. And the next step was

understanding what if I wanted to

reflect now like I wanted to run it and

not wait five minutes or not wait one

minute. So then we added a reflect now

button that will override any other

setting so it re-reflects on whatever

the number of messages that I set to

are. And the goal is that it

automatically checks the system prompt

ignores the cooldown so it doesn't

override and generate suggestions or

evaluates and updates the prompt. And

you can see this right here. Here if I

click on reflect now it will go through

and reflect on the last system prompt

and the last few messages. In this case

I have unevalued messages. So there have

been no new messages. So it won't really

have any form of real change. And then

we have the last 14 messages but these

two are basically going against each

other. So I just forgot we actually just

sent a message. So we do have two

unevaluated messages. It did evaluate

it. And you can see right here it graded

itself as perfect. And technically what

it came back with was pretty good. So

it's not wrong. But you can see the

value of being able to tinker and design

your app so you can test and tinker and

stress test it and build that and bake

it into the back end itself. So because

these sessions would drag on and almost

always cap out the context limit. And

you can see right here one example 85%

used. I would create this handoff

document that I called the baton pass

self-improving basically checklist. It

would go through and maintain context on

what it did, what bugs it encountered,

and basically what phase it was in and

what was completed. It would denote

something as completed with this check

mark emoji. And if there was anything

left or anything to investigate, I could

always refer to the next chat to go

through and see what the latest update

was. So, it's my hacky way of keeping

the most important pieces of context

together because yes, you can use things

like slashcompact to summarize the

conversation and generate a brand new

session, but many times some micro

behaviors or some really pivotal pieces

of information get left out. And the

overall goal of this was to create one

unified save state where everything

that's changing, every bug that I

encountered, everything that needed

investigation could be in one place. So

I could be the lazy person that I am and

say refer to and I tag the file at

handoff and execute on all the remaining

bugs in the app. And this is a big hack

for memory management, context

management, and overall if you ever

build this project and you want to

replicate it, it's cool to take this

artifact as something to your new

project and take the code, the context,

and it'll help you build other

self-improving systems that much faster.

Now, at this point, we had finished

quite a few of the phases of the core

build. So the first pass of the entire

app was put together. And now this is

the part where I would go in test things

and realize I didn't have everything at

my disposal that I needed to test things

out. So as a good example, if I would go

to the prompts, I wanted some way to see

the system history of all the different

prompts that I had before and ideally

revert back to them. We didn't have

this. It would overwrite the existing

prompt without any form of history. So I

wanted that. So then superbase MCP had

to listen to those requirements, create

new tables to store that on the

reflection logs. I would only have the

last reflection. I wanted all of them.

And again, this is something that

existed in the database, but just wasn't

shown on the front end. So this

iterative process is something you can

only understand once you're in it and

you see what's missing from your initial

requirements, which is how this

suggestions tab was born because I

wanted to know how close were we

threshold-wise to not passing the test.

Like what was it that the AI noticed?

Does the AI know what to look for? And

is there a way that I can hide these or

check these off if I've completed them?

So if I say mark is addressed, I can do

that. If I'm unhappy with it, I feel

like it's not a great piece of advice, I

can just hide it entirely. And those are

the extra second, third, fourth order

features that come about when you're

actually testing out the app. And like I

said, I wanted to be able to intervene

and see what is this reflection prompt

that keeps passing with magnificent

colors. Cuz like I said before, one of

the problems was I assumed some

overconfidence. It was rating itself

like it was amazing all the time, and I

felt like it was being too nice to the

other AI. It's good to be stable, but

it's not good to be biased. So then I

created this next tab where I could

audit and see exactly what the prompt

was that was being used as the judge for

the app. You'll notice here, this is an

example of me trying to do the

following, which is create the ruthless

critic so I could purposely break it and

see whether or not it would respond. So

you can see here from the very first

line of the prompt I say you are an

impossibly critical quality assurance

system for an AI consultancy chatbot.

Your standards are unreasonably high.

You are looking for perfection and

perfection does not exist. And I

basically set it up for failure just to

test out that it would would work. It

would actually update the prompt by

making it impossible to score anything

above a four. So, if I went back to the

dashboard and I set this to 4.5, no

matter how good the conversations were,

it would have to fail if the app was

actually working. And I know I'm pseudo

rambling here, but I wanted to walk you

through the mental model of how to build

an app and how to make sure you can

stress test whether it does what it says

it does. And after this phase of

creating version control, logging

everything, this app is not perfect.

There are still many areas, especially

if there were active users on the

platform, where I could see that it

could fail or would need more

adjustments. But the cool part is now

that we have the understanding of a

self-improving system, you don't just

have to stop here. It's not just about

improving the system prompt, you could

improve the app itself. You could create

a part of this app that would say based

on user behavior and what people are

asking for, maybe come up with and

implement a new feature or at least

draft a new feature that we could add to

this app that's maybe not chat oriented

or it's an area where you can go back

and forth and build something like a

document cuz everyone's asking for XYZ

document. The sky's is the limit and I

wanted to show you this example end to

end at least theoretically and

conceptually so you can apply it to

whatever makes sense for you. Now, if

you enjoyed this video, it took me a

while to put this mad scientist

experiment together and break it down in

a way that I could show you in an easy

way. Now, if you want to build a version

of this app on your end, then I'll give

you the mega prompt I put together in

the second link in the description below

along with a guide and a few other

goodies to help you do that. And if you

want access to this system asis, carbon

copy, along with a series of other

systems that I'm continually building

for my community, you can check out the

first link in the description below. and

I could see you in my early AI adopters

community. And last but not least, if

you like videos just like these where I

go very mad scientist and I try to go

against the grain and see what's

possible, please let me know down in the

comments below. It gives me feedback

that I should do more of this and that

you like it. And number two, it helps

the video and helps the channel. I'll

see you the next one.

How to Build SELF-IMPROVING Systems in Claude Code

Mark Kashef

71 days ago

25:42

Ai Whitelist

AI Whitelist

Rank #1

Description

Join my AI community: https://bit.ly/earlyaidopters Get the Mega Prompt + Guide: https://bit.ly/44HeWKB Book a call: https://bit.ly/markaicoaching --- What if your AI could improve itself? Not metaphorically - literally. In this video, I show you how to build self-improving AI systems using Claude Code and Supabase. I built a chatbot that detects its own bad responses, grades itself on a rubric, and rewrites its own prompts - all without human intervention. This isn't science fiction anymore. Once you see this pattern, you'll never build an AI app the same way again. What's inside: - Live demo of a self-improving chatbot - The evaluation layer that makes AI judge itself - How to set up Supabase MCP with Claude Code - The mega prompt I used to build the entire system - Safety nets to prevent overconfident self-assessments - Version control for AI-generated prompts --- TIMESTAMPS: 00:00 - What if AI could improve itself? 00:36 - Demo: Self-improving chatbot in action 01:10 - Admin panel: Reflection logs and versioning 02:01 - How the evaluation triggers work 03:02 - Version control and audit trails 04:31 - Traditional vs self-improving architecture 05:35 - The only two tools you need 06:06 - Teacher mode: How it's built 07:06 - The evaluation layer explained 08:00 - The feedback loop diagram 09:07 - Building session 1: Foundations 11:07 - The mega prompt breakdown 14:14 - Session 2: Breaking the cooldown 17:00 - Session 3: Admin UI and settings 19:56 - The handoff document hack 21:17 - Testing and stress testing 23:01 - Creating the ruthless critic prompt 24:16 - Beyond prompts: Self-improving features --- #claudecode #aiautomation #selfimprovingai #supabase #vibecoding #aichatbot #promptengineering #llmasajudge #aitools #buildwithclaude #nocode #aiforbusiness #metaprompting #claudeai #aidevelopment

Video Details

Category

Feed

AI Whitelist

Featured Date

December 21, 2025

Quality Rank

#1

AI Recommended