I Forced Claude to Code for 24 Hours NONSTOP, Here's What Happened | DailyDevLists

Loading video player...

Full Transcript

5,495 words • EN

A few days ago, Anthropic released this

article where they have open sourced

their harness for building longunning

agents. All a harness is is a

coordination layer on top of coding

agents that allows them to work for

hours and hours on a task without

overwhelming their context window. So

basically splitting work between

different agents and context windows on

the same very very large task. And I've

been fascinated with this idea recently.

And so what I want to do with you right

now is take this anthropic harness, put

cloud code into it, and have it work for

24 hours straight and see what results

we get at the end. And I know, I know

this is very experimental. You're not

going to do this kind of thing for a

production application, but I think this

is really exciting to experiment with.

And I really do think that in the near

future, longunning agents are going to

be something we use a lot to kick off

our coding assistants as background

tasks to start an application for us

like build auto proof of concept and

then we come in and keep building on top

of it. And I can tell you that after

reading through this article, which I'll

have linked in the description of

course, and trying it myself already,

the strategies here are legit. And like

I said, I've already been experimenting

with and building these kinds of things,

but I'm definitely going to be taking

inspiration from some of the strategies

that Anthropic has here. And I also

created this really neat Excal diagram

to help you grasp everything that goes

into this harness. And it really is

simple overall, which I appreciate quite

a bit. And there are a lot of really

smart ideas to take out of this. Plus,

of course, you don't have to use cloud

code. This whole harness really is just

a bunch of prompts and files. And so you

could use this with any coding

assistant, codeex, open code, you pick

your poison. So more on this in a bit,

but the application that we're going to

be building over 24 hours is a copy of

claw.ai. So you're probably familiar

with this if you use Claude at all. It

also looks very similar to Claude

desktop. It's just a simple interface

for us to talk to Claude, upload files,

manage projects, and different

conversations. I chose this because

that's the demo they have in the article

that I was showing you earlier. And

whenever I read one of these articles

and they give an example of what they

can build, I always wonder like, is this

actually possible or did they do

something else to make it just look

nicer than it actually turns out? And so

I want to test this article for real by

giving 24 hours for cloud code to build

this thing. And so what you're looking

at right here is what I have from a

previous execution only after a couple

of hours. There are a lot of features

missing here if I were to poke around.

And so we're going to see after 24

hours, can we build this but completely

working with conversations and artifacts

and files and projects, all the features

that we have with Claude as the

application. Now I could go right into

the live demo and we'll do that in a

little bit, but the real value for you

is understanding how this harness works

even so you can take ideas from this to

evolve your own system for AI coding. So

I'll explain this really quick because

it is pretty simple. Overall, it relies

on the concept of testdriven

development, which is very powerful for

AI coding. We define the success

criteria, all of our tests upfront

before we do any of the actual coding.

And so then we're constantly checking

our work against this set of tests that

defines what it means to have a finished

product. It's very cool. And all the

ideas that I have in this diagram are

coming directly from this article and

then also the open source repo for the

harness, which I'll link to below. You

can start using this right away and this

is what we're going to be using for our

live demo in a bit. So with this

harness, everything starts with the app

spec text file. I typically call this a

PRD. It's basically the scope of work

for what you want to build for your MVP.

So this is the primary context that goes

into the first session that our harness

kicks off. And this is the session for

our initializer agent. The sole job of

our initializer agent is to get our

project set up. So when it finishes,

we're not going to have anything

actually working yet, but we're going to

have these four things created to then

go into our coding agents. And we'll get

into this in a little bit. And so the

first thing that our initializer agent

creates is this feature list JSON file.

This has the 200 or more test cases that

need to pass for our application to be

considered complete. And this number is

configurable in the anthropic harness. I

know that is an insane number, but like

I said, this is very experimental. We

just want to see what it looks like to

have a coding agent run for a very long

time implementing a lot of different

things. And so all of this is just based

on our PRD breaking it down into very

granular tasks. And then we also create

a script to initialize our project like

get the website spun up. We create the

scaffolding. So kind of like the

boilerplate having that in place for our

coding agents. And then finally we

initialize a git repository because git

is absolutely crucial for any AI coding

system. So with the initializer agent,

we've covered two out of the three core

artifacts for our harness. We have the

feature list and the initialization

script. The last core artifact here is

our claude progress. This is the file

that is updated at the end of every

session, giving a summary of what was

just done. This is how our initializer

agent and then each of our coding agents

are able to communicate with each other

even though we create a brand new

context window between each agent. And

so the way that the initializer agent

works is it sets up all the scaffolding

and then it gives an overview of what

it's set up in this file. Then we go to

the second session in the system which

is the first time we're running a coding

agent. And so it is going to go to the

cloud progress understand what was just

done by the initializer. Then it's going

to spin up the website with the init

script. And then it'll read the feature

list to figure out what is the first

feature that I should knock out. And

then it's going to run through this

process to kind of catch itself up to

speed on what we have in the codebase,

implement the next feature, document

that commit, and then go on to the next

loop of the coding agent. So it gets its

bearings. I often call this priming.

It's going to understand what we already

have, including reading the feature list

to figure out what to build next. It's

going to maybe do some regression

testing, making sure that previous

features are still working. That's a

really important part of this harness as

well. It'll pick the next feature,

implement and test it, and then it'll

update the cloud progress with a summary

of what it did, and then make a get

commit. So, we have a save state after

each one of our context windows, and

then it's going to loop n number of

times. This is going to go over and over

and over again until all of the test

cases are passing in our feature list

JSON file. And what allows this to go

forever basically is just the fact that

we have this new context window every

time we go into the next coding agent.

And we have these core artifacts plus

the process to make it so that even

though we start a new context window

every time, we can quickly catch

ourselves up to speed on what has to be

implemented. Even do a bit of regression

testing before we now go into the next

feature. So each coding agent is very

granular and focused and that's what

makes this process pretty reliable

overall. And there are also some

guardrails that we have in place for

security. There are some validation

tools so the agent can actually visit

the browser and verify things visually.

It's a pretty cool process. So we'll get

into that as we start the demo here. Now

if you thought that we're using the

Claude Code CLI for this harness, you my

friend are mistaken. We have true power

and flexibility when we interact with

cloud code directly in our Python or

TypeScript code. And so for this

demonstration, this harness that we're

going to be using in a little bit here,

we are using the Claude agent SDK to

create our Claude code client and

interact with it directly in Python

code. And I really do think that this is

also the direction that we're heading

with coding assistance because it's

really easy to build our own systems

like this harness when we control things

programmatically. And there are a lot of

other AI coding assistants like Codeex

and Open Code that are also coming out

with SDKs. And I even included the

Codeex SDK along with the clawed agent

SDK in the remote agentic coding system

that I covered in the live stream on

Saturday, which is very exciting by the

way. But I've done a lot of

experimentation with this. So that's why

I'm saying that I could very easily take

this repository, swap out cloud code for

codecs, and then still use the exact

same harness because it is just those

artifact files. and then the prompts

that I'll show you later as well. So

with that, let's actually get into the

repository here and spin up our demo.

All right, so I have the repository

cloned locally and going through the

prerequisites here to get everything

ready to run is really, really

straightforward. The only thing that I

tweaked in my local version is I

absolutely do not want to use my

enthropic API key. That is going to

charge me out the wazoo with the task

running this long. I definitely want to

take advantage of my Mac subscription

and use my Claude subscription token.

And so the way that you get that is you

do claude and then setup dash token.

It'll walk you through a little ooth

flow. It'll give you a token that you

can set as the environment variable cla

token. So you just have to set that

instead of your enthropic API key. And

then I changed the code right here. And

then I also removed something at the top

of the client function. So that's all

you have to do if you want to replicate

what I set up here. Otherwise, you can

of course use your anthropic API key.

Just get ready to pay a good amount,

especially if you're using open opus 4.5

for your coding agent. Other than that

though, I just followed all the steps in

the readme here. And so it got to the

point where now I can run this command

right here. So I run the autonomous

agent demo and then all I have to select

is the directory that I want to build

this project in. And so I'm just using

the exact same app spec that came with

this repository because like I said, I

want to clone Claude just like they have

in the article. So they already had this

full prompt created for me, which is

very, very detailed. Obviously, we need

something very detailed in order to

create that featureless JSON file that

has all of these different features and

test cases that we need passing for the

app to be complete. So this is my app

spec. I'm going to send this in as

context for the application build. And

then once I kick things off, that's when

I want to kind of explore the prompts

more with you and also talk about how we

are invoking the claude agent SDK. But

right now, let's go ahead and spin this

off. And this begins the 24 hours for

our demo here. So, autonomous coding

agent demo. We're using Opus 4.5 for our

model. And we are going to start with

our initializer agent. So, this is

session number one. just like I was

showing in the Excal Draw diagram and we

are using this appspec as our PRD that

outlines all of the features that we

want to build. And so all the tool calls

that we typically see within the Cloud

Code CLI, we can see these here. It

doesn't look quite as pretty, but we can

watch it work. And so this is it's going

to go for a while. The first session

does take 10 to 20 minutes because it

has to generate those 200 very detailed

test cases. So, we have to be patient

and that's why I'm going to take this as

an opportunity to show you around the

prompts more and then we'll come back to

this when we get to the first coding

agent session. So, the big question you

have right now is where do the prompts

come from? Because we're not using the

cloud code CLI. We're not entering

anything in ourselves or using any kind

of slash command. But that my friend is

the beauty of the claude agent SDK

because within our Python code we are

loading the initializer prompt from

right here and then every session after

that is just going to be loading in this

coding prompt. So they're just markdown

documents just like our rules and

commands that you're typically working

with with your AI coding assistants. And

so for our initializer here, everything

that we're about to go through should

ring a lot of bells because we're just

working with the same process that we

outlined in the Excaladraw. And so we're

giving it context like you were the

first agent in a longunning autonomous

development process. And so you start by

reading the appspec or the PRD. This

contains the complete specification for

what you need to build. And then based

on that PRD, you're going to create that

massive feature list JSON file that I

showed you briefly earlier. So it's very

very structured for every single feature

that we need to build. We have the

category, the description, the steps to

validate the feature. This is really

cool. And then also just true or false.

Is this currently passing? So whenever a

coding agent knocks out one of these

features, it just goes back in the

feature list JSON and changes this from

false to true. So if I like scroll down

to the very bottom here, this is from an

old run, by the way, it would just like

change this from false to true. And then

we create that initialization script so

the coding agent can spin up the website

every time we have a new context window.

It initializes the git repository and

then creates the boilerplate for our

application structure. And so then the

coding agents already have at least

something to work on top of even though

nothing's actually working in the app at

this point. And then finally, it creates

that claude progress to just give a

summary of what it set up. And so this

is an example of the cloud progress

right here. This is completely wiped and

uh redone every time we have the end of

a session. So that is the initializer.

And if we go look at our terminal right

here, it is still running. So at this

point, we've created our feature set.

Now we're just working on some of these

other things like creating the initial

project directory structure. So, it's

working through everything that we just

saw in the initializer prompt. Now, once

the coding agents start running, that's

when we use our coding prompt. And so,

this one is a little bit more

complicated, but it's still not too bad

overall. And so, we obviously have to

start by getting our bearings because

we're dropping a fresh context window

into an existing project. And so there's

a couple of commands that we want to run

just like understand our PRD. Look at

the feature list to see what we should

build next. Look at the cloud progress

so we can see what the initializer agent

did if this is our second session.

Otherwise, we're looking at what the

coding agent did in the previous run.

Taking advantage of the git history as

well. Like I said, git is a very crucial

part of our process and our harness

here. And then we're going to start up

the servers with the initialization

script. So going back to our diagram

here, we are reading all of our core

artifacts and taking advantage of those

in our agent loop here. And then we're

going to do some verification. So before

we do anything new, it's going to do a

little bit of regression testing. So

just spot checking here. We look at a

couple of the more recently implemented

features that are marked as true uh for

the passing in the featureless JSON

here. And it's just going to make sure

that those things are still working. And

this is really important because as

we're building out so much code in our

project here, we might be breaking old

things. So I really appreciate that

regression testing is built into this.

Now, if there are any issues that are

found, then we're going to address them.

So go back to the feature JSON, mark it

as false, fix the issue, and then go

through the steps again to verify that

everything is working. And we have the

Puppeteer MCP server attached to the

coding assistant. Like if we go and look

at how everything is configured here.

We're giving it the Puppeteer MCP

server. So it can actually go and verify

that things are working on the website.

Like this is what we have right now for

our initializer agent running. It

actually spun up the website to make

sure that our little hello world

boilerplate is working. And so as it's

building out the application, it can

even validate things visually, which is

very, very powerful. So we'll see a lot

more of this as it's building things

out. And then to finish off the coding

prompt here, obviously once it does all

of its regression testing, it's going to

choose one feature from our JSON file.

So kind of like you know the next one

that has passes as false. And then it's

going to implement it, test it, go back

and mark it as complete after it does

even like the full browser automation

validation with Puppeteer like we just

saw. And we have clear instructions

here. Make sure that you only updated

the passes field. you cannot change

anything else because one thing that

coding assistants do a lot is they get

lazy and they say like oh I did the

first four steps of validation I don't

need to do the last one so let me just

remove it so we're making sure we avoid

that by saying you cannot update the

steps you can only update false to true

and then of course our very last step is

to make that save state with git so we

are committing our progress and then

updating the clawed progress so that the

next coding agent can read through

exactly what was done with the last loop

and So then we end our session and the

rest of this here is just kind of

miscellaneous instructions. That's

really the end of the process that we

have for the coding agent and we're just

going to be running this over and over

and over again. So at this point we've

gone through all the prompting for this

system. It really is simple. We rely on

these core artifacts. We have one prompt

for the initializer and then one prompt

for all of our coding agents. And by the

way, if you want to understand best

practices for security, permissions,

things like that, for defining our

agents in code, this repository is

fantastic for this. I don't want to dive

into the code too much, but it's worth

showing you a little bit how this works

with the cloud agent SDK. So, when we

create our client, first of all, we pass

in a project directory. The coding agent

is only able to operate in this project

directory, which is already a good layer

of security, right? The file operations

are restricted to the project directory

only. We have our sandbox environment.

And then we have our permissions where

we just accept all edits. So we don't

need human approval for changes.

Obviously that would not work for an

autonomous system. And then we define

the specific commands that claude code

is able to run in the cloud agent SDK.

Reading files, writing files, using the

Puppeteer MCP server for the browser

automation, all that good stuff. And it

even goes so far as having a hook. So

every time we use a tool, if it is

running a bash command, we also have

this entirely separate Python script

that manages the different kinds of bash

commands that we're allowing cloud code

to run. So it can't do things like

delete directories or work outside of

our current codebase. And so this is

really really powerful. It gets quite

technical, but if you are more technical

and you want to understand like how can

it end processes without killing itself,

for example, it's definitely worth

diving into this. And then finishing

things off for the client here, we have

the MCP server for Puppeteer, which by

the way, the Puppeteer MCP server, it

takes a while to validate things cuz the

agent, it spins up the browser, it waits

for things to load, then it clicks on a

button and waits for that to load. And

so we go for 24 hours, but it's not like

we're spending millions and millions of

tokens because we are waiting quite a

while for all the browser automation

stuff. But yeah, we have our system

prompt that we can customize as well,

which is really cool. We define our

model, all the allowed tools and the

hooks and things, our current working

directory. There's so much configuration

that we have for the cloud agent SDK,

which is part of why we need something

like this when we have this harness that

we want to control so much. And so when

we actually invoke the agent, if it is

the first run, we're going to get the

initializer prompt. Otherwise, we get

the coding prompt. It's just reading

these in from these markdown documents.

And then we send that into this function

right here, which leverages the client

that I showed you just now, how we

define. So it runs this query here with

the latest message, which this is going

to be the message that's just loaded in

from one of these markdown documents. We

stream out all of the text and tool

calls and we put that out to the

terminal. That's exactly what we're

looking at right here as we have our

autonomous agent running. All right, our

initializer agent has finished and we

have the first commit for our initial

setup. And one thing that this harness

that this application does after every

session is it automatically runs all the

tests to give us a progress update. Now

obviously right here we have a progress

report of 0% because the initializer

agent is only creating the foundation.

it is not responsible for doing any

feature development that would make one

of these tests pass. And so that brings

us to the second session or the first

time we run our coding agent. It is

going to create that secured sandbox

environment and then go through the

exact process that we saw in the coding

agent prompt. So it does its prime here

where it lists out the files, looks at

the PRD and the cloud progress from the

initializer agent, looks at the git log

and the feature list, all these things

that we've already gone over. It spins

up the website with that initialization

script and then we can see it using the

Puppeteer MCP server to visualize

things, even take a screenshot. It's

looking at the API. You just saw it go

there like just very briefly. It was

doing a little bit of work and we can

also see the website more plainly if we

just like visit it in our own browser as

well. So this is my personal browser and

then what we're looking at right here,

there's nothing being shown right now,

but this is the browser that the agent

is currently operating in. So this is

the what we have built so far as I have

the first coding agent run. So it's kind

of just creating the initial user

interface at this point. I'm not going

to click through things right now, but

most of this stuff is probably not

working at all because we have just

gotten started with the process here.

And the first agent is going to be very

granular in what it builds. So all this

output here, I know that it's a little

ugly, but it's pretty much what we see

in the Cloud Code CLI. And so it just

goes through the first feature and does

all the testing it needs. And then in a

little bit here, it'll probably update

the cloud progress and then move on to

our second coding agent session. And so

at this point, I've covered everything

that I want to cover. And so what I'm

going to do here is pause and then next

time you see me, the 24 hours is going

to be up. We're going to come back to

the terminal here, see what session

we're on, see how many tests are

passing, and then we'll also see what

our application looks like. So, I'm

doing this live with you. I have no idea

how it it's going to turn out. I've

tested the same thing, but only with a

few hours. That's all I gave it. And so,

yeah, let's come back together and see

how it shapes up. All right, we are now

at the 24-hour mark, and we have gotten

to the 54th coding agent session.

Absolutely crazy. I have no idea how

many tokens I've used for this, but it

is probably a lot. Thank goodness I'm

using my Claude subscription. And we

have 54% of our tests passing at this

point, which after an entire day might

not seem like a very high percentage,

but we have given it a lot of different

features to work through and implement.

So having over 100 of the tests passing

at this point is pretty cool. And going

to our browser here, this is the website

that I have spun up in my own browser,

not the testing one with Puppeteer. It

is pretty impressive everything that we

have built out right now. And honestly,

I don't even know what the last half of

the tests are for because it already

feels like I have a completely

functional clone of claw.ai. And it's

cool. We can go through all of the past

conversations that it generated as it

was verifying things through each of the

loops. We have really nice markdown

formatting. We are able to create these

different HTML pages and even write and

execute code. We have a settings where

we can change the theme and our default

model and a slider for the max tokens.

Like this is just so so featurerich.

Like it's a lot more than if you were to

just in a single prompt ask it to make a

clone cuz it's not going to build out

all this functionality. Even being able

to see the token count for the responses

and I'll create a new chat here and send

in something myself. So I can see the

number of characters and estimated

tokens for my own prompts. And you can

see here that like the UI isn't perfect.

And so I definitely didn't expect this

to be perfect. We still want to come in

and add a human in the loop. But it

still is really cool how much I was able

to build here without laying a finger on

anything. And honestly, who knows? If I

let it go for another day and it goes

through all of the features, maybe this

would be an absolutely perfect

application cuz it really is crazy the

kinds of things we can build with Claude

Opus 4.5. And then having a harness like

this to let it do so much validation and

iteration on an application. And also

it's really cool to look through the

featureless JSON as well. So passes is

true for a ton of these different

features now. So I have to scroll all

the way down to the middle to start

seeing the ones that are false for

passes. So now we start to see the next

things that we have to work on. And

these are very very specific scroll bars

and mobile styling and dividers. like

we're getting to the very nitty-gritty

details of making a very complete Claude

clone. And then in the cloud progress,

we can see what happened in the previous

session and then even an overview of

what happened in sessions before that.

One thing that's really confusing here

is I don't know why it says session 34

when we know that we're on session 54.

So like the harness seems to have veered

off a little bit, but it still is

knocking out feature after feature

pretty quickly. I've been watching the

logs towards the end of the execution

and that's one of the things that I was

really nervous with is that like it'd

work really well for the first, you

know, 10 20 sessions, but then it would

start to hallucinate a lot and like go

through features willy-nilly here and

start to totally mess up the cloud

progress, but like overall it seems to

be very aligned even as we go through

dozens of sessions. So, I got to hand it

to Anthropic here. Overall, I'm very

impressed. Not that I haven't had these

kind of longunning sessions work before,

but being able to do something really

out of the box with a resource that they

have open sourced is really fantastic.

And so I would encourage you to try this

out yourself. Clone this repository that

again I will have linked in the

description. You can even go within the

app spec here and change what you want

to build. So if you don't just want to

follow along with what they show here

and build the claude clone, you can

build any application that you want, any

kind of backend, any kind of front end,

it works with all of this. you just

probably want to have some kind of user

interface to take advantage of the

Puppeteer MCP server integration. But

otherwise, it's really up to you what

you want to use this harness to create.

And so, I hope that you appreciated this

video and it gave you some ideas for

ways that you can build this kind of

harness into your own AI coding system.

So, if you appreciated this video and

you're curious how you can take these

kinds of ideas and use them to build

your own system for AI coding,

definitely check out the Dynamus Agentic

Coding course that I'll have a link to

in the description and the pin comment.

This is the best resource you'll find on

the internet for learning how to build

reliable and repeatable workflows for AI

coding. So, definitely check it out and

I will see you in the next video.

I Forced Claude to Code for 24 Hours NONSTOP, Here's What Happened

Cole Medin

88 days ago

26:47

Claude & Anthropic Ecosystem

Rank #5

Description

Ever wondered what the best AI coding tool in the world (Claude Code) could create if you gave it a full 24 hours and tools to validate its own work? Well that's exactly what I do in this video, and I was VERY impressed with the results to say the least. Coding agents usually don't work for 24 hours straight - at some point they just decide they're done and return control back to you. But with this "harness" Anthropic has created (and open sourced!) for long running agents, we can give it a super complex task and have it rip through all the features over hours and hours. Practical? Maybe not yet - it's very experimental. But boy is this fascinating to get a glimpse into the future of agentic coding. ~~~~~~~~~~~~~~~~~~~~~~~~~~ - The Dynamous Agentic Coding Course is now FULLY released - learn how to build reliable and repeatable systems for AI coding: https://dynamous.ai/agentic-coding-course - Anthropic's Article on Harnesses for Long Running Agents: https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents - GitHub Repo for the Harness I Use in the Video: https://github.com/anthropics/claude-quickstarts/tree/main/autonomous-coding ~~~~~~~~~~~~~~~~~~~~~~~~~~ 00:00 - Introducing Anthropic's Harness for Long Running Agents 01:19 - What I Have in Store for You Today 02:47 - How the Heck does the Harness Work? 08:42 - Setting Up the Coding Agent Harness 10:34 - Kicking Off the 24 Hour Agent Execution! 12:26 - Diving into the Initializer Agent Prompt 13:46 - Understanding the Coding Agent Prompt 17:14 - Security & Flexibility Built into the Harness 19:58 - First Coding Agent Started! 22:18 - Final Results after 24 Hours of Coding NONSTOP 25:29 - This Harness is Legit ~~~~~~~~~~~~~~~~~~~~~~~~~~ Join me as I push the limits of what is possible with AI. I'll be uploading videos weekly - at least every Wednesday at 7:00 PM CDT!

Video Details

Category

Claude & Anthropic Ecosystem

Featured Date

December 6, 2025

Quality Rank

#5

AI Recommended