Production-Ready Agents: Automatic Tool Retries with Exponential Backoff | DailyDevLists

Loading video player...

Full Transcript

826 words • EN

Hi there, this is Christian from Nche.

Today we're fixing one of the most

frustrating problems in agent

development. Tools that fail at the

worst possible moment. Because let's be

honest, we all have been in the

situation where we want to demo our

shiny new agent, but then random API

calls just fail or a third party

integration, just, throws, [clears throat]

a random 500. And in a typical agent

one tool call and one tool call failure

might cause the entire result to become

garbage. Or in cases where your model is

actually be able to revive from it, it

will cost you a lot of more tool calls

and tokens to actually create quality

results and that's just not acceptable

in a production environment. That's

exactly why we developed a tool retry

middleware to encounter these types of

problems.

In this video, I will show you how to

automatically retry flaky tool calls and

how you can retry only on specific

errors. This will make your agent more

resilient, reliable, and production

ready. Let's dive in.

Now, let's take a look into the

longchain docs. If you open

docs.longchain.com,

you can find under the TypeScript

section all the built-in middleares that

the longchain package provides. One of

these middlewares is going to be the

tool retry middleware, which you can

import directly from the longchain

package. Then middleware allows a couple

of configurations. One is going to be

max retries which is essentially the

maximum amount of times we retry the

tool call. Then you are able to define

the tools that you want to retry. This

can be a list of tool instances or the

tool name itself.

You can configure a callback that is

being called every time a tool is going

to be retrieded. and you have a call

back on failure that allows you to call

a function whenever all the retries are

exhausted.

You also additionally can define a

backoff factor which increases the delay

of the retry tool call. You can define

an initial delay in milliseconds as well

as a max delay in milliseconds and the

jitter which creates a randomness

whenever you retry that tool call.

Now if we jump into our agent sandbox

we have one scenario where we look into

the two retry behavior. And if you look

into the code, we again define a next.js

endpoint that passes along a message to

our two retry agent. And that two retry

agent defines a model and a get weather

tool that simulates or has a simulation

for network failures.

that function essentially calls or fails

for the first two calls and then passes

on the third one.

So if we define our agent, we pass on

the model, we pass on the get weather

tool and we we define our tool retry

middleware with three retries that means

that two failures happen and we retry

again and then our tool will be

succeeding. uh we have a retry on

function that allows us to define when

we retry the tool and this in this case

only happens in case of a network error.

We define our backout to be two. So the

first delay is going to be 1 second. The

second delay of the tool we try going to

be 2 seconds and the next one is 4

seconds and so on. Our initial retry for

the first tool call will be 1 second but

we are capping the delay at 8 seconds.

We introduce a little jitter to create

some randomness and on the failure in

case of all tool calls continue to fail

we will return error message to the

agent. Now if we try this out you're

going to see that the tool call is a

little bit delayed.

It takes a second because it retries but

then ultimately succeeds to fetch the

weather for both two calls. So by adding

just a simple two retry middleware, we

are able to retry two calls that

randomly fail to make our agent more

stable. All right, that's a two retry

middleware in action. We took a flaky

failureprone weather API, made it fail

on purpose and watch the agent recover

automatically using retries, exponential

backoff and jitter. This is exactly the

type of reliability you need once your

agent start interacting with real world

systems. Systems that time out, break

randomly for reasons we probably never

understand. This middleware allows your

agent to stay more resilient, avoid

crashing workflows, and handle failures

the way real production apps should. No

custom retry loops, no manual error

handling, just a clean, centralized way

that makes every tool and agent more

robust. If you want to explore this

example or adapt it in your own tools

you can check out the full source code

down below. I would love to see what

you're building with it next. See you in

the next video.

Production-Ready Agents: Automatic Tool Retries with Exponential Backoff

LangChain

88 days ago

5:01

AI Framework Development

Rank #1

Description

Tools fail. APIs time out. Services throw random 500s. If your agent can’t recover, your entire workflow collapses. In this tutorial, Christian Bromann walks through how to use Tool Retry Middleware in LangChainJS to build agents that are truly production-ready. You'll learn: ✔️ How to automatically retry failed tool calls ✔️ How exponential backoff + jitter prevent cascading failures ✔️ How to retry only for specific error types (e.g., NetworkError) ✔️ How to gracefully handle situations where retries are exhausted ✔️ How to make ANY API-dependent tool more reliable — weather APIs, search endpoints, databases, internal services, and more We also build a real-world demo where a weather API intentionally fails the first two times, letting you see the middleware recover automatically. Code Example: https://github.com/christian-bromann/langchat

Watch on YouTube

Video Details

Category

AI Framework Development

Featured Date

December 11, 2025

Quality Rank

#1

AI Recommended