Loading video player...
Hi there, this is Christian from Nche.
Today we're fixing one of the most
frustrating problems in agent
development. Tools that fail at the
worst possible moment. Because let's be
honest, we all have been in the
situation where we want to demo our
shiny new agent, but then random API
calls just fail or a third party
integration, just, throws, [clears throat]
a random 500. And in a typical agent
one tool call and one tool call failure
might cause the entire result to become
garbage. Or in cases where your model is
actually be able to revive from it, it
will cost you a lot of more tool calls
and tokens to actually create quality
results and that's just not acceptable
in a production environment. That's
exactly why we developed a tool retry
middleware to encounter these types of
problems.
In this video, I will show you how to
automatically retry flaky tool calls and
how you can retry only on specific
errors. This will make your agent more
resilient, reliable, and production
ready. Let's dive in.
Now, let's take a look into the
longchain docs. If you open
docs.longchain.com,
you can find under the TypeScript
section all the built-in middleares that
the longchain package provides. One of
these middlewares is going to be the
tool retry middleware, which you can
import directly from the longchain
package. Then middleware allows a couple
of configurations. One is going to be
max retries which is essentially the
maximum amount of times we retry the
tool call. Then you are able to define
the tools that you want to retry. This
can be a list of tool instances or the
tool name itself.
You can configure a callback that is
being called every time a tool is going
to be retrieded. and you have a call
back on failure that allows you to call
a function whenever all the retries are
exhausted.
You also additionally can define a
backoff factor which increases the delay
of the retry tool call. You can define
an initial delay in milliseconds as well
as a max delay in milliseconds and the
jitter which creates a randomness
whenever you retry that tool call.
Now if we jump into our agent sandbox
we have one scenario where we look into
the two retry behavior. And if you look
into the code, we again define a next.js
endpoint that passes along a message to
our two retry agent. And that two retry
agent defines a model and a get weather
tool that simulates or has a simulation
for network failures.
that function essentially calls or fails
for the first two calls and then passes
on the third one.
So if we define our agent, we pass on
the model, we pass on the get weather
tool and we we define our tool retry
middleware with three retries that means
that two failures happen and we retry
again and then our tool will be
succeeding. uh we have a retry on
function that allows us to define when
we retry the tool and this in this case
only happens in case of a network error.
We define our backout to be two. So the
first delay is going to be 1 second. The
second delay of the tool we try going to
be 2 seconds and the next one is 4
seconds and so on. Our initial retry for
the first tool call will be 1 second but
we are capping the delay at 8 seconds.
We introduce a little jitter to create
some randomness and on the failure in
case of all tool calls continue to fail
we will return error message to the
agent. Now if we try this out you're
going to see that the tool call is a
little bit delayed.
It takes a second because it retries but
then ultimately succeeds to fetch the
weather for both two calls. So by adding
just a simple two retry middleware, we
are able to retry two calls that
randomly fail to make our agent more
stable. All right, that's a two retry
middleware in action. We took a flaky
failureprone weather API, made it fail
on purpose and watch the agent recover
automatically using retries, exponential
backoff and jitter. This is exactly the
type of reliability you need once your
agent start interacting with real world
systems. Systems that time out, break
randomly for reasons we probably never
understand. This middleware allows your
agent to stay more resilient, avoid
crashing workflows, and handle failures
the way real production apps should. No
custom retry loops, no manual error
handling, just a clean, centralized way
that makes every tool and agent more
robust. If you want to explore this
example or adapt it in your own tools
you can check out the full source code
down below. I would love to see what
you're building with it next. See you in
the next video.
Tools fail. APIs time out. Services throw random 500s. If your agent can’t recover, your entire workflow collapses. In this tutorial, Christian Bromann walks through how to use Tool Retry Middleware in LangChainJS to build agents that are truly production-ready. You'll learn: ✔️ How to automatically retry failed tool calls ✔️ How exponential backoff + jitter prevent cascading failures ✔️ How to retry only for specific error types (e.g., NetworkError) ✔️ How to gracefully handle situations where retries are exhausted ✔️ How to make ANY API-dependent tool more reliable — weather APIs, search endpoints, databases, internal services, and more We also build a real-world demo where a weather API intentionally fails the first two times, letting you see the middleware recover automatically. Code Example: https://github.com/christian-bromann/langchat