Loading video player...
Voice is one of the most natural ways to
interact with AI. And as the models are
getting better, I'm excited about new
use cases and interaction patterns that
it's going to unlock, especially in
industries like education and customer
service. It's surprisingly easy to get
started building a voice agent. And so
let's go through that in this video. I'm
Tannushri and I'm going to show you how
to build a voice agent, specifically a
French tutor with this framework called
Pipecat. going to walk through how it
works end to end and then we've also
hooked up observability into Langmith so
we can peel back the layers and show you
what happens in each step of your voice
agent.
So let's start by just an overview of
how uh how this voice agent works. So
there's three main steps in the voice
agent. Uh there's a speech to text or
ST.
There's the LM call and this is text in
and text out. It's just a regular um
sort of a regular textbased model. And
then lastly is the text to speech step
that takes the text and then adds audio
to it.
So I'll show you a quick demo of of an
agent I've built. Um this is I'm
learning French. This is a French tutor.
Um and let's actually let's give it a
whirl and I can show you um how how it
looks like.
Cool. Let's take a look at the resulting
trace in Lenman so we can see exactly
what happened in each step.
All right. Um, what's really nice is
these traces are are laid out exactly as
I showed in the diagrams earlier. So,
you can see there's one turn of this
conversation. This is the speech to text
node. Um, actually, interestingly
enough, um, it didn't quite understand
what I was saying here. Um, I'm I'm
using a local model just for the sake of
of the demo. And so that's probably why
the the transcription step didn't go uh
didn't go as expected.
This is the LLM call and um system
prompt helps guide the LLM in you know
what the context is and how I want it to
respond. Um looks like it it you know
kind of saw enough context here and then
asked me uh said it said it was doing
well. asked me how I was doing. And then
finally is the text to speech steps. And
the reason there's multiple here is uh
the audio is actually streamed back um
which is great UX rather than waiting
for the entire audio. Um uh it's being
streamed back to to me, the user.
Um and so it's pretty cool that you can
kind of uncover and see all of the
layers here. Um, one thing I've been
doing a bunch of testing with is
actually instead of using uh the local
model I'm using, um, uh, testing various
models, seeing which, uh, transcription
service works best for my use case. And
so I'll actually show you really quick
um
that if I switch to using um not a local
model and using using an open AI model
directly the transcription works much
better. Let's give it a try.
Okay. And you can see that this
transcription step was way better. I
have all of the debug logs streaming in
here too. And so this shows the same
thing that if I if I pulled up the trace
um with with the new model um uh that
that would show as well. So let's let's
peel back the layers a little bit and go
into how this works. Uh I'm I'm using
Pipecat to build this agent. Pipecat is
a real time voice and multimodal uh open
source framework. Um and what I've
really liked about it is it's easy to
swap out different models. We tested
with two speechto text models in this
demo and it was really just a line of
code to swap it out. And so really the
core logic of this script is um in a
couple places. So this is this is the
area in the script where I declare which
models I want to use for each step of
the pipeline.
Um this is the system prompt. We saw
this in lang. This gets sent to the LLM
call. Um and then this is the meat of it
where the pipeline gets constructed. So,
it's taking um audio input from my
microphone, the various steps in the
pipeline. Um uh and then I also have
some additional information coming
through here that I'll I'll go over. Um
so, couple things. Namely, I have some
span processors here. Um the reason for
these is I wanted to record the audio
conversation so that I could then upload
it along with my lang traces. Um, and
this is a great best practice for when
you're tracing voice agents is is you
want to see the transcription, but also
having the audio side by side is really
helpful. So I have like the full audio
of the conversation as well as the audio
for each turn. Um, which makes debugging
really great. And then also you can send
something like this to an eval pipeline
and it has all the information you need.
And then um so the the last big chunk of
logic in this app is is I've set up
tracing to Langmith. Um we use open
telemetry to send data from Pipecat to
Langmith. Um and it is all handled for
you with with the import.
So that was the demo of building a voice
agent. Uh check out Pipecat and
Langmith. Give it a try. I think there's
some really really fun types of
applications to build and uh share what
you build with us.
Learn how to debug and improve a AI voice agent using LangSmith. We’ll walk through tracing conversations, spotting failures, and iterating on your agent. In this demo we use LangChain and Pipecat, an open source framework for voice and multimodal conversations. LangChain repo: https://github.com/langchain-ai/langchain LangChain docs: https://docs.langchain.com/langsmith/trace-with-pipecat Pipecat repo: https://github.com/pipecat-ai/pipecat