Loading video player...
Hey everyone, thanks for joining us for
the next session of our Python and AI
level up series on vector embeddings.
My name is Anna. I'll be your producer
for this session. I'm an event planner
for Reactor joining you from Redmond,
Washington.
Before we start, I do have some quick
housekeeping.
Please take a moment to read our code of
conduct.
We seek to provide a respectful
environment for both our audience and
presenters.
While we absolutely encourage engagement
in the chat, we ask that you please be
mindful of your commentary, remain
professional, and on topic.
Keep an eye on that chat. We'll be
dropping helpful links and checking for
questions for our presenter and
moderators to answer.
Our session is being recorded. It will
be available to view on demand right
here on the Reactor channel.
With that, I'd love to turn it over to
our speaker for today, Pamela. Thanks
for joining.
>> Hello. Hello everyone. Very excited to
be here for the second session in our
Python plus AI series where we show you
all the fundamentals of generative AI
with a focus of using them in Python.
So, we are on our second session where
we're talking about vector embeddings.
We already talked about the basics of
LLMs. If you missed that one, just you
can go back on YouTube and you can watch
that when you've got time. And we still
have seven more sessions after this one.
So, we're covering lots of topics in
this series. If you haven't yet
registered for the series, go and
register for it now with that link uh
that we'll post in the chat uh because
we want to hopefully see you at all the
sessions. Now, they will all be
recorded. All the slides and all the
code, everything will be available. So
if you do miss anything, you can always
catch up.
So let's go to today's topic about
vector embeddings.
Uh this is it's a really fun topic. It
also is a little bit more of a mathy
topic. So those, you know, going to be a
little little more math today, uh than
you know, other ones. Uh a little more
math than Python. Uh, but our goal is so
that you walk away from this and like
really feel like you have a better
intuitive understanding of vector
embeddings, how they work, and when you
might want to use them. So, we're going
to be talking about what they actually
are, um, you know, what the kind of
similarity spaces look like for
embeddings, how you search with them,
uh, how you can compress them with
different quantization techniques, and
we'll see that across a couple different
embedding models.
Now you can follow along with everything
I'm doing with this repo here
aka.msbeector-mbbedding-demos.
Uh so you can we'll go ahead and post
that in the chat and that is a public
GitHub repo. So you can go to the repo
and then the easiest way to get started
is to click on the code button and go to
the code spaces tab. And you can see I
actually already have two code spaces.
But when you click on it, you what you
should see is a giant button that says
create code space on main. This this
button here. So you're going to click on
that and it will create a code space in
your browser. So it's going to look like
a VS code inside your browser that is
inside the actual project. Uh so you can
see I've got a code space open over here
uh that has that has this project open.
Uh, so that's what that's what it'll
look like when you open it is that you
give it a couple minutes, it'll open
this VS Code and then you'll have access
to all of these notebooks and you should
be able to run all the notebooks that
I'm running um for free um because uh
we're you know using we're using GitHub
models for the vector embedding models
and that you can run for free using a
GitHub account. So hopefully you can
follow along with everything I'm doing
uh if you want to by opening this repo
in GitHub code spaces. If you really
want, you can also open it locally, but
then you will have to do a little more
setup and you will have to read the read
me. Uh so if you do want to open it
locally, go for it. Um but do uh check
out the readme uh to get some tips for
setting up and ask questions in the chat
if you have any.
Okay. All right. So we want to make sure
everyone knows how they can follow
along. Uh and you can always open that
repo later if you just want to watch for
now and you know don't want to play
around with the code yet.
So now let's talk about what are vector
embeddings.
So at a very high level
humans think in words right and
computers think in numbers or really
computers think in in binary and bits.
Uh so if we're trying to get computers
to be able to think about human concepts
like words, we need to find a way of
converting those words into something a
computer understands.
So what we do is we create these vectors
that represent
words and phrases and sentences and they
try to represent the meaning of those
sentences but in a numerical form. So a
vector embedding is actually a list of
floatingoint numbers. That is that is
what it is to the computer and that's
something that a computer can actually
use when it's doing you know
calculations. So when everything is in a
vector form it can do you know ranking
and sorting and searching and so it is a
way of capturing
the meaning of of you know human
concepts in a numerical form that
computers can understand. So that's what
they are at a you know a really high
level. We're gonna go much much deeper
into this, right? But they are lists of
numbers that represent, you know, words,
phrases sentences inputs concepts.
Uh, and we've just figured out a way to
turn it into something that a computer
understands.
Now, why do we actually need vector
embeddings? Like, what are they useful?
Why are you even here? So, they're
actually incredibly incredibly useful.
uh once we have vector embeddings like
first we can just make much better
searches. So anywhere where you have a
search right if you've got a a product
store that has a search uh you've got
documentation that has a search anywhere
where you have a search you're going to
improve that search if you add in vector
uh you know vector embeddings and vector
search to it. Now on top of that uh we
can start building more interesting
things and actually building chatbased
interfaces for users because we can take
really really uh ambiguous input from
users and we can handle any sort of
input now. So we can take a user's
entire question and we can use that to
do a search and then we can hand those
results to an LLM. Uh so many people
that are building chatbased interfaces
with LLMs are actually using vector
search behind the scenes in order to
handle these you know more ambiguous uh
queries from users.
Uh and once we start adding vector
search we can also you know we can
improve things because we can find
things that are similar in meaning. We
can also handle multiple languages more
easily. Uh so it's because all these a
lot of these vector models actually
understand multiple languages. So
suddenly our search can understand
multiple languages. Uh we also have
vector betting models that understand
images. So now our search can understand
images. So just really really can
improve search across the board. So
that's the big big use case for um for
vector embeddings. We are going to see a
few other use cases as well. Um but
there's just you know there's so many
great places where you can use vector
embeddings in order to enhance the
retrieval in uh software.
Okay. So that was at a high level. So
now let's start digging in. So how do we
generate embeddings?
We need to use a vector embedding model.
So a vector embedding model is a special
type of model that is trained
specifically to be able to convert
inputs into a vector. Uh these days it
does use architectures similar to large
language models. Uh but it it's also
it's not the same as a large language
model, right? So it can use really
similar architecture like the
transformer. Uh but it's it has the goal
of creating vector embeddings of
learning the similarities and
differences between different inputs.
that is its goal when it's being trained
and that means that it you know creates
an embedding model that specifically is
really good at understanding which
things are similar and which things are
different uh and this these embedding
models these days are trained on huge
data sets uh you know typically like
from from the whole internet right and
they're looking at across the whole
internet like similarities and
differences between between words so a
vector embedding model can take an input
and then output a vector and it can do
that for things it's seen and then once
it's been trained it can do that for
things it hasn't seen as well and that's
what you know makes it really really
powerful.
Now there are different vector embedding
models out there. Um the you know one
that you may have heard of that we've
had for many years is called wordtovec
and that one is just knows how to turn a
single word into a vector embedding and
uh and that's one that you can even just
run on your laptop if you want to train
your own wordtobeck model. Very easy to
make a word tovec model and um be able
to encode words but the thing is that
only encodes a single word at a time.
Now these days we have vector embedding
models that can encode really really big
inputs. And so the new like open AI
embedding models they can encode up to
uh 8,200 tokens which is huge. That's
like a a large large amount of text,
right? So you could take an entire essay
and be like hey make a vector embedded
that represents this whole essay. So
these models are much much more powerful
because they can accept such longer
inputs and they can come up with a
vector embedding that represents this
really really long input. And once we
had these new models that could encode
much bigger inputs then that's when
vector embeddings really took off where
people were like whoa like now we can
you know we can take any any user input
and we can just represent that as a
vector embedding and uh and just do so
much with it.
And the interesting thing with these
models is that they output vectors of
different lengths, right? So remember a
vector is a list of numbers. And so you
know like wordtovec typically that's a
list of 300 numbers. Uh openai text
embedding a002 that was their first
model and that's a list of 1536 models.
Now, OpenAI has two new models. The text
embedding three small and text embedding
three large. And the uh the three small
is a vector length 1536. And then three
large is a vector length of 372.
And then we also have the Azure AI
vision model. That one can take image or
text. That's a multimodal embedding
model. We'll talk about that in the
vision series. That one has a vector
length of 1024. Uh so you can see
there's differences in what kind of
input they can take, what how long their
vectors are and then there's also
differences you know in quality. So we
have a leaderboard here. Hugging face
has this leaderboard to compare
embedding models. So if you curious to
see the benchmarks for embedding models
uh like if a new embedding model comes
out then you can check out their
embedding benchmark leaderboard and see
how these models are doing. Uh but
generally what you see is like currently
the OpenAI text embedding 3 large model
is a really really good model uh ranking
higher than you know any of the other
previous OpenAI ones. So that's the one
that I tend to use whenever I'm
developing applications because it does
have the highest quality. Uh so if you
go here, this is the leaderboard
and this will show the rankings of the
the models here
and oh we can see so I haven't visited
in a while but you can see that um
there's some new models here. So there's
ones from Gemini Quen um and they look
at different statistics statistics here.
Uh so you know um the and they you know
kind of average and see which ones which
ones are the best.
Uh so you can dig more into that if
you're interested in the benchmarks. Uh
especially if you're looking for
embedding models to work for a specific
language. They do have some benchmarks
for different languages too which is
nice.
All right. So now let's go ahead and
generate some embeddings. So, we're
going to be focusing on the OpenAI
models since we can use those with
OpenAI.com or Azure or GitHub models.
Um, and so in this case, I'm using
GitHub models since I can use that for
free.
And we can create some embeddings. So,
I'm going to go to the actual codeace to
do this.
All right. So, I'm going to open up
generate embedding notebook.
So the first thing I need to do is
create a I'm going to use the OpenAI
Python package
and connect to GitHub models. So this is
the URL for GitHub models and this is my
uh my key is my GitHub token which is in
the GitHub code space already and I'm
going to use the text embedding 3 small
model here which has 1536 dimensions.
Now next I'm going to go down here and
then I can use that openAI client to
create a new embedding. So I say okay
this is the model I want to use. Uh this
is the dimensions and then here is the
input. So I can run this and it goes off
and creates an output and you can see
it's 1536 and we can see a bunch of
numbers right. Uh so we can do you know
a different thing here. Uh big dog.
uh we could do we could do in different
languages grande
so and it can convert it can we can also
we can do gobblelygook I mean that's the
interesting thing is that you can turn
with these models you can turn any input
into a vector and it will try to
represent that vector uh but you can
even put gobbleygook in there and it'll
turn into a vector and that you know
vector exists in this like
multi-dimensional space, right? And so
there, you know, this vector lives
somewhere inside this vector embedding
space and is closer to some vectors than
other vectors even though this is
absolute nonsense. So that's something
to keep in mind with vector embeddings
is that you know garbage in garbage out.
Like if you if you put in nonsense, it
will give you a vector and and um you
know and at least pretend that it has
some meaning. uh even if it really
doesn't have any meaning at all, but it
wants to try and come up with some sort
of representation it of it even though
it it has so little meaning right now.
Most of the time we want to use this
with things that you know actually have
meaning. Um
so we'll you know pass in actual
phrases.
Okay. So now uh so now we've seen that's
how we can generate embeddings and I
have a bunch of embeddings already
premputed in this in this workspace so
that we'll be able to to look at a a
bunch of different uh you know vectors
of different inputs uh without having to
wait for it to to do lots of um uh lots
of embedding generation because it does
take time to generate the embeddings.
Uh the model dimensions uh so this model
dimensions let me just change this. This
is 1536 and this is text embedding three
small the model dimension does depend on
the model. So you would need to check
and see u you know for a particular
model what dimensions it supports. Uh we
will see later that OpenAI does actually
support doing smaller than the default
uh for some of their models and that's
an option but by default this model is
1536
and so that's what we're we're
specifying here. Uh generally like
embedding models they all have their own
their own similarity space. So we need
to know when we create embeddings which
model we used and how many dimensions we
used because whenever we're going to use
those embeddings going forward we need
to use the exact same model and the
exact same dimensions.
All right. So now let's talk about you
know different models, right? So uh I
was just saying like it's really
important that we know which model we're
using because every every embedding
model has its own similarity space. It's
like its own brain and some brains have
different similarity spaces than others,
right? Like in my brain um you know like
there's a different like it's a
different similarity between like
unicorns and ponies than in somebody
else's brain. Um, so what we need to do
is know which model we're using and and
and stick with it. Um, so that we're
able to actually compare things. Uh, so
here's a comparison between the two
OpenAI models, right? Looking at uh two
different vectors. So for the same word,
I can pass the same word queen into
their older model to 002 and we do get a
vector of 1536 dimensions. Uh but it's
kind of a weird it's kind of a weird
vector. Uh if you look at it, most of
the values are really really close to
zero. Um but then there's this value
here that go you know it's like
negative.7, right? And this is the weird
thing about this model is that every
single vector has a value at.7 in the
you know it's dimension 196 uh always
has a value that goes really really low
like this. So that's it's a weird thing
about this vector embedding model that
this is what the the numbers tend to
look like is that they're mostly around
00 and they have a couple super downward
spikes, right? And I saw that with every
single input that I put into this model
that it all had this really similar
spike uh spike downward here.
Now the text embedding 3 small model
also outputs 1536
but the actual values look much more
much better distributed right so here we
can see the values of these numbers
range between uh 0.1 and negative.1
and you know they're just kind of all
well distributed over that right there's
no like really extreme spikes like this
one uh so you know to me this means this
I think this is a better model I think
they did a better job training it um
that you know that they they don't have
any weird artifacts like this one. Um
now the thing is when you look at this
the what does this even mean right like
you can't look at a vector and see the
meaning in it right uh it's only it only
you can only like really understand how
a similarity space works when you then
compare and see like what does it think
is similar right that's how you can
understand a vector similarity space is
by seeing what does it think is similar
to other things uh so let me look uh let
me open up another notebook here. Right?
So, we've got embeddings for
a bunch of words across a few different
models. So, like AD to 002 and text
embedding 3, right?
And uh what we can do is oh actually
what I want to do is
look at
wait which ones are similar. All right.
So now let's talk about similarity.
Right. So as I was saying
the way uh to really understand a vector
similarity space is to see what it
thinks is similar to each other.
Now how do we actually measure
similarity? Uh the most popular
netmetric and the one we're going to use
today is cosine similarity.
So let's see cosine similarity in the
similarity notebook here.
and I'll go ahead and load these load
these up.
All right, so I'm loading in the
embeddings that I've already generated
for
1,00 words across each of the models.
Now, here's the cosine similarity
function. So cosine similarity is the
dotproduct
divided by the magnitude. So first we
calculate the dot product and then we
calculate the magnitude and we divide
them.
So here we've got that implemented. And
so now we can measure the cosine
similarity between dog and cat. And so
we get this number 6. And in this case 6
means you know they're they're fairly
fairly similar, right? Um now here's
what's interesting about cosine
similarity. See how it's dotproduct and
magnitude.
Now the magnitude of a vector
is one if it's a unit vector. Uh so this
is this you know I said we were going to
do a little bit of math right? So if you
remember from vector class a unit vector
is a vector that has a magnitude of one.
So if you have unit vectors and you do
dotproduct over magnitude then you're
really just doing dotproduct right? So
let me let me actually show you. Let's
print out the magnitude here.
Print the magnitude. We'll run this. And
you can see the magnitude is basically
one, right? With a, you know, like a
little margin of error here, but the
magnitude is basically one. And that's
because the OpenAI embedding model
vectors are unit vectors. So that means
I can actually speed up my calculation
here when I'm using OpenAI models
because I can delete magnitude and just
do dotproduct and remember the
similarity 62. I'm going to run this.
I'm going to run this. Boom. 62. So I
just saved myself a whole calculation.
Um because there I only have to do the
dot product. Now this only works if
you're uh working with a model that
produces unit vectors. Uh so that's true
of all the open AI models. So if you are
using open AAI models, you can actually
save yourself by time some time by just
using dotproduct uh and not dividing by
the magnitude. And sometimes this
doesn't matter because you're using you
know a vector database and the vector
database uh you know just takes care of
this for you and does the optimization.
But sometimes it does matter like when
I'm using uh Postgress, Postgress gives
you the option to do cosine similarity
or dotproduct and if I'm using the openi
embedding models then I use the
dotproduct metric instead. Uh so I think
it's worth knowing that there is a
slight performance improvement that you
can get if you know you're working with
unit vectors.
Okay. So that was that was our our our
math uh for today. Um so
uh and we do have lots of resources for
anybody who wants to dig deeper into the
different distance metrics. Um and uh
there's it's some pretty interesting
stuff. I do wish I'd paid more attention
to vector math in in college now. All
right. So now that we have a way of
calculating similarity,
uh we're going to go ahead and look for
the most similar words to a given word
using that metric. So basically, we're
going to calculate the similarity from
one word to every single other word in
my collection of words because I've got
a thousand words. So I'm going to
calculate and see like, okay, for the
word dog, which of the a thousand words
is closest to it according to cosine
similarity. So for dog I can look here
and see that. Okay, the closest one is
dog. Okay, so that's good. Dog to itself
gets a similarity of one. Very good. Now
the next one is animal. So the clo so
the very closest one after this is 088.
So it has a coin similarity of 088. Now
the one after that is god. So according
to this model, God is actually more
similar to dog than cat. Uh so I guess
this model is is like a dog person. This
model really likes dogs. I know. I think
the reason is because of the spelling.
Uh generally what I've seen is that this
this is the ADA 002 model. I think the
ADA00002 model did kind of incorporate
spelling during its training process in
in a ways. Um, I'm not sure exactly why,
but this is this is interesting to see
because you're like, huh, if if it
thinks dog is close to God, you know,
what sort how might that affect other
ways that I, you know, would use this
model? And then we see also drug and gun
at the bottom here, right? So, I think
once again, drug probably maybe close in
terms of spelling maybe. Also, this is
only looking at a thousand words and
maybe there weren't any more similar
words here. But this is what I find
useful is that if you're trying to
understand a similarity space is to see
you know in that similarity space what
do they think is similar you know to
each each of the inputs.
Now we can look at the same uh we'll
look also at the text embedding 3 small
model. So that one actually has much
more reasonable right here we've got dog
then we've got animal. Now what's
interesting is that animal here has a
similarity score of 68 and if you looked
up here the closest thing is 088. So
this is another thing you'll notice is
that you shouldn't think of similarity
scores as being absolute right um
because they differ so much across
models like so in the 80002 model 088
meant very similar right but in this
model 68 means very similar. So you
really you cannot look at a similarity
score on its own and know oh that's like
a really really similar thing. You can
only look at it relative to similarity
scores from the same model. Right? So
here we're seeing like okay so for this
model 68 is you know a close score. U we
see animal and then we see So cat
is 6 and that to it is actually a strong
similarity. And then we go you know all
the way down to uh baby and door. I
guess dogs are at the door a lot. So
that might be why uh it thought it was
similar because a lot of times when
these models are trained, they're
trained by looking at proximity and
text. So if dog is close to door in a
lot of text, then that can increase the
similarity score, right? Maybe dogs are
hanging out with the babies like lady in
the they're taking care of them,
right? Who knows? Uh but this is I'd say
this is a much more reasonable set of
similar words than um ADA 002. That's
why I have moved everything over to the
new models because I think they just
have a more reasonable similarity space.
All right. Uh cool. So that that I think
is a really helpful way of understanding
uh understanding a model.
Now you can also try to visualize the
model. So here I I used this technique
called PCA, principal components
analysis. And it tries to turn the 1536
dimensions into just three dimensions.
So we can plot it in 3D space. And it's,
you know, it's fun to do this and it's
cute and it makes cool visualizations,
but you lose so much information because
you're literally losing 1533
dimensions of information when we try to
squeeze them down into three dimension
of space. So, you know, it's fun to to
make these um 3D projections of the
vectors, but I don't actually think it's
that useful because we lose so much
information. So, I think it's more
useful to actually just say like, okay,
for a given word, which you know, which
concepts are the most similar. Uh but,
you know, we like to see pretty graphs.
So, here you go. Here's some pretty
graphs.
Now another thing that's useful is
looking at the actual values of the
cosine summary right so we were looking
I was showing like okay so here you know
we saw the closest one was 68 and then
here the closest one was 088 so I think
that's useful to look at to give you a
kind of more intu uh better intuition
for a particular model what kind of
values to expect so for a002
all of the values are between 0.86 86
and like
78, right? They're in a really really
tight range. Uh versus the newer models,
text embedding three small, we see the
values range between uh like 7 and you
know 05 like that's a much more
reasonable range. That's another reason
why I think this is a better model uh
because it just has a uh a better
distribution across the range. In this
one, you could actually look at the
values and be like, okay, if you get a
0.2 cosine similarity probably means not
particularly similar, right? Uh so it's
easier to come up with the kind of cut
offs to say like, okay, if it's below
this, you know, similarity value, it's
just really not that similar at all,
right? So, so I think it's interesting
to look at look at uh the range of
similarity values to get a feel for what
you might expect.
So now we have seen cosine similarity,
vector similarity. I see there's lots of
discussion in the chat about the the uh
the math behind it. Um so definitely you
can dig into that more. Uh now there's
some great use cases for just vector
similarity on its own. Uh so you can use
it to make recommendation systems. In
the past, if you wanted to make a
recommendation system, you'd often have
to look at like lots of user input and
consider uh you know, what different
users um liked and and say like, okay,
if lots of users liked this then and
they also liked this, like okay, those
are those are similar. We're going to
recommend them. That's still a really
great thing to do if you can build a
whole userbased recommendation system.
But you could also now just use vector
embeddings for making recommendation
systems and uh and it's a lot easier um
because the these vector embedding
models have been trained on the internet
so they have a rough idea of like which
things are are similar. So I know people
who have uh you know used this just to
even to add like you know
recommendations onto their personal blog
or something right like so now it's so
easy to make recommendations
that you can do it for anything you can
just throw it on there right now you
don't have to build your own
recommendation model you could just use
the embedding models and then boom you
can show hey you know you're on this
piece of content let's recommend these
other base of other content based off of
uh what you're currently on. So that's
definitely a big a big use case uh that
you can consider for using vector
embedding. Great way to get started with
it.
Another interesting use case is fraud
detection. Uh so people use use vectors
in order to establish whether uh a new
input is more similar to a fraudulent
input than a real input. Uh now this you
really need to do do very carefully uh
because you don't want to accuse
anything of being fraud if it's not
fraud. Uh similar like spam detection.
Uh so that would be another use case for
um for vectors. Now interestingly when
people are doing fraud detection
sometimes they use different metrics
other than cosine similarity. So here
today we're really talking a lot about
cosign similarity because that's the
metric that people are using for most of
the you know the generative AI use cases
uh that we're looking at. Uh but for
fraud detection sometimes people use
other distance metrics as well. So
cosign similarity is a great metric.
It's the one we're showing today because
it's going to um be the most useful for
all your generative AI applications. But
just so you know there are other ways of
measuring the distance between vectors
and um you know sometimes other metrics
are more appropriate for the scenario uh
like with fraud detection.
Okay. So now we can talk about how we
can do vector search and that's really
what um everybody's so excited about in
terms of vector embeddings. So the idea
is that once we have all of our data
converted into vectors, we can then
search that data based off any new user
input. Right? So we get in the user
input, we turn that into a vector using
the same model and then we use that
vector to search existing vectors and
say, okay, based off that new user
input, here are the closest vectors that
we have.
So we can start off by doing an
exhaustive search. So an exhaustive
search means that we are going to take
the you the input vector and we are
going to use that input vector and
compare its cosine similarity to every
single other vector and say okay we
found the vector that is closest to this
input.
So let's see uh let's look at search the
search notebook here
and let me run this. Um so here
I've got a bunch of embeddings for
Disney movie titles. Let me actually
show you what those embeddings um look
like. Text embedding three movies text
embedding three small 1536. Okay, let's
see if I can get the get it to open the
file. It is a really really big file.
Generally, you don't want to store
embeddings in JSON files. Um, I did it
for this notebook because I didn't have
that many, but you know, they do take up
a lot of space. So, here I've got a
bunch of embeddings for Disney movie
titles. So, I took the the movie title
and then I generated the embedding. So,
solely based off the title, that's how I
generated the embedding.
So, that's what I'm loading in here. Uh,
so I've loaded all of those in
and then I'm going to set up my
connection to GitHub models
and then I'm going to make a little
wrapper function that can generate an
embedding using the OpenAI SDK.
And then I've got my code here to
compute cosine similarity and do
exhaustive search. Now I was saying we
could actually just um remove the
magnitude calculation since we know
they're unit vectors, but we can also
just leave it in because that's the
standard way of doing cosine similarity.
And so then we're going to create a new
vector. Right? So here I've got the
input a toddler friendly movie about
cats. Right? I have two daughters and
they often want to watch movie about
cats and we have Disney Plus, right? So
here I'm trying to search for what is a
good movie that we could watch that they
would like. So we get the embedding for
that input. Then we use that embedding
to search all of our existing vectors.
So we say, "All right, let's calculate
the cosine similarity. Let's get them
all and let's sort it." So let's see
what we get as a result.
Okay. So, the top movie is The
Aristocrats. Then we have the Tiger
movie, then Ratatouille, uh then Cars
Too,
uh African Cats is somewhere in here. We
have the Goofy Movie, we've got dog
movies. Unfortunately, there's actually
not that many movies about cats uh on
Disney Plus. It's unfortunate. Uh so,
but you know, this is a pretty good
pretty good result here. Uh now, we can,
you know, we can put in all kinds of
things, right? So, we can we can do
something in Spanish. We can say like
okay peliculas so Leon which is movies
about lions
and here we get the Lion King right so
it's saying like once you're using
vector search you're able to support so
many more searches right like and we can
be like really really like oh man my
daughters are screaming
for unicorn stuff I need to occupy them
right now Right? Like this is just a
stream of conscious input from a user.
But a vector embedding model can turn
anything into a vector. And here we can
see uh the result suggested was Babes in
Toyland. Uh I haven't watched that one
yet. And then Monsters Inc. Ice
Princess. Right. So these are like they
were like okay responses. Uh I guess
there's no unicorns actually in this in
this data set. Let's see. Kitty, kitty,
dinosaur. Let's try dinosaur. Dinosaur.
Do we have any dinosaur
results here? Dinosaur. There we go.
Right. So, I had this whole long stream
of consciousness and it picked out that,
you know, dinosaur was a salient, you
know, part of that vector, right? And
so, you know, it's turned this whole
phrase into a vector that lives
somewhere in that multi-dimensional
space. And fortunately that vector is
closer to the dinosaur vector than it is
to any of the other vectors. Right? So
that's what's really powerful uh about
about vectors.
Um okay.
So so yeah. So there we go. So that's
pretty cool. Now I saw a comment in the
chat uh that exhaustive search sounds
expensive and that's true, right? So
here we're only searching what is it
like um 500 movies something like that
and so we're able to do the exact octive
search and the exhaustive search is you
know relatively fast but when you're
actually using vectors in production
you're going to have much bigger
databases right you're going to have
thousands of vectors millions of vectors
you could even have billions of vectors
we do actually have customers at Azure
that have billions of vectors in their
database
So um what we can do here is then use
approximation searches. So we need to
use an algorithm uh known as an
approximate nearest neighbor search. So
ANN so these are all search algorithms
that try to find the best result without
actually having to search exhaustively.
Right? Um so they're trying to find the
highest quality result without having to
look at every every possible option. So
they have to use some sort of huristics
to cut down the search space. Uh now the
most popular one these days is HNSW and
that has really well good support across
lots of databases. So uh Postgress
supports it with the PG vector
extension. Uh Azure AI search supports
it. Chroma DV, Weeba, um you know,
pretty much all the big vector databases
are supporting HNSW
and uh so that's a great pick. So if you
see that as an option, you know, that's
a great pick. Uh now Microsoft came up
with a new a new approximation algorithm
called disk A&N. And so we're now using
that for several Azure products. So
Cosmos DB has it, Azure SQL has it and
Azure Postgress has it. So if you're
using Azure Postgress, you can actually
pick between HSW or diskn
uh there's also IBF flat and uh and that
is supported by Postgress. It's not as
supported by the other ones because it's
not quite as practically useful for a
lot of production uses of vectors. Um
it's got some limitations like it IBF
flat works best if you're only going to
make the vector index once and you're
not doing lots of updates. But a lot of
times if you have a database it's
because you're going to be updating that
database, right? So you want something
that works well that can handle lots of
data updates.
And then there's Feice. Feice is
something you could use if you just
needed an in-memory index. Um but it's
not designed for you know like a
persistent database.
So we're going to look at HNSW because
it is uh you know the most popular one
uh that a lot of people are using and
you can use it just in Python. So I
wanted to find an example that we could
run in Python without having to set up a
database today just so you could get a
feel for it. Um so HSW does work well in
situations where you have an index
that's frequently updated and it can
scale really well up to large indexes as
well. Uh so it's a very well-designed
algorithm and um and we can use it for a
lot of our production use cases. So I've
actually have that set up uh in the
notebook here using the HNSW lib
package. Uh so we you know declare our
index. We say how many dimensions it's
going to have and uh that we're going to
use cosine
and we we have some parameters here that
you can you can uh tweak in order to
change stuff like the performance of the
index. And then we can add all the items
to the index.
And then um and basically at this point
the index is set up. Now mo usually
you're not going to be the one writing
all this code. Typically you're going to
use a database like Postgress that's
going to set this up for you behind the
scenes. Uh maybe with some database
commands. Uh but I thought it was
helpful to see that you can do it in in
Python.
So now I can get my embedding for the
toddler friendly movie. we'll say about
dogs since it seems like there's some
dog people in the chat and uh we're
going to get that embedding and then
we're going to do a query. So this is a
KN&N query which says get us the 10
closest nearest neighbors the 10 nearest
neighbors based uh to this vector
and then we display them.
All right, so we got lots of dog movies,
right? So, that was fast and it did a
good job. So, the idea of HSW is it
should get you really similar results to
uh if you were doing exhaustive search
like let's go try the same thing up
here. A toddler
friendly movie about dogs.
All right, we'll run that. Snow dogs 102
domations fox and hound, right? And what
we see is basically the same results
down here. Uh, and of course this was on
a really small index, so we would expect
to see really good results. But the idea
of HSW is that even as you're adding
more and more vectors, you can keep
getting really, really high quality
results.
And so not having to sacrifice the
quality, but still get really good
speed.
All right, so that was vector search. So
what are the use cases for vector
search? Like first of all, if you have
any sort of search box on your web app,
on your website, you can enhance that
search box by adding in vector search
because then once you add in vector
search, you can handle more complex
queries, more ambiguous queries, right?
You can handle multilingual queries,
right? Uh so it can generally just
improve any sort of search that already
exists.
And then the really big use case with
LLMs is rag is where we're using an LLM
to answer questions by searching some
data. And we're going to be talking
about that really in depth tomorrow. Um
so please, you know, come to tomorrow's
session where we're going to be talking
about rag. But this is really where uh
vector search has taken off because
people when people talk with LLMs they
they you know they They don't use search
queries to talk with LMS. They ask
questions, right? So, we need to be able
to take questions
from users and be able to use those to
do uh to do vector searches, right? So,
I have uh so for example, this is a rag
on top of Azure AI search that's
searching documents like PDFs, right? So
this one is doing a vector search of of
documents and we can actually like see
the um kind of scores that come back
from the search service for for the
various chunks. Right? So we have
created embeddings of all of these
chunks
from the documents and we are searching
based off of the embeddings of these
chunks. And that way we can get really
good results and then you know send that
to send those results to the LM and get
a good response. So that's that is rag
and that is what we're going to be
talking about for the whole session
tomorrow.
All right. So now uh moving on let's
talk about how we can compress vector
embeddings.
So as you see our vectors you know
vectors take up a lot of space right
like this is you know the the title of
the movie and the actual vector is
really really really long and this is
actually the small embedding model right
this is 1536 usually I use the large
model and that's 3,072
right so you know once we start using
vectors we are increasing the amount of
storage space that we you know we need
for our data and then we're actually
like paying for that storage space and
uh you know it affects how well we can
productionize the applications that
we're building. So when you are thinking
about putting your vector search into
production, you may think about how you
can compress the vectors to decrease
your storage size and cost and also
decrease the search latency. And as it
turns out, there's two really
interesting techniques we can use to
compress vectors and not lose that much
quality. So we're going to look at
vector quantization where we take each
number and we make them take up less
space. And then also dimensionality
reduction where we just take a long
vector and make that vector shorter.
So let's talk about vector quantization.
So this is where we start off with the
list of floatingoint numbers, right?
Like that's what we just saw. Tons and
tons of floatingoint numbers. These are
64-bit floatingoint numbers. And so they
require a full 64 bits in order to store
in memory. But what we can do is reduce
them so that they don't require 64 bits.
Uh we can start off with scalar
quantization. That's where we reduce
each number to an integer. And then we
can even go as far as binary
quantization where we reduce each number
to a single bit. And it's crazy that
that works. Uh so let me go ahead and
open the the quantization notebook here.
And uh here we have okay so we're going
to load in the same ones from before the
1536.
All right. So we can see these are
currently 64-bit numbers.
Now for scalar quantization the approach
that we use
is that we have you know we we look and
see okay what is you know what is the
range of these values what do these
floatingoint numbers range between and
you know what's kind of the the middle
of those values and we take the range of
those values and we map them to the new
range. So for this code here, I'm going
to map the the values from -128 to
positive27,
right? So it's going to take all the all
those floating point values and they
each of them is going to fall into one
of 256, you know, buckets uh between
this range. Uh so you can see here we
find the global min and max. uh we
normalize the embeddings
and then we go and figure out okay
between the min and the max where is you
know which number is it going to become.
So go ahead. I can run this on the
little mermaid. And now we can see that
the little mermaid is represented by a
list of integers instead. And integers
uh require much less storage space than
64-bit, you know, floating numbers up
here.
So now we've got that um that vector.
uh you can you kind of you can look at
it and see that once we've mapped it
into integers all of the values ranged
between uh in this case we can see they
went from like -10 to positive 70 for
this vector
uh the full possible range is -27 to you
know positive 128 uh but you know it
just depends where the values fall. So
now what the big question is like okay
well we we removed a bunch of
information right we went from 64 bits
of information to you know just an
integer of information how does that
affect similarity that's what matters is
are we going to get the same results
like I saw a question like you know are
we getting the same results with HNSW
right that's always the question like as
you add on you know approximation and
compression and all that how good are
the results right a lot of times they're
not going to be as good. But a lot of
times they could be good enough, right?
So here we can look and see like okay
Moana with the integers we see the most
similar values you know similar movies
are like Mulan, Little Mermaid, Lilo and
Stitch and we can compare that to the
original where we see Mulan, Lilo and
Stitch, Little Mermaid. So like the top
three they're in a slightly different
order. I try and show it all. So here
they're in a slightly different order,
but
they're pretty darn close, right? So
given how much performance improvement
we can get out of this, this might be
good enough, right?
Uh but it's something you really have to
decide and actually run evaluations and
see like okay like once we like add in
this performance enhancements we add in
this quantization
you know does it is our is our
performance still good enough like is
our retrieval quality still good enough.
So the next thing is we can go more
extreme and then we can turn each of the
numbers into bits just zero or one. And
so for this one uh what we do is we look
at the full range and we figure out the
average like the mean the one in the
value in the middle and if it's less
than that it becomes zero and if it's
more than that it becomes one. Right?
So that's what this code here does.
figures out the mean and then quantizes
to either one or zero.
And so then we can see Moana becomes,
you know, 1 one zero one 0 one, right?
And it's crazy to think that this, you
know, that we could do this like we're
losing so so much information. We went
from 64 bits to one bit. And I was
actually working on this notebook with
my mom and my mom, she's a
mathematician, and she did not think it
was going to work. She was like, "No
way. like we're going to lose so much
data, but you know, let's just look and
see.
All right, so that's what the um vector
looks like. Um so here when we do when
we look at the most similar movies, we
see Moana, Mulan, Little Mermaid,
Princess and the Frog did move up one.
Um but then we got Lilo and Stitch here.
So we still seen really similar results
even with one bit to represent that full
64-bit floaty number. So that's what's
really really interesting is that you
can actually reduce so much information
and still retain a lot of the semantic
meaning in that vector.
Um
oh and we I see there's a comment about
like oh the similarity numbers are
different. That's true. The similarity
numbers are different. That's why um
they're saying like you know it's
important that we you know usually we
don't want to strictly be looking at
similarity numbers on their own and we
want to look more relative right so
relatively we see that within this
search you know Mulan has is more
similar than Little Mermaid but yeah
it's true that this one's 68 and this
one is 0.54 right so really different um
similarity ranges here so if you were
doing if you were doing any sort of
similarity threshold hold cut off. It
does really depend on u in this case it
would depend both on the model and the
space and what sort of quantization you
were doing. So that's that's a good
observation there.
Okay. So that's crazy but it worked,
right? And here I've got the actual
effects on similarity between the the
float and the um you know binary. Like
there's definitely differences here. Um
and we can argue about which of the
results are actually you know better. Um
uh but uh but it still did retain a
surprising amount of semantic
information because it did come up, you
know, it wasn't just nonsense, right?
Like it came up with a bunch of similar
movies, just some differences in which
is which.
Now the cool thing is like you know for
if you're you know just working a few
thousand vectors it's not a big deal but
if you are doing millions of vectors
billions of vectors like this can make a
big uh difference in reducing the size
of your vendor vector index. So here we
have a comparison for Azure AI search
which supports quantization and so uh if
we start off you know with float 32 and
then we go down to bits we can see that
we get a 97% reduction in the index
size. So that's huge, right? Because
that's money that you're actually paying
for. And uh you know, if you if you
don't need to use up all that space,
then you know, why use it?
So that's quantization.
And now let's look at the other
technique, which is dimension reduction.
So dimension reduction is a technique we
can only use on models that were
specifically trained to be able to
support it. So the openi embedding
models were trained to support MRL
matrioska representation learning. So
since they were trained to support MRL
it means that we can reduce those models
uh to different sizes and still retrain
uh retain the a lot of the same semantic
representation
but you do have to be really careful to
only use this on the models that support
it. Uh now the cool thing is with the
openi SDK you can actually just do it in
the SDK itself. So you know when you're
using the model just pass in the smaller
dimension. So for this one three small
is usually 1536. I can just pass in 256
and it will do the correct reduction and
give back the new embedding.
So does this does this work? Right. So I
did the same comparison here. So I went
from 1536 down to 256. That's the
minimum that um they that is supported
for reduction on this model. And you
know we what we see is really similar
results with some some differences right
similar to quantization effects.
So the interesting thing is you can
combine these techniques. So you could
take like uh you know the you know start
off with 1536 reduce those to 256 and
you could then even in theory reduce
those 256 things to only be bits instead
of floatingoint numbers and you would
have this index that takes up this tiny
amount of room. But I know like you're
probably all freaking out like oh my god
you're definitely going to lose some
quality with that right? You're
definitely because you're losing
information. You're losing information
in in all directions all dimensions
right? But there's this really cool hack
or technique that you can use uh which
is to over sample, right? So what you
can do is like let's say originally you
wanted 10 results and you were going to
do those 10 results on your you know
non-compressed index. What you can do is
say like okay I'm going to get 150
results
and uh you you know I'm going to get ask
for 150 results from my compressed index
and then you can still still store the
original vectors in um in a more
efficient place to store vectors like
not in your actual index but in a
different spot and then you can rescore
the results according to your original
vectors. And so basically if you ask for
150 results,
you're gonna find that those 150 results
are going to have the original top 10
best results. So you get those 150, you
then rescore the 150 using the original
vectors that are stored elsewhere and
then you'll find those top 10. The top
10 will will rise out of that. Right? So
uh that's I mean this is getting to you
know use cases that maybe not everybody
needs like maybe you don't all need
these compression but you know we do
work with lots of customers that do have
lots and lots of vectors and you know
this is what we recommend is like okay
use all these compression techniques but
then use uh what we call oversampling
right so use oversampling with the
original uh vectors in order to rescore
so I think it's really really a cool
technique um and uh you know that we can
use in order to have fast, you know,
fast retrieval from the vector index but
still have high quality results. Uh we
did like a whole um series about it
where we did a deeper dive into it and I
was really impressed to see the results
is if you you can do all of this and as
long as you do the oversampling with
rescoring the originals, you get
fantastic results.
Okay, so we covered a lot today. As I
mentioned, it was we did have, you know,
we did go kind of deeper, nerdier into
into the math of it because vector
embedding is like, you know, they're a
bunch of numbers like there's uh we have
to think about what they actually mean
and how we, you know, compute distances
and all that stuff. Uh hopefully this
gave you a better feel for for vector
embeddings. Um if you have any feedback
about uh how to make them easier to
understand, uh you know, let us know. Uh
there's lots more resources here that I
used when I was researching uh vector
embeddings and you can get these
resources from the slides
and uh and you can dig deeper into it. I
also have let me find my um blog post.
Uh
so I have this blog post here that is
kind of like a written version of this
one. And uh I actually even turned this
into a poster. So if you like posters,
you can have a poster of all these. Uh
this mentions a few other things like
the other distance metrics.
Uh so yeah, so this was this was one way
of exploring the amazing world of vector
embeddings and uh we will keep using
vector embeddings going forward in the
series uh especially tomorrow. So please
come to the rag session where we're
actually going to do practical things
with vector embeddings and see uh how we
can use it for making rag applications
uh which is the most popular use of
vector embeddings. Uh we will also
revisit vector embeddings in the vision
models on Monday. So we'll get to see
multimodal embeddings which are really
really cool because you can search by
image and you can search images and so
that's really really helpful. Uh, so we
will continue to talk about cool ways of
using vector embeddings going forward in
the series. So I hope you come back and
I hope that we get to see you tomorrow.
Uh, let's see. We do have office hours
on Tuesdays. We already had them
yesterday. It was a great office hours.
Lots of you came and we had lots of
great questions. Uh, I don't have office
hours today, but you can still join the
Discord and there's lots of channels in
the Discord where you can ask ask
questions. Uh, another place you can ask
questions is on our resources thread.
So, uh, we're, you know, keeping all the
resources in this discussion thread
here. So, if you do have any additional
questions, please feel free to add them
to this thread and, uh, I can totally
answer follow-up questions there. And,
um, let's see what else. Uh we are at
time
right now so we can't we can't go over
today with questions but hopefully you
got a lot of questions answered in the
chat and please do post any more
questions in this thread or in discord
or bring it to office hours next week.
All right,
thank you everyone. I hope to see you
tomorrow.
Bye.
Thank you all for joining and thanks
again to our speakers.
This session is part of a series to
register for future shows and watch past
episodes on demand. You can follow the
link on the screen or in the chat.
We're always looking to improve our
sessions and your experience. If you
have any feedback for us, we would love
to hear what you have to say. You can
find that link on the screen or in the
chat. and we'll see you at the next one.
[Music]
Hey.
[Music]
Hey,
In our second session of the Python + AI series, we'll dive into a different kind of model: the vector embedding model. A vector embedding is a way to encode a text or image as an array of floating point numbers. Vector embeddings make it possible to perform similarity search on many kinds of content. In this session, we'll explore different vector embedding models, like the OpenAI text-embedding-3 series, with both visualizations and Python code. We'll compare distance metrics, use quantization to reduce vector size, and try out multimodal embedding models. If you'd like to follow along with the live examples, make sure you've got a GitHub account. š This session is a part of a series. Learn more here: https://aka.ms/PythonAI/2 Explore the slides and episode resources: https://aka.ms/pythonai/resources Check out the demos: https://aka.ms/python-openai-demos Chapters: 00:08 ā Welcome & Housekeeping 01:03 ā Introduction to Vector Embeddings 02:24 ā Why Vector Embeddings Matter 03:32 ā How Embedding Models Work 06:01 ā Comparing Embedding Models 10:55 ā Generating Embeddings with OpenAI 20:59 ā Understanding Similarity Spaces 24:47 ā Cosine Similarity Explained 34:02 ā Vector Search with Exhaustive Search 40:01 ā Approximate Nearest Neighbor (ANN) Search 46:54 ā Compressing Embeddings: Quantization 56:17 ā Compressing Embeddings: Dimensionality Reduction 59:57 ā Oversampling for High-Quality Retrieval 1:00:08 ā Wrap-Up & Resources #MicrosoftReactor #learnconnectbuild [eventID:26293]