Python + AI: Vector embeddings | DailyDevLists

Loading video player...

Full Transcript

10,573 words • EN

Hey everyone, thanks for joining us for

the next session of our Python and AI

level up series on vector embeddings.

My name is Anna. I'll be your producer

for this session. I'm an event planner

for Reactor joining you from Redmond,

Washington.

Before we start, I do have some quick

housekeeping.

Please take a moment to read our code of

conduct.

We seek to provide a respectful

environment for both our audience and

presenters.

While we absolutely encourage engagement

in the chat, we ask that you please be

mindful of your commentary, remain

professional, and on topic.

Keep an eye on that chat. We'll be

dropping helpful links and checking for

questions for our presenter and

moderators to answer.

Our session is being recorded. It will

be available to view on demand right

here on the Reactor channel.

With that, I'd love to turn it over to

our speaker for today, Pamela. Thanks

for joining.

>> Hello. Hello everyone. Very excited to

be here for the second session in our

Python plus AI series where we show you

all the fundamentals of generative AI

with a focus of using them in Python.

So, we are on our second session where

we're talking about vector embeddings.

We already talked about the basics of

LLMs. If you missed that one, just you

can go back on YouTube and you can watch

that when you've got time. And we still

have seven more sessions after this one.

So, we're covering lots of topics in

this series. If you haven't yet

registered for the series, go and

register for it now with that link uh

that we'll post in the chat uh because

we want to hopefully see you at all the

sessions. Now, they will all be

recorded. All the slides and all the

code, everything will be available. So

if you do miss anything, you can always

catch up.

So let's go to today's topic about

vector embeddings.

Uh this is it's a really fun topic. It

also is a little bit more of a mathy

topic. So those, you know, going to be a

little little more math today, uh than

you know, other ones. Uh a little more

math than Python. Uh, but our goal is so

that you walk away from this and like

really feel like you have a better

intuitive understanding of vector

embeddings, how they work, and when you

might want to use them. So, we're going

to be talking about what they actually

are, um, you know, what the kind of

similarity spaces look like for

embeddings, how you search with them,

uh, how you can compress them with

different quantization techniques, and

we'll see that across a couple different

embedding models.

Now you can follow along with everything

I'm doing with this repo here

aka.msbeector-mbbedding-demos.

Uh so you can we'll go ahead and post

that in the chat and that is a public

GitHub repo. So you can go to the repo

and then the easiest way to get started

is to click on the code button and go to

the code spaces tab. And you can see I

actually already have two code spaces.

But when you click on it, you what you

should see is a giant button that says

create code space on main. This this

button here. So you're going to click on

that and it will create a code space in

your browser. So it's going to look like

a VS code inside your browser that is

inside the actual project. Uh so you can

see I've got a code space open over here

uh that has that has this project open.

Uh, so that's what that's what it'll

look like when you open it is that you

give it a couple minutes, it'll open

this VS Code and then you'll have access

to all of these notebooks and you should

be able to run all the notebooks that

I'm running um for free um because uh

we're you know using we're using GitHub

models for the vector embedding models

and that you can run for free using a

GitHub account. So hopefully you can

follow along with everything I'm doing

uh if you want to by opening this repo

in GitHub code spaces. If you really

want, you can also open it locally, but

then you will have to do a little more

setup and you will have to read the read

me. Uh so if you do want to open it

locally, go for it. Um but do uh check

out the readme uh to get some tips for

setting up and ask questions in the chat

if you have any.

Okay. All right. So we want to make sure

everyone knows how they can follow

along. Uh and you can always open that

repo later if you just want to watch for

now and you know don't want to play

around with the code yet.

So now let's talk about what are vector

embeddings.

So at a very high level

humans think in words right and

computers think in numbers or really

computers think in in binary and bits.

Uh so if we're trying to get computers

to be able to think about human concepts

like words, we need to find a way of

converting those words into something a

computer understands.

So what we do is we create these vectors

that represent

words and phrases and sentences and they

try to represent the meaning of those

sentences but in a numerical form. So a

vector embedding is actually a list of

floatingoint numbers. That is that is

what it is to the computer and that's

something that a computer can actually

use when it's doing you know

calculations. So when everything is in a

vector form it can do you know ranking

and sorting and searching and so it is a

way of capturing

the meaning of of you know human

concepts in a numerical form that

computers can understand. So that's what

they are at a you know a really high

level. We're gonna go much much deeper

into this, right? But they are lists of

numbers that represent, you know, words,

phrases sentences inputs concepts.

Uh, and we've just figured out a way to

turn it into something that a computer

understands.

Now, why do we actually need vector

embeddings? Like, what are they useful?

Why are you even here? So, they're

actually incredibly incredibly useful.

uh once we have vector embeddings like

first we can just make much better

searches. So anywhere where you have a

search right if you've got a a product

store that has a search uh you've got

documentation that has a search anywhere

where you have a search you're going to

improve that search if you add in vector

uh you know vector embeddings and vector

search to it. Now on top of that uh we

can start building more interesting

things and actually building chatbased

interfaces for users because we can take

really really uh ambiguous input from

users and we can handle any sort of

input now. So we can take a user's

entire question and we can use that to

do a search and then we can hand those

results to an LLM. Uh so many people

that are building chatbased interfaces

with LLMs are actually using vector

search behind the scenes in order to

handle these you know more ambiguous uh

queries from users.

Uh and once we start adding vector

search we can also you know we can

improve things because we can find

things that are similar in meaning. We

can also handle multiple languages more

easily. Uh so it's because all these a

lot of these vector models actually

understand multiple languages. So

suddenly our search can understand

multiple languages. Uh we also have

vector betting models that understand

images. So now our search can understand

images. So just really really can

improve search across the board. So

that's the big big use case for um for

vector embeddings. We are going to see a

few other use cases as well. Um but

there's just you know there's so many

great places where you can use vector

embeddings in order to enhance the

retrieval in uh software.

Okay. So that was at a high level. So

now let's start digging in. So how do we

generate embeddings?

We need to use a vector embedding model.

So a vector embedding model is a special

type of model that is trained

specifically to be able to convert

inputs into a vector. Uh these days it

does use architectures similar to large

language models. Uh but it it's also

it's not the same as a large language

model, right? So it can use really

similar architecture like the

transformer. Uh but it's it has the goal

of creating vector embeddings of

learning the similarities and

differences between different inputs.

that is its goal when it's being trained

and that means that it you know creates

an embedding model that specifically is

really good at understanding which

things are similar and which things are

different uh and this these embedding

models these days are trained on huge

data sets uh you know typically like

from from the whole internet right and

they're looking at across the whole

internet like similarities and

differences between between words so a

vector embedding model can take an input

and then output a vector and it can do

that for things it's seen and then once

it's been trained it can do that for

things it hasn't seen as well and that's

what you know makes it really really

powerful.

Now there are different vector embedding

models out there. Um the you know one

that you may have heard of that we've

had for many years is called wordtovec

and that one is just knows how to turn a

single word into a vector embedding and

uh and that's one that you can even just

run on your laptop if you want to train

your own wordtobeck model. Very easy to

make a word tovec model and um be able

to encode words but the thing is that

only encodes a single word at a time.

Now these days we have vector embedding

models that can encode really really big

inputs. And so the new like open AI

embedding models they can encode up to

uh 8,200 tokens which is huge. That's

like a a large large amount of text,

right? So you could take an entire essay

and be like hey make a vector embedded

that represents this whole essay. So

these models are much much more powerful

because they can accept such longer

inputs and they can come up with a

vector embedding that represents this

really really long input. And once we

had these new models that could encode

much bigger inputs then that's when

vector embeddings really took off where

people were like whoa like now we can

you know we can take any any user input

and we can just represent that as a

vector embedding and uh and just do so

much with it.

And the interesting thing with these

models is that they output vectors of

different lengths, right? So remember a

vector is a list of numbers. And so you

know like wordtovec typically that's a

list of 300 numbers. Uh openai text

embedding a002 that was their first

model and that's a list of 1536 models.

Now, OpenAI has two new models. The text

embedding three small and text embedding

three large. And the uh the three small

is a vector length 1536. And then three

large is a vector length of 372.

And then we also have the Azure AI

vision model. That one can take image or

text. That's a multimodal embedding

model. We'll talk about that in the

vision series. That one has a vector

length of 1024. Uh so you can see

there's differences in what kind of

input they can take, what how long their

vectors are and then there's also

differences you know in quality. So we

have a leaderboard here. Hugging face

has this leaderboard to compare

embedding models. So if you curious to

see the benchmarks for embedding models

uh like if a new embedding model comes

out then you can check out their

embedding benchmark leaderboard and see

how these models are doing. Uh but

generally what you see is like currently

the OpenAI text embedding 3 large model

is a really really good model uh ranking

higher than you know any of the other

previous OpenAI ones. So that's the one

that I tend to use whenever I'm

developing applications because it does

have the highest quality. Uh so if you

go here, this is the leaderboard

and this will show the rankings of the

the models here

and oh we can see so I haven't visited

in a while but you can see that um

there's some new models here. So there's

ones from Gemini Quen um and they look

at different statistics statistics here.

Uh so you know um the and they you know

kind of average and see which ones which

ones are the best.

Uh so you can dig more into that if

you're interested in the benchmarks. Uh

especially if you're looking for

embedding models to work for a specific

language. They do have some benchmarks

for different languages too which is

nice.

All right. So now let's go ahead and

generate some embeddings. So, we're

going to be focusing on the OpenAI

models since we can use those with

OpenAI.com or Azure or GitHub models.

Um, and so in this case, I'm using

GitHub models since I can use that for

free.

And we can create some embeddings. So,

I'm going to go to the actual codeace to

do this.

All right. So, I'm going to open up

generate embedding notebook.

So the first thing I need to do is

create a I'm going to use the OpenAI

Python package

and connect to GitHub models. So this is

the URL for GitHub models and this is my

uh my key is my GitHub token which is in

the GitHub code space already and I'm

going to use the text embedding 3 small

model here which has 1536 dimensions.

Now next I'm going to go down here and

then I can use that openAI client to

create a new embedding. So I say okay

this is the model I want to use. Uh this

is the dimensions and then here is the

input. So I can run this and it goes off

and creates an output and you can see

it's 1536 and we can see a bunch of

numbers right. Uh so we can do you know

a different thing here. Uh big dog.

uh we could do we could do in different

languages grande

so and it can convert it can we can also

we can do gobblelygook I mean that's the

interesting thing is that you can turn

with these models you can turn any input

into a vector and it will try to

represent that vector uh but you can

even put gobbleygook in there and it'll

turn into a vector and that you know

vector exists in this like

multi-dimensional space, right? And so

there, you know, this vector lives

somewhere inside this vector embedding

space and is closer to some vectors than

other vectors even though this is

absolute nonsense. So that's something

to keep in mind with vector embeddings

is that you know garbage in garbage out.

Like if you if you put in nonsense, it

will give you a vector and and um you

know and at least pretend that it has

some meaning. uh even if it really

doesn't have any meaning at all, but it

wants to try and come up with some sort

of representation it of it even though

it it has so little meaning right now.

Most of the time we want to use this

with things that you know actually have

meaning. Um

so we'll you know pass in actual

phrases.

Okay. So now uh so now we've seen that's

how we can generate embeddings and I

have a bunch of embeddings already

premputed in this in this workspace so

that we'll be able to to look at a a

bunch of different uh you know vectors

of different inputs uh without having to

wait for it to to do lots of um uh lots

of embedding generation because it does

take time to generate the embeddings.

Uh the model dimensions uh so this model

dimensions let me just change this. This

is 1536 and this is text embedding three

small the model dimension does depend on

the model. So you would need to check

and see u you know for a particular

model what dimensions it supports. Uh we

will see later that OpenAI does actually

support doing smaller than the default

uh for some of their models and that's

an option but by default this model is

1536

and so that's what we're we're

specifying here. Uh generally like

embedding models they all have their own

their own similarity space. So we need

to know when we create embeddings which

model we used and how many dimensions we

used because whenever we're going to use

those embeddings going forward we need

to use the exact same model and the

exact same dimensions.

All right. So now let's talk about you

know different models, right? So uh I

was just saying like it's really

important that we know which model we're

using because every every embedding

model has its own similarity space. It's

like its own brain and some brains have

different similarity spaces than others,

right? Like in my brain um you know like

there's a different like it's a

different similarity between like

unicorns and ponies than in somebody

else's brain. Um, so what we need to do

is know which model we're using and and

and stick with it. Um, so that we're

able to actually compare things. Uh, so

here's a comparison between the two

OpenAI models, right? Looking at uh two

different vectors. So for the same word,

I can pass the same word queen into

their older model to 002 and we do get a

vector of 1536 dimensions. Uh but it's

kind of a weird it's kind of a weird

vector. Uh if you look at it, most of

the values are really really close to

zero. Um but then there's this value

here that go you know it's like

negative.7, right? And this is the weird

thing about this model is that every

single vector has a value at.7 in the

you know it's dimension 196 uh always

has a value that goes really really low

like this. So that's it's a weird thing

about this vector embedding model that

this is what the the numbers tend to

look like is that they're mostly around

00 and they have a couple super downward

spikes, right? And I saw that with every

single input that I put into this model

that it all had this really similar

spike uh spike downward here.

Now the text embedding 3 small model

also outputs 1536

but the actual values look much more

much better distributed right so here we

can see the values of these numbers

range between uh 0.1 and negative.1

and you know they're just kind of all

well distributed over that right there's

no like really extreme spikes like this

one uh so you know to me this means this

I think this is a better model I think

they did a better job training it um

that you know that they they don't have

any weird artifacts like this one. Um

now the thing is when you look at this

the what does this even mean right like

you can't look at a vector and see the

meaning in it right uh it's only it only

you can only like really understand how

a similarity space works when you then

compare and see like what does it think

is similar right that's how you can

understand a vector similarity space is

by seeing what does it think is similar

to other things uh so let me look uh let

me open up another notebook here. Right?

So, we've got embeddings for

a bunch of words across a few different

models. So, like AD to 002 and text

embedding 3, right?

And uh what we can do is oh actually

what I want to do is

look at

wait which ones are similar. All right.

So now let's talk about similarity.

Right. So as I was saying

the way uh to really understand a vector

similarity space is to see what it

thinks is similar to each other.

Now how do we actually measure

similarity? Uh the most popular

netmetric and the one we're going to use

today is cosine similarity.

So let's see cosine similarity in the

similarity notebook here.

and I'll go ahead and load these load

these up.

All right, so I'm loading in the

embeddings that I've already generated

for

1,00 words across each of the models.

Now, here's the cosine similarity

function. So cosine similarity is the

dotproduct

divided by the magnitude. So first we

calculate the dot product and then we

calculate the magnitude and we divide

them.

So here we've got that implemented. And

so now we can measure the cosine

similarity between dog and cat. And so

we get this number 6. And in this case 6

means you know they're they're fairly

fairly similar, right? Um now here's

what's interesting about cosine

similarity. See how it's dotproduct and

magnitude.

Now the magnitude of a vector

is one if it's a unit vector. Uh so this

is this you know I said we were going to

do a little bit of math right? So if you

remember from vector class a unit vector

is a vector that has a magnitude of one.

So if you have unit vectors and you do

dotproduct over magnitude then you're

really just doing dotproduct right? So

let me let me actually show you. Let's

print out the magnitude here.

Print the magnitude. We'll run this. And

you can see the magnitude is basically

one, right? With a, you know, like a

little margin of error here, but the

magnitude is basically one. And that's

because the OpenAI embedding model

vectors are unit vectors. So that means

I can actually speed up my calculation

here when I'm using OpenAI models

because I can delete magnitude and just

do dotproduct and remember the

similarity 62. I'm going to run this.

I'm going to run this. Boom. 62. So I

just saved myself a whole calculation.

Um because there I only have to do the

dot product. Now this only works if

you're uh working with a model that

produces unit vectors. Uh so that's true

of all the open AI models. So if you are

using open AAI models, you can actually

save yourself by time some time by just

using dotproduct uh and not dividing by

the magnitude. And sometimes this

doesn't matter because you're using you

know a vector database and the vector

database uh you know just takes care of

this for you and does the optimization.

But sometimes it does matter like when

I'm using uh Postgress, Postgress gives

you the option to do cosine similarity

or dotproduct and if I'm using the openi

embedding models then I use the

dotproduct metric instead. Uh so I think

it's worth knowing that there is a

slight performance improvement that you

can get if you know you're working with

unit vectors.

Okay. So that was that was our our our

math uh for today. Um so

uh and we do have lots of resources for

anybody who wants to dig deeper into the

different distance metrics. Um and uh

there's it's some pretty interesting

stuff. I do wish I'd paid more attention

to vector math in in college now. All

right. So now that we have a way of

calculating similarity,

uh we're going to go ahead and look for

the most similar words to a given word

using that metric. So basically, we're

going to calculate the similarity from

one word to every single other word in

my collection of words because I've got

a thousand words. So I'm going to

calculate and see like, okay, for the

word dog, which of the a thousand words

is closest to it according to cosine

similarity. So for dog I can look here

and see that. Okay, the closest one is

dog. Okay, so that's good. Dog to itself

gets a similarity of one. Very good. Now

the next one is animal. So the clo so

the very closest one after this is 088.

So it has a coin similarity of 088. Now

the one after that is god. So according

to this model, God is actually more

similar to dog than cat. Uh so I guess

this model is is like a dog person. This

model really likes dogs. I know. I think

the reason is because of the spelling.

Uh generally what I've seen is that this

this is the ADA 002 model. I think the

ADA00002 model did kind of incorporate

spelling during its training process in

in a ways. Um, I'm not sure exactly why,

but this is this is interesting to see

because you're like, huh, if if it

thinks dog is close to God, you know,

what sort how might that affect other

ways that I, you know, would use this

model? And then we see also drug and gun

at the bottom here, right? So, I think

once again, drug probably maybe close in

terms of spelling maybe. Also, this is

only looking at a thousand words and

maybe there weren't any more similar

words here. But this is what I find

useful is that if you're trying to

understand a similarity space is to see

you know in that similarity space what

do they think is similar you know to

each each of the inputs.

Now we can look at the same uh we'll

look also at the text embedding 3 small

model. So that one actually has much

more reasonable right here we've got dog

then we've got animal. Now what's

interesting is that animal here has a

similarity score of 68 and if you looked

up here the closest thing is 088. So

this is another thing you'll notice is

that you shouldn't think of similarity

scores as being absolute right um

because they differ so much across

models like so in the 80002 model 088

meant very similar right but in this

model 68 means very similar. So you

really you cannot look at a similarity

score on its own and know oh that's like

a really really similar thing. You can

only look at it relative to similarity

scores from the same model. Right? So

here we're seeing like okay so for this

model 68 is you know a close score. U we

see animal and then we see So cat

is 6 and that to it is actually a strong

similarity. And then we go you know all

the way down to uh baby and door. I

guess dogs are at the door a lot. So

that might be why uh it thought it was

similar because a lot of times when

these models are trained, they're

trained by looking at proximity and

text. So if dog is close to door in a

lot of text, then that can increase the

similarity score, right? Maybe dogs are

hanging out with the babies like lady in

the they're taking care of them,

right? Who knows? Uh but this is I'd say

this is a much more reasonable set of

similar words than um ADA 002. That's

why I have moved everything over to the

new models because I think they just

have a more reasonable similarity space.

All right. Uh cool. So that that I think

is a really helpful way of understanding

uh understanding a model.

Now you can also try to visualize the

model. So here I I used this technique

called PCA, principal components

analysis. And it tries to turn the 1536

dimensions into just three dimensions.

So we can plot it in 3D space. And it's,

you know, it's fun to do this and it's

cute and it makes cool visualizations,

but you lose so much information because

you're literally losing 1533

dimensions of information when we try to

squeeze them down into three dimension

of space. So, you know, it's fun to to

make these um 3D projections of the

vectors, but I don't actually think it's

that useful because we lose so much

information. So, I think it's more

useful to actually just say like, okay,

for a given word, which you know, which

concepts are the most similar. Uh but,

you know, we like to see pretty graphs.

So, here you go. Here's some pretty

graphs.

Now another thing that's useful is

looking at the actual values of the

cosine summary right so we were looking

I was showing like okay so here you know

we saw the closest one was 68 and then

here the closest one was 088 so I think

that's useful to look at to give you a

kind of more intu uh better intuition

for a particular model what kind of

values to expect so for a002

all of the values are between 0.86 86

and like

78, right? They're in a really really

tight range. Uh versus the newer models,

text embedding three small, we see the

values range between uh like 7 and you

know 05 like that's a much more

reasonable range. That's another reason

why I think this is a better model uh

because it just has a uh a better

distribution across the range. In this

one, you could actually look at the

values and be like, okay, if you get a

0.2 cosine similarity probably means not

particularly similar, right? Uh so it's

easier to come up with the kind of cut

offs to say like, okay, if it's below

this, you know, similarity value, it's

just really not that similar at all,

right? So, so I think it's interesting

to look at look at uh the range of

similarity values to get a feel for what

you might expect.

So now we have seen cosine similarity,

vector similarity. I see there's lots of

discussion in the chat about the the uh

the math behind it. Um so definitely you

can dig into that more. Uh now there's

some great use cases for just vector

similarity on its own. Uh so you can use

it to make recommendation systems. In

the past, if you wanted to make a

recommendation system, you'd often have

to look at like lots of user input and

consider uh you know, what different

users um liked and and say like, okay,

if lots of users liked this then and

they also liked this, like okay, those

are those are similar. We're going to

recommend them. That's still a really

great thing to do if you can build a

whole userbased recommendation system.

But you could also now just use vector

embeddings for making recommendation

systems and uh and it's a lot easier um

because the these vector embedding

models have been trained on the internet

so they have a rough idea of like which

things are are similar. So I know people

who have uh you know used this just to

even to add like you know

recommendations onto their personal blog

or something right like so now it's so

easy to make recommendations

that you can do it for anything you can

just throw it on there right now you

don't have to build your own

recommendation model you could just use

the embedding models and then boom you

can show hey you know you're on this

piece of content let's recommend these

other base of other content based off of

uh what you're currently on. So that's

definitely a big a big use case uh that

you can consider for using vector

embedding. Great way to get started with

it.

Another interesting use case is fraud

detection. Uh so people use use vectors

in order to establish whether uh a new

input is more similar to a fraudulent

input than a real input. Uh now this you

really need to do do very carefully uh

because you don't want to accuse

anything of being fraud if it's not

fraud. Uh similar like spam detection.

Uh so that would be another use case for

um for vectors. Now interestingly when

people are doing fraud detection

sometimes they use different metrics

other than cosine similarity. So here

today we're really talking a lot about

cosign similarity because that's the

metric that people are using for most of

the you know the generative AI use cases

uh that we're looking at. Uh but for

fraud detection sometimes people use

other distance metrics as well. So

cosign similarity is a great metric.

It's the one we're showing today because

it's going to um be the most useful for

all your generative AI applications. But

just so you know there are other ways of

measuring the distance between vectors

and um you know sometimes other metrics

are more appropriate for the scenario uh

like with fraud detection.

Okay. So now we can talk about how we

can do vector search and that's really

what um everybody's so excited about in

terms of vector embeddings. So the idea

is that once we have all of our data

converted into vectors, we can then

search that data based off any new user

input. Right? So we get in the user

input, we turn that into a vector using

the same model and then we use that

vector to search existing vectors and

say, okay, based off that new user

input, here are the closest vectors that

we have.

So we can start off by doing an

exhaustive search. So an exhaustive

search means that we are going to take

the you the input vector and we are

going to use that input vector and

compare its cosine similarity to every

single other vector and say okay we

found the vector that is closest to this

input.

So let's see uh let's look at search the

search notebook here

and let me run this. Um so here

I've got a bunch of embeddings for

Disney movie titles. Let me actually

show you what those embeddings um look

like. Text embedding three movies text

embedding three small 1536. Okay, let's

see if I can get the get it to open the

file. It is a really really big file.

Generally, you don't want to store

embeddings in JSON files. Um, I did it

for this notebook because I didn't have

that many, but you know, they do take up

a lot of space. So, here I've got a

bunch of embeddings for Disney movie

titles. So, I took the the movie title

and then I generated the embedding. So,

solely based off the title, that's how I

generated the embedding.

So, that's what I'm loading in here. Uh,

so I've loaded all of those in

and then I'm going to set up my

connection to GitHub models

and then I'm going to make a little

wrapper function that can generate an

embedding using the OpenAI SDK.

And then I've got my code here to

compute cosine similarity and do

exhaustive search. Now I was saying we

could actually just um remove the

magnitude calculation since we know

they're unit vectors, but we can also

just leave it in because that's the

standard way of doing cosine similarity.

And so then we're going to create a new

vector. Right? So here I've got the

input a toddler friendly movie about

cats. Right? I have two daughters and

they often want to watch movie about

cats and we have Disney Plus, right? So

here I'm trying to search for what is a

good movie that we could watch that they

would like. So we get the embedding for

that input. Then we use that embedding

to search all of our existing vectors.

So we say, "All right, let's calculate

the cosine similarity. Let's get them

all and let's sort it." So let's see

what we get as a result.

Okay. So, the top movie is The

Aristocrats. Then we have the Tiger

movie, then Ratatouille, uh then Cars

Too,

uh African Cats is somewhere in here. We

have the Goofy Movie, we've got dog

movies. Unfortunately, there's actually

not that many movies about cats uh on

Disney Plus. It's unfortunate. Uh so,

but you know, this is a pretty good

pretty good result here. Uh now, we can,

you know, we can put in all kinds of

things, right? So, we can we can do

something in Spanish. We can say like

okay peliculas so Leon which is movies

about lions

and here we get the Lion King right so

it's saying like once you're using

vector search you're able to support so

many more searches right like and we can

be like really really like oh man my

daughters are screaming

for unicorn stuff I need to occupy them

right now Right? Like this is just a

stream of conscious input from a user.

But a vector embedding model can turn

anything into a vector. And here we can

see uh the result suggested was Babes in

Toyland. Uh I haven't watched that one

yet. And then Monsters Inc. Ice

Princess. Right. So these are like they

were like okay responses. Uh I guess

there's no unicorns actually in this in

this data set. Let's see. Kitty, kitty,

dinosaur. Let's try dinosaur. Dinosaur.

Do we have any dinosaur

results here? Dinosaur. There we go.

Right. So, I had this whole long stream

of consciousness and it picked out that,

you know, dinosaur was a salient, you

know, part of that vector, right? And

so, you know, it's turned this whole

phrase into a vector that lives

somewhere in that multi-dimensional

space. And fortunately that vector is

closer to the dinosaur vector than it is

to any of the other vectors. Right? So

that's what's really powerful uh about

about vectors.

Um okay.

So so yeah. So there we go. So that's

pretty cool. Now I saw a comment in the

chat uh that exhaustive search sounds

expensive and that's true, right? So

here we're only searching what is it

like um 500 movies something like that

and so we're able to do the exact octive

search and the exhaustive search is you

know relatively fast but when you're

actually using vectors in production

you're going to have much bigger

databases right you're going to have

thousands of vectors millions of vectors

you could even have billions of vectors

we do actually have customers at Azure

that have billions of vectors in their

database

So um what we can do here is then use

approximation searches. So we need to

use an algorithm uh known as an

approximate nearest neighbor search. So

ANN so these are all search algorithms

that try to find the best result without

actually having to search exhaustively.

Right? Um so they're trying to find the

highest quality result without having to

look at every every possible option. So

they have to use some sort of huristics

to cut down the search space. Uh now the

most popular one these days is HNSW and

that has really well good support across

lots of databases. So uh Postgress

supports it with the PG vector

extension. Uh Azure AI search supports

it. Chroma DV, Weeba, um you know,

pretty much all the big vector databases

are supporting HNSW

and uh so that's a great pick. So if you

see that as an option, you know, that's

a great pick. Uh now Microsoft came up

with a new a new approximation algorithm

called disk A&N. And so we're now using

that for several Azure products. So

Cosmos DB has it, Azure SQL has it and

Azure Postgress has it. So if you're

using Azure Postgress, you can actually

pick between HSW or diskn

uh there's also IBF flat and uh and that

is supported by Postgress. It's not as

supported by the other ones because it's

not quite as practically useful for a

lot of production uses of vectors. Um

it's got some limitations like it IBF

flat works best if you're only going to

make the vector index once and you're

not doing lots of updates. But a lot of

times if you have a database it's

because you're going to be updating that

database, right? So you want something

that works well that can handle lots of

data updates.

And then there's Feice. Feice is

something you could use if you just

needed an in-memory index. Um but it's

not designed for you know like a

persistent database.

So we're going to look at HNSW because

it is uh you know the most popular one

uh that a lot of people are using and

you can use it just in Python. So I

wanted to find an example that we could

run in Python without having to set up a

database today just so you could get a

feel for it. Um so HSW does work well in

situations where you have an index

that's frequently updated and it can

scale really well up to large indexes as

well. Uh so it's a very well-designed

algorithm and um and we can use it for a

lot of our production use cases. So I've

actually have that set up uh in the

notebook here using the HNSW lib

package. Uh so we you know declare our

index. We say how many dimensions it's

going to have and uh that we're going to

use cosine

and we we have some parameters here that

you can you can uh tweak in order to

change stuff like the performance of the

index. And then we can add all the items

to the index.

And then um and basically at this point

the index is set up. Now mo usually

you're not going to be the one writing

all this code. Typically you're going to

use a database like Postgress that's

going to set this up for you behind the

scenes. Uh maybe with some database

commands. Uh but I thought it was

helpful to see that you can do it in in

Python.

So now I can get my embedding for the

toddler friendly movie. we'll say about

dogs since it seems like there's some

dog people in the chat and uh we're

going to get that embedding and then

we're going to do a query. So this is a

KN&N query which says get us the 10

closest nearest neighbors the 10 nearest

neighbors based uh to this vector

and then we display them.

All right, so we got lots of dog movies,

right? So, that was fast and it did a

good job. So, the idea of HSW is it

should get you really similar results to

uh if you were doing exhaustive search

like let's go try the same thing up

here. A toddler

friendly movie about dogs.

All right, we'll run that. Snow dogs 102

domations fox and hound, right? And what

we see is basically the same results

down here. Uh, and of course this was on

a really small index, so we would expect

to see really good results. But the idea

of HSW is that even as you're adding

more and more vectors, you can keep

getting really, really high quality

results.

And so not having to sacrifice the

quality, but still get really good

speed.

All right, so that was vector search. So

what are the use cases for vector

search? Like first of all, if you have

any sort of search box on your web app,

on your website, you can enhance that

search box by adding in vector search

because then once you add in vector

search, you can handle more complex

queries, more ambiguous queries, right?

You can handle multilingual queries,

right? Uh so it can generally just

improve any sort of search that already

exists.

And then the really big use case with

LLMs is rag is where we're using an LLM

to answer questions by searching some

data. And we're going to be talking

about that really in depth tomorrow. Um

so please, you know, come to tomorrow's

session where we're going to be talking

about rag. But this is really where uh

vector search has taken off because

people when people talk with LLMs they

they you know they They don't use search

queries to talk with LMS. They ask

questions, right? So, we need to be able

to take questions

from users and be able to use those to

do uh to do vector searches, right? So,

I have uh so for example, this is a rag

on top of Azure AI search that's

searching documents like PDFs, right? So

this one is doing a vector search of of

documents and we can actually like see

the um kind of scores that come back

from the search service for for the

various chunks. Right? So we have

created embeddings of all of these

chunks

from the documents and we are searching

based off of the embeddings of these

chunks. And that way we can get really

good results and then you know send that

to send those results to the LM and get

a good response. So that's that is rag

and that is what we're going to be

talking about for the whole session

tomorrow.

All right. So now uh moving on let's

talk about how we can compress vector

embeddings.

So as you see our vectors you know

vectors take up a lot of space right

like this is you know the the title of

the movie and the actual vector is

really really really long and this is

actually the small embedding model right

this is 1536 usually I use the large

model and that's 3,072

right so you know once we start using

vectors we are increasing the amount of

storage space that we you know we need

for our data and then we're actually

like paying for that storage space and

uh you know it affects how well we can

productionize the applications that

we're building. So when you are thinking

about putting your vector search into

production, you may think about how you

can compress the vectors to decrease

your storage size and cost and also

decrease the search latency. And as it

turns out, there's two really

interesting techniques we can use to

compress vectors and not lose that much

quality. So we're going to look at

vector quantization where we take each

number and we make them take up less

space. And then also dimensionality

reduction where we just take a long

vector and make that vector shorter.

So let's talk about vector quantization.

So this is where we start off with the

list of floatingoint numbers, right?

Like that's what we just saw. Tons and

tons of floatingoint numbers. These are

64-bit floatingoint numbers. And so they

require a full 64 bits in order to store

in memory. But what we can do is reduce

them so that they don't require 64 bits.

Uh we can start off with scalar

quantization. That's where we reduce

each number to an integer. And then we

can even go as far as binary

quantization where we reduce each number

to a single bit. And it's crazy that

that works. Uh so let me go ahead and

open the the quantization notebook here.

And uh here we have okay so we're going

to load in the same ones from before the

1536.

All right. So we can see these are

currently 64-bit numbers.

Now for scalar quantization the approach

that we use

is that we have you know we we look and

see okay what is you know what is the

range of these values what do these

floatingoint numbers range between and

you know what's kind of the the middle

of those values and we take the range of

those values and we map them to the new

range. So for this code here, I'm going

to map the the values from -128 to

positive27,

right? So it's going to take all the all

those floating point values and they

each of them is going to fall into one

of 256, you know, buckets uh between

this range. Uh so you can see here we

find the global min and max. uh we

normalize the embeddings

and then we go and figure out okay

between the min and the max where is you

know which number is it going to become.

So go ahead. I can run this on the

little mermaid. And now we can see that

the little mermaid is represented by a

list of integers instead. And integers

uh require much less storage space than

64-bit, you know, floating numbers up

here.

So now we've got that um that vector.

uh you can you kind of you can look at

it and see that once we've mapped it

into integers all of the values ranged

between uh in this case we can see they

went from like -10 to positive 70 for

this vector

uh the full possible range is -27 to you

know positive 128 uh but you know it

just depends where the values fall. So

now what the big question is like okay

well we we removed a bunch of

information right we went from 64 bits

of information to you know just an

integer of information how does that

affect similarity that's what matters is

are we going to get the same results

like I saw a question like you know are

we getting the same results with HNSW

right that's always the question like as

you add on you know approximation and

compression and all that how good are

the results right a lot of times they're

not going to be as good. But a lot of

times they could be good enough, right?

So here we can look and see like okay

Moana with the integers we see the most

similar values you know similar movies

are like Mulan, Little Mermaid, Lilo and

Stitch and we can compare that to the

original where we see Mulan, Lilo and

Stitch, Little Mermaid. So like the top

three they're in a slightly different

order. I try and show it all. So here

they're in a slightly different order,

but

they're pretty darn close, right? So

given how much performance improvement

we can get out of this, this might be

good enough, right?

Uh but it's something you really have to

decide and actually run evaluations and

see like okay like once we like add in

this performance enhancements we add in

this quantization

you know does it is our is our

performance still good enough like is

our retrieval quality still good enough.

So the next thing is we can go more

extreme and then we can turn each of the

numbers into bits just zero or one. And

so for this one uh what we do is we look

at the full range and we figure out the

average like the mean the one in the

value in the middle and if it's less

than that it becomes zero and if it's

more than that it becomes one. Right?

So that's what this code here does.

figures out the mean and then quantizes

to either one or zero.

And so then we can see Moana becomes,

you know, 1 one zero one 0 one, right?

And it's crazy to think that this, you

know, that we could do this like we're

losing so so much information. We went

from 64 bits to one bit. And I was

actually working on this notebook with

my mom and my mom, she's a

mathematician, and she did not think it

was going to work. She was like, "No

way. like we're going to lose so much

data, but you know, let's just look and

see.

All right, so that's what the um vector

looks like. Um so here when we do when

we look at the most similar movies, we

see Moana, Mulan, Little Mermaid,

Princess and the Frog did move up one.

Um but then we got Lilo and Stitch here.

So we still seen really similar results

even with one bit to represent that full

64-bit floaty number. So that's what's

really really interesting is that you

can actually reduce so much information

and still retain a lot of the semantic

meaning in that vector.

Um

oh and we I see there's a comment about

like oh the similarity numbers are

different. That's true. The similarity

numbers are different. That's why um

they're saying like you know it's

important that we you know usually we

don't want to strictly be looking at

similarity numbers on their own and we

want to look more relative right so

relatively we see that within this

search you know Mulan has is more

similar than Little Mermaid but yeah

it's true that this one's 68 and this

one is 0.54 right so really different um

similarity ranges here so if you were

doing if you were doing any sort of

similarity threshold hold cut off. It

does really depend on u in this case it

would depend both on the model and the

space and what sort of quantization you

were doing. So that's that's a good

observation there.

Okay. So that's crazy but it worked,

right? And here I've got the actual

effects on similarity between the the

float and the um you know binary. Like

there's definitely differences here. Um

and we can argue about which of the

results are actually you know better. Um

uh but uh but it still did retain a

surprising amount of semantic

information because it did come up, you

know, it wasn't just nonsense, right?

Like it came up with a bunch of similar

movies, just some differences in which

is which.

Now the cool thing is like you know for

if you're you know just working a few

thousand vectors it's not a big deal but

if you are doing millions of vectors

billions of vectors like this can make a

big uh difference in reducing the size

of your vendor vector index. So here we

have a comparison for Azure AI search

which supports quantization and so uh if

we start off you know with float 32 and

then we go down to bits we can see that

we get a 97% reduction in the index

size. So that's huge, right? Because

that's money that you're actually paying

for. And uh you know, if you if you

don't need to use up all that space,

then you know, why use it?

So that's quantization.

And now let's look at the other

technique, which is dimension reduction.

So dimension reduction is a technique we

can only use on models that were

specifically trained to be able to

support it. So the openi embedding

models were trained to support MRL

matrioska representation learning. So

since they were trained to support MRL

it means that we can reduce those models

uh to different sizes and still retrain

uh retain the a lot of the same semantic

representation

but you do have to be really careful to

only use this on the models that support

it. Uh now the cool thing is with the

openi SDK you can actually just do it in

the SDK itself. So you know when you're

using the model just pass in the smaller

dimension. So for this one three small

is usually 1536. I can just pass in 256

and it will do the correct reduction and

give back the new embedding.

So does this does this work? Right. So I

did the same comparison here. So I went

from 1536 down to 256. That's the

minimum that um they that is supported

for reduction on this model. And you

know we what we see is really similar

results with some some differences right

similar to quantization effects.

So the interesting thing is you can

combine these techniques. So you could

take like uh you know the you know start

off with 1536 reduce those to 256 and

you could then even in theory reduce

those 256 things to only be bits instead

of floatingoint numbers and you would

have this index that takes up this tiny

amount of room. But I know like you're

probably all freaking out like oh my god

you're definitely going to lose some

quality with that right? You're

definitely because you're losing

information. You're losing information

in in all directions all dimensions

right? But there's this really cool hack

or technique that you can use uh which

is to over sample, right? So what you

can do is like let's say originally you

wanted 10 results and you were going to

do those 10 results on your you know

non-compressed index. What you can do is

say like okay I'm going to get 150

results

and uh you you know I'm going to get ask

for 150 results from my compressed index

and then you can still still store the

original vectors in um in a more

efficient place to store vectors like

not in your actual index but in a

different spot and then you can rescore

the results according to your original

vectors. And so basically if you ask for

150 results,

you're gonna find that those 150 results

are going to have the original top 10

best results. So you get those 150, you

then rescore the 150 using the original

vectors that are stored elsewhere and

then you'll find those top 10. The top

10 will will rise out of that. Right? So

uh that's I mean this is getting to you

know use cases that maybe not everybody

needs like maybe you don't all need

these compression but you know we do

work with lots of customers that do have

lots and lots of vectors and you know

this is what we recommend is like okay

use all these compression techniques but

then use uh what we call oversampling

right so use oversampling with the

original uh vectors in order to rescore

so I think it's really really a cool

technique um and uh you know that we can

use in order to have fast, you know,

fast retrieval from the vector index but

still have high quality results. Uh we

did like a whole um series about it

where we did a deeper dive into it and I

was really impressed to see the results

is if you you can do all of this and as

long as you do the oversampling with

rescoring the originals, you get

fantastic results.

Okay, so we covered a lot today. As I

mentioned, it was we did have, you know,

we did go kind of deeper, nerdier into

into the math of it because vector

embedding is like, you know, they're a

bunch of numbers like there's uh we have

to think about what they actually mean

and how we, you know, compute distances

and all that stuff. Uh hopefully this

gave you a better feel for for vector

embeddings. Um if you have any feedback

about uh how to make them easier to

understand, uh you know, let us know. Uh

there's lots more resources here that I

used when I was researching uh vector

embeddings and you can get these

resources from the slides

and uh and you can dig deeper into it. I

also have let me find my um blog post.

Uh

so I have this blog post here that is

kind of like a written version of this

one. And uh I actually even turned this

into a poster. So if you like posters,

you can have a poster of all these. Uh

this mentions a few other things like

the other distance metrics.

Uh so yeah, so this was this was one way

of exploring the amazing world of vector

embeddings and uh we will keep using

vector embeddings going forward in the

series uh especially tomorrow. So please

come to the rag session where we're

actually going to do practical things

with vector embeddings and see uh how we

can use it for making rag applications

uh which is the most popular use of

vector embeddings. Uh we will also

revisit vector embeddings in the vision

models on Monday. So we'll get to see

multimodal embeddings which are really

really cool because you can search by

image and you can search images and so

that's really really helpful. Uh, so we

will continue to talk about cool ways of

using vector embeddings going forward in

the series. So I hope you come back and

I hope that we get to see you tomorrow.

Uh, let's see. We do have office hours

on Tuesdays. We already had them

yesterday. It was a great office hours.

Lots of you came and we had lots of

great questions. Uh, I don't have office

hours today, but you can still join the

Discord and there's lots of channels in

the Discord where you can ask ask

questions. Uh, another place you can ask

questions is on our resources thread.

So, uh, we're, you know, keeping all the

resources in this discussion thread

here. So, if you do have any additional

questions, please feel free to add them

to this thread and, uh, I can totally

answer follow-up questions there. And,

um, let's see what else. Uh we are at

time

right now so we can't we can't go over

today with questions but hopefully you

got a lot of questions answered in the

chat and please do post any more

questions in this thread or in discord

or bring it to office hours next week.

All right,

thank you everyone. I hope to see you

tomorrow.

Bye.

Thank you all for joining and thanks

again to our speakers.

This session is part of a series to

register for future shows and watch past

episodes on demand. You can follow the

link on the screen or in the chat.

We're always looking to improve our

sessions and your experience. If you

have any feedback for us, we would love

to hear what you have to say. You can

find that link on the screen or in the

chat. and we'll see you at the next one.

[Music]

Hey.

[Music]

Hey,

Python + AI: Vector embeddings

Microsoft Reactor

39 days ago

1:05:03

RAG & Vector Search

Rank #8

Description

In our second session of the Python + AI series, we'll dive into a different kind of model: the vector embedding model. A vector embedding is a way to encode a text or image as an array of floating point numbers. Vector embeddings make it possible to perform similarity search on many kinds of content. In this session, we'll explore different vector embedding models, like the OpenAI text-embedding-3 series, with both visualizations and Python code. We'll compare distance metrics, use quantization to reduce vector size, and try out multimodal embedding models. If you'd like to follow along with the live examples, make sure you've got a GitHub account. 📌 This session is a part of a series. Learn more here: https://aka.ms/PythonAI/2 Explore the slides and episode resources: https://aka.ms/pythonai/resources Check out the demos: https://aka.ms/python-openai-demos Chapters: 00:08 – Welcome & Housekeeping 01:03 – Introduction to Vector Embeddings 02:24 – Why Vector Embeddings Matter 03:32 – How Embedding Models Work 06:01 – Comparing Embedding Models 10:55 – Generating Embeddings with OpenAI 20:59 – Understanding Similarity Spaces 24:47 – Cosine Similarity Explained 34:02 – Vector Search with Exhaustive Search 40:01 – Approximate Nearest Neighbor (ANN) Search 46:54 – Compressing Embeddings: Quantization 56:17 – Compressing Embeddings: Dimensionality Reduction 59:57 – Oversampling for High-Quality Retrieval 1:00:08 – Wrap-Up & Resources #MicrosoftReactor #learnconnectbuild [eventID:26293]

Video Details

Category

RAG & Vector Search

Featured Date

November 12, 2025

Quality Rank

#8

AI Recommended