BDVE Text Embeddings in a Glass Box | DailyDevLists

Loading video player...

BDVE Text Embeddings in a Glass Box

Eternal Boot Loops

24 days ago

6:44

RAG & Vector Search

Rank #2

Description

Bi-Directional Vector Embedding (BDVec) | 100% Deterministic & Reversible BDVec is an original, math-based vector embedder featuring complete and total reversibility. Check out the code and try it yourself below! 🔗 GitHub Repo: [Insert Link Here] 🔗 Follow the project: [Insert any other links, Twitter, Website, etc.] ----- HOW THE BDVEC ENGINE WORKS ----- BDVec works by breaking down text into tokens, clustering them, and using pure math and statistics to map their relationships. Here is the current (WIP) embedding pipeline: 1. SLIDING-WINDOW CO-OCCURRENCE (Identifying Relationships) Once text is tokenized, the system slides a fixed-width window over the stream of token IDs. It counts how often specific, unordered token pairs appear together within these windows, alongside individual token frequencies. 2. NPMI NORMALIZATION (Handling Noise) Raw co-occurrence counts are converted into marginal and joint probabilities to calculate Normalized Pointwise Mutual Information (NPMI). To handle noise, we apply positive-only clamping, resulting in a symmetric, sparse V x V "friction" or association matrix. 3. TRUNCATED SVD (Spectral Compression) The sparse NPMI matrix is compressed into a dense V x k token embedding matrix using truncated Singular Value Decomposition (SVD). The top-k left singular vectors are extracted and scaled to preserve geometric fidelity. 4. INFERENCE & MEAN-POOLING At query time, new text is tokenized, and the dense vectors are looked up in the matrix. The vectors for individual tokens in a "hunk" are mean-pooled together to produce a single k-dimensional semantic vector representing the entire string. ----- INTERACTIVE BDVE DEMO ----- The demo processes queries against a specific body of text through a training phase, followed by inference and comparison. TRAINING ON A SPECIFIC BODY OF TEXT Train the model on a selected .txt file via the GUI's "Train from File" button or the CLI. The pipeline analyzes the language and produces two persistent artifacts: * tokenizer.json: Stores the custom BPE vocabulary and merge rules learned directly from your file. * embeddings.npy: The dense token embedding matrix generated from your text's co-occurrence patterns. PROCESSING YOUR QUERY Any new query is processed using the rules mapped from your training text: * Tokenization: The query is broken into BPE subword tokens using the exact vocabulary in tokenizer.json. * Chunking: Tokens are split into budget-bounded "hunks" to manage input size. * Embedding: The system looks up each token's vector in embeddings.npy and mean-pools them to create a single semantic vector for your query. REVERSE LOOKUP (Comparing Query to Text) To compare your query back to the trained text, the demo uses a Reverse operation: * It takes the pooled query vector and searches the embedding space for the closest individual token vectors using cosine similarity. * It outputs the "nearest tokens," showing exactly which words from your training document are semantically closest to your query.

Watch on YouTube

Video Details

Category

RAG & Vector Search

Featured Date

Quality Rank

#2

AI Recommended