Loading video player...
In this lecture on Word and Document Embeddings, we move beyond sparse TF-IDF vectors and learn how modern embeddings represent meaning in dense vector spaces. Using a streaming sample from the Yelp Polarity dataset in Colab, we first recap a leakage-safe TF-IDF baseline so we have a clean benchmark for evaluation. Then we build intuition for cosine similarity and explain why “angle” matters when comparing embeddings. Next, we explore pre-trained word embeddings by checking similarity scores and nearest neighbors, and we train a small Word2Vec model on review text to illustrate domain effects, showing how semantic neighborhoods shift in marketing and service contexts. After that, we construct document embeddings by pooling word vectors, comparing simple mean pooling to TF-IDF weighted pooling, and we run a mini semantic retrieval demo to find meaningfully similar reviews. We then move to sentence-transformers as a ready-to-use solution for sentence and document embeddings, encoding thousands of reviews with a progress bar, running semantic search with a natural-language query, and using embeddings as features for a Logistic Regression classifier to compare AUC and F1 against TF-IDF. We also visualize embeddings with PCA and optional UMAP, emphasizing that 2D plots are intuition tools rather than interpretable axes. We close with practical marketing applications, key limitations to keep in mind, and the outputs we save for the next module. Instructor: Dr. Hyunhwan “Aiden” Lee, CSULB College of Business