Loading video player...
Ever wondered how massive AI models like GPT-4 and Mixtral work under the hood? In this video, we build a "Mixture of Experts" (MoE) model completely from scratch using PyTorch. We'll start with the basics of a character-level language model, explore the fundamentals of self-attention, and then layer in the sparse MoE components, all while training on a fun dataset of Simpsons scripts. This step-by-step tutorial is perfect for anyone in the AI field looking to gain a deep, intuitive understanding of how modern Transformers are built. Github Repo: https://github.com/rajshah4/makeMoE_simpsons Open in Colab: https://colab.research.google.com/github/rajshah4/makeMoE_simpsons/blob/main/makeMoE_from_Scratch.ipynb 0:00 - Intro: Let's Build a Mixture of Experts Model! 1:08 - Getting Started with the Code Notebook 2:40 - High-Level Overview of the MoE Architecture 3:54 - Data Loading: The Simpsons Scripts 4:32 - Tokenization: Turning Characters into Numbers 5:56 - Batching and Next-Token Prediction 9:19 - Core Concept: Self-Attention Explained 12:38 - From Attention to Mixture of Experts (MoE) 14:32 - The Router: Top-K Gating for Expert Selection 16:21 - Improving Training with Noisy Top-K Gating 17:29 - Assembling the Full Sparse MoE Block 19:10 - Building and Training the Final Language Model 21:21 - Training the Model and Tracking Experiments 22:37 - Analyzing the Results: From Gibberish to Simpsons Dialogue ━━━━━━━━━━━━━━━━━━━━━━━━━ ★ Rajistics Social Media » ● Home Page: http://www.rajivshah.com ● LinkedIn: https://www.linkedin.com/in/rajistics/ ● Reddit: https://www.reddit.com/r/rajistics/ ━━━━━━━━━━━━━━━━━━━━━━━━━