Loading video player...
In this video, I share my thesis presentation on novel methods for comparing audio and text for retrieval. I delve into the limitations of traditional search engines and propose a new approach to search through multiple modalities without relying on text comparison. By converting text and audio into vectors called embeddings and calculating their similarity, we aim to improve accuracy. I discuss the baseline system, our proposed method of transforming embeddings into pseudo-images for a vision transformer, and our experimental results. Despite underwhelming performance, this exploration offers valuable insights and potential future improvements. Special thanks to my team members and supervisors for their support. 00:00 Introduction and Personal Journey 00:31 Traditional Search Engines and Their Limitations 01:22 Thesis Overview and Objectives 02:18 Acknowledgements and Team Contributions 02:50 Understanding Audio-Text Retrieval 06:01 Baseline System Explained 10:45 Proposed Method: Vision Proxy Approach 12:18 Experiment and Results 13:52 Discussion and Future Improvements 15:22 Conclusion and Final Thoughts