行业发展前沿报告一题目:Operator Fusion for Memory-Bound Compute Intensive Operators-武汉大学计算机学院

行业发展前沿报告一题目:Operator Fusion for Memory-Bound Compute Intensive Operators

发布时间:2024-11-22     浏览量:

报告一题目Operator Fusion for Memory-Bound Compute Intensive Operators

报告时间:2024112514:30

报告地点:计算机学院E406会议室


报告二题目Scaling New Heights: Transformative Cross-GPU Sampling for Training Billion-Edge Graphs

报告时间:20241126上午10:30

报告地点:计算机学院E406会议室


报告三题目Accelerating Distributed DLRM Training with Optimized TT Decomposition and Micro-Batching

报告时间:20241127上午10:30

报告地点:计算机学院B405会议室


报告人:杨东林

报告人单位:英伟达

报告人简介:Donglin Yang is currently a Deep Learning Software Engineer at NVIDIA, where he focuses on advancing TensorFlow Core and XLA technologies. He received his Bachelor of Science (B.S.) degree in Electrical Engineering from Sun Yat-sen University and earned his Ph.D. from the Computer Science Department at the University of North Carolina at Charlotte in 2022. His research interests span high-performance computing, deep learning compilers, and graph neural networks, among other cutting-edge topics in the field of machine learning and AI. He has made significant contributions to these areas, with his work being recognized and published in prestigious conferences and journals, including PPoPP, SC, TPDS, TC, HPDC, and IPDPS. His research combines theoretical rigor with practical applications, pushing the boundaries of computational efficiency and scalability in modern AI systems.


报告摘要Operator fusion is a key technique to improve data locality and alleviate GPU memory bandwidth pressure, often fails to extend to the fusion of multiple compute-intensive operators due to saturated computation throughput. However, the dynamicity of tensor dimension sizes could potentially lead to these operators becoming memory-bound, necessitating the generation of fused kernels — a task hindered by limited search spaces for fusion strategies, redundant memory access, and prolonged tuning time, leading to sub-optimal performance and inefficient deployment. This tack will present our work MCFuser, a pioneering framework designed to overcome these obstacles by generating high-performance fused kernels for what we define as memory-bound compute intensive (MBCI) operator chains. Leveraging high-level tiling expressions to delineate a comprehensive search space, coupled with Directed Acyclic Graph (DAG) analysis to eliminate redundant memory accesses, MCFuser streamlines kernel optimization.


报告二摘要:Efficient training of Graph Neural Networks (GNNs) on billion-edge graphs poses significant challenges due to memory constraints and data transfer bottlenecks, particularly affecting GPU-based sampling. Traditional methods either face severe CPU-GPU data transfer bottlenecks or encounter excessive data shuffling and synchronization overheads in multi-GPU setups. To overcome these challenges in GNN training on large-scale graphs, we introduce HyDRA, a pioneering framework that elevates mini-batch, sampling-based training. HyDRA innovates in multi-GPU memory sharing and multi-node feature retrieval, transforming cross-GPU sampling by seamlessly integrating sampling and data transfer into a single kernel operation.  It develops a hybrid pointer-driven data placement technique to enhance neighbor retrieval efficiency, designs a targeted replication strategy for high-degree vertices to reduce communication overhead, and leverages dynamic cross-batch data orchestration with pipelining to minimize redundant data transfers. Evaluated on systems equipped with up to 64 A100 GPUs, HyDRA significantly outperforms current leading methods, achieving 1.4x to 5.3x faster training speeds compared to DSP and DGL-UVA and demonstrating up to a 42x improvement in multi-GPU scalability. HyDRA sets a new benchmark for high-performance GNN training at large scales.


报告三摘要:Deep Learning Recommendation Models (DLRMs) are pivotal in various sectors, yet they are hindered by the high memory demands of embedding tables and the significant communication overhead in distributed training environments. Traditional approaches, like Tensor-Train (TT) decomposition, although effective for compressing these tables, introduce substantial computational burdens. Furthermore, existing frameworks for distributed training are inadequate due to the excessive data exchange requirements. This paper proposes EcoRec, an advanced library designed to expedite the training of DLRMs through a synergistic integration of TT decomposition technology and distributed training. EcoRec introduces a novel computation pattern that eliminates redundancy in TT operations, alongside an efficient multiplication pathway, significantly reducing computational time. Additionally, it provides a unique micro-batching technique with sorted indices to decrease memory demands without additional computational costs. EcoRec also features a novel pipeline training system for embedding layers, ensuring balanced data distribution and enhanced communication efficiency.


邀请人:程大钊