Performance Challenges in Distributed Deep Neural Network Systems

Published：2018-10-15

Speaker: Cheng-Fu Chou

Time and Date: 15:00 pm, October 15, 2018

Place: Room 308 of Genetics Building, Handan Campus, Fudan University

Abstract:

The success of Deep Neural Networks (DNNs) has created significant interest in the development of software tools, hardware architectures, and cloud systems to meet the huge computational demand of their training jobs. To successfully train neural networks with many hidden layers, efficient algorithms, large datasets, and large scale data centers all play a fundamental role. When scaling up is not enough, data and computation must be distributed among multiple nodes, e.g., multiple CPUs, GPUs, or even FPGAs with dedicated hardware architectures. In this talk, we first review the solutions as well as performance issues in practice to distribute stochastic gradient descent computations at the scale of data centers. Then, we present some initial results from our recent work, which addresses two important problems for the application of this strategy to large-scale clusters and multiple, heterogeneous jobs. The first one is we propose and validate a queuing model to estimate the throughput of a training job as a function of the number of nodes assigned to the job. Throughput estimations are then used to explore several classes of scheduling heuristics to reduce response time in a scenario where heterogeneous jobs are continuously submitted to a large-scale cluster. Heuristics are evaluated through extensive simulations of realistic DNN workloads, also investigating the effects of early termination, a common scenario for DNN training jobs.

Biography:

Cheng-Fu Chou received the M.S. and Ph.D. degrees from the University of Maryland, College Park, MD, USA, in 1999 and 2002, respectively. After his graduation, he joined the Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan, where he is currently a Professor. Since 2002, he has been a Visiting Scholar at the Department of Computer Science, University of Southern California, Los Angeles, CA, USA, in 2002 and 2017-2018. His research interests include distributed machine learning systems, software-defined networking, wireless networks, and their performance evaluation.