Follow
Yanghua Peng
Yanghua Peng
ByteDance Inc.
Verified email at cs.hku.hk
Title
Cited by
Cited by
Year
Optimus: an efficient dynamic resource scheduler for deep learning clusters
Y Peng, Y Bao, Y Chen, C Wu, C Guo
Proceedings of the Thirteenth EuroSys Conference, 1-14, 2018
5212018
A generic communication scheduler for distributed DNN training acceleration
Y Peng, Y Zhu, Y Chen, Y Bao, B Yi, C Lan, C Wu, C Guo
Proceedings of the 27th ACM Symposium on Operating Systems Principles, 16-29, 2019
3752019
Deep learning-based job placement in distributed machine learning clusters
Y Bao, Y Peng, C Wu
IEEE INFOCOM 2019-IEEE conference on computer communications, 505-513, 2019
1582019
Online job scheduling in distributed machine learning clusters
Y Bao, Y Peng, C Wu, Z Li
IEEE INFOCOM 2018-IEEE Conference on Computer Communications, 495-503, 2018
1382018
DL2: A deep learning-driven scheduler for deep learning clusters
Y Peng, Y Bao, Y Chen, C Wu, C Meng, W Lin
IEEE Transactions on Parallel and Distributed Systems 32 (8), 1947-1960, 2021
922021
{BGL}:{GPU-Efficient}{GNN} training by optimizing graph data {I/O} and preprocessing
T Liu, Y Chen, D Li, C Wu, Y Zhu, J He, Y Peng, H Chen, H Chen, C Guo
20th USENIX Symposium on Networked Systems Design and Implementation (NSDI …, 2023
732023
{MegaScale}: Scaling large language model training to more than 10,000 {GPUs}
Z Jiang, H Lin, Y Zhong, Q Huang, Y Chen, Z Zhang, Y Peng, X Li, C Xie, ...
21st USENIX Symposium on Networked Systems Design and Implementation (NSDI …, 2024
692024
Preemptive all-reduce scheduling for expediting distributed DNN training
Y Bao, Y Peng, Y Chen, C Wu
IEEE INFOCOM 2020-IEEE Conference on Computer Communications, 626-635, 2020
692020
Multi-resource interleaving for deep learning training
Y Zhao, Y Liu, Y Peng, Y Zhu, X Liu, X Jin
Proceedings of the ACM SIGCOMM 2022 Conference, 428-440, 2022
592022
deTector: a Topology-aware Monitoring System for Data Center Networks
Y Peng, J Yang, C Wu, C Guo, C Hu, Z Li
2017 USENIX Annual Technical Conference (USENIX ATC 17), 55-68, 2017
442017
Elastic parameter server load distribution in deep learning clusters
Y Chen, Y Peng, Y Bao, C Wu, Y Zhu, C Guo
Proceedings of the 11th ACM Symposium on Cloud Computing, 507-521, 2020
422020
Dynamic scaling of virtualized, distributed service chains: A case study of IMS
J Duan, C Wu, F Le, AX Liu, Y Peng
IEEE Journal on Selected Areas in Communications 35 (11), 2501-2511, 2017
382017
Deep learning-based job placement in distributed machine learning clusters with heterogeneous workloads
Y Bao, Y Peng, C Wu
IEEE/ACM Transactions on Networking 31 (2), 634-647, 2022
172022
SP-GNN: Learning structure and position information from graphs
Y Chen, J You, J He, Y Lin, Y Peng, C Wu, Y Zhu
Neural Networks 161, 505-514, 2023
132023
dpro: A generic performance diagnosis and optimization toolkit for expediting distributed dnn training
H Hu, C Jiang, Y Zhong, Y Peng, C Wu, Y Zhu, H Lin, C Guo
Proceedings of Machine Learning and Systems 4, 623-637, 2022
122022
Sapipe: Staleness-aware pipeline for data parallel dnn training
Y Chen, C Xie, M Ma, J Gu, Y Peng, H Lin, C Wu, Y Zhu
Advances in Neural Information Processing Systems 35, 17981-17993, 2022
102022
LLM-PQ: Serving LLM on Heterogeneous Clusters with Phase-Aware Partition and Adaptive Quantization
J Zhao, B Wan, Y Peng, H Lin, C Wu
arXiv preprint arXiv:2403.01136, 2024
82024
ByteCheckpoint: A Unified Checkpointing System for LLM Development
B Wan, M Han, Y Sheng, Z Lai, M Zhang, J Zhang, Y Peng, H Lin, X Liu, ...
arXiv e-prints, arXiv: 2407.20143, 2024
32024
Hybridflow: A flexible and efficient rlhf framework
G Sheng, C Zhang, Z Ye, X Wu, W Zhang, R Zhang, Y Peng, H Lin, C Wu
arXiv preprint arXiv:2409.19256, 2024
12024
Optimus: Accelerating Large-Scale Multi-Modal LLM Training by Bubble Exploitation
W Feng, Y Chen, S Wang, Y Peng, H Lin, M Yu
arXiv preprint arXiv:2408.03505, 2024
12024
The system can't perform the operation now. Try again later.
Articles 1–20