Deep image: Scaling up image recognition R Wu, S Yan, Y Shan, Q Dang, G Sun arXiv preprint arXiv:1501.02876, 2015 | 534 | 2015 |
Evaluating fast algorithms for convolutional neural networks on FPGAs L Lu, Y Liang, Q Xiao, S Yan 2017 IEEE 25th annual international symposium on field-programmable custom …, 2017 | 285 | 2017 |
Exploring heterogeneous algorithms for accelerating deep convolutional neural networks on FPGAs Q Xiao, Y Liang, L Lu, S Yan, YW Tai Proceedings of the 54th Annual Design Automation Conference 2017, 1-6, 2017 | 226 | 2017 |
yaSpMV: Yet another SpMV framework on GPUs S Yan, C Li, Y Zhang, H Zhou Acm Sigplan Notices 49 (8), 107-118, 2014 | 185 | 2014 |
Evaluating fast algorithms for convolutional neural networks on FPGAs Y Liang, L Lu, Q Xiao, S Yan IEEE Transactions on Computer-Aided Design of Integrated Circuits and …, 2019 | 150 | 2019 |
StreamScan: fast scan algorithms for GPUs without global barrier synchronization S Yan, G Long, Y Zhang Proceedings of the 18th ACM SIGPLAN symposium on Principles and practice of …, 2013 | 123 | 2013 |
Characterization and prediction of deep learning workloads in large-scale gpu datacenters Q Hu, P Sun, S Yan, Y Wen, T Zhang Proceedings of the International Conference for High Performance Computing …, 2021 | 116 | 2021 |
Optimizing network performance for distributed dnn training on gpu clusters: Imagenet/alexnet training in 1.5 minutes P Sun, W Feng, R Han, S Yan, Y Wen arXiv preprint arXiv:1902.06855, 2019 | 79 | 2019 |
A coordinated tiling and batching framework for efficient GEMM on GPUs X Li, Y Liang, S Yan, L Jia, Y Li Proceedings of the 24th symposium on principles and practice of parallel …, 2019 | 63 | 2019 |
AMOS: enabling automatic mapping for tensor computations on spatial accelerators with hardware abstraction S Zheng, R Chen, A Wei, Y Jin, Q Han, L Lu, B Wu, X Li, S Yan, Y Liang Proceedings of the 49th Annual International Symposium on Computer …, 2022 | 50 | 2022 |
Towards distributed machine learning in shared clusters: A dynamically-partitioned approach P Sun, Y Wen, NBD Ta, S Yan 2017 IEEE International Conference on Smart Computing (SMARTCOMP), 1-6, 2017 | 50 | 2017 |
GPURoofline: a model for guiding performance optimizations on GPUs H Jia, Y Zhang, G Long, J Xu, S Yan, Y Li Euro-Par 2012 Parallel Processing: 18th International Conference, Euro-Par …, 2012 | 44 | 2012 |
Understanding the tradeoffs between software-managed vs. hardware-managed caches in GPUs C Li, Y Yang, H Dai, S Yan, F Mueller, H Zhou 2014 IEEE International Symposium on Performance Analysis of Systems and …, 2014 | 42 | 2014 |
Diesel: A dataset-based distributed storage and caching system for large-scale deep learning training L Wang, S Ye, B Yang, Y Lu, H Zhang, S Yan, Q Luo Proceedings of the 49th International Conference on Parallel Processing, 1-11, 2020 | 35 | 2020 |
Gradientflow: Optimizing network performance for large-scale distributed dnn training P Sun, Y Wen, R Han, W Feng, S Yan IEEE Transactions on Big Data 8 (2), 495-507, 2019 | 32 | 2019 |
A survey on efficient inference for large language models Z Zhou, X Ning, K Hong, T Fu, J Xu, S Li, Y Lou, L Wang, Z Yuan, X Li, ... arXiv preprint arXiv:2404.14294, 2024 | 26 | 2024 |
Parallelization and performance optimization on face detection algorithm with OpenCL: A case study W Wang, Y Zhang, S Yan, Y Zhang, H Jia Tsinghua Science and Technology 17 (3), 287-295, 2012 | 24 | 2012 |
Enabling efficient fast convolution algorithms on GPUs via MegaKernels L Jia, Y Liang, X Li, L Lu, S Yan IEEE Transactions on Computers 69 (7), 986-997, 2020 | 23 | 2020 |
Timed dataflow: Reducing communication overhead for distributed machine learning systems P Sun, Y Wen, TNB Duong, S Yan 2016 IEEE 22nd International Conference on Parallel and Distributed Systems …, 2016 | 20 | 2016 |
Chimera: An analytical optimizing framework for effective compute-intensive operators fusion S Zheng, S Chen, P Song, R Chen, X Li, S Yan, D Lin, J Leng, Y Liang 2023 IEEE International Symposium on High-Performance Computer Architecture …, 2023 | 18 | 2023 |