Search
Now showing items 1-1 of 1
A study of checkpointing in large scale training of deep neural networks
(arXiv.Org, 2021-03-29)
Deep learning (DL) applications are increasingly being deployed on HPC systems to leverage the massive parallelism and computing power of those systems. While significant effort has been put to facilitate distributed ...