Now showing items 1-3 of 3

    • A study of checkpointing in large scale training of deep neural networks 

      Rojas, Elvis; Kahira, Albert Njoroge; Meneses, Esteban; Bautista-Gomez, Leonardo; Badia, Rosa M (arXiv.Org, 2021-03-29)
      Deep learning (DL) applications are increasingly being deployed on HPC systems to leverage the massive parallelism and computing power of those systems. While significant effort has been put to facilitate distributed ...
    • Exploring the effects of silent data corruption in distributed deep learning training 

      Rojas, Elvis; Pérez, Diego; Meneses, Esteban (Institute of Electrical and Electronics Engineers (IEEE), 2022-11-02)
      The profound impact of recent developments in artificial intelligence is unquestionable. The applications of deep learning models are everywhere, from advanced natural language processing to highly accurate prediction of ...
    • Understanding soft error sensitivity of deep learning models and frameworks through checkpoint alteration 

      Rojas, Elvis; Pérez, Diego; Calhoun, Jon; Bautista-Gomez, Leonardo; Jones, Terry; Meneses, Esteban (Institute of Electrical and Electronics Engineers (IEEE), 2021-10-13)
      The convergence of artificial intelligence, highperformance computing (HPC), and data science brings unique opportunities for marked advance discoveries and that leverage synergies across scientific domains. Recently, deep ...