A study of checkpointing in large scale training of deep neural networks

Rojas, Elvis; Kahira, Albert Njoroge; Meneses, Esteban; Bautista-Gomez, Leonardo; Badia, Rosa M

dc.contributor.author	Rojas, Elvis
dc.contributor.author	Kahira, Albert Njoroge
dc.contributor.author	Meneses, Esteban
dc.contributor.author	Bautista-Gomez, Leonardo
dc.contributor.author	Badia, Rosa M
dc.date.accessioned	2023-10-30T19:19:14Z
dc.date.available	2023-10-30T19:19:14Z
dc.date.issued	2021-03-29
dc.identifier.uri	http://hdl.handle.net/11056/26772
dc.description.abstract	Deep learning (DL) applications are increasingly being deployed on HPC systems to leverage the massive parallelism and computing power of those systems. While significant effort has been put to facilitate distributed training by DL frameworks, fault tolerance has been largely ignored. Checkpoint-restart is a common fault tolerance technique in HPC workloads. In this work, we examine the checkpointing implementation of popular DL platforms. We perform experiments with three state-of-theart DL frameworks common in HPC (Chainer, PyTorch, and TensorFlow). We evaluate the computational cost of checkpointing, file formats and file sizes, the impact of scale, and deterministic checkpointing. Our evaluation shows some critical differences in checkpoint mechanisms and exposes several bottlenecks in existing checkpointing implementations. We provide discussion points that can aid users in selecting a fault-tolerant framework to use in HPC. We also provide take-away points that framework developers can use to facilitate better checkpointing of DL workloads in HPC.	es_ES
dc.description.abstract	Las aplicaciones de aprendizaje profundo (deep learning, DL) se despliegan cada vez más en sistemas HPC para aprovechar el paralelismo masivo y la potencia de cálculo de estos sistemas. Aunque se ha hecho un gran esfuerzo para facilitar el entrenamiento distribuido por parte de los marcos de DL, la tolerancia a fallos se ha ignorado en gran medida. El reinicio por puntos de control es una técnica de tolerancia a fallos habitual en las cargas de trabajo de HPC. En este trabajo, examinamos la implementación de puntos de control de las plataformas de DL más populares. Realizamos experimentos con tres marcos de DL de última generación comunes en HPC (Chainer, PyTorch y TensorFlow). Evaluamos el coste computacional del checkpointing, los formatos y tamaños de los archivos, el impacto de la escala y el checkpointing determinista. Nuestra evaluación muestra algunas diferencias críticas en los mecanismos de checkpoint y expone varios cuellos de botella en las implementaciones de checkpoint existentes. Aportamos puntos de debate que pueden ayudar a los usuarios a seleccionar un marco tolerante a fallos para su uso en HPC. También proporcionamos puntos de partida que los desarrolladores de marcos pueden utilizar para facilitar un mejor punto de control de las cargas de trabajo DL en HPC.	es_ES
dc.description.sponsorship	Universidad Nacional, Costa Rica	es_ES
dc.language.iso	eng	es_ES
dc.publisher	arXiv.Org	es_ES
dc.rights	Acceso abierto	es_ES
dc.rights	Attribution-NonCommercial-NoDerivatives 4.0 Internacional	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/4.0/	*
dc.subject	APRENDIZAJE PROFUNDO	es_ES
dc.subject	RESILIENCIA	es_ES
dc.subject	REDES NEURONALES	es_ES
dc.subject	COMPUTACIÓN DE ALTO RENDIMIENTO	es_ES
dc.subject	DEEP LEARNING	es_ES
dc.subject	RESILIENCE	es_ES
dc.subject	NEURAL NETWORKS	es_ES
dc.subject	HIGH PERFORMANCE COMPUTING	es_ES
dc.title	A study of checkpointing in large scale training of deep neural networks	es_ES
dc.type	http://purl.org/coar/resource_type/c_816b	es_ES
dc.description.procedence	Sede Regional Brunca, Campus Pérez Zeledón	es_ES
dc.identifier.doi	https://doi.org/10.48550/arXiv.2012.00825

Files in this item

Name:: A Study of Checkpointing.pdf
Size:: 493.0Kb
Format:: PDF

View/Open

Name:: license_rdf
Size:: 805bytes
Format:: application/rdf+xml

View/Open

This item appears in the following Collection(s)

Preprints [2]

Show simple item record

Except where otherwise noted, this item's license is described as Acceso abierto