Introduction

Training data is the primary and most crucial data that helps machines to learn and make predictions. Continual learning is the capability of a model to learn continuously from incoming data. In other words, this means enabling a Machine Learning (ML) model to autonomously learn and adapt in production as new data keeps coming in.

Generally, training a deep learning model begins with a forward pass where loss functions are assessed, followed by a backward pass where the loss-compensating gradients are created, which are then pushed to servers and then updated. These servers aggregate the updates from all the users and make changes to the global ML model. This process repeats itself numerous times until it touches a specific accuracy. Advanced models are huge and comprise heavy compute. As models turn bigger, the training process continues to remain an expensive affair.

Distributed Learning

Distributed training was developed to prevent the restriction of research to just well-funded labs. Volunteer computing (VC) is popular with other domains such as physics and bioinformatics where people donate the idle time of their smartphones, desktops, and other personal devices to solve a computationally complex problem. However, distributed training still has some problems:

  • Distributed training of a single model needs more communication and does not permit a natural way to “resume” failed jobs.
  • Distributed training of neural networks are restricted by the throughput of parameter servers and the memory present on the weakest GPU.

DeDLOC

Now, a team of researchers from Yandex, HF, and others have developed a new method that enables ML models to train over the internet in a better way. The new training algorithm is termed as Distributed Deep Learning in Open Collaborations (DeDLOC).

Data parallelism in GPUs is a common technique. DeDLOC attempts to adopt the best of all parallelism attributes while changing the popular distributed training techniques. DeDLOC includes synchronous data-parallel training with fixed hyperparameters irrespective of the quantity of volunteers. Training of the ML model is practiced with significantly large batches to make up for the slow communication. According to the researchers, each device collates gradients at its own pace till the collaboration achieves the target batch size. Once ready, the collaborators exchange the gradients and perform one optimiser step.

DeDLOC makes it possible for anyone in the ML community to run large-scale distributed pre-training with their friends. Some specific benefits of DeDloc are:

  • DeDloc adapts itself to the various network and hardware set-ups of the participants for effective transfer of data.
  • DeDLOC has been tested successfully– Yandex’s team along with Hugging Face, a professor from the University of Toronto and others, used the method to pretrain sahajBERT, a model for the Bengali language, with 40 volunteers. On downstream tasks, the model attains a quality that is comparable to much larger models using hundreds of high-tier accelerators.
  • DeDLOC may also be essential for multilingual NLP. Now, the community for any language can train their own models without requiring huge computational resources that are concentrated in one place.

DeDLOC when applied on pretraining ML models  achieves nearly an advanced with results comparable to much larger models that utilize hundreds of high-tier accelerators.

Conclusion

DeDloc is the first distributed deep learning training at scale and the results are heartening for individual researchers who are looking to take up exclusive ML training tasks. “The community for any language can train their own models without the need for significant computational resources concentrated in one place,” stated the HuggingFace team.