Version 0.17.0 of Horovod, the distributed deep learning framework, has been released. With the new release, Horovod extends and improves support for machine learning platforms and libraries. The release also contains a new run tool, performance improvements, and minor bug fixes.
Running Horovod training directly using Open MPI gives a lot of flexibility and allows fine-grained control over options and settings. The flexibility comes with the challenge of providing a significant number of parameters and values, even for simple operations. Missing or wrong parameters or values will prevent Horovod from running successfully.
With this release, the command-line utility horovodrun is introduced. The utility horovodrun is an Open MPI-based wrapper for running Horovod scripts, without the complexity of running Open MPI commands. The horovodrun utility automatically detects and sets parameters and allows the user to show the used MPI command if desired.
Let’s say we have a Horovod script train.py, and we want to run it on one machine using four GPUs, the horovodrun command would be:
|horovodrun -np 4 -H localhost:4 python train.py|
The flag -np specifies the number of processes, and the -H flag specifies the host. If more machines are used, the hosts can be listed separated by commas.
The equivalent Open MPI command would be:
|mpirun -np 4 \|
-H localhost:4 \
-bind-to none -map-by slot \
-x NCCL_DEBUG=INFO -x LD_LIBRARY_PATH -x PATH \
-mca pml ob1 -mca btl ^openib \
Apache MXNet is a high performant deep learning framework used for building, training, and deploying deep neural networks and supports distributed training.
Apache MXNet 1.4.1 and 1.5.0 are the releases officially supporting Horovod. Previously, the MXNet 1.4.0 release supported Horovod on certain OS, and users had to run the master branch version of MXNet to have Horovod support. In addition, the DistributedTrainer object is now introduced to better support Gluon APIs and enable Automatic Mixed Precision (AMP) in MXNet.
MPI-less Horovod alpha
MPI is used extensively in the supercomputing community for high-performance parallel computing, but can be difficult to install and configure for the first time. This change introduces support for Facebook’s Gloo as an alternative to running Horovod with MPI. Gloo comes included with Horovod, and allows users to run Horovod without requiring MPI to be installed.
For environments that have support both MPI and Gloo, users can choose their preferred library at runtime with a single flag to horovodrun:
|$ horovodrun –gloo -np 2 python train.py|
Gloo support is still early in its development, and more features are coming soon, most notably: fault tolerance. Stay tuned!
TensorFlow 2.0 support
TensorFlow 2.0 introduces some significant changes, not only when it comes to new features, but also when it comes to the API. The changes include removing redundant APIs and making the API more consistent, with a focus on improving the integration experience. Horovod supports the new TensorFlow 2.0 features and APIs in the latest release.
Intel Machine Learning Scaling Library (MLSL) offers a set of communication features which can provide performance benefits for distributed performance, such as asynchronous progress compute/communication overlap, message prioritization, support for data/model/hybrid parallelism, and employment of multiple background processes for communication.
Horovod supports different communication backends, such as MPI, NCCL, and DDL, and with the latest release, Horovod also supports Intel MLSL. Using MLSL as the communication backend improves both the scalability of communication-bound workloads and the compute/communication ratio.
Horovod version 0.16.3 contains performance improvements for existing features with the most noteworthy being updates for PyTorch and large clusters.
PyTorch is an open source, Python-based framework built for easy and efficient deep learning. PyTorch can utilize GPUs to accelerate tensor computation, and provides great flexibility and speed.
In the new release of Horovod, performance has been improved for gradient clipping, which is a method used for preventing instabilities caused by gradients with excessively large values.
Large cluster performance
Performance for ultra-large clusters is improved in Horovod 0.16.3. One example of an ultra-large cluster, which takes advantage of this improvement, is Oak Ridge National Laboratory’s Summit supercomputer. Summit has more than 27,000 GPUs and was built to provide computing power to solve large-scale deep learning tasks for which great complexity and high fidelity is required.
In Horovod, network communication is used in two distinct ways. First and foremost, network communication is used to carry out the collective operations to allreduce/allgather/broadcast tensors across workers during training. To drive these operations, network communication is also used for coordination/control to determine tensor readiness across all workers, and subsequently, what collective operations to carry out. With large cases on these systems spanning many hundreds to thousands of GPUs, the coordination/control logic alone can become a severe limiter to obtaining good parallel efficiency. To alleviate this bottleneck, NVIDIA contributed an improvement to the coordination/control implementation in Horovod to reduce the network communication usage for this phase of operation. In the improved implementation, a caching mechanism is introduced to store tensor metadata that was, in the original implementation, redundantly communicated across the network at each training step. With this change, coordination/control requires as little as a single bit per tensor communicated across the network per training step, instead of several bytes of serialized metadata per tensor.
RDMA support in the provided docker containers
Starting with Horovod version 0.16.4, RDMA support is available with Docker containers, which increases Horovod’s performance.
Previously it was necessary to build your own Docker image with the appropriate libraries to, such as MOFED, to run Horovod with RDMA. That is no longer necessary, as the provided containers now support RDMA.
If you have Mellanox NICs, we recommend that you mount your Mellanox devices (/dev/infiniband) in the container and enable the IPC_LOCK capability for memory registration:
|$ nvidia-docker run -it –network=host -v /mnt/share/ssh:/root/.ssh –cap-add=IPC_LOCK –device=/dev/infiniband horovod:latest|
You need to specify these additional configuration options on primary and secondary workers.
Curious about how Horovod can make your model training faster and more scalable? Check out these new updates and try out the framework for yourself. Be sure to join the Horovod Announce and Horovod Technical-Discuss mailing list.