Cray and Microsoft accelerate deep learning training to minutes instead of weeks

A team of researchers have been working on a project to speed up the use of deep learning algorithms on supercomputers. They accelerated the training process. Instead of waiting weeks or months for results, data scientists can obtain results within hours or even minutes.
 
With the introduction of supercomputing architectures and technologies to deep learning frameworks, customers now have the ability to solve a whole new class of problems, such as moving from image recognition to video recognition, and from simple speech recognition to natural language processing with context.
 
The team have scaled the Microsoft Cognitive Toolkit, an open-source suite that trains deep learning algorithms, to more than 1,000 Nvidia Tesla P100 GPU accelerators on the Swiss centre’s Cray XC50 supercomputer, which is nicknamed Piz Daint.
 
Deep learning problems share algorithmic similarities with applications traditionally run on a massively parallel supercomputer. By optimizing inter-node communication using the Cray® XC™ Aries network and a high performance MPI library, each training job can leverage significantly more compute resources, reducing the time required to train an individual model. 
 
“Cray’s proficiency in performance analysis and profiling, combined with the unique architecture of the XC systems, allowed us to bring deep learning problems to our Piz Daint system and scale them in a way that nobody else has,” said Prof. Dr. Thomas C. Schulthess, director of the Swiss National Supercomputing Centre (CSCS). “What is most exciting is that our researchers and scientists will now be able to use our existing Cray XC supercomputer to take on a new class of deep learning problems that were previously infeasible.”
 
“Applying a supercomputing approach to optimize deep learning workloads represents a powerful breakthrough for training and evaluating deep learning algorithms at scale,” said Dr. Xuedong Huang, distinguished engineer, Microsoft AI and Research. “Our collaboration with Cray and CSCS has demonstrated how the Microsoft Cognitive Toolkit can be used to push the boundaries of deep learning.”
 
The result of this deep learning collaboration opens the door for researchers to run larger, more complex, and multi-layered deep learning workloads at scale, harnessing the performance of a Cray supercomputer.
 
To simplify the building and deploying of deep learning environments in supercomputing, Cray is supporting its Cray XC customers with deep learning toolkits, such as the Microsoft Cognitive Toolkit, that allow customers to run deep learning applications at their fullest potential, at scale on a Cray supercomputer. Fusing high performance computing capability with deep learning is another step forward in Cray’s vision of the convergence of supercomputing and big data.
 
“Only Cray can bring the combination of supercomputing technologies, supercomputing best practices, and expertise in performance optimization to scale deep learning problems,” said Dr. Mark S. Staveley, Cray’s director of deep learning and machine learning. “We are working to unlock possibilities around new approaches and model sizes, turning the dreams and theories of scientists into something real that they can explore. Our collaboration with Microsoft and CSCS is a game changer for what can be accomplished using deep learning.”