04-03-18 | Blog Post
Microsoft Azure support for high performance computing (HPC) workloads (or sometimes referred to as Big Compute) has evolved rapidly over the past few years. In this blog post, I will go over the history and support for HPC in Microsoft Azure, including Infrastructure as a Service (IaaS) solutions, Platform as a Service solutions (PaaS), and even Software as a Service (SaaS) solutions from Microsoft and/or its partners. I will also cover some of the most important scenarios where Big Compute makes a significant difference in helping organizations across many industries respond to scenarios that require a great amount of compute power.
HPC is used in many scenarios across multiple industries, including manufacturing; life sciences, chemicals, oil and gas, financial services, insurance, and many others. In manufacturing, engineering applications requiring extensive computations and the ability to render large graphics (e.g., a 20 GB airplane model image that includes the details of every rivet and part of the plane) with great precision were some of the first be extended and run in the public cloud.
HPC is also used to perform finite element calculations during product design, which helps shorten the product design lifecycle and bring such product to market faster. In chemicals and life sciences, HPC applications are used to run extensive simulations in research and development (R&D) that would help make new molecules used in formulating new products, such as plastics in the chemical industry, or drugs in life sciences.
In financial services, intensive simulations are needed to evaluate different financial solutions based on risk, among other criteria. In insurance, an industry based on risk analysis, HPC helps run long and extensive Monte Carlo simulations to find out the risks of certain populations or conditions, on which the insurance companies would base their fees and project their cost and profits. Financial institutions use such simulations to recommend optimal distributions of accumulated savings to their customers.
No matter what industry you consider, public cloud big compute applications may fall in these categories:
One remarkable example of the power of cloud-based big compute in life sciences is the human genome project, which took 13 years to finish (1990 – 2003) and cost about $1B to sequence the first human DNA. Nowadays, with the next-generation sequencing enabled by HPC, DNA sequencing can be done in less than a day and costs a few hundred dollars. This is enabled by running massively parallel jobs to analyze millions of human DNA strands at the same time in the cloud.
The public cloud provides the ability to quickly provision highly powerful resources that allow for performing intensive computations. The public cloud also makes provisioning such resources automatic on an as needed basis.
Initially, companies, such as automotive, aerospace, and many others, invested in highly expensive datacenters to run engineering simulations on premises. However, they quickly realized that such datacenters would run out of capacity at times and there was, at least temporarily, a need to extend them with additional capacity. That model of investing heavily in expensive resources to cover maximum computation requirements for part of the time did not make much business sense. Therefore, they looked at the public cloud as an ideal solution to borrow the needed extra capacity when they need it. This is how bursting applications to the cloud became a reality.
Public cloud providers such as Microsoft responded to this need by improving their cloud technology to support automating the bursting process. To this extent, they supported more robust hybrid networking as well as the deployment of most commonly used schedulers in the cloud. These schedulers manage the provisioning and de-provisioning of HPC cluster nodes required to run such solutions in a hybrid model.
The next step was for cloud providers to add their own HPC services that would automate the allocation of resources for applications both on premises and in the cloud and scheduling job runs across HPC cluster nodes. Third-party solution providers made their existing or new, innovative, solutions available in the public cloud for customers who need HPC capabilities, which provided more choices for customers and further solidified the public cloud as the ideal environment to run HPC workloads.
The diagram (adopted from the Azure solution architecture documentation) below shows a typical application bursting scenario in Azure. This particular solution requires using certain virtual machine (VM) sizes in Azure and a secure hybrid network connection between Azure and the on-premise datacenter (using VPN or Express Route technology), among other components.
There are several options available to run HPC workloads on Azure. These solutions range from a “do it yourself” to ones readily available for use from the Azure Marketplace, and other options in between.
The “do it yourself” solutions involve setting up your own HPC cluster in Azure using special VMs or VM scale sets. You can also “lift and shift” an existing HPC cluster on premise to run it in Azure, or deploy a new cluster in Azure for additional capacity when you needed it. Azure provides special VMs to support HPC workloads. This includes the first A8 and A9 VM sizes, the N-series (with NVIDIA graphical processing unit – GPU – support) and the H-series, such as H16r and H16mr, which can connect to a high throughput backend remote direct memory access (RDMA) network. RDMA is essential for compute-intensive applications allowing enormous amounts of data to flow between memory chips on different servers without tying up the server CPUs in the process.
Hybrid solutions extend on-premises solutions to burst additional workloads to Azure. You can use your own workload manager for these scenarios, but still take advantage of the HPC and GPU VMs in Azure.
Microsoft also provides the tools needed to develop your own SaaS solutions using Azure Batch and related services. Third party providers have many SaaS applications that run on Azure. Examples of such vendors include Altair, Cycle Computing, and Rescale. Finally, you can use the existing HPC applications in the Azure Marketplace from Microsoft partners. Examples of such applications include Ansys CFD, MATLAB DCS, Autodesk Maya, 3ds Max, Arnold on Azure Batch, Batch AI, and many others.
One might argue that Azure Batch is the Microsoft PaaS implementation of HPC in Azure. Batch can run large scale parallel and HPC batch jobs in a very efficient manner. With Azure Batch, you do not need to use your scheduler to control the nodes in the HPC cluster. Batch will manage the compute nodes (VMs), install the needed applications, and act as the scheduler for the batch jobs that need to run on the nodes. Batch allows you to use the Azure Portal to configure, manage and monitor jobs running on the different nodes. The figure below shows how easy it is to create a Batch Service in the Azure Portal. Once created, it is ready to use and manage the HPC cluster you point it to.
Microsoft also provides several APIs that can be used. The service APIs allows applications to issue REST calls to run and manage Azure Batch workloads. The Batch Management APIs allow for programmatic access to the Batch accounts (quota, applications packages, and other resources).
I have barely scratched the surface when it comes to big compute capabilities in Azure. Microsoft provides sufficient documentation on Azure Batch and the rest of Azure HPC services. However, I hope this post gives you an overview of the power of running HPC workloads in the cloud and what Microsoft is doing in this area.
If you’re looking to manage your HPC workloads in Azure, we can help. Learn more about our managed Azure services, download our free ebook on cloud computing, or watch our recent webinar about Microsoft Azure.