Case Study

Financial Services HPC Cloud Migration

  • Client Systemically important European bank with global presence
  • # HPC Applications > 1000
  • Architecture Shared (multi tenant) grid across 8 on prem DCs and 6 cloud regions
  • Workload > 30 million CPU hours per month

This case study presents the work undertaken by HMx Labs in migrating a large systemically important European bank’s HPC workload to the cloud. We have presented as much detail as feasible without breaching non-disclosure agreements. As such not every aspect of the work can be disclosed here and unfortunately this includes details such as the client and the cloud provider.

The Ask

Efficiently transition HPC workloads to the cloud to:

  • meet cloud consumption targets
  • decommission end of life hardware
  • exit existing data centres

Enhance cloud efficiency by:

  • lowering unit cost of compute (price per CPU hour of workload)
  • minimising always on capacity
  • reducing VM start up time
  • use remaining on prem capacity in preference to cloud

Initial State

Prior to engaging HMx Labs, the client had signed an agreement with a cloud provider that stipulated a minimum spend commitment. They had already established network connectivity, centralised cloud adoption and CISO teams and policies were already in place.

The cloud network configuration followed a fairly standard hub-and-spoke model, with dedicated connectivity (bandwidth varying by region) to the cloud.

  • < 20% of HPC workload on cloud
  • > 30 millions CPU hours per month HPC workload
  • > 1000 HPC applications in use

Constraints

Below are the most notable constraints, though not an exhaustive list

  • All data stored in the cloud must be encrypted both in transit and at rest, a requirement not mandated for on-premises data. Encryption is to be executed using client-generated and maintained keys, not those provided by the cloud service
  • No connectivity can be established from external networks (including cloud) to the on premises network.

Our Assessment

Our involvement commenced with a comprehensive evaluation and assessment of the client's grid computing estate. The assessment's outcomes serve several pivotal purposes:

  • Establishing and aligning on a baseline of the current state
  • Stakeholder (up to board level) engagement.
  • Formulating and obtaining agreement on a migration approach

Given the intricate, protracted, and notably expensive nature of cloud migration endeavours like this, the significance of these steps cannot be overstated. Without clarity and alignment across all organizational levels regarding every aspect of the migration, the risk of a costly failure is significant.

  • < 20% of HPC workload on cloud
  • > 30 millions CPU hours per month HPC workload
  • > 1000 HPC applications in use

Top 100 HPC Applications Monthly CPU Hours

Our assessment encompassed not only the HPC infrastructure, which serves numerous applications, but also engagement with and evaluation of the largest (by workload) and most complex HPC applications. This entailed not only delineating the overall architecture and dependencies of each application but also identifying key constraints, performance requirements, and critical bottlenecks and limitations.

In collaboration with the HPC application team, we subsequently established a preferred migration approach along with potential mitigations and alternatives.

While we are unable to share an architecture diagram of one of the client’s applications, the following illustrates a typical financial risk system (the HPC application).

Application Migration

App Migration

A significant aspect of the migration involved close collaboration between HMx Labs and multiple application teams to facilitate their cloud migration endeavours.

Given the diverse cloud proficiency, levels of engagement, and time constraints among the application teams, we tailored our approach to each team accordingly. This ranged from offering high-level architectural guidance to integrating closely within the teams.

Our experience has proven that validating technology solutions within organizations, such as our client, can often be approximately 100 times slower compared to within HMx Labs. Consequently, we are uniquely positioned to validate proof of concept designs using solely open-source code (such as COREx) and a meta analysis of the actual system.

This distinctive approach enables HMx Labs to simulate the HPC application within our own test labs and furnish comprehensive feedback on migration solutions. All of this is accomplished without any proprietary code, data, or intellectual property leaving the client's network.

Cloud Cost Optimisation

Cost Optimisation

Working embedded within and with the close co-operation of the client’s HPC team, various strategies were employed to optimize cloud costs

VM Selection

Although the client does not have the ability to dynamically reallocate capacity to a different VM type in real time, a mechanism does exist to facilitate a semi manual process to select the most cost effective virtual machine and region.

HMx Labs benchmarking data served as input for this process. Additionally, we developed additional reporting within the internal metrics platform to illustrate the relative costs of operating on various VM types and regions, thereby reducing the manual workload associated with VM type and region selection.

Reduction of Always On Capacity

Not all HPC applications are able to rely entirely on dynamic scaled cloud capacity. A number of applications necessitate a minimum number of CPUs remain available at all times, resulting in higher than desired costs. Collaborating with the application teams, HMx Labs successfully reduced the total CPU hours of static capacity required.

VM Start Up Time

As part of the strategy used to reduce always on capacity we focused on reducing the startup time . This was achieved by investigating both VM boot times and the mechanism and cloud APIs used to scale capacity.

Conclusion

In conclusion, our collaborative efforts with the client's application teams have yielded significant advancements in cloud migration, optimization and cost reduction. Through tailored strategies and meticulous attention to detail, we've successfully streamlined processes, reduced static capacity requirements, and improved overall efficiency.

  • > 70% reduction in unit cost of HPC compute
  • > 90% of HPC workload cloud enabled. Remaining applications enabled with a path to cloud
  • The charts speak for themselves!

CPU Hours Cloud/On-Prem

Total On Prem Machines

Looking to migrate your HPC workload to cloud?

Want to have a chat? No hard sell. No fees. No credit card needed.

Let's Talk!