Bottlenecks, we all hate them; they slow down our work and frustrate us. Luckily there are plenty of solutions available that will allow you to remove them. So what are the typical bottlenecks one would find in DL or ML experiments?

In today’s modern age of disruptive technologies, I regularly get asked a question by many companies in many verticals. The question? ‘How can I speed up my training times?’

The main issue here is that most organisations aren't very well equipped to manage projects based on historical platforms (X86) but as soon as you start looking away from the normal mainstream solutions, then skillsets can become weaker, not just internally but also from your hardware suppliers. GPU’s, NVLINK, bandwidth considerations, frameworks, GAN’s are all new things to the typical system administrator or reseller.

Today I am going to be looking at Bandwidths and how to move away from PCI-E to something focusing more on the new kinds of workloads users are needing to process, where the bottlenecks are, and how we can do away with them. The current best solution is from NVIDIA and it’s called NVLink.

What is NVLink?

Imagine you are driving down the motorway going to a business trip. You are travelling on a 3-lane motorway. Traffic is very heavy and as a result you are traveling slowly. The bottleneck here is that there are not enough lanes for all the cars to travel on, hence the slow speeds (bottleneck). Now imagine if you had 6 lanes for traffic to travel on but the same number of cars, would you be able to travel faster? The answer is obvious, of course the traffic would move faster because you have more bandwidth, and as such the bottleneck is removed. 

The above is in essence what NVLINK is doing to your experiments, it is removing bottlenecks from the transfer of data between the GPU’s. NVLINK is a communication system that lets you process more data more quickly because you can move larger amounts of data onto the GPU’s. Then the GPU’s can share this data by the amount of GPU’s in the system.

NVLink was originally introduced on the DGX-1-P variant and immediately it was recognised as a game change; the resulting extra performance that was injected into the training process of datasets was instantly noticeable, and was a major contributing factor to the success of the DGX-1-P versions. Performance curves were up to x5 better than using traditional PCI-E and really set the scene for future variants of NVLink to be launched. Then in 2017, NVIDIA launched the second variant of the DGX series with a new version of NVLink. With NVLink 2, the performance curve over PCI-E was increased to x10 the performance, offering again great savings in training times. 

To show this below are two illustrations showing the problem and the solution?

The image on the left is an 8GPU system based on X86 and using NVIDIA NVLink technology. As you can see on the left image, the NVLink is massively increasing the communication bandwidth between the GPU’s which will result in faster training times. But as soon as the workload engages with the CPU, the bottleneck is back as the NVLink system is not licensed out to Intel or AMD (green lines indicate where NVLink is used, Purple thinner lines show where PCI-E is being used). This effectively means the bottleneck has simply been moved further away (e.g the GPU bandwidth was the bottleneck but with NVLink the bottleneck is removed and simply passed to the CPU). 32BG is the bottleneck point here (in red)

However, the image on the right highlights the solution, the image on the right represents the IBM POWER9 platform. The POWER9 platform is currently one of only a few platform's where the NVLink technology has been integrated directly onto the CPU (POWER9 chip) itself, as well of course on the GPU’s. The results this has had is simply massive. Basically what this means in simple terms is that bottlenecks that can be removed are now fully removed, meaning that from the time you start your experiments, you will not be limited by bandwidth restrictions on either the GPU-GPU communication or the CPU-GPU communications. 150GB is the standard pathway for communications on the POWER9 meaning there is no bottleneck (given current technology).

The results this can have on training times is a very big deal indeed. Because from the moment you have uploaded your dataset onto the system and started training, the system is optimised from a hardware (Bandwidth) point of view. Some of the clients Novatech are working on with Proof of Concepts on the POWER9 platforms are reporting incredible speed ups in training times over systems where the NVLink is present on the GPU’s but unfortunately not on the CPU. To force this point home, in one example a client saw the same performance from a POWER9 system over an x8 GPU (V100) system with NVLink on the GPU’s but not on the CPU. Of course this type of example holds no weight without a detailed case study, which I will post soon enough when the PoC is finished.

The POWER9 platform is the successor of the hugely popular POWER8 or (Minsky system), however, with the launch of the POWER9, IBM have essentially taken this great platform to another level in terms of GPU compute performance, scalability (95%), NVLink on the POWER9 chip, as well as the GPU’s. The system also supports some other cool features:

·        Distributed Deep Learning (DDL) 95% scalability over 256 GPU’s

·        Large Model Support (LMS) No need to reduce dataset / perform batch tasks. You can look at the data holistically to get a deeper insight. GPU memory saturated? Or above 80% then LMS will benefit your experiments

·        Open source infrastructure NVLink between CPU and GPU (3.5 – 4x faster than PCIe

·        AI Vision lets you automatically build Image Classification and Object Detection model. 


Communication is key in the age of technology. Communication amongst your technologies is even more important. Since Titan, a trend has emerged toward heterogeneous node configurations with larger ratios of GPU accelerators per CPU socket, with two or more GPUs per CPU becoming increasingly common as developers continue to expose and leverage the available parallelism in their applications. Although each of the new DoE systems is unique, they share the same fundamental multi-GPU node architecture.

While multi-GPU applications provide a vehicle for scaling single node performance, they can be constrained by interconnect performance between the GPUs. Developers must overlap data transfers with computation or carefully orchestrate GPU accesses over PCI-e interconnect to maximize performance. However, as GPUs get faster and GPU-to-CPU ratios climb, a higher performance node integration interconnect is warranted. Enter NVLink.

NVlink was designed as a solution to the challenges that exascale computing created. With an energy-efficient design and high-bandwidth, this sublime interconnect is the next step that you’ve been looking for in accelerating your GPU, allowing for flash-quick communication between your CPU and GPU hardware, as well as connections between the GPU’s themselves. NVLink brings data sharing to a new level, up to 10 times faster than any traditional PCIe interconnect.

The result is a dynamic technology that speeds up applications and their performance, producing a new species of boosted, flexible servers for efficient, ultra-fast computing.

All of the above benefits I have mentioned will be eclipsed when Tensorflow starts supporting NVIDIA TensorRT. When this happens, most users will experience a far greater performance increase over what users are currently experiencing.

I have been discussing two primary platforms above with NVLink inclusion, the DGX-1-V and the POWER9, so I am sure people will be asking soon, what’s the best one of the two to purchase? Well there is no simple answer to this unfortunately. Both platforms excel in certain areas and would generally be a massive asset to any organisation who was to purchase them.

From my point of view, I see the DGX-1-V as the T-REX of the current line-up of systems on offer. Simply put it has more computational performance than any other platform currently available in its class x8 V100 GPU’s utilising NVLink producing of total output of 1000 TFLOPS of DL performance. So from a deployment point of view it is very hard to beat. I don’t see anything to rival this currently, until NVIDIA bring something new to the party; I think GTC is coming soon….

The POWER9 system is however different from the DGX-1-V, it’s not trying to be a T-REX and dominate the landscape with raw compute performance, instead it’s a much nimbler fine-tuned system, with its roots in HPC applications and the scientific community. Whilst the POWER9 would be a good deployment system, I feel its slightly better situated as a training / deployment system. The system also offers some great features like LMS which in certain areas like medical imaging (drug discovery, Cancer detection) or with any very large images 30,000 x 30,000 (satellite images of earth or mapping the universe) then the POWER9 would offer some serious benefits.

So for me, its not about what system is the best in terms of raw performance, that's not the point. The point is which system is going to delivery the solution that your company needs? The performance curve that will matter most to you depends massively on the experiments you are running, the type of data you are using, even the types of frameworks and nets matter.

Also it’s good to know that NVLink is available to you without having to purchase the mentioned systems to this article. If you would like to find out more about this, or the other ML/DL solutions we offer simply contact myself or one of my team.

Share this