The ever-increasing bandwidth and storage performance requirements of media workflows and other performance-intensive use cases have powered the evolution and continuous fine-tuning of every component of storage and network infrastructure. However, the improvement of existing tools has its limits and sometimes what is needed is a different approach. One of these alternative approaches to the task of high-speed data transfer has resulted in Remote Direct Memory Access (RDMA). This clever technology brings a significant boost in throughput, while significantly reducing latency. As a company, we have harnessed the power of RDMA and this has helped us to build super-fast Ethernet-based storage solutions that offer the kind of performance required for 4K, 8K, and DPX workflows.
How does RDMA work?
RDMA unlocks higher transfer speeds by circumventing the data buffers in the operating system. Data can then be transferred directly from the network adapter to the application memory and vice versa.
In a traditional network data path, data needs to pass through the kernel networking stack of both the host’s and the receiver’s operating system. The data path includes the TCP and IPv4/6 stack down to the device driver. RDMA on the other hand bypasses the operating system’s software kernel and allows the client system to copy data from the memory of the storage server directly into its own. This direct execution of I/O transactions between the network adapter and the application memory avoids unnecessary data copies and frees up the CPU. This has a positive effect on the throughput whilst simultaneously reducing latency, making remote file storage perform similarly to directly attached block storage. This uses fewer CPU cycles for data transfer network, leaving more resources available for performance-demanding applications.
The Different RDMA Flavours
Firstly, it is important to understand that there are a few different ways to run RDMA over Ethernet networks. These include iWARP which uses TCP and a few other layers for RDMA communication, RoCE (RDMA over Converged Ethernet) which uses UDP for RDMA communications and InfiniBand.
Which one of these technologies is better is an ongoing discussion and one that Mellanox, a company which solely focuses on RoCE, addressed in the following published comparison in which they present RoCE as a solution with lower latency, higher performance and CPU utilization and wider adoption.
Mellanox Paper: RoCE vs iWARP
Chelsio, a competitor which focuses heavily on iWARP, published the following comparison in which the benefits of ease of use, fewer support issues, less cost, as well as congestion control mechanisms, scalability and routability of iWARP, are highlighted.
Chelsio Paper: RoCE vs iWARP
We have found from our experience that Chelsio’s Linux drivers can be difficult to install in comparison to the Mellanox OFED driver package. On the other hand, we find that Chelsio’s Windows drivers behave much better on Windows-based operating systems. Especially in dual-CPU workstations, a higher performance can be seen compared to Mellanox. Unfortunately, the two RDMA standards are not cross-compatible with each other, which means that a decision needs to be made on which one to use.
How Much Faster is RDMA Compared to TCP?
Mellanox, an established networking equipment manufacturer, has published the following interesting white paper on the benefits of RDMA in which they have compared different performance metrics of TCP against RoCE with SMB Direct. RoCE has outperformed TCP in every metric in a 25GbE setup. Read and write bandwidth is approximately 30% higher with a more efficient CPU utilization – particularly for the read operations where the results differ about 50%. This is especially good news for media workflows which are known to be quite read-heavy and require the usage of CPU-hungry applications.
Mellanox: Benefits of RDMA over Routed Fabrics
Nvidia, which is the new owner of Mellanox, also published a similar test in which they compared TCP to RoCE with NFS through a 100GbE connection. Their findings show a 2 to 3 times higher bandwidth when a block size of 128KB is used and about 1,5 times higher IOPs values with 2KB and 8KB block sizes.
RDMA and ELEMENTS
Until recently using Fibre Channel was the only way to achieve the performance required for demanding media workflows. Thanks to our NVMe-powered ELEMENTS BOLT, this is now a thing of the past. Soon, speeds of up to 70 Gigabytes per second per chassis and practically instant data access can make an Ethernet connection seem impossibly fast. Adding RDMA to the mix (which we have implemented since 2017) allows us to scale NAS performance and get the most out of flash-based storage like BOLT. With RDMA, the overall scalability of the environment grows and allows your facility access to the most demanding workflows – without the need for a large investment into another type of network (or more clearly Fibre Channel) infrastructure. We recommend reading our Bulletproof solution for 4K HDR framestack workflows blog to find out just how simple and cost-efficient a high-performance environment can be.
RDMA is a technology that is hugely beneficial to media workflows. Not only does it increase the connection throughput by a significant margin, but it also brings lower latency compared to TCP. By bypassing the operating systems networking stack and allowing the network adapter direct access to application memory, RDMA significantly reduces CPU utilization on both the host and the server side. This is of great value for media use cases as the freed-up CPU resources can be used by the applications, for example, to reduce the time it takes to render a timeline. RDMA is also a great match with flash-based storage as it helps to solve the transmission bottleneck by allowing more efficient utilization of the performance offered by the storage.
RDMA can bring clear performance benefits to your storage environment and allow for workflows that would otherwise not be possible.