Computers in a network can exchange data in the main memory with the involvement of the processor, cache or OS by using a technology called remote direct memory access (RDMA). This frees up resources, improving the throughput and performance while facilitating faster data transfer.
Remote direct memory access (RDMA) technology increases the speed of server-to-server data movement through better utilisation of network infrastructure without CPU intervention. The network adapter transfers data directly to or from the application memory without interrupting other parallel operations of the system. RDMA technology is widely used in enterprise data centres and high performance computers (HPC) because of its high-throughput and low-latency networking.
This article will enable app developers to start programming RDMA apps even without any experience with it. Before we start, let’s have a brief introduction to InfiniBand (IB) fabrics — its features and components.
InfiniBand (IB)
InfiniBand is an open industry-standard specification for data flow between server I/O and inter-server communication. IB supports RDMA and offers high-speed, low latency, low CPU overhead, high efficiency and scalability. The transfer speed of InfiniBand ranges from 10Gbps (SDR) to 56Gbps (FDR) per port.
Components of InfiniBand
Host channel adapter (HCA): This provides an address translation mechanism under the control of the operating system, which allows an application to access the HCA directly. The same address translation mechanism is the means by which an HCA accesses memory on behalf of a user level application. The application refers to virtual addresses, while the HCA has the ability to translate these addresses into physical addresses in order to effect the actual message transfer.
Switches: IB switches are conceptually similar to standard networking switches but are designed to meet IB performance requirements. They implement the flow control of the IB Link Layer to prevent packet dropping and to avoid congestion. They also have adaptive routing capabilities and advanced quality of service. Many switches include a subnet manager, at least one of which is required to configure an IB fabric.
Range extenders: InfiniBand range extension is accomplished by encapsulating the InfiniBand traffic onto the WAN link and extending sufficient buffer credits to ensure full bandwidth across the WAN.
Subnet managers: The IB subnet manager is based on the concept of software defined networking (SDN), which eliminates interconnect complexity and enables the creation of very large scale compute and storage infrastructures. The IB subnet manager assigns local identifiers (LIDs) to each port connected to the InfiniBand fabric, and develops a routing table based on the assigned LIDs.
Installing RDMA
First of all, connect two devices back to back or through a switch. Download and install the latest version of the OFED package from https://www.openfabrics.org/downloads/.
OpenFabrics Enterprise Distribution (OFED) is a package developed and released by the OpenFabrics Alliance (OFA), as a joint effort of many companies that are part of the RDMA scene. It contains the latest upstream software packages (both kernel modules and user-space code) to work with RDMA. This package supports most major Linux distributions and CPU architectures.
Extract the tgz file and type the following command to start the installation:
[root@localhost]# ./install.pl
Next, choose ‘2’ (Install OFED software).
From the options displayed, choose ‘1’ (OFED modules and basic user level libraries).
OFED packages will now be installed. Reboot the system to complete the installation.
The structure of a typical RDMA application is as follows:
1. Gets the device list
2. Opens the requested device
3. Queries the device’s capabilities
4. Allocates a protection domain
5. Registers a memory region
6. Creates a completion queue
7. Creates a queue pair
8. Brings the queue pair to a ready-to-send state
9. Creates an address vector
10. Posts work requests
11. Polls for completion
12. Cleans up
To identify RDMA-capable devices in your system, type the following command:
[root@localhost]# ibstat
You need to be aware of the medium you are planning to use for your RDMA connection—InfiniBand or Ethernet. Verify that the ports are Active and Up.
Getting the device list
ibv_get_device_list( ) returns an array of the RDMA devices currently available.
An example of how this is done is given below:
struct ibv_device **dev_list; dev_list = ibv_get_device_list(NULL); if (!dev_list) exit(1);
Opening the requested device
ibv_open_device( ) opens the device and creates a context for further use.
An example is given below:
struct ibv_device **device_list; struct ibv_context *ctx; ctx = ibv_open_device(device_list[0]); if (!ctx) { fprintf(stderr, “Error, failed to open the device ‘%s’\n”, ibv_get_device_name(device_list[i])); return -1; } printf(“The device ‘%s’ was opened\n”, ibv_get_device_name(ctx->device));
Querying the device’s capabilities
ibv_query_device( ) returns the attributes of the RDMA device that is associated with a context. These attributes are constant and can be later used.
Here is an example:
struct ibv_device_attr device_attr; int rc; rc = ibv_query_device(ctx, &device_attr); if (rc) { fprintf(stderr, “Error, failed to query the device ‘%s’ attributes\n”, ibv_get_device_name(device_list[i])); return -1; }
Allocating a protection domain
ibv_alloc_pd( ) allocates a protection domain for an RDMA device context.
An example is given below:
struct ibv_context *context; struct ibv_pd *pd; pd = ibv_alloc_pd(context); if (!pd) { fprintf(stderr, “Error, ibv_alloc_pd() failed\n”); return -1; }
Registering a memory region
ibv_reg_mr( ) registers a memory region associated with the protection domain to allow the RDMA device to perform read/write operations.
Here is an example:
struct ibv_pd *pd; struct ibv_mr *mr; mr = ibv_reg_mr(pd, buf, size, IBV_ACCESS_LOCAL_WRITE); if (!mr) { fprintf(stderr, “Error, ibv_reg_mr() failed\n”); return -1; }
Creating a completion queue
ibv_create_cq( ) creates a completion queue for an RDMA device context.
An example is given below:
struct ibv_cq *cq; cq = ibv_create_cq(context, 100, NULL, NULL, 0); if (!cq) { fprintf(stderr, “Error, ibv_create_cq() failed\n”); return -1; }
Creating a queue pair
ibv_create_qp( ) creates a queue pair associated with a protection domain.
An example is given below:
struct ibv_pd *pd; struct ibv_cq *cq; struct ibv_qp *qp; struct ibv_qp_init_attr qp_init_attr; memset(&qp_init_attr, 0, sizeof(qp_init_attr)); qp_init_attr.send_cq = cq; qp_init_attr.recv_cq = cq; qp_init_attr.qp_type = IBV_QPT_RC; qp_init_attr.cap.max_send_wr = 2; qp_init_attr.cap.max_recv_wr = 2; qp_init_attr.cap.max_send_sge = 1; qp_init_attr.cap.max_recv_sge = 1; qp = ibv_create_qp(pd, &qp_init_attr); if (!qp) { fprintf(stderr, “Error, ibv_create_qp() failed\n”); return -1; }
Creating an address vector
ibv_create_ah( ) creates an address handle associated with a protection domain.
Here is an example of how this is done:
struct ibv_pd *pd; struct ibv_ah *ah; struct ibv_ah_attr ah_attr; memset(&ah_attr, 0, sizeof(ah_attr)); ah_attr.is_global = 0; ah_attr.dlid = dlid; ah_attr.sl = sl; ah_attr.src_path_bits = 0; ah_attr.port_num = port; ah = ibv_create_ah(pd, &ah_attr); if (!ah) { fprintf(stderr, “Error, ibv_create_ah() failed\n”); return -1; }
Posting work requests
ibv_post_send( ) posts a linked list of work requests to the send queue of a queue pair.
Here is an example:
struct ibv_sge sg; struct ibv_send_wr wr; struct ibv_send_wr *bad_wr; memset(&sg, 0, sizeof(sg)); sg.addr = (uintptr_t)buf_addr; sg.length = buf_size; sg.lkey = mr->lkey; memset(&wr, 0, sizeof(wr)); wr.wr_id = 0; wr.sg_list = &sg; wr.num_sge = 1; wr.opcode = IBV_WR_SEND; wr.send_flags = IBV_SEND_SIGNALED; if (ibv_post_send(qp, &wr, &bad_wr)) { fprintf(stderr, “Error, ibv_post_send() failed\n”); return -1; }
Polling for completion
ibv_poll_cq( ) polls work completions from a completion queue.
An example is given below:
struct ibv_wc wc; int num_comp; do { num_comp = ibv_poll_cq(cq, 1, &wc); } while (num_comp == 0); if (num_comp < 0) { fprintf(stderr, “ibv_poll_cq() failed\n”); return -1; }