IO-TCP: Efficient Content Delivery with SmartNIC

Content Delivery Systems for Video Streaming

Modern content delivery systems consist of a large number of geographically distributed content delivery Web or reverse proxy servers. These systems serve as the basis for many applications such as Web page access and video streaming. Among them, the volume of video traffic has rapidly increased due to the COVID-19 pandemic and takes up about 80% of the entire Internet traffic.

CPU as a Major Source of Bottleneck

This trend is moving the major bottleneck of today’s content delivery systems from disk I/O to the memory subsystem. Not only the server can exploit sequential disk reads to maximize the disk throughput for large-object access like video download but also the advent of inexpensive large RAM and flash-based disks (e.g., NVMe SSDs) remove the disk seek-induced limitations. Especially, for I/O-intensive applications like video content delivery, over 70% of CPU cycles are spent on simple I/O operations. To effectively harness the recent advancement in I/O devices, the current program structure must reduce the dependency on CPU and its memory system for I/O operations.

Opportunities with SmartNIC

The key idea of our work is to offload data I/O from CPU to a programmable I/O device while supporting TCP-based content delivery. Any programmable device that can perform direct disk I/O and network packet I/O can meet our goal, but we use SmartNIC as it serves as a convenient place to interact with remote clients. However, naively running a server directly on SmartNIC does not efficiently use the resources as the processors and their memory are not so powerful as the host system.

The architecture of the BlueField SmartNIC

IO-TCP: Offloading TCP data plane to SmartNIC

The key design choice of IO-TCP is to separate the control and data planes of the TCP stack such that the CPU stack takes full control of every operation (control path) while individual I/O operations (data path) are offloaded to the SmartNIC stack. The core rationale for this is to save the majority of CPU cycles for performing I/O operations while keeping the SmartNIC stack simple to implement.

The overview of IO-TCP stacks

The control plane functions typically require complex state management as the behavior depends on the response from the other end. The data plane operations refer to all operations that involve data packet preparation and transfer, which supports the implementation of control plane functions. IO-TCP offloads only the operations in the send path because they are simple, stateless, and easily parallelizable.

Performance

The figure below shows that lighttpd on IO-TCP achieves 78.1 Gbps with a single CPU core on the host side for plaintext transfer, which demonstrates that a single CPU core is sufficient to handle the control plane operations for all 1600 clients. In contrast, Linux TCP does not go beyond 57 Gbps even with 10 CPU cores. This shows that the memory bandwidth is inefficiently utilized despite the usage of a zero-copy API like sendfile().

Comparison of throughputs of lighttpd over varying number of CPU cores serving 500KB files.

The next figures compare the performances with different file sizes. IO-TCP outperforms Linux TCP by 38% to 51% and it uses 2x to 10x smaller number of CPU cores to reach the peak performance. When it comes to TLS transfer, IO-TCP experiences little performance degradation due to the dedicated crypto hardware on NIC. In contrast, Linux TCP achieves only 37.4 Gbps even with 10 CPU cores as the main memory bandwidth becomes the bottleneck.

Plaintext performance for varying file sizes.
TLS performance for varying file sizes.

Please read the paper for more details about the server settings and more experiments.

Publications

Rearchitecting the TCP Stack for I/O-Offloaded Content Delivery
Taehyun Kim, Deondre Martin Ng, Junzhi Gong, Youngjin Kwon, Minlan Yu and KyoungSoo Park
In Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’23)
April 2023