P4: Load Balancing

At a conceptual level, load balancing can be seen as follows: take all incoming packets and distribute them among the available resources in the network, typically either network paths or back-end servers. In practice, there are several practical challenges that must be addressed in any approach to load balancing:

Tracking the state needed to ensure that the load is balanced, using the limited resources available on network switches.
Ensuring flow affinity, that is that packets in the same flow traverse the same path or are delivered to the same backend server.

Hash-Based Load Balancing

A simple and widely-deployed approach to load balancing is to rely on hashing. This avoids all issues having to do with maintainign state, and automatically ensures flow affinity. The idea is to hash incoming packets to one of N buckets, each of which corresponds to a particular forwarding behavior. For example, if the goal is to balance load in the network itself, the buckets could represent the end-to-end paths between the source and the destination. However, more commonly, the buckets would represent the set of next hops toward the destination – e.g., Equal Cost Multipath Routing (ECMP), uses this idea.

Example

The following example code shows an implementation of ECMP-style load balancing in P4.

The ecmp_select action groups packets into one of 4 buckets by hashing a 5-tuple:

action set_ecmp_select() {
    hash(meta.ecmp_select,
        HashAlgorithm.crc16,
        16w0,
        { hdr.ipv4.srcAddr,
          hdr.ipv4.dstAddr,
          hdr.ipv4.protocol,
          hdr.tcp.srcPort,
          hdr.tcp.dstPort },
        16w4);
}

This action stores the selected group in the ecmp_select metadata field – in this instance, an unsigned integer between 0 and 3 (inclusive).

Next, the ecmp_nhop table keys on the group and sets the egress port for the next hop:

action set_nhop(bit<9> port) {
    standard_metadata.egress_spec = port;
}
table ecmp_nhop {
    key = {
        meta.ecmp_select: exact;
    }
    actions = {
        drop;
        set_nhop;
    }
    size = 4;
}

Load Imbalance

As with any hash-based scheme, collisions are a concern in load balancing protocols. For example if several large “elephant” flows get mapped to the same forwarding path, there can be excessive congestion on that path, even though the network has sufficient capacity to route all of the flows. Current approaches to addressing ECMP imbalance typically attempt to adjust the hash functions or fields being considered at each hop to eliminate collisions – a manual and error-prone strategy.

Stateful Load Balancing

Another approach to load balancing, which is possible to implement using programmable data planes such as P4, is to monitor the load and map flows to paths in a way that avoids excessive congestion on any one path.

This approach was pioneered in the context of data centers in CONGA, and was implemented in a custom chip developed by Cisco. More recent work on HULA showed how to develop a practical load balancing implementing using P4-programmable switches.

Overview

At a high level, the main conceptual innovation in HULA is to maintain hop-by-hop rather than network-wide state. In particular, rather than having to track utilization information about every path in the network, each switch only has to maintain local information about the utilization along next hop, which dramatically decreases resource requirements, especially in large networks.

HULA provides of two main pieces of functionality:

Continuously monitor the healthiness and state of the links in the network topology using special packet probes.
Forward packets using stateful load balancing applied at the granularity of pathlets.

Probes

The header type for HULA probes can be defined in P4 as follows:

header_type hula_header {
  bit<24> dst_tor;
  bit<9> path_util;
}

Each ToR switch periodically sends a probe, which is multicast to all other switches. Some logic is needed to ensure that probes do not loop, resulting a packet storm – either by exploiting topological structure (e.g., in data centers) or relying on time-to-live field.

The dst_tor field encodes the identifier of the top-of-rack (ToR) switch, while the path_util field encodes the min-max utilization observed along the reverse path from the destination to the switch.

Each switch maintains two pieces of state, represented concretely in P4 registers, each indexed by destination ToRs:

best_hop: the next hop along the current best path
min_path_util: the current min-max utilization along the best path

Upon receipt of a probe, the switch updates its state using its local state and the data carried within the probe itself.

Flowlets

To ensure flow affinity, HULA relies on the notion of a flowlet: a group of packets with related headers not separated by a large time gap. Flowlets have the property that they are relatively easy to track in an efficient data plane implementation, while not suffering from reordering, which would hurt the performance of transport protocols such as TCP.

HULA incorporates standard logic for grouping incoming packets into flowlets (using hashing and timestamps), and maintains a flowlet table that stores forwarding information about currently active flowlets:

+-------------|------------+-------+
|Flowlet_ID   | Next_Hop   | Time  |
+-------------|------------+-------|
|...          | ...        | ...   |
+-------------|------------+-------+

Forwarding Logic

Using these data structures, the overall logic for HULA can be summarized as follows:

For probe packets: if the path_util field in the packet is less than the currently stored value of min_path_util for the dst_tor, then the switch updates the min_path_util to path_util and best_hop to the ingress port.

For data packets: If the timer for the current flowlet has expired, then set the next hop using the current value of the best_hop register. Otherwise, set the next hop using the data already stored in flowlet.

Reading

For further details on CONGA and HULA, including an evaluation of their performance on realistic traffic patterns, see the original research papers describing each system.