P4: Load Balancing
At a conceptual level, load balancing can be seen as follows: take all incoming packets and distribute them among the available resources in the network, typically either network paths or back-end servers. In practice, there are several practical challenges that must be addressed in any approach to load balancing:
-
Tracking the state needed to ensure that the load is balanced, using the limited resources available on network switches.
-
Ensuring flow affinity, that is that packets in the same flow traverse the same path or are delivered to the same backend server.
Hash-Based Load Balancing
A simple and widely-deployed approach to load balancing is to rely on
hashing. This avoids all issues having to do with maintainign state,
and automatically ensures flow affinity. The idea is to hash incoming
packets to one of N
buckets, each of which corresponds to a
particular forwarding behavior. For example, if the goal is to balance
load in the network itself, the buckets could represent the end-to-end
paths between the source and the destination. However, more commonly,
the buckets would represent the set of next hops toward the
destination – e.g., Equal Cost Multipath Routing (ECMP), uses this
idea.
Example
The following example code shows an implementation of ECMP-style load balancing in P4.
The ecmp_select
action groups packets into one of 4
buckets by hashing a 5-tuple:
action set_ecmp_select() {
hash(meta.ecmp_select,
HashAlgorithm.crc16,
16w0,
{ hdr.ipv4.srcAddr,
hdr.ipv4.dstAddr,
hdr.ipv4.protocol,
hdr.tcp.srcPort,
hdr.tcp.dstPort },
16w4);
}
This action stores the selected group in the ecmp_select
metadata
field – in this instance, an unsigned integer between 0
and 3
(inclusive).
Next, the ecmp_nhop
table keys on the group and sets the egress port
for the next hop:
action set_nhop(bit<9> port) {
standard_metadata.egress_spec = port;
}
table ecmp_nhop {
key = {
meta.ecmp_select: exact;
}
actions = {
drop;
set_nhop;
}
size = 4;
}
Load Imbalance
As with any hash-based scheme, collisions are a concern in load balancing protocols. For example if several large “elephant” flows get mapped to the same forwarding path, there can be excessive congestion on that path, even though the network has sufficient capacity to route all of the flows. Current approaches to addressing ECMP imbalance typically attempt to adjust the hash functions or fields being considered at each hop to eliminate collisions – a manual and error-prone strategy.
Stateful Load Balancing
Another approach to load balancing, which is possible to implement using programmable data planes such as P4, is to monitor the load and map flows to paths in a way that avoids excessive congestion on any one path.
This approach was pioneered in the context of data centers in CONGA, and was implemented in a custom chip developed by Cisco. More recent work on HULA showed how to develop a practical load balancing implementing using P4-programmable switches.
Overview
At a high level, the main conceptual innovation in HULA is to maintain hop-by-hop rather than network-wide state. In particular, rather than having to track utilization information about every path in the network, each switch only has to maintain local information about the utilization along next hop, which dramatically decreases resource requirements, especially in large networks.
HULA provides of two main pieces of functionality:
-
Continuously monitor the healthiness and state of the links in the network topology using special packet probes.
-
Forward packets using stateful load balancing applied at the granularity of pathlets.
Probes
The header type for HULA probes can be defined in P4 as follows:
header_type hula_header {
bit<24> dst_tor;
bit<9> path_util;
}
Each ToR switch periodically sends a probe, which is multicast to all other switches. Some logic is needed to ensure that probes do not loop, resulting a packet storm – either by exploiting topological structure (e.g., in data centers) or relying on time-to-live field.
The dst_tor
field encodes the identifier of the top-of-rack (ToR)
switch, while the path_util
field encodes the min-max utilization
observed along the reverse path from the destination to the switch.
Each switch maintains two pieces of state, represented concretely in P4 registers, each indexed by destination ToRs:
best_hop
: the next hop along the current best pathmin_path_util
: the current min-max utilization along the best path
Upon receipt of a probe, the switch updates its state using its local state and the data carried within the probe itself.
Flowlets
To ensure flow affinity, HULA relies on the notion of a flowlet: a group of packets with related headers not separated by a large time gap. Flowlets have the property that they are relatively easy to track in an efficient data plane implementation, while not suffering from reordering, which would hurt the performance of transport protocols such as TCP.
HULA incorporates standard logic for grouping incoming packets into
flowlets (using hashing and timestamps), and maintains a flowlet
table that stores forwarding information about currently active
flowlets:
+-------------|------------+-------+
|Flowlet_ID | Next_Hop | Time |
+-------------|------------+-------|
|... | ... | ... |
+-------------|------------+-------+
Forwarding Logic
Using these data structures, the overall logic for HULA can be summarized as follows:
For probe packets: if the path_util
field in the packet is less than
the currently stored value of min_path_util
for the dst_tor
, then
the switch updates the min_path_util
to path_util
and best_hop
to the ingress port.
For data packets: If the timer for the current flowlet has expired,
then set the next hop using the current value of the best_hop
register. Otherwise, set the next hop using the data already stored in
flowlet
.
Reading
For further details on CONGA and HULA, including an evaluation of their performance on realistic traffic patterns, see the original research papers describing each system.