CS138c Project:

Reliable Multicast Protocols for LANs with Unreliable Hardware Multicast

Paul LeMahieu and Lihao Xu

Introduction

Parallel computing on clusters of workstations and personal computers has very high potential, since it leverages existing hardware and software. To support this, a set of collective communication operations is needed. Among these, multicast is an important operation since it forms the base for many other collective communications. Multicast on clusters of workstations over local area networks can achieve additional efficiency when the LAN provides a hardware multicast feature, which is often the case. Reliability of multicast on the user level is necessary for both distributed and parallel computing, but a communication medium like a LAN is unreliable. Packets may be lost, duplicated, or overtaking may occur. Our goal is to develop a simple reliable multicast protocol for the user level given some unreliable communication network. We are most interested in results for bus-based LANs, such as Ethernet.

If you are interested, you can see our original project proposal or our original project presentation slides.

Communication Channel Model

We model the network as follows: So, in general, our model is not limited to that of the bus-based LAN. Any network that can provide a true hardware multicast fits our model, and we choose to look at bus-based networks in particular because of their popular use.

Definition of Multicast Reliability

Here we give a weak definition for multicast reliability, i.e., reliability of single source multicast:
At the user level, all packets from a source will arrive at the destination(s) in the same order as they were sent, and the user sees no duplicate packets.
Note that we do not take into account the possibility of node failures. Our assumption is that individual nodes in the system are reliable, and it is only communications that are unreliable. Dealing with node fallibility in addition to unreliable channels is a much more difficult problem.

Design Options for Reliable Multicast

There are two ways to implement a reliable mulitcast protocol: we can first implement reliable point-to-point, and then build on that, or we can implement a multicast protocol directly on top of unreliable channels.

Multicast with Reliable Point-to-Point

We can always implement reliable multicasting by first implementing reliable point-to-point communications. We can then multicast via a tree, with the root sending to two or more nodes in the multicast group and instructing each child node to further the multicast by acting as the root of a subset of the original group. Reliability is ensured by the reliability of the point-to-point communications. This is wasteful, however, in the presence of hardware broadcast capabilities.

Multicast with Unreliable Hardware Broadcast

To take advantage of the bus-based nature of most LANs, we can use a hardware broadcast to send the data to the group, and then use a point-to-point acknowledgment scheme to ensure the reliability of the multicast. This is a very efficient way to do multicasts, especially since it let's us take advantage of the simple bus nature of networks such as Ethernet.

The benefits of this protocol are not limited to bus-based LANs, however. Any local area network providing hardware multicast should have a protocol layer of this nature. In fact, non-bus-based networks providing hardware multicast will probably perform better since the initial send is concurrent due to hardware (as with a bus-based network), but the point-to-point acknowledges don't suffer from the contention present in a bus-based network. Examples of this kind of LAN are switched Ethernet and ATM switches, which typically provide hardware broadcast as well as contention-free point-to-point. Acknowledgments can be collected in a tree to prevent the source node from being overwhelmed.

With this acknowledge tree, some efficient sliding window scheme can be used for the data packet acknowledges as with the reliable unicast protocol. The only significant difference is that a packet is successfully received only after the source gets all acknowledges for the packet from it's children in the tree.

A Reliable Multicast Protocol Design

Since the selective repeat sliding window scheme is an efficient reliable unicast protocol, we design our reliable multicast protocol based on this this scheme. The protocol works as follows:

  1. Each message has a message head which contains the multicast group members for this message and the size of this message in terms of packets. Each packet of the message has a packet sequence number indicating its order in the message.

  2. An N-ary ACK tree is created for each message according to the designated multicast group. The message source is the root of the tree. Upon receiving the multicast group information (which is contained in the message head), each receiver can find its location (its parent and children) in the acknowledge tree by some simple, deterministic calculations.

  3. The source uses a sending window with some predefined window size to send all the packets in the window. Each packet is associated with a timer of time_out time units. A packet in the window is regarded as ACKed only after the node collects all ACKs from its children. The source slides its sending window by i steps after all the first i packets in the window get ACKed, and i new packets are sent once the window advances.

  4. Upon receiving a time_out and/or a NACK signal for a packet in the sending window, the source resends the packet by multicast.

  5. Each receiver has a window with the same size as the sender's window. Upon receiving a packet, each receiver sends back an ACK to its parent once it receives all ACKs for the given packet from its children in the tree. If it receives a time_out signal for a packet and it has not received the ACKs from all its children, it sends back a NACK to the source (which is the root of the ACK tree). It can also sends back a NACK to the source if it receives an ACK from a child for a packet which it itself has not yet received.

  6. A message is regarded as successfully sent by the source only when all its packets have been ACKed.

Here the NACK is only introduced for the efficiency of the protocol. Upon receiving a NACK, the source resends the NACKed packet by multicast instead of unicast based on the assumption that if one receiver didn't receive a packet, it is quite probable that many of the other receivers also didn't receive the packet. Note that we are conservative in our use of NACKs, to prevent NACK implosion at the source. This occurs when the source is overwhelmed by NACK messages. Thus, we don't NACK a previous packet if we simply get an out-of-order packet. We only NACK when we have some information indicating that other nodes did receive a packet that we (or our children) did not receive. This is still not a perfect method, since if a leaf node loses a packet it would still cause one NACK from each level in the tree (above the leaf) to reach the source.

Correctness Proof of the Protocol

Our use of NACKs in the protocol is just for efficiency. It does not affect the correctness of the protocol. We assume in the proof that the receiver can differentiate one message from another during the communication period.

The correctness of the protocol can be proved in the following steps:

  1. If each packet in a given message has a unique sequence number, then each packet can be differentiated from other packets of the same message. This guarantees the receiver can properly order packets.

    Then, all packets will eventually reach all destinations. This is guaranteed by required acknowledgments from every destination, resending after time-outs, and the channel assumption that infinite sending implies eventual delivery. No receiver acknowledges unless it has received the packet, each ACK corresponds to its correct packet (by our unique sequence number assumption), and the sender keeps resending until all ACKs have been received.

  2. If each packet in a given message does not have a unique sequence number, then we must look at two cases:

So, with proper packet sequence numbering, all the packets of a message will be delivered to the user group members in the same order as the were sent from the sender.

Simulation of the Protocol with Mayc

We implemented the above protocol in Mayc (Refer to the source code). Each node has a user process, a sender process, and a receiver process. The sender process gets a message send request from its user process. It then sends the message to the receiver processes of all group members using the sliding window protocol. A receiver process receives packets from a sender process and sends back an ACK or NACK to the suitable sender or receiver process according to its place in the tree. When it detects that a complete message has been received, it passes the message to its user process.

For the delay from sender to receiver, we introduce:

The order of the ACK tree (N) and the size of the sending window are tunable.

Test 1

For the first test we did the following:

Notice that here we have a channel model where the delay for message is primarily due to time spent on the channel, not local overhead.

Comments on these results:

Test 2

For the second test we did the following:

Notice that here we have a channel model where we've reversed the importance of channel delay and local processing delay. Now the delay in sending a message is primarily due to the time spent in local processing, not time spent transmitting on the channel.

Comments on these results:

Here we once again see very similar results to Test 1. By shifting the significance of where delay was incurred packet transmission, we had hoped to see a stronger relationship between tree order and protocol performance. We believe we would have to try larger group sizes and/or a greater ratio of Ts/Tc and Tr/Tc.

Test 3

For the third test we did the following:

Some test results are listed in the tables, and part of them are depicted in graphs.

Comments on these results:

Since this test was for a switched type network, the concurrency in acknowledge collection gives us a performance benefit that we really didn't see with tests 1 and 2. As expected, we can see a benefit from having a moderate tree order (trees of order 2, 3, 4) as opposed to extremes: order 1 (which is a linear list) or order 20 (which is just everyone ACKing back to the source).

Another extreme case is a window size of 1. This is the stop-and-wait (or Alternate Bit) sliding window scheme, which performs much worse than the selective repeat sliding window scheme, as seen in these test results.

From the test results, all but very small window size usually gives similar performance. In practice, the window size cannot be too large, since the size of the buffers at sender and receiver sides is essentially WINDOW_SIZE*PACKET_SIZE.

Also, from the tables, a rather small timer time_out gives a better performance, as long as is is larger than the packet round-trip time in channel, kTc (where k is determined by the height of the ACK tree and the ratio of Ts/Tc and Tr/Tc).

Conclusions

A protocol that takes advantage of the inherent low-level broadcast/multicast capability of a network is almost certain to perform better than one that doesn't take this into account. The only concern is the unreliability of this low-level multicast, and the then necessary processing of acknowledges. If we process acknowledges distributively in an acknowledge tree, we can achieve more concurrency (depending on the blocking nature of our network and delay parameters), and distribute the load, since each node (sender or receiver) must only process at most N (the order of the ACK tree) acknowledges per packet. We prevent ACK-implosion at the source node, and achieve concurrency in acknowledge collection, hopefully increasing performance of the whole system. We say hopefully because overall performance is very dependent on transmission delays, local delays, and group sizes.

For non-blocking networks (e.g., switch-based LANs), the ACK tree can further help reduce communication interference (signal collisions) of the acks to different destinations, which gives more concurrency of the ack-collection than blocking networks, such as bus-based LANs. This is an additional benefit of the protocol for non-blocking networks.

In practical applications of the protocol, several parameters such as window size, ACK tree order, and timer time_out should be chosen carefully to match the real channel model which can be mainly characterized by the channel delay and the ratio of channel delay and local processing overhead. A systematic way for modeling the channel and choosing the related parameters of the protocol will be further studied. Our brief experience with this project has shown us that there are many inter-related parameters even in a systems as simple as this, and setting them to optimal values is not trivial.

Future Work

References

  1. Afek, Y., et al. Reliable Communication Over Unreliable Channels. Journal of the ACM, Vol. 41, No. 6, November 1994.

  2. Chandran, S. R. and Lin, S. Selective-Repeat-ARQ Schemes for Broadcast Links. IEEE Transactions on Communications, Vol. 40, No. 1, January 1992.

  3. Tanenbaum, A. S. Computer Networks. Prentice-Hall, New Jersey, 1989.