Reliable Multicast Protocols for LANs with Unreliable
Hardware Multicast
Paul LeMahieu and Lihao Xu
Introduction
Parallel computing on clusters of workstations and personal computers
has very high potential, since it leverages existing hardware and
software. To support this, a set of collective communication operations is needed.
Among these, multicast is an important operation since it forms the base for
many other collective communications. Multicast on clusters
of workstations over local area networks can achieve additional efficiency
when the LAN provides a hardware multicast feature, which is often the case.
Reliability of multicast on the user level
is necessary for both distributed and parallel computing, but a communication
medium like a LAN is unreliable. Packets may be lost, duplicated, or
overtaking may occur.
Our goal is to develop a simple reliable multicast protocol for the user level
given some unreliable communication network. We are most interested in results for
bus-based LANs, such as Ethernet.
If you are interested, you can see our original
project proposal
or our original
project presentation slides.
Communication Channel Model
We model the network as follows:
- If a packet is sent by a source infinitely often, the packet will arrive
at the designated destination(s) eventually.
- The channel may delete arbitrary packets.
- Packets may be delayed in the channel for some time with some distribution
(i.e., overtaking can occur).
- The integrity of a packet's content is guaranteed once it arrives at its
destination(s).
- The channel can concurrently forward messages to multiple destinations,
with no guarantee of reliability.
So, in general, our model is not limited to that of the bus-based LAN.
Any network that can provide a true hardware multicast fits our model,
and we choose to look at bus-based networks in particular because of
their popular use.
Definition of Multicast Reliability
Here we give a weak definition for multicast reliability, i.e., reliability of single source multicast:
- At the user level, all packets from a source will arrive at the
destination(s) in the same order as they were sent, and the user
sees no duplicate packets.
Note that we do not take into account the possibility of node failures.
Our assumption is that individual nodes in the system are reliable,
and it is only communications that are unreliable. Dealing with node
fallibility in addition to unreliable channels is a much more difficult
problem.
Design Options for Reliable Multicast
There are two ways to implement a reliable mulitcast protocol: we
can first implement reliable point-to-point, and then build on that,
or we can implement a multicast protocol directly on top of
unreliable channels.
Multicast with Reliable Point-to-Point
We can always implement reliable multicasting by first implementing
reliable point-to-point communications. We can then multicast via a tree,
with the root sending to two or more nodes in the multicast group and instructing
each child node to further the multicast by acting as the root
of a subset of the original group. Reliability is ensured by
the reliability of the point-to-point communications. This is wasteful,
however, in the presence of hardware broadcast capabilities.
Multicast with Unreliable Hardware Broadcast
To take advantage of the bus-based nature of most LANs, we can use
a hardware broadcast to send the data to the group, and then use
a point-to-point acknowledgment scheme to ensure the reliability of the
multicast. This is a very efficient way to do multicasts, especially
since it let's us take advantage of the simple bus nature of networks such
as Ethernet.
The benefits of this protocol are not limited to bus-based LANs, however.
Any local area network providing hardware multicast should have a
protocol layer of this nature. In fact, non-bus-based networks providing
hardware multicast will probably perform better since the initial send
is concurrent due to hardware (as with a bus-based network), but the
point-to-point acknowledges don't suffer from the contention present in
a bus-based network. Examples of this kind of LAN are switched Ethernet
and ATM switches, which typically provide hardware broadcast as well
as contention-free point-to-point. Acknowledgments can be collected
in a tree to prevent the source node from being overwhelmed.
With this acknowledge tree, some efficient sliding window
scheme can be used for the data packet acknowledges as with the
reliable unicast protocol. The only significant difference is that a packet
is successfully received only after the source gets all acknowledges for the
packet from it's children in the tree.
A Reliable Multicast Protocol Design
Since the selective repeat sliding window scheme is an efficient reliable
unicast protocol, we design our reliable multicast protocol based on this
this scheme. The protocol works as follows:
- Each message has a message head which contains the multicast group
members for this message and the size of this message in terms of packets.
Each packet of the message has a packet sequence number indicating its
order in the message.
- An N-ary ACK tree is created for each message according to
the designated multicast group. The message source is the root of the tree.
Upon receiving the multicast group information (which is contained in the
message head), each receiver can find its location (its parent and children)
in the acknowledge tree by some simple, deterministic calculations.
- The source uses a sending window with some predefined window size to send
all the packets in the window. Each packet is associated with a timer of
time_out time units. A packet in the window is regarded as ACKed only after
the node collects all ACKs from its children. The source slides its
sending window by i steps after all the first i packets
in the window get ACKed, and i new packets are sent once
the window advances.
- Upon receiving a time_out and/or a NACK signal for a packet in the sending
window, the source resends the packet by multicast.
- Each receiver has a window with the same size as the sender's window.
Upon receiving a packet, each receiver sends back an ACK to its parent once
it receives all ACKs for the given packet from its children in the tree.
If it receives a time_out signal for a packet and it has not received
the ACKs from all its children, it sends back a NACK to the source
(which is the root of the ACK tree). It can also sends back a NACK to the
source if it receives an ACK from a child for a packet which it itself
has not yet received.
- A message is regarded as successfully sent by the source only when all
its packets have been ACKed.
Here the NACK is only introduced for the efficiency of the protocol. Upon receiving
a NACK, the source resends the NACKed packet by multicast instead of unicast
based on the assumption that if one receiver didn't receive a packet, it is quite
probable that many of the other receivers also didn't receive the packet.
Note that we are conservative in our use of NACKs, to prevent NACK
implosion at the source. This occurs when the source is overwhelmed by
NACK messages. Thus, we don't NACK a previous packet if we simply get an
out-of-order packet. We only NACK when we have some information
indicating that other nodes did receive a packet that we (or our children)
did not receive. This is still not a perfect method, since if a leaf
node loses a packet it would still cause one NACK from each level
in the tree (above the leaf) to reach the source.
Correctness Proof of the Protocol
Our use of NACKs in the protocol is just for efficiency.
It does not affect the correctness of the protocol. We assume in
the proof that the receiver can differentiate one message from
another during the communication period.
The correctness of the protocol can be proved in the following steps:
- If each packet in a given message has a unique
sequence number, then each packet can be differentiated from other
packets of the same message. This guarantees the receiver can properly
order packets.
Then, all packets will eventually reach all destinations.
This is guaranteed by required acknowledgments from every destination,
resending after time-outs, and the channel assumption that infinite
sending implies eventual delivery. No receiver acknowledges unless it
has received the packet, each ACK corresponds to its correct packet
(by our unique sequence number assumption), and the sender keeps resending until all ACKs
have been received.
- If each packet in a given message does not have a unique
sequence number, then we must look at two cases:
- If there is no packet overtaking in the channel, which is true for the
bus-type channels, then for a window size of WIN_SIZE, a latest
sequence number ACKed of i (from the point of view of the receiver),
and a next possible sequence number that can arrive at the receiver of j,
the following invariant holds for i and j:
j in [i-WIN_SIZE+1, i+WIN_SIZE]
Thus, as long as the range of the packet sequence numbers is not smaller than
2*WIN_SIZE, the receiver can order the packets properly.
Being able to reorder the packets properly is equivalent to saying all
packets can be properly differentiated, and by the same logic of (1) all
receivers will eventually receive all packets.
- If there is finite packet overtaking in the channel, i.e., each packet
can overtake at most OVERTAKE_NUM other packets, then with the above notation, the
following invariant holds for i and j:
j in [i-OVERTAKE_NUM-WIN_SIZE+1, i+WIN_SIZE]
Hence, as long as the range of the packet sequence numbers is not less than
2*WIN_SIZE+OVERTAKE_NUM, the receiver can order the packets properly.
Again, being able to reorder the packets properly is equivalent to saying all
packets can be properly differentiated, and by the same logic of (1) all
receivers will eventually receive all packets.
So, with proper packet sequence numbering, all the packets
of a message will be delivered to the user group members in the
same order as the were sent from the sender.
Simulation of the Protocol with Mayc
We implemented the above protocol in Mayc
(Refer to the source code).
Each node has a user process, a sender process, and a receiver process.
The sender process gets
a message send request from its user process. It then sends the message to the
receiver processes of all group members using the sliding window
protocol. A receiver process receives packets from a sender process and sends
back an ACK or NACK to the suitable sender or receiver process according to
its place in the tree. When it detects that a complete message has been received,
it passes the message to its user process.
For the delay from sender to receiver, we introduce:
- Ts: the local processing time for the sender to
prepare a packet for transmission. Note that this time is local,
and will not interfere with other nodes' ability to transmit or receive.
- Tc: the actual time spent for the sender to transmit
a packet on the channel. Note that during this period all other
nodes are blocked from transmitting (in a bus-based network).
In a switch-based network, only those wishing to transmit to the same
destination will be blocked during this time.
- Tr: the local processing time for the receiver to
process a received packet. Note that this time is local,
and will not interfere with other nodes' ability to transmit or receive.
The order of the ACK tree (N) and the size of the sending window are tunable.
Test 1
For the first test we did the following:
- Source sends a 300 packet message
- Multicast group size: moderate (20 or 32)
- Channel delay: Tc is uniformly distributed on [10,30]
- Send and receive delay: Tr = Ts = 5
- A bus-based channel model (true broadcast, competing transmissions
block one another)
- The probability of a packet successfully being transmitted is
referred to as prob_no_error.
Notice that here we have a channel model where the delay for message is
primarily due to time spent on the channel, not local overhead.
- Here's a performance
plot
as error probability and tree_order vary for
group_size = 32, win_size = 10, and timeout = 100.
- Here's a performance
plot
as error probability and win_size vary for
group_size = 20, tree_order = 3, and timeout = 100.
- Here's a performance
plot
as time_out and tree_order vary for
group_size = 32, win_size = 10, and prob_no_error = 0.95
- Here's a performance
plot
as time_out and win_size vary for
group_size = 20, tree_order = 3, and prob_no_error = 0.95
- Here's a performance
plot
as tree_order and win_size vary for
group_size = 20, time_out = 100, and prob_no_error = 0.95
Comments on these results:
- Plots vs. error probability: we see that for moderately high probability of
transmission success, multicast time is weak linear function of
error probability (for reasonable tree_order and win_size).
- Plots vs. time_out: for a fixed probability of error on the channel,
we see transmission time as a weak linear function of time_out value
(for reasonable tree_order and win_size).
- Plot of win_size and tree_order: we see that most of the gains
due to window size and tree order are quickly achieved.
Test 2
For the second test we did the following:
- Source sends a 300 packet message
- Multicast group size: moderate (20 or 32)
- Channel delay: Tc is uniformly distributed on [1:10]
- Send and receive delay: Tr = Ts = 15
- A bus-based channel model (true broadcast, competing transmissions
block one another)
- The probability of a packet successfully being transmitted is
referred to as prob_no_error.
Notice that here we have a channel model where we've reversed
the importance of channel delay and local processing delay. Now
the delay in sending a message is primarily due to the time spent
in local processing, not time spent transmitting on the channel.
- Here's a performance
plot
as error probability and tree_order vary for
group_size = 32, win_size = 10, and timeout = 100.
- Here's a performance
plot
as error probability and win_size vary for
group_size = 20, tree_order = 3, and timeout = 100.
- Here's a performance
plot
as time_out and tree_order vary for
group_size = 32, win_size = 10, and prob_no_error = 0.95
- Here's a performance
plot
as time_out and win_size vary for
group_size = 20, tree_order = 3, and prob_no_error = 0.95
- Here's a performance
plot
as tree_order and win_size vary for
group_size = 20, time_out = 100, and prob_no_error = 0.95
Comments on these results:
Here we once again see very similar results to Test 1. By shifting
the significance of where delay was incurred packet transmission, we
had hoped to see a stronger relationship between tree order and
protocol performance. We believe we would have to try larger group
sizes and/or a greater ratio of Ts/Tc and Tr/Tc.
Test 3
For the third test we did the following:
- Source sends a 300 packet message
- Multicast group size: 20
- Channel delay: Tc is uniformly distributed on [10:30]
- Send and receive delay: Tr = Ts = 5
- A switched-based channel model (true broadcast, only those competing
to transmit to the same node will block)
Some test results are listed in the tables, and part
of them are depicted in graphs.
Comments on these results:
Since this test was for a switched type network,
the concurrency in acknowledge collection gives us a performance benefit
that we really didn't see with tests 1 and 2.
As expected, we can see a benefit from having a moderate tree order (trees
of order 2, 3, 4) as opposed to extremes: order 1 (which is a linear list)
or order 20 (which is just everyone ACKing back to the source).
Another extreme case is a window size of 1.
This is the stop-and-wait (or Alternate Bit) sliding window scheme, which
performs much worse than the selective repeat sliding window scheme, as
seen in these test results.
From the test results, all but very small window size usually gives similar performance.
In practice, the window size cannot be too large, since the size of the buffers at
sender and receiver sides is essentially WINDOW_SIZE*PACKET_SIZE.
Also, from the tables, a rather small timer time_out gives a better performance,
as long as
is is larger than the packet round-trip time in channel, kTc (where k is
determined by the height of the ACK tree and the ratio of Ts/Tc and Tr/Tc).
Conclusions
A protocol that takes advantage of the inherent low-level broadcast/multicast
capability of a network is almost certain to perform better than one
that doesn't take this into account. The only concern is the unreliability
of this low-level multicast, and the then necessary processing
of acknowledges. If we process acknowledges distributively in an
acknowledge tree, we can achieve more concurrency (depending on the blocking
nature of our network and delay parameters), and distribute the load, since
each node (sender or
receiver) must only process at most N (the order of the ACK tree) acknowledges
per packet. We prevent ACK-implosion at the source node,
and achieve concurrency in acknowledge collection, hopefully increasing
performance of the whole system. We say hopefully because overall
performance is very dependent on transmission delays, local delays,
and group sizes.
For non-blocking networks (e.g., switch-based LANs), the ACK tree can further help
reduce communication interference (signal collisions) of the
acks to different destinations, which gives more concurrency of the
ack-collection than blocking networks, such as bus-based
LANs. This is an additional benefit of the protocol for non-blocking
networks.
In practical applications of the protocol, several parameters such as window
size, ACK tree order, and timer time_out should be chosen carefully to match
the real channel model which can be mainly characterized by the channel delay
and the ratio of channel delay and local processing overhead.
A systematic way for modeling the channel
and choosing the related parameters of the protocol will be further studied.
Our brief experience with this project has shown us that there are many
inter-related parameters even in a systems as simple as this, and setting
them to optimal values is not trivial.
Future Work
- Theoretical Analysis of the protocol: Even though the
channel is simple, the performance analysis of the protocol is nontrivial,
but this is important for optimization of the protocol. The key measurement
of the performance is the throughput of the system, i.e.,
the number of packets that can be correctly delivered to user in a unit time.
We may wish to optimize this under the constraint of limited sender and receiver
buffer sizes. Performance depends considerably on the physical network being
modeled, so different channel models are interesting topics.
- Optimization of the protocol: Based on a theoretical
analysis, the number of control packets (ACKs, NACKs) and control
signals (time-out interrupts) should be optimized to improve the throughput
of the system, with the correctness of the protocol still guaranteed.
- General Message Ordering: This is the problem that comes
up for any multicast protocol. The discussed protocol considered the problems
of single source ordering only. That is, we guarantee that messages
from the same source are received in the same order by all group members.
Stronger ordering rules are multiple source ordering where messages
from multiple sources multicasting to the same group are received in the
same order by all group members, and multiple group ordering where
messages from possibly different sources sent to overlapping groups are
received in the same order by all members in the intersection of the
groups.
References
- Afek, Y., et al. Reliable Communication Over Unreliable Channels.
Journal of the ACM, Vol. 41, No. 6, November 1994.
- Chandran, S. R. and Lin, S. Selective-Repeat-ARQ Schemes for
Broadcast Links. IEEE Transactions on Communications,
Vol. 40, No. 1, January 1992.
- Tanenbaum, A. S. Computer Networks. Prentice-Hall,
New Jersey, 1989.