scribe: A large-scale and decentralized application-level multicast infrastructure

Abstract

This paper presents Scribe, a scalable application level multicast infrastructure. Scribe support large numbers of groups, with a potentially large number of members per group. Scribe is built on top of Pastry, a generic p2p object location and routing substrate overlayed on the Internet, and leverages Pastry's reliability, self-organization, and locality. Pastry is used to create and manage group and to build efficient multicast trees for the dissemination of message to each group. Scribe provides best-effort reliability guarantees, and we outline how an application can extend Scribe to provide stronger reliability. Simulation result,based on a realistic network topology model, show that Scribe scales across a wide range of group and group size. Also, it balances the load on the nodes while achieving acceptable delay and stress when compared with Internet protocol multicast.

1 Introduction

Network-Level Internet protocol(IP)multicast was proposed over a decade ago. Subsequently, multicast protocols such as scalable reliable multicast protocol(SRM) and reliable message transport protocol(RMTP) have added reliability. However, the use of multicast in application has been limited because of lock of wide scale deployment and the issue of how to track group membership.

In this paper, we present Scribe, a large-scale, decentralized application-level multicast infrastructure built upon Pastry. Scribe provides efficient application-level multicast and is capable if scaling to large number of groups, of multicast source, and members per group.

Scribe are decentralized, that builds a multicast tree, formed by joining the pastry routes from each group member to rendezvous point associated with a group. Membership maintenance and message dissemination in Scribe leverages the robustness, self-organization, locality, and reliability properties of Pastry.

2 Pastry

3 scribe

Scribe is a scalable application-level multicast infrastructure built on top of Pastry. Any Scribe node may create a group, other node can then join the group, or multicast message to all member of the group(provided they have appropriate credential). Scribe provides best-effort delivery of multicast message, and specifies no particular delivery order. However, stronger reliability guarantees and ordered delivery for a group can be built on top of scribe, as outlined in section 3-B. Node can create, send message to, and join many group. Group may have multiple source of multicast message and many members. Scribe can support simultaneously a large numbers of group with a wide range of group sizes, and high rate of membership turnover.

Scribe offers a simple API to its application

create(credentials, groupId) creates a group with groupId. Throughout, the credentials are used for access control.

join(credentials,groupId,MessageHandler) causes the local node to join the group with group Id. All subsequently received multicast messages for that group are passed to the specified message handler.

leave(credentials, group) causes the local node to leave the group with groupId.

multicast(credentials, groupId, message) causes the message to be multicast within the group with groupId.

Scribe uses Pastry to manage group creation, group joining and to build a per-group multicast tree and used to disseminate the messages multicast in the group. Pastry and Scribe are fully decentralized: all decisions are based on local information, and each node has identical capabilities. Each node can act as a multicast source, a root of a multicast tree, a group member, a node within a multicast tree, and any sensible combination of the above. Much of the scalability and reliability of Scribe and Pastry derives from this peer to peer model.

A. Scribe implementation

A scribe system consists of a network of Pastry, where each node run the Scribe application software. The Scribe software on each node provides the forward and deliver methods, which are invoked by Pastry whenever a Scribe message arrives. The pseudo-code for these Scribe methods, simplified for clarity, is shown in Fig3 and 4, respectively.

Recall that the forward method is called whenever a scribe message is routed through a node. The deliver method is called when a Scribe message arrives a t the node with nodeId numerically closest to the message's key, or when a message was addressed to the local node using the Pastry send operation. The possible message types in Scribe are JOIN, CREATE, LEAVE ,and MULTICAST;

The following variables are used in the pseudocode: groups is the set of groups that the local node is aware of, msg.source is nodeId of the message's source node, msg.source is the nodeId of the message's source node, msg.group is groupId of the group, and msg.type is meessage type.

1)group management: Each group has unique groupId. The scribe node with a nodeId numerically closest to the groupId acts as the rendezvous point for the associated group. The rendezvous point is the root of the multicast tree created for the group.

to create a group, a Scribe node asks Pastry to route a CREATE message using the groupId as the key[e.g., route(CREATE,groupId)]. Pastry delivers this messages to the node with the nodeId numerically closest to groupId. The Scribe deliver methods adds the group to the list of groups is already knows about(line3 of Fig.4). It also checks the credentials to ensure that the group can be created, and stores the credentials. This Scribe node becomes the rendezvous point for the group.

The groupId is the hash of the group's textual name concatenated with its creator's name. The hash is computed using a collision resistant Hash function (e.g. SHA-1), which ensures a uniform distribution of groupIds. Since Pastry nodeIds are also uniform distribution of groupIds. Since Pastry nodeIds are also uniformly distributed. this ensure an even distribution of group across Pastry ndoes.

alternatively, we can make the creator of a group be the rendezvous point for the group as follows: a Pastry nodeId can be the hash of the textual name of the node, and a groupId can be the concatenation of the nodeId of the creator and the hash of the textual name of the group. This alternative can improve performance with a good choice of creator: link stress and delay will be lower if the creator send to the group ofen, or is close in the network to other frequent senders or many group members.

In both alternatives, a groupId can be generated by any Scribe node using only textual name of the group and its creator, without the need for an additional naming service. Of cource, proper credentials are necessary to join or multicast messages in the associated group.

2) Membership Management : Scribe creates a multicast tree, rooted at rendezvous point, to disseminate the multicast message in the group. The multicast tree is created using a scheme similar to reverse path forwarding[20]. The tree is formed by joining the Pastry routes from each group member to rendezvous point. Group joining operations are managed in decentralized manner to support large and dynamic membership.

Scribe nodes that are part of a group's multicast tree are called forwarders with respect to the group; they may or may not be members of the group. Each forwarder maintains a children table for the group containing an entry(IP address and nodeId for each of its children table in the multi castree).

When a scribe node wishes to join a group, it asks Pastry to route aJOIN message with the group''s groupId as the key[e.g., route(JOIN,groupId)].This message is routed by Pastry toward the group's rendezvous point. At each node along the route, Pastry invokes Scribe's forward methods. Forward(lines three to seven in Fig.3) checks its list of groups to see if it is currently a forwarder; if so, it accepts the node as a child, adding it to the children table. If the node is not already a forwarder it creates an entry for the group, and adds the source node as a child in the associated children table. It then becomes a forwarder for the group by sending a JOIN message to the next node along the route from the joining node to the rendezvous point. The original message from the source is terminated; This is achieved by setting nextId=null, in line seven of Fig.3.

Fig.5 illustrates the group joining mechanism. The circles represent nodes, and some of the nodes have their nodId shown. For simpliicity b=1, so the prefix is matched one bit at a time. We assume that there is a group with groupId 1100 whose rendezvous point is the node with the same identifier. The node with nodeId 0111 is joining the group. In this example, Pastry routes the JOIN message to node 1001; then the message from 1001 is routed to 1101; finally, the message from 1101 arrives at 1100. This route is indicated by the solid arrows in Fig.5.

Let us assume that nodes 1001 and 1101 are not already forwarders for group 1100. The joining of node 0111 causes the other two nodes along the route to become forwarders for the group, and causes them to add the preceding note in the route to their children tables. Now let us assume that node 0100 decides to join the same group. The route that its JOIN message would take is shown using dot-dash arrows. However, since node 1001 is already a forwarder, it add node 0100 to its children table for the group, and JOIN message is terminated.

When a Scribe node wishes to lave a group, it records locally that it left the group. If there are no other entires in the children table, it send a Leave message to its parent in the multicast tree, as show in lines 9 to 11 in Fig.4. The message proceeds recursively up the multicast tree, util a node is reached that still has entries up the multicast tree a node is reached that still has entries in the children table after removing the departing child.

The properties of Pastry routes ensure that this mechanism produces a tree. There are no loops because the nodeId of the next node in every hop of Pastry route matches a longer prefix of the groupId than the previous node, or matches a prefix with the same length and is numerically closer, or is the nodeId of the root.

rendezvous point is root node

The Membership management management mechanism is efficient for groups with a wide range of memberships, varying one to all Scribe nodes. The list of members of a group is distributed across the nodes in the multicast tree. pastry's randomization properties ensure that the tree is well balanced and that the forwarding load is evenly balanced across the nodes. This balance enables Scribe to support large numbers of group and members per group. Joining request are handled locally in a decentralized fashion. In particular, the rendezvous point dose not handle all joining request.

The locality properties of Pastry ensure that the multicast tree can be used to disseminate message efficiently. The delay to forward a message from the rendezvous point to each group member is small because of the short routes property. Second, the route convergence property ensure that the load imposed on the physical network is small because most message are sent by the nodes close the leaves and the network distance traversed by these message is short. Simulation result quantifying the locality properties of the Scribe multicast tree will be presented in Section 4.

There is a single multicast tree for each group and all multicast sources use the above procedure to multicast message to the group. This allows the rendezvous node to perform access control.

B. reliability

Application that use group multicast may have diverse reliability requirement. Some group may require reliable and ordered delivery of message, whilst others require only best effort delivery. Therefore, Scribe provides only best-effort delivery of message but if offers a framework for applications to implements stronger reliability guarantees. Scribe uses TCP to disseminate message reliably from parents to repair the multicast tree when a forwarder fails.

1) repairing the Multicast Tree: Periodically, each nonleaf node in the tree sends a heartbeat message to its children. multicast message serve as an implicit heartbeat signal avoiding the need for explicit heartbeat message in many cases. A child suspects that its parent is faulty when it fails to receive heartbeat message. Upon detection of the failure of its parent, a node call Pastry to route a JOIN message to the group's Identifier. Pastry will route the message to a new parent, thus, repairing the multicast tree.

For example, in Fig. 5, consider the failure of node 1101.Node 1001 detects the failure of 1101 and uses Pastry to route a JOIN message toward the root through an alternative route (in dicated by the dashed arrows). The message reaches node 1111 who adds 1001 to its children table and, since it is not a for warder, sends a JOIN message toward the root. This causes node 1100 to add 1111 to its children table.

Scribe can also tolerate the failure of multicast tree roots(rendezvous point). The state associated with the rendezvous point, which identifies the group creator and has an access control list, is replicated across the k closest node to the root node in the nodeId space(where a typical value of k is five). It should be noted that these nodes are in the leaf set of the root node. If the root fail, its immediate children detect the failure and join again through Pastry. Pastry route the join message to a new root(the live node with numerically closest nodeId to the groupId), which takes over the role of the rendezvous point. multicast senders like wise discover the new rendezvous point by routing via pastry.

Children table entries are discarded unless they are periodically refreshed by an explicit message from the child, stating its desire to remain in the group.

This tree repair mechanism scales well: fault detection is done by sending messages to a small number of nodes, and recovery from faults is local; only a small number of nodes($O(log_{2^b}N)$) is involved.

2) providing Additional Guarantees: By default, Scribe provides reliable ordered delivery of multicast message only if the TCP connections between the nodes in the multicast tree do not break. For example, if some node in the multicast tree fail, Scribe may fail to deliver message or may deliver them out of order.

Scribe provides a simple mechanism to allow application to implement stronger reliability guarantees. Applications can define the following upcall methods, which are invoked by Scribe.

forwardHandler(msg) is invoked by Scribe before the node forwards a multicast message, msg, to its children in the multicast tree. The method can modify msg before it is forwarded.

faultHandler(msg) is invoked by Scribe when a node suspects that its parent is faulty. The argument is the JOIN message that is sent to repair the tree. The method can modify msg to additional information before it is sent.

For example, an application can implement ordered reliable delivery of multicast messages by defining the upcalls as follows. The forwardHandler is defined such that the root assigns a sequence number to each message and such that recently multicast message are buffered by root and each node in the multi cast tree. Message are retransmitted after the multi cast tree is repaired. The faultHandler add the last sequence number,n, delivered by node to JOIN message and the join Handler retransmits buffered message with sequence numbers above n to the new child. To ensure reliable delivery, the message must be buffered for an amount of time that exceeds the maximal time to repair the multicast tree after a TCP connection breaks.

To tolerate root failures, the root needs to be replicated. For example, one could choose a set of replicas in the leaf set of the root and use an algorithm like paxos to ensure strong consistency.

4 experimental evaluation

5 related work

Overcast: multicast based tree structure, use unicast tunnels. used for video streaming, root node select The optimal transmission path. The disadvantage is that if the upper level node fails, The entire subtree will be affected.

Narada :build a fully connected overlay network, and create minimum spanning tree to optimize the multicast route. extensibility is lower than Scribe.

CAN Multicast: DHT based CAN, every group have a independent CAN overlay network. through CAN structure for find the appropriate point at group member join.

Last modification:February 17, 2025
如果觉得我的文章对你有用,请随意赞赏