Probabilistic Reliable Dissemination in Large-Scale Systems

Probabilistic reliable dissemination in large-scale systems | IEEE Journals & Magazine | IEEE Xplore

abstract

The growth of the Internet raises new challenges for the design of distributed system and application. In the context of group communication protocols, gossip-based schemes have attracted interest as they are scalable, easy to deploy, and resilient to network and process failures. However, traditional gossip-based protocols have two major drawback 1) they rely on each peer having knowledge of the global membership 2) being oblivious to the network topology, they can impose a high load on network links when applied to wide-area setting. In this paper, we provide a theoretical analysis of gossip based protocols which relates their reliability to key system parameters (system size, failure rates, and number of gossip target). The result provide guidelines for the design of practical protocols. In particular, they show how reliability can be maintained while alleviating drawback 1)by providing each peer with only a small subset of the total membership information and drawback 2) by organizing members into a hierarchical structure that reflects their proximity according to some network-related metric. We validate results by simulations and verify that the hierarchical gossip protocols considerably reduces the load on the network compared to the original, nonhierarchical protocol.

Introduction and background

large-scale reliable group communication. Reliable group communication protocols are essential for distributed system and application such as publish/subscribe system, distributed database, consistency management, and distributed failure detection. The growth of the Internet has influenced the scale and the reliability requirement of distributed system. Traditional solution applicable in small-scale setting often do not scale well to very large system sizes.

Network layer multicast protocols like SRM and RMTP work on top of IP multicast and ensure reliability by using positive negative acknowledgments to repair packet losses. However, IP multicast is not currently deployed in the Internet. Consequently, application-level multicast has recently received increasing attention. Centralized or partially centralized approaches proven efficient in local-area network, do not scale well to large group. For instance log-based reliable multicast(LBRM) use logger to provide stable storage and to hand retransmission of missing message; how ever, the amount of information to be stored grows with the number of nodes and loggers could be overloaded. Other protocols, like Scribe and CAN-multicast are efficient and scalable but require the existence of a largescale peer-to-peer routing infrastructure. In contrast, epidemic or gossip-based protocols scales well to large groups, are easy to deploy, and degrade gracefully as the rate of node failure or message loss increases.

Gossip-based probabilistic multicast protocols. there protocols peer to peer interaction model for multicasting a message and are scalable since the load distributed among all participating nodes. they use redundant message and are scalable since the load is distributed among all participating node. They use redundant message to achieve reliability and fault tolerance. This class of protocols has been used for consistency management in replicated databases,failure detection, garbage collection, etc. A recent protocols called pbcast[2] uses them for reliable multicast. In pbcast, notifications are first broadcast using either IP multicast or a randomly generated multicast tree if IP multicast is not available. in addition, each node periodically chooses a random subset of process and send them a digest of the most recent message. Upon receipt of these message, receiver check for missing message and, if needed, solicit retransmission.

a related protocol, combining both push and pull phases, is proposed in[22]. in the push phase, each node receiving a message passes it on as in gossip, but increments a counter attached to the message. When the counter reaches a threshold, receiving nodes don't gossip anymore. In the second phase, nodes which haven't yet received the message send requests to randomly chosen node to pull the message. This is one of the few papers to include a theoretical analysis of the number of gossip messages needed to ensure high probability of reaching everyone. the pull phase is difficult to implement, which motivates us to consider a pure push algorithm in this paper. Our analysis techniques are different and may be of independent interest.

A hybrid version of pbcast and LBRM, called Reliable Probabilistic Multicast (rpcast) [24], consists of three phases.The first phase uses an unreliable IP multicast. During the second phase a pbcast gossiping step is initiated and, if it fails, a third deterministic phase using loggers is invoked.

A refinement of probabilistic gossip that takes network topology into account is Directional Gossip[20]. This is a wide-area protocol in which node favor the choice of low connectivity neighbors as gossip target in an attempt to improve reliability.

Though the above gossip-based approaches have proven scalable, they rely on a nonscalable membership protocol: They assume that the subset of nodes is chosen uniformly among all participating nodes, requiring that each node should know every other node. This assumption limits their applicability in large-scale settings. A protocol which partially addresses this issue is presented in [21], where a connection graph called a Harary graph is constructed. Optimality properties of Harary graphs ensures a good tradeoff between the number of message propagated and the reliability guarantees. However, building such a graph requires global knowledge of membership, and maintaining such a graph structure in the presence of arrivals/departures of nodes might prove difficult.

Contributions. In this paper we consider protocols where each node maintains only a partial view of the membership, which is typically much smaller than the system size. Gossip target are chosen from this partial view. We describe how probabilistic gossip-based algorithms can be modeled using random graphs and provide a rigorous analysis relating the probability of success to the fanout, defined as the number of gossip target per node. We do this in a flat setting, where the partial views provided to each node are chosen uniformly among all group members. We then consider models where nodes are grouped into clusters and partial view are largely restricted to nodes within the same cluster. we give an analysis of reliability as a function of the fanout both within and between clusters. This can be extended to hierarchical model with an arbitrary number of levels. The network load can be reduced by clustering on the basis of proximity int the network.

Simulations conform the analysis in showing that the protocol exhibits even under high failure rate. An implications of this resilience to node failures is that the protocol can provide good support for mobile nodes which may disconnect for nonnegligible periods. In order to evaluate the impact of clustering, we simulate our system on a realistic network topology based on the Georgia-Tech transit-stub model[27]. results show that the hierarchical protocol substantially reduces network load compared to flat gossip. The theoretical analysis in this paper is generic and can be applied to a variety of gossip-based protocols.

Gossip and membership protocols

2.1 gossip protocol

We consider a system composed of n nodes. We assume that there is a membership protocol which provides each node with a randomized partial knowledge of the system; this consists of a lit of node identifiers stored in local subscription list whose size denote by $l$. We will consider specific examples of membership protocols later. A notification (or gossip) message contains an event to disseminate to the whole group. When a node generates a notification event, a gossiping protocol round is initiated. The pseudocode for the gossiping algorithm is presented in Fig1. A node that initiates a notification or receives it for the first time picks k nodes at random from its local list and send them the notification. The number of gossip target,k , is called the fanout. The number k and l could be random variables.

The links between nodes defined by their gossip targets specify an overlay network on top of the existing network topology. We call this overlay network a connection graph in the rest of the paper. In the next section, we derive an expression for the fanout required to achieve a specified probability that a notification reaches every group member. The membership protocol can be tuned to provide members with a partial view, which is some small multiple of this desired fanout. This permits nodes to randomize their choice of gossip targets between successive gossip rounds and reduces the likelihood of a node remaining isolated for long periods.

For the sake of simplicity, we assume that a gossiping round is initiated for each notification. This can easily be modified to be initiated periodically and to send several notifications per gossip message.

2.2 Reliability Requirements

The goal of our protocol is ensure that a notification send by a member of the group reaches all notified members despite transient or permanent failure of other nodes and/or links in the system. Transient failures refer to the temporary inability of a node to receive a message(e.g., due to buffer overflow) or a temporary failure of the network to deliver a message(e.g., due to packet drops). Permanent failures refer to node crashes.

Gossip-based protocols provide probabilistic guarantees of delivery. The parameters of the model can be turned to achieve success probabilities arbitrarily close to 1 so that our approach is comparable to deterministic methods. In this paper, we focus on the relationship between the fanout and reliability of the basic gossip protocol described in the previous section. Additional mechanisms to increase reliability can be easily layered on top of the basic probabilistic protocol. This is out of the scope of this paper.

Finally, note that a node may become isolated either because its identifier present in no local view or because all nodes holding its identifier have either failed or unsubscribed. Such a node has substantial probability of remaining isolated for long period. we describe how to deal with this issue in the context of specific membership protocols below.

2.3 Membership protocols

A variety of protocols can be used to provide each node with a partial view of the group membership. The focus here is not on the details of their implementation, but on the theoretical analysis and simulation results in the next two sections, which show that the memory requirement of these protocols scale well in the system size. The results are applicable to variants of the basic protocols presented below.

2.3.1 Flat membership protocol

server-based protocol consider a set of s servers, to one of which group members have to subscribe when they join a group. Each server manages a subscription list containing all subscriptions it knows about. In this model, each server manages a part of the membership service. The subscription process is distributed among the servers, whereas the membership information is replicated on all servers. Upon receipt of a subscription, a server add the subscriber in its own subscription list. The server integrates the new member in the connection graph of the group. This requires two steps: 1) providing the new member with a partial knowledge of the system and 2) disseminating ,the new member's identify to other nodes. to this end, the server randomly chooses a sub set of l nodes from it's subscription list and send the subset to new member. This subset will constitute its local subscription list and provide it with a uniform randomized partial view of the system. In addition, the server randomly chooses l other node and send them the identifier of new member. This enables the new member to integrated in l other local subscription list and, consequently, in future connections graphs.

As the fanout is related to the number of nodes in the system and the reliability guarantees, servers are in charge of modifying the fanout in response to changes in the number of nodes; such modification are likely to be infrequent since, as we shall see, the fanout needs to be increased by 1 when the number of nodes increases by a factor of e.

The main drawback of this protocol is that, as the number of nodes in the system increases, the load on each server increases linearly. In addition, synchronization between servers is required periodically to ensure that they have a (approximately) consistent view of the group membership. Replicating membership information has the advantage that failure of individual servers can be tolerated.

Isolation could happen in this protocol when a node's identifier is present in no local view but that of its server, for example, because all nodes holding its identifier have either failed or unsubscribed. To overcome this, nodes periodically send heartbeat message according to the same gossip protocol. Missed heartbeat trigger resubscriptions.

Decentralized flat membership protocols The protocols described above rely on servers to manage the membership, though not the dissemination of notifications. The problems of designing a scalable, peer-to-peer membership, though not the dissemination of notifications. The problem of designing a scalable, peer-to-peer membership service is addressed by Lpbcast[8] and Scamp[10],[11].Scamp employs a self-organizing subscription mechanism which automatically provides each node with partial view of the membership of size(c+1)log(n) on average, where n is the number of members and c a design parameter.

2.3.2 Hierarchical Membership Protocol

probabilistic gossip-based protocols are scalable from the nodes point of view. However,their attractive reliability properties derive from a high degree of redundancy. This generates a large number of message, which may be expensive in a wide-area setting.

This drawback can be partially overcome if most messages are sent locally. In the flat membership approch, the local subscription list is composed of nodes located all over the network. We now introduce a hierarchical model where nodes are clustered according to a geographical or network proximity criterion and this is taken into account in providing nodes with a local subscription list composed exclusively of nodes belonging to the same cluster. In addition, a few nodes within each clusters are provided with a remote subscription list consisting of nodes in other clusters. We will see that only a small number of link between clusters are necessary to keep small number of links between clusters. are necessary to keep the system connected.

Henceforth, we distinguish between intercluster and intracluster fanout

  • The intracluster fanout k denotes the number of links each node has with other nodes in the same cluster.
  • The intercluster fanout f denotes the number of remote link each cluster must maintain with nodes outside the cluster. This is the minimum degree of knowledge each server must have of nodes out side its cluster.

The intracluster membership information is contained in the subscription list as previously. In addition a remote list contains the identity of f remote nodes. f random nodes are designated in each cluster to maintain these remote links. One node may be responsible for one interclyster connection only. Fig2 depicts an example of a 24-node system with two different approaches.

The membership protocol has been described for a two-level hierarchy for ease of exposition. but can easily be extended to hierarchy with more levels. In the model described, remote links are equally like to be directed at any other cluster, though some clusters may be closer than others in the chosen metric. in view of the fact, established in the next section, that very few intercluster links are required, we do not expect this to have a significant impact on network load in all but very large system. If required, differences in proximity between cluster can be exploited by more levels to the hierarchy.

The membership protocol can be implemented using a server per cluster which is in charge of attributing intercluster link as well as maintaining the partial views with the cluster. Synchronization is required between servers to exchange random nodes-Ids which will be used to link clusters. A hgierarchical gossiping approach is used in Astrolabe[25] and [15]. In these protocols, the hierarchy of nodes is mapped on the network topology by an administrator and nodes are more likely to gossip within their subhierarchy than remotely. A fully decentralized hierarchical approach to membership has been proposed in Hi-Scamp [12,] which clusters nodes according to a network proximity metric and implements a Scamp [11] protocol within each cluster and between clusters at each level of the hierarchy.

Last modification:February 20, 2025
如果觉得我的文章对你有用,请随意赞赏