pastry: scalable, Decentralized Object Localtion

Pastry: Scalable, Decentralized Object Location, and Routing for Large-Scale Peer-to-Peer Systems | SpringerLink

abstract

Pastry is a scalable, distributed object location and for wide-area p2p applications. Pastry performs application level routing and object location in potentially very large overlay network of nodes connected via the internet. It can be used to support a variety of p2p application, including global data storage, data sharing,group communication and naming.

each node in the pastry network has unique identifier(nodeId). When presented with a message and a key, a pastry node efficiently route the message to the node with a nodeId that is numerically closest to the key, among all currently live pastry node. Each pastry node keeps track of its immediate neighbors in the nodeId space, and notifies applications of new node arrivals, node failures and recoveries. pastry takes in to account network locality; it seek to minimize the distance message travel, according to a to scalar proximity metric like the number of IP routing hops.

pastry is completely decentralized, scalable, and self-organizing; it automatically adapts to the arrival,departure and failure of node. Experimental results obtained with a prototype implementation on an emulated network of up to 100,000 node confirm pastry's scalability and efficiency, its ability to self-organize and adapt to node failures, and its good network locality properties.

1. introduction

p2p internet have many interesting technical aspects like decentralized control, self organization, adaptation and scalability. In p2p network all nodes have identical capabilities and responsibilities and all communication is symmetric.

one of the key problems in large-scale p2p application is provide efficient algorithms for object location and routing within the network. Pastry a generic p2p object localtion and routing scheme, based on a self-organizing network of node connected to Internet. Pastry is decentralized ,fault-resilient, scalable and reliable,pasyry has good route locality properties.

pastry is intended as general substrate for the construction of variety of p2p internet application like global file sharing, file storage, group communication and naming system.

Pastry provides following capability: Each node in the pastry network has a unique numeric identifier(nodeId). then presented with a message and numeric key, a pastry node efficiently route the message to the node with nodeId that is numerically closest to the key, among all currently live pastry nodes. The expected number of routing step is O(log N),where N is number of pastry nodeId the network. The application in route that a message task may perform application-specific computation related to the message.

Pastry seeks to minimize the distance message travel, according to a scalar proximity metric like the number of IP routing hops. Each pastry node keep track of it immediate neighbors in the nodeId space, and notifies applications of new node arrivals, node failures and recoveries. Because nodeIds are randoml, with high probability, the set of nodes with adjacent nodeId is diverse in geography, ownership,jurisdiction. application can leverage this, as Pastry can route to one of k node that a numerically closest to the key. this closest is defined bt "proximity metric"

past is a distributed file system based on distributed hash table

Application use these capabilities in different way. For instance computed as the hash of the file's name as a pastry key for file. Replicas of the file are stored on the k pastry node with nodeId s closest to fileId. We can store file in the searched path. As long as one of the node on these path is alive, the file can be found. Pastry's notifications mechanisms allow PAST to maintain replicas of a file on the k nodes closest to the key, despite node failure and node arrivals, and using only local coordination among node with adjacent nodeIds.

In SCRIBE system, the node that stores the list of subscribers is called a rendez-vous point

rendez-vous is a bridge between subscribers and publishers.

Anther sample application is publish/subscribe system, a list of subscribers is stored on the node with nodeId numerically closest to topicId of a topic, where topicId is a hash of the topic name. That node forms a rendez-vous point for publishers and subscribers. Subscribers send a message via pastry using the topicId as the key; registration is recorded at each node along the path. A publisher send data to the rendez-vous point via pastry, again using the topicId as the key. The rendez-vous point forwards the data along the multicast tree formed by the reverse pathfrom the redez-vous to all subscribers.

2 Design of pastry

Any computer that connected to the internet and runs the pastry node software can act as a pastry node, subject only to application-specific security policies .

Each node in pastry p2p network is assigned 128 bit node identifier(nodeId). The Node is used to indicate a node's position in circular nodeId space, which ranges from 0 to $2^{128}-1$. the nodeId is assigned randomly when a node join the system. It is assumed that nodeIds are generated such that the resulting set of nodeIds is uniformly distributed in 128-bit nodeId space. For instance, nodeIds could be generated by computing a cryptographic hash of the node's piblic key or its IP address.

in pastry route working by match the prefix of the nodeId, every routing operation will match b prefix

L: represent leaf node. every node will maintain the closest L/2 node that bigger current nodeId and L/2 node that smaller current nodeId.

such as current nodeId is 100 and L=10: bigger(101,102,103,104,105) smaller(99,98,97,96,95)

Assuming a network consisting of N nodes, pastry can route to the numerically closest node to given key in less than $log_{2^b}N$ steps under normal operation(b is a configuration parameter with typical value 4). Desite concurrent node failures,eventual delivery is guaranteed unless $|L|/2$ node with adjacent nodeids fail simultaneously(|L| is configuration parameter with a typical value of 16 or 32).in the following we present the pastry scheme.

For the purpose of routing,nodeIds and keysare thought of as sequence of digits with base$2^b$. pastry routes message to the node whose nodeId is numerically closest to the given key. This is accomplished as follows. In each routing step, a node normally forward the message to node those nodeId shares with the key a prefix that is at least one digit(or b bits)longer than the prefix that the key shares with the present node's id. If no such node is known, the message is forwarded to a node whose nodeId shares a prefix with the key as long as the current node, but is numerically closer to the key than the present node's id. To support this routing procedure, each node maintains some routing state.

2.1 pastry Node State

route table have$log_{2^b}N$ row, N is node number in network. B is the number of digits compared each time

Each Pastry node maintains a routing table,a neighborhood set and left set. R(route table ) is organized in to $log_{2^b}N$ row with $2^b-1$entries each. The $2^b-1 $ entries at row n of routing table each refer to a node those nodeId shares the present node's nodeId in the first n digits, but whose n+1th digit has one of the $2^b-1$ possible value other than the n+1th digit in the present node's id.

Each entry in routing table contains the IP address of one of potentially many node whose nodeId have appropriate prefix; in practice, a node chosen close to present node, according to the proximity metric. we will show in section 2.5 that this choice provide good locality properties. If no node is know with a suitable nodeId, then the routing table entry is left empty. The uniform distribution of nodeIds ensures an even population of the nodeId space; thus, on average, only $log_{2^b}N$rows are populated in the routing table.

The choice of b invoolves a trade off between the size of populated portion of the routing table(approximately $log_{2^b}N \times (2^b-1)$entries) and maximum number of hops requirement to route between any pair of nodes $log_{2^b}N$. With a value of b = 4 and $10^6$ nodes, a routing table contains on average 75 entries and the expected number of routing hop is 5,whilst with $10^9$ nodes, the routing table contains on average 105 entries, and the expected number of routing hops in 7.

The neighborhood set M contains the nodeIds and IP addresses of |M| nodes that are closest(according the proximity metric) to the local node. The neighborhood set is not normally used in routing message; it is useful in maintaining locality properties, as discussed in section 2.5. The leaf set L is the set of node with the |L|/2 numerically closest larger nodeIds, and the |L|/2 nodes with numerically closet smaller nodeIds, relative to present node's nodeId. The leaf set is used during the message routing, as described below. Typical value for |L| and |M| are $2^b$ or $2 \times 2^b$

Figure 1 depicts the state of a hypothetical Pastry node with the nodeId 10233102 (base 4), in a system that uses 16 bit nodeIds and a value of b = 2.

neighborhood set is based proximity metric. Perhaps due to the close physical distance.

in routing table

The first row (Row 0) stores nodes that are different from the 0th bit of the current node NodeId.

The second row (Row 1) stores nodes that are the same as the first bit of the current node but different from the next bit.

Similarly, the nth row stores nodes with the same first n bits but different n+1 bits

b means $2^b$-base in this fig. the range of each number is 0-3

2.2 routing

The pastry routing procedure is shown in pseudo code from in Table1. The procedure is executed whenever a message with key D arrives at a node with nodeId A. We begin by defining some notation.

$R^i_l$: the entry in the routing table R at column i,$0 \leq i < 2^b$ and row l, $0 \leq l<[128/b]$.

$L_i$: the i-th closet nodeId in the leaf set L, $-[|L|/2]\leq i\leq [|L|/2]$, where negative/positive indices indicate nodeIds smaller /larger than the present nodeId,respectively.

$D_l$: the value of the l's digit in the key D

shl(A,B): the length of the prefix shared among A and B, in digits.

Given a message, the node first checks to see if the key falls within the range of nodeIds covered by its leaf set(line1). If so message is forwarded directly to the destination node, namely the node in the leaf set whose nodeId is closest to the key(possibly the present node)(lie3)

If the key is not covered by the leaf set, then the routing table is used and the message is forwarded to a node that shares a common prefix with the key by at least one more digit(lines 6-8).in certain cases, it is possible that the appropriate ensure in routing table is empty or the associated node is not reachable(11-14),in which case the message is forwarded to node that shares a prefix with the key at least as long as the local node, and is numerically closer to the key than present node's id. Such a node must be in the leaf set unless the message has already arrived at the node with numerically closest nodeId. And, unless |L|/2 adjacent node in the leaf set have failed simultaneously, at least one of those node must be live.

If we cannot find a node with a longer prefix that matches target, we will choose nodes closer to the target than the current node. in line 14 demonstrated shorten the distance

This simple routing procedure always converges, because each step take the message to a node that either(1) shares a longer prefix with the key than the local node, or(2) shares as long a prefix with, but is numerically closer the key than the local node.

routing performance. it can be show that the expected number of routing step is $log_{2^b}N$ steps, assuming accurate routing table and no recent node failures. Briefly, consider the three case in the routing procedure. If a message is forwarded using routing table(lines 6-8), then the set of nodes whose id have a longer prefix match with key is reduced by a factor of $2^b$, then the set of node whose ids have a longer prefix match with the key is redeced by a factor of $2^b$ in each step, which means the destination is reached in $log_{2^b}N$ steps . If the key is within range of the leaf set(lines 2-3), then the destination node is at most one hop away.

The larger the leaf set, the lower the probability of mismatch occurring. usually leads to an additional forward once.

The third case arises when the key is not covered by the leaf set(i.e., it is still more than one hop away from destination), but there is no routing table entry. Assuming accurate routing tables and no recent node failures, this means that a node with the appropriate prefix does not exits(11-14). The likelihood of this case, given the uniform distribution of nodeIds, depends on |L|. analysis show that with $|L|=2^b$ and $|L|=2 \times 2^b$, the probability that this case arises during a given message transmission is less than .02 and 0.006,respectively. then it happens, no more than one additional routing step results with high probability.

In the event of many simultaneous node failures, the number of routing steps required may be at worst linear in N, while the nodes are updating their state. This is a loose upper bound;in practice, routing performance degrades gradually with the number of recent node failures, as we will show experimentally in Section 3.1. Eventual message delivery is guaranteed unless |L|/2 node with consecutive nodesIds fail simultaneously. Due to the expected diversity of nodes with adjacent nodeIds, and with reasonable choice for |L|(e.g. $2^b$), the probability of such a failure can be made very low.

2.3 pastry API

nodeId = pastryInit(credentials,Application) causes the local node to join an existing pastry network(or start a new one), initialize all relevant state, and return the local node's nodeId. The application-specific credentials contain information needed to authenticate the local node. The application argument is a handle to the application object that provide the pastry node with the procedures to invoke when certain event happend, e.g., a message arrival

route(msg,key) causes pastry to route the given message to the node with nodeId numerically closest to the key, among all live Pastry nodes.

applications layered on top of Pastry must export the following operations:

deliver(msg,key) called by pastry when a message is received and the local node's nodeId is numerically closest to key, among all live node.

forward(msg,key,nextId) called by pastry just before a message is forwarded to node with nodeId=nextId. The application may change the content of the message or the value of nextId. Setting the nextId to NULL terminates the message at the local node.

newLeafs(leafSet) called by Pastry whenever where is a change in the local node's leaf set. This provides the application with an opportunity to adjust application-specific invariants based on the leaf set.

2.4 self-organization and adaptation

In this section, we describe pastry's protocols for handling the arrival and departure of node in the pastry network. We begin with the arrival of a new node that joins the system.

expanding ring is query algorithm, commonly seen in p2p networks or DHT system.

start searching locally and expand to the predefined maximum search range.

Node arrival. When a new node arrives, it need to initialize its state tables, and then inform other nodes of its presence. We assume the new node knows initially about a nearby Pastry node A, according to proximity metric, that is already part of the system. Such node can be located automatically, for instance, using "expanding ring" IP multicast, or be obtained by the system administrator through outside channels.

Let us assume the new node's nodeId is X.(The assignment of nodeIds is application-specific ; typically it is computed as the SHA-1 hash of its IP address or its public key) Node X then asks A to route special "join" message to the existing node Z whose id is numerically closest to X.

In response to receiving the "join "request, nodes A,Z and all node encountered on the path from A to Z send their state tables to X. The new node X inspects this information, may request state from additional nodes, and then initializes the state tables, using a procedure describe below. Finally, X informs any node that need to be aware of its arrival. This procedure ensures that X initializes its state with appropriate value, and that the state in all other affected nodes is updated.

since node A is assumed to be in proximity to the new node X, A's neighborhood set to initialize X's neighborhood set. Moreover, Z has the closest existing nodeId to X, thus its leaf set is the basis for X's leaf set. Next, we consider the routing table, starting at row zero. we consider the most general case,where the nodeIds of A and X share no common prefix. Let $A_i$ denote node A's row of the routing table at level i. Note that the entries in row zero of the routing table are idenpendent of a node's nodeId. Thus, $A_0$ contains appropriate value for $X_0$. Other level of A's routing table are of no use to X, since A's and X's ids share no common prefix .

If X want to find Z(closest node in digital for X). X need know A(A already in the Pastry), use A find X, in this process B is first node along route from A to Z. X obtains appropriate entries for $X_2$ from node C, and so on.

However, appropriate values for $X_1$ can be taken from $B_1$, where B is the first node encountered along the route from A to Z. to see this, observe taht entries in $B_1$ and $X_1$ share the same prefix, because X and B have the same first digit in their nodeId. Similary, X obtains appropriate for $X_2$ from node C, the next node encountered along the route from A to Z, and so on.

Finally, X transmit a copy of its resulting state to each of the nodes found in its neighborhood set ,leaf set, and routing table. those node in turn update their own state based on the information received. One can show that at this stage, the new node X is able to route and receive message, and participate in the pastry network. The total cost for a node join, in terms of the numbers of messages exchanged, is $O(log_{2^b}N)$.The constant is about $3\times 2^b$. Those node un turn update their own state based on the information received. One can show that at this stage, the new node X is able to route and receive message, and participate in Pastry network. The total cost for a node join, in terms of the number of message exchanged, is $O(log_{2^b}N)$.The constant is about $3\times 2^b$

Pastry uses an optimistic approach to controlling concurrent node arrivals and departures. Since the arrival/departure of a node affects only a small number of existing node in system, contention is rare and optimistic approach is appropriate. briefly, whenever a node A provides state information to a node B, it attaches a timestamp to the message. B adjusts its own state based on this information and eventually send an update message to A(e.g. notifying A of its arrival). B attaches the original timestamp,which allows A to check if its state has since changed. in the event that its state has changed, it responds with its updated state and B restarts it operation.

node departure. Nodes in pastry network may fail or depart without warning .In this section, we discuss how the Pastry network handles such departures. A pastry node is considered failed when its immediate neighbors in nodeId space can no longer communicate with the node.

pastry regular check on neighbors, left set and routing are reachable or not. If there is no response within a certain period of time, will be considered a malfunction.

leaf Set

If one of leaf node fail , The closest live node will communicate the Node that keep largest or smallest nodeIds in leaf set, and get that copy leaf, and choose appropriate node to replace the faulty node.

that can be ensure Left set always maintains a certain number of node, to avoid many continuous node fault.

routing table

Fault happened in routing table, Pastry allow message forward to anther node without block the transmission of message.

To fix the invalid node, current node will attempt to:

  • find alternative node from other columns in the same row(i.e. nodes with the same prefix but different suffix)
  • If can't fand appropriate node, attempt find the alternative node from neighborhood set

Neighborhood Set

It will request other neighbors to provides it neighbor list and select a node closer to itself for replacement.

The neighborhood set is not normally used in the routing of message, yet it is important to keep it current, because the set play an important role in exchanging information about nearby node.

2.5 Locality

Pastry's notion of network proximity is based on a scalar proximity metric, such as the number of IP routing hops or geographic distance. It is assumed that the application provides a function that allows each Pastry node to determine the distance of a node with a given IP address to itself. A node with a lower distance value is assumed to be more desirable. An application is expected to implements this functions depending on its choice of a proximity metric, using network service like traceroute or internet subnet maps, and appropriate caching and approximation techniques to minimize overhead.

Pastry adopt scalar proximity metric to measure distance between two node such as

  • IP routing hops
  • geographical distance
  • RTT(Round-Trip Time)
  • bandwidth

these measures need complies with the triangle Inequality. But in actual network environment, IP routing not always complies triangle inequality, so Pastry need special strategy.

For initialization the routing table, not only consider prefix matches of nodeId, but also choose the closest node in physically. For every time the new node join the network, notified node only keep the closest node when the prefix length is the same. This operation can reduce transmission time.

X: The node that newly added to the network

A: X knows the initial Pastry node

The circles around the n-th node along the route from A to Z indicate the average distance of the node's representatives at level n. Note that X lies within each circle.

Shortest distance at the lowest level node, and distance increase with the level.

The first few hops is small hops(optimize locality), latter hops is big hops(accelerate search speed)

2.6 Arbitrary Node failures and Network partitions

Throughout this paper, it is assumed that pastry nodes fail silently. Here ,we briefly discuss how a pastry network could deal with arbitrary node failures, where a failed node continues to be responsive, but behaves incorrectly or even maliciously. The pastry routing scheme as described so far is deterministic. Thus, it is vulnerable to malicious or failed nodes along the route that accept message but do not correctly forward them.

In application where arbitrary node failures must be tolerated, the routing can be randomized. recall that in order to avoid routing loops, a message must always be forwarded to a node that shares a longer prefix with the destination, or shares the same prefix length as current node but numerically closer in the nodeId space than the current node. However, the choice among multiple node that satisfy this criterion can be made randomly. In practice, the probability distribution should be biased towards to the best choice to ensure low average route delay. In the event of malicious or failed node along the path, the query may have to be repeated several times by the client, until a route is chosen that avoids the bad node. furthermore, the protocols for node join and node failure can extended to tolerate misbehaving node. The detail are beyond the scope of this paper.

Another challenge are IP routing anomalies in the Internet that cause IP hosts to be unreachable from certain IP hosts but not other. The pastry routing is tolerant of such anomalies; pastry node are considered live and remain reachable in the overlay network as long as they are able to communication with their immediate neighbors in the nodeId space. However, Pastry's self-organization protocol may cause the creation of multiple,isolated pastry overlay networks during periods of IP routing failures. Because Pastry relies almost exclusively on information exchange with in the overlay network to self-organize, such isolated overlays may persist after full IP connectivity resumes.

Last modification:February 13, 2025
如果觉得我的文章对你有用,请随意赞赏