毕设英译汉原文协作数据共享系统的可靠存储和(8)

时间：2025-07-06

Fig.6.Distributed query plan for running example.

data has been received,then propagate the end-of-stream.The process continues until the Group operators are encountered. When these operators receive end-of-stream,theyﬁrst output theirﬁnal aggregate values,before propagating the end-of stream,which gets forwarded to the query initiator.

C.Handling Node Membership Changes

The major challenge of reliable query processing is how to handle changes to the node set.Recall from Section III that the query initiator takes a snapshot of the routing tables in the system during query initiation.It disseminates this snapshot along with the query plan so all machines will use a consistent assignment between hash values and nodes.

Node arrival.Suppose a node joins the system in the midst of execution.In a DHT,such a change immediately affects the routing of the system—and begins forwarding messages to the new node,which may not have participated in any prior computation.In principle,one might develop special protocols by which the new node would be“brought up to speed”by its neighbors.However,this becomes quite complex when multiple nodes join at different times.Instead,we let the query complete on its initial set of nodes,and only make use of the new node when a fresh query(with a new routing table snapshot)is invoked.This approach provides simplicity and avoids expensive synchronization.

Node departure/failure.Our use of TCP connections be-tween nodes is generally adequate to detect a total node failure(we assume complete failure rather than incorrect operation)or network partition.If a sending node(and query operator)drops its connection before sending an end-of-stream message,or a receiving node drops its connection before query completion,then this represents a failure.Additionally,the system performs periodic ping operations in the background to detect a“hung”machine.Clearly in this case,continuing query computation will result in missing or possibly incorrect answers.This leads us to the problem of recomputation, described in the next subsection.

D.Recovery from Failure

Our system supports two forms of recovery from failure. One option,upon detecting a node failure,is to terminate and and restart any in-process queries.Assuming low failure rates, we will ultimately get the answers this way.This approach is

straightforward to implement in O RCHESTRA,since we can detect which queries are still in-ﬂight—in contrast to systems like PIER or Seaweed.

When failures are more common,as in longer-lived queries running on large numbers of nodes,better performance might be obtained by performing incremental recomputation,where we only repeat the computations affected by the failed node, using a different node that has data replicated from the failed one.The key challenge here is that simply recomputing will likely result in the creation of some number of duplicate tuples —which in turn will either lead to duplicate answers or(in many cases)to incorrect aggregate results.

After a failure,any derived state in the system that origi-nated from the failed nodes is likely to be inconsistent,due to propagation and computation delays.We can re-invoke the computation from the failed nodes and then remove duplicate answers,or instead we can remove all state derived from the failed nodes’data before performing the recomputation.We adopt the latter approach due to the difﬁculty of detecting which tuples are duplicates.As was hinted at previously,this means we must track which intermediate andﬁnal results are derived from data processed at one of the failed nodes.We tag each tuple in the system with the set of nodes that have processed it(or any tuple used to create it),and maintain these sets of nodes as the tuples propagate their way through the operator graph.As we validate experimentally,this can be done with minimal overhead.

We divide incremental recomputation into four stages.

Determine change in assignment of ranges to nodes.

When a node or set of nodes fail,other nodes“inherit”a portion of the hash key space from failed nodes.The query initiator computes a new routing table from the original one, assigning the ranges owned by the failed nodes to remaining ones.If the failed nodes’data is available on more than one replica,the initiator will evenly divide among them the task of recomputing the missing answers.

Drop all intermediate results dependent on data from the failed nodes.To prevent duplicate answers,we scan the internal state of all operators and discard any tuples that are tagged as having passed through a failed node(we term these tainted tuples).It is critical that any state not dependent on the failed nodes remains available.This is easy to accomplish with join operator state.For aggregate operators,we partition each group into sub-groups that summarize the effects of all of the tuples for each possible set of contributing nodes,and drop the sub-groups for failed nodes.While the number of subgroups is exponential in the number of rehashes(for n nodes and m rehashes,

m+1

k=1

),this number is typically small;critically,it does not depend on the number of input tuples.Tuples that are inﬂight between operators(or crossing the network)must also beﬁltered in this way.

Restart leaf-level operations for the failed nodes’hash key space ranges.We restart leaf-level operations such as tablescans,re-producing any data that would have originated at the failed nodes.As the data propagate through the system, they will be re-processed against the data from other nodes,

…… 此处隐藏：3553字，全部文档内容请下载后查看。喜欢就下载吧 ……

毕设英译汉原文协作数据共享系统的可靠存储和(8).doc 将本文的Word文档下载到电脑

下载这篇word文档