毕设英译汉原文协作数据共享系统的可靠存储和

时间：2026-05-08

Reliable Storage and Querying for Collaborative

Data Sharing Systems

Nicholas E.Taylor and Zachary G.Ives

Computer and Information Science Department,University of Pennsylvania

Philadelphia,PA,U.S.A.

{netaylor,zives}@cis.upenn.edu

Abstract—The sciences,business confederations,and medicine urgently need infrastructure for sharing data and updates among collaborators’constantly changing,heterogeneous databases.The O RCHESTRA system addresses these needs by providing data transformation and exchange capabilities across DBMSs,com-bined with archived storage of all database versions.O RCHESTRA adopts a peer-to-peer architecture in which individual collabo-rators contribute data and compute resources,but where there may be no dedicated server or compute cluster.

We study how to take the combined resources of O RCHES-TRA’s autonomous nodes,as well as PCs from“cloud”services such as Amazon EC2,and provide reliable,cooperative storage and query processing capabilities.We guarantee reliability and correctness as in distributed or cloud DBMSs,while also sup-porting cross-domain deployments,replication,and transparent failover,as provided by peer-to-peer systems.Our storage and query subsystem supports dozens to hundreds of nodes across different domains,possibly including nodes on cloud services.

Our contributions include(1)a modiﬁed data partitioning substrate that combines cluster and peer-to-peer techniques, (2)an efﬁcient implementation of replicated,reliable,versioned storage of relational data,(3)new query processing and indexing techniques over this storage layer,and(4)a mechanism for incre-mentally recomputing query results that ensures correct,com-plete,and duplicate-free results in the event of node failure during query execution.We experimentally validate query processing performance,failure detection methods,and the performance beneﬁts of incremental recovery in a prototype implementation.

I.I NTRODUCTION

There is a pressing need today in the sciences,medicine, and even business for tools that enable autonomous parties to collaboratively share and edit data,such as information on the genome and its functions,patient records,or component designs shared across multiple teams.Such collaborations are often characterized by diversity across groups,resulting in different data representations and even different beliefs about some data(such as competing hypotheses or diagnoses from the same observations).Data is added and annotated by dif-ferent participants,and occasionally existing items are revised or corrected;all such changes may need to be propagated to others.To maintain a record across changes,different versions of the data may need to be archived.In these collaborative settings,there is often no single authority,nor global IT group, to manage the infrastructure.Hence,it may be economically or politically infeasible to create centralized services in support of data transformation,change propagation,and archival.

To address these needs,we have been developing the O RCHESTRA collaborative data sharing system(CDSS)[1].Brieﬂy,O RCHESTRA adopts a peer-to-peer architecture for data sharing,where each individual participant owns a local DBMS with its own preferred schema,makes updates over this DBMS,and periodically publishes updates to others.Then the participant translates others’published updates to its own schema via schema mappings and imports them.O RCHESTRA especially targets scientiﬁc data sharing applications such as those in the life sciences,where data sets are typically in the GB to10s of GB,and changes are published periodically and primarily consist of new data insertions.

Previous work on O RCHESTRA has developed the upper layers of our system architecture:strategies and algorithms for resolving conﬂicts[2],and for generating the necessary queries to propagate data and updates across sites or peers[3]. Such work temporarily used a centralized DBMS to handle storage and query processing.In this paper,we complete the picture,with a highly scalable and reliable versioned storage and query processing system for O RCHESTRA,which does not require dedicated server machines.Rather,we employ the existing CDSS nodes,possibly in combination with machines leased as-needed from cloud services such as Amazon EC2. Our goal is to provide the beneﬁts of peer-to-peer ar-chitectures[4],[5],[6],[7],[8](such as support for au-tonomous domains with no commonﬁlesystem,transparent handling of membership changes,and plug-and-play opera-tion),hybridized with the beneﬁts commonly associated with traditional parallel DBMSs and with emerging cloud data management platforms[9],[10],[11],[12](such as efﬁcient data partitioning,automatic failover and partial recomputation, and guarantees of complete answers).We avoid what we perceive to be the negative aspects of each architecture:the lack of completeness or consistency guarantees in peer-to-peer query systems,and requirements for sharedﬁlesystems and centralized administration in the existing cloud data manage-ment services(e.g.Google’s GFS[9],Amazon’s S3[12]). To accomplish this,we exploit the fact that our system does not need all of the properties provided by existing distributed substrates.Our problem space is less prone to “churn”than a traditional peer-to-peer system like a distributed hash table:we assume that membership in a CDSS,while not completely stable,consists of perhaps dozens to hundreds of participants at academic institutions or corporations,with good bandwidth and relatively stable machines.We support archived storage of data under a batch-oriented update load:

…… 此处隐藏：2677字，全部文档内容请下载后查看。喜欢就下载吧 ……

毕设英译汉原文协作数据共享系统的可靠存储和.doc 将本文的Word文档下载到电脑

下载这篇word文档