2008-FAST-Avoiding the Disk Bottleneck in the Data Domain De(11)

时间：2026-01-16

FAST有关论文。。

(4) with both Summary Vector and Locality

Preserved Caching. The results are shown in Table 4.

Clearly, the Summary Vector and Locality Preserved Caching combined have roduced an astounding reduction in disk reads. Summary Vector alone reduces about 16.5% and 18.6% of the index lookup disk I/Os for exchange and engineering data resectively. The Locality Preserved Caching alone reduces about 82.4% and 81% of the index lookup disk I/Os for exchange and engineering data respectively. Together they are able to reduce the index lookup disk I/Os by 98.94% and 99.6% respectively.

In general, the Summary Vector is very effective for new data, and Locality Preserved Caching is highly effective for little or moderately changed data. For backup data, the first full backup (seeding equivalent) does not have as many duplicate data segments as subsequent full backups. As a result, the Summary Vector is effective to avoid disk I/Os for the index lookups during the first full backup, whereas Locality Preserved Caching is highly beneficial for subsequent full backups. This result also suggests that these two datasets exhibit good duplicate locality.

versions where each successive version (“generation”) is asomewhat modified copy of the preceding generation in the series. The generation-to-generation modifications include: data reordering, deletion of existing data, and addition of new data. Single-client backup over time is simulated when synthetic data generations from a backup stream are written to the deduplication storage system in generation order, where significant amounts of data are unchanged day-to-day or week-to-week, but where small changes continually accumulate. Multi-client backup over time is simulated when synthetic data generations from multiple streams are written to the deduplication system in parallel, each stream in the generation order. There are two main advantages of using the synthetic dataset. The first is that various compression ratios can be built into the synthetic model, and usages

papproximating various real world deployments can be

tested easily in house. p

The second is that one can use relatively inexpensive client computers to generate an arbitrarily large amount of synthetic data in memory without disk I/Os and write in one stream to the deduplication system at more than 100 MB/s. Multiple cheap client computers can combine in multile streams to saturate the intake of the dedulication system in a switched network environment. We find it both much more costly and technically challenging using traditional backu software, high-end client computers attached to primary storage arrays as backup clients, and high–end servers as media/backup servers to accomplish the same feat. In our experiments, we choose an average generation (daily equivalent) global compression ratio of 30, and an average generation (daily equivalent) local compression ratio of 2 to 1 for each backu stream. These compression numbers seem possible given the real world examples in section 5.1. We measure throughput for one

5.3 Throughput

To determine the throughput of the deduplication storage

system, we used a synthetic dataset driven by client

computers. The synthetic dataset was developed to

pmodel backup data from multiple backup cycles from

multiple backup streams, where each backup stream can be generated on the same or a different client computer. The dataset is made up of synthetic data generated on the fly from one or more backup streams. Each backup stream is made up of an ordered series of synthetic data

Exchange data#disk I/Os

no Summary Vectorand

no Locality Preserved Caching Summary Vector only

Locality Preserved Caching onlySummary Vectorand

Locality Preserved Caching

328,613,503274,364,78857,725,8443,477,129

%of total100.00%83.49%17.57%1.06%

Engineering data#disk I/Os318,236,712259,135,17160,358,8751,257,316

%of total100.00%81.43%18.97%0.40%

Table 4:Index and locality reads. This table shows the number disk reads to perform index lookups or fetchesfrom the container metadata for the four combinations: with and without the Summary Vector and with and without Locality Preserved Caching.Without either the Summary Vector or Locality Preserved Caching,there is an index read for every segment. The Summary Vector avoids these reads for most new segments. Locality Preserved Cachingavoids index lookups for duplicate segments at the cost an extra read to fetchagroupof segment fingerprints from the container metadata for every cache miss for which the segment is foundin the index.

USENIXAssociationFAST’08:6thUSENIXConferenceonFileandStorageTechnologies

279

…… 此处隐藏：2641字，全部文档内容请下载后查看。喜欢就下载吧 ……

2008-FAST-Avoiding the Disk Bottleneck in the Data Domain De(11).doc 将本文的Word文档下载到电脑

下载这篇word文档