索引子系统的设计与实现(3)

发布时间:2021-06-08

索引子系统的设计与实现

ABSTRACT

CnX indexing subsystem is a complete indexing constructor of Chinese XML data. It is mainly composed by a Chinese-English semantic processing module, an inverted index building module and a scoring module which uses Okapi BM25 algorithm that is a kind of probabilistic model. This paper gives a design solution and an implement solution of the CnX,which is based on C/S structure of a multi-threaded subsystem. This paper starts from the modern information retrieval technology, and considers the ways of retrievaling XML data at first, and then begin to discuss the demand analysis,the ways of design and implementing of CnX. XML is a kind of semi-structured data, how to store the structure information of a XML document must be considered when building the inverted index. The structure of a XML document just likes a tree of data structure, which is builded up by many element nodes. And the nodes it has also can be divided into inner nodes and leaf nodes, usually, the leaf nodes are considered that contain text content, and the inner nodes usually not. For the text of the leaf nodes, it can be retrievaled by a ways of full-text content, just like retrievaling a plain text file.

As CnX focuses on the Chinese indexing, the Chinese sentences need lexically analyzing at first between the full-text content retrievaling, and then builds the pair of tag-term according to the structure information of the XML document. Before building a virtual document object of a XML document, the structure of the memory tree must be adjusted to a conscious state. Through repeated handling this tree,the inverted index is stored into a database system at last. After building the index completely,CnX will score the stored index by Okapi BM 25 algorithm for the top of the core procedures to use.

CnX index subsystem is a complete XML based information retrieval system, it plays an important role in building the whole information retrieval system.

Key words:XML;Chinese Words;Inverted Index;Information Retrieval (IR)

索引子系统的设计与实现(3).doc 将本文的Word文档下载到电脑

精彩图片

热门精选

大家正在看

× 游客快捷下载通道(下载后可以自由复制和排版)

限时特价:7 元/份 原价:20元

支付方式:

开通VIP包月会员 特价:29元/月

注:下载文档有可能“只有目录或者内容不全”等情况,请下载之前注意辨别,如果您已付费且无法下载或内容有问题,请联系我们协助你处理。
微信:fanwen365 QQ:370150219