索引子系统的设计与实现(3)
发布时间:2021-06-08
发布时间:2021-06-08
索引子系统的设计与实现
ABSTRACT
CnX indexing subsystem is a complete indexing constructor of Chinese XML data. It is mainly composed by a Chinese-English semantic processing module, an inverted index building module and a scoring module which uses Okapi BM25 algorithm that is a kind of probabilistic model. This paper gives a design solution and an implement solution of the CnX,which is based on C/S structure of a multi-threaded subsystem. This paper starts from the modern information retrieval technology, and considers the ways of retrievaling XML data at first, and then begin to discuss the demand analysis,the ways of design and implementing of CnX. XML is a kind of semi-structured data, how to store the structure information of a XML document must be considered when building the inverted index. The structure of a XML document just likes a tree of data structure, which is builded up by many element nodes. And the nodes it has also can be divided into inner nodes and leaf nodes, usually, the leaf nodes are considered that contain text content, and the inner nodes usually not. For the text of the leaf nodes, it can be retrievaled by a ways of full-text content, just like retrievaling a plain text file.
As CnX focuses on the Chinese indexing, the Chinese sentences need lexically analyzing at first between the full-text content retrievaling, and then builds the pair of tag-term according to the structure information of the XML document. Before building a virtual document object of a XML document, the structure of the memory tree must be adjusted to a conscious state. Through repeated handling this tree,the inverted index is stored into a database system at last. After building the index completely,CnX will score the stored index by Okapi BM 25 algorithm for the top of the core procedures to use.
CnX index subsystem is a complete XML based information retrieval system, it plays an important role in building the whole information retrieval system.
Key words:XML;Chinese Words;Inverted Index;Information Retrieval (IR)