A formal model for information selection in multi-sentence t(2)

时间：2026-04-30

Selecting important information while accounting for repetitions is a hard task for both summarization and question answering. We propose a formal model that represents a collection of documents in a two-dimensional space of textual and conceptual units wi

ural language but not handled by current methods for reducing redundancy.

2Formal Model for Information Selection and Packing

Our model for selecting and packing information across multiple text units relies on three compo-nents that are speciﬁed by each application.First, we assume that there is aﬁnite set T of textual units t1,t2,...,t n,a subset of which will form the an-swer or summary.For most approaches to sum-marization and question answering,which follow the extraction paradigm,the textual units t i will be obtained by segmenting the input text(s)at an application-speciﬁed granularity level,so each t i would typically be a sentence or paragraph. Second,we posit the existence of aﬁnite set C of conceptual units c1,c2,...,c m.The conceptual units encode the information that should be present in the output,and they can be deﬁned in different ways according to the task at hand and the prior-ities of each system.Obviously,deﬁning the ap-propriate conceptual units is a core problem,akin to feature selection in machine learning:There is no exact deﬁnition of what an important concept is that would apply to all tasks.Current summariza-tion systems often represent concepts indirectly via textual features that give high scores to the textual units that contain important information and should be used in the summary and low scores to those tex-tual units which are not likely to contain informa-tion worth to be included in theﬁnal output.Thus, many summarization approaches use as conceptual units lexical features like tf*idf weighing of words in the input text(s),words used in the titles and sec-tion headings of the source documents(Luhn,1959;

H.P.Edmundson,1968),or certain cue phrases like signiﬁcant,important and in conclusion(Kupiec et al.,1995;Teufel and Moens,1997).Conceptual units can also be deﬁned out of more basic concep-tual units,based on the co-occurrence of important concepts(Barzilay and Elhadad,1997)or syntac-tic constraints between representations of concepts (Hatzivassiloglou et al.,2001).Conceptual units do not have to be directly observable as text snippets; they can represent abstract properties that particular text units may or may not satisfy,for example,status as aﬁrst sentence in a paragraph or generally posi-tion in the source text(Lin and Hovy,1997).Some summarization systems assume that the importance of a sentence is derivable from a rhetorical repre-sentation of the source text(Marcu,1997),while others leverage information from multiple texts to re-score the importance of conceptual units across all the sources(Hatzivassiloglou et al.,2001).

No matter how these important concepts are de-ﬁned,different systems use text-observable features that either correspond to the concepts of interest (e.g.,words and their frequencies)or point out those text units that potentially contain important con-cepts(e.g.,position or discourse properties of the text unit in the source document).The former class of features can be directly converted to concep-tual units in our representation,while the latter can be accounted for by postulating abstract conceptual units associated with a particular status(e.g.,ﬁrst sentence)for a particular textual unit.We assume that each conceptual unit has an associated impor-tance weight w i that indicates how important unit c i is to the overall summary or answer.

2.1Aﬁrst model:Full correspondence

Having formally deﬁned the sets T and C of tex-tual and conceptual units,the part that remains in order to have the complete picture of the constraints given by the data and summarization approach is the mapping between textual units and conceptual units. This mapping,a function f:T×C→[0,1],tells us how well each conceptual unit is covered by a given textual unit.Presumably,different approaches will assign different coverage scores for even the same sentences and conceptual units,and the consistency and quality of these scores would be one way to de-termine the success of each competing approach. Weﬁrst examine the case where the function f is limited to zero or one values,i.e.,each textual unit either contains/matches a given conceptual feature or not.This is the case with many simple features, such as words and sentence position.Then,we de-ﬁne the total information covered by any given sub-set S of T(a proposed summary or answer)as

I(S)=

i=1,...,m

w i·δi(1) where w i is the weight of the concept c i and

δi=

1,if∃j∈{1,...,m}such that f(t j,c i)=1

0,otherwise

In other words,the information contained in a summary is the sum of the weights of the concep-tual units covered by at least one of the textual units included in the summary.

2.2Partial correspondence between textual

and conceptual units

Depending on the nature of the conceptual units,the assumption of a0-1mapping between textual and conceptual units may or may not be practical or even

…… 此处隐藏：2120字，全部文档内容请下载后查看。喜欢就下载吧 ……

A formal model for information selection in multi-sentence t(2).doc 将本文的Word文档下载到电脑

下载这篇word文档