A formal model for information selection in multi-sentence t(2)
发布时间:2021-06-08
发布时间:2021-06-08
Selecting important information while accounting for repetitions is a hard task for both summarization and question answering. We propose a formal model that represents a collection of documents in a two-dimensional space of textual and conceptual units wi
ural language but not handled by current methods for reducing redundancy.
2Formal Model for Information Selection and Packing
Our model for selecting and packing information across multiple text units relies on three compo-nents that are specified by each application.First, we assume that there is afinite set T of textual units t1,t2,...,t n,a subset of which will form the an-swer or summary.For most approaches to sum-marization and question answering,which follow the extraction paradigm,the textual units t i will be obtained by segmenting the input text(s)at an application-specified granularity level,so each t i would typically be a sentence or paragraph. Second,we posit the existence of afinite set C of conceptual units c1,c2,...,c m.The conceptual units encode the information that should be present in the output,and they can be defined in different ways according to the task at hand and the prior-ities of each system.Obviously,defining the ap-propriate conceptual units is a core problem,akin to feature selection in machine learning:There is no exact definition of what an important concept is that would apply to all tasks.Current summariza-tion systems often represent concepts indirectly via textual features that give high scores to the textual units that contain important information and should be used in the summary and low scores to those tex-tual units which are not likely to contain informa-tion worth to be included in thefinal output.Thus, many summarization approaches use as conceptual units lexical features like tf*idf weighing of words in the input text(s),words used in the titles and sec-tion headings of the source documents(Luhn,1959;
H.P.Edmundson,1968),or certain cue phrases like significant,important and in conclusion(Kupiec et al.,1995;Teufel and Moens,1997).Conceptual units can also be defined out of more basic concep-tual units,based on the co-occurrence of important concepts(Barzilay and Elhadad,1997)or syntac-tic constraints between representations of concepts (Hatzivassiloglou et al.,2001).Conceptual units do not have to be directly observable as text snippets; they can represent abstract properties that particular text units may or may not satisfy,for example,status as afirst sentence in a paragraph or generally posi-tion in the source text(Lin and Hovy,1997).Some summarization systems assume that the importance of a sentence is derivable from a rhetorical repre-sentation of the source text(Marcu,1997),while others leverage information from multiple texts to re-score the importance of conceptual units across all the sources(Hatzivassiloglou et al.,2001).
No matter how these important concepts are de-fined,different systems use text-observable features that either correspond to the concepts of interest (e.g.,words and their frequencies)or point out those text units that potentially contain important con-cepts(e.g.,position or discourse properties of the text unit in the source document).The former class of features can be directly converted to concep-tual units in our representation,while the latter can be accounted for by postulating abstract conceptual units associated with a particular status(e.g.,first sentence)for a particular textual unit.We assume that each conceptual unit has an associated impor-tance weight w i that indicates how important unit c i is to the overall summary or answer.
2.1Afirst model:Full correspondence
Having formally defined the sets T and C of tex-tual and conceptual units,the part that remains in order to have the complete picture of the constraints given by the data and summarization approach is the mapping between textual units and conceptual units. This mapping,a function f:T×C→[0,1],tells us how well each conceptual unit is covered by a given textual unit.Presumably,different approaches will assign different coverage scores for even the same sentences and conceptual units,and the consistency and quality of these scores would be one way to de-termine the success of each competing approach. Wefirst examine the case where the function f is limited to zero or one values,i.e.,each textual unit either contains/matches a given conceptual feature or not.This is the case with many simple features, such as words and sentence position.Then,we de-fine the total information covered by any given sub-set S of T(a proposed summary or answer)as
I(S)=
i=1,...,m
w i·δi(1) where w i is the weight of the concept c i and
δi=
1,if∃j∈{1,...,m}such that f(t j,c i)=1
0,otherwise
In other words,the information contained in a summary is the sum of the weights of the concep-tual units covered by at least one of the textual units included in the summary.
2.2Partial correspondence between textual
and conceptual units
Depending on the nature of the conceptual units,the assumption of a0-1mapping between textual and conceptual units may or may not be practical or even
上一篇:电梯标准化管理制度