A formal model for information selection in multi-sentence t(3)
发布时间:2021-06-08
发布时间:2021-06-08
Selecting important information while accounting for repetitions is a hard task for both summarization and question answering. We propose a formal model that represents a collection of documents in a two-dimensional space of textual and conceptual units wi
feasible.For many relatively simple representations of concepts,this restriction poses no difficulties:the concept is uniquely identified and can be recognized as present or absent in a text passage.However,it is possible that the concepts have some structure and can be decomposed to more elementary conceptual units,or that partial matches between concepts and text are natural.For example,if the conceptual units represent named entities(a common occurrence in list-type long answers),a partial match between a name found in a text and another name is possi-ble;handling these two names as distinct concepts would be inaccurate.Similarly,an event can be rep-resented as a concept with components correspond-ing to participants,time,location,and action,with only some of these components found in a particular piece of text.
Partial matches between textual and conceptual units introduce a new problem,however:if two tex-tual units partially cover the same concept,it is not apparent to what extent the coverage overlaps. Thus,there are multiple ways to revise equation(1) in order to account for partial matches,depending on how conservative we are on the expected over-lap.One such way is to assume minimum overlap (the most conservative assumption)and define the total information in the summary as
I(S)=
i=1,...,m w i·max
j
f(t j,c i)(2)
An alternative is to consider that f(t j,c i)repre-sents the extent of the[0,1]interval corresponding to concept c i that t j covers,and assume that the coverage is spread over that interval uniformly and independently across textual units.Then the com-bined coverage of two textual units t j and t k is f(t j,c i)+f(t k,c i)−f(t j,c i)·f(t k,c i) This operator can be naturally extended to more than two textual units and plugged into equation(2) in the place of the max operator,resulting into an equation we will refer to as equation(3).Note that both of these equations reduce to our original for-mula for information content(equation(1))if the mapping function f only produces0and1values.
2.3Length and textual constraints
We have provided formulae that measure the infor-mation covered by a collection of textual units un-der different mapping constraints.Obviously,we want to maximize this information content.How-ever,this can only sensibly happen when additional constraints on the number or length of the selected textual units are introduced;otherwise,the full set of available textual units would be a solution that proffers a maximal value for equations(1)–(3),i.e.,∀S⊂T,I(S)≤I(T).We achieve this by assign-ing a cost p i to each textual unit t i,i=1,...,n, and defining a function P over a set of textual units that provides the total penalty associated with se-lecting those textual units as the output.In our ab-straction,replacing a textual unit with one or more textual units that provide the same content should only affect the penalty,and it makes sense to assign the same cost to a long sentence as to two sentences produced by splitting the original sentence.Also, a shorter sentence should be preferable to a longer sentence with the same information content.Hence, our operational definitions for p i and P are
p i=length(t i),P(S)=
t i∈S
p i
i.e.,the total penalty is equal to the total length of the answer in some basic unit(e.g.,words).
Note however,than in the general case the p i’s need not depend solely on the length,and the to-tal penalty does not need to be a linear combina-tion of them.The cost function can depend on features other then length,for example,number of pronouns—the more pronouns used in a textual unit, the higher the risk of dangling references and the higher the price should be.Finding the best cost function is an interesting research problem by itself. With the introduction of the cost function P(S) our model has two generally competing compo-nents.One approach is to set a limit on P(S)and optimize I(S)while keeping P(S)under that limit. This approach is similar to that taken in evaluations that keep the length of the output summary within certain bounds,such as the recent major summa-rization evaluations in the Document Understand-ing Conferences from2001to the present(Harman and V oorhees,2001).Another approach would be to combine the two components and assign a com-posite score to each summary,essentially mandat-ing a specific tradeoff between recall and precision; for example,the total score can be defined as a lin-ear combination of I(S)and P(S),in which case the weights specify the relative importance of cov-erage and precision/brevity,as well as accounting for scale differences between the two metrics.This approach is similar to the calculation of recall,pre-cision,and F-measure adopted in the recent NIST evaluation of long answers for definitional questions (V oorhees,2003).In this paper,we will follow the first tactic of maximizing I(S)with a limit on P(S) rather than attempting to solve the thorny issues of
上一篇:电梯标准化管理制度