A formal model for information selection in multi-sentence t(5)
发布时间:2021-06-08
发布时间:2021-06-08
Selecting important information while accounting for repetitions is a hard task for both summarization and question answering. We propose a formal model that represents a collection of documents in a two-dimensional space of textual and conceptual units wi
reducible to the classic set cover problem (given a finite set and a collection of subsets of that set,find the smallest subset of that collection whose mem-bers’union is equal to the original set)(Hochbaum,1997).It follows that more general formulations of the cost function that actually are more realistic for our problem (such as defining the total cost as the sum of the lengths of the selected textual units and allowing the textual units to have different lengths)will also result in an NP-hard problem,as we can re-duce these versions to the special case of maximum set coverage .
Nevertheless,the correspondence with maximum set coverage provides a silver lining.Since the problem is known to be NP-hard,properties of simple greedy algorithms have been explored,and a straightforward local maximization method has been proved to give solutions within a known bound of the optimal solution.The greedy algorithm for maximum set coverage has as follows:Start with an empty solution S ,and iteratively add to the S the set T i that maximizes I (S ∪T i ).It is provable that this algorithm is the best polynomial approximation algorithm for the problem (Hochbaum,1997),and that it achieves a solution bounded as follows
I (O PT )≥I (G REEDY )≥
1− 1−1
k
k
I (O PT )
> 1−1
e
I (O PT )≈0.6321×I (O PT )
where I (O PT )is the information content of the op-timal summary and I (G REEDY )is the information content of the summary produced by this greedy al-gorithm.
For the more realistic case where cost is speci-fied as the total length of the summary,and where we try to optimize I (S )with a limit on P (S )(see Section 2.3),we propose two greedy algorithms in-spired by the algorithm above.Both our algorithms operate by first calculating a ranking of the textual units in decreasing order.This ranking is for the first algorithm,which we call adaptive greedy algo-rithm ,identical to the ranking provided by the ba-sic greedy algorithm,i.e.,each textual unit receives as score the increase in I (S )that it generates when added to the output,in the order specified by the ba-sic greedy algorithm.Our second greedy algorithm (dubbed modified greedy algorithm below)modifies this ranking by prioritizing the conceptual units with highest individual weight w i ;it ranks first the tex-tual unit that has the highest contribution to I (S )while covering this conceptual unit with the high-est individual weight,and then iteratively proceeds with the textual unit that has the highest contribu-tion to I (S )while covering the next most important
unaccounted for conceptual unit.
Given the rankings of textual units,we can then produce an output of a given length by adopting ap-propriate stopping criteria for when to stop adding textual units (in order according to their ranking)to the output.There is no clear rule for conform-ing to a specific length (for example,DUC 2001al-lowed submitted summaries to go over “a reason-able percentage”of the target length,while DUC 2004cuts summaries mid-sentence at exactly the target length).As the summary length in DUC is measured in words,in our experiments we extracted the specified number of words out of the top sen-tences (truncating the last sentence if necessary).
5Experiments
To empirically establish the effectiveness of the pre-sented model we ran experiments comparing evalu-ation scores on summaries obtained with a baseline algorithm that does not account for redundancy of information and with the two variants of greedy al-gorithms described in Section 4.We chose summa-rization as the evaluation task because “ideal”out-put (prepared by humans)and methods for scoring arbitrary system output were available for this task,but not for evaluating long answers to questions.Data We chose as our input data the document sets used in the evaluation of multidocument sum-marization during the Document Understanding Conference (DUC),organized by NIST in 2001(Harman and V oorhees,2001).This collection con-tains 30test document sets,each containing approx-imately 10news stories on different events;docu-ment sets vary significantly in their internal cohere-ness.For each document set 12human-constructed summaries are provided,3for each of the target lengths of 50,100,200,and 400words.We se-lected DUC 2001because unlike later DUCs,ideal summaries are available for multiple lengths.We consider sentences as our textual units.
Features In our experiments we used two sets of features (i.e.,conceptual units).First,we chose a fairly basic and widely used set of lexical fea-tures,namely the list of words present in each input text.We set the weight of each feature to its tf*idf value,taking idf values from http://elib.cs.berkeley.edu/docfreq/.
Our alternative set of conceptual units was the list of weighted atomic events extracted from the input texts.An atomic event is a triplet consisting of two named entities extracted from a sentence and a con-nector expressed by a verb or an event-related noun that appears in-between these two named entities.
上一篇:电梯标准化管理制度