A formal model for information selection in multi-sentence t

发布时间:2021-06-08

Selecting important information while accounting for repetitions is a hard task for both summarization and question answering. We propose a formal model that represents a collection of documents in a two-dimensional space of textual and conceptual units wi

A Formal Model for Information Selection in Multi-Sentence Text

Extraction

Elena Filatova Department of Computer Science Columbia University

New York,NY10027,USA filatova@cs.columbia.edu

Vasileios Hatzivassiloglou Center for Computational Learning Systems

Columbia University

New York,NY10027,USA

vh@cs.columbia.edu

Abstract

Selecting important information while account-

ing for repetitions is a hard task for both sum-

marization and question answering.We pro-

pose a formal model that represents a collec-

tion of documents in a two-dimensional space

of textual and conceptual units with an asso-

ciated mapping between these two dimensions.

This representation is then used to describe the

task of selecting textual units for a summary or

answer as a formal optimization task.We pro-

vide approximation algorithms and empirically

validate the performance of the proposed model

when used with two very different sets of fea-

tures,words and atomic events.

1Introduction

Many natural language processing tasks involve the collection and assembling of pieces of informa-tion from multiple sources,such as different doc-uments or different parts of a document.Text sum-marization clearly entails selecting the most salient information(whether generically or for a specific task)and putting it together in a coherent sum-mary.Question answering research has recently started examining the production of multi-sentence answers,where multiple pieces of information are included in thefinal output.

When the answer or summary consists of mul-tiple separately extracted(or constructed)phrases, sentences,or paragraphs,additional factors influ-ence the selection process.Obviously,each of the selected text snippets should individually be impor-tant.However,when many of the competing pas-sages are included in thefinal output,the issue of information overlap between the parts of the output comes up,and a mechanism for addressing redun-dancy is needed.Current approaches in both sum-marization and long answer generation are primar-ily oriented towards making good decisions for each potential part of the output,rather than examining whether these parts overlap.Most current methods adopt a statistical framework,without full semantic analysis of the selected content passages;this makes the comparison of content across multiple selected text passages hard,and necessarily approximated by the textual similarity of those passages.

Thus,most current summarization or long-answer question-answering systems employ two levels of analysis:a content level,where every tex-tual unit is scored according to the concepts or fea-tures it covers,and a textual level,when,before being added to thefinal output,the textual units deemed to be important are compared to each other and only those that are not too similar to other can-didates are included in thefinal answer or summary. This comparison can be performed purely on the ba-sis of text similarity,or on the basis of shared fea-tures that may be the same as the features used to select the candidate text units in thefirst place.

In this paper,we propose a formal model for in-tegrating these two tasks,simultaneously perform-ing the selection of important text passages and the minimization of information overlap between them. We formalize the problem by positing a textual unit space,from which all potential parts of the summary or answer are drawn,a conceptual unit space,which represents the distinct conceptual pieces of informa-tion that should be maximally included in thefinal output,and a mapping between conceptual and tex-tual units.All three components of the model are application-and task-dependent,allowing for dif-ferent applications to operate on text pieces of dif-ferent granularity and aim to cover different concep-tual features,as appropriate for the task at hand.We cast the problem of selecting the best textual units as an optimization problem over a general scoring function that measures the total coverage of concep-tual units by any given set of textual units,and pro-vide general algorithms for obtaining a solution. By integrating redundancy checking into the se-lection of the textual units we provide a unified framework for addressing content overlap that does not require external measures of similarity between textual units.We also account for the partial overlap of information between textual units(e.g.,a single shared clause),a situation which is common in nat-

A formal model for information selection in multi-sentence t.doc 将本文的Word文档下载到电脑

精彩图片

热门精选

大家正在看

× 游客快捷下载通道(下载后可以自由复制和排版)

限时特价:7 元/份 原价:20元

支付方式:

开通VIP包月会员 特价:29元/月

注:下载文档有可能“只有目录或者内容不全”等情况,请下载之前注意辨别,如果您已付费且无法下载或内容有问题,请联系我们协助你处理。
微信:fanwen365 QQ:370150219