Topic segmentation with an aspect hidden Markov model(5)
时间:2025-04-27
时间:2025-04-27
We present a novel probabilistic method for partially unsupervised topic segmentation on unstructured text. Previous approaches to this problem utilize the hidden Markov model framework (HMM). The HMM treats a document as mutually independent sets of words
23HMMSEGMENTATION
Figure1:AgraphicalmodelrepresentingthesegmentingHMM
3HMMSegmentation
InthesegmentingHMMframework,anunsegmenteddocumentistreatedasacollec-tionofmutuallyindependentsetsofwords.Themodelpositsthateachsetisprob-abilisticallygeneratedbyahiddentopicvariableinaseries.Transitionprobabilitiesbetweentopicsdeterminethenexthiddenvariableinthesequence.
Asagenerativemodel,theHMMpositsthatadocumentisproducedbythefol-lowingprocess:chooseatopicfromaninitialdistributionoftopics;generateasetofindependentwordsfromadistributionoverwordsassociatedwiththattopic;chooseanothertopic,possiblythesametopicfromadistributionofallowedtransitions;repeatthisprocess.Givenanew,unsegmenteddocument,oneinvertsthisprocessbycalculat-ingthemostlikelysetoftopicswhichgeneratedthe-wordsetsofthegivendocument.Topicbreaksoccuratthepointswherethevalueofthetopicvariableschange.
Moreformally,aresetsofwordsandaregen-eratedbyatopic.Eachdependsonlyonandtheareindependentofeachothergiven.Thisisillustratedinthegraphicalmodelin gure1.Circlesrepre-sentrandomvariablesandarrowsindicatepossiblydependency.Theboxaroundindicatesthatthisrandomvariableisrepeatedtimesforeachtopicvariableintheseries.
TheHMMisparameterizedbyatransitionprobabilitydistributionbetweentopicsandasetoftopic-basedunigramlanguagemodelsforeachpossiblevalueof.Totrainthemodel,asetofsegmentsfromacorpusisclusteredusingthe-meansalgorithm.Aunigramlanguagemodeliscomputedforeachoftheseclustersandanappropriatesmoothingtechniqueisappliedtoaccountforsparsity.Thetransitionprob-
isaparameterwhichisseparatelyabilitydistributionbetweentopicstates
tunedin[6].Wesimplyusenormalizedcountsoftransitionsbetweenclustersinthetrainingsettoestimateit.Notethatthismodelrequiresasegmentedcorpustotrain,butworksinanunsupervisedmannertoclusterthosesegments.
Tosegmentanewdocument,thestreamoftextisdividedintoasequenceof
ofwordseach.TheViterbialgorithm[7],adynamicprogram-observations
mingtechnique,isusedto ndthemostlikelyhiddensequenceoftopicstates
givenanobservedsequenceofwordsets.Topicbreaksoccurwhen.
…… 此处隐藏:285字,全部文档内容请下载后查看。喜欢就下载吧 ……上一篇:工程索道与柔性吊桥
下一篇:人教版六年级数学上册期末试卷2