Abstract The MediaMill TRECVID 2005 Semantic Video Search En(4)

时间:2026-01-23

UvA-MediaMill team participated in four tasks. For the detection of camera work (runid: A CAM) we investigate the benefit of using a tessellation of detectors in combination with supervised learning over a standard approach using global image information.

3.1.2

Contextures:RegionalTextureDescriptorsandtheirContext

Thevisualdetectorsaimtodecomposeanimageinproto-conceptslikevegetation,water, re,skyetc.Toachievethisgoal,animageisdividedupinseveraloverlappingrectan-gularregions.Theregionsareuniformlysampledacrosstheimage,withastepsizeofhalfaregion.Theregionsizehastobelargeenoughtoassessstatisticalrelevance,andsmallenoughtocapturelocaltexturesinanimage.Weutilizeamulti-scaleapproach,usingsmallandlargeregions.Anexampleofregionsamplingisdisplayedin gure4.

Avisualsceneischaracterizedbybothglobalaswellaslocaltextureinformation.Forexample,apicturewithanaircraftinmidairmightbedescribedas“sky,withaholeinit”.Tomodelthistypeofinformation,weuseaproto-conceptoccurrencehistogramwhereeachbinisaproto-concept.Thevaluesinthehistogramarethesimilarityresponsesofeachproto-conceptannotation,totheregionsintheimage.

Weusetheproto-conceptoccurrencehistogramtochar-acterizebothglobalandlocaltextureinformation.Globalinformationisdescribedbycomputinganoccurrencehis-togramaccumulatedoverallregionsintheimage.Localinformationistakenintoaccountbyconstructinganotheroccurrencehistogramforonlytheresponseofthebestre-gion.Foreachproto-concept,orbin,btheaccumulatedoc-currencehistogramandthebestoccurrencehistogramareconstructedby,

Haccumulated(b)Hbest(b)

==

W2(a,r)W2(a,r)

,,

accountsforinformationaboutthewholeshoti,andin-formationaboutaccidentalframes,whichmightoccurwithhighcameramotion.Thecombinationofalltheseparam-etersyieldsavectorofcontextures vi,containingthe nalresultofthevisualanalysis.3.1.3

TextualAnalysis

Inthetextualmodality,weaimtolearntheassociationbe-tweenutteredspeechandsemanticconcepts.Adetectionsystemtranscribesthespeechintotext.FortheChineseandArabicsourcesweexploittheprovidedmachinetrans-lations.Theresultingtranslationismappedfromstoryleveltoshotlevel.Fromthetextweremovethefrequentlyoc-curringstopwords.Afterstopwordremoval,wearereadytolearnsemantics.

Tolearntherelationbetweenutteredspeechandcon-cepts,weconnectwordstoshots.Wemakethisconnectionwithinthetemporalboundariesofashot.Wederivealex-iconofutteredwordsthatco-occurwithωusingtheshot-basedannotationsofthetrainingdata.Foreachconceptω,welearnaseparatelexicon,ΛωT,asthisutteredwordlexi-conisspeci cforthatconcept.ForfeatureextractionwecomparethetextassociatedwitheachshotwithΛωT.Thiscomparisonyieldsatextvector tiforshoti,whichcontainsthehistogramofthewordsinassociationwithω.3.1.4

EarlyFusion

r∈R(im)a∈A(b)

argmax

r∈R(im)a∈A(b)

whereR(im)denotesthesetofregionsinimageim,A(b)representsthesetofstoredannotationsforproto-conceptb,andW2istheCram´er-vonMisesstatisticasintroducedinequation2.

Wedenoteaproto-conceptoccurrencehistogramasacon-textureforthatimage.Wehavechosenthisname,asourmethodincorporatestexturefeaturesinacontext.Thetex-turefeaturesaregivenbytheuseofWiccestfeatures,usingcolorinvarianceandnaturalimagestatistics.Furthermore,contextistakenintoaccountbythecombinationofbothlocalandglobalregioncombinations.

Contexturescanbecomputedfordi erentparameterset-tings.Speci cally,wecalculatethecontexturesatscalesσ=1andσ=3oftheGaussian lter.Furthermore,we

1

and1usetwodi erentregionsizes,withratiosofof

thex-dimensionandy-dimensionsoftheimage.Moreover,contexturesarebasedononeimage,andnotbasedonashot.Togeneralizeourapproachtoshotlevel,weextract1framepersecondoutofthevideo,andthenaggregatetheframesthatbelongtothesameshot.Weusetwowaystoaggregateframes:1)averagethecontextureresponsesforallextractedframesinashotand2)keepthemaximumresponseofallframesinashot.Thisaggregationstrategy

Indexingapproachesthatrelyonearlyfusion rstextractunimodalfeaturesofeachstream.Theextractedfeaturesofallstreamsarecombinedintoasinglerepresentation.Aftercombinationofunimodalfeaturesinamultimodalrepre-sentation,earlyfusionmethodsrelyonsupervisedlearningtoclassifysemanticconcepts.Earlyfusionyieldsatrulymultimediafeaturerepresentation,sincethefeaturesareintegratedfromthestart.Anaddedadvantageisthere-quirementofonelearningphaseonly.Disadvantageoftheapproachisthedi cultytocombinefeaturesintoacom-monrepresentation.ThegeneralschemeforearlyfusionisillustratedinFig.5a.

Werelyonvectorconcatenationintheearlyfusionschemetoobtainamultimodalrepresentation.Wecon-catenatethevisualvector viwiththetextvector ti.Afterfeaturenormalization,weobtainearlyfusionvector ei.3.1.5

LateFusion

Indexingapproachesthatrelyonlatefusionalsostartwithextractionofunimodalfeatures.Incontrasttoearlyfusion,wherefeaturesarethencombinedintoamultimodalrep-resentation,approachesforlatefusionlearnsemanticcon-ceptsdirectlyfromunimodalfeatures.Ingeneral,tefusionfocusesontheindividualstrengthofmodalities.Uni-modalconceptdetectionscoresarefusedintoamultimodal

…… 此处隐藏:2881字,全部文档内容请下载后查看。喜欢就下载吧 ……
Abstract The MediaMill TRECVID 2005 Semantic Video Search En(4).doc 将本文的Word文档下载到电脑

精彩图片

热门精选

大家正在看

× 游客快捷下载通道(下载后可以自由复制和排版)

限时特价:4.9 元/份 原价:20元

支付方式:

开通VIP包月会员 特价:19元/月

注:下载文档有可能“只有目录或者内容不全”等情况,请下载之前注意辨别,如果您已付费且无法下载或内容有问题,请联系我们协助你处理。
微信:fanwen365 QQ:370150219