Abstract The MediaMill TRECVID 2005 Semantic Video Search En(4)
时间:2026-01-23
时间:2026-01-23
UvA-MediaMill team participated in four tasks. For the detection of camera work (runid: A CAM) we investigate the benefit of using a tessellation of detectors in combination with supervised learning over a standard approach using global image information.
3.1.2
Contextures:RegionalTextureDescriptorsandtheirContext
Thevisualdetectorsaimtodecomposeanimageinproto-conceptslikevegetation,water, re,skyetc.Toachievethisgoal,animageisdividedupinseveraloverlappingrectan-gularregions.Theregionsareuniformlysampledacrosstheimage,withastepsizeofhalfaregion.Theregionsizehastobelargeenoughtoassessstatisticalrelevance,andsmallenoughtocapturelocaltexturesinanimage.Weutilizeamulti-scaleapproach,usingsmallandlargeregions.Anexampleofregionsamplingisdisplayedin gure4.
Avisualsceneischaracterizedbybothglobalaswellaslocaltextureinformation.Forexample,apicturewithanaircraftinmidairmightbedescribedas“sky,withaholeinit”.Tomodelthistypeofinformation,weuseaproto-conceptoccurrencehistogramwhereeachbinisaproto-concept.Thevaluesinthehistogramarethesimilarityresponsesofeachproto-conceptannotation,totheregionsintheimage.
Weusetheproto-conceptoccurrencehistogramtochar-acterizebothglobalandlocaltextureinformation.Globalinformationisdescribedbycomputinganoccurrencehis-togramaccumulatedoverallregionsintheimage.Localinformationistakenintoaccountbyconstructinganotheroccurrencehistogramforonlytheresponseofthebestre-gion.Foreachproto-concept,orbin,btheaccumulatedoc-currencehistogramandthebestoccurrencehistogramareconstructedby,
Haccumulated(b)Hbest(b)
==
W2(a,r)W2(a,r)
,,
accountsforinformationaboutthewholeshoti,andin-formationaboutaccidentalframes,whichmightoccurwithhighcameramotion.Thecombinationofalltheseparam-etersyieldsavectorofcontextures vi,containingthe nalresultofthevisualanalysis.3.1.3
TextualAnalysis
Inthetextualmodality,weaimtolearntheassociationbe-tweenutteredspeechandsemanticconcepts.Adetectionsystemtranscribesthespeechintotext.FortheChineseandArabicsourcesweexploittheprovidedmachinetrans-lations.Theresultingtranslationismappedfromstoryleveltoshotlevel.Fromthetextweremovethefrequentlyoc-curringstopwords.Afterstopwordremoval,wearereadytolearnsemantics.
Tolearntherelationbetweenutteredspeechandcon-cepts,weconnectwordstoshots.Wemakethisconnectionwithinthetemporalboundariesofashot.Wederivealex-iconofutteredwordsthatco-occurwithωusingtheshot-basedannotationsofthetrainingdata.Foreachconceptω,welearnaseparatelexicon,ΛωT,asthisutteredwordlexi-conisspeci cforthatconcept.ForfeatureextractionwecomparethetextassociatedwitheachshotwithΛωT.Thiscomparisonyieldsatextvector tiforshoti,whichcontainsthehistogramofthewordsinassociationwithω.3.1.4
EarlyFusion
r∈R(im)a∈A(b)
argmax
r∈R(im)a∈A(b)
whereR(im)denotesthesetofregionsinimageim,A(b)representsthesetofstoredannotationsforproto-conceptb,andW2istheCram´er-vonMisesstatisticasintroducedinequation2.
Wedenoteaproto-conceptoccurrencehistogramasacon-textureforthatimage.Wehavechosenthisname,asourmethodincorporatestexturefeaturesinacontext.Thetex-turefeaturesaregivenbytheuseofWiccestfeatures,usingcolorinvarianceandnaturalimagestatistics.Furthermore,contextistakenintoaccountbythecombinationofbothlocalandglobalregioncombinations.
Contexturescanbecomputedfordi erentparameterset-tings.Speci cally,wecalculatethecontexturesatscalesσ=1andσ=3oftheGaussian lter.Furthermore,we
1
and1usetwodi erentregionsizes,withratiosofof
thex-dimensionandy-dimensionsoftheimage.Moreover,contexturesarebasedononeimage,andnotbasedonashot.Togeneralizeourapproachtoshotlevel,weextract1framepersecondoutofthevideo,andthenaggregatetheframesthatbelongtothesameshot.Weusetwowaystoaggregateframes:1)averagethecontextureresponsesforallextractedframesinashotand2)keepthemaximumresponseofallframesinashot.Thisaggregationstrategy
Indexingapproachesthatrelyonearlyfusion rstextractunimodalfeaturesofeachstream.Theextractedfeaturesofallstreamsarecombinedintoasinglerepresentation.Aftercombinationofunimodalfeaturesinamultimodalrepre-sentation,earlyfusionmethodsrelyonsupervisedlearningtoclassifysemanticconcepts.Earlyfusionyieldsatrulymultimediafeaturerepresentation,sincethefeaturesareintegratedfromthestart.Anaddedadvantageisthere-quirementofonelearningphaseonly.Disadvantageoftheapproachisthedi cultytocombinefeaturesintoacom-monrepresentation.ThegeneralschemeforearlyfusionisillustratedinFig.5a.
Werelyonvectorconcatenationintheearlyfusionschemetoobtainamultimodalrepresentation.Wecon-catenatethevisualvector viwiththetextvector ti.Afterfeaturenormalization,weobtainearlyfusionvector ei.3.1.5
LateFusion
Indexingapproachesthatrelyonlatefusionalsostartwithextractionofunimodalfeatures.Incontrasttoearlyfusion,wherefeaturesarethencombinedintoamultimodalrep-resentation,approachesforlatefusionlearnsemanticcon-ceptsdirectlyfromunimodalfeatures.Ingeneral,tefusionfocusesontheindividualstrengthofmodalities.Uni-modalconceptdetectionscoresarefusedintoamultimodal
…… 此处隐藏:2881字,全部文档内容请下载后查看。喜欢就下载吧 ……上一篇:自定义动画---陀螺旋
下一篇:刑法学案例分析题1