Improving Support Vector Clustering with Ensembles
时间:2025-04-20
时间:2025-04-20
Abstract: Support Vector Clustering (SVC) is a recently proposed clustering methodology with promising performance for high-dimensional and noisy data sets, and for clusters with arbitrary shape. However, setting the parameters of the SVC algorithm is a ch
Improving Support Vector Clustering with Ensembles
Wilfredo J. Puma-Villanueva, George B. Bezerra, Clodoaldo A. M. Lima, Fernando J. Von Zuben
LBiC/DCA/FEEC/Unicamp
C.P. 6101
Campinas/SP, Brazil, 13083-970
+55 19 3788-3885
{wilfredo, bezerra, moraes, vonzuben}@dca.fee.unicamp.br
Abstract: Support Vector Clustering (SVC) is a recently proposed clustering methodology with promising performance for high-dimensional and noisy data sets, and for clusters with arbitrary shape. However, setting the parameters of the SVC algorithm is a challenging task. Instead of searching for a single optimal configuration, the proposal involves generation, selection, and combination of distinct clustering solutions that guides to a consensus clustering. The purpose is to deal with a wide variety of clustering problems without the necessity of searching for a single and dedicated high-performance solution.
I. PROBLEM STATEMENT
Support Vector Machines (SVM) are high-performance supervised learning machines based on the Vapnik’s Statistical Learning Theory, and successively extended by a number of researchers to deal with clustering problems. The SVM variants are generally competitive among each other, even when they differ on formulation, solution strategy, and/or the choice of the kernel function. Under the availability of multiple learning machines, there are many theoretical and empirical reasons to implement an ensemble.
Ensembles involve the generation, selection, and linear/nonlinear combination of a set of individual components designed to simultaneously cope with the same task. This is typically done through the variation of some configuration parameters and/or employment of different training procedures, such as bagging and boosting. Such ensembles should properly integrate the knowledge embedded in the components, and have frequently produced more accurate and robust models. The effectiveness of the ensemble will strongly depend on the diverse behaviour and accuracy of the learning machines taken as components.
For a sample of size N composed of p-dimensional real-valued vectors, clustering is a procedure that divides the p-dimensional vectors in m disjoint groups. Data points within each group are more similar to each other than to any data point in other groups.
Each clustering procedure may produce diverse solutions depending on its parameters setup. In cases where no a priori knowledge is available, it becomes quite difficult to attest the consistency of a single solution. Cluster
boundaries tend to be fuzzy, and clustering results will significantly vary at transitory regions.
The resulting diversity among clustering proposals can be explored to synthesize an ensemble of clustering solutions. The main aspects to be explored are:
§ Reuse of the knowledge implicit in each clustering solution.
§ Clustering over distributed datasets in case where the data cannot be directly shared or grouped together because of restrictions due to ownership, privacy, and storage.
§ Attribution of a confidence level to each cluster. The ensemble proposed here will combine partitions produced by SVC (Support Vector Clustering) [1] [2]. SVC is used to map the data set into a higher dimensional feature space using a Gaussian kernel, and then searches for the minimal enclosing sphere. When the sphere is mapped back to the original data space, it will automatically separate data into clusters. The SVC methodology can generate clusters with arbitrary shape and size. Besides, it also has a unique mechanism to deal with outliers, making it especially adequate for noisy datasets.
Yang et al. [20] proposed a mechanism to improve the performance of the original SVC [1] by adopting proximity graph tools at the cluster assignment stage, thus increasing the accuracy and providing scalability to deal with large data sets by a considerable reduction of the required processing time.
II. CURRENT RESEARCH
Researches involving combination of multiple clustering are relatively few in the machine learning community. Park et al. [16] adopted several values for the width parameter of the Gaussian kernel in SVC, aiming at obtaining various adjacency matrixes, and then combined them via Spectral Graph Partitioning to obtain one consensus adjacency matrix. The following works, though relevant to the intended application, did not apply kernel-based approaches to perform clustering. Fisher [5] analyzed methods for iteratively improving an initial set of hierarchical clustering solutions. Fayyad et al. [4] obtained multiple approximate k-means solutions in main memory after making a single pass through a database
…… 此处隐藏:2892字,全部文档内容请下载后查看。喜欢就下载吧 ……上一篇:荣威轿车K系列发动机的开发
下一篇:核心500单词