Improving Support Vector Clustering with Ensembles(2)
时间:2025-04-20
时间:2025-04-20
Abstract: Support Vector Clustering (SVC) is a recently proposed clustering methodology with promising performance for high-dimensional and noisy data sets, and for clusters with arbitrary shape. However, setting the parameters of the SVC algorithm is a ch
and combining these means finding a final set of cluster centers. Fred et al. [6] presented and evidence accumulation framework, wherein multiple k-means with a much higher value of k than the original anticipated answer were run on a common data-set. Kargupta et al. [8] combined multiple clustering solutions based only on a partial set of features. Johnson and Kargupta [7] presented a feasible approach for combining distributed agglomerative clustering solutions. Kargupta et al. [9] introduced a distributed method of PCA for clustering.
III. KEY AVENUES
The research group has been involved with clustering, ensembles, and SVM approaches for a while [3] [10] [11] [12] [13] [14] [15]. Based on the classical ensemble for classification approach, an extension for clustering can be conceived as outlined in Fig. 1. Three phases are involved in the implementation: Generation, Selection and Combination. In what follows, we describe the essence of the proposal.
x SVCi : Particular clustering approach (SVCs with distinct kernels) Ci : Clustering solution produced by SVCi C : Consensus solution
Figure 1. Ensemble of SVCs
Generation strategy
A high rate of diversity among the components of the ensemble will generally allow improving the final result. Thus, we can generate diversity by choosing among a variety of kernel functions (see Table 1) available in the literature.
A kernel function is a function that represents the inner product in a higher dimensional space, named feature space.
No other approach in the literature has tried distinct kernel functions in an ensemble of SVCs. Selection of the clustering solution
This part is still to be properly defined. The selection of appropriate candidates among the clustering solutions is a challenging task, because distinct solutions may present a diverse number of clusters, and defining the optimal number of clusters is still an open question. In SVC, after computing the minimum radio of the sphere that contains the whole dataset in feature space, the corresponding number of clusters in the original space is very sensitive to the parameters of the kernel function. Some strategies to estimate the proper number of clusters have been proposed in the literature, such as Akaike’s and Bayesian Information Criterion, and Minimum Message Length. The most used is Minimum Description Length (MDL), originally proposed by Rissanen [18], which has been widely applied in the field of neural networks. Robust Growing Neural Gas, proposed by Qin and Suganthan [17], adopted MDL to their constructive clusterization model. However, the extension of such approach to deal with SVC seems to be very computationally demanding. Combination strategy
Strehl and Ghosh [19] presented a rich framework for combining clustering solutions, in terms of an optimization problem, and proposed up to four combination methods: direct and greedy optimization, cluster-based similarity partitioning algorithm (CSPA), hyper-graph partitioning algorithm (HGPA), and meta-clustering algorithm (MCLA).
CSPA: Finds a relationship between data points in the same cluster used to establish a measure of pairwise similarity. An induced similarity measure is then proposed to recluster the data points, reaching a consensus clustering solution.
HGPA: The objective is to approximate the maximum mutual information criterion with the minimum number of edges to be cut. Basically, the cluster ensemble problem is posed as a partitioning problem of a suitably defined hypergraph having hyperedges that represent clusters.
MCLA: Groups of clusters, denoted metaclusters, have to be identified and collapsed.
…… 此处隐藏:1896字,全部文档内容请下载后查看。喜欢就下载吧 ……上一篇:荣威轿车K系列发动机的开发
下一篇:核心500单词