学校主页 加入收藏 English
当前位置: 首页 >> 教学科研 >> 正文 教学科研
成果速递第11期:Kappa 和 F-score 在聚类评价和聚类集成中的重要性
  点击次数: 次 发布时间:2025-03-20   编辑:统计与数学学院

张忠元教授及其博士生团队在我校AAA级期刊Knowledge and Information Systems上发表了关于Kappa和F-score在聚类集成中重要性的系统分析论文。

传统单一聚类方法通常基于特定的数据分布假设,且对超参数设置、初始化条件以及数据中的噪声或异常值高度敏感,从而在复杂结构的数据集上表现出较差的鲁棒性与稳定性。相比之下,聚类集成(clustering ensemble)作为集成学习在无监督领域的拓展,凭借对聚类结果的表现与稳健性的显著提升而备受瞩目。基于多样性和稳定性的选择式聚类集成(Selective Clustering Ensemble,SCE)与加权式聚类集成(Weighted Clustering Ensemble,WCE)进一步优化了性能。但在多样性和稳定性间取得平衡仍面临挑战,核心难点在于基础划分(base partitions)与簇质量的评估。现有评估指标如归一化互信息(Normalized Mutual Information,NMI)及其变体,存在对称性问题、上下文语义问题,以及忽视小簇重要性问题。针对这些局限,本文提出了基于Kappa与F-score的新型评估方法,并引入了一个新型SCE方法:利用Kappa筛选信息丰富的基础划分,同时采用F-score基于稳定性为簇赋权。系统的实验分析结果表明了所提方法的有效性和效率。此外,尽管NMI是最常用的聚类评价指标之一,本研究提供了更多的证据表明NMI值具有误导性,无法准确反映聚类结果的实际性能,而Kappa值更为可靠。聚类方法的性能分析应该基于Kappa而非NMI。代码已发布于:https://github.com/Jarvisyan/DSKF-matlab。

论文题目:The significance of Kappa and F-score in clustering ensemble: a comprehensive analysis

论文摘要:Clustering ensemble techniques have gained significant attention due to their ability to enhance partition results’ accuracy and robustness. Selective clustering ensemble (SCE) and weighted clustering ensemble (WCE) methods further improve performance by selecting and weighting base partitions or clusters based on their diversity and stability. However, striking a balance between these two factors remains challenging. The primary difficulty lies in evaluating the quality of base partitions and clusters. Existing evaluation criteria, such as normalized mutual information (NMI) and its variants, suffer from inherent flaws, including symmetric problem, context meaning problem, and the disregard for small clusters’ importance. To address these limitations, this paper proposes a novel evaluation method that utilizes kappa and F-score. We introduce a new SCE method that employs kappa to select informative base partitions and utilizes F-score to assign weights to clusters based on their stability. Empirical validation on real datasets demonstrates the effectiveness and efficiency of the proposed approach. The code is available at https://github.com/Jarvisyan/DSKF-matlab.

撰稿人:张忠元

审稿人:邓 露

首页

          版权所有:中央财经大学统计与数学学院  
          地址:北京市昌平区沙河高教园中央财经大学沙河校区1号学院楼   邮政编码:102206   电 话:(010)61776184    
          邮箱:samofcufe@cufe.edu.cn    
         

学院公众号