基于均矢量相似性的机器学习样本集划分

来源期刊:中南大学学报(自然科学版)2009年第6期

论文作者:陈先来 杨路明

文章页码:1636 - 1641

关键词:均矢量;样本集分割;相似性;机器学习

Key words:mean vector; splitting sample set; similarity; machine learning

摘    要:提出一种基于均矢量相似性的机器学习样本集分割方法(MSSS),根据样本集中每个样本矢量与均矢量之间的余弦相似性,将样本划分成训练集和测试集。为评价MSSS方法性能,分别用随机分割法(RSS)和MSSS方法,按不同比例划分来自UCI的4个数据集,对产生的训练集-测试集进行Hotelling T2检验;另外,采用得到的训练集对分类BP神经网络进行训练,以相应的测试集测试神经网络。研究结果表明:对用RSS划分4个数据集产生的训练集-测试集进行Hotelling T2检验,发现均存在F值超出界值的现象,而MSSS均未出现;使用MSSS训练的神经网络所产生的训练-测试误差差异、准确率差异均比使用RSS训练的神经网络所产生的小,说明用MSSS划分产生的训练集与测试集的一致性比用RSS划分产生的好。

Abstract: MSSS (Mean-similarity-based splitting sample), an algorithm for partitioning machine learning sample set, was presented based on similarity to mean vector. A sample set was splited into training set and test set by cosine similarity of each sample vector to mean vector. Simulation study were set up to evaluate MSSS. Four data sets from UCI were individually split by different proportions with MSSS and randomly splitting sample (RSS). The training set and test set were tested by Hotelling T2. Back propagation neural networks for classification were built up. Training set was used for training networks and test set for testing networks. The result shows that the F value of Hotelling T2 test for RSS might overtop its border, but that for MSSS does not. In contrast with RSS, MSSS has significantly lower error difference between training error and test error and accuracy difference between training accuracy and test accuracy of the network. It can be confirmed that the consistency between training set and test set from MSSS is superior to that from RSS.

基金信息:国家自然科学基金资助项目

有色金属在线官网  |   会议  |   在线投稿  |   购买纸书  |   科技图书馆

中南大学出版社 技术支持 版权声明   电话:0731-88830515 88830516   传真:0731-88710482   Email:administrator@cnnmol.com

互联网出版许可证:(署)网出证(京)字第342号   京ICP备17050991号-6      京公网安备11010802042557号