对基于SNM数据清洗算法的优化

来源期刊:中南大学学报(自然科学版)2010年第6期

论文作者:张建中 方正 熊拥军 袁小一

文章页码:2240 - 2245

关键词:数据挖掘;数据清洗;重复记录;SNM算法

Key words:data mining; data cleaning; approximately duplicate records; SNM algorithm

摘    要:对基本邻近排序算法SNM(basic sorted-neighborhood method)进行分析,指出其不足;提出基于SNM算法的一种优化算法,通过采集中南大学冶金矿物工程机构知识库的2 000多条文献记录作为样本数据进行实验研究,对记录的“脏数据”按照DC标准和相关规范进行清洗与排重。研究结果表明:与SNM算法相比,在同样的运算环境下,优化算法在招回率、误识别率和执行时间上有明显优势。

Abstract: The basic sorted-neighborhood method (SNM) was introduced and the analysis was made on its deficiency. An improved algorithm of data cleaning based on SNM was put forward. And the experiments were made on more than 2 000 sample records data from the mineral metallurgy institutional database of Central South University. Key task was cleaning dirty data and removing approximately duplicate records according to dublin core (DC) standard and other criterion. The results show that the improved algorithm is better than SNM in the aspects of recall, precision and run time in the same computer condition.

相关论文

  • 暂无!

相关知识点

  • 暂无!

有色金属在线官网  |   会议  |   在线投稿  |   购买纸书  |   科技图书馆

中南大学出版社 技术支持 版权声明   电话:0731-88830515 88830516   传真:0731-88710482   Email:administrator@cnnmol.com

互联网出版许可证:(署)网出证(京)字第342号   京ICP备17050991号-6      京公网安备11010802042557号