基于多项式模型和低风险的贝叶斯垃圾邮件过滤算法

来源期刊:中南大学学报(自然科学版)2013年第7期

论文作者:梁志文 杨金民 李元旗

文章页码:2787 - 2792

关键词:邮件过滤;特征提取;概率度量;多项式模型;风险评估

Key words:mail filtering; feature extraction; probability measurement; polynomial model; risk assessment

摘    要:针对现有贝叶斯算法应用于垃圾邮件过滤时,贝叶斯贝努利模型对邮件文本特征向量进行处理不能区分特征向量的重要性,导致邮件分类召回率低,同时还存在合法邮件被误判的风险的问题,采用贝叶斯多项式模型对特征向量进行加权处理来区分特征向量的重要性;然后,采用低风险策略来降低合法邮件被误判的风险,提出基于多项式模型和低风险的贝叶斯垃圾邮件过滤算法。实验结果表明:对于不同数量的特征项,该算法能够有效提高邮件分类的正确率与召回率,降低合法邮件被误判的风险,并在过滤文本字符数量较大的邮件时,具有性能平稳、波动小的特点。

Abstract: Existing Bayesian algorithms use Bernoulli model to process text features in the application to spam filtering, which does not distinguish the varying importance of various features, leading to a low recall rate in mail classification. In addition, existing Bayesian algorithms also have the risk of mis-judging legitimate mail. A Bayesian spam filtering algorithm was proposed based on the polynomial model and the low risk. The algorithm measures the weight of text features to distinguish their importance in mail classification, and then compares the probabilities that a mail respectively fall into the spam class or the normal mail class. The results show that this algorithm effectively improves the recall and precision rate of mail classification, and reduces the risk of mis-judging legitimate mail. Additionally, the algorithm is of smooth and little fluctuation when filtering mails with a large number of text characters.

有色金属在线官网  |   会议  |   在线投稿  |   购买纸书  |   科技图书馆

中南大学出版社 技术支持 版权声明   电话:0731-88830515 88830516   传真:0731-88710482   Email:administrator@cnnmol.com

互联网出版许可证:(署)网出证(京)字第342号   京ICP备17050991号-6      京公网安备11010802042557号