上海交通大学学报(医学版)

• 论著(基础研究) • 上一篇    下一篇

非编码碱基序列文献的挖掘

安建福,孟丽莉   

  1. 上海交通大学 医学院附属仁济医院信息中心, 上海 200127
  • 出版日期:2013-10-28 发布日期:2013-10-31
  • 作者简介:安建福(1977—),男,工程师,博 士; 电子信箱: anjianfu@163.com。

Literature mining for non-coding base sequence

AN Jian-fu, MENG Li-li   

  1. Information Center, Renji Hospital, Shanghai Jiaotong University School of Medicine, Shanghai 200127, China
  • Online:2013-10-28 Published:2013-10-31

摘要:

目的 应用神经网络算法提高非编码碱基序列文献的查全率和查准率。方法 从PubMed数据库中选取样本。对样本处理后,应用词频(TF)×逆文档频率(IDF)方法选取特征项,建立基于后向传播(BP)神经网络算法的检索模型。结果 在选取100个特征项时,查准率为91.49%,查全率为71.23%,受试者工作特征曲线下面积(ROC-AUC)为0.823,特异度为93.37%,灵敏度为71.23%,准确率为82.30%。结论 该方法与常用的关键词、MeSH词等方法相比,不仅能够查准也能查全与主题相关的文献。

关键词: 非编码碱基序列, 神经网络, 后向传播算法, 词频×逆文档频率, 文献挖掘

Abstract:

Objective To improve the recall rate and precision rate of non-coding base sequence literature retrieval with neural network algorithm. Methods The related literatures were obtained from PubMed as examples. After the sample literatures were dealt, the terms were selected with term frequency (TF) and inverse document frequency (IDF) methods, then the retrieval model based on back-propagation (BP) neural network algorithm was built. Results When 100 terms were selected, the precision rate, recall rate, area under the receiver operating characteristic curve (ROCAUC), specificity, sensitivity and accuracy rate were 91.49%, 71.23%, 0.823, 93.37%, 71.23% and 82.30% respectively. Conclusion Compared with common methods such as key words and MeSH retrieval, the retrieval model with neural network algorithm can effectively retrieve the literatures related tbo a particular topic.

Key words: non-coding base sequence, neural network, back-propagation algorithm, term occurrence frequency and inverse document frequency, literature mining