上海交通大学学报(医学版)

• 论著(基础研究) • 上一篇    下一篇

adaboost分类器的构建及其对肝癌非编码区有害突变的鉴定

徐丽平1,李佳2,房林2   

  1. 1.浙江省宁波市第一医院普外科, 宁波 315010; 2.同济大学附属第十人民医院甲乳外科, 上海 200072
  • 出版日期:2015-06-28 发布日期:2015-07-30
  • 通讯作者: 房林, 电子信箱: fanglin_f@126.com。
  • 作者简介:徐丽平(1978—), 男, 副主任医师, 硕士; 电子信箱: 22434289@qq.com。

Establishment of model of adaboost classifier and evaluation of harmful mutations in non-coding regions of liver cancer cells

XU Li-ping1, LI Jia2, FANG Lin2   

  1. 1.Department of General Surgery, Ningbo First Hospital, Ningbo 315010, China; 2.Department of Thyroid and Breast, Shanghai Tenth People's Hospital, Tongji University, Shanghai 200072, China
  • Online:2015-06-28 Published:2015-07-30

摘要:

目的 建立adaboost分类器模型,评估肝癌非编码区疾病相关突变的可能性,识别非编码区的有害突变。方法 利用人类基因突变数据库(HGMD)疾病相关的非编码区突变共13 108个作为实验组,中性单核苷酸多态性(SNP)作为对照,结合非编码区的调控因子,如保守区、进化性的RNA保守结构、高表达基因、DNA酶Ⅰ超敏感位点、转录因子结合位点、组蛋白修饰和早期复制基因等指标,建立adaboost分类器,分析以上指标对预测非编码区中有害突变的价值。构建预测概率的受试者工作特征(ROC)曲线,计算其相应的ROC曲线下面积(AUCROC)。分别利用全基因组关联研究(GWAS)和ClinVar疾病相关的突变数据库对模型进行验证。结果 对疾病相关突变鉴别的重要性由大到小分别是保守区、早期复制基因、非翻译区(UTR)、启动子、高表达区、H3K36me3和保守性的转录因子结合位点等。应用adaboost分类器的预测概率建立ROC曲线,其AUCROC为0.90。GWAS和ClinVar疾病相关突变的平均得分显著高于中性SNP (P<0.05)。结论 adaboost分类器有助于评估肝癌非编码区有害突变的可能性,是一种准确率高的预测工具。

关键词: 肝癌, 非编码区突变, adaboost分类器

Abstract:

Objective To establish a model of adaboost classifier, evaluate the possibility of disease related mutations in non-coding regions of liver cancer cells, and identify harmful mutations in non-coding regions. Methods A total of 13 108 disease related mutations in non-coding regions were selected from HGMD database and used as subjects and neutral SNPs were used as controls. Combined with regulatory factors of non-coding regions, such as conserved regions, evolutionary RNA conservative structures, high-expressed genes, DNAseⅠ hypersensitive sites, transcription factor binding sites, histone modification, and early replicated genes, the model of adaboost classifier was established. The value of these factors for predicting harmful mutations in non-coding regions was analyzed. The receiver operating characteristic (ROC) curve was plotted and the area under the ROC curve (AUCROC) was calculated. The genome-wide association study (GWAS) and ClinVar disease-associated variants database were used to verify the model. Results Factors sorted by the importance for identifying disease related mutations were conserved regions, early replicated genes, untranslated Regions (UTR), promoters, high-expressed regions, H3K36me3, and conserved TFBSs. The ROC curve was established by using the prediction probability of adaboost classifier and the AUCROC was 0.90. The average scores of GWAS and ClinVar diseaseassociated variants were significantly higher than that of neutral SNPs (P<0.05). Conclusion The adaboost classifier is helpful for evaluating the possibility of harmful mutations in non-coding regions of liver cancer cells and is an accurate prediction tool.

Key words: liver cancer, non-coding variant, adaboost classifier