上海交通大学学报(医学版) ›› 2018, Vol. 38 ›› Issue (9): 1019-.doi: 10.3969/j.issn.1674-8115.2018.09.004

• 论著·基础研究 • 上一篇    下一篇

基于宏基因组学分析构建诊断大肠癌的肠道菌群标签

张昕雨 1*,张璟2*,朱小强 1,曹颖颖 1,陈豪燕 1   

  1. 1. 上海交通大学医学院附属仁济医院消化科,上海市消化疾病研究所,上海 200001;2. 上海交通大学医学院附属仁济医院病案统计中心,上海 200001
  • 出版日期:2018-09-28 发布日期:2018-10-15
  • 通讯作者: 陈豪燕,电子信箱:haoyanchen@shsmu.edu.cn。
  • 作者简介:张昕雨( 1994—),女,硕士生;电子信箱: stella941@126.com。张璟(1972—),女,高级统计师,学士;电子信箱: 13611793563@126.com。 *为共同第一作者。
  • 基金资助:
    国家自然科学基金( 31371273);上海市教育委员会高校“青年东方学者”(QD2015003);上海市教育委员会高峰高原学科建设计划( 20161309)

Bacterial signatures for diagnosis of colorectal cancerfecal metagenomics analysis

ZHANG Xin-yu1*, ZHANG Jing2*, ZHU Xiao-qiang1, CAO Ying-ying1, CHEN Hao-yan1   

  1. 1. Department of Gastroenterology and Hepatology, Renji Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai Institute of Digestive Disease, Shanghai 200001, China; 2. Medical Record Statistics Center, Renji Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai 200001, China
  • Online:2018-09-28 Published:2018-10-15
  • Supported by:
    National Natural Science Foundation of China, 31371273; “Youth Eastern Scholar” at Shanghai Institutions of Higher Learning, QD2015003; Shanghai Municipal Education Commission— Gaofeng Clinical Medicine Support, 20161309

摘要: 目的 ·根据粪便样本宏基因组学数据建立肠道菌群标签,探索用于筛查与诊断大肠癌的非侵入性方法。方法 ·共纳入 285例样本,根据随机森林分类算法筛选出与大肠癌发生密切相关的特征细菌;利用 6种机器学习分类模型建立大肠癌的诊断模型,并进行内部和外部验证。结果 ·首先筛选出了 9种与大肠癌发生密切相关的特征细菌,利用这 9种细菌建立了 6种诊断模型。其中随机森林模型准确率最高(达 0.847 7),其在内部验证集和外部验证集中的准确率分别为 0.815 8和 0.734 4,在全集中受试者工作特征(receiver operating characteristic,ROC)曲线下面积( area under curve,AUC)为 0.894。结论 ·根据粪便样本的宏基因组学数据,利用随机森林算法建立了由 9种细菌组成的诊断大肠癌的菌群标签,能够有效对健康者与大肠癌患者进行区分。

关键词: 大肠癌, 诊断, 肠道菌群, 机器学习, 随机森林

Abstract:

Objective · To construct bacterial signaturesanalyzing fecal metagenomics for the screening and diagnosis of colorectal cancer (CRC). Methods · A total of 285 samples were included in the study. Diagnostic models for CRC according to six different machine learning algorithms were developed using the featured bacteria selectedrandom forest algorithm, and validated in validation sets. Results · Nine bacteria that differentiated CRC and the control were identified, with which 6 models were established. The best model was random forest model, with an accuracy of 0.847 7 in the training set. Its accuracy in two test sets was 0.815 8 and 0.734 4, respectively. The area under curve (AUC) of receiver operating characteristic of the random forest model in the set including all samples was 0.894. Conclusion · Bacterial signatures based on random forest algorithm for the diagnosis of CRC can differentiate patients with CRC and the control effectively, which suggests the potential clinical value of the bacterial signatures.

Key words: colorectal cancer, diagnosis, intestinal bacteria, machine learning, random forest

中图分类号: