上海交通大学学报(医学版) ›› 2022, Vol. 42 ›› Issue (7): 911-918.doi: 10.3969/j.issn.1674-8115.2022.07.010

• 论著 · 技术与方法 • 上一篇    

不同表达矩阵对筛选差异长链非编码RNA的影响

魏豪(), 邱家俊, 颜景斌()   

  1. 上海市儿童医院,上海交通大学医学院附属儿童医院医学遗传研究所,上海市胚胎与生殖工程重点实验室,上海 200040
  • 收稿日期:2022-03-24 接受日期:2022-07-14 出版日期:2022-07-28 发布日期:2022-09-04
  • 通讯作者: 颜景斌 E-mail:1187383951@qq.com;m18917128323@163.com
  • 作者简介:魏 豪(1996—),男,硕士生;电子信箱:1187383951@qq.com
  • 基金资助:
    国家重点研发计划项目(2019YFA0801402);国家自然科学基金面上项目(81971421);上海市临床重点专科项目(shslczdzk05705)

Effects of different expression matrices on screening differential lncRNAs

WEI Hao(), QIU Jiajun, YAN Jingbin()   

  1. Shanghai Childern's Hospital, Shanghai Institute of Medical Genetics, Shanghai Jiao Tong University School of Medicine, Shanghai Key Laboratory of Embryo and Reproduction Engineering, Shanghai 200040, China
  • Received:2022-03-24 Accepted:2022-07-14 Online:2022-07-28 Published:2022-09-04
  • Contact: YAN Jingbin E-mail:1187383951@qq.com;m18917128323@163.com
  • Supported by:
    National Key R&D Plan(2019YFA0801402);National Natural Science Foundation of China(81971421);Shanghai Key Clinical Specialty Project(shslczdzk05705)

摘要:

目的·基于全转录组测序数据,比较长链非编码RNA(long non-coding RNA,lncRNA)表达水平差异分析的2种方法在筛选差异lncRNA方面的效果。方法·从NCBI_GEO数据库下载2组全转录组测序数据集共10个样本。A组为人类通用参考RNA样本,B组为人脑参考RNA样本,每个样本均包含一系列来自于外源RNA对照物联盟(external RNA control consortium,ERCC)的已知浓度的外源合成RNA(spike-in RNA)。对处理后的测序数据使用mRNA、lncRNA以及总体RNA的注释参考基因组分别进行计数, 从而获得相应的包含spike-in RNA注释信息的3个表达矩阵。在P<0.05的条件下,根据在不同组别中spike-in RNA的真实浓度,判断差异表达分析结果的假阳性率和假阴性率。再使用R语言软件包DESeq2和edgeR对所有表达矩阵分别进行组间差异表达分析,以spike-in RNA的受试者操作特征(receiver operating characteristic,ROC)曲线来展示不同表达矩阵差异表达分析的特异性和准确性。该研究主要关注总体RNA表达矩阵和lncRNA表达矩阵之间的差异。此外对组内样本的总体RNA表达矩阵和lncRNA表达矩阵分别进行差异lncRNA分析,统计P值分布,比较不同表达矩阵的假阳性率。结果·在P <0.05的条件下,A组和B组之间spike-in RNA的假阳性率和假阴性率,在以总体RNA表达矩阵为背景分析时为0.52和0.14,以lncRNA表达矩阵分析时为0.30和0.17,可见使用lncRNA表达矩阵差异分析的假阳性率更低。使用不同软件包分析的表达矩阵中spike-in RNA的ROC曲线下面积(area under the curve,AUC)大小关系基本一致,均为AUC(总体RNA)≈AUC(mRNA)<AUC(lncRNA),可见依据lncRNA表达矩阵筛选差异spike-in的效果更好。而组内的lncRNA差异分析结果显示,在P<0.05的条件下,A组中lncRNA表达矩阵和总体RNA表达矩阵的差异lncRNA分别有9个和7个,B组中分别有15个和17个,不同表达矩阵之间的数目并没有显著差异。结论·在对全转录组测序数据中的已知lncRNA进行差异表达分析时,使用仅含有lncRNA的表达矩阵分析具有更高的特异性和准确性。

关键词: 长链非编码RNA, 差异表达分析, 外源合成RNA, 受试者操作特征曲线

Abstract:

Objective·To compare the effects of two methods for differential analysis of long non-coding RNA (lncRNA) expression levels on screening differential lncRNAs based on whole transcriptome sequencing data.

Methods·Two sets of whole transcriptome sequencing datasets were downloaded from the NCBI_GEO database with a total of 10 samples. Group A consisted of universal human reference RNA samples, and Group B consisted of human brain reference RNA samples. Each sample contained a series of synthetic RNA (spike-in RNA) at known concentrations from the External RNA Control Consortium (ERCC). The processed sequencing data were counted by using the annotated reference genomes of mRNA, lncRNA, and total RNA, respectively, to obtain the corresponding three expression matrices containing the annotation information of spike-in RNA. Under the condition of P<0.05, according to the real concentration of spike-in RNA in different groups, the false positive rate and false negative rate of differential expression analysis results were judged. The R language software packages DESeq2 and edgeR were used to perform differential expression analysis between groups for all expression matrices, and the receiver operating characteristic (ROC) curve of spike-in RNA was used to show the specificity and sensitivity of differential expression analysis of different expression matrices. Our study mainly focused on the differences between the total RNA expression matrix and the lncRNA expression matrix. Differentially expressed lncRNA analysis was then performed on the total RNA expression matrix and lncRNA expression matrix within groups, and the P value distribution was calculated to compare the false positive rate of different expression matrices.

Results·Under the condition of P<0.05, the false positive rate and false negative rate of spike-in RNA between group A and B were 0.52 and 0.14 when analyzed with the total RNA expression matrix, and when analyzed with the lncRNA expression matrix, it was 0.30 and 0.17, which indicated that the false positive rate using the lncRNA expression matrix differential analysis was higher. The area under the curve (AUC) of spike-in RNA in expression matrices analyzed by different R packages was generally consistent: AUC (total RNA)≈AUC (mRNA)<AUC (lncRNA), which indicated that the screening effect of lncRNA expression matrix was better than that of total RNA. The intra-group lncRNA differential expression analysis results showed that, under the condition of P<0.05, there were 9 and 7 different expressed lncRNAs in the lncRNA expression matrix and total RNA expression matrix in group A, and 15 and 17 in group B, respectively. The numbers were not significantly different between expression matrices.

Conclusion·In the differential expression analysis of known lncRNAs in whole transcriptome sequencing data, the specificity and sensitivity of the lncRNA expression matrix analysis are better than that of total RNA.

Key words: long non-coding RNA, differential expression analysis, spike-in RNA, receiver operating characteristic curve

中图分类号: