论著 · 技术与方法

不同表达矩阵对筛选差异长链非编码RNA的影响

  • 魏豪 ,
  • 邱家俊 ,
  • 颜景斌
展开
  • 上海市儿童医院,上海交通大学医学院附属儿童医院医学遗传研究所,上海市胚胎与生殖工程重点实验室,上海 200040
魏 豪(1996—),男,硕士生;电子信箱:1187383951@qq.com
颜景斌,电子信箱:m18917128323@163.com

收稿日期: 2022-03-24

  录用日期: 2022-07-14

  网络出版日期: 2022-09-04

基金资助

国家重点研发计划项目(2019YFA0801402);国家自然科学基金面上项目(81971421);上海市临床重点专科项目(shslczdzk05705)

Effects of different expression matrices on screening differential lncRNAs

  • Hao WEI ,
  • Jiajun QIU ,
  • Jingbin YAN
Expand
  • Shanghai Childern's Hospital, Shanghai Institute of Medical Genetics, Shanghai Jiao Tong University School of Medicine, Shanghai Key Laboratory of Embryo and Reproduction Engineering, Shanghai 200040, China
YAN Jingbin, E-mail: m18917128323@163.com.

Received date: 2022-03-24

  Accepted date: 2022-07-14

  Online published: 2022-09-04

Supported by

National Key R&D Plan(2019YFA0801402);National Natural Science Foundation of China(81971421);Shanghai Key Clinical Specialty Project(shslczdzk05705)

摘要

目的·基于全转录组测序数据,比较长链非编码RNA(long non-coding RNA,lncRNA)表达水平差异分析的2种方法在筛选差异lncRNA方面的效果。方法·从NCBI_GEO数据库下载2组全转录组测序数据集共10个样本。A组为人类通用参考RNA样本,B组为人脑参考RNA样本,每个样本均包含一系列来自于外源RNA对照物联盟(external RNA control consortium,ERCC)的已知浓度的外源合成RNA(spike-in RNA)。对处理后的测序数据使用mRNA、lncRNA以及总体RNA的注释参考基因组分别进行计数, 从而获得相应的包含spike-in RNA注释信息的3个表达矩阵。在P<0.05的条件下,根据在不同组别中spike-in RNA的真实浓度,判断差异表达分析结果的假阳性率和假阴性率。再使用R语言软件包DESeq2和edgeR对所有表达矩阵分别进行组间差异表达分析,以spike-in RNA的受试者操作特征(receiver operating characteristic,ROC)曲线来展示不同表达矩阵差异表达分析的特异性和准确性。该研究主要关注总体RNA表达矩阵和lncRNA表达矩阵之间的差异。此外对组内样本的总体RNA表达矩阵和lncRNA表达矩阵分别进行差异lncRNA分析,统计P值分布,比较不同表达矩阵的假阳性率。结果·在P <0.05的条件下,A组和B组之间spike-in RNA的假阳性率和假阴性率,在以总体RNA表达矩阵为背景分析时为0.52和0.14,以lncRNA表达矩阵分析时为0.30和0.17,可见使用lncRNA表达矩阵差异分析的假阳性率更低。使用不同软件包分析的表达矩阵中spike-in RNA的ROC曲线下面积(area under the curve,AUC)大小关系基本一致,均为AUC(总体RNA)≈AUC(mRNA)<AUC(lncRNA),可见依据lncRNA表达矩阵筛选差异spike-in的效果更好。而组内的lncRNA差异分析结果显示,在P<0.05的条件下,A组中lncRNA表达矩阵和总体RNA表达矩阵的差异lncRNA分别有9个和7个,B组中分别有15个和17个,不同表达矩阵之间的数目并没有显著差异。结论·在对全转录组测序数据中的已知lncRNA进行差异表达分析时,使用仅含有lncRNA的表达矩阵分析具有更高的特异性和准确性。

本文引用格式

魏豪 , 邱家俊 , 颜景斌 . 不同表达矩阵对筛选差异长链非编码RNA的影响[J]. 上海交通大学学报(医学版), 2022 , 42(7) : 911 -918 . DOI: 10.3969/j.issn.1674-8115.2022.07.010

Abstract

Objective

·To compare the effects of two methods for differential analysis of long non-coding RNA (lncRNA) expression levels on screening differential lncRNAs based on whole transcriptome sequencing data.

Methods

·Two sets of whole transcriptome sequencing datasets were downloaded from the NCBI_GEO database with a total of 10 samples. Group A consisted of universal human reference RNA samples, and Group B consisted of human brain reference RNA samples. Each sample contained a series of synthetic RNA (spike-in RNA) at known concentrations from the External RNA Control Consortium (ERCC). The processed sequencing data were counted by using the annotated reference genomes of mRNA, lncRNA, and total RNA, respectively, to obtain the corresponding three expression matrices containing the annotation information of spike-in RNA. Under the condition of P<0.05, according to the real concentration of spike-in RNA in different groups, the false positive rate and false negative rate of differential expression analysis results were judged. The R language software packages DESeq2 and edgeR were used to perform differential expression analysis between groups for all expression matrices, and the receiver operating characteristic (ROC) curve of spike-in RNA was used to show the specificity and sensitivity of differential expression analysis of different expression matrices. Our study mainly focused on the differences between the total RNA expression matrix and the lncRNA expression matrix. Differentially expressed lncRNA analysis was then performed on the total RNA expression matrix and lncRNA expression matrix within groups, and the P value distribution was calculated to compare the false positive rate of different expression matrices.

Results

·Under the condition of P<0.05, the false positive rate and false negative rate of spike-in RNA between group A and B were 0.52 and 0.14 when analyzed with the total RNA expression matrix, and when analyzed with the lncRNA expression matrix, it was 0.30 and 0.17, which indicated that the false positive rate using the lncRNA expression matrix differential analysis was higher. The area under the curve (AUC) of spike-in RNA in expression matrices analyzed by different R packages was generally consistent: AUC (total RNA)≈AUC (mRNA)<AUC (lncRNA), which indicated that the screening effect of lncRNA expression matrix was better than that of total RNA. The intra-group lncRNA differential expression analysis results showed that, under the condition of P<0.05, there were 9 and 7 different expressed lncRNAs in the lncRNA expression matrix and total RNA expression matrix in group A, and 15 and 17 in group B, respectively. The numbers were not significantly different between expression matrices.

Conclusion

·In the differential expression analysis of known lncRNAs in whole transcriptome sequencing data, the specificity and sensitivity of the lncRNA expression matrix analysis are better than that of total RNA.

参考文献

1 PONTING C P, OLIVER P L, REIK W. Evolution and functions of long noncoding RNAs[J]. Cell, 2009;136(4): 629-641.
2 RINN J L, CHANG H Y. Genome regulation by long noncoding RNAs[J]. Annu Rev Biochem, 2012, 81: 145-166.
3 WANG Z, GERSTEIN M, SNYDER M. RNA-Seq: a revolutionary tool for transcriptomics[J]. Nat Rev Genet, 2009, 10(1): 57-63.
4 TRAPNELL C, WILLIAMS B A, PERTEA G, et al. Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation[J]. Nat Biotechnol, 2010, 28(5):511-515.
5 BULLARD J H, PURDOM E, HANSEN K D, DUDOIT S. Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments[J]. BMC Bioinformatics, 2010, 11: 94.
6 ROBINSON M D, OSHLACK A. A scaling normalization method for differential expression analysis of RNA-seq data[J]. Genome Biol, 2010, 11(3): R25.
7 ANDERS S, HUBER W. Differential expression analysis for sequence count data[J]. Genome Biol, 2010, 11(10): R106.
8 ROBINSON M D, MCCARTHY D J, SMYTH G K. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data[J]. Bioinformatics, 2010, 26(1): 139-140.
9 LI J, WITTEN D M, JOHNSTONE I M, TIBSHIRANI R. Normalization, testing, and false discovery rate estimation for RNA-sequencing data[J]. Biostatistics, 2012, 13(3): 523-538.
10 TRAPNELL C, HENDRICKSON D G, SAUVAGEAU M, et al. Differential analysis of gene regulation at transcript resolution with RNA-seq[J]. Nat Biotechnol, 2013, 31(1): 46-53.
11 RAPAPORT F, KHANIN R, LIANG Y, et al. Comprehensive evaluation of differential gene expression analysis methods for RNA-seq data[J]. Genome Biol, 2013, 14(9): R95.
12 STARK R, GRZELAK M, HADFIELD J. RNA sequencing: the teenage years[J]. Nat Rev Genet, 2019, 20(11): 631-656.
13 MARGUERAT S, BAHLER J. RNA-seq: from technology to biology[J]. Cell Mol Life Sci, 2010, 67(4): 569-879.
14 MCDERMAID A, MONIER B, ZHAO J, et al. Interpretation of differential gene expression results of RNA-seq data: review and integration[J]. Brief Bioinform, 2019, 20(6): 2044-2054.
15 DILLIES M A, RAU A, AUBERT J, et al. A comprehensive evaluation of normalization methods for Illumina high-throughput RNA sequencing data analysis[J]. Brief Bioinform, 2013, 14(6): 671-683.
16 MORTAZAVI A, WILLIAMS B A, MCCUE K, et al. Mapping and quantifying mammalian transcriptomes by RNA-Seq[J]. Nat Methods, 2008, 5(7): 621-628.
17 SHI L, CAMPBELL G, JONES W D, et al. The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models[J]. Nat Biotechnol, 2010, 28(8): 827-838.
18 MAQC CONSORTIUM, SHI L, REID L H, et al. The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements[J]. Nat Biotechnol, 2006, 24(9): 1151-1161.
19 ST LAURENT G, WAHLESTEDT C, KAPRANOV P. The Landscape of long noncoding RNA classification[J]. Trends Genet, 2015, 31(5): 239-251.
20 YAN L, YANG M, GUO H, et al. Single-cell RNA-Seq profiling of human preimplantation embryos and embryonic stem cells[J]. Nat Struct Mol Biol, 2013, 20(9): 1131-1139.
21 ERICKSON R A, RATTNER B A. Moving Beyond P<0.05 in Ecotoxicology: A Guide for Practitioners[J]. Environ Toxicol Chem, 2020, 39(9): 1657-1669.
22 HADJIPAVLOU G, SIVITER R, FEIX B. What is the true worth of a P-value? Time for a change[J]. Br J Anaesth, 2021, 126(3): 564-567.
23 FRIESE M, FRANKENBACH J. P-Hacking and publication bias interact to distort meta-analytic effect size estimates[J]. Psychol Methods, 2020, 25(4): 456-471.
24 YADDANAPUDI L N. The American Statistical Association statement on P-values explained[J]. J Anaesthesiol Clin Pharmacol, 2016, 32(4): 421-423.
25 MORGAN J F. P value fetishism and use of the Bonferroni adjustment[J]. Evid Based Ment Health, 2007, 10(2): 34-35.
文章导航

/