【原】mRNA-seq数据中的duplicate情况探究

健明 2021-07-14

展开全文

去除与否，这是一个问题。

论坛讨论

知名生信论坛一直就有关于这个问题的讨论：Read the biostar discussions:

Duplicated reads in RNA-Seq Experiment
Read duplicates
Duplicate reads in RNAseq

and this seqanswers thread and the other threads it links to.

总的来说，是不需要去除mRNA-seq数据中的duplicate reads的，因为没办法区分这些重复来自于建库过程中的PCR，还是本身该基因高表达。Computationally, read duplicates are defined via their mapping position, which does not distinguish PCR- from natural duplicates that are bound to occur for highly transcribed RNAs.

但是也需要注意一些特殊的情况，比如duplicate rate 过高，需要仔细探究问题出在哪里。

文献查询

2017就发表了一篇工具专门探究测序中的重复现象，题目是A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments

analysis of exome datasets prepared using the Nextera library preparation method indicated that 45–50% of read duplicates correspond to natural read duplicates likely due to fragmentation bias.
analysis of RNA-seq datasets from individuals in the 1000 Genomes project demonstrated that 70–95% of read duplicates observed in such datasets correspond to natural duplicates sampled from genes with high expression and identified outlier samples with a 2-fold greater PCR duplication rate than other samples.

工具在：https://github.com/vibansal/PCRduplicates

早在2015年，就有文章比较不同的建库测序技术，包括Smart-Seq, TruSeq and UMI-seq，题目是：The impact of amplification on differential expression analyses by RNA-seq 结论是；Consequently, the computational removal of duplicates does improve neither accuracy nor precision and can actually worsen the power and the False Discovery Rate (FDR) for differential gene expression.

应该是从bam文件考虑duplicate情况

FastQC’s duplication plot is based on an assumption of (relatively) even sampling over the available sequence space. For library types with significant enrichment (such as RNA-Seq) this assumption falls down, and for highly expressed genes you should expect to see high levels of duplication even in libraries with no PCR amplification duplication because eventually you run out of places to put new reads. Any reasonably well covered RNA-Seq library will trigger the duplication flag in its report.

You therefore can’t use sequence level analyses such as FastQC to look effectively at whether you have a duplication problem in RNA-Seq data. The right way to look at this is to map the data to your reference and then look at the relationship between read density and duplication in your samples. There is a nice package called dupRadar which can do this from the command line, and it’s now also built into the SeqMonk graphical analysis program.

What you’re looking for is either a universally (or mostly) low level of duplication, or if you have places with high duplication then there should be a strong relationship between the density of reads and the level of duplication. If you find consistently high duplication without high read density then you have a problem.

It’s probably also worth mentioning that even if you do have some problems with technical duplication, then deduplicating your data isn’t a magic fix and will cause you problems which are different, but often just as bad as the duplication.

从fastq文件里面用fastqc软件检测的duplication比例参考价值不大。