比对本质是是很简单的了,各种mapping工具层出不穷,我们一般常用的就是BWA和bowtie了,我这里就挑选bowtie2吧,反正别人已经做好了各种工具效果差异的比较,我们直接用就好了,代码如下: ## step5 : alignment to hg19/ using bowtie2 to do alignment ## ~/biosoft/bowtie/bowtie2-2.2.9/bowtie2-build ~/biosoft/bowtie/hg19_index /hg19.fa ~/biosoft/bowtie/hg19_index/hg19 ## cat >run_bowtie2.sh ls *.fastq | while read id ; do echo $id #~/biosoft/bowtie/bowtie2-2.2.9/bowtie2 -p 8 -x ~/biosoft/bowtie/hg19_index/hg19 -U $id -S ${id%%.*}.sam 2>${id%%.*}.align.log; #samtools view -bhS -q 30 ${id%%.*}.sam > ${id%%.*}.bam ## -F 1548 https://broadinstitute./picard/explain-flags.html # -F 0x4 remove the reads that didn't match samtools sort ${id%%.*}.bam ${id%%.*}.sort ## prefix for the output # samtools view -bhS a.sam | samtools sort -o - ./ > a.bam samtools index ${id%%.*}.sorted.bam done
这个索引~/biosoft/bowtie/hg19_index/hg19需要自己提取建立好,见前文 初步比对的sam文件到底该如何过滤,我查了很多文章都没有给出个子丑寅卯,各执一词,我也没办法给大家一个标准,反正我测试了好几种,看起来call peaks的差异不大,就是得不到文章给出的那些结果!! 一般来说,初步比对的sam文件只能选取unique mapping的结果,所以我用了#samtools view -bhS -q 30,但是结果并没什么改变,有人说是peak caller这些工具本身就会做这件事,所以取决于你下游分析所选择的工具。 给大家看比对的日志吧: SRR1042593.fastq 16902907 reads; of these: 16902907 (100.00%) were unpaired; of these: 667998 (3.95%) aligned 0 times 12467095 (73.76%) aligned exactly 1 time 3767814 (22.29%) aligned >1 times 96.05% overall alignment rate [samopen] SAM header is present: 93 sequences. SRR1042594.fastq 60609833 reads; of these: 60609833 (100.00%) were unpaired; of these: 9165487 (15.12%) aligned 0 times 39360173 (64.94%) aligned exactly 1 time 12084173 (19.94%) aligned >1 times 84.88% overall alignment rate [samopen] SAM header is present: 93 sequences. SRR1042595.fastq 14603295 reads; of these: 14603295 (100.00%) were unpaired; of these: 918028 (6.29%) aligned 0 times 10403045 (71.24%) aligned exactly 1 time 3282222 (22.48%) aligned >1 times 93.71% overall alignment rate [samopen] SAM header is present: 93 sequences. SRR1042596.fastq 65911151 reads; of these: 65911151 (100.00%) were unpaired; of these: 10561790 (16.02%) aligned 0 times 42271498 (64.13%) aligned exactly 1 time 13077863 (19.84%) aligned >1 times 83.98% overall alignment rate [samopen] SAM header is present: 93 sequences. SRR1042597.fastq 22210858 reads; of these: 22210858 (100.00%) were unpaired; of these: 1779568 (8.01%) aligned 0 times 15815218 (71.20%) aligned exactly 1 time 4616072 (20.78%) aligned >1 times 91.99% overall alignment rate [samopen] SAM header is present: 93 sequences. SRR1042598.fastq 58068816 reads; of these: 58068816 (100.00%) were unpaired; of these: 8433671 (14.52%) aligned 0 times 37527468 (64.63%) aligned exactly 1 time 12107677 (20.85%) aligned >1 times 85.48% overall alignment rate [samopen] SAM header is present: 93 sequences. SRR1042599.fastq 24019489 reads; of these: 24019489 (100.00%) were unpaired; of these: 1411095 (5.87%) aligned 0 times 17528479 (72.98%) aligned exactly 1 time 5079915 (21.15%) aligned >1 times 94.13% overall alignment rate [samopen] SAM header is present: 93 sequences. SRR1042600.fastq 76361026 reads; of these: 76361026 (100.00%) were unpaired; of these: 8442054 (11.06%) aligned 0 times 50918615 (66.68%) aligned exactly 1 time 17000357 (22.26%) aligned >1 times 88.94% overall alignment rate [samopen] SAM header is present: 93 sequences.
可以看到比对非常成功!!!我这里就不用表格的形式来展现了,毕竟我又不是给客户写报告,大家就将就着看吧。
|