split_libraries.py(qiime)

BIOINFO_J 2017-04-07

展开全文

Example: [the path at the beginning of the code was necessary for excute the code]/usr/lib/qiime/bin/split_libraries.py -m map.txt -f mtt.fna -q mtt.qual -o lib1 -r -l 150 -b variable_length -M 6

split_libraries.py

-m 后接map文件（详见之前的博文）

-f 后接fasta文件，如果有很多个，用逗号隔开

-q 质量文件，如果有很多个，用逗号隔开

-r 删掉未匹配的序列，默认的是不加这个参数，即保留呗

-l 最短序列，默认的是200，低于这个值得就给丢掉

-L 保留的最长的序列，默认的是1000，超过这个长度就给丢掉

-t 在去掉引物和barcodes后，计算序列的长度，默认的是False,即不计算

-s read中容许的最小的平均得分，默认的是25，片段质量得分低于这个值得就给丢掉。

-k 保留引物

-B 保留barcode

-b barcode的类型：hamming_8, golay_12, variable_length (will disable any barcode correction if variable_length set),或者是barcode的长度（例如-b 4 表示barcode长度为4）默认的是golay_12

-e 容许的最大的barcode错误，默认的是1.5

-c 关闭寻找最相近的barcode

-a 容许的最大的未知的碱基数，默认的是6

-H --max-homopolymer，默认的为6

-M 容许最大引物错配，默认的为0

-o 生成的文件夹

-n Seq id to use for the first sequence [default: 1]

--retain_unassigned_reads 保留没有分配到生成文件中的序列，默认为不保留

-w , --qual_score_window

Enable sliding window test of quality scores. If the average score of a continuous set of w nucleotides falls below the threshold (see -s for default), the sequence is discarded. A good value would be 50. 0 (zero) means no filtering. Must pass a .qual file (see -q parameter) if this functionality is enabled. Default behavior for this function is to truncate the sequence at the beginning of the poor quality window, and test for minimal length (-l parameter) of the resulting sequence. [default: 0]

-g, --discard_bad_windows

If the qual_score_window option (-w) is enabled, this will override the default truncation behavior and discard any sequences where a bad window is found. [default: False]

-p, --disable_primers

Disable primer usage when demultiplexing. Should be enabled for unusual circumstances, such as analyzing Sanger sequence data generated with different primers. [default: False]

-z, --reverse_primers

Enable removal of the reverse primer and any subsequence sequence from the end of each read. To enable this, there has to be a “ReversePrimer” column in the mapping file. Primers a required to be in IUPAC format and written in the 5’ to 3’ direction. Valid options are ‘disable’, ‘truncate_only’, and ‘truncate_remove’. ‘truncate_only’ will remove the primer and subsequent sequence data from the output read and will not alter output of sequences where the primer cannot be found. ‘truncate_remove’ will flag sequences where the primer cannot be found to not be written and will record the quantity of such failed sequences in the log file. [default: disable]

--reverse_primer_mismatches

Set number of allowed mismatches for reverse primers (option -z). [default: 0]

-d, --record_qual_scores

Enables recording of quality scores for all sequences that are recorded. If this option is enabled, a file named seqs_filtered.qual will be created in the output directory, and will contain the same sequence IDs in the seqs.fna file and sequence quality scores matching the bases present in the seqs.fna file. [default: False]

-i, --median_length_filtering

Disables minimum and maximum sequence length filtering, and instead calculates the median sequence length and filters the sequences based upon the number of median absolute deviations specified by this parameter. Any sequences with lengths outside the number of deviations will be removed. [default: None]

-j, --added_demultiplex_field

Use -j to add a field to use in the mapping file as an additional demultiplexing option to the barcode. All combinations of barcodes and the values in these fields must be unique. The fields must contain values that can be parsed from the fasta labels such as “plate=R_2008_12_09”. In this case, “plate” would be the column header and “R_2008_12_09” would be the field data (minus quotes) in the mapping file. To use the run prefix from the fasta label, such as “>FLP3FBN01ELBSX”, where “FLP3FBN01” is generated from the run ID, use “-j run_prefix” and set the run prefix to be used as the data under the column headerr “run_prefix”. [default: None]

-x, --truncate_ambi_bases

Enable to truncate at the first “N” character encountered in the sequences. This will disable testing for ambiguous bases (-a option) [default: False]

生成文件：

.fna 序列的名字中包含了来自map文件中sample id的编号

histograms.txt包含了特殊长度的序列的数目

split_library_log.txt 质量过滤后的总结文件

1，如果是好几个样品，只要他们Map文件中barcode不一样，可以这么来：

split_libraries.py -m Mapping_File.txt -f 1.TCA.454Reads.fna,2.TCA.454Reads.fna -q 1.TCA.454Reads.qual,2.TCA.454Reads.qual -o Split_Library_Output_comma_separated/

也可以直接将所有序列合并后再来处理

2，如果是双端测序，来自两个测序。比如说同一个barcode的几个不同测序结果中编号一样，如果都用同一个barcode，导致的结果就是不同测序中的片段被划分了同一个编号。

split_libraries.py -m Mapping_File.txt -f 1.TCA.454Reads.fna -q 1.TCA.454Reads.qual -o Split_Library_Run1_Output/

split_libraries.py -m Mapping_File.txt -f 2.TCA.454Reads.fna -q 2.TCA.454Reads.qual -o Split_Library_Run2_Output/ -n 2000000

cat Split_Library_Run1_Output/seqs.fna Split_Library_Run2_Output/seqs.fna > Combined_seqs.fna

-n后面接着起始序列编号，这个数值应该大于打一个脚本中序列数之和

参考资料：

http:///scripts/split_libraries.html