Phrap使用方法

2011/10/02评论3,312

Phrap ("phragment assembly program", or "phil's revised assembly program"), Phrap是由华盛顿大学分子生物技术学院的Phil Green和Brent Ewing开发的phred\phrap软件包的一部分，主要用于shotgun序列的组装。
主要特征:
允许使用全长的序列（而不仅仅是高质量部分）
使用质量信息进行组装提高组装的准确度
由最高质量部分的序列构建contig序列
提供广泛的组装信息帮助解决错拼等问题（包括contig序列的质量信息）
能够处理比较大的数据集

下载
phrap可通过邮件直接向作者索取：phg@u.washington.edu

安装
上传phrap的压缩包到本地linux/unix运算服务器；

解压缩：
gzip -d phrap.tar.gz
tar –xvf phrap.tar

编译源程序：
在命令行键入make，屏幕提示如下：
$ make

如果编译器不识别-02，可将makefile文件中CFLAGS= -O2行改为CFLAGS= -O，删除*.o文件后重新编译。
如果数据集多于64,000条序列，或者序列中含有长于64,000bp的序列，则需要使用phrap.manyreads或phrap.longreads, 这两个程序编译命令为:
$ make manyreads
使用
程序运行命令行：
phrap [sequence file] –new_ace > phrap.out
输入
Fasta格式的核酸序列，如：pp.seq.screen：
>10_A8-9.ab1
gtgctctggtctctgctcctttcccctaagcaatagtaggcagaatcaac
aaaaacaaccccttctcccctccctacctggggaacagagccaatgagac
aggctcaggaacagggcaccagcacctgcactcaccattcaatctcttta
ggctcacggtccttcagaagctcttgtacctcctgccgacagcgctcctg
gtattccgggtgctttgcaaggtggtacaggacccaggagagaccactgg
cccccataaaaagtcacagtacctctgagggctcttgagtctaatctgag
acagtctctgaagattcatcctctttccagaaacccaagcccatcttgct
ctcctagaaacctttctataaaaaaaaaaaaan
>11_A8-9_R.ab1
gggagaggcggagctctggtccttgtcatctaagctgtgtggattgatcg
cctagaacctccctatctaccctccctacctggggaacagagccaatgag
aaaggctcaggaacagggcaccagcacctgcactcaccattcaatctctt
taggctcacggtccttcagaagctcttgtacctcctgccgacagcgctcc
caacttcttcccatcttcatcctggagagaaggcaataaccccccacccc
cacccccataaaaagtcacagtacctctgagggctcttgagtctaatctg
agacagtctctgaagattcatcctctttccagaaacccaagcccatcttg
ctctccagaacccttcttaaa
>15_A8-9.ab1
aagactggcagnggatctctgcatctagtcacctaagctatagctggtag
actcgaccaaaacaaccctttctaccctccctacctggggaacagagcca
atgagacaggctcaggaacag
… …

如有质量文件，则质量文件需和序列文件放在同一目录下，且名字为[序列文件名.qual]，如，序列文件名为pp.seq.screen，质量文件名必须为 pp.seq.screen.qual，质量文件不需要在命令行中。并且质量文件中的序列和序列文件中的序列必须一一对应，包括顺序和碱基个数。

输出
在程序运行目录，除屏幕输出外，会产生一系列相关文件，分别为：
*.contigs文件。组装好的contig序列，格式为FASTA格式。其中包括单个read的contig（这类reads和其他contig有比对上的部分，但达不到连上的标准）。
*.contigs.qual文件。Contig组装的质量文件，FASTA格式。此文件记录每个contig的碱基质量信息。
*.singlets 文件。和任何其他reads没有overlap的序列，FASTA格式。
*.log文件和*.problems文件。对使用者基本没用。
*.ace 文件。当使用参数-new_ace或-old_ace时才会产生的文件，用consed查看组装结果时需要。
*.view 文件。当使用-view参数时产生的文件，用phrapview查看组装结果时需要。
除以上文件外，phrap还有屏幕输出，可重定向到文件，如phrap > phrap.out，此输出包含contigs的组成信息。

参数
详细的参数列表可以查看phrap文档：
参数及默认值：
1. 比对分值
-penalty -2 碱基不匹配（替换）罚分
-gap_init penalty-2 gap罚分
-gap_ext penalty-1 扩展gap罚分
-ins_gap_ext gap_ext 插入罚分
-del_gap_ext gap_ext 删除罚分
-matrix [None] 打分矩阵
-raw * 只用原始的Smith-Waterman打分
2. 结合搜索
-minmatch 14 最小匹配长度
-maxmatch 30 最大匹配长度，默认为30
-max_group_size 20 组的限制
-word_raw * Use raw rather than complexity-adjusted word length, in testing against minmatch (N.B. maxmatch always refer to raw lengths).
-bandwidth 14 1/2 band width for banded SWAT searches (full width is 2 times bandwidth + 1).
3. 比对过滤参数
-minscore 30 最小比对分值
-vector_bound 80 序列开始部分可能的载体碱基数目
Special cases:
-masklevel 0 只报告单条最高分值的比对
-masklevel 100 report any match whose domain is not completely contained within a higher scoring match
-masklevel 101 报告所有的比对
4. 输入相关
-default_qual 15 没有质量文件时的碱基默认质量值
-subclone_delim . 克隆名称的分隔符号
-n_delim 1 Indicates which occurrence of the subclone delimiter character denotes the end of the subclone name
-group_delim _ Group name delimiter: Character used to indicate end of that part of the read name that corresponds to the group name (relevant only if option -preassemble is used);
-trim_start 0 序列开头去掉的碱基数
5. 组装相关
-forcelevel 0 Relaxes stringency to varying degree during final contig merge pass.
-bypasslevel 1 Controls treatment of inconsistent reads in merge.
-maxgap 30 Maximum permitted size of an unmatched region in merging contigs, during first (most stringent) merging pass.
-repeat_stringency .95 控制匹配的严紧度
-revise_greedy * 在弱结合部位打断，并尝试重新结合
-shatter_greedy * 打断弱的结合但不尝试重新结合
-preassemble * 组内序列先结合
-force_high * Causes edited high-quality discrepancies to be ignored during final contig merge pass.
6. 一致性序列构建参数
-node_seg 8 Minimum segment size (for purposes of traversing weighted directed graph).
-node_space 4 Spacing between nodes (in weighted directed graph).
7. 输出相关
-tags * Tag selected lines in the standard output, to facilitate parsing.
-screen * when the -old_ace or -new_ace option is specified (see below), this option causes parts of the read sequences that consist of phrap-inferred sequencing vector and chimeric segments to be replaced by X's in the .ace file.
-old_ace * 产生旧格式的ace文件
-new_ace * 产生新格式的ace文件
-ace * 同参数-new_ace
-view * 产生适用于phrapview的".view"文件
-qual_show 20 Cutoff for flagging "low_quality" regions in contig sequence and "high quality" discrepancies between read and contig.
-print_extraneous_matches * 打印contigs间的非局部比对信息
8. 其他
-retain_duplicates * 保留完全相同的序列，而不是去除
-max_subclone_size 5000 最大克隆长度
-trim_penalty -2 Penalty used for identifying degenerate sequence at beginning & end of read.
-trim_score 20 Minimum score for identifying degenerate sequence at beginning & end of read.
-trim_qual 13 定义序列高质量部分的质量值
-confirm_length 8 Minimum size of confirming segment.
-confirm_trim 1 Amount by which confirming segments are trimmed at edges.
-confirm_penalty -5 Penalty used in aligning against "confirming" reads.
-confirm_score 30 Minimum alignment score for a read to be allowed to "confirm" part of another read.
-indexwordsize 10 Size of indexing (hashing) words, used in finding word matches between sequences.

运行问题
内存不足：
如果程序运行提前终止，并给出以下错误信息提示：
FATAL ERROR: REQUESTED MEMORY UNAVAILABLE
程序长时间运行：
可以试着提高参数-minmatch的值
phrap 注意事项

数据量和数据性质
通常情况下reads数量不要超过15万。
如果覆盖度不是很高并且重复序列很少，phrap能完成50万以下的拼接。
如果覆盖度很高（几十以上）或者重复序列很多，phrap就很难处理了。对于特殊数据的拼接策略
对于有重复序列的非finish项目，可以通过对序列的深度统计去掉高重复的reads，只保留uniq区的reads拼接(RePS方法)。
同样在比较难拼接的finish项目中也可以使用这种策略来保证正确性，再通过其他方法补充gap

phrap.out文件包含了reads拼成contig的方式，包括位置，方向等。把这些信息提取出来存入contig.list文件

发表评论