Velvet and Oases For Transcriptome Assembly(基于转录组进行拼接)

评论4,736

首先要有Linux环境的服务器或PC,推荐最低需求:
64bit Linux System (Rethat, CentOS, Fedora), 4 core CPU processors, 2.26GHz, basically 16GB RAM, 20GB HD storage space.
Environment: GCC/G++ 4.3.5, PERL v5.8, Python 2.6, JAVA 6(JDK/JRE)

选择软件有:

  1. Velvet+Oases;
  2. ABySS+TransABySS;
  3. Newbler(cDNA assembler)+iAssembler(CAP3)

前两个主要用于Illumina和ABi SOLiD的短序列RNA-seq组装,第三个用于454较长一些的序列(cDNA)组装,后面我们只介绍第一种。

下载,安装:

http://www.ebi.ac.uk/~zerbino/velvet/

建议下载用git(http://git.or.cz/ )
下载方法:安装好git后

获取velvet最新版本:(以下所有your-path是你要安装软件存放的目录)

$mkdir -p /your-path/velvet

$cd /your-path/velvet

$git clone git://github.com/dzerbino/velvet.git

获取oases最新版本:

$cd /your-path/velvet

$git clone git://github.com/dzerbino/oases.git

编译velvet:

$cd /your-path/velvet/velvet

$./update_velvet.sh

如果需要用大的kmer值(默认是31,这里假设需要75,根据需求,一般kmer值决定组装片段大小,但也受内存限制,16GB节点建议是31kmer以下,72GB建议45kmer以下,512GB建议75kmer),则可以按下面方法编译:
$make MAXKMERLENGTH=75

编译oases:

$cd /your-path/velvet/oases/

$make ‘VELVET_DIR=/your-path/velvet/velvet’

如果需要所有用户可使用,即在有root权限时,运行以下命令即可:

$cd /usr/local/bin

$ln -s /your-path/velvet/velvet/velveth && ln -s /your-path/velvet/velvet/velvetg && ln -s /your-path/velvet/oases/oases

$chmod 555 velveth velvetg oases
就可以了,如果没有有两种办法运行:

1). 每次运行用绝对路径运行,如:

$/your-path/velvet/velvet/velveth

$/your-path/velvet/velvet/velvetg

$/your-path/velvet/oases/oases

2). 用编辑~/.bashrc,写入以下三行:

alias velveth=”/your-path/velvet/velvet/velveth”

alias velvetg=”/your-path/velvet/velvet/velvetg”

alias oases=”/your-path/velvet/oases/oases”
安装就好了,运行方法可以直接运行velveth, velvetg, oases就看到相关参数。

(第一步)hashing
Velveth用法:

velveth – simple hashing program
Version 1.1.04
Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk)
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Compilation settings:
CATEGORIES = 2
MAXKMERLENGTH = 75
Usage:
./velveth directory hash_length {[-file_format][-read_type] filename1 [filename2 ...]} {…} [options]
directory : directory name for output files
hash_length : EITHER an odd integer (if even, it will be decremented) <= 75 (if above, will be reduced)
: OR: m,M,s where m and M are odd integers (if not, they will be decremented) with m < M ILLUMINA-57021F:8:1:1106:3817#0/1
AGATATTTGAGATTTATGGATTAGGAAAGGAGGTTGTTGACCTAGCATTTTGTGGGAGTG
>ILLUMINA-57021F:8:1:1106:3817#0/2
AATTTTTGACACACTCCCACAAAATGCTAGGTCAACAACCTCCTTTCCTAATCCATAAAT
>ILLUMINA-57021F:8:1:1123:6213#0/1 score: 0 left_trim: 0
GACTGTCCCTGTTAATCATTACTCCGATCCCGAAGGCCAACAGAATAGGACCGAAATCCT
>ILLUMINA-57021F:8:1:1123:6213#0/2 score: 2 left_trim: 0
AGCATGGGATAACATCATAGGATTTCGGGCCTATTCTGTTGGCCTTCGGGATCGGAGTAA

这个在velvet的安装目录里有两个脚本可供做这样事情,shuffleSequences_fastq.pl, shuffleSequences_fasta.pl;
如果是mate pairs序列,由于方向是 RF; 而不是—>|<— FR,所以mate pairs的序列全部要反转互补一下。

(5)如果RNASeq是strand specific测序的话,可以选择-strand_specific参数;

输出:
你所指定的一个文件夹名称(例子中的Assem),这个文件夹请勿改动,要用于velvetg和oases使用。

其余可参考velvet的说明手册:http://www.ebi.ac.uk/~zerbino/velvet/Manual.pdf

注:最新版的velvet1.1.04 支持多线程跑,如果内存足够,又不知道设置多大kmer合适,可以考虑用MetaVelveth(velvet/contrib/MetaVelvet-v0.3.1),或设置kmer值时,用

$velveth output_directory/ m,M,s (..data files..)

m<=k<=m, s是步长;如31,45,2;就从31,33,35,…,45之间顺次设置

Velvet还有很多第三方程序和脚本,用于优化组装,见安装velvet中的velvet/contrib目录,有兴趣可对手册参照使用。

(第二步)de Bruijn graph assembly
Velvetg用法

velvetg – de Bruijn graph construction, error removal and repeat resolution
Version 1.1.04
Copyright 2007, 2008 Daniel Zerbino (zerbino@ebi.ac.uk)
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Compilation settings:
CATEGORIES = 2
MAXKMERLENGTH = 75
Usage:
./velvetg directory [options]
directory : working directory name
Standard options:
-cov_cutoff : removal of low coverage nodes AFTER tour bus or allow the system to infer it
(default: no removal)
-ins_length : expected distance between two paired end reads (default: no read pairing)
-read_trkg : tracking of short read positions in assembly (default: no tracking)
-min_contig_lgth : minimum contig length exported to contigs.fa file (default: hash length * 2)
-amos_file : export assembly to AMOS file (default: no export)
-exp_cov : expected coverage of unique regions or allow the system to infer it
(default: no long or paired-end read resolution)
-long_cov_cutoff : removal of nodes with low long-read coverage AFTER tour bus
(default: no removal)
Advanced options:
-ins_length2 : expected distance between two paired-end reads in the second short-read dataset (default: no read pairing)
-ins_length_long : expected distance between two long paired-end reads (default: no read pairing)
-ins_length*_sd : est. standard deviation of respective dataset (default: 10% of corresponding length)
[replace '*' by nothing, '2' or '_long' as necessary]
-scaffolding : scaffolding of contigs used paired end information (default: on)
-max_branch_length : maximum length in base pair of bubble (default: 100)
-max_divergence : maximum divergence rate between two branches in a bubble (default: 0.2)
-max_gap_count : maximum number of gaps allowed in the alignment of the two branches of a bubble (default: 3)
-min_pair_count : minimum number of paired end connections to justify the scaffolding of two long contigs (default: 5)
-max_coverage : removal of high coverage nodes AFTER tour bus (default: no removal)
-long_mult_cutoff : minimum number of long reads required to merge contigs (default: 2)
-unused_reads : export unused reads in UnusedReads.fa file (default: no)
-alignments : export a summary of contig alignment to the reference sequences (default: no)
-exportFiltered : export the long nodes which were eliminated by the coverage filters (default: no)
-clean : remove all the intermediary files which are useless for recalculation (default : no)
-very_clean : remove all the intermediary files (no recalculation possible) (default: no)
-paired_exp_fraction : remove all the paired end connections which less than the specified fraction of the expected count (default: 0.1)
-shortMatePaired* : for mate-pair libraries, indicate that the library might be contaminated with paired-end reads (default no)
Output:
directory/contigs.fa : fasta file of contigs longer than twice hash length
directory/stats.txt : stats file (tab-spaced) useful for determining appropriate coverage cutoff
directory/LastGraph : special formatted file with all the information on the final graph
directory/velvet_asm.afg : (if requested) AMOS compatible assembly file

如上:velvetg有很多参数,对于RNA-seq中有用的有以下几个
输入:velveth运行指定的那个目录如例子中的Assem

参数选择:

  1.  cov_cutoff 最低覆盖度,去除低频度序列,用于纠错;
  2. ins_length 插入片段长度,如500bp insert,则设置500,如果在velveth输入有多个插入片段的libraries,可以通过-ins_length2和-ins_length_long设置;
  3. ins_length_sd 插入片段分布的标准差值,通过比对或建库信息获取,一般500bp以内的在5~30以内;
  4. exp_cov 期望覆盖度,如果不确定可以设置为auto,这里设置的是浮点数,如果准确设置对组装有所贡献;
  5. min_pair_count 用于搭建scaffolds(设置了-scaffolding yes)最少的pairs数,如设置3,即如果有3对pairs搭上相同的两个contigs(方向插入片段正确),则可认为这两个contigs可以构建为scaffolds;
  6. max_coverage 最大覆盖度,用于去除高频序列,主要是低复杂度的重复序列,或polyA和adapters序列。

其他参数参考手册。

输出:contigs.fa这个即为组装的contigs序列,没有转录本信息。

(第三步)transcriptome assembling
oases – De novo transcriptome assembler for the Velvet package
Version 0.1.18
Copyright 2009,2010 Daniel Zerbino (dzerbino@soe.ucsc.edu)
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Compilation settings:
CATEGORIES = 20
MAXKMERLENGTH = 75
Usage:
./oases directory [options]
directory : working directory name
Standard options:
-ins_length2 : expected distance between two paired-end reads in the second short-read dataset (default: no read pairing)
-ins_length_long : expected distance between two long paired-end reads (default: no read pairing)
-ins_length*_sd : est. standard deviation of respective dataset (default: 10% of corresponding length)
[replace '*' by nothing, '2' or '_long' as necessary]
-unused_reads : export unused reads in UnusedReads.fa file (default: no)
-amos_file : export assembly to AMOS file (default: no export)
-alignments : export a summary of contig alignment to the reference sequences (default: no)
–help : this help message
Advanced options:
-cov_cutoff : removal of low coverage nodes AFTER tour bus or allow the system to infer it (default: 3)
-min_pair_count : minimum number of paired end connections to justify the scaffolding of two long contigs (default: 4)
-min_trans_lgth : Minimum length of output transcripts (default: hash-length)
-paired_cutoff : minimum ratio allowed between the numbers of observed and estimated connecting read pairs
Must be part of the open interval ]0,1[ (default: 0.1)
-scaffolding :Allow gaps in transcripts (default: yes)
-degree_cutoff : Maximum allowed degree on either end of a contigg to consider it ‘unique’ (default: 3)
Output:
directory/transcripts.fa
directory/splicing_events.txt
directory/contig-ordering.txt

Oases是最后一步用于组装transcripts,并报告splicing events(splicing isoforms)
输入:velveth运行指定的那个目录如例子中的Assem
参数选择:跟velvetg过程相似
输出:
1) transcripts.fa
FASTA格式,如下:
>Locus_1_Transcript_1/10_Confidence_0.111
CACAGTTGAGCAGATAAAAAAGGGATTTGCAAGATGGAGAGCATCCAAAGCTTTTGGCAA
TTGGGTGATGAGCTCCGAGGACAATCAAAAGTCTCAGAGGATCATAAGTGGTTAATGGCT
GCCTCGAAATTGGCTGAGCAGACAAGGTCAAAGGGGGAACGCATGAATAATCTTGATCTT
TCAAAAGGTCTGTCAGAATTAAGGCCAAGGGAGAAGATTGGGTTTCAGGCTGGAGATCGG
AAGAGCGGTTCAGCAGGAA
>Locus_1_Transcript_2/10_Confidence_0.222
CACAGTTGAGCAGATAAAAAAGGGATTTGCAAGATGGAGAGCATCCAAAGCTTTTGGCAA
TTGGGTGATGAGCTCCGAGGACAATCAAAAGTCTCAGAGGATCATAAGTGGTTAATGGCT
GCCTCGAAATTGGCTGAGCAGACAAGGTCAAAGGGGGAACGCATGAATAATCTTGATCTT
TCAAAAGGTCTGTCAGAATTAAGGCCAAGGGAGAAGATTGGGTTTCAGGAAGATAATAAA
TTTGAAAGCCTTAACTTTAATATGTTGAACTTGGAATCTAAGATGGCTGAAAATGTGGGC
AAAGGTTCTTTCAGAAGTGGAGTTTACAATATGAATGCAGTGTACCAGAAAACCAACTTA
AATGGTATTGGAAATCTTACTGGTAGCAAGTATAGCGGTAACAATCAGAATACCAAAGAC
CCCAACAACAACAGCAATAACAACATTAACAACATTGAGAACAGTGCAAATAATGCTGTT
GACAAAAGGTTCAAGACTCTGCCTGCAACAGAGACGCTTCCACGGAATGAGGTTCTTGGA
GGATACATCTTTGTCTGTAACAATGACACTATGCAGGAAGATTTGAAGCGCCAGCTATTT
GGTTTACCTCCAAGATATAGAGATTCTGTTCGGGCAATAACACCTGGCTTACCTCTCTTT
CTTTACAACTACACAACTCACCAGTTGCATGGTATATTTGAGGCAGCTAGTTTTGGGGGT
TCTAACATTGATCCTACTGCTTGGGAAGACAAAAAATGTAAAGGCGAGTCGAGACTGGAG
ATCGGAAGAGCGGTTCAGC

其中Locus_1是指gene locus;
1/10指这个基因locus有10个转录本;
Confidence是指置信度,一般转录本越多,置信度越低,具体可参考:
What do the confidence scores mean?

The confidence scores assigned by Oases are a heuristic measure that expresses the uniqueness of a transcript in a locus. In the case of loci with only one or two possible transcripts, they/it are assigned a confidence of 1. In the case of more complex loci, if a transcript of a locus shares the majority of the contigs associated with a locus than the Confidence of the transcript is high (close to 1), whereas a Confidence Score close to 0 indicates that only few of the contigs of a locus are part of the transcript.

2) splicing_events.txt

>Locus_1_Node_30005
CACAGTTGAGCAGATAAAAAAGGGATTTGCAAGATGGAGAGCATCCAAAGCTTTTGGCAA
TTGGGTGATGAGCTCCGAGGACAATCAAAAGTCTCAGAGGATCATAAGTGGTTAATGGCT
GCCTCGAAATTGGCTGAGCAGACAAGGTCAAAG

>Locus_1_Node_30006
TGCCTCGAAATTGGCTGAGCAGACAAGGTCAAAGGGGGAACGCATGAATAATCTTGATCT
TTCAAAAGGTCTGTCAGAATTAAGGCCAAGGGAGAAGATTGGGTTTCAGG

报告每个gene Locus中由哪些contigs组成,用 AStalavista nomenclature*(参考文献Sammeth M, Foissac S, Guigó R. A general definition and nomenclature for alternative splicing events. PLoS Comput Biol. 2008 Aug 8;4(8):e1000147. )方式来描述alternative splicing

3) contig-ordering.txt

>Locus_1_Node_120550
ATCGGAAGAGCGGTTCAG
>Locus_1_Node_120121
ACTGGAG
>Locus_1_Transcript_1/10_Confidence_0.111
30005:153-(0)->30006:229-(0)->129011:259
>Locus_1_Transcript_2/10_Confidence_0.222
30005:153-(0)->30006:229-(0)->1:767-(0)->92621:768-(0)->92622:773-(0)->94084:799

一种混合的FASTA格式,用来报告contigs的线性关系:
$contig_id:$cumulative_length-($distance_to_next_contig)->
在这里,累计长度是描述转录本从5′端向3′端的组装起来contigs的总长度,从这个结果可以找到在转录本序列来自于哪些contigs。

更多请参考:http://www.ebi.ac.uk/~zerbino/oases/Manual.txt

由于版本不兼容问题,如果不急于得到结果,或者不是必须使用多线程,建议安装旧的版本:

http://www.ebi.ac.uk/~zerbino/velvet/velvet_1.0.18.tgz

http://www.ebi.ac.uk/~zerbino/oases/oases_0.1.18.tgz

另外也可以参考Adrian Reich的README, Velvet and Oases (transcriptome): https://sites.google.com/a/brown.edu/bioinformatics-in-biomed/velvet-and-oases-transcriptome

发表评论

匿名网友