PASA的安装与使用

2014/01/26评论8,546

1. PASA简介

PASA, acronym for Program to Assemble Spliced Alignments, is a eukaryotic genome annotation tool that exploits spliced alignments of expressed transcript sequences to automatically model gene structures, and to maintain gene structure annotation consistent with the most recently available experimental sequence data. PASA also identifies and classifies all splicing variations supported by the transcript alignments.

Note:
Combine genome and Trinity de novo RNA-Seq assemblies to generate a comprehensive transcript database.

2. PASA使用前的准备

2.1 Mysql数据库的准备

创建只读权限用户和所有权限用户各一个。

mysql> GRANT SELECT ON *.* TO 'pasa'@'%' IDENTIFIED BY '123456';
mysql> GRANT ALL ON *.* TO 'chenlianfu'@'%' IDENTIFIED BY '123456';
mysql> FLUSH PRIVILEGES;

2.1 安装perl模块

# cpan
cpan[1]> install DBD::mysql
cpan[1]> install GD

2.3 安装GMAP

$ wget http://research-pub.gene.com/gmap/src/gmap-gsnap-2013-03-31.v5.tar.gz
$ tar zxvf gmap-gsnap-2013-03-31.v5.tar.gz
$ cd gmap-2013-03-31
$ ./configure --prefix=$PWD
$ make -j 8
$ make install

2.4 安装BLAT

$ wget http://hgwdev.cse.ucsc.edu/~kent/src/blatSrc35.zip
$ unzip blatSrc35.zip
$ cd blatSrc
$ MACHTYP=x86_64
$ export MACHTYPE
$ mkdir -p ~/bin/x86_64
$ make -j 8

2.5 安装FASTA

$ wget http://faculty.virginia.edu/wrpearson/fasta/fasta3/CURRENT.tar.gz
$ tar zxvf CURRENT.tar.gz
$ cd fasta-35.4.12
$ cd src
$ make -f ../make/Makefile.linux_sse2 all
$ cd ..
$ ln -s $PWD/bin/fasta35 ~/bin/fasta

2.6 安装PASA

$ wget http://kaz.dl.sourceforge.net/project/pasa/PASA2-r20130425beta.tgz
$ tar zxvf PASA2-r20130425beta.tgz
$ cd PASA2-r20130425beta/
$ make -j 8

2.7 安装GD

安装GD需要先行安装libgd

$ wget https://bitbucket.org/libgd/gd-libgd/get/93368566388c.zip
$ unzip 93368566388c.zip
$ cd libgd-gd-libgd-93368566388c
$ ./bootstrap.sh
$ ./configure
$ make -j 8
$ sudo make install
$ gdlib-config

再安装GD

$ wget http://search.cpan.org/CPAN/authors/id/L/LD/LDS/GD-2.49.tar.gz
$ tar zxvf GD-2.49.tar.gz
$ cd GD-2.49
$ perl Makefile.PL
$ make -j 8
$ sudo make install

安装GD的目的是能通过网页来查看PASA的运行结果。

2.8 配置PASA

2.8.1. 修改PASA的配置文件$PASAHOME/pasa_conf/conf.txt

$ cp $PASAHOME/pasa_conf/pasa.CONFIG.template $PASAHOME/pasa_conf/conf.txt
$ vim $PASAHOME/pasa_conf/conf.txt

2.8.2. 该文件需要修改的地方：

PASA_ADMIN_EMAIL=(your email address)
MYSQLSERVER=(your mysql server name)   此处不能填写IP。
MYSQL_RO_USER=(mysql read-only username)
MYSQL_RO_PASSWORD=(mysql read-only password)
MYSQL_RW_USER=(mysql all privileges username)
MYSQL_RW_PASSWORD=(mysql all privileges password)
BASE_PASA_URL=http://server_name/pasa/cgi-bin/

2.8.3. 修改httpd配置文件，

# vim /etc/httpd/conf/httpd.conf
# /etc/init.d/httpd restart

在/etc/httpd/conf/httpd.conf添加如下几行:

ScriptAlias /pasa "$PASAHOME"
<Directory "$PASAHOME">
        Options MultiViews ExecCGI
        AllowOverride None
        Order allow,deny
        Allow from all
</Directory>

2.9 cleaning the transcript sequences[Optional, requires seqclean to be installed

下载两个污染数据库，为fasta文件。

$ cd $PASAHOME/seqclean
$ tar zxf seqclean.tar.gz
$ cd seqclean
$ wget ftp://ftp.ncbi.nih.gov/pub/UniVec/UniVec -O UniVec.fasta
$ wget ftp://ftp.ncbi.nih.gov/pub/UniVec/UniVec_Core -O UniVec_Core.fasta

UniVec_Core includes only oligonucleotides and vectors consisting of bacterial, phage, viral, yeast or synthetic sequences. Vectors that include sequences of mammalian origin are excluded.

3. PASA主程序的使用

PASA的主程序是： $PASAHOME/scripts/Launch_PASA_pipeline.pl, 其使用参数如下：

*代表该参数是必须的

-c <filename> *
比对配置文件。可以将$PASAHOME/pasa_conf/pasa.alignAssembly.Template.
txt复制过来，只是将其中的MYSQLDB修改成需要的mysql数据库名。

####################

spliced alignment settings:
--ALIGNERS <string>
比对的软件，可用的软件有gmap和blat。也可以同时选择使用'gmap,blat'

-N <int> default: 1
max number of top scoring alignments

--MAX_INTRON_LENGTH | -I <int> default: 100000
max intron length parameter passed to GMAP or BLAT

--IMPORT_CUSTOM_ALIGNMENTS_GFF3 <filename>
only using the alignments supplied in the corresponding GFF3 file.

--cufflinks_gtf <filename>
incorporate cufflinks-generated transcripts

####################

actions
-C
 flag, create MYSQL database
-R
 flag, run alignment/assembly pipeline.
-A
 compare to annotated genes.
--ALT_SPLICE
 flag, run alternative splicing analysis

-R 用于比对transcripts ， -A 用于和已有gff3注释文件的比较和更新；这两个参数不
能同时共用，使用不同的参数，则 -C 参数设置不同的参数文件。

####################

input files

-g <filename> *
 genome sequence FASTA file

-t <filename> *
 transcript db

-f <filename>
 file containing a list of fl-cdna accessions.

--TDN <filename>
 file containing a list of accessions corresponding to Trinity
 (full) de novo assemblies (not genome-guided)

####################

polyAdenylation site identification ** highly recommended **
-T
 flag,transcript db were trimmed using the TGI seqclean tool.
-u <filename>
 value, transcript db containing untrimmed sequences (input to 
seqclean).a filename with a .cln extension should also exist, gen
erated by seqclean.

####################

Jump-starting or prematurely terminating
-x
 flag, print cmds only, don't process anything. (useful to get 
indices for -x or -e opts below)
-s <int>
 pipeline index to start running at (avoid rerunning searches).
-e <int>
 pipeline index where to stop running, and do not execute this 
entry. 

####################

Misc:
--TRANSDECODER
 flag, run transdecoder to identify candidate full-length coding
 transcripts
--CPU <int> default: 2
 multithreading
-d flag, Debug 
-h flag, print this option menu and quit