linux安装和使用NCBI剪接边界工具splign

评论4,257

splign是NCBI中一个比对cDNA和genome的一个工具,通过splign可以很方便的找到cDNA各个外显子。Windows下安装非常简单,下载后就可以直接用了,但linux版本下运行需要一些相关的包,下面介绍一下splign在linux下的安装和使用(windows中splign的使用和linux一样)。

首先下载相应的版本,我的linux系统是ubuntu 64位的,下载Linux x64,解压”gunzip splign”,修改文件属性为可执行文件”chomd 777 splign”,试着运行一下”./splign”,一般会报错”splign: error while loading shared libraries: libpcre.so.0: cannot open shared object file: No such file or directory”,那是因为缺少模块 “libpcre.so.0″, 应下载安装pcre包(pcre-8.21)。

1、安装pcre前应先检查一下有没有安装”gcc”,没有的话应先安装”gcc”,不然不能进行编译,对于ubuntu,可以直接用命令安装

sudo apt-get install gcc

2、解压pcre

gunzip pcre-8.21.tar.gz
tar xvf pcre-8.21.tar

3、配置

cd pcre 8.21
./configure –prefix=/usr/local/pcre-8.21 –libdir=/usr/local/lib/pcre –includedir=/usr/local/include/pcre

4、编译

make

如果报错,说明你的机子还没安装make,使用”sudo apt-get install make”进行安装。

5、安装

make install

6、检查

ls /usr/local #检查是否有pcre-8.21目录
ls /usr/local/lib #检查是否有pcre目录
ls /usr/local/include #检查是否有pcre目录

7、将库文件导入cache

在/etc/ld.so.conf中最后加入:/usr/locla/lib/pcre,然后运行ldconfig,如果找不到ldconfig路径,就用”whereis”命令查找

whereis ldconfig

安装好这些后splign就可以使用了,下面介绍splign的使用方法。

1、如果只是两个序列,直接用就可以了

./splign -query cDNA.fa -subj genome.fa

2、如果是多序列批量处理,就要分多个步骤进行,下面以拟南芥为例.

(1)获得拟南芥(Arabidopsis thaliana)第一条染色体的所有cDNA序列(FASTA格式),存为文件”chr1_cDNA.fa”,下载获得拟南芥第一条染色体的基因组序列存为文件”chr1_genome.fa”,将这两个文件放在一个单独的文件夹”chr1″。

(2)建立LDS索引,使splign能快速识别你的序列

./splign -mklds chr1

(3)格式化需要比对的序列

formatdb -pF -oT -i chr1/chr1_cDNA.fa
formatdb -pF -oT -i chr1/chr1_genome.fa

(4)进行序列比对

可以使用blast中的”megblast”程序,也可以下载”compart“。

compart -qdb chr1_cDNA.fa -sdb chr1_genome.fa >chr1.cpt
megablast -i chr1_cDNA.fa -d chr1_genome.fa -F “m D;R” -D 3 | grep -v “^#” | sort -k 2,2 -k 1,1 -T temp_dir > chr1.hit

(5)使用splign结合建立的索引整理比对的结果

如果比对程序是compart:

splign -ldsdir chr1 -comps chr1.cpt >chr1.splign

如果比对程序是megablast:

splign -ldsdir chr1 -hits chr1.hit >chr1.splign

chr1.splign文件就为各外显子所在染色体位置情况及其边界情况,下面是部分结果,关于splign的更详细使用情况可以参考其官方说明文件

+1NM_099983.2chr11283128336313913GTM283
+1NM_099983.2chr1128128456439964276AGGTM281
+1NM_099983.2chr1112056568444864605AGGTM120
+1NM_099983.2chr11390685107447065095AGGTM390
+1NM_099983.2chr111531075122751745326AGGTM153
+1NM_099983.2chr114611228168854395899AGAAM461
+2NM_001197952.1chr11234243426672654326776AGGTM234
+2NM_001197952.1chr11151266828182686227012AGGTM151

各字段说明如下:

FieldDefinition
1. Compartment (or model) IDNumeric ID preceeded by a plus or minus sign indicating query orientation.
2. QueryQuery (cDNA) sequence identifier.
3. SubjectSubject (genomic) sequence identifier.
4. IdentityThe number of matches divided by the length of the alignment, if the segment is an exon. Dash, if the segment is unaligned.
5. LengthThe length of the alignment, if the segment is an exon. Dash, if the segment is unaligned.
6. Query startStarting coordinate on the query.
7. Query stopEnding coordinate on the query. When aligned in antisense, this coordinate is less than the query starting coordinate.
8. Subject startStarting coordinate on the subject sequence, if the segment is an exon. Dash, if the segment is unaligned.
9. Subject stopEnding coordinate on the subject sequence, if the segment is an exon. If the subject sequence is aligned in minus strand, this coordinate is be less than the subject start. Dash, if the segment is unaligned.
10. Type, if the segment is an exon or if the segment was left unaligned. For exons, acceptor is specified to the left of ‘<' and donor is specified to the right of '>‘. For unaligned segments, L-, M- or R- may preceed ‘GAP’ to indicate location on the query.
11. Alignment transcriptAlignment transcript represents full details of the alignment in a form of a string composed of characters ‘M’, ‘R’, ‘I’ and ‘D’ where each character corresponds to an elementary command (Match, Replace, Insert or Delete) needed to transform the query segment into the subject segment. The string is encoded with RLE.

原文来自:http://bioinformation.cn/?p=398

发表评论

匿名网友