基因组Repeat Sequence的寻找

2013/10/041 7,306

1.1 Repeats的分类

基因组中的repeats依据其序列特征分成2类：串联重复(tandem repeats) 和散在分布在基因组中的重复序列(interspersed repeats).其中第二类主要是transposable elements(TEs).

第一类串联重复包含：microsatellites 或 simple sequence repeats（1-6个碱基为一个重复单元）和 minisatellites（10-60个碱基的长序列为一个重复单元）.

TEs包含2种类型：class-I TEs通过RNA介导的(copy and paste）机制进行转座；class-II TEs通过DNA介导的(cut and paste）机制来转座. 前者称为retroelements，后者称为DNA transposons。

class-I TEs中主要由LTR(long terminal repeat)构成。LTR的部分序列可能具有编码功能。而non-LTR则包含2个子类：LINEs(long interspersed nuclear elements)和SINEs(short interspersed elements），其中前者可能具有编码功能，后者则没有。

class-II TEs中加入了一个子类 MITEs(miniature inverted repeat transposable elements),基于DNA的转座因子，但是确通过”copy and paste”的机制来转座(Wicker et al., 2007)。

1.2 鉴定重复序列的软件

对于不同的重复序列，需要使用不同的软件来进行鉴定。而鉴定的方法可以分为：基于library，基于重复序列的特定结构或重头预测。文献中给出的软件很多：http://www.nature.com/hdy/journal/v104/n6/fig_tab/hdy2009165t1.html#figure-title。

1.2.1 基于LIBRARY-BASED的软件

library-based的软件，需要构建library，该library中包含很多来自不同物种某一重复序列的一致性序列，然后通过相似性比对来鉴定repeats。这种方法能对所有的种类的重复序列进行鉴定。此方法最经典最流行的软件是RepeatMasker；此方法中CENSOR和MASKERAID两个软件可以用于改良RepeatMasker的结果；此外，用于基因组的重复序列鉴定的还有GREEDIER（Li et al.,2008），该软件在其文章中表明该软件性能还不错，在repeats鉴定的敏感性上稍微比RepeatMasker高一点，但是repeats的鉴定率只有RepeatMasker的一半左右.

1.2.2 基于SIGNATURE的软件

基于signature的方法主要用于TEs的鉴定。
1. 鉴定LTR逆转座子: LTR_STRUC (McCarthy and McDonald, 2003), LTR_PAR (Kalyanaraman and Aluru, 2006), FIND_LTR (Rho et al., 2007), RETROTECTOR (Sperber et al., 2007), LTR_FINDER (Xu and Wang, 2007) and LTRHARVEST (Ellinghaus et al., 2008)。文献中，这些软件中LTR_STRUC的敏感性最高(96%)，但是LTR的鉴定率只有30%；而LTRharvest的鉴定率最高(42%),敏感性67%.总体上，作者依次推荐的软件是LTRHARVEST和FIND_LTR(敏感性83%，鉴定率37%)。
2. 鉴定non-LTR retrotransposons: TSDFINDER (Szak et al., 2002), SINEDR (Tu et al., 2004) and RTANALYZER (Lucier et al., 2007)。其中，第一个软件用于验证RepeatMasker检测到的L1 insertions；第二个软件用于检测侧翼有TSDs（target site duplications 当重复序列插入到基因组上时，其两侧会带入短核酸序列的重复）的SINEs；第三个软件通过一些特征，比如TSDs，polyA尾和5′端核酸内切酶位点等来通过打分来检测L1逆转座子。
3. 鉴定MITEs:FINDMITE (Tu, 2001), TRANSPO (Santiago et al., 2002), MITE Analysis Kit (MAK) (Yang and Hall, 2003) and MITE Uncovering SysTem (MUST) (Chen et al., 2009)。文献中作者使用第一个软件报错，使用第三个软件却下载不到。第二个软件不能寻找新的MITEs，看来最好是使用最新的第四个软件。
4. 鉴定helitrons: HELITRONFINDER。该软件(Du et al., 2008)用来寻找玉米基因组中的helitrons（在动物和植物中有发现）。

1.2.3 重头预测的软件

1.2.3.1 自我比较的方法

通过BLAST、PALS等方法，将序列和自身进行比较，从而找出重复序列。软件有：REPEAT PATTERN TOOLKIT (Agarwal and States, 1994), RECON (Bao and Eddy, 2002), PILER (Edgar and Myers, 2005) and the BLASTER suite (used in Quesneville et al., 2005).其中RECON软件使用最广泛。

1.2.3.2 K-MER AND SPACED SEED APPROACHES

一定长度的k-mer出现了多次，可以鉴定为重复序列；spaced seed则是k-mer的一种衍生，seed上允许有一定的差异。

软件有：REPUTER (Kurtz and Schleiermacher, 1999), VMATCH (Kurtz, unpublished), REPEAT-MATCH (Delcher et al., 1999), MER-ENGINE (Healy et al., 2003), FORREPEATS (Lefebvre et al., 2003), REAS (Li et al., 2005), REPEATSCOUT (Price et al., 2005), RAP (Campagna et al., 2005), REPSEEK (Achaz et al., 2007), TALLYMER (Kurtz et al., 2008) and P-CLOUDS (Gu et al., 2008).

1.2.4 其它重复序列鉴定软件

其它一些用于鉴定非TEs的软件：Tandem Repeats Finder (TRF) (Benson, 1999), Tandem Repeat Occurrence Locator (TROLL) (Castelo et al., 2002), MREPS (Kolpakov et al., 2003), TRAP (Sobreira et al., 2006) and Optimized Moving Window Spectral Analysis (OMWSA) (Du et al., 2007) have been developed specifically to detect tandem repeats. The Inverted Repeat Finder (IRF) program (Warburton et al., 2004) was designed to search for inverted repeats.

1.3 多个软件整合的pipeline程序

REPEATMODELER pipeline (Smit, unpublished http://www.repeatmasker.org/RepeatModeler.html) includes the programs RECON, REPEATSCOUT, REPEATMASKER and TRF. It uses the output of the RECON and REPEATSCOUT programs to build, refine and classify consensus models of putative interspersed repeats.

当然，在此文献中，也讲述了其它很多专门用途的其它pipeline软件。而REPEATMODELER pipeline是现在运用于基因组的重复序列鉴定最主流的软件。以下将讲述该软件运用。

2. RepeatModeler的安装与使用

2.1 软件的安装

RepeatMasker和RepeatModeler是ISB（Institute for Systems Biology)的软件。ISB is located in the South Lake Union neighborhood。

根据RepeatMasker说明，其安装与使用需要：Perl 5.8.0以上版本，序列比对Engine，TRF和Repeat Database。其中序列比对engine可以安装多个，但每次只能使用其一，可以使用Cross_match,RMBlast,HMMER和ABBlaast/WUBlast等。

根据RepeatMideker说明，其安装与使用需要：Perl 5.8.8以上版本，RepeatMasker,Repeat Database,RECON,RepeatScout,TRF和序列比对engine。其中序列比对engine可以安装多个，但每次只能使用其一，可以使用RMBlast和ABBlaast/WUBlast。

再安装完毕需要的软件后，对RepeatMasker和RepeatModeler进行configure的时候填入相应软件的路径即可安装。

2.2 RepeatModeler的使用

2.2.1 使用REPEATMODELER来通过基因组序列构建LIBRARY

$ $RepeatModelerHome/BuildDatabase -name species \
  -engine ncbi species.genome.fasta
$ $RepeatModelerHome/RepeatModeler -database species

结果生成了一个文件夹，名称为RM_[PID].[DATE] ie. “RM_5098.MonMar141305172005″。该文件夹中的”consensi.fa.classified”即为library，用于RepeatMasker的输入。

2.2.2 使用REPEATMASKER来进行重复序列掩盖和重复序列计算

cp RM_5098.MonMar141305172005/consensi.fa.classified .
mkdir Repeat_result
$ $RepeatMaskerHome/RepeatMasker -pa 8 \
  -e ncbi -lib consensi.fa.classified \
  -dir Repeat_result -gff species.genome.fasta

则生成的结果文件位于Repeat_result文件夹中。

原文来自：http://www.hzaumycology.com/chenlianfu_blog/?p=1747

YULUO 1
2016/04/14 21:14:23 1F
回复
您好，我按照您讲的用repeatmodeler找重复序列，但是总是出现错误，还有您可以发我一下repeatmasker的最新library吗？谢谢你了
错误信息：FATAL: xdf_db_fopen failed code 22 (identifier index does not exist) on
database “hmi”: the file “hmi.xni” was not found. To create a
sequence identifier index, execute xdformat with the -I option or
re-index the existing database (faster) using the -X option.
RepeatModeler::sampleFromDB() Could not obtain sequence gi|245 from the database!