蛋白注释相关的工具

2011/12/02评论7,869

随着蛋白质数量的增加，对于这些蛋白质的分类与注释成为一个非常活跃的课题。这里将包含所有蛋白质序列的集合称为nr库，在nr库中，序列与序列的相似性是不均等的，当我们使用blast进行序列两两比对的时候，会发现有些序列有着较高的相似性，有些则完全不相同。我们将相似的序列分为一组，进行多重比对，然后再用不同的算法或者模型去剖析，对构建的多重比对序列进行注释，并以注释的结果为基础，对蛋白质进行分类。当有新的蛋白质出现的时候，再以这些算法为依据，对蛋白质进行注释或者分类。

下面是主要的数据库、工具以及基于的算法：

InterProScan	综合的
CDD	特异位点得分矩阵（PSSMS）
pfam	基于隐马尔科夫模型
smart	结构域domain
superFamily	于隐马尔科夫模型
TigrFam	基于隐马尔科夫模型
prosite	基于一般的正则表达式
profile	基于序列谱的数据库
prints	基于蛋白质指纹技术
blocks	基于蛋白质序列模块
hmmer	基于Hmm的同源基因搜索程序
HHsearch
RPS-BLAST

主要数据资源的介绍

InterProScan protein sequence analysis & classification

http://www.ebi.ac.uk/interpro/

InterPro is an integrated database of predictive protein “signatures” used for the classification and automatic annotation of proteins and genomes. InterPro classifies sequences at superfamily, family and subfamily levels, predicting the occurrence of functional domains, repeats and important sites. InterPro adds in-depth annotation, including GO terms, to the protein signatures.

InterPro是一个综合性数据库，主要用于蛋白质以及基因组的分类与自动注释。InterPro将序列按照超家族、家族、子家族等不同水平，将蛋白质序列进行分类，预测序列的功能保守域、重复序列、关键位点等。InterPro还可以对包括GO、蛋白质信号肽进行注释。

CDDConserved Domain Database (CDD)

http://www.ncbi.nlm.nih.gov/cdd

CDD is a protein annotation resource that consists of a collection of well-annotated multiple sequence alignment models for ancient domains and full-length proteins. These are available as position-specific score matrices (PSSMs) for fast identification of conserved domains in protein sequences via RPS-BLAST. CDD content includes NCBI-curated domains, which use 3D-structure information to explicitly to define domain boundaries and provide insights into sequence/structure/function relationships, as well as domain models imported from a number of external source databases (Pfam, SMART, COG, PRK, TIGRFAM).

CDD是一个蛋白质注释资源，收集了多重比对序列模型以及全长蛋白质序列构建保守结构域。其使用RPS-BLAST，基于特异位点得分矩阵（PSSMs)，对于蛋白质序列中保守域进行快速鉴定。CDD同时还包含了基于蛋白质三级结构，边界以及非常明确的保守域。建立起序列-结构-功能的联系。同时CDD还收集了来自于其他数据库的结构域，包括Pfam, SMART, COG, PRK, TIGRFAM。

SMART: Simple Modular Architecture Research Tool

http://smart.embl-heidelberg.de/

Our main goal in providing this tool is to allow automatic identification and annotation of domains in user-supplied protein sequences.

主要提供一个可以自动鉴定与注释结构域的工具

TIGRFAMs

http://www.jcvi.org/cgi-bin/tigrfams/index.cgi

TIGRFAMs is a resource consisting of curated multiple sequence alignments, Hidden Markov Models (HMMs) for protein sequence classification, and associated information designed to support automated annotation of (mostly prokaryotic) proteins. Starting with release 10.0, TIGRFAMs models use HMMER3, which provides excellent search speed as well as exquisite search sensitivity. See the “TIGRFAMs Complete Listing” page to review the accession, protein name, model type, and EC number (if assigned) of all models. TIGRFAMs is a member database in InterPro. InterPro The HMM libraries and supporting files are available to download and use for free from our FTP site.

TIGRFAMs主要收集多重比对序列、HMMs及其相关联的信息，以支持蛋白质自动注释。从10.0开始，TIGRFAMs使用HMMER3构建模型，HMMER3搜索的速度与灵敏性（准确率）要高。TIGRFAMs也是InterPro的一个成员。

SUPERFAMILY

http://supfam.cs.bris.ac.uk/SUPERFAMILY/

SUPERFAMILY is a database of structural and functional annotation for all proteins and genomes. The SUPERFAMILY annotation is based on a collection of hidden Markov models, which represent structural protein domains at the SCOP superfamily level. A superfamily groups together domains which have an evolutionary relationship. The annotation is produced by scanning protein sequences from over 1,700 completely sequenced genomes against the hidden Markov models.

蛋白质与基因组结构与功能注释数据库，其注释信息时基于Hmm，所有具有演化关系的结构域组成一个超家族，收集了超过1,700个基因组的蛋白质数据。

InterPro 包括的数据库：

T.K. Attwood and A. Mitchell. PRINTS database
L. Bougueleret, L. Cerutti, E. de Castro, N. Hulo, C. Rivoire, C. Sigrist and I. Xenarios. PROSITE and HAMAP databases
D. Kahn and T. Laurent. ProDom and PRIAM databases
A. Bateman, P. Coggill. Pfam database
P. Bork and I. Letunic. SMART database
J. Gough. SUPERFAMILY database
D. Haft. TIGRFAMs database
D. Natale, H. Huang, R. Mazumder and C.H. Wu. PIRSF database
C. Orengo and C. Yeats. Gene3D database
P.D. Thomas. Panther database
E. Kelly and N. Mulder. InterPro database (South Africa)

HHsearch is a software suite for detecting remote homologues of proteins and generating highquality alignments for homology modeling and function prediction.

HMMER is used to search sequence databases for homologs of protein sequences, and to make protein sequence alignments.