详解Bioperl的序列对象(Bioperl HOWTO翻译7)

评论1,843

序列对象

英文原文

前面涉及到了很多序列对象,展示了序列对象的一些创建和使用方法。这里来详细描述序列对象的功能。

下表列出了序列对象的‘方法’(面向对象编程中的概念,见前文;表的内容就不翻译了)。‘return’表示使用这个方法时,对象所返回的值(或内容)。其中有些方法,如seq(),既可用于输出,也可以向其输入。例如,从已有的序列对象中获取序列。

$sequence_as_string = $seq_obj->seq;

也可以自己设定序列:

$seq_obj->seq("MMTYDFFFFVVNNNNPPPPAAAW");
Table 1: Sequence Object Methods
NAMERETURNSEXAMPLENOTE
accession_numberidentifier$acc = $so->accession_numberget or set an identifier
alphabetalphabet$so->alphabet(‘dna’)get or set the alphabet (‘dna’,'rna’,'protein’)
authorityauthority, if available$so->authority(“FlyBase”)get or set the organization
descdescription$so->desc(“Example 1″)get or set a description
display_ididentifier$so->display_id(“NP_123456″)get or set an identifier
divisiondivision, if available (e.g. PRI)$div = $so->divisionget division (e.g. “PRI”)
get_datesarray of dates, if available@dates = $so->get_datesget dates
get_secondary_accessionsarray of secondary accessions, if available@accs = $so->get_secondary_accessionsget other identifiers
is_circularBooleanif $so->is_circular { # }get or set
keywordskeywords, if available@array = $so->keywordsget or set keywords
lengthlength, a number$len = $so->lengthget the length
moleculemolecule type, if available$type = $so->moleculeget molecule (e.g. “RNA”, “DNA”)
namespacenamespace, if available$so->namespace(“Private”)get or set the name space
newSequence object$so = Bio::Seq->new(-seq => “MPQRAS”)create a new one, see Bio::Seq for more
pidpid, if available$pid = $so->pidget pid
primary_ididentifier$so->primary_id(12345)get or set an identifier
revcomSequence object$so2 = $so1->revcomReverse complement
seqsequence string$seq = $so->seqget or set the sequence
seq_versionversion, if available$so->seq_version(“1″)get or set a version
speciesSpecies object$species_obj = $so->speciesSee Bio::Species for more
subseqsequence string$string = $seq_obj->subseq(10,40)Arguments are start and end
translateprotein Sequence object$prot_obj = $dna_obj->translateSee the Bioperl Tutorial for more
truncSequence object$so2 = $so1->trunc(10,40)Arguments are start and end

需要注意的是,上表列出的有些方法,如molecule和division,仅在序列对象有相应值的时候才有效,有些序列格式并不包括这些信息。所以,使用某种方法之前,一定要了解清楚输入的序列文件,及其包含的内容。

还有一些方法是关于序列注释信息的,但这些内容可能有点离题,如果要了解的话,详见Feature-Annotation HOWTO。下表列出了一些有关的方法。

Table 2: Feature and Annotation Methods
NAMERETURNSNOTE
get_SeqFeaturesarray of SeqFeature objects
get_all_SeqFeaturesarray of SeqFeature objects arrayincludes sub-features
remove_SeqFeaturesarray of SeqFeatures removed
feature_countnumber of SeqFeature objects
add_SeqFeatureannotation array of Annotation objectsget or set

举例

接着来看一下如何使用上面提到的各种方法。看看这些方法如何从不同的来源获取序列对象以及输出内容。先来看看如何从Genbank获取并创建序列对象,代码如下:

use Bio::DB::GenBank;

$db_obj = Bio::DB::GenBank->new;
$seq_obj = $db_obj->get_Seq_by_acc("J01673");

或者从本地已有的Genbank文件中获取,代码如下

use Bio::SeqIO;

$seqio_obj = Bio::SeqIO->new(-file => "J01673.gb", -format => "genbank" );
$seq_obj = $seqio_obj->next_seq;

Genbank文件格式如下所示:

LOCUS       ECORHO                  1880 bp    DNA     linear   BCT 26-APR-1993
DEFINITION  E.coli rho gene coding for transcription termination factor.
ACCESSION   J01673 J01674
VERSION     J01673.1  GI:147605
KEYWORDS    attenuator; leader peptide; rho gene; transcription terminator.
SOURCE      Escherichia coli
ORGANISM  Escherichia coli
                  Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales;
                  Enterobacteriaceae; Escherichia.
REFERENCE   1  (bases 1 to 1880)
AUTHORS   Brown,S., Albrechtsen,B., Pedersen,S. and Klemm,P.
TITLE     Localization and regulation of the structural gene for
             transcription-termination factor rho of Escherichia coli
JOURNAL   J. Mol. Biol. 162 (2), 283-298 (1982)
MEDLINE   83138788
PUBMED   6219230
REFERENCE   2  (bases 1 to 1880) AUTHORS   Pinkham,J.L. and Platt,T.
TITLE     The nucleotide sequence of the rho gene of E. coli K-12
JOURNAL   Nucleic Acids Res. 11 (11), 3531-3545 (1983)
MEDLINE   83220759
PUBMED   6304634
COMMENT      Original source text: Escherichia coli (strain K-12) DNA.
                      A clean copy of the sequence for [2] was kindly provided by
                      J.L.Pinkham and T.Platt.
FEATURES       Location/Qualifiers
     source      1..1880
                     /organism="Escherichia coli"
                     /mol_type="genomic DNA"
                     /strain="K-12"
                     /db_xref="taxon:562"
     mRNA       212..>1880
                     /product="rho mRNA"
     CDS          282..383
                     /note="rho operon leader peptide"
                     /codon_start=1
                     /transl_table=11
                     /protein_id="AAA24531.1"
                     /db_xref="GI:147606"
                     /translation="MRSEQISGSSLNPSCRFSSAYSPVTRQRKDMSR"
     gene         468..1727
                     /gene="rho"
     CDS          468..1727
                     /gene="rho"
                     /note="transcription termination factor"
                     /codon_start=1
                     /transl_table=11
                     /protein_id="AAA24532.1"
                     /db_xref="GI:147607"
                     /translation="MNLTELKNTPVSELITLGENMGLENLARMRKQDIIFAILKQHAK
                     SGEDIFGDGVLEILQDGFGFLRSADSSYLAGPDDIYVSPSQIRRFNLRTGDTISGKIR
                     PPKEGERYFALLKVNEVNFDKPENARNKILFENLTPLHANSRLRMERGNGSTEDLTAR
                     VLDLASPIGRGQRGLIVAPPKAGKTMLLQNIAQSIAYNHPDCVLMVLLIDERPEEVTE
                     MQRLVKGEVVASTFDEPASRHVQVAEMVIEKAKRLVEHKKDVIILLDSITRLARAYNT
                     VVPASGKVLTGGVDANALHRPKRFFGAARNVEEGGSLTIIATALIDTGSKMDEVIYEE
                     FKGTGNMELHLSRKIAEKRVFPAIDYNRSGTRKEELLTTQEELQKMWILRKIIHPMGE
                     IDAMEFLINKLAMTKTNDDFFEMMKRS"
ORIGIN      15 bp upstream from HhaI site.
        1 aaccctagca ctgcgccgaa atatggcatc cgtggtatcc cgactctgct gctgttcaaa
      61 aacggtgaag tggcggcaac caaagtgggt gcactgtcta aaggtcagtt gaaagagttc

                                  ...deleted...  

  1801 tgggcatgtt aggaaaattc ctggaatttg ctggcatgtt atgcaatttg catatcaaat
  1861 ggttaatttt tgcacaggac
//

不论用那种方式,都能得到一样的序列对象。下表列出了这个序列对象的可用方法及其返回值。

Table 3: Values from the Sequence object (Genbank)
METHODRETURNS
display_idECORHO
descE.coli rho gene coding for transcription termination factor.
display_nameECORHO
accessionJ01673
primary_id147605
seq_version1
keywordsattenuator; leader peptide; rho gene; transcription terminator
is_circular
namespace
authority
length1880
seqAACCCT…ACAGGAC
divisionBCT
moleculeDNA
get_dates26-APR-1993
get_secondary_accessionsJ01674

这里需要说明一下。首先,很多序列信息没有被返回。这些“丢失”的信息都是和序列注释信息有关,可详见Feature and Annotation HOWTO。并且,有些方法返回的是空值,比如namespace和authority。原因是对应的序列信息还没有一个普遍接受的格式或确定的名字,也许等确定的时候,作者会重写代码。(译者注:可能作者是先构造了一个结构,没有对应的内容。反正现在这些方法是没用的,暂不用管。)最后,你可能会问各个序列信息如何和相应的方法对应起来的。一般来说,由于没有一个通用标准,代码作者根据自己的常识,将相应的序列信息命一个合理的名字,然后对应到某个方法上。(最后一句可能翻译的不准确)

再来看fasta格式文件作为输入(仍是同一序列)。fasta格式如下所示,相对Genbank,显得非常简单:

>gi|147605|gb|J01673.1|ECORHO E.coli rho gene coding for transcription termination factor
AACCCTAGCACTGCGCCGAAATATGGCATCCGTGGTATCCCGACTCTGCTGCTGTTCAAAAACGGTGAAG
TGGCGGCAACCAAAGTGGGTGCACTGTCTAAAGGTCAGTTGAAAGAGTTCCTCGACGCTAACCTGGCGTA

                        ...deleted...

ACGTGTTTACGTGGCGTTTTGCTTTTATATCTGTAATCTTAATGCCGCGCTGGGCATGTTAGGAAAATTC
CTGGAATTTGCTGGCATGTTATGCAATTTGCATATCAAATGGTTAATTTTTGCACAGGAC

可返回的内容:

Table 4: Values from the Sequence object (Fasta)
METHODRETURNS
display_id147605|gb|J01673.1|ECORHO
descE.coli rho gene coding for transcription termination factor
display_name147605|gb|J01673.1|ECORHO
accessionunknown
primary_id147605|gb|J01673.1|ECORHO
is_circular
namespace
authority
length1880
seqAACCCT…ACAGGAC

和上面使用Genbank文件得到的信息相比,会缺少一些序列信息,如seq_version。另外,如display_id,显示的是不同值。原因在于Genbank服务器将Genbank格式转换fasta格式时遵循的规则和SwissProt服务器将SwissProt格式转换fasta格式的规则不一样。除非有一个统一的标准,否则代码作者一般是根据自己的理解将各个序列信息对应到某一方法上。虽然Bioperl可以遵循某一个特定的规则,如Genbank所使用的。但Bioperl的作者们通过投票决定不遵循任何一个只来源于某一个组织的转换规则。

接着看一下SwissProt格式文件的输入。

ID A2S3_RAT STANDARD; PRT; 913 AA.

AC   Q8R2H7; Q8R2H6; Q8R4G3;
DT   28-FEB-2003 (Rel. 41, Created)
DE   Amyotrophic lateral sclerosis 2 chromosomal region candidate gene
DE   protein 3 homolog (GABA-A receptor interacting factor-1) (GRIF-1) (O-
DE   GlcNAc transferase-interacting protein of 98 kDa).
GN   ALS2CR3 OR GRIF1 OR OIP98.
OS   Rattus norvegicus (Rat).
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
OC   Mammalia; Eutheria; Rodentia; Sciurognathi; Muridae; Murinae; Rattus.
OX   NCBI_TaxID=10116;
RN   [1]
RP   SEQUENCE FROM N.A. (ISOFORMS 1 AND 2), SUBCELLULAR LOCATION, AND
RP   INTERACTION WITH GABA-A RECEPTOR.
RC   TISSUE=Brain;
RX   MEDLINE=22162448; PubMed=12034717;
RA   Beck M., Brickley K., Wilkinson H.L., Sharma S., Smith M.,
RA   Chazot P.L., Pollard S., Stephenson F.A.;
RT   "Identification, molecular cloning, and characterization of a novel
RT   GABAA receptor-associated protein, GRIF-1.";
RL   J. Biol. Chem. 277:30079-30090(2002).
RN   [2]
RP   REVISIONS TO 579 AND 595-596, AND VARIANTS VAL-609 AND PRO-820.
RA   Stephenson F.A.;
RL   Submitted (FEB-2003) to the EMBL/GenBank/DDBJ databases.
RN   [3]
RP   SEQUENCE FROM N.A. (ISOFORM 3), INTERACTION WITH O-GLCNAC TRANSFERASE,
RP   AND O-GLYCOSYLATION.
RC   STRAIN=Sprague-Dawley; TISSUE=Brain;
RX   MEDLINE=22464403; PubMed=12435728;
RA   Iyer S.P.N., Akimoto Y., Hart G.W.;
RT   "Identification and cloning of a novel family of coiled-coil domain
RT   proteins that interact with O-GlcNAc transferase.";
RL   J. Biol. Chem. 278:5399-5409(2003).
CC   -!- SUBUNIT: Interacts with GABA-A receptor and O-GlcNac transferase.
CC   -!- SUBCELLULAR LOCATION: Cytoplasmic.
CC   -!- ALTERNATIVE PRODUCTS:
CC       Event=Alternative splicing; Named isoforms=3;
CC       Name=1; Synonyms=GRIF-1a;
CC         IsoId=Q8R2H7-1; Sequence=Displayed;
CC       Name=2; Synonyms=GRIF-1b;
CC         IsoId=Q8R2H7-2; Sequence=VSP_003786, VSP_003787;
CC       Name=3;
CC         IsoId=Q8R2H7-3; Sequence=VSP_003788;
CC   -!- PTM: O-glycosylated.
CC   -!- SIMILARITY: TO HUMAN OIP106.
DR   EMBL; AJ288898; CAC81785.2; -.
DR   EMBL; AJ288898; CAC81786.2; -.
DR   EMBL; AF474163; AAL84588.1; -.
DR   GO; GO:0005737; C:cytoplasm; IEP.
DR   GO; GO:0005634; C:nucleus; IDA.
DR   GO; GO:0005886; C:plasma membrane; IEP.
DR   GO; GO:0006357; P:regulation of transcription from Pol II pro...; IDA.
DR   InterPro; IPR006933; HAP1_N.
DR   Pfam; PF04849; HAP1_N; 1.
KW   Coiled coil; Alternative splicing; Polymorphism.
FT   DOMAIN      134    355       COILED COIL (POTENTIAL).
FT   VARSPLIC    653    672       VATSNPGKCLSFTNSTFTFT -> ALVSHHCPVEAVRAVHP
FT                                TRL (in isoform 2).
FT                                /FTId=VSP_003786.
FT   VARSPLIC    673    913       Missing (in isoform 2).
FT                                /FTId=VSP_003787.
FT   VARSPLIC    620    687       VQQPLQLEQKPAPPPPVTGIFLPPMTSAGGPVSVATSNPGK
FT                                CLSFTNSTFTFTTCRILHPSDITQVTP -> GSAASSTGAE
FT                                ACTTPASNGYLPAAHDLSRGTSL (in isoform 3).
FT                                /FTId=VSP_003788.
FT   VARIANT     609    609       E -> V.
FT   VARIANT     820    820       S -> P.
SQ   SEQUENCE   913 AA;  101638 MW;  D0E135DBEC30C28C CRC64;
     MSLSQNAIFK SQTGEENLMS SNHRDSESIT DVCSNEDLPE VELVNLLEEQ LPQYKLRVDS
     LFLYENQDWS QSSHQQQDAS ETLSPVLAEE TFRYMILGTD RVEQMTKTYN DIDMVTHLLA
                             ...deleted...
     GIARVVKTPV PRENGKSREA EMGLQKPDSA VYLNSGGSLL GGLRRNQSLP VMMGSFGAPV
     CTTSPKMGIL KED
//

相应的返回值如下表所示:

Table 5: Values from the Sequence object (Swissprot)
METHODRETURNS
display_idA2S3_RAT
descAmyotrophic lateral … protein of 98 kDa).
display_nameA2S3_RAT
accessionQ8R2H7
is_circular
namespace
authority
seq_version
keywordsCoiled coil; Alternative splicing; Polymorphism
length913
seqMSLSQ…ILKED
divisionRAT
get_dates28-FEB-2003 (Rel. 41, Created)
get_secondary_accessionsQ8R2H6 Q8R4G3

和Genbank一样,详见Feature and Annotation HOWTO,查看序列注释信息。

发表评论

匿名网友