Identification of Pseudogenes


 By Seth Axen (11/18/08)


A pseudogene is a genetic sequence which is nonfunctional. What distinguishes a pseudogene from any other noncoding DNA sequence is that typically, a pseudogene will align well to a known protein as seen on BLAST, CD, and/or Pfam while appearing to be a legitimate coding sequence during structural annotation. Pseudogenes are thought to be formed by two mechanisms in prokaryotes:

  1. Duplicated pseudogenes are sequences which were formed through a duplication of a functional gene, followed by mutagenesis to remove functionality.
  2. Disabled pseudogenes are the original sequence of a functional gene which has been disabled through mutagenesis so that the microbe no longer contains a functional copy of the gene.

For the sake of identification in genomics, an open reading frame (ORF) is annotated as a pseudogene if it meets one of the following criteria:

  1. The sequence is interrupted by more than one stop codon or frameshift so that it corresponds to a truncated Pfam less than 30% of the predicted profile.
  2. The sequence is separated by another ORF
  3. The sequence is missing key residues known to be required for functionality. The first possible case is identified as given in the “Criterion 1” section below. The second case is identified through somewhat more complicated methods which are given in “Criterion 2” below. The third and final case requires using a new online resource as shown in “Criterion 3” below.

While annotating, one must keep in mind that there is some disagreement in the scientific community as to the technical definition of a pseudogene, and no consensus has yet been reached. Because of this confusion, many professional annotators improperly annotate some hypothetical proteins as pseudogenes with insubstantial evidence that the ORFs are, in fact, nonfunctional genes. As a student annotator, it is easy to fall into this trap of fallacious reasoning. For example, in a pilot program at UCLA, student annotators over three quarters annotated a large number of features. Of the features annotated, 16 were predicted by the annotators to be pseudogenes. When the methods below were employed in identification, none of those predicted to be pseudogenes were revealed to be so. This example also demonstrates the fact that an actual pseudogene is usually very rare. It should also be noted that the three criteria given above identify pseudogenes only from a theoretical genomics perspective. Confirmation in a wet-lab is still required before a sequence can be known as a true pseudogene. In fact, several sequences which have been annotated as pseudogenes have been verified to be functional in wet-lab experiments. The first and most famous example of a functional pseudogene was identified in 2003 by Hirotsune et al in yeast1. They found that an untranslated RNA form of the pseudogene had a critical role in the regulation of the original copy of the gene. Since their research was published, several other such examples of functional pseudogenes have been presented, causing many scientists to question the assumption that pseudogenes are prime examples of “junk DNA.”

New Resource: ScanProsite URL: Prosite is a curated database of multiple sequence alignments of motifs used for the purpose of identifying domains and families of protein sequences. These alignments are known as profiles and are similar to Pfams or COGs. In annotation, Prosite is generally used to identify an open reading frame as a pseudogene by verifying the absence of catalytic residues. In addition, Prosite on occasion will identify a searched sequence as a pattern solely because it contains a signature trend or necessary residues present in a domain or family, though the rest of the sequence may not align to the domain. This aids in identification of proteins also present in distantly related microbes.


Note: The tutorial was designed to be applicable to Mozilla Firefox on any Windows operating system. You may use any browser or operating system that you prefer. However, some of the following steps may not be performed under any other conditions.


  1. Navigate to the Pfam database at
  2. Search the amino acid sequence of your ORF against the Pfam database as explained in the Pfam SOP.
  3. On the results page, note the domain graphic. If this is a pseudogene of Criterion 1, then, the domain graphic will show the last domain as truncated and running to the end of the sequence.
  4. Scroll down to the data table and observe the row for that sequence. Note the length of the HMM covered by the sequence given by under the HMM columns subtracting the “To” number from the “From” value and adding 1.
  5. To determine the total length of the HMM, navigate to the “Curations & models” tab, scroll down to the “HMM information” section, and record the value under “Model length.”
  6. 6 Divide the value recorded in step 4 by the value in step 5 and multiply the resulting number by 100% to obtain the percent coverage of the entered sequence. If this value is less than 30% and research of the literature indicates that the domain is necessary for protein functionality, then the protein is a pseudogene meeting Criterion 1.


  1.  Navigate to the Pfam database at
  2. Search the amino acid sequence of your ORF against the Pfam database as explained in the Pfam SOP.
  3. On the results page, note the domain graphic. If an inserted ORF maintains its reading frame and its stop codon is intact, a pseudogene of Criterion 2 will usually show one or more domains after a truncated domain of the predicted
  4. On imgACT, through the Lab Notebook for the gene in question, access the Gene Details page on IMG. Under “Evidence for Function Prediction,” click on “Sequence Viewer for Alternate ORF Search.” On the “Sequence Viewer” page, change the value in the “bp downstream” box from “+0” to a number such as “+100.” Press “Submit.” Scroll down to view the flanking nucleotide sequence which is colored green. Copy the flanking DNA sequence and paste it to the end of a nucleotide FASTA file for the gene. Run a Pfam on this nucleotide sequence by searching at
  5. On the results page, the domain graphic will most likely appear as below. While the inserted ORF will not necessarily have a domain which is identified on Pfam, there will be a visible fragmentation of the domain from the original protein. What is important to gather here is that the second half of the fragmented domain is present in the flanking DNA. If this is the case, then the feature is a pseudogene
  6. While this method seems rather simple, remember that we made the assumption that the stop codon for the inserted ORF was intact. If it was not, then the first Pfam search would have appeared as the domain graphic in Step 5 and the subsequent steps would be unnecessary. However, we also assumed that the ORF maintained its reading frame. If it did not, we would expect that either a premature stop codon would be introduced or the inserted ORF would not appear on the domain graphic while the portion of the protein after it may or may not. In this case, it becomes a fairly complicated process to verify that the protein is a pseudogene meeting Criterion 2, and any subsequent methods used will be designed based off of the first domain graphic obtained.


  1. Navigate to the ScanProsite tool on Prosite at
  2. Enter the amino acid sequence of the protein in question into the box under “Sequence(s) to be scanned.” Uncheck the “Exclude motifs with a high probability of occurrence” box, check the “Show low level score” box, and click “START THE SCAN.”
  3. Before beginning analysis on the ScanProsite results page, observe the sections on the page and compare them with the example page given below. As Prosite is not as robust as CD or Pfam for identification of domains, it will generally only be used to identify pseudogenes of Criterion 3. Therefore, subsequent analysis of the results page will only be concerned with this process.
  4. There are two important notations to watch for when looking at the results for a ScanProsite search page. The first is an underlined series of residues in the protein sequence listed below a hit as seen in below. These correspond to predicted features in the domain deemed to be necessary. Because this is a prediction, its purpose is not well defined, but the presence of the underlined portion should be recorded anyways. Notice the condition given in the bottom right-hand corner of the box containing the hit. If you are using Firefox as your web browser, placing the cursor over the condition will place a green highlight over the residues that meet that condition.
  5. The second notation to look for is the presence of red colored residues corresponding to necessary components in the active site of the domain as seen below. Note that if you place the cursor over the condition here as well, the corresponding residue is also highlighted green.
  6. When any one of the amino acids in an active site or a predicted feature does not meet the condition required for assumed functionality, Prosite will no longer underline or color the residues but will change the condition section to say [incomplete group: <consensus sequence>: false]. If amino acids are separated on the linear peptide but are both members of the same active site in the folded protein, the necessary residues still present will have a condition section saying [incomplete group: <consensus sequence>: true]. Placing the cursor over these conditions will sometimes but not always highlight grey the altered residue(s) in the domain’s sequence in the same box. However, it will always highlight the altered residue(s) in the complete sequence at the top of the page. When scrolling up, take care that the cursor does not hover over any of the other hits or those hits
  7. If a condition is not met, that provides strong evidence that the protein is a pseudogene. However, one more step must be taken to confirm that exceptions to this condition are rare or absent. Click on the PS***** identification number hyperlink in the title line for the profile for which the condition is not met. This
  8. On the profile page, be sure to scan the “Description section” toward the top for any key information concerning structure or catalytic activity. Scroll down to the “Technical section” and look for a data box that contains a grey graphic portraying the profile. Above this graphic is a box with a blue background that usually contains information concerning the condition in the profile. If multiple other sequences are detected in Swiss-Prot, indicating that there are multiple exceptions to the condition, then you cannot yet conclude that the feature is a pseudogene. If the data and the box give few or no exceptions to the condition, then it is safe to conclude that the condition is absolutely necessary for functionality of the protein, and the feature may be considered a pseudogene


1. Nature. 2003 May 1;423(6935):91-6.