基因组拼接中专业术语名词解释

2011/10/21评论4,239

Glossary terms in genome assembly

RNA sequencing

(RNA-seq). An experimental protocol that uses next-generation sequencing technologies to sequence the RNA molecules within a biological sample in an effort to determine the primary sequence and relative abundance of each RNA.

Sequencing depth

The average number of reads representing a given nucleotide in the reconstructed sequence. A 10× sequence depth means that each nucleotide of the transcript was sequenced, on average, ten times.

Paired-end protocol

A library construction and sequencing strategy in which both ends of a DNA fragment are sequenced to produce pairs of reads (mate pairs).

Contigs

An abbreviation for contiguous sequences that is used to indicate a contiguous piece of DNA assembled from shorter overlapping sequence reads.

Low-complexity reads

Short DNA sequences composed of stretches of homopolymer nucleotides or simple sequence repeats.

Quality scores

An integer representing the probability that a given base in a nucleic acid sequence is correct.

k-mer frequency

The number of times that each k-mer (that is, a short oligonucleotide of length k) appears in a set of DNA sequences.

Splice-aware aligner

A program that is designed to align cDNA reads to a genome.

Traversing

A method for systematically visiting all nodes in a mathematical graph.

Seed-and-extend aligners

An alignment strategy that first builds a hash table containing the location of each k-mer (seed) within the reference genome. These algorithms then extend these seeds in both directions to find the best alignment (or alignments) for each read.

Burrows–Wheeler transform

(BWT). This reorders the characters within a sequence, which allows for better data compression. Many short-read aligners implement this transform in order to use less memory when aligning reads to a genome.

Parallel computing

A computer programming model for distributing data processing across multiple processors, so that multiple tasks can be carried out simultaneously.

Trans-spliced genes

Genes whose transcripts are created by the splicing together of two precursor mRNAs to form a single mature mRNA.

De Bruijn graph

A directed mathematical graph that uses a sequence of letters of length k to represent nodes. Pairs of nodes are connected if shifting a sequence by one character creates an exact k–1 overlap between the two sequences.

Greedily assembling

The use of an algorithm that joins overlapping reads together by making a series of locally optimal solutions. This strategy usually leads to a globally suboptimal solution.

N50 size

The size at which half of all assembled bases reside in contigs of this size or longer.

RACE

An experimental protocol termed Rapid Amplification of cDNA Ends, which is used to determine the start and end points of gene transcription.

Cloud computing

The abstraction of underlying hardware architectures (for example, servers, storage and networking) to a shared pool of computing resources that can be readily provisioned and released.