Glossary terms in genome assembly
(RNA-seq). An experimental protocol that uses next-generation sequencing technologies to sequence the RNA molecules within a biological sample in an effort to determine the primary sequence and relative abundance of each RNA.
The average number of reads representing a given nucleotide in the reconstructed sequence. A 10× sequence depth means that each nucleotide of the transcript was sequenced, on average, ten times.
A library construction and sequencing strategy in which both ends of a DNA fragment are sequenced to produce pairs of reads (mate pairs).
An abbreviation for contiguous sequences that is used to indicate a contiguous piece of DNA assembled from shorter overlapping sequence reads.
Short DNA sequences composed of stretches of homopolymer nucleotides or simple sequence repeats.
An integer representing the probability that a given base in a nucleic acid sequence is correct.
The number of times that each k-mer (that is, a short oligonucleotide of length k) appears in a set of DNA sequences.
A program that is designed to align cDNA reads to a genome.
A method for systematically visiting all nodes in a mathematical graph.
An alignment strategy that first builds a hash table containing the location of each k-mer (seed) within the reference genome. These algorithms then extend these seeds in both directions to find the best alignment (or alignments) for each read.
(BWT). This reorders the characters within a sequence, which allows for better data compression. Many short-read aligners implement this transform in order to use less memory when aligning reads to a genome.
A computer programming model for distributing data processing across multiple processors, so that multiple tasks can be carried out simultaneously.
Genes whose transcripts are created by the splicing together of two precursor mRNAs to form a single mature mRNA.
De Bruijn graph
A directed mathematical graph that uses a sequence of letters of length k to represent nodes. Pairs of nodes are connected if shifting a sequence by one character creates an exact k–1 overlap between the two sequences.
The use of an algorithm that joins overlapping reads together by making a series of locally optimal solutions. This strategy usually leads to a globally suboptimal solution.
The size at which half of all assembled bases reside in contigs of this size or longer.
An experimental protocol termed Rapid Amplification of cDNA Ends, which is used to determine the start and end points of gene transcription.
The abstraction of underlying hardware architectures (for example, servers, storage and networking) to a shared pool of computing resources that can be readily provisioned and released.