# String Graph of a Genome

Regular readers are familiar with our explanation of de Bruijn graphs through the following steps -

ii) We explained that, in the error-free world, de Bruijn graph constructed from short reads sampled from a genome looks identical to the de Bruijn graph of a genome. So, the genome assembly algorithm essentially consists of two steps – constructing de Bruijn graph from short reads (easy, but RAM consuming) and finding the genome sequence from the de Bruijn graph of a genome sequence (difficult, especially for repeat regions).

We will present String Graph assemblers using similar steps. In this post, we will define string graph for a genome. Please study the following figure carefully.

Top part of the above image shows a genome sequence. Yellow regions marked with 1-5 are unique segments, whereas red and purple regions are repetitive segments each appearing twice in the genome. The string graph for the genome is shown in the bottom figure. The graph has seven nodes consisting of five unique regions and two repetitive regions. Multiple appearances of the same repeat all collapse into the same node.

The connectivity between the nodes (edges) follows the same order as the genome sequence. As an example, red repeat region comes after yellow segment 1 in the genome. Therefore, 3′ end of node ‘yellow segment 1′ is connected to 5′ end of red node. It is straightforward to derive the remaining edges based on the genome.

If you remember how de Bruijn graph of a genome looked like, you will notice the following differences.

i) The nodes of de Bruijn graph of a genome consisted of short k-mers of fixed size, whereas nodes of string graph are longer sequences with variable sizes. Therefore, the string graph needs far fewer nodes (and RAM) than de Bruijn graphs.

ii) Typically, the k-mers of de Bruijn graph were shorter than reads. Therefore, de Bruijn graphs were constructed by spitting reads into short subunits. In contrast, nodes of string graph are likely to be longer than NGS read sizes, and thus string graphs are constructed by joining overlapping reads.