Newbler output V: the 454ContigScaffolds.txt and 454ScaffoldContigs.fna

评论1,102

In the post on what is new in newbler version 2.6, I introduced the -scaffold option. Briefly, with this option instances (i.e. the consensus sequence) of repeats are placed in gaps. As I mentioned, setting -scaffold results in two extra files. With this post, I will explain these in detail.

454ContigScaffolds.txt and its relation to the 454Scaffolds.txt file
Both these files are in the AGP format, see my earlier post on the 454Scaffolds.txt file. The examples for  post are based on a bacterial genome data set (shotgun and paired end 454 reads), assembled using the -scaffold flag (and newbler 2.6).
The 454Scaffolds.txt looks is different from an assembly without the -scaffold flag:

scaffold00001   1       4543    1       W       sctg_0001_0001  1       4543    +
scaffold00001   4544    5465    2       N       922     fragment        yes
scaffold00001   5466    6758    3       W       sctg_0001_0002  1       1293    +
scaffold00001   6759    6868    4       N       110     fragment        yes
scaffold00001   6869    75179   5       W       sctg_0001_0003  1       68311   +
scaffold00001   75180   75497   6       N       318     fragment        yes
scaffold00001   75498   91133   7       W       sctg_0001_0004  1       15636   +
scaffold00001   91134   91476   8       N       343     fragment        yes
scaffold00001   91477   151573  9       W       sctg_0001_0005  1       60097   +
scaffold00001   151574  154675  10      N       3102    fragment        yes
scaffold00001   154676  220143  11      W       sctg_0001_0006  1       65468   +
scaffold00001   220144  220163  12      N       20      fragment        yes
scaffold00001   220164  221487  13      W       sctg_0001_0007  1       1324    +
scaffold00001   221488  222941  14      N       1454    fragment        yes

Instead of ‘contigXXXXX’ in the 6th column, there are sctg_XXXX_YYYY. ‘sctg’ stands for ‘ScaffoldContig’, see below. ‘sctg_0001′ stands for scaffold 1, while the following ‘_0001′ stands for the first contig in this scaffold. So, the 20th contig in scaffold 13 would be sctg_0013_0020. The 454ContigScaffolds.txt file is one line per contig followed by one line for a gap.

In the new 454ContigScaffolds.txt file, the corresponding region of scaffold 1 looks like this:

scaffold00001   1       4543    1       W       contig00001     1       4543    +
scaffold00001   4544    5465    2       N       922     fragment        yes
scaffold00001   5466    6758    3       W       contig00002     1       1293    +
scaffold00001   6759    6868    4       N       110     fragment        yes
scaffold00001   6869    75179   5       W       contig00003     1       68311   +
scaffold00001   75180   75497   6       N       318     fragment        yes
scaffold00001   75498   91133   7       W       contig00004     1       15636   +
scaffold00001   91134   91476   8       N       343     fragment        yes
scaffold00001   91477   117498  9       W       contig00005     1       26022   +
scaffold00001   117499  117527  10      W       contig00006     1       29      +
scaffold00001   117528  117914  11      W       contig00007     1       387     +
scaffold00001   117915  117970  12      W       contig00008     1       56      +
scaffold00001   117971  118037  13      W       contig00009     1       67      +
scaffold00001   118038  149720  14      W       contig00010     1       31683   +
scaffold00001   149721  151573  15      W       contig00011     1       1853    +
scaffold00001   151574  154675  16      N       3102    fragment        yes
scaffold00001   154676  158800  17      W       contig00012     1       4125    +
scaffold00001   158801  158926  18      W       contig00013     1       126     +
scaffold00001   158927  158951  19      W       contig00014     1       25      +
scaffold00001   158952  159192  20      W       contig00015     1       241     +
scaffold00001   159193  159225  21      W       contig00016     1       33      +
scaffold00001   159226  159843  22      W       contig00017     1       618     +
scaffold00001   159844  159969  23      W       contig00013     1       126     +
scaffold00001   159970  159994  24      W       contig00014     1       25      +
scaffold00001   159995  160235  25      W       contig00015     1       241     +
scaffold00001   160236  160268  26      W       contig00016     1       33      +
scaffold00001   160269  206731  27      W       contig00018     1       46463   +
scaffold00001   206732  207126  28      W       contig00019     1       395     +
scaffold00001   207127  207156  29      W       contig00020     1       30      +
scaffold00001   207157  220143  30      W       contig00021     1       12987   +
scaffold00001   220144  220163  31      N       20      fragment        yes
scaffold00001   220164  221487  32      W       contig00022     1       1324    +
scaffold00001   221488  222941  33      N       1454    fragment        yes

Note how there are many contigs between gaps! A careful comparison tells us that:
sctg_0001_0001 is contig00001 (4543 bp)
sctg_0001_0002 is contig00002 (1293 bp)
sctg_0001_0003 is contig00003 (68311 bp)
sctg_0001_0004 is contig00004 (15636 bp)
sctg_0001_0005 consists of contig00005 – contig00011 (these contigs are 60097 bp all together)
sctg_0001_0006 consists of contig00012 – contig00021 (these contigs are 65468 bp all together)
sctg_0001_0007 is contig00022 (1324 bp)

This show how the -scaffold option works: repeat contigs are placed in gaps, so-called ‘ScaffoldContigs’ are formed by concatenating the contigs that are now next to each other without gaps in between. The 454ContigScaffolds.txt file shows which contigs are placed where, while the 454Scaffolds.txt shows the scaffolds as they are built up out of ScaffoldContigs.

If we now add the per-contig depth (from the 454ContigGraph.txt file) to the contigs that make up the ScaffoldContigs, we get:

For sctg_0001_0005:
contig    length    depth
contig00005     26022   39.3
contig00006     29      267.6
contig00007     387     352.3
contig00008     56      272.8
contig00009     67      203.2
contig00010     31683   41.0
contig00011     1853    26.0

So, we have a long, 26 kb long contig of ‘normal’ depth (40x), followed by four short contigs of quite high depth (203-352x), after that one long contigagain of almost 32 kb of ‘normal’ depth. This looks like four repeat contigs in between long single-copy contigs. Finally, there is a 1.9 kb contig of somewhat lower depth, which I cannot really explain…

For sctg_0001_0006:
contig    length    depth
contig00012     4125    34.1
contig00013     126     75.0
contig00014     25      203.8
contig00015     241     119.3
contig00016     33      79.2
contig00017     618     42.3
contig00013     126     75.0
contig00014     25      203.8
contig00015     241     119.3
contig00016     33      79.2
contig00018     46463   38.8
contig00019     395     180.6
contig00020     30      116.3
contig00021     12987   37.4

Here, there are four long contigs,  4kb, 0.6, 46.5 kb and 13 kb, of ‘normal’ depth (34-42x), with shorter contigs in between, most of them with high depth (75 – 204x). Unsurprisingly, a quick blast identified contig 13 and 15 as being part of putative transposases, proteins known to be present in multiple copies in bacterial genomes…

454ScaffoldContigs.fna and .qual files
These simply list the sequences of the ScaffoldContig files as listed in the 454Scaffods.txt file.

In conclusion, 454 has tried to offer more complete scaffolds by placing repeats in gaps where possible

发表评论

匿名网友