Newbler output VI: the ‘status’ files

  • A+
所属分类:Genomics

The files that are the topic of this post are all tables, i.e. tab separated text files. The ‘status’ files describe what happened with all the reads and the paired end halves, while the AlignmentInfo file summarizes the contig alignments.

The fact that these files are tabular makes for easy parsing using by perl/python or, my favorite, awk.

1) 454TrimStatus.txt

Accno   Trimpoints Used Used Trimmed Length     Orig Trimpoints Orig Trimmed Length     Raw Length
ERGMJHS01CYVHW  5-78    74      5-98    94      100
ERGMJHS01D6IHL  5-116   112     5-116   112     161
ERGMJHS01DYTX5  5-127   123     5-127   123     173
ERGMJHS01DYDH0  5-78    74      5-78    74      124
ERGMJHS01ECEGM  5-256   252     5-256   252     271
ERGMJHS01CRQ8D  5-272   268     5-272   268     273
ERGMJHS01ECMVT  5-260   256     5-260   256     270
ERGMJHS01EZ7VU  5-41    37      5-61    57      62
ERGMJHS01ERDXB  5-207   203     5-207   203     252

This file describes what (trimmed) part of the read was considered for alignment. The columns describe:

  • Accno: the unique read ID, where the first 7 characters describe the unique run ID, followed by the lane number, followed by the encoded x and y coordinates of the read on the picotiterplate.
  • Trimpoints Used: the start and end position of the part of the read newbler used. Most of the times, the start will be position 5, as the first four bases of every read comprise the key sequence that identifies the read as a sample read (as opposed to control reads that have different key sequences). Also, in contrast to traditional Sanger reads, read quality is usually high from the very first bases read after the sequencing primer. When MIDs (454’s Multiplex IDentifiers) or other tags/barcodes have been used during library generation, and the reads were split according to the tag (which removes the tag from the read), the starting position will be higher accordingly.
  • Used Trimmed Length: the length of the part of the read newbler used
  • Orig Trimpoints: the start and end part of the trimmed read as it was given to newbler. These positions are the result of the image signal processing software trimming steps (thanks to Steven Sullivan for pointing out the original trimpoints are from the signal processing steps, not image processing…)
  • Orig Trimmed Length: the corresponding original trimmed length
  • Raw Length: the length of the read as it was before image processing

Comparing the Used Trimmed Length with the Orig Trimmed Length shows that for some reads, newbler trims even further than the image processing software. Also, the usable part of a read can get shorter when the ‘trimming database’ option (-vt) was used during assembly, for example to remove vector/adaptor/primer sequences.

Another section of the same file:

FQL5QBG02GX6EQ_left     5-171   167     5-296   292     299
FQL5QBG02GX6EQ_right    217-296 80      5-296   292     299
FQL5QBG02GUPVF  255-255 1       5-255   251     265
FQL5QBG02IFXSU_left     5-173   169     5-305   301     308
FQL5QBG02IFXSU_right    219-305 87      5-305   301     308
FQL5QBG02GXQUO  29-268  240     5-268   264     268
FQL5QBG02JS960  5-270   266     5-275   271     304
FQL5QBG02H0VJ7_left     5-145   141     5-238   234     259
FQL5QBG02H0VJ7_right    190-238 49      5-238   234     259
FQL5QBG02HASXU  62-304  243     5-304   300     313

Here, some of the reads have ‘_left’ or ‘_right’ added at the end of the read ID (Accno). This indicates that the read was a paired end read (the linker sequence was detected in the read), and for this file, these reads get split into their constituent right and left halves. Note that, for example, for read FQL5QBG02GX6EQ, the position of the linker sequence can be determined from the trimpoints: from position 172 (following the last position of the left part) to 216 (just before the starting position of the right part). Also note that some reads of the same run are not paired end reads. These reads either lack the linker altogether (an results of the paired end library generation procedure), or have too few bases (less than 20) on one side of the linker to give two mappable read halves. These reads are used as normal shotgun reads.

2) 454ReadStatus.txt

Accno   Read Status     5' Contig       5' Position     5' Strand       3' Contig       3' Position     3' Strand
ERGMJHS01CYVHW  Assembled       contig00011     610     +       contig00011     685     -
ERGMJHS01CJOXV  PartiallyAssembled      contig00115     8069    -       contig00115     7943    +
ERGMJHS01DYDH0  Singleton
ERGMJHS01EZ7VU  Repeat
ERGMJHS01A8MP3  Outlier
FQL5QBG02GDUSS_left     Assembled       contig00106     3130    +       contig00106     3242    -
FQL5QBG02GDUSS_right    Assembled       contig00106     5787    -       contig00106     5759    +

This file describes where reads ended up after assembly was complete. For paired end reads, the ‘fate’ of each hall is reported on a separate line. Columns are:

  • Accno: the unique read ID
  • Read Status: this can be
    - Assembled: the reads was placed in one or more contigs
    - PartiallyAssembled: only part of the read was used for making contigs
    - Singleton: there was no (significant) overlap between this read and all the others
    - Repeat: the read was most likely derived from a repeated part of the genome. More technically: more than 70% of a read’s seeds (see this post) hit to more than 70 other reads.
    - Outlier: a problematic read, e.g. a chimeric read
    - TooShort: the trimmed portion of the read was below the length threshold. This minimum can be set with the –minlen flag during assembly. When it is not set, and no paired end reads are included, it is 50 bases; for an assembly with paired ends, it is 20 bases (if I’m not mistaken).
  • 5′ Contig, 5′ Position, 5′ Strand: the contig and position in it where the 5’ end of the reads alignment begins, and the orientation of the read relative to the contig (‘+’ or ‘-‘ for forward and reverse strand, respectively)
  • 3′ Contig, 3′ Position, 3′ Strand: similar for the 3’ end of the reads alignment

Note that only the starting and end of the each read’s alignment are shown. Due to the way newbler builds contigs, the middle of a read could be aligned within one or even several other contigs. It follows then, that this file can not be used for determining all the reads that were used to build a contig, or all the contigs that a read is a part of.

3) 454PairStatus.txt

Template        Status  Distance        Left Contig     Left Pos        Left Dir        Right Contig    Right Pos       Right Dir       Left Distance   RightDistance
FQL5QBG02GDUSS  SameContig      2657    contig00106     3130    +       contig00106     5787    -
FQL5QBG02GRUHY  Link    1366    contig00208     267     -       contig00207     3298    +       267     1099
FQL5QBG02HRDSS  OneUnmapped     -       Unmapped                        contig00017     10630   -
FQL5QBG02FS0NM  BothUnmapped    -       Unmapped                        Unmapped
FQL5QBG02IIB8R  MultiplyMapped  -       Repeat                  contig00173     207     -
FQL5QBG02IJDOE  FalsePair       -       contig00015     72252   +       contig01166     7528    -

This file describes for each paired end read, how it ended up in the assembly.  Columns are:

  • Template: the read ID
  • Status: this can be:
    - SameContig: both halves of the paired end read mapped to (or, for long enough halves, were assembled into) the same contig with a consistent orientation (i.e. the halves ‘point towards each other’ as paired end halves should). These reads have been used to determine the library insert size.
    - Link: the reads mapped to different contigs, close enough to the ends of these contigs so that they could be used to link the contigs together into a scaffold.
    - OneUnmapped: only one of the halves was mapped, the other not
    - BothUnmapped: neither the right half, or the left halve was mapped
    - MultiplyMapped: one or both of the halves mapped to multiple contigs (repeated reads)
    - FalsePair: both halves were mapped, but either to the same contig with incorrect orientation or, the distance between the halves was outside of the accepted range for the library.
  • So, of all these status categories, only the ones marked as ‘Link’ were actually used for scaffolding…
  • Distance:
    - for reads that map to the same contig: the distance between the halves
    - for reads that Link contigs into scaffolds: the sum of the distances from the position of each half to the end of the contig. So, the total distance between the halves for these pairs would be the distance mentioned in the
  • 454PairStatus.txt file, plus the gap between distance the contigs. This distance then should be consistent with the paired end library insert size.
  • Left Contig, Left Pos, Left Dir: the contig ID, position (of the 5’ end) and orientation (‘+’ or ‘-‘ for forward and reverse strand, respectively) of the mapped left half. Left Contig can also be marked as ‘Unmapped‘ or ‘Repeat’
  • Right Contig, Right Pos, Right Dir: similar for the right half. Note that ‘position’ here refers to the position of the 3’ end of the right half.
  • Left Distance: for reads that ‘Link’ contigs only: the distance from the 5’ end of the left half, to the end of the contig
  • Right Distance: for reads that ‘Link’ contigs only: the distance from the 3’ end of the right half, to the end of the contig

From this, it follows logically that for reads marked as ‘Link’, the sum of the Left and Right Distance columns is the same as the number listed in the Distance column (column 2)

For pair halves marked as ‘Repeat’, the mapping information is not reported in this file. It is possible to obtain the mapping results by adding the –pair or –pairt flags during assembly. This will result in the 454TagPairAlign.txt file, which describes all alignments of pair halves shorter than 50 bases (these are not assembled, but mapped to contigs afterwards, see my first post). The file can either report all alignments (-pair), or a tabulated summary (-pairt)

4) 454AlignmentInfo.tsv

Position        Consensus       Quality Score   Unique Depth    Align Depth     Signal  StdDeviation
>contig00001    1
1       C       64      24      29      0.99    0.08
2       T       64      24      29      0.94    0.10
3       C       64      24      29      0.91    0.07
4       A       64      24      29      1.93    0.10
5       A       64      24      29      1.93    0.10
6       T       64      24      29      1.03    0.08
7       A       64      23      28      0.95    0.09
8       T       64      23      28      1.93    0.08
9       T       64      23      28      1.93    0.08
10      A       64      22      27      0.99    0.08

This file gives a consensus alignment overview for each position in each contig. Normally, this file is only present in the output when the project contains less then 4 million reads, and less then 40Mb total assembled contig length. For larger assemblies, adding –info to the command line will output this file.

The information for each contig starts with a line giving the contig ID, e.g >contig00001. The number which follows is always ‘1’ for assemblies (but can be different for mapping projects, perhaps subject of a future post…)
Columns are:

  • Position: position in the contig
  • Consensus: consensus contig nucleotide (base) at this position
  • Quality Score: consensus contig quality score at this position
  • Unique Depth: the number of reads that align to (cover) the position, restricted to unique reads only (a significant proportion of 454 reads are duplicates as a results of two beads present in the same microreactor during emusion PCR).
  • Align Depth: the number of all reads that align to the position (including duplicates)
  • Signal, StdDeviation: the average flow signal and the corresponding standard deviation for the flows at that position. Note that for stretches of identical bases, these numbers are identical (as 454 sequencing basically reads homopolymer lengths), e.g. see positions 4 and 5.

In closing, with these last four posts, I have described the most important output files, and the ones that usually are present by default. With a little programming skills one should be able to distill all information necessary from a newbler assembly using these files.

 

avatar

发表评论

:?: :razz: :sad: :evil: :!: :smile: :oops: :grin: :eek: :shock: :???: :cool: :lol: :mad: :twisted: :roll: :wink: :idea: :arrow: :neutral: :cry: :mrgreen: