This post is about how to start up newbler for de novo assembly projects. I will describe setting up newbler using the command line. Most of the options I will mention are also available through the GUI version, but I will not describe how to use them here.
For a description of the progress that newbler reports during assembly, please check this post. For a description of the different output files, these are described in a series of previous posts.
1) default newbler on one or more files
This is the most simple way of running newbler: just provide it with one sff file. It will generate a folder called along the lines of P_yyyy_mm_dd_hh_min_sec_runAssembly and put all output in there. If you want to have control over the name of this folder, use
runAssembly -o project1 /data/sff/EYV886410.sff
-o describes the name of the folder newbler will provide all output in, in this case ‘project1’
Using more files is most efficiently done by having them all in the same folder and writing
runAssembly -o project1 /data/sff/*.sff
It is also possible to provide several files in different folders:
runAssembly -o project1 /data/sff/EYV886410.sff /data/reads/FQL5QBG02.sff
2) Keeping track of different types (‘libraries’) of reads
When you have reads coming from different libraries, newbler can keep track of which library the sff files belong to. This is most useful in the situation where you have a mix of shotgun reads and paired-end reads, with more than one file per paired end library, and/or paired end libraries with different insert sizes.
During assembly, newbler will calculate the average pair distance for paired-end reads that map to the same contig, and use this average to estimate gap sizes between contigs, see the post on how newbler works.
If you provide all files at once, newbler will consider each input sff file containing paired-end reads separately (i.e. as coming from a separate library), and calculate the average distance for reads from each file. Using the setup described below will result in newbler using files from the same paired-end librarytogether, and calculating the average distance for the reads on a per-library basis. The average distances are also reported in the 454NewblerMetrics.txt file.
For this to work, we need to set up the project with several commands
addRun -lib shotgun project2 /data/sff/EY*
addRun -lib 3kb_PE project2 /data/sff/EZ5TR6A01.sff
addRun -lib 3kb_PE project2 /data/sff/FQL5QBG02.sff
addRun -lib 8kb_PE project2 /data/sff/ GD4EG4306_8kb.sff
The newAssembly command sets up the project, creates a new folder called ‘project2’
addRun is used to add runs, with the -lib option to give a name to the library. Here, we added a bunch of shotgun containing sff files, two files from the same 3kb paired-end library (called ‘3kb_PE’), and one file from an 8kb library (called ‘8kb_PE’). Note the order of the arguments here: addRun – options – projectdirname – sff files.
Note that for shotgun reads, there is not really a need to indicate the library, newbler does not use this information. It might help you to keep track of where files come from, though…
runProject starts the actual assembly
In your terminal, this will look something like this:
bash-3.2$ newAssembly project2
Created assembly project directory project2
bash-3.2$ addRun -lib shotgun project2 /data/sff/EY*
10 read files successfully added.
bash-3.2$ addRun -lib 3kb_PE project2 /data/sff/EZ5TR6A01.sff
1 read file successfully added.
bash-3.2$ addRun -lib 3kb_PE project2 /data/sff/FQL5QBG02.sff
1 read file successfully added.
bash-3.2$ addRun -lib 8kb_PE project2 /data/sff/GD4EG4306_8kb.sff
1 read file successfully added.
bash-3.2$ runProject project2
<lots of output>
Now, in the 454NewblerMetrics.txt file, you should get library insert size average for the two PE libraries.
NOTE: newbler will check each sff file for the presence of reads with the PE linker sequence. If at 25% of the first 500 reads have this linker, the reads will be treated as coming from a paired-end library. If your file has fewer paired end reads, you can force newbler to treat it as a paired-end file by adding the –p parameter:
addRun -p -lib 3kb_PE project2 /data/cyano_sff/FQL5QBG02.sff
At the end of assemblies set up using newAssembly, addRun and runProject, the folder structure is different than for the first example.
In the output folder (‘project2’) there is
- an (empty) file called 454Project.xml
- the folder called sff with the symlinks to the sff files used
- a folder called assembly with the usual output, and in addition the 454AssemblyProject.xml (this xml file contains description of all input files and parameters). This folder also contains a number of hidden files (some quite large) that newbler used to store information regarding the assembly.
It is, in fact, possible to add reads to a finished assembly, by invoking addRun another time, and then running the project again (runProject), thereby only aligning the added reads, so-called incremental assembly. However, I would recommend always starting from the beginning. If you, on the other hand, want to change aspects of the output of a finished assembly, simply invoke runProject again with a parameter that, for example, generates an extra file (see below for an overview of such parameters). Newbler will then not recomputed the entire assembly, just generate the new output.
What follows are some parameters that are useful for influencing the assembly process.
3) Minimum overlap length (-ml) and minimum overlap identity (-mi)
Although the developers of newbler optimized parameters to get the best assembly in most cases, sometimes it is useful to adjust these two. They control the stringency of the alignment of reads at the beginning of the assembly. Increasing this stringency works well when:
- the sample has low-frequency contamination DNA. In this case, only the reads that ‘belong together’ assemble together into contigs, without accidentally ‘pulling in’ contaminating reads. For example, for the bacterial assemblies based on contaminated read datasets described in this (my)paper, increasing mi and ml yielded better results for the genomes of interest.
- the read dataset is low in coverage relative to the genome size. When the total number of bases in the read dataset is less than, say 10 times the genome size (less than 10x coverage), at default settings, newbler sometimes tries to find/force overlaps between reads that are not really there, slowing down the alignment phase. Increasing the stringency will speed up the assembly.
An example set up for adjusting -mi and -ml would be:
runAssembly -o project1 -mi 96 -ml 60 /data/sff/EYV886410.sff /data/reads/ FQL5QBG02.sff
Or, in the case of newAssembly, addRun and runProject, use:
runProject -mi 96 -ml 60 project2
I recommend testing -mi from 90 (the default) to 98 (very stringent) and -ml from 40 (the default) to 100 or more (very stringent).
4) Large datasets and parallelization
For large genomes (say, over 100 Mb) with a good coverage in terms of bases input, using -large will use some shortcuts to speed up (read: ‘make possible to complete’) the assembly. The cost is more contigs and more reads determined to be singletons. Also, using a multi-cpu machine and setting the number of cpus available to newbler with the -cpu parameter will result in a significant speed increase. During alignment, newbler will spread the task over all cpus. For those in the known, this parallelization is threaded, not MPI based, meaning that the cpus need to be on the same core/machine. The default number of cpus newbler will use is one. When more than one cpu is used, rerunning the exact same assembly will result in small differences in the resulting number of contigs.
Examples of use:
runAssembly -o project1 -large -cpu 8 /data/sff/EYV886410.sff
runProject -large -cpu 8 project2
5) Restricting the reads used during assembly
Sometimes, reads have adaptor sequences at the beginning and/or end, e.g. as a results of the sample prep procedure. Alternatively, the dataset could contain reads that should not be included. For example when a BAC is sequenced, there is usually some E. coli host DNA present in the sample, which will result in reads not coming from the BAC. Newbler allows for removal of such sequences by:
- for adaptors: using -vt and a (multi) fasta vector trimming file with the sequences to be removed; newbler will check for a match at the ends of reads and exclude matching bases for assembly.
- for contaminating reads: using -vs and a (multi) fasta screening file with the sequences to be checked against (e.g. the E. coli genome); newbler will check entire reads against the screening sequences and exclude matching reads for assembly.
runAssembly -o project1 -vt /data/seq/adaptors.fasta /data/sff/EYV886410.sff /data/reads/ FQL5QBG02.sff
runProject -vs /data/seq/e_coli_genome.fasta project2
Alternatively, it is possible to provide either a list of reads that should be used for the assembly, excluding the other reads that might be present in the input, or, to provide a list of reads to exclude from the total set of reads provided. The ‘list’ should contain the read identifiers (14 characters). Use -fi to provide the include list, and -fe for the exclude list.
Finally, it is possible to exclude reads below a certain length by specifying the -minlen parameter, which can be set from 15 to 45 bases:
runAssembly -o project1 -minlen 45 /data/sff/EYV886410.sff
runProject -minlen 45 project2
6) Parameters influencing the output newbler will generate
All of the parameters mentioned here need to be added to the runAssembly or runProject commands. ‘#’ indicates that a number needs to be added after the parameter.
-a # and -l #
These determine the minimum length of contigs in the 454AllContigs.fna and 454LargeContigs.fna files, respectively. Defaults are 100 for -a, and 500 for -l.
For large projects (more than 4 million reads, more than 40 MB genome), the output of the 454AlignmentInfo.tsv file is suppressed. Use -info to generate this file in this case.
Setting this flag suppresses all output. This can be useful to quickly test if your assembly will finish, or in the case of incremental assemblies (see above), saving time and generating all output only at the very end.
These flags control the output of one or more 454Contigs.ace file(s). The ‘ace’ file is a file format for describing assemblies, i.e. all alignments and contig consensus sequences, however, without scaffolding information. For newbler assemblies, it is the only output file containing the actual read alignments. The ace file originated as the output of the phrap assembly program. Other assembly programs sometimes can generate it as well (newbler is one of them) or there are converter scripts between the different formats. Somewhere in this document is a description of the ace file format. The ace file can be used for assembly visualization in programs such as consed (see below), or Hawkeye. While the -ace flag generates one ace file containing all contigs, using -acedir generates a subdirectory with a separate ace file for each contig.
Consed is an assembly viewing, editing and finishing tool. It takes the ace file as input and allows for visualization of the assembly. Using the -consed flag will generate, besides the ace file, a set of folders and files that (in theory) should allow for editing the newbler assembly in consed.
In contrast to other assembly programs, newbler splits reads that that are aligned at the end of a contig, and continue in another contig, thus allowing for reads to be placed in more than one contig. The -rip flag ensures that reads are placed in one contig only (I assume where the majority of it’s bases are aligned).
As described above, the runAssembly program output has a different folder structure than the runProject-based assembly. The latter has the original folder structure, with additional hidden files. When using runAssembly, setting the -nrm flag will keep the folder structure the same as for assemblies started with runProject. Logically, this flag cannot be used with runProject.
Finally, there are many more parameters to choose from, but I think you will get very far using the ones described here. Maybe I’ll spend another post on the remaining parameters and options…