fastq2fasta by sed

  • 1,664

I've been having fun with sed this morning. I'm testing using Illumina data as input to Newbler, just to see how it works. As the data is paired-end, it requires some pre-processing following the guidelines for Sanger reads in the manual for gsAssembler (A.K.A newbler). Yesterday I did this using the method proposed in the manual with a final cleanup using sed so the procedure was fastq-->fasta-->perl script to make .acc file --> use fnafile to produce fasta file --> final tidy using sed. That was all messy and horribly slow. So today I decided to do all of those steps using sed.

Here's the end result:

1) Convert fastq2fasta using sed

sed '/^@/!d;s//>/;N' asp5_leaf_read1.fq > asp5_leaf_read1.fna
sed '/^@/!d;s//>/;N' asp5_leaf_read2.fq > asp5_leaf_read2.fna

2) Format fasta header so newbler can match the pairs up

sed -e 's/>\(.*\)#0\/\(.*\)/>\1 template=\1 dir=\2 library=asp5/' -e 's/dir=1/dir=f/' asp5_leaf_read1.fna > asp5_leaf_read1_newbler.fna
sed -e 's/>\(.*\)#0\/\(.*\)/>\1 template=\1 dir=\2 library=asp5/' -e 's/dir=2/dir=r/' asp5_leaf_read2.fna > asp5_leaf_read1_newbler.fna

3) Produce qual files with read names matching the fasta files

sed '/^+/!d;s//>/;N' asp5_leaf_read1.fq | sed 's/>\(.*\)#0\/\(.*\)/>\1/' > asp5_leaf_read1_newbler.qual
sed '/^+/!d;s//>/;N' asp5_leaf_read2.fq | sed 's/>\(.*\)#0\/\(.*\)/>\1/' > asp5_leaf_read2_newbler.qual

Those qual values will then need rescaling to phred33 .... still to do. Steps 1 and 2 can also be piped if you don't want to keep the unmodified .fna files (but I did - for testing Inchworm). I also need to check whether to reverse compliment the reads for either paired-end or mate-pair data.

Those steps took about 15 mins to complete. The attempt yesterday took over six hours. Thank you sed!

  • 本文由 发表
  • 网站部分文章源自互联网,若未正确标注来源,请联系管理员更新。文章转载,请务必保留本文链接



评论:2   其中:访客  2   博主  0
    • Dan B

      I was hoping for a Unix command, so I would avoid having to go through a perl script; I suppose I could do a ‘tr’.
      What about the first two questions?

        • ybzhao

          hi, ‘asp5′ and ‘leaf’ is just a filename without special means.

          As for the first question, I did not understand it well. I am sorry.