# bowtie结果sam文件解读

sam文件解读

@HD VN:1.0 SO:unsorted

@SQ SN:chr1 LN:249250621

@SQ SN:chr2 LN:243199373

@PG ID:Bowtie VN:1.0.0 CL:"bowtie genome/hg19 -q reads/SRR3101251.fastq -m 1 -p 4 -S"

SRR3101251.1 0 chr19 9486878 255 49M * 0 0 NTACTCCCACTACTCTCAGATTCAAGCAATCCTCCCACCCTAGCCCACC #1=DDDFFHHHHHIHHIJJJHIJIIJIHIFHJIIJJJJJJJIIJJJJJJ XA:i:1 MD:Z:0A48 NM:i:1

SRR3101251.5 16 chr2 240279787 255 49M * 0 0 CCTGAATCCATCAGAGCAGCCGGGCTGTGACACTCACTGTCATGATGTT JIJJIHIIIIJJJJJJJJJGHJJJJIIHJHICJIGCHHHHHFFFFFCCC XA:i:0 MD:Z:49 NM:i:0

SRR3101251.6 4 * 0 0 * * 0 0 NATTCCCACCTATGAGTGAGAATATGCGGTGTTTGGTTTTTTGTTCTTG #1=DDDFFHHHHHJJJGHIJJJJJJJJJJCGGIIJJIIJJJIJHJIIJJ XM:i:1

SAM文件分成两部分，注释信息和比对结果部分。

ASCII-encoded read qualities (reverse-complemented if the read aligned to the reverse strand). The encoded quality values are on the Phred quality scale and the encoding is ASCII-offset by 33 (ASCII char !), similarly to a FASTQ file.

SAM格式

SAM是一种序列比对格式标准， 由sanger制定，是以TAB为分割符的文本格式。主要应用于测序序列mapping到基因组上的结果表示，当然也可以表示任意的多重比对结果。

SAM要处理好的问题：

* 同一条序列，分多段（segment）比对到参考基因组上；

* 无限量的，结构化信息表示，包括错配、删除、插入等比对信息；

* QNAME，比对片段的（template）的编号；

* FLAG，位标识，template mapping情况的数字表示，每一个数字代表一种比对情况，这里的值是符合情况的数字相加总和；

* RNAME，参考序列的编号，如果注释中对SQ-SN进行了定义，这里必须和其保持一致，另外对于没有mapping上的序列，这里是’*‘；

* POS，比对上的位置，注意是从1开始计数，没有比对上，此处为0；

* MAPQ，mappint的质量；

* CIGAR，简要比对信息表达式（Compact Idiosyncratic Gapped Alignment Report），其以参考序列为基础，使用数字加字母表示比对结果，比如3S6M1P1I4M，前三个碱基被剪切去除了，然后6个比对上了，然后打开了一个缺口，有一个碱基插入，最后是4个比对上了，是按照顺序的；

* RNEXT，下一个片段比对上的参考序列的编号，没有另外的片段，这里是’*‘，同一个片段，用’=‘；

* PNEXT，下一个片段比对上的位置，如果不可用，此处为0；

* TLEN，Template的长度，最左边得为正，最右边的为负，中间的不用定义正负，不分区段（single-segment)的比对上，或者不可用时，此处为0；

* SEQ，序列片段的序列信息，如果不存储此类信息，此处为’*‘，注意CIGAR中M/I/S/=/X对应数字的和要等于序列长度；

* QUAL，序列的质量信息，格式同FASTQ一样。

* reference

* segment

* template（参考序列和比对上的序列共同组成的序列为template）

* alignment

* seq更多的介绍请读读

SAM的定义： http://samtools.sourceforge.net/SAM1.pdf

CIGAR的概念 http://asia.ensembl.org/common/Help/Glossary?db=core

perl模块 http://search.cpan.org/~lds/Bio-SamTools/lib/Bio/DB/Sam.pm

SAM bowtie output

Following is a brief description of the SAM format as output by bowtie when the -S/--sam option is specified. For more details, see the SAM format specification.

When -S/--sam is specified, bowtie prints a SAM header with @HD, @SQ and @PG lines. When one or more --sam-RG arguments are specified, bowtie will also print an @RG line that includes all user-specified --sam-RG tokens separated by tabs.

Each subsequnt line corresponds to a read or an alignment. Each line is a collection of at least 12 fields separated by tabs; from left to right, the fields are:

1. Name of read that aligned

2. Sum of all applicable flags. Flags relevant to Bowtie are:

 1 The read is one of a pair 2 The alignment is one end of a proper paired-end alignment 4 The read has no reported alignments 8 The read is one of a pair and has no reported alignments 16 The alignment is to the reverse reference strand 32 The other mate in the paired-end alignment is aligned to the reverse reference strand 64 The read is the first (#1) mate in a pair 128 The read is the second (#2) mate in a pair. Thus, an unpaired read that aligns to the reverse reference strand will have flag 16. A paired-end read that aligns and is the first mate in the pair will have flag 83 (= 64 + 16 + 2 + 1).

3. Name of reference sequence where alignment occurs, or ordinal ID if no name was provided

4. 1-based offset into the forward reference strand where leftmost character of the alignment occurs

5. Mapping quality

6. CIGAR string representation of alignment

7. Name of reference sequence where mate's alignment occurs. Set to = if the mate's reference sequence is the same as this alignment's, or * if there is no mate.

8. 1-based offset into the forward reference strand where leftmost character of the mate's alignment occurs. Offset is 0 if there is no mate.

9. Inferred insert size. Size is negative if the mate's alignment occurs upstream of this alignment. Size is 0 if there is no mate.

10. Read sequence (reverse-complemented if aligned to the reverse strand)

11. ASCII-encoded read qualities (reverse-complemented if the read aligned to the reverse strand). The encoded quality values are on the Phred quality scale and the encoding is ASCII-offset by 33 (ASCII char !), similarly to a FASTQ file.

12. Optional fields. Fields are tab-separated. For descriptions of all possible optional fields, see the SAM format specification. bowtie outputs some of these optional fields for each alignment, depending on the type of the alignment:

NM:i:Aligned read has an edit distance of .

CM:i: Aligned read has an edit distance of in colorspace. This field is present in addition to the NM field in -C/--color mode, but is omitted otherwise.

MD:Z: For aligned reads, is a string representation of the mismatched reference bases in the alignment. See SAM format specification for details. For colorspace alignments, describes the decoded nucleotide alignment, not the colorspace alignment.

XA:i: Aligned read belongs to stratum . See Strata for definition.

XM:i: For a read with no reported alignments,  is 0 if the read hadno alignments. If [-m] was specified and the read's alignmentswere supressed because the [-m] ceiling was exceeded,  equalsthe [-m] ceiling + 1, to indicate that there were at least thatmany valid alignments (but all were suppressed). In [-M] mode, ifthe alignment was randomly selected because the [-M] ceiling wasexceeded,  equals the [-M] ceiling + 1, to indicate that therewere at least that many valid alignments (of which one was reportedat random).

• 版权声明 本文由 wangyunpeng_bio 投稿，于2016/07/30发表
• 除非特殊声明，本站文章均为原创，转载请务必保留本文链接

• xliang @回复

学习了