转录组 de novo流程–包括转录本完整注释

2016/01/14来源：生信菜鸟团评论8,535

有网友咨询过对于没有参考基因组或者转录组的物种，如何做RNA-seq分析。我觉得这个问题太大了，而且我还真的对这个没有经验。但是我以前看到过一篇文献，里面提到过一个非常全面的转录组 de novo组装注释流程，所以我摘抄了文章里面的生物信息学处理部分，分享给大家：

文章是RNA-seq analysis for plant carnivory gene discovery in Nepenthes × ventrata马来西亚的学者的研究，文章非常短小，吓了我一跳~

期刊名 FRONTIERS IN PLANT SCIENCE 出版周期：不详. 常用链接 … SCI(2014)：3.948 感觉这个杂志影响因子还会继续升实验设计流程一模一样，发了两篇paper

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4707257/

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4778577/

测序策略都是 Illumina HiSeq 2500 sequencing platform. Paired end reads of 125 bp

数据处理流程：

Trimmomatic –》 Trinity (v2.0.6) –》 Transdecoder –》Trinotate (v2.0.0)

这些软件我博客都有使用记录

下面是技术详情：

Raw reads from all three data sets were filtered to remove adapter sequences with sequence pre-processing tool, Trimmomatic [2]. High quality Illumina raw reads with phred score ≥ 25 were kept for assembly. De novo assembly of these processed reads was performed with Trinity (v2.0.6) [3]. Statistics of the assembly is showed in Table 1.

Protein coding sequences of unique transcripts were analyzed via Transdecoder version v2.0.1 as a part of Trinity analysis pipeline. Standard Trinotate (v2.0.0) annotation pipeline (https://trinotate.github.io/) was carried out to annotate the assembled unique transcripts againstSwissprot [4], Pfam [5], eggNOG [6], Gene Ontology [7], SignalP [8], and Rnammer [9].Summary of the annotation is showed in Table 2.

所以重点是学会以下几个软件：

Trinotate http://trinotate.github.io download Trinotate
Trinity (includes support for expression and DE analysis using RSEM and Bioconductor): http://trinityrnaseq.github.io/ download Trinity. >Note, Trinity is not absolutely required. It is possible to use Trinotate with other sources of transcript data as long as suitable inputs are available.
TransDecoder for predicting coding regions in transcripts http://transdecoder.github.io download TransDecoder.
sqlite (required for database integration): http://www.sqlite.org/
NCBI BLAST+: Blast database Homology Search: http://www.ncbi.nlm.nih.gov/books/NBK52640/
HMMER/PFAM Protein Domain Identification: http://hmmer.janelia.org/download.html

数据都是可以下载的，也比较适合大家练手：

Transcriptome profile of N. × ventrata were generated from the polyA-enriched cDNA libraries prepared from total RNA extracted from its pitcher. The short reads were filtered, processed, assembled and analyzed as describe in the next section. Raw data for this project were deposited at SRA database with the accession numbers SRX1389337 (http://www.ncbi.nlm.nih.gov/sra/SRX1389337) for day 0 control, SRX1389392 (http://www.ncbi.nlm.nih.gov/sra/SRX1389392) for day 3 longevity experiment, and SRX1389395 (http://www.ncbi.nlm.nih.gov/sra/SRX1389395) for day 3 chitin-treatment experiment.

Transcriptome profile of N. ampullaria was generated from the polyA-enriched cDNA libraries prepared from total RNA extracted from its pitcher. The short reads were filtered, processed, assembled, and analyzed as described in the next section. Raw data for this project were deposited at SRA database with the accession numbers SRX1400303 (http://www.ncbi.nlm.nih.gov/sra/SRX1400303) for day 0 control, SRX1400308 (http://www.ncbi.nlm.nih.gov/sra/SRX1400308) for day 3 longevity experiment, and SRX1400311 (http://www.ncbi.nlm.nih.gov/sra/SRX1400311) for day 3 fluid protein depletion experiment. Assembled transcriptome fasta sequences can be accessed at http://gohlab.researchfrontier.org/public-datasets/Nepenthes-ampullaria-Trinity-gohlab.fasta.

原文来自：http://www.bio-info-trainee.com/1789.html

发表评论