DEE2 Pipeline Info

Overview

The DEE2 pipeline is written in bash and uses open-source tools in a Docker container to analyse, filter and process the data. Our data processing procedure entails (1) Download from NCBI SRA; (2) Diagnose sequence format; (3) Sequence quality trimming and adapter clipping, (4), Alignment to genome and transcriptome and (5) Assignment of reads to genes and transcripts. The Docker image is available from Docker Hub. More information regarding the data processing method is available at the GitHub repo and at the original publication. Below are the versions and major parameters used in the pipeline.

Software versions

Software versions and parameters used in the pipeline.
Software, version	Purpose	Parameter
Software, version	Purpose	SE	PE
Aspera client, v3.5.4	Rapid download of sequence data	ascp -l 500m -O 33001 -T -i $ID $URL .
SRA toolkit, v2.8.2	Validate downloaded SRA files	vdb-validate $SRA
	diagnose single or paired end	fastq-dump -X 4000 --split-files $SRA
	dump fastq	(see parallel-fastq-dump below)
FastQC, v0.11.5	Diagnose basespace / colorspace, quality encoding, read length from 4000 reads	fastqc $FQ1	fastqc $FQ2
parallel-fastq-dump, 0.6.3	Rapid decompression of sequence data from .sra files	parallel-fastq-dump --threads $THREADS --outdir . --split-files --defline-qual + -s ${SRR}.sra
Skewer, v0.2.2	3’ quality trimming	skewer -l 18 -q 10 -k inf -t $THREADS -o $SRR $FQ1	skewer -l 18 -q 10 -k inf -t $THREADS -o $SRR $FQ1 $FQ2
	Adapter clipping	skewer -l 18 -t $THREADS -x $ADAPTER -o $SRR $FQ1	skewer -l 18 -t $THREADS -x $ADAPTER1 -y $ADAPTER2 -o $SRR $FQ1 $FQ2
	5’ trimming	skewer -m ap --cut $CLIP_NUM,$CLIP_NUM -l 18 -k inf -t $THREADS $FQ1	skewer -m ap --cut $R1_CLIP_NUM,$R2_CLIP_NUM -l 18 -k inf -t $THREADS $FQ1 $FQ2
Minion, v13-100	3’ adapter detection	minion search-adapter -i $FQ1	minion search-adapter -i $FQ2
Bowtie2, v2.3.2	Adapter contamination detection	bowtie2 -f -x $BT2_REF -S /dev/stdout $ADAPTER
FASTX-Toolkit, v0.0.14	Progressive 5’ trimming	fastx_trimmer -f {5,9,13,21} -m 18 -Q 33 -i $FQ1	fastx_trimmer -f {5,9,13,21} -m 18 -Q 33 -i $FQ2
STAR v020201	Gene-level mapping, Diagnose strandedness	STAR --runThreadN $THREADS --quantMode GeneCounts \ --genomeLoad LoadAndKeep --outSAMtype None \ --genomeDir $STAR_DIR --readFilesIn=$FQ1	STAR --runThreadN $THREADS --quantMode GeneCounts \ --genomeLoad LoadAndKeep --outSAMtype None \ --genomeDir $STAR_DIR --readFilesIn=$FQ1 $FQ2
Kallisto, v0.43.1	Transcript-level mapping	kallisto quant $KALLISTO_STRAND_PARAMETER \ --single -l 100 -s 20 -t $THREADS -o . \ -i $KAL_REF $FQ1	kallisto quant $KALLISTO_STRAND_PARAMETER \ -t $THREADS -o . -i $KAL_REF $FQ1 $FQ2

Reference genome information

The compendium relies on reference genome sequence and annotation information provided by Ensembl Genomes.

Species	Genome Reference Sequence and Annotation
Arabidopsis thaliana	Ensembl Plants release 36 Genome sequence (fasta) Gene annotation set (GTF) cDNA sequences (fasta)
Caenorhabditis elegans	Ensembl release 90 Genome sequence (fasta) Gene annotation set (GTF) cDNA sequences (fasta)
Drosophila melanogaster	Ensembl release 90 Genome sequence (fasta) Gene annotation set (GTF) cDNA sequences (fasta)
Danio rerio	Ensembl release 90 Genome sequence (fasta) Gene annotation set (GTF) cDNA sequences (fasta)
Escherichia coli	Ensembl release 36 Genome sequence (fasta) Gene annotation set (GTF) cDNA sequences (fasta)
Homo sapiens	Ensembl release 90 Genome sequence (fasta) Gene annotation set (GTF) cDNA sequences (fasta)
Mus musculus	Ensembl release 90 Genome sequence (fasta) Gene annotation set (GTF) cDNA sequences (fasta)
Oryza sativa	Ensembl Plants release 59 Genome sequence (fasta) Gene annotation set (GTF) cDNA sequences (fasta)
Rattus norvegicus	Ensembl release 90 Genome sequence (fasta) Gene annotation set (GTF) cDNA sequences (fasta)
Saccharomyces cerevisiae	Ensembl release 36 Genome sequence (fasta) Gene annotation set (GTF) cDNA sequences (fasta)
Zea mays	Ensembl Plants release 59 Genome sequence (fasta) Gene annotation set (GTF) cDNA sequences (fasta)

Understanding quality metrics

The philosophy behind DEE2 is that we will process and provide as much of the data available on SRA as possible with minimal filtering. We provide detailed quality metrics for each run so that users can perform their own filtering procedures. A description of each of the quality metrics is provided on the Gitub page here.

Update cycle

The data repository is updated on an irregular basis, depending on available compute resources and time.

About the pipeline

Overview

Software versions

Reference genome information

Understanding quality metrics

Update cycle