Overview
The DEE2 pipeline is written in bash and uses open-source tools in a Docker container to analyse, filter and process the data. Our data processing procedure entails (1) Download from NCBI SRA; (2) Diagnose sequence format; (3) Sequence quality trimming and adapter clipping, (4), Alignment to genome and transcriptome and (5) Assignment of reads to genes and transcripts. The Docker image is available from Docker Hub. More information regarding the data processing method is available at the GitHub repo and at the original publication. Below are the versions and major parameters used in the pipeline.
Software versions
| Software versions and parameters used in the pipeline. | |||
|---|---|---|---|
| Software, version | Purpose | Parameter | |
| SE | PE | ||
| Aspera client, v3.5.4 | Rapid download of sequence data | ascp -l 500m -O 33001 -T -i $ID $URL . | |
| SRA toolkit, v2.8.2 | Validate downloaded SRA files | vdb-validate $SRA | |
| diagnose single or paired end | fastq-dump -X 4000 --split-files $SRA | ||
| dump fastq | (see parallel-fastq-dump below) | ||
| FastQC, v0.11.5 | Diagnose basespace / colorspace, quality encoding, read length from 4000 reads | fastqc $FQ1 | fastqc $FQ2 |
| parallel-fastq-dump, 0.6.3 | Rapid decompression of sequence data from .sra files | parallel-fastq-dump --threads $THREADS --outdir . --split-files --defline-qual + -s ${SRR}.sra | |
| Skewer, v0.2.2 | 3’ quality trimming | skewer -l 18 -q 10 -k inf -t $THREADS -o $SRR $FQ1 | skewer -l 18 -q 10 -k inf -t $THREADS -o $SRR $FQ1 $FQ2 |
| Adapter clipping | skewer -l 18 -t $THREADS -x $ADAPTER -o $SRR $FQ1 | skewer -l 18 -t $THREADS -x $ADAPTER1 -y $ADAPTER2 -o $SRR $FQ1 $FQ2 | |
| 5’ trimming | skewer -m ap --cut $CLIP_NUM,$CLIP_NUM -l 18 -k inf -t $THREADS $FQ1 | skewer -m ap --cut $R1_CLIP_NUM,$R2_CLIP_NUM -l 18 -k inf -t $THREADS $FQ1 $FQ2 | |
| Minion, v13-100 | 3’ adapter detection | minion search-adapter -i $FQ1 | minion search-adapter -i $FQ2 |
| Bowtie2, v2.3.2 | Adapter contamination detection | bowtie2 -f -x $BT2_REF -S /dev/stdout $ADAPTER | |
| FASTX-Toolkit, v0.0.14 | Progressive 5’ trimming | fastx_trimmer -f {5,9,13,21} -m 18 -Q 33 -i $FQ1 | fastx_trimmer -f {5,9,13,21} -m 18 -Q 33 -i $FQ2 |
| STAR v020201 | Gene-level mapping, Diagnose strandedness | STAR --runThreadN $THREADS --quantMode GeneCounts \ --genomeLoad LoadAndKeep --outSAMtype None \ --genomeDir $STAR_DIR --readFilesIn=$FQ1 |
STAR --runThreadN $THREADS --quantMode GeneCounts \ --genomeLoad LoadAndKeep --outSAMtype None \ --genomeDir $STAR_DIR --readFilesIn=$FQ1 $FQ2 |
| Kallisto, v0.43.1 | Transcript-level mapping | kallisto quant $KALLISTO_STRAND_PARAMETER \ --single -l 100 -s 20 -t $THREADS -o . \ -i $KAL_REF $FQ1 |
kallisto quant $KALLISTO_STRAND_PARAMETER \ -t $THREADS -o . -i $KAL_REF $FQ1 $FQ2 |
Reference genome information
The compendium relies on reference genome sequence and annotation information provided by Ensembl Genomes.
Understanding quality metrics
The philosophy behind DEE2 is that we will process and provide as much of the data available on SRA as possible with minimal filtering. We provide detailed quality metrics for each run so that users can perform their own filtering procedures. A description of each of the quality metrics is provided on the Gitub page here.
Update cycle
The data repository is updated on an irregular basis, depending on available compute resources and time.