About DEE2

Jump to:
* Mission
* Who we are
* Acknowledgements
* News
* Data processing
* Reference genome information
* QC metrics
* Update Schedule

Our mission

The goal of DEE2 is to make large scale gene expression data sets accessible to bioinformaticians, biologists and students alike. We use open-source bioinformatics tools and computational resources provided by our academic partners to provide many thousands of public RNA-seq data sets from a variety of organisms and make them freely accessible under a GNU General Public License v3.0.

Who we are

This compendium is maintained by Dr Mark Ziemann (Burnet Institute/Deakin University) and Antony Kaspi (WEHI). We value your feedback, so feel free to contact us by email (mark.ziemann[at]gmail.com) or raise an issue on our GitHub Repo.

Acknowledgements

We acknowledge significant support from the following:

Nectar Research Cloud, a collaborative Australian research platform supported by the National Collaborative Research Infrastructure Strategy (NCRIS).

Burnet Institute, an Australian independent research institute focused on global health equality. Provides computational and salary support.

Deakin University, a Geelong and Melbourne based University supports DEE2 through collaboration, computing and infrastructure. Particular support is provided by School of Life and Environmental Sciences.

National Computational Infrastructure Australia, providing access to high performance computing facilities including the GADI supercomputer. NCI is an Australian research platform supported by the National Collaborative Research Infrastructure Strategy (NCRIS).

We also thank previous contributors including The Multi-modal Australian ScienceS Imaging and Visualisation Environment (MASSIVE), Deakin eResearch, Monash eResearch Centre and the IT department of Baker Heart and Diabetes Institute.

This research would not have been possible without facilities and help from NCBI. In particular, we acknowledge support from SRA and GEO for curating and hosting these data.

Latest News

25th June 2025 - There are new changes to announce about the bulk data dumps. From now on, these bulk data dumps will be provided in HDF5 format, which is better suited for most downstream users and occupies less space. Visit the bulk data page for more info. Thanks to A/Prof Alexey Sergushichev for the suggestion.

5th June 2025 - Thanks to Dr Wen-Dar Lin and his team, we have received RNA-seq data for several new plant species including Brachypodium dystachion, Glycine max (soya), Hordeum vulgare (barley), Populus trichocarpa (poplar, black cottonwood), Sorghum bicolor (sorghum), Solanum lycopersicum (tomato), Solanum tuberosum (potato), Triticum aestivum (common wheat) and Vitis vinifera (grapevine). Over the next few weeks, we will start uploading the data to the server and making it available. Stay tuned for updates.

22nd Jan 2025 Big news! We added two new species to DEE2, rice (Oryza sativa) and maize (Zea mays ), two staple crops that together feed billions of people. This was only possible with a huge contribution from Dr Wen-Dar Lin's team at Institute of Plant and Microbial Biology. Further updates to the data processing pipeline have been made and are described on the GenomeSpot blog.

10th Sep 2024 - Just completed an operating system upgrade, we should be good for another 2 years.

Visit the news archive.

Data processing

Our data processing procedure entails:

-Download from NCBI SRA

-Diagnose sequence format

-Sequence quality trimming and adapter clipping

-Alignment to genome and transcriptome

-Assignment of reads to genes and transcripts

More information regarding the data processing method is available at the GitHub repo. Below are the versions and major parameters used in the pipeline.

Software versions and parameters used in the pipeline.
Software, version	Purpose	Parameter
Software, version	Purpose	SE	PE
Aspera client, v3.5.4	Rapid download of sequence data	ascp -l 500m -O 33001 -T -i $ID $URL .
SRA toolkit, v2.8.2	Validate downloaded SRA files	vdb-validate $SRA
	diagnose single or paired end	fastq-dump -X 4000 --split-files $SRA
	dump fastq	(see parallel-fastq-dump below)
FastQC, v0.11.5	Diagnose basespace / colorspace, quality encoding, read length from 4000 reads	fastqc $FQ1	fastqc $FQ2
parallel-fastq-dump, 0.6.3	Rapid decompression of sequence data from .sra files	parallel-fastq-dump --threads $THREADS --outdir . --split-files --defline-qual + -s ${SRR}.sra
Skewer, v0.2.2	3’ quality trimming	skewer -l 18 -q 10 -k inf -t $THREADS -o $SRR $FQ1	skewer -l 18 -q 10 -k inf -t $THREADS -o $SRR $FQ1 $FQ2
	Adapter clipping	skewer -l 18 -t $THREADS -x $ADAPTER -o $SRR $FQ1	skewer -l 18 -t $THREADS -x $ADAPTER1 -y $ADAPTER2 -o $SRR $FQ1 $FQ2
	5’ trimming	skewer -m ap --cut $CLIP_NUM,$CLIP_NUM -l 18 -k inf -t $THREADS $FQ1	skewer -m ap --cut $R1_CLIP_NUM,$R2_CLIP_NUM -l 18 -k inf -t $THREADS $FQ1 $FQ2
Minion, v13-100	3’ adapter detection	minion search-adapter -i $FQ1	minion search-adapter -i $FQ2
Bowtie2, v2.3.2	Adapter contamination detection	bowtie2 -f -x $BT2_REF -S /dev/stdout $ADAPTER
FASTX-Toolkit, v0.0.14	Progressive 5’ trimming	fastx_trimmer -f {5,9,13,21} -m 18 -Q 33 -i $FQ1	fastx_trimmer -f {5,9,13,21} -m 18 -Q 33 -i $FQ2
STAR v020201	Gene-level mapping, Diagnose strandedness	STAR --runThreadN $THREADS --quantMode GeneCounts \ --genomeLoad LoadAndKeep --outSAMtype None \ --genomeDir $STAR_DIR --readFilesIn=$FQ1	STAR --runThreadN $THREADS --quantMode GeneCounts \ --genomeLoad LoadAndKeep --outSAMtype None \ --genomeDir $STAR_DIR --readFilesIn=$FQ1 $FQ2
Kallisto, v0.43.1	Transcript-level mapping	kallisto quant $KALLISTO_STRAND_PARAMETER \ --single -l 100 -s 20 -t $THREADS -o . \ -i $KAL_REF $FQ1	kallisto quant $KALLISTO_STRAND_PARAMETER \ -t $THREADS -o . -i $KAL_REF $FQ1 $FQ2

Reference genome information

The compendium relies on reference genome sequence and annotation information provided by Ensembl Genomes .

Species	Genome Reference Sequence and Annotation
Arabidopsis thaliana	Ensembl Plants release 36 Genome sequence (fasta) Gene annotation set (GTF) cDNA sequences (fasta)
Caenorhabditis elegans	Ensembl release 90 Genome sequence (fasta) Gene annotation set (GTF) cDNA sequences (fasta)
Drosophila melanogaster	Ensembl release 90 Genome sequence (fasta) Gene annotation set (GTF) cDNA sequences (fasta)
Danio rerio	Ensembl release 90 Genome sequence (fasta) Gene annotation set (GTF) cDNA sequences (fasta)
Escherichia coli	Ensembl release 36 Genome sequence (fasta) Gene annotation set (GTF) cDNA sequences (fasta)
Homo sapiens	Ensembl release 90 Genome sequence (fasta) Gene annotation set (GTF) cDNA sequences (fasta)
Mus musculus	Ensembl release 90 Genome sequence (fasta) Gene annotation set (GTF) cDNA sequences (fasta)
Oryza sativa	Ensembl Plants release 59 Genome sequence (fasta) Gene annotation set (GTF) cDNA sequences (fasta)
Rattus norvegicus	Ensembl release 90 Genome sequence (fasta) Gene annotation set (GTF) cDNA sequences (fasta)
Saccharomyces cerevisiae	Ensembl release 36 Genome sequence (fasta) Gene annotation set (GTF) cDNA sequences (fasta)
Zea mays	Ensembl Plants release 59 Genome sequence (fasta) Gene annotation set (GTF) cDNA sequences (fasta)

About the quality metrics

A description of each of the quality metrics is provided on the Gitub page here.

Update schedule

We are updating the compendium on a quarterly basis. Upon release of an updated genome build, we intend to update the data for that organism if resources are available. We will keep an archived version based on the previous build for bulk download only. Gene annotation sets will not be updated independent of the genome build.