About DEE2

Jump to:
* Mission
* Who we are
* Acknowledgements
* News
* Data processing
* Reference genome information
* QC metrics
* Update Schedule

Our mission

The goal of DEE2 is to make large scale gene expression data sets accessible to bioinformaticians, biologists and students alike. We use open-source bioinformatics tools and computational resources provided by our academic partners to provide many thousands of public RNA-seq data sets from a variety of organisms and make them freely accessible under a GNU General Public License v3.0.


Who we are

This compendium is maintained by Dr Mark Ziemann (Deakin University) and Antony Kaspi (WEHI). We value your feedback, so feel free to contact us by email (mark.ziemann[at]gmail.com) or raise an issue on our GitHub Repo.


Acknowledgements

A project of this size cannot be done by a single person or team, so we acknowledge support from the following:

Epigenetics in Human Health and Disease Laboratory

Nectar Research Cloud, a collaborative Australian research platform supported by the National Collaborative Research Infrastructure Strategy (NCRIS).

The Multi-modal Australian ScienceS Imaging and Visualisation Environment (MASSIVE).

Deakin eResearch, Monash eResearch Centre and the IT department of Baker Heart and Diabetes Institute.

This research would not have been possible without facilities and help from NCBI. In particular, we acknowledge support from SRA and GEO for curating and hosting these data.


Latest News

11th May 2021 - We have just completed a migration of DEE2 webserver from older Intel Xeon to a new AMD Epyc system. .

3rd Oct 2020 - We have curated a number of human disease gene signatures that you can use in enrichment analysis. More information here: http://dee2.io/genesets.html.

1st May 2020 - Due to popular demand we have made available md5 checksums for the bulk data dumps. The file name is "checksums.md5". These will be updated with each addition of new datasets. Link: http://dee2.io/mx/

12th March 2020 at 11pm today we recorded 1000043 SRA run datasets uploaded to DEE2 servers. We're expecting the 1 millionth DEE2 dataset to become publicly available within the next 1-2 weeks.

9th January 2020 - Happy New Year! Here is a new blog post describing recent updates and high priority issues for the next few months.
* Data processing - a work in progress
* Getting a new source of metadata - completed
* Future areas for improvement
* The problem of studies missing runs
* Increasing queue efficiency
* Making the R package more efficient
* Website metadata search improvements
* Call for compute resources

5th November 2019 - the webserver has had a few hiccups recently, but all is restored to normal operation. In the next few weeks we will be trying to complete the rest of the outstanding mouse and human datasets and then moving to a new source of metadata.

Visit the news archive.


Data processing

Our data processing procedure entails:

  1. -Download from NCBI SRA

  2. -Diagnose sequence format

  3. -Sequence quality trimming and adapter clipping

  4. -Alignment to genome and transcriptome

  5. -Assignment of reads to genes and transcripts

More information regarding the data processing method is available at the GitHub repo. Below are the versions and major parameters used in the pipeline.

Software versions and parameters used in the pipeline.
Software, version Purpose Parameter
SE PE
Aspera client, v3.5.4 Rapid download of sequence data ascp -l 500m -O 33001 -T -i $ID $URL .
SRA toolkit, v2.8.2 Validate downloaded SRA files vdb-validate $SRA
diagnose single or paired end fastq-dump -X 4000 --split-files $SRA
dump fastq (see parallel-fastq-dump below)
FastQC, v0.11.5 Diagnose basespace / colorspace, quality encoding, read length from 4000 reads fastqc $FQ1 fastqc $FQ2
parallel-fastq-dump, 0.6.3 Rapid decompression of sequence data from .sra files parallel-fastq-dump --threads $THREADS --outdir . --split-files --defline-qual + -s ${SRR}.sra
Skewer, v0.2.2 3’ quality trimming skewer -l 18 -q 10 -k inf -t $THREADS -o $SRR $FQ1 skewer -l 18 -q 10 -k inf -t $THREADS -o $SRR $FQ1 $FQ2
Adapter clipping skewer -l 18 -t $THREADS -x $ADAPTER -o $SRR $FQ1 skewer -l 18 -t $THREADS -x $ADAPTER1 -y $ADAPTER2 -o $SRR $FQ1 $FQ2
5’ trimming skewer -m ap --cut $CLIP_NUM,$CLIP_NUM -l 18 -k inf -t $THREADS $FQ1 skewer -m ap --cut $R1_CLIP_NUM,$R2_CLIP_NUM -l 18 -k inf -t $THREADS $FQ1 $FQ2
Minion, v13-100 3’ adapter detection minion search-adapter -i $FQ1 minion search-adapter -i $FQ2
Bowtie2, v2.3.2 Adapter contamination detection bowtie2 -f -x $BT2_REF -S /dev/stdout $ADAPTER
FASTX-Toolkit, v0.0.14 Progressive 5’ trimming fastx_trimmer -f {5,9,13,21} -m 18 -Q 33 -i $FQ1 fastx_trimmer -f {5,9,13,21} -m 18 -Q 33 -i $FQ2
STAR v020201 Gene-level mapping, Diagnose strandedness STAR --runThreadN $THREADS --quantMode GeneCounts \
--genomeLoad LoadAndKeep --outSAMtype None \
--genomeDir $STAR_DIR --readFilesIn=$FQ1
STAR --runThreadN $THREADS --quantMode GeneCounts \
--genomeLoad LoadAndKeep --outSAMtype None \
--genomeDir $STAR_DIR --readFilesIn=$FQ1 $FQ2
Kallisto, v0.43.1 Transcript-level mapping kallisto quant $KALLISTO_STRAND_PARAMETER \
--single -l 100 -s 20 -t $THREADS -o . \
-i $KAL_REF $FQ1
kallisto quant $KALLISTO_STRAND_PARAMETER \
-t $THREADS -o . -i $KAL_REF $FQ1 $FQ2

Reference genome information

The compendium relies on reference genome sequence and annotation information provided by Ensembl Genomes .

Species Genome Reference Sequence and Annotation
Arabidopsis thaliana Ensembl release 36
Genome sequence (fasta)
Gene annotation set (GTF)
cDNA sequences (fasta)
Caenorhabditis elegans
Ensembl release 90
Genome sequence (fasta)
Gene annotation set (GTF)
cDNA sequences (fasta)
Drosophila melanogaster Ensembl release 90
Genome sequence (fasta)
Gene annotation set (GTF)
cDNA sequences (fasta)
Danio rerio Ensembl release 90
Genome sequence (fasta)
Gene annotation set (GTF)
cDNA sequences (fasta)
Escherichia coli Ensembl release 36
Genome sequence (fasta)
Gene annotation set (GTF)
cDNA sequences (fasta)
Homo sapiens Ensembl release 90
Genome sequence (fasta)
Gene annotation set (GTF)
cDNA sequences (fasta)
Mus musculus Ensembl release 90
Genome sequence (fasta)
Gene annotation set (GTF)
cDNA sequences (fasta)
Rattus norvegicus Ensembl release 90
Genome sequence (fasta)
Gene annotation set (GTF)
cDNA sequences (fasta)
Saccharomyces cerevisiae Ensembl release 36
Genome sequence (fasta)
Gene annotation set (GTF)
cDNA sequences (fasta)


About the quality metrics

A description of each of the quality metrics is provided on the Gitub page here.


Update schedule

We are updating the compendium fortnightly. Upon release of an updated genome build, we intend to update the data for that organism within a year, keeping a previously archived version for bulk download only. Gene annotation sets will not be updated independent of the genome build.