DEE2 Bulk Data

Bulk transcriptome data files

Bulk data dumps for each species are available via http here. The data are in HDF5 format. There are three HDF5 files for each species. STAR counts have the suffix 'se.h5'. STAR counts will always be integers. Kallisto estimated counts have the 'ke.h5' suffix and are commonly floating point numbers. QC metrics are available in the file with the 'qc.h5' suffix.

HDF5 is ideal for storing and and accessing large and complex datasets. Within each of the HDF5 files, there will be three objects. "Bigmatrix" is the main data object and contains the gene expression data. In this type of data, the rows represent SRA runs, and the columns represent genes. The other two objects are the column names and row names. HDF5 allows for random access without loading the entire object into memory.

The checksums.md5 file is available for you to check that the downloads have worked properly

For guidance working with HDF5 data, I suggest using the rhdf5 R package from Bioconductor.

Bulk metadata files

There are three types of metadata files. Files with the "metadata.tsv" suffix are based on individual SRR runs, and the data contains corresponding SRA experiments, SRA project, Experiment Title and GEO information (if available). Files with the "metadata.tsv.cut" suffix are similar, but have a cut-down number of columns. Files with the "srp.tsv" suffix contain the metadata of SRA projects including the abstract of the study.

Working with bulk transcriptome data files

A tutorial for working with DEE2's HDF5 files is currently on the GenomeSpot blog.

DEE2 Bulk Data Resources

Bulk transcriptome data files

Bulk metadata files

Working with bulk transcriptome data files