Bulk transcriptome data files
Bulk data dumps for each species are available via http here.
The data are in HDF5 format.
There are three HDF5 files for each species.
STAR counts have the suffix 'se.h5'. STAR counts will always be integers.
Kallisto estimated counts have the 'ke.h5' suffix and are commonly
floating point numbers.
QC metrics are available in the file with the 'qc.h5' suffix.
HDF5 is ideal for storing and and accessing large and complex datasets.
Within each of the HDF5 files, there will be three objects.
"Bigmatrix" is the main data object and contains the gene expression data.
In this type of data, the rows represent SRA runs, and the columns represent genes.
The other two objects are the column names and row names.
HDF5 allows for random access without loading the entire object into memory.
The checksums.md5 file is available for you to check that the downloads have worked
properly
For guidance working with HDF5 data, I suggest using the rhdf5 R package from Bioconductor.