In addition to providing the largest publicly available set of processed RNA-seq data, we are trialing the sharing of gene signatures. These gene signatures have been specifically curated to help characterise human diseases including diabetes, heart disease, epilepsy and viral infection (SARS/MERS/SARS-CoV-2). These gene signatures are useful for enrichment analysis with tools such as GSEA and mitch. They are in GMT format.
|Disease focus (Number of sets)||Gene symbols||Ensembl IDs|
|Heart disease (50)||116K||264K|
|SARS, MERS and SARS-CoV-2 (68)||120K||284K|
The code used was deposited to GitHub. To create these gene sets, our curation team searched for GEO series with related to our main disease focuses and satisfied a number of criteria:
1. The study is relevant to the disease.
2. The study involves samples that can be compared, for example control and disease groups.
4. The data is available at DEE2 and passes QC filtering. QC "FAIL" data sets are excluded.
3. The experiment is replicated, that means n>2 for each condition after QC filtering.
For the studies that do fit the criteria, the contrasts are defined in a markdown file related to the SRA project ID (SRP) and the SRA experiment IDs (SRXs) corresponding to the control and case datasets:
SRP: Contrast name: Control group name; SRX(control1),SRX(control2),SRX(controlN): Case group name; SRX(case1);SRX(case2),SRX(case3)
For example:SRP233503:Genes differentially regulated by free fatty acids in HUVECs cells:Ctrl; SRX7228895,SRX7228896,SRX7228897:FFA; SRX7228898,SRX7228899,SRX7228900 Next, the contrasts are analysed using a specially designed pipeline in R (version 4.0.2). In the "de_analysis" folder there are Rmd files that can be used to regenerate the gene set library for each theme. These Rmd files use a common set of functions for differential analysis, which are found in the "de_functions.R" file. First, the contrast information is parsed out so the control and case data are identified. Next, the gene expression count data are obtained from DEE2 using the "getDEE2" package. Any SRA runs with QC summary of "FAIL" are excluded. The remaining datasets are aggregated from runs to experiments. Genes with fewer than 10 reads per sample on average are excluded. Next we remove samples with fewer than 1000 expressed genes. If after this there are fewer than two replicates per group, the contrast is discarded. Next DESeq2 version1.28.1 was used for differential expression. Genes indicated as FDR<0.05 were divided into up- and down-regulated groups. Sets with fewer than 10 genes were discarded. Sets were aggregated and written as a GMT format.
Aaron Kovacs, Chelsia Sritharan, Haris Lekovic and Mark Ziemann. If you would like to contribute, get in touch via GitHub.