DEE2 gene signatures

**Under construction**

In addition to providing the largest publicly available set of processed RNA-seq data, we are trialing the sharing of gene signatures. These gene signatures have been specifically curated to help characterise human diseases including diabetes, heart disease, epilepsy and viral infection (SARS/MERS/SARS-CoV-2). These gene signatures are useful for enrichment analysis with tools such as GSEA and mitch. They are in GMT format.

Download
Disease focus (Number of sets)	Gene symbols	Ensembl IDs
Epilepsy (12)	20K	44K
Diabetes (132)	312K	716K
Heart disease (50)	116K	264K
SARS, MERS and SARS-CoV-2 (68)	120K	284K

Set analysis

We performed an analysis of these sets, see the data here. We checked to see whether these gene sets showed unreasonably high overlap with existing sets in MsigDB so we calculated Jaccard index against every set in MsigDB (link). The highest Jaccard score was 0.39 for a SARS-CoV-2 set with a set in MsigDB.

Method summary

The code used was deposited to GitHub. To create these gene sets, our curation team searched for GEO series with related to our main disease focuses and satisfied a number of criteria:

1. The study is relevant to the disease.

2. The study involves samples that can be compared, for example control and disease groups.

4. The data is available at DEE2 and passes QC filtering. QC "FAIL" data sets are excluded.

3. The experiment is replicated, that means n>2 for each condition after QC filtering.

For the studies that do fit the criteria, the contrasts are defined in a markdown file related to the SRA project ID (SRP) and the SRA experiment IDs (SRXs) corresponding to the control and case datasets:

SRP: Contrast name: Control group name; SRX(control1),SRX(control2),SRX(controlN): Case group name; SRX(case1);SRX(case2),SRX(case3)

For example:

SRP233503:Genes differentially regulated by free fatty acids in HUVECs cells:Ctrl; SRX7228895,SRX7228896,SRX7228897:FFA; SRX7228898,SRX7228899,SRX7228900

Next, the contrasts are analysed using a specially designed pipeline in R (version 4.0.2). In the "de_analysis" folder there are Rmd files that can be used to regenerate the gene set library for each theme. These Rmd files use a common set of functions for differential analysis, which are found in the "de_functions.R" file. First, the contrast information is parsed out so the control and case data are identified. Next, the gene expression count data are obtained from DEE2 using the "getDEE2" package. Any SRA runs with QC summary of "FAIL" are excluded. The remaining datasets are aggregated from runs to experiments. Genes with fewer than 10 reads per sample on average are excluded. Next we remove samples with fewer than 1000 expressed genes. If after this there are fewer than two replicates per group, the contrast is discarded. Next DESeq2 version1.28.1 was used for differential expression. Genes indicated as FDR<0.05 were divided into up- and down-regulated groups. Sets with fewer than 10 genes were discarded. Sets were aggregated and written as a GMT format.

Contributors

Aaron Kovacs, Chelsia Sritharan, Haris Lekovic and Mark Ziemann. If you would like to contribute, get in touch via GitHub.