Abstract

Rummageo

The Gene Expression Omnibus (GEO) is a major open biomedical research repository for transcriptomics and other omics datasets. It currently contains millions of gene expression samples from tens of thousands of studies collected by many biomedical research laboratories from around the world. While users of the GEO repository can search the metadata describing studies and samples for locating relevant studies, there is currently no method or resource that facilitates global search of GEO at the data level. To address this shortcoming, we developed RummaGEO, a webserver application that enables gene expression signature search against all human and mouse RNA-seq studies deposited into GEO. To enable such a search engine, we performed offline automatic identification of conditions from uniformly aligned GEO studies available from ARCHS4, and then computed differential expression signatures to extract gene sets from these signatures. In total, RummaGEO currently contains loading and loading from loading. Overall, RummaGEO provides an unprecedented resource for the biomedical research community enabling hypotheses generation for many future studies.

Methods

We considered any GEO study aligned by ARCHS4 with at least three samples per condition with at least six samples in total collected for the study. Studies with more than 50 samples were discarded because such studied typically contain patient data that is not amenable for simple signature computation that compares two conditions. Samples were grouped using metadata provided by the GEO study. Specifically, K-means clustering of the embedding of concatenated sample title, characteristic_ch1, and source_ch1 fields were used to classify conditions. To create condition titles, common words across all samples for each condition were retained. Limma voom was used to compute differential expression signatures for each condition against all other conditions within each study. Additionally, we attempted to first identify any control conditions based on metadata and a discrete list of keywords that describe control conditions, for example, “wildtype”, “ctrl”, or “DSMO”. If such terms were identified, they were used to compare to the samples labeled with such term to all other condition groups. Up and down gene sets were extracted from each signature for genes with an adjusted p-value of less than 0.05. If less than five genes met this threshold, the gene set was discarded. If more than 2000 genes met this threshold, the threshold was lowered incrementally to 0.05, 0.01, 0.005, and 0.001, until less than 2000 genes were retained. Additionally, we calculate a data-level confidence score of the condition groups with a silhouette score based on a PCA of normalized expression data wherein a value of 1 indicates perfect clustering and -1 indicates poor clustering.


For for information about using RummaGEO, please refer to the User Manual.


This database is updated with new releases of ARCHS4.


This site is programatically accessible via a GraphQL API.


RummaGEO is actively being developed by the Ma'ayan Lab


Please acknowledge RummaGEO in your publications by citing the following reference:

Marino, G.B., Clarke, D.J.B., Lachmann, A., Deng, E.Z., and Ma’ayan, A. RummaGEO: Automatic mining of human and mouse gene sets from GEO. Patterns. 5:10, 101072; October 11, 2024;

https://www.cell.com/patterns/fulltext/S2666-3899(24)00231-9


RummaGEO is protected under the "BSD Source Code Attribution" license (see below).


Copyright (c) 2024, Ma'ayan Lab, Icahn School of Medicine at Mount Sinai

All rights reserved.


Redistribution and use of this software in source and binary forms, with or without modification, are permitted provided that the following conditions are met:


* Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

* Neither the name of "Ma'ayan Lab, Icahn School of Medicine at Mount Sinai" nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission from Ma'ayan Lab, Icahn School of Medicine at Mount Sinai.


THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.