Abstract 2P
Background
NCBI's Gene Expression Omnibus (GEO) is a major repository for high-throughput transcriptomic datasets. It currently contains approximately 7,000,000 transcriptomic profiles spread across more than 200,000 datasets, of which around 50,000 are related to cancer. The secondary analyses of these datasets hold vast potential to unlock new biological understanding and shape future clinical study designs. However, the high data heterogeneity and the limited browsing features of the repository’s website pose significant challenges, particularly in oncology research.
Methods
Here, we introduce a solution that leverages a tagging approach for the characterization of GEO datasets. It focuses on the clinical description (metadata) of the sample transcriptomic profiles included in the datasets, and detects multiple criteria (e.g., patient vs. cell line, donor type, overall survival, cancer type). This approach involves natural language processing techniques (e.g., named entity recognition and normalization), combined with machine learning classifications as well as rule-based classifications developed in collaboration with clinical and molecular oncology experts.
Results
The tagging models demonstrate high performance, with Area Under the Receiver Operating Characteristic Curve (AUC) values exceeding 0.95, 0.80, and 0.90 for identifying patient-derived samples, donor type, and overall survival information, respectively. Our cancer type classifier achieves a weighted average F1 score exceeding 0.90 across 21 cancer histologies. Ultimately, a user-friendly dashboard offers insights into GEO's cancer content, its evolution, and breakdowns by various features such as technology, platform, and cancer type.
Conclusions
Our tagging approach significantly enhances the exploration and annotation of GEO datasets, thus facilitating the secondary analysis of cancer-related data. This solution, coupled with dataset aggregation, is not only advantageous to deal with scarce data (e.g., in the context of rare cancers), but also scalable to abundant data sources.
Clinical trial identification
Editorial acknowledgement
Legal entity responsible for the study
Epigene Labs.
Funding
Epigene Labs.
Disclosure
V. Bernu, C. Lescure, H. Brull Corretger, E. Fox, C. Marijon, C. Petit, A. Behdenna: Financial Interests, Personal, Stocks/Shares, Stock options: Epigene Labs. P. Dhillon: Financial Interests, Personal, Invited Speaker, Stock options: Epigene Labs. A. Nordor: Financial Interests, Personal, Stocks/Shares, Stock options: Epigene Labs; Financial Interests, Personal, Stocks/Shares, Founder stocks: Epigene Labs.