NAV

Introduction

Oncoscape is a data visualization platform that empowers researchers to discover novel patterns and relationships between clinical and molecular data. Through a suite of interoperable tools, Oncoscape offers a unique and intuitive approach to hypothesis refinement.For more detailed information, please read github repo page.

Data Content

This section is dedicated to explain the raw data Oncoscape utilizes. The section Data Provenance explains how the raw data has been processed to fit into our visualization model.

Clinical Data

Data Sources

Genomic Data Commons Data Portal (GDC) from National Institutes of Health (NIH) provides the compiled annotated clinical data.

GDC clinical data events | clinical events collection organized by patient patient | patient collection for each disease type drug | chemo or other medicine administration records newtumor | new tumor event records for possible patients othermalignancy | other maliganancy records for possible patients radiation | radiation administration records followup | possible follow-up records newtumor-followup | possible follow-up records for the new tumor events samplemap | sample-patient mapping collection

Molecular Data

Data Sources

UCSC Xena compiled annotated normalized molecular datasets of various platforms from multiple institutes.

UCSC xena hub

UCSC xena github

SARC GEO Data

GEO Title Publication Samples Platform
GSE12102 Overcoming resistance to conventional drugs in Ewing’s sarcoma and identification of molecular predictors of outcome Scotlandi K, Remondini D, Castellani G, Manara MC et al. Overcoming resistance to conventional drugs in Ewing sarcoma and identification of molecular predictors of outcome. J Clin Oncol 2009 May 1;27(13):2209-16. PMID: 19307502 37 Affy U133Plus2.0
GSE16102 Gene expression profiles of canine and human osteosarcoma Scotlandi K, Remondini D, Castellani G, Manara MC et al. Overcoming resistance to conventional drugs in Ewing sarcoma and identification of molecular predictors of outcome. J Clin Oncol 2009 May 1;27(13):2209-16. PMID: 19307502 57 Affy U133A
GSE20196 Gene expression profile of poorly differentiated synovial sarcoma Nakayama R, Mitani S, Nakagawa T, Hasegawa T et al. Gene expression profiling of synovial sarcoma: distinct signature of poorly differentiated type. Am J Surg Pathol 2010 Nov;34(11):1599-607. PMID: 20975339 34 Affy U133Plus2.0
GSE21050 Expression data from Complex genetics sarcomas (cohort 1 and 2) Chibon F, Lagarde P, Salas S, Pérot G et al. Validated prediction of clinical outcome in sarcomas and multiple types of cancer on the basis of a gene expression signature related to genome complexity. Nat Med 2010 Jul;16(7):781-7. PMID: 20581836 310 Affy U133Plus2.0
GSE21122 Whole-transcript expression data for soft-tissue sarcoma tumors and control normal fat specimens Barretina J, Taylor BS, Banerji S, Ramos AH et al. Subtype-specific genomic alterations define new targets for soft-tissue sarcoma therapy. Nat Genet 2010 Aug;42(8):715-21. PMID: 20601955 158 Affy U133A
GSE23980 Expression data from human soft tissue sarcomas with complex genomics Gibault L, Ferreira C, Pérot G, Audebourg A et al. From PTEN loss of expression to RICTOR role in smooth muscle differentiation: complex involvement of the mTOR pathway in leiomyosarcomas and pleomorphic sarcomas. Mod Pathol 2012 Feb;25(2):197-211. PMID: 22080063 171 Affy U133Plus2.0
GSE30929 Whole-transcript expression data for liposarcoma Gobble RM, Qin LX, Brill ER, Angeles CV et al. Expression profiling of liposarcoma yields a multigene predictor of patient outcome and identifies genes that contribute to liposarcomagenesis. Cancer Res 2011 Apr 1;71(7):2697-705. PMID: 21335544 140 Affy U133A
GSE6481 Gene Expression Analysis of Soft Tissue Sarcomas: Characterization & Reclassification of Malignant Fibrous Histiocytoma Nakayama R, Nemoto T, Takahashi H, Ohta T et al. Gene expression analysis of soft tissue sarcomas: characterization and reclassification of malignant fibrous histiocytoma. Mod Pathol 2007 Jul;20(7):749-59. PMID: 17464315 105 Affy U133A

Data Type

Type Annotation
expr Expression data including mRNA and microRNA expression data and Reverse hase protein array (RPPA) data
mut non-synonymous mutations representated as strings in this collection
mut01 non-synonymous mutations representated as binary values in this collection
meth DNA methlyation data
meth_thd Thresholded DNA methlyation data
cnv DNA copy-number data represented as Gistic score
cnv_thd Thresholded DNA copy-number data represented as Gistic score

Schema

Schema Type Annotation
chr_sample Collections of this schema have chromosomal location inhelper.formation as keys for each record, which is a list of values with samples as keys.
hugo_sample Collections of this schema have chromosomal HUGO genes as keys for each record, which is a list of values with samples as keys.
sample_pos Collections of this schema have samples as keys for each record, which is a list of position.
methoprobe_sample Collections of this schema have methlyation probes as keys for each record, which is a list of values with samples as keys.

Gene Sets

Name Description Genes
TCGA GBM Classifiers Gene expression-based molecular classification of GBM subtypes (Proneural, Neural, Classical, Mesenchymal) 840
Glioma Markers Genes recurrently impacted in TCGA gliomas 545
TCGA Pancan Mutated Significantly mutated genes according to the TCGA PANCAN working group (syn1750331) identified by both MuSiC and MutSig 73
Oncoplex Vogelstein Combined set from the Oncoplex gene panel and driver genes described in Vogelstein, Science 2013. 274
Oncoplex A sequencing panel that detects mutations in genes related to cancer treatment, prognosis, and diagnosis. 263
OSCC Expression Markers Differentially expressed probe set comparing normal oral tissue to oral squamous cell carcinoma 109
Breast PAM50 Gene expression based subtype predictor for subtypes luminal A, luminal B, HER2-enriched, and basal-like 50
Breast Tumor Intrinsic Classifier Meta analysis of available breast cancer gene expression datasets grouping LumA, LumB, Basal-like, HER2+/ER-, and Normal Breast-like tumor subtypes 1232
FoundationOne Heme FoundationOne® Heme is designed to analyze and interpret sequence information for somatically altered genes in human hematologic malignancies (leukemias, lymphomas, and myelomas), and sarcomas. Genes included in this assay encode known or likely targets of therapies, either approved or in clinical trials, or otherwise known drivers of oncogenesis. 593
TCGA Sarcoma alterations Frequently mutated genes in soft-tissue sarcoma subtypes 21
Leiomyosarcoma molecular subtypes Three molecular subtypes of leiomyosarcoma were confirmed in 2 publically available datasets. Subtype I LMS is associated with good outcome in extrauterine LMS while subtype II LMS is associated with poor prognosis in both uterine and extrauterine LMS. A subset of the biomarkers are used here based on the genes mentioned in the publication. 13
Sarcoma markers Compiled from multiple publications on different sarcoma subtypes 48
Sarcoma markers Heme Compiled from multiple publications on different sarcoma subtypes & intersected with FoundationOne Heme 24
Sarcoma markers Oncoplex Compiled from multiple publications on different sarcoma subtypes & intersected with Oncoplex 19
Sarcoma CINSARC Performed genomic and expression profiling in a training set of 183 sarcomas and established a prognostic gene expression signature, complexity index in sarcomas (CINSARC), composed of 67 genes related to mitosis and chromosome management 67

Data Provenance

This section is dedicated to explain how the raw data were processed to generate new data models to fit to Oncoscape visualization tools. The section Data Content explains the source and type of raw data.

Pipeline

Data Processing Pipeline

Oncoscape Interface

We use lookup as an hand-off from data-generation to data-utilization. Lookup reminds us the data are organized by diseases. And they have the subcategories: clinical, molecular. Each document of the lookup collection is to describe the organization of all the disease-related collection structure. Within each document, except for the organization of actual data collections, there is metadata section.

One example of lookup do3cument

The processed raw data are stored under ‘calculated’ and 'edges’ in each document. The 'calculated’ is explained in Cluster while the 'edges’ is explained in Network.

Cluster

The cluster collections are generated to fit into two Oncoscape tools 'Markers and Patients’ and 'PCA’. There are two Schemas for clutster collections: Multidimensional Scaling (MDS) and Principal component analysis (PCA).

MDS

cnv_thd and mut01 were combined and the distance matrix were calculated best represent the similarity of individual sample in the N-dimensional space.

mds collection schema

PCA

The collection of each class will be processed respectively that distance matrices were generated to best represent the similarity of individual sample in the N-dimensional space.

pca scores collection schema

pca loading schema

Collection Organization

The naming convention of the derived collections includes the information from method, geneset as well as data class listed below.

Methods Genesets Data Class
PCA All Genes RNA
MDS TCGA Pancan Mutated Protein
Oncoplex CNV/Mut01
Glioma Markers Mut01
Oncoplex Vogelstein CNV
TCGA GBM Classifiers

Network

The network collections are generated to fit to Oncoscape tool 'Markers and Patients’. There are three Schemas for network collections: edges, patient weights and gene weights.

Edges

Edges contain the information to define the edges connecting one patient sample with one gene. For each record the value may be integer between -2 to +2. If one gene in one patient sample has copy number variation, this information will be represented as -2 (deletion), or -1 (loss), or +1 (gain), or +2 (amplificaiton). If this gene on this patient sample doesn’t have copy number variation but point mutation, this information will be represented as '0’. Otherwise, no record will be shown.

edges schema

Patient degrees

For a certain dataset, patient degrees record the number of altered genes (copy number variation and point mutation) for each patient.

ptdegree schema

Gene degrees

For a certain dataset, gene degrees record the number of patients who have this gene altered (either caused by copy number varation or point mutation).

genedegree schema

Data Access

Oncoscape provides the API service based on the traditional RESTful API data structure. Data are secured with exposed by API Gateway Kong.. The privacy is managed at collection level. You can acess the public datasets through appending ‘apikey=password’.

Example to access one collection from browser

HTTP Request

Collections are accessable at the host: http://dev.oncoscape.sttrcancer.io/api/

The endpoint of oncoscape API is a unique URL. Every endpoint points to a unique collection. Below lists more details of the organization of the Oncoscape Mongo Database and the collections organized by disease type.

GET http://dev.oncoscape.sttrcancer.io/api/gbm_patient_tcga_clinical/?q=&apikey=password

Query Collection from Browser

HTTP Request

Filter by gender and race and only show the selected fields

GET http://dev.oncoscape.sttrcancer.io/api/gbm_patient_tcga_clinical/?q={"gender":"MALE", "race":"WHITE","$fields":["gender","race","patient_ID"],"$skip":5,"$limit":2}&apikey=password

only show gender, race and patient_ID

"$fields":["gender","race","patient_ID"]

skip the first five records

"$skip":5

limit the final output to two records.

"$limit":2

Explore the Oncoscape Database

Data Explorer is an interactive web application to explore the clinical collections in the database.