Welcome back again RSG friends! The sun is still shining here in Germany and so we take advantage of this warm weather to fill the June newsletter of the Regional Student Group of Germany from the iSCB Student Council. For this month, we have selected three papers related to multi-omics, an emerging discipline in Bioinformatics where several sources of omics data are integrated to address biological problems and questions. In the first part, we will discuss common methods that are used to address biological problems and predict phenotypic outcomes such as cancer. In the second part, we will talk about a popular method that integrates omics data based on network reconstruction in order to prioritize cancer genes. Finally, we will discuss a multi-omics data integration method based on Deep Learning to predict liver cancer survivors. Stay tuned, and see you next month! Yours:
RGS Germany (ISCB Student Council) – Tommaso Andreani, Ilkay Başak Uysal, Neetika Nath, Nikos Papadopoulos, Yvonne Gladbach
More is Better: Recent Progress in Multi-Omics Data Integration Methods
Medical research, especially precision medicine, has been revolutionized by high-throughput technologies and multi-omics data integration. The advent of high-throughput sequencing allowed scientists to connect transcript abundance to other molecular phenotypes, such as proteins and metabolites. Out of all these layers of cellular information coupled with omics technologies rose the field of integrative multi-omics. With a special focus on precision medicine, many software tools have been developed with the goal of improving a particular clinical outcome prediction.
In order to overcome the limitation of just one level of information (e.g. transcripts abundance), other measurements such as copy number variation, DNA methylation, and miRNA expression are combined with clinical data such as race, tumor stage, relapse, and treatment response. This allows scientists to form a comprehensive picture of the disease and get insights into the underlying biology behind. There are several ways to integrate these different layers.
In unsupervised data integration, methods accept input data sets without labeled response variables. They can then categorize biological profiles and so cluster samples into different subgroups.
Contrary to this exploratory analysis, supervised data integration methods consider the phenotype as the label of the samples, e.g. disease or normal. They then use machine training approaches to select the input features that can better represent and predict the labels. Among supervised data integration methods, the most representative are network-based, multi-kernel and multi-step models.
Semi-supervised data integration lies somewhere between the supervised and unsupervised approaches, taking labeled and unlabeled samples to develop the learning algorithm. Through the relationship of labeled samples, unknown samples can be assigned by building sample-wise similarity networks.
Many integrative methods are working independently on different layers in the initiating stage of data integration. Recently, state-of-the-art tools are considering the interactions between the different omics layers. Thus, one important challenge of multi-omics approaches is to identify the internal relationship of the different integrated layers for each omics data set. At the same time, the integration method is the limiting factor.
Network-based integration of multi-omics data for prioritizing cancer genes
It is common practice in precision medicine to integrate multi-omics data to take advantage of multi-layered information. Incorporating epigenetic changes or miRNA differential expression can enhance our understanding of cancer studies. Often, the focus of cancer studies is on the genetic aberrations and/or epigenetic changes that provoke direct interactions (e.g. mutations to transcription factors), but seldom is there focus on changes due to gene-gene interaction. In this work, the authors proposed a method called NetICS (Network-based Integration of Multi-omics Data), which provides prioritized genes by their mediator effect in order to understand the mechanisms behind cancer progression due to genetic interactions. The authors suggest that genetic interactions will funnel the effect of multiple genetic and epigenetic changes into a few mediator genes that are involved in cancer progression.
In this method, a graph is built by integrating different types of aberrations events (such as somatic mutation or copy number variation) with differential expression data on the transcriptome and proteome level for every tumor sample. Thus, the algorithm constructs a directed functional interaction network, where nodes represent transcripts and edges hold information on the variety of interaction types at different cellular levels(including (de)phosphorylation, expression/repression, and activation/inhibition). Non-interactive genes are then removed from further analysis in order to simplify the results and interpretation. From the remaining subgraph, NetICS calculates the priority of genes by diffusing the aberrations and gene expression scores over the interactions network. At the end, gene ranking is performed for each sample by using a robust rank aggregation technique. The top-ranked genes are considered to be mediator genes.
Such a framework can provide a better understanding of patient-specific aberrations affecting the same gene targets in different ways. However, this method is limited to examining the genetic effects present in the interaction network. Since a network diffusion strategy is implemented, the results are biased towards highly connected genes. Nonetheless, this is an important step towards the integration of multi-omics data for the investigation of genetic alterations.
Deep Learning–Based Multi-Omics Integration Robustly Predicts Survival in Liver Cancer
Hepatocellular carcinoma (HCC) is the most prevalent type of liver cancer and its 5-year survival rate is less than 32%. Moreover, the high level of heterogeneity along with the complex etiologic factors makes the prognosis very challenging. In addition, treatment strategies in HCC are very limited and prediction models so far don’t consider the survival rates of the patients. This emphasizes the importance of developing tools which can predict patient survival.
In order to understand the heterogeneity of HCC, many researchers worked on multiple methods which use a different type of data such as mRNA or miRNA expression, copy number variation (CNV), and DNA methylation. Most of these studies did not consider the survival of the patients during the process of subtyping. Instead, survival was used to evaluate the clinical significance of these subtypes. This resulted in grouping different subtypes in the same subtype, making them redundant subtypes in terms of survival differences. Considering the need for new approaches which takes into account the survival of the patients and uses multi-omics data, authors developed a deep learning (DL) computational framework on multi-omics HCC datasets.
Authors of the work utilized the autoencoder framework which aims to reconstruct the original input using combinations of nonlinear functions that can then be used as new features to represent the dataset. These algorithms have already been proven to be efficient approaches to produce features linked to clinical outcomes. One very important characteristic of autoencoders is the fact that autoencoder transformation usually cluster together genes that share a biological pathway, thus making them suitable for interpreting biological functions.
A DL-based survival-sensitive model was constructed by using 360 HCC patients’ data using RNA sequencing (RNA-Seq), miRNA sequencing (miRNA-Seq), and methylation data from The Cancer Genome Atlas (TCGA). The method’s performance was as good as the state of the art when method input includes both genomics and clinical data
This DL-based model provides two optimal subgroups of patients with significant survival differences and good model fitness. The more aggressive cancer subtype is associated with frequent TP53 inactivation mutations, higher expression of stemness markers (KRT19 and EPCAM) and tumor marker BIRC5, as well as activated Wnt and Akt signaling pathways. The multi-omics model was subsequently validated on five external datasets of various omics types: LIRI-JP cohort, NCI cohort, Chinese cohort, E-TABM-36 cohort, and Hawaiian cohort.
This work presents a novel application of DL to identify multi-omics features linked to the differential survival of patients with HCC. Given its robustness over multiple cohorts, it can be reasonably expected that this workflow will be useful at predicting HCC prognosis prediction.
The model was able to identify two cancer subtypes from the molecular information and was robust across multiple datasets. Moreover, it had a great predictive performance without the need for clinical features. Functional analysis of these two subtypes identified that gene expression signatures (KIRT19, EPCAM, and BIRC5) and Wnt signaling pathways are highly associated with poor survival. In summary, the survival-sensitive subtypes model reported is significant for both HCC prognosis prediction and therapeutic intervention.
Some of the challenges the authors faced were the absence of cluster label information and in the original reports they based the model on and lack of survival data in some of the cases.
In conclusion, the authors managed to create a model which is robust to noise, is able to extract meaningful features and can reflect both linear and non-linear relationships. The model had a consistent performance across different datasets, something rare for a multi-omics approach. It could integrate multiple omics data and had a sophisticated strategy to combine multiple features. The model also may be expanded to incorporate pathway information and account for overlapping genes.