July 2018-Multi Omics (Machine Learning)

July 2018

Introduction

Welcome back again RSG friends! We are sorry for the delay, we have all been very busy this month but this did not stop us in pushing our monthly July newsletter. For this month, we have selected three papers related to multi-omics and network topology, a branch of Bioinformatics with the aim to link several sources of omics data to find meaningful patterns in biological data. In the first part we will discuss about a method to “fuse” several network from different layers of omics information, the second will be a method that is suitable for those trying to define the expression of genes in co-expression networks and the last is an application of gene regulatory networks to single cell data. Stay tuned, and see you next month! Yours:

RGS Germany (ISCB Student Council) – Tommaso Andreani, Ilkay Başak Uysal, Neetika Nath, Nikos Papadopoulos, Yvonne Gladbach

Similarity network fusion for aggregating data types on a genomic scale

The last decades were signed by an explosion of data availability in the biological field. Several phenotypic and molecular measurements allowed to interconnect the measured elements of complex biological systems such as those involved in cancer or human development. One of the emerging field in this data explosion revolution is genomics. In fact, DNA, RNA and Epigenetic information such as methylation can be integrated in order to identify meaningful patterns and extract novel information.

However, the this Integration is difficult and challenging. This is because the already low signal to noise ratio after normalization of each measurement is further diluted when integrated with measurements of other nature. A strategy to avoid this problem is to analyze each data independently but the later integration will usually lead to inconclusive conclusions. Other approaches usually preselect a set of features that are known to be important for the type of system under investigation but this has the consequence to create bias leading to not useful or novel results.

Similarity network fusion (SNF) solves this problems by constructing networks of samples (e.g., patients) for each available data type (DNA, RNA, methylation) and then efficiently fusing them into one network that represents the full spectrum of the underlying data. SNF consists of computing the sample-similarity network for each data type and integrate these networks into a single similarity network using a nonlinear combination method.

The five main steps of SNF are: 1) the collection of data measurements such as mRNA, DNA methylation and miRNA from the same samples, 2) computation of the patient similarity matrices where nodes are samples (the patients or cells) and the weighted edges represent pairwise sample similarities for each molecular phenotype, 3) build of the patient similarity networks, 4) fusion data iterations and 5) construct the final fused patient similarity network.

Wang, Bo, et al. “Similarity network fusion for aggregating data types on a genomic scale.” Nature methods 11.3 (2014): 333.

Gene co-expression network connectivity is an important determinant of selective constraint

Mähler et al. in this work investigate the features of biological network i.e. scale-free to examine the relationship between selective processes acting to maintain a natural variation and the association of co-expression network structure. Basically, the scale-free nature of biological networks explains the power-law distribution, that means a fewer number of nodes (genes or proteins) get a higher degree (expression values) or connection whereas a majority of nodes(genes or proteins) get lower connectivity. Previous studies focused on the genomic location of variation and shown that genomic polymorphisms that are located outside of protein-coding regions are the contributing factors for phenotype changes.

Based on this biological ground combined with bioinformatics tools Mähler et al. investigates the relationship between variations and the co-expression network in order to study not only the location but also to understand the selective pressure on the aspen (P. tremula) genome. First, eQTL mapping was performed to explore the genetic architecture of gene expression variation among genotypes and found that often SNPs were located proximal to the transcription start site (TSS) and to the stop codon. Then they have determined the variations positions by performing sequence ontology and found that UTRs have the highest density of local SNPs followed by flanking regions and introns. Next, in order to determine the evolutionary history of its component genome, the co-expression network was created, and correlation between network connectivity and rates of sequence evolution was established. Finally, combining results from eQTL and co-expression networks analysis, they identified that non-core genes, defined as low connections, contains the greatest density of SNPs keeping the effect of mutation to minimal. Furthermore, it was noted that these non-core genes were present at the periphery of the network again suggesting minimal effect to the core genes. Such buffering characteristics of the network would be true even though all genes are equally likely within the network to be exposed to the same evolutionary history which is in contrast to the idea of natural selection, where it may prevent the accumulation of mutations within specific genes.

All in all, in European aspen population the authors show how the co-expression network can be buffered against large perturbations and adaptation by tolerating an accumulation of mutations within network periphery.

Their work has shown us how to incorporate evolutionary history network topology to understand underlying behaviour evolution. Nonetheless, we need to acknowledge the availability of not fully annotate P. tremula genome. This work demonstrates that by incorporating bioinformation and biology one can explain biological principles hidden in such co-expression network and advance our understanding.

Mähler, Niklas, et al. “Gene co-expression network connectivity is an important determinant of selective constraint.” PLoS genetics 13.4 (2017): e1006402.

Gene Regulatory Network Inference from Single-Cell Data Using Multivariate Information Measures

The development of cells and organisms is driven by finely tuned spatial and temporal gene expression, and their ability to adjust gene expression levels is what allows them to respond to environmental and physiological input. Inferring gene regulatory networks (where genes are nodes and edges represent a regulatory relationship between them) facilitates the study of the precisely controlled patterns of gene expression that are so essential for reproduction and survival. Additionally, GRNs are quickly becoming the tool of choice to dissect the molecular contributions of complex diseases.

The increasing availability of high-throughput single-cell expression data is offering unprecedented opportunities for GRN inference, something largely unexplored to-date. In this paper, researchers from the Imperial College (London) present PIDC (Partial information Decomposition and Context), an algorithm that utilizes multivariate information theory to find genes that have a regulatory relationship in single cell data.

The authors adopt partial information decomposition (PID), a non-negative decomposition of Mutual Information (MI) for the examination of three variables at once. PID quantifies the influence of a “source set” of two variables X, Y on a target variable Z. In PID, the MI is composed of four terms: the synergy (portion of information only provided by knowing X, Y), the unique contributions on Z (portion of information provided only by X when the other variable is Y and vice versa) and the redundancy (portion of information about Z that can be provided by X or Y alone).

The relative value of these terms for every gene triplet creates a characteristic pattern, a signature, that depends on the topology. Finding these patterns in the real data will indicate gene triplets with the corresponding topologies. Two of those are of particular interest: unconnected (three genes that do not interact with each other) and single-edge (two genes connected, one not), since they account for more than 90% of all triplets in real GRNs.

The PIDC algorithm exploits this fact: for every pair of genes X,Y it goes over all triplets X,Y,Z and calculates the proportion of the PID that is captured by the unique information given by Z (proportion of unique contribution, PUC). If X,Y are connected, then for most of the triplets X,Y,Z the ratio of the unique information to the total PID will be high. If they are unconnected, the same ratio will be low. Finally, PIDC considers the PUC distributions for each gene and keeps the significant interactions, rather than the ones that surpass a global PUC cutoff.

On a benchmark on simulated data, PIDC compares favorably to the state of the art, and shows improved performance on datasets with more data points. On real data, PIDC reconstructs networks that are considerably clearer than when using simple expression correlation or pairwise MI and identifies the interactions of key genes and regulators. PIDC demonstrates the value of using higher-order information measures as well as information about the network context, and illustrates that the large sample sizes and transcriptional variability of modern single-cell analyses can be used as advantages in GRN reconstruction.

Chan, Thalia E., Michael PH Stumpf, and Ann C. Babtie. “Gene regulatory network inference from single-cell data using multivariate information measures.” Cell systems 5.3 (2017): 251-267.