Newsletter

November 2018

Introduction

Welcome back again RSG friends! It’s a bit of time that we are not upgrading our newsletter but now it’s time to restart with some exciting news. In fact, our members Niko, Yvonne and Neetika were busy in attending the summer conferences around the world about bioinformatics and computational biology. It is for this reason that we would like to make a resume of what we have experienced during these conferences in Chicago and Athens. Have fun with our latest updates and stay tuned, yours RSG friends:

RGS Germany (ISCB Student Council) – Tommaso Andreani,  Ilkay Başak Uysal, Neetika Nath, Nikos Papadopoulos, Yvonne Gladbach

Glimpse on ISMB vs. ECCB

The International Society for Computational Biology (iSCB), a society for researchers in bioinformatics, is hosting every year 11 conferences. This year I had the great chance of attending two conferences of iSCB, each with a student council symposium as pre-meeting: intelligent systems for molecular biology (ISMB) in Chicago and European conference on computational biology (ECCB) in Athens, a bi-annual meeting.

Focus of the ISMB conference, the flagship conference of the iSCB, is on development and application of advanced computational methods for biological problems happening on the north American continent. It covers a variety of special interest communities (Cosi). With this huge variety and the Cosis running in parallel, a lot of researchers and students are attracted and can find the fitting niche for listening to experts in their field.

Pre-meeting like workshops and tutorials covered single cell RNA-Seq, ML, integrative omics analysis, regulatory interactions, deep learning and visualization as well as bioinformatics education. Highlights of the student council symposium SCS were the keynotes of Lucia Peixoto and Philip E. Bourne. Additionally, 11 student and postdoc talks were presented as well as posters from early stage researchers.

Main focus of the ECCB are advances in computational biology and their application to problems in molecular biology. This conference is one of the three sister conferences of the ISMB and happening on the European continent. In contrast to the ISMB, the ECCB was organized in a more general theme. Workshops and tutorials covered similar topics like ISMB.

On contrary, the european student council symposium (ESCS) was organized in the fashion of the ISMB with the following highlights: 11 student and postdoc talks, 6 flash talks as well as posters from several early stage researchers. The two enlightening keynote speakers were Julio Saez-Rodriguez and Anna Zhukova, who stayed in addition to a roundtable discussion on a crucially important topic “Bioethical aspects in Bioinformatics” with the experts Yves Moreau and Mahsa Shabani.

The student council symposia as well as the main conferences provide a great opportunity for students to present their work to an international audience, build a network within the computational biology community and develop important soft skills in an environment that fosters exchange of ideas and knowledge.

The first difference that I would like to discuss here is the conference size. The ISMB was a large conference but compared to this the ECCB were small-scale. Instead of 9 sessions in parallel to each other, this conference consisted of 3 sessions. Considering the presentation-type, the set-up of the poster presentations followed the theme of the corresponding conference.

Both conferences are mainly about advances and development in computational biology and its applications in molecular biology. I did notice quite some differences between the questions I got asked by the different audiences. While at the ISMB I got some more practical feedback to the project, the questions at the ECCB focused a bit more on the theory and the underlying biology. Depending on your own preferences you should carefully choose the conference or considering a more specialized conference. Personally, I prefer the jointly meeting of ISMB and ECCB, in the non-jointly year, I will check in the future for specialized conferences in my PhD topic.

© Yvonne

bar2

Neural Network history and challenges at  ECCB in Athen

nov

There are undeniable trends in every field, and Bioinformatics is no different. Artificial neural networks (NNs) became popular in the ‘90s, fueled by the development and rapid growth of databases that contained heaps of raw data but only a limited amount of annotated data. After training on the annotated data, feed-forward NNs could be used to predict the features they learned on new data, alleviating the need for costly and time-consuming experiments. NNs were applied, with considerable success, to predict (among others) secondary structure, solvent accessibility, and transmembrane regions.

In the late ‘90s and early 2000s NNs were applied to make predictions for every problem under the sun – from protein 3D structure to helix-helix contacts, toxicity of organic compounds, and spectral properties of the green fluorescent protein. However, the technology seemed to stagnate; NNs could not solve overly complex problems, and it seemed dedicated modelling was needed for these.

This changed in the early 2010s with the emergence of so-called “deep neural networks”, which combined clever improvements of the basic ideas behind NNs with parallelized code that took advantage of improved hardware. These new NNs are performing comparably to or better than humans at many tasks (such as image classification). Additionally, software like TensorFlow is making the deployment of deep NNs easier than ever before. This combination makes deep NNs very attractive to scientists, and leads to a wide adoption for classification and prediction tasks in Bioinformatics.

The trend was clearly visible in the recent ECCB conference in Athens. Four talks and about twenty five posters were explicitly dedicated to deep NNs, and many more used or referenced them. Here, we will provide a brief summary for DLPRB, a method that uses deep learning to predict protein-RNA binding.

Protein-RNA binding is usually studied via CLIP experiments, which produce binary outcomes (protein binds to RNA or not). These are characterized by high noise-to-signal ratio, something that makes the learning of binding preferences a difficult challenge. The most common model-based approaches have used the expectation-maximization algorithm to find sequence motifs that are more likely to bind to the protein of interest.

DLPRB (Deep Learning for Prediction of RNA-Binding) proposes a Convolutional NN (CNN) and a Recurrent NN (RNN) to predict RNA-protein binding. The input to the network is a matrix M, whose L columns are the positions of the RNA sequence and d rows are the RNA sequence in one-hot encoding and the predicted secondary structure (5 structural contexts are considered per nucleotide).

Matrix M is fed into the CNN. M is convoluted with various filter sizes, and the maxima of each resulting filtered submatrix are pooled together. A fully connected layer computes a weighted sum of these maximum values, yielding the predicted binding intensity.

The RNN architecture works slightly differently. The RNN consists of two directed networks (forward and backward), each consisting of L long short-term memory units (LSTMs, applied with considerable success on time series data; here they can be used to discover long-term dependencies between distal sequence positions). The matrix M is input sequentially, both forward and backward. The output of each network is then pooled and processed to produce a binding intensity prediction.

On a dataset containing 244 CLIP experiments with known binding intensities, both architectures showed considerable improvement over the state of the art, a trend that holds for in vivo experiments. The authors also confirmed that RNA structure information plays an important role in predicting binding affinity. Withholding the structure information lowered the prediction accuracy, and supplying true instead of predicted RNA structures improved method performance. The authors attribute their method’s improvement upon other deep learning approaches to the use of RNA structure and the configuration of their convolution filters.

© Niko

Leaning heavily on https://link.springer.com/referenceworkentry/10.1007%2F978-3-540-92910-9_18 for the history of NNs in Bioinformatics.

bar2

Biomarkers identifications methods at the ECCB

Clinical biomarkers are a cornerstone of a better healthcare system and better identification of biomarkers will improve disease diagnosis  suggesting novel strategy for stratification of high-risk patients. Due to the clinical significance of already known candidates biomarkers, newly identifiable disease-based biomarkers holds great promise for personalized medicine, especially for disease diagnosis and prognosis. Nowadays, with the availability of large scale genomic data, it has become possible the identification of disease specific biomarkers that can be translated into clinical practice. Simplistically, I can say that there are two strategies for biomarker identification, one “hypothesis-directed” (HD)  and two “data-directed” (DD). Hypothesis-directed biomarker was popular and implemented often in traditional research. The strategy for HD is to identify candidate biomarkers that are based on a comparison of biomolecules between two different conditions. This strategy is simple and straightforward but fails to acknowledge the multifactorial behavior of the disease and only included limited samples that are specific to the question. Therefore, there is an unmet need to develop computational algorithms for diagnosis, prognosis and therapeutics. This because such algorithm can identify complex patterns in disease related datasets using on a plethora of emerging data (”-omics” technologies (MS, Next Generation Sequencing (NGS), microarrays, etc) in the public domain. This topic was discussed with different aspects at the ECCB 2018 showing the interest and progress of bioinformatics research.

Often, as a first step, in-house and independent methods are developed in order to perform biomarker identification, which is specific to the disease, research centers and availability of data types. The sheer volume of candidate genes returned by high-throughput studies makes experimental validation a daunting task. Prioritizing candidate disease markers are necessary and important in order to maximize the information from these high-throughput experiments. Owing to innovative developments in informatics and analytical technologies, and the integration of biological approaches, it is now possible to expand identified biomarkers to understand the systems-level effects of a certain disease. In this respect, Antoranz A et al. (1) presented in the conference, their in-house method that is focused on investigating biological mechanism of identified biomarkers. The method has the potential to establish a relationship between the biomarkers and their mechanisms to gain a systems level understanding of the molecular pathomechanisms. Another aspect is to prioritize genes with respect to a disease that could lead to identifying the candidate gene. A method called Semantic Disease Gene Embeddings (SmuDGE) was presented by Alshahrani M (2) at the conference. This method is developed to predict gene-disease associations by including knowledge from phenotype-associations for any gene connected in an interaction network and providing disease-specific priorities. Moreover, I learned at the conference that biological questions related to diseases are not limited to academics research only, an analytic startup called e-NIOS has also focused their work on identification and interpretation of biomarkers. E-NIOS develops a platform leveraging the power of machine learning and physiology to improve automate the interpretation of potential DD biomarkers. E-NIOS is a cloud-based platform suitable for molecular markers interpretation in different conditions, physiological states and datasets.

In the end, I think this conference is a great platform to gain insight into the progress of a particular field of research. For bioinformatics, ECCB and ISCB are the best platforms for young scientists to come together and build up new and innovative ideas. Like always, this year ECCB covered a variety of topics ranging from methods developed for disease marker interpretation based on proteins, gene and genome dataset.

Antoranz A, Sakellaropoulos T, Saez-Rodriguez J, Alexopoulos LG. Mechanism-based biomarker discovery. Drug Discov Today. 2017;22(8):1209–15.

Alshahrani M, Hoehndorf R. Semantic Disease Gene Embeddings (SmuDGE): phenotype-based disease gene prioritization without phenotypes.bioRxiv 311449. 2018

© Neetika