Research Areas

  •   Metaheuristics
  •   Machine Learning
  •   HPC for Bioinformatics

Optimization and Metaheuristics - Metaheuristics combine basic heuristic methods in a higher-level framework aimed at efficiently and effectively exploring a search space that guides the search for a solution in a broad range of optimization problems. The main goal is to find an acceptable solution with an acceptable time. Most of the metaheuristics consist of interaction with local improvements (exploitation) and strategies that avoid being trapped in local optima (exploration).

In Bioinformatics there are several problems that still do not have a computational method that can guarantee a minimum quality of solution in a feasible time. This is due to the fact that, the rules that govern the biochemical processes and relations are partially known, making harder to design efficient computational strategies. Since such problems are classified as NP-Complete or NP-Hard, there is the need to use computational techniques that can deal with them.

Metaheuristics are one of the most common and powerful techniques used in this case. They do not guarantee the optimal solution, but they give a good approximation with a limited computational effort.

Keywords - memetic algorithms; distributed meta-heuristics; population-based metaheuristics; evolutionary computation

Machine Learning - Machine Learning is a research area of computer science that deals with the development of algorithms capable of learning from a usually voluminous and complex data set. These algorithms can infer input-output relationships without explicitly assuming a pre-determined model. The learning is mainly focused on the discovery of predictive models and the detection of patterns, relationships and dependencies in the data, automatically, allowing the extraction of implicit information that could hardly be detected through manual analysis.

Machine learning can efficiently deal with non-linear, noisy environments and treat properly missing data. For these reasons, machine learning is the most used algorithms to integrate and analyze omic data. There are three learning paradigms: supervised, unsupervised and semi-supervised. Supervised learning is a process in which the model is parameterized by using a set of observations, each of those associated with a known outcome (label). In opposition, in unsupervised learning, one does not have access to the labels, it can be viewed as the task of “spontaneously” finding patterns and structures in the input data. The third paradigm, semi-supervised, deals with a combination of supervised and unsupervised.

Keywords - feature selection; dimensionality reduction; interpretability in machine learning models; predictive models; neuroevolution; learning heuristics.

HPC for Bioinformatics - Depending on the problem, metaheuristics and machine learning approaches can be computationally expensive, allowing to solve only small instances of the problems. In order to overcome the above, it is possible to develop parallel models of metaheuristics and machine learning algorithms, which will allow to explore a larger number of plausible solutions. There are several ways to address the lack of hard computing power for bioinformatics: (1) developing new and faster heuristic algorithms (meta-heuristic) that reduce computational space for the most time-consuming tasks; (2) incorporating these algorithms into specialized chip and (3) the most promising consideration, parallel computing.

Parallel computing still requires new paradigms in order to harness the additional processing power for Bioinformatics. A recent trend in Structural Bioinformatics is to move the algorithms from traditional, single-core processors to multi-core processors and further to many-core or massively multi-core processors. Data-parallel computations, such as present in Bioinformatics Problems, with high arithmetic intensity, can attain maximum performance from Graphics Processing Units (GPU). In such cases, when the algorithm can be parallelized effectively there is a significant speedup.

Keywords - GPU computing; massive parallel computing; CUDA.

  •   Systems Biology
  •   Protein Structures
  •   Gene Expression Data
  •   Data Analysis

Systems Biology - Identification of targets of interest derived from large-scale biological data, creation of interaction networks between different types of molecules, prospecting for possible new drugs and elucidation of molecular mechanisms, evolutionary comparison between biochemical pathways - this and much more can be investigated through System Biology.

Our group has experience with applying various tools of Systems Biology in different organisms such as humans, mice, bacteria, plants, and fungi. Likewise, we have experience understanding the effects of molecules and toxic compounds in biological systems and prospecting for potential drugs. Systems biology has been increasingly employed in multiple types of work due to its flexibility and full application in all Molecular Biology areas.

Keywords - Biological Networks; Biochemical Pathways; Molecular Mechanisms; Interactomes; Large-Scale Biological Data.

Simulation and Modelling of Protein Structures - A protein of interest does not always have its structure available. Moreover, even when available, that structure does not explain how that molecule behaves when in a cellular environment or bound to other molecules, such as proteins or chemical compounds.

Our group has a journey in the application and creation of new tools to predict protein structures efficiently. Likewise, we have experience modeling and simulating proteins, protein complexes, and understanding the behavior of proteins bound to chemical compounds. The simulation of proteins bound to drugs or other compounds is directly linked to an effective reduction in the costs of choosing new drugs of interest. Similarly, they can be used to understand different molecules' behavior in solution and optimize biotechnology processes.

Keywords - Protein Structure Prediction; Protein Complexes; Molecular Docking; Molecular Dynamics; Biotechnological Processes.

Analysis of Gene Expression Data - Most genes in an organism are expressed in RNA molecules, and these can be of different natures (e.g., mRNA, miRNA, lncRNA). Thus, understanding an organism's gene expression profiles is vital to explore possible molecular targets of biological and biotechnological interest. Similarly, these analyses are crucial to understanding how organisms or tissues respond to different conditions.

We are experienced in analyzing gene expression data, such as microarray and RNA-seq in multiple species. In this sense, we analyze data from any microarray platform and devise tools for the analysis of expression data. Concerning RNA-seq data, we are experienced in data analysis coming from the Illumina platform.

Keywords - Microarrays, RNA-Seq; SNPs; SNVs; Isoforms; LncRNA.

Machine Learning and Biological Data Analysis - The use of artificial intelligence, particularly machine learning, creates new opportunities for analyzing large volumes of biological data. These techniques allow identifying patterns that are often difficult to detect by other approaches and play a key role in understanding and solving complex problems in agriculture, livestock, extraction industry, health, and security.

Machine Learning techniques can be used to analyze genomic data, seeking to identify new biomarkers with diagnostic or prognostic value, or as potential therapeutic targets in the treatment of diseases. Similarly, they can be utilized to detect SNPs and SNVs of interest in a population. These same techniques can be used, for example, in agriculture to discover unknown metabolic pathways and defense mechanisms and their regulation for the study of plant-pathogen interaction. It can also be applied to the discovery of genetic variants with potential applications in animal genetic improvement.

Our laboratory has years of experience creating new algorithms for machine learning and using these techniques to different biological and biotechnological interest problems.

Keywords - Data Science; Big Data Analytics; Translational Data Science; Feature Selection; Feature Extraction; Predictive Models; Data Mining.

Tools and Datasets

Science is moving towards a greater openness, not just data but also publications, computer code, and workflows. The SBCBLab is committed to open science and free access to tools and datasets. Over the last few years, we have developed several tools, libraries, and datasets.

Tools:



Publications

  • Journals
  • Proceedings
  • Book Chapter

Publications


2024

2023

2022

2021

2020

2019

2018

2017

2015

2014

2013

2012

2010

Total of 43 publications


2023

2022

2021

2020

2019

2018

2017

2016

2014

2013

2012

2011

2008

Publications


Laboratory Facilities

Scientific discoveries are closely linked to technological development. The Structural Bioinformatics and Computational Laboratory maintains and constantly expands cutting-edge facilities that enable students and scientists to carry out their research. The SBCB Lab also has access to international High Performance Computing infrastructure through cooperation projects with France, Chile, and Germany. The SBCB Lab is also member of the National Research Infrastructure Platform MCTI.

SBCB Server - Batch System

The computing resources of the SBCB lab are accessible via a Batch System, and the users only have direct access to the login node. Access to the compute nodes is only possible through Slurm. The system has 160 CPU cores, 2 TB RAM; 360 TB disk space; 12,544 CUDA cores. The official Slurm documentation can be found at https://slurm.schedmd.com.

SBCB Server - Jupyter Lab

Jupyter enables interactive computing and is an alternative to accessing High-Performance Computing resources via SSH. It allows different programming languages and runtimes to be used within a web-based environment. While the front end runs in the browser on the client, the commands are executed on the remote systems. The SBCB Jupyter platform runs on 48 cores (Intel Xeon E5-2650V4), 768 GB memory, 5120 CUDA cores / 640 Tensor Core (Titan V GPU), and 200 TB of workspace. A detailed documentation of the Jupyter project can be found at https://jupyter.readthedocs.io.

SBCB Server - RStudio Server

RStudio Server enables a browser-based interface to R running in the remote server, bringing the power and productivity of the RStudio IDE to server-based deployments of R. The SBCB RStudio Server runs on 64 cores (Intel Xeon Silver 4216), 1 TB of Memory, and 200 TB of workspace. A detailed documentation of the RStudio Server project can be found at https://docs.posit.co/ide/server-pro/1.1.463.

Hardware Overview:

External Infrastructure:

Computing node Dayhof - Center for Biotechnology - UFRGS - Brazil

Computing node Dayhoff - Center of Biotechnology - Supermicro 4023S-TRT AMD EPYC 7502, 2 CPUs, 2.5Ghz, 128 threads - 512 GB DDR4 ECC - HDD 50TB. NVIDIA Tesla T4: Turing architecture, 2560 CUDA core, 320 Tensor Cores, 16 GB GDDR6, 320+ GB/s Total Memory Bandwidth, 254.4.

Horeka Green - Steinbuch Centre for Computing - Karlsruhe Institute of Technology - Germany

Through the cooperation with the Karlsruhe Institute of Technology (Germany), the Laboratory have access to the Horeka-Green supercomputer. It can provide a computing power of more than 17 PetaFLOPS or 17 quadrillion computing operations per second, which corresponds to the performance of more than 150,000 laptops.
HoreKa is an innovative hybrid system with nearly 60,000 Intel processor cores, more than 220 terabytes of main memory and 668 NVDIA A100 GPUs.