MCTI/CNPq/CT-Biotec nº 30/2022

DSBA: Data Science for Biotechnology Applications
Solving large-scale challenges using explainable machine learning, metaheuristics, and high-performance computing
January 2023 to December 2026
  •   Animal Health
  •   Personalized Medicine
  •   Phenotype Prediciton
  •   Protein Phenotype Insights

The use of bioinformatics tools in identifying molecular profiles of bacteria enables a precise and efficient approach to disease diagnosis. Furthermore, it fosters a deeper understanding of bacterial genetic diversity and facilitates well-informed clinical decision-making. In the field of animal health, researchers focus on studying bacteria of the genus Brucella, which cause a disease known as brucellosis. This disease, also called Malta fever or undulant fever, affects a wide range of mammals, exhibiting zoonotic and cosmopolitan characteristics and posing a significant risk to public health with substantial economic losses. Brucellosis can cause various symptoms, ranging from cold-like signs to complications in the nervous system, musculoskeletal system, and heart. In canines (affected by B. canis), nonspecific signs are observed, like those in humans, but reproductive failures and joint issues related to this bacterium are commonly diagnosed. Due to the diversity of clinical signs, diagnosing brucellosis in humans and animals presents a significant challenge, with underdiagnosis contributing to the spread of infection. Despite this, few genomic studies with different strains of B. canis have been developed so far. In this regard, there is a demand for more information, such as virulence factors, antimicrobial resistance genes, and the evolutionary profile of the pathogen, which can greatly contribute to decision-making in government responses to public health, as well as in storing and comparing data about this agent.

In the experimental front of this project, team members recently sequenced 20 B. canis genomes using two sequencing technologies (for obtaining short reads and long reads), which will contribute to the data used in solving this biological problem, along with 60 public genomes of B. canis and 160 public genomes of B. suis. This data will be analyzed by the computational tools developed in this proposal to identify species-specific genetic variations to serve as diagnostic markers for brucellosis. Interpretable machine learning algorithms will be employed to create a genotypic profile of virulent strains and differentiate them between species based on their phenotypic differences and antimicrobial susceptibility profiles.

Personalized Medicine:

The use of Bioinformatics in Personalized Medicine, particularly in oncology, offers several advantages. By integrating clinical, genetic, and genomic data, Bioinformatics allows a more comprehensive understanding of individual patient characteristics, contributing to more accurate diagnoses and informed prognoses. Gene expression analysis and biomarker identification through bioinformatics techniques enable treatment personalization, adapting therapeutic strategies based on the specific genetic features of each tumor. Additionally, the application of machine learning algorithms in interpreting complex data drives the discovery of patterns and non-obvious relationships, advancing the identification of potential drugs and therapeutic targets. Bioinformatics plays a crucial role in transitioning from conventional to personalized and predictive medicine, offering substantial benefits for early diagnosis, effective treatment, and overall better management of cancer patients. The specific objectives of this project focus on selecting tumor biomarkers and inhibitory drugs as potential pharmaceuticals.

Therapeutic targets and biomarkers of tumor

This project centers on integrating machine learning and heuristic search methods to identify tumor biomarkers, aiming to discover potential therapeutic targets. The goal is to develop efficient machine learning techniques to handle the complexity of large-scale biological data, contributing to the identification of biomarkers with diagnostic and prognostic value in different types of cancer. The project seeks not only scientific advancements but also clinical applicability, enhancing diagnostics, prognostics, and therapeutic planning.

Drug Design

The project proposes the use of machine learning and heuristic search methods in drug selection to combat chemotherapy resistance in cancer treatments, focusing on resistence to the chemotherapy drug cisplatin. The research aims to identify potential inhibitors of polymerase Eta (POLH), associated with cisplatin resistance, through Virtual Screening methods enhanced by machine learning techniques. The identified inhibitors will undergo in silico testing through molecular dynamics (MD) simulations, and new machine learning algorithms will be analyzed to reduce the need for bench experiments. The objective is to accelerate identification, decrease drug development costs, and optimize the drug selection process to address challenges in chemoresistant cancers.

Genotype-to-phenotype prediction is a crucial field in contemporary genetics, with important applications in forensic science and anthropological genetics. The search for markers capable of predicting externally visible characteristics (EVCs) has shown promise. Predictors of skin, eye, and hair color from DNA have been proposed with relative success, such as HIrisPlex-S. However, it demonstrated low predictive power for Latin Americans. In addition to the diverse nature of human populations, methodological challenges in the search for genetic markers (SNPs) and the development of predictors must be considered to improve performance. This project aims to construct predictors for skin, eye, and hair color. We have a sample of 6,987 individuals and 651,871 SNPs from five Latin American countries (Brazil, Chile, Colombia, Mexico, and Peru) obtained through CANDELA (Consortium for the Analysis of the Diversity and Evolution of Latin America). Thus, the project's objectives are to develop a global classifier for externally visible characteristics (eye, skin, and hair) for populations in five Latin American countries, as well as to generate specific classifiers for each population in the sampled countries.

ProteinPhenotypeInsights (ProPhIn) is a Python package designed to assist evolutionary biologists in understanding the relationships between candidate proteins and categorical phenotypes in related species. ProPhIn employs machine learning techniques to unravel genotype-phenotype relationships, focusing on interpretability and the generation of visualization tools.

The program encodes missense variations in a way that facilitates the understanding of potential epistatic interactions and selects those more likely to impact the phenotype in each species. Visualization tools for variations aid in interpreting statistical associations, making it easier to assess the biological plausibility of findings. ProPhIn also evaluates the two- dimensional distribution of species and conducts network analyses, aiding in the understanding of overall data behavior.

We tested the software using the candidate genes OXT, OXTR, and LNPEP, and the phenotypes social monogamy, paternal care, and litter size in primates. Our research group has been studying these phenotypes and genes for about 10 years. When executing ProPhIn on our database, the program identified 83.3% of the sites indicated by previous studies from our team as potentially important due to their positions in molecules or signs of positive selection and/or coevolution between OXT, OXTR, and LNPEP. Some of the identified variations have already had their significance validated by functional studies. The program also discovered new variations, potentially capable of explaining phenotypes in species less studied by our research group.

Researchers Students Scholarship Students Tools and Databases Partners Publications

This project aims to develop new bioinformatics tools based on Machine Learning methods (supervised and unsupervised), heuristic search methods, and high-performance computing to explore high-dimensional data in problems of scientific and economic interest in the area of human and animal health. We will develop: (i) algorithms based on adaptive and multiobjective metaheuristics; (ii) multimodal metaheuristics; (iii) time series-based metaheuristics; (iv) combinatorial optimization; (v) interpretable machine learning methods; (vi) algorithms for feature extraction and selection; and (vii) combination of interpretability methods aiming at building general-purpose strategies that contribute to the analysis of large data with complex structure...

Researchers

Dr. Márcio Dorn - Coordinator
Center for Biotechnology
Institute of Informatics - UFRGS - Brazil

Dr. Maria Cátira Bortolini
Institute of Biosciences
Department of Genetics - UFRGS - Brazil

Dr. Bruno Iochins Grisci
Center for Biotechnology
Institute of Informatics - UFRGS - Brazil

Dr. Manuel Escalona
Post Doc - INF/UFRGS - Brazil

Dr. Franciele Maboni Siqueira
Center for Biotechnology
Faculty of Veterinary - UFRGS - Brazil

Dr. Hugo Verli
Center for Biotechnology
Institute of Biosciences - UFRGS - Brazil

Dr. Juliana Silva Bernardes
LCQB/UPMC - France

Dr. Manuel Villalobos-Cid
DIINF/USACH - Chile

Dr. Mario Inostroza-Ponta
DIINF/USACH - Chile

Graduate Students/Collaborators

Mateus Boiani
Ph.D. Candidate
PPGC/INF/UFRGS - Brazil

Gabriela Flores Gonçalves
Ph.D. Candidate
PPGBM/IB/UFRGS - Brazil

Cauê Scotti Luciano Rocha
ITI - EC/UFRGS - Brazil

Éderson S M. Pinto
Ph.D. Candidate
PPGC/INF/UFRGS - Brazil

Lorenzo C. C. Novo
ITI - BTC/UFRGS - Brazil

Bruna Oliveira Missaggia
Ph.D. Candidate
PPGBM/IB/UFRGS - Brazil

Gabriel Dominico
Ph.D. Candidate
PPGC/INF/UFRGS - Brazil

Scholarship Students

Cauê Scotti Luciano Rocha
ITI - EC/UFRGS - Brazil

Lorenzo C. C. Novo
ITI - BTC/UFRGS - Brazil

Bruna Oliveira Missaggia
DTI-B CNPq
UFRGS/UFPA - Brazil

Oscar V. C. Alegría
DTI-B CNPq - UFRGS/UFPA
PPGBM/IB/UFRGS - Brazil

Tools and Datasets

Publications