Algoritmos de estimação de distribuição para predição ab initio de estruturas de proteínas

Bonetti, Daniel Rodrigo Ferraz

doi:10.11606/T.55.2015.tde-03082015-193613

Home

Facilities

Doctoral Thesis

DOI

https://doi.org/10.11606/T.55.2015.tde-03082015-193613

Document

Doctoral Thesis

Author

Bonetti, Daniel Rodrigo Ferraz (Catálogo USP)

Full name

Daniel Rodrigo Ferraz Bonetti

E-mail

Institute/School/College

Instituto de Ciências Matemáticas e de Computação

Knowledge Area

Computer Science and Computational Mathematics

Date of Defense

2015-03-05

Published

São Carlos, 2015

Supervisor

Delbem, Alexandre Cláudio Botazzo (Catálogo USP)

Committee

Delbem, Alexandre Cláudio Botazzo (President)
Delgado, Myriam Regattieri de Biase da Silva
Einbeck, Jochen
Oliveira, Paulo Sérgio Lopes de
Silva, Fernando Luis Barroso da

Title in Portuguese

Algoritmos de estimação de distribuição para predição ab initio de estruturas de proteínas

Keywords in Portuguese

Ab initio
Algoritmos de estimação de distribuição
Energia de Van der Waals
Modelo probabilístico
Predição de estruturas de proteínas

Abstract in Portuguese

As proteínas são moléculas que desempenham funções essenciais para a vida. Para entender a função de uma proteína é preciso conhecer sua estrutura tridimensional. No entanto, encontrar a estrutura da proteína pode ser um processo caro e demorado, exigindo profissionais altamente qualificados. Neste sentido, métodos computacionais têm sido investigados buscando predizer a estrutura de uma proteína a partir de uma sequência de aminoácidos. Em geral, tais métodos computacionais utilizam conhecimentos de estruturas de proteínas já determinadas por métodos experimentais, para tentar predizer proteínas com estrutura desconhecida. Embora métodos computacionais como, por exemplo, o Rosetta, I-Tasser e Quark tenham apresentado sucesso em suas predições, são apenas capazes de produzir estruturas significativamente semelhantes às já determinadas experimentalmente. Com isso, por utilizarem conhecimento a priori de outras estruturas pode haver certa tendência em suas predições. Buscando elaborar um algoritmo eficiente para Predição de Estruturas de Proteínas livre de tendência foi desenvolvido um Algoritmo de Estimação de Distribuição (EDA) específico para esse problema, com modelagens full-atom e algoritmos ab initio. O fato do algoritmo proposto ser ab initio é mais interessante para aplicação envolvendo proteínas com baixa similaridade, com relação às estruturas já conhecidas. Três tipos de modelos probabilísticos foram desenvolvidos: univariado, bivariado e hierárquico. O univariado trata o aspecto de multi-modalidade de uma variável, o bivariado trata os ângulos diedrais (Φ Ψ) de um mesmo aminoácido como variáveis correlacionadas. O hierárquico divide o problema em subproblemas e tenta tratá-los separadamente. Os resultados desta pesquisa mostraram que é possível obter melhores resultados quando considerado a relação bivariada (Φ Ψ). O hierárquico também mostrou melhorias nos resultados obtidos, principalmente para proteínas com mais de 50 resíduos. Além disso, foi realiza uma comparação com algumas heurísticas da literatura, como: Busca Aleatória, Monte Carlo, Algoritmo Genético e Evolução Diferencial. Os resultados mostraram que mesmo uma metaheurística pouco eficiente, como a Busca Aleatória, pode encontrar a solução correta, porém utilizando muito conhecimento a priori (predição que pode ser tendenciosa). Por outro lado, o algoritmo proposto neste trabalho foi capaz de obter a estrutura da proteína esperada sem utilizar conhecimento a priori, caracterizando uma predição puramente ab initio (livre de tendência).

Title in English

Estimation of distribution algorithms for ab initio protein structure prediction

Keywords in English

Ab initio
Estimation of distibutions algorithms
Probabilistic model
Protein structure prediction
Van der Waals energy

Abstract in English

Proteins are molecules that perform critical roles in the living organism and they are essential for their lifes. To understand the function of a protein, its 3D structure should be known. However, to find the protein structure is an expensive and a time-consuming task, requiring highly skilled professionals. Aiming to overcome such a limitation, computational methods for Protein Structure Prediction (PSP) have been investigated, in order to predict the protein structure from its amino acid sequence. Most of computational methods require knowledge from already determined structures from experimental methods in order to predict an unknown protein. Although computational methods such as Rosetta, I-Tasser and Quark have showed success in their predictions, they are only capable to predict quite similar structures to already known proteins obtained experimentally. The use of such a prior knowledge in the predictions of Rosetta, I-Tasser and Quark may lead to biased predictions. In order to develop a computational algorithm for PSP free of bias, we developed an Estimation of Distribution Algorithm applied to PSP with full-atom and ab initio model. A computational algorithm with ab initio model is mainly interesting when dealing with proteins with low similarity with the known proteins. In this work, we developed an Estimation of Distribution Algorithm with three probabilistic models: univariate, bivariate and hierarchical. The univariate deals with multi-modality of the distribution of the data of a single variable. The bivariate treats the dihedral angles (Proteins are molecules that perform critical roles in the living organism and they are essential for their lifes. To understand the function of a protein, its 3D structure should be known. However, to find the protein structure is an expensive and a time-consuming task, requiring highly skilled professionals. Aiming to overcome such a limitation, computational methods for Protein Structure Prediction (PSP) have been investigated, in order to predict the protein structure from its amino acid sequence. Most of computational methods require knowledge from already determined structures from experimental methods in order to predict an unknown protein. Although computational methods such as Rosetta, I-Tasser and Quark have showed success in their predictions, they are only capable to predict quite similar structures to already known proteins obtained experimentally. The use of such a prior knowledge in the predictions of Rosetta, I-Tasser and Quark may lead to biased predictions. In order to develop a computational algorithm for PSP free of bias, we developed an Estimation of Distribution Algorithm applied to PSP with full-atom and ab initio model. A computational algorithm with ab initio model is mainly interesting when dealing with proteins with low similarity with the known proteins. In this work, we developed an Estimation of Distribution Algorithm with three probabilistic models: univariate, bivariate and hierarchical. The univariate deals with multi-modality of the distribution of the data of a single variable. The bivariate treats the dihedral angles (Φ Ψ) within an amino acid as correlated variables. The hierarchical approach splits the original problem into subproblems and attempts to treat these problems in a separated manner. The experiments show that, indeed, it is possible to achieve better results when modeling the correlation (Φ Ψ). The hierarchical model also showed that is possible to improve the quality of results, mainly for proteins above 50 residues. Besides, we compared our proposed techniques among other metaheuristics from literatures such as: Random Walk, Monte Carlo, Genetic Algorithm and Differential Evolution. The results show that even a less efficient metaheuristic such as Random Walk managed to find the correct structure, however using many prior knowledge (prediction that may be biased). On the other hand, our proposed EDA for PSP was able to find the correct structure with no prior knowledge at all, so we can call this prediction as pure ab initio (biased-free).

WARNING - Viewing this document is conditioned on your acceptance of the following terms of use:
This document is only for private use for research and teaching activities. Reproduction for commercial use is forbidden. This rights cover the whole data about this document as well as its contents. Any uses or copies of this document in whole or in part must include the author's name.

tese_Bonetti_revisada.pdf (16.37 Mbytes)

Publishing Date

2015-08-03

Derived works

WARNING: Learn what derived works are clicking here.