ConfID Tutorial

   »Introduction    » Preparing    » Running    »Analyzing    » Comparing

Read the ConfID documentation page: http://sbcb.inf.ufrgs.br/confid

Hey there! This is a tutorial for using ConfID, a tool for conformational characterization of drug-like molecules. You can download all tutorial inputs in the official GitHub repository: https://github.com/sbcblab/confid/tree/master/Tutorials. This includes all the neeeded molecular dynamics simulation data.

To find more about ConfID installation procedures, click here or read the file INSTALL.txt in your ConfID distribution. Here, we will assume you successfully aliased ConfID (again: read the INSTALL.txt).

NOTE: This tutorial also uses XMGRACE for graphs visualization. You can install it by typing:

$ sudo apt-get install xmgrace

1 - INTRODUCTION


The conformational space of molecules can be vast and difficult to assess. Experimental techniques, such as X-ray crystallography and NMR, are usually employed, but they are often limited by its solid-state nature or its sensitivity, respectively. Computationally, force field parameters, search algorithms, and geometry sampling also contribute to the challenge of identifying conformational populations and their derived properties. In this sense, molecular dynamics simulations have been employed to assess the conformational landscape of small molecules while accounting for solvation effects and simulating time-dependent properties. Still, identification and quantification of conformational populations sampled in MD simulations are usually relegated to clusterization algorithms, which can be insensitive to small conformational changes or tackled by manual efforts, which can be either time consuming and a source of errors.

With this in mind, we have designed ConfID to infer most relevant conformations from highly sampled transition intermediates, which can be a challenge for both experimental and computational techniques.

In this Tutorial, we are going to analyze PIK75, a GSK3β inhibitor, and its analog called ANA:



ConfID is based on dihedral flexibility of a given molecule sampled throughout molecular dynamics simulations. For PIK75 and ANA, the dihedrals used for analyses are marked above in green arrows.

Beforehand, we have simulated both PIK75 and ANA separately in water solvent for 1μs using GROMACS and calculated dihedral fluctuations (DIH*.aver.xvg) and dihedral distributions over trajectory (DIH*.dist.xvg) using the following line:

$ gmx angle -f trajectory.xtc -n dihedrals.ndx -type dihedral -ov DIH*.aver.xvg -od DIH*.dist.xvg

ConfID uses only dihedral fluctuations through time and their overall frequency distribution as input. Therefore, it does not discriminate which simulation software is used.

NOTE: the determination of which dihedrals are relevant for conformational sampling and which are not (symmetrical torsions, for instance) is entirely delegated to the user.

NOTE2: since the default time unit for GROMACS is picosecond (ps), DIH*.aver.xvg is written in ps units in this tutorial and should be considered the time unit used throughout this tutorial.

2 - PREPARING FILES


In this tutorial, all dihedral fluctuations and overall distribution are already placed in a specific directory, along with files input.inp, containing input names, and config, containing ConfID parameters. For both ANA and PIK75 folders, you should find the files:

  • DIH*.aver.xvg and DIH*.dist.xvg
  • input.inp
  • config
Please note that file names used in input.inp can be modified, but the order “distribution, fluctuation” must be maintained.



ConfID configuration parameters can be set in config file.



A brief explanation of each parameter:

RESULTS_FOLDER: specifies the directory in which output files should be saved.
DIH_POP_FOLDER: specifies the directory in which output .xvg files should be saved.
NETWORK_FOLDER: specifies the directory in which output network files should be saved.
TIME_STATS_FOLDER: specifies the directory in which output transition files should be saved.
SHOW_Z: a flag that determines if spurious regions (Z) should be represented in the results. They will be used in the internal calculations nevertheless. If this is True, please consider setting PLOT_NETWORK to False, as plotting the chart may become too slow.
NETWORK_CUTOFF: the smallest transition frequency required for an edge to appear in the networks. If equal to 0.0, all edges are considered. If this cutoff is too small, please consider setting PLOT_NETWORK to False, as plotting the chart may become too slow.
PLOT_NETWORK: if True, networks figures for the transitions will be created using the graphviz library. Network text files will be created if it is either True or False.
CONVERGENCE_CUTOFF: the smallest population frequency at the end of the simulation required for the convergence file for that population to be generated. If equal to 0.0, all populations will be represented, but for a large number of dihedral angles, this can take a while.
FACTOR_PEAK: a factor that sets the constriction for peaks selection. Larger values lessen the constriction.
FACTOR_VALLEY: a factor that sets the constriction for valleys selection. Lower values lessen the constriction.
TIME_DEPENDENT_STATS: flag that determines if the statistics of the time stayed at each population should be computed.
DATA_1: list of functions that should be used as the x-axis of the charts of the statistics of the time stayed at each population and how the report should be ordered.
DATA_2: list of functions that should be used as the y-axis of the charts of the statistics of the time stayed at each population and how the report should be ordered.

More information can be found in config_help.txt file.

NOTE: among all configuration parameters, FACTOR_PEAK and FACTOR_VALLEY must be taken with caution. Their values can impact on the determination of dihedral populations and spurious regions (Z). The user must check whether the used values are suited for their case or not.

3 - RUNNING CONFID


First, access the ANA directory:

$ cd Tutorial/ANA/

Please note the input.inp and config files there. Then, run ConfID by simply typing:

$ confID input.inp config

A successful run should end smoothly with a classic “Finished”.

4 - ANALYZING RESULTS


A lot of information will be printed on the screen. First, the config parameters will be printed, followed by the input files designated in the input.inp and, for last, the analyses itself. A successful run should create four new folders: Dihedral_Populations, Networks, Time_Dependent_Stats, and Populations.



4.1 - Determination of Dihedral Populations



To check whether ConfID was able to determine dihedral populations correctly, we must look at the graphs written within the folder "Dihedral_Populations". Let’s take a look in dihedral DIH3 with the following command:

$ xmgrace Dihedral_Regions/DIH3.dist*

It should look something like the graph below:



By detecting distribution peaks (in ice blue) and valley regions (in red), ConfID identifies dihedral populations for all torsions set in input.inp. It is important to mention that FACTOR_PEAK and FACTOR_VALLEY values are crucial to the accurate identification of peaks and valleys. For well-defined populations (such as the case for DIH3), the default values should do a decent work.

NOTE: We STRONGLY suggest the users test different sets of FACTOR_PEAK and FACTOR_VALLEY to find the most values suitable ones for their case. Information regarding the identified dihedral populations is written in “Populations/DIHEDRAL_REGIONS.txt.”

4.2 - Determination of Conformational Populations



One of the main features of ConfID is its capacity to characterize conformational populations of a drug-like molecule throughout a trajectory. After determining dihedral populations, conformational populations are printed on the screen and saved in “/Populations/CONFORMATIONAL_POPULATIONS.txt,” like the figure shown below:



For ANA, only two conformational populations were identified: P#1 and P#2, with relative frequencies of 55% and 43.6%, respectively. In total, only 1% of all frames were identified within the spurious Z region, leaving 98.6% of all frames for being considered valid.

NOTE: Different sets of FACTOR_PEAK and FACTOR_VALLEY can directly impact on the size of the Z region. Choose them with caution.

Within “Populations” folder, a “Convergence” is also written. In there, ConfID saves the sampling evolution throughout time (Populations/Convergence/Sampling_Evolution.xvg) and the conformational frequency evolution throughout time (Populations/Convergence/Freq_Conf-*.xvg).

Let’s take a look in the Sampling Evolution:

$ xmgrace /Populations/Convergence/Sampling_Evolution.xvg



The graph suggests that the two ANA conformers were sampled already at the beginning of the trajectory. Now, let’s take a look in the conformational frequency evolution by typing:

$ xmgrace Populations/Convergence/Freq_Conf-*.xvg



This graph shows that, although P#1 and P#2 conformers of ANA were sampled right at the beginning of the simulation, their relative frequencies took more time to converge.

In practical terms: if ANA simulation were only 100ns long, it would have led the user to an erroneous assumption that P#2 is more abundant in solution than P#1. With ConfID, this convergence evaluation is easily made and can be used in automated procedures to check if an MD simulation of drug-like compounds has converged or not.

Aside from the “Population/Convergence/” folder, ConfID also creates a “Population/CONF_Frames/” folder, in which a collection of frames related to identified conformers is saved. This feature is handy to the user when tracking exclusive chemical properties or behaviors of each conformer.

4.3 - Conformational Ensemble Topology



ConfID also tracks transition events between different conformational populations throughout the trajectory. These populations are compiled into a topological network using graphs theory, in which conformational populations are represented by nodes, while transition events are represented by directional edges.

By default, ConfID will save all transition events into “Networks/CONF_network.txt.” If the parameter for PLOT_NETWORK in config is set as True, ConfID will plot the network into “Networks/CONF_network.pdf,” which is a fast way to analyze data roughly.



Since ANA has only two conformational populations, the visualization of 2 nodes can be quite straightforward. However, more flexible molecules might yield a more complex conformational ensemble topology. For those cases, and considering the user's freedom, ConfID saves “Networks/CONF_network.gml” by default, which can be properly manipulated in Cytoscape or other System’s Biology software to make better-presented figures.

Analogously, all features implemented for conformational transitions in ConfID are also available for transitions between the dihedral population of each torsion. Therefore, for each torsion set in the input.inp, ConfID tracks transition events and writes all of the already mentioned types of network output files.

NOTE: Please note that NETWORK_CUTOFF parameter in config is the minimum value of node frequency to be saved in *gml, *txt, and *pdf files.

4.4 - Time-dependent statistics



ConfID also tracks transition events throughout the trajectory between different dihedral populations and between different conformational populations. This information is saved in “Time_Dependent_Stats” folder.

In “Time_Dependent_Stats/CONF_Transitions.txt,” ConfID saves every transition event and its frames of detection. In “Time_Dependent_Stats/CONF_Transitions.xvg”, ConfID plots a graph of transition occurring as a function of time, where 1 means a transition has occurred, while 0 means no occurrence. It should look like the image below:



In “Time_Dependent_Stats/CONF.transitions-Time_Stats-sumXaver.txt,” ConfID saves the time-dependent properties chosen in the config of each conformational population identified. All available properties are:

- sum: total time in a population
- max: maximum time spent in a population without leaving
- min: minimum time spent in a population without leaving
- aver: average time spent in a population without leaving
- std: standard deviation of the average time
- median: median time spent in a population without leaving
- count: the amount of times of a transition event entering this population

NOTE: The time-dependent properties calculated by ConfID are sensitive to the frequency of frames used to write dihedral fluctuations as a function of time (files “DIH*.aver.xvg”). The higher the frequency of saved frames, the higher the time resolution of transition events, the more reliable will be the properties calculated by ConfiD. The Lower the frequency of saved frames, the less reliable are the calculated properties.

As shown in the image below, P#1 has slightly higher average time than P#2 and a higher standard deviation (396 士 1562 ps against 321 士 910 ps).



ConfID also plots an XY graph with two time-dependent properties on each axis, as can be seen in “Time_Dependent_Stats/CONF.transitions-Time_Stats-sumXaver.png” and below. Users can set each property in the config file.

5 - COMPARING ANA AND PIK75


After a complete characterization of ANA, you should be able to run ConfID for PIK75 and evaluate the results by yourself.

For the PIK75 compound, four conformational populations were identified, as shown in the image below (file “PIK75/Populations/CONFORMATIONAL_POPULATIONS.txt"). Please note that the only difference between ANA and PIK75 is the amino (-NH2) and nitro group (-NO2), but this is enough for PIK75 to sample different conformations in solution.



If we take a look in the conformational ensemble topology plotted by ConfID (PIK75/Networks/CONF_network.pdf), we can see the most relevant conformational populations and their transition frequencies. Since P#4 frequency is below 0.01, it is not shown.



Furthermore, we can compare the sampling convergence of PIK75 with ANA by plotting both “Sampling_Evolution.xvg” files together. The graph should look similar to the one below:



It is essential to mention that in both cases, simulations of PIK75 and ANA sampled all possible conformers very quickly. However, by comparing the frequency evolution of each conformer throughout time, it is possible to infer that the conformational ensemble of PIK75 reached and equilibrium around 300ns.



Note the frequency evolution of ANA's conformers P#1 (in blue) and P#2 (in green), and of PIK75's conformers P#1 (in blue), P#2 (in green) and P#3 (in yellow). By looking at the dihedral combinations of each conformer, it is possible to notice that P#1 and P#2 of ANA and PIK75 are conformationally analogous.

With this you have sucessfully completed your first analysis with ConfID! If you still have questions, try reading the helper files that were downloaded with the code, the ConfID documentation page, the published paper, or contact us directly through the e-mail confidcontact@gmail.com. Have a nice "ConfIDent" analysis! =)