Previously, each Kohonen network was trained with information from only one molecule, in most cases, one molecular surface. The maps thus generated were replications of one molecular surface, the objects mapped into the individual neuron were points from the molecular surface. Now, work will be reported where entire datasets of molecules are sent into a single Kohonen network. Each object mapped into a neuron will consist of an entire molecule. Various representations of molecules have been employed, as detailed in the following sections.
The analysis of a dataset of objects by learning methods, be it by statistical or pattern recognition methods or neural networks, asks for the objects to be represented by the same number of variables. If the objects are molecules, one has to come up with the same number of descriptors irrespective of the size of the molecule and the number of atoms in the molecule. In the following, the transformation of a molecular structure by autocorrelation is used to obtain a representation by a fixed number of variables. It will be shown that autocorrelation allows molecules to be considered with different degrees of sophistication, starting with the constitution (the topology) of a molecule, through the 3D structure all the way to representations of molecular surfaces. In addition, a variety of physico-chemical properties of the atoms or of the molecular surfaces can be considered. Such a hierarchy of representations is mainly dictated by the size of the datasets to be studied: large datasets of molecules ask for rapid encoding schemes, while smaller ones allow a more detailed consideration of molecules.
The idea of using autocorrelation for the transformation of the constitution of a molecule into a fixed length representation was introduced by Moreau and Broto . A certain property, pk, of an atom i is correlated with the same property of atom j and these products are summed over all atom pairs having a certain topological distance d. This gives one element of a topological autocorrelation function A(pk,d) of this property pk:
The following properties were calculated by previously published empirical methods for all atoms of a molecule: sigma charge, qs , total charge, qtot, sigma electronegativity, cs , pi-electronegativity, cp , lone-pair electronegativity, cLP, and atom polarizability, a, . In addition to those six electronic variables, the identity function - i.e. each atom being represented by the number 1 - was used in Eq. 5 to just account for the connectivity of the atoms in the molecule.
The autocorrelation of these seven variables was calculated for seven topological distances (number of intervening bonds) from two to eight. The basic assumption, thus, was that interactions of atoms beyond eight bonds can be neglected. With seven variables and seven distances, an autocorrelation vector of dimension 49 was obtained for each molecule, irrespective of its size or number of atoms. The hydrogen atoms were not considered in the calculation of the autocorrelation vector.
In order to investigate the potential of topological autocorrelation functions for the distinction of biological activity, a dataset of 112 dopamine agonists (DPA) and 60 benzodiazepine agonists (BDA) was studied . A Kohonen network of size 10 x 7 was used to project these 172 compounds from the 49 dimensional space spanned by these autocorrelation vectors into two dimensions.
The two types of compounds, DPA, and BDA, were nearly completely separated in the Kohonen map, underscoring the potential of this molecular representation to model biological activity. To put this capability to a more severe test, this dataset of 112 DPA and 60 BDA compounds was mixed with the entire catalog of a chemical supplier (Janssen Chimica catalog, version 1989) consisting of 8323 commercially available compounds comprising a wide range of structures from alkanes to triphenylmethane dyestuffs.
Fig. 12. Kohonen map of 40 x 30 neurons obtained by training with 112 dopamine (DPA), 60 benzodiazepine agonists (BDA) and 8323 commercially available compounds. Only the types of compounds mapped into the individual neurons are indicated. Black identifies DPA, light gray BDA and dark gray the compounds of unknown activity. Empty neurons are stored in white; the two neurons marked by a black frame indicate conflicts where both DPA and BDA are mapped into the same neuron.
The map of Fig. 12 shows that both DPA and BDA occupy only limited areas in the overall map. Furthermore, the areas of DPA and BDA are quite well separated from each other, only one neuron with BDA intrudes into the domain of DPA and only two neurons with conflicts, obtaining both DPA and BDA, occur. With the results obtained here, the search for new active compounds or new lead structures can be restricted to a smaller area of the entire chemical space. This opens the way for searching for compounds with a desired biological activity and for discovering new lead structures in large databases of compounds. Closer analysis of the mapping shows interesting insights that are further discussed in the original publication .
Ligands and proteins interact through molecular surfaces and, therefore, clearly, representations of molecular surfaces have to be sought in the endeavour to understand biological activity. Again, we are under the restriction of having to represent molecular surfaces of different size; and again, autocorrelation was employed to achieve this goal . Firstly, a set of randomly distributed points on the molecular surface has to be generated. Then, all distances between the surface points are calculated and sorted into preset intervals
where p(i) and p(j) are property values at points i and j, respectively, dij is the distance between the points i and j; and L is the total number of distances in the interval [dl,du] represented by d. For a series of distance intervals with different upper and lower bounds, dl and du, a vector of autocorrelation coefficients is obtained. It is a condensed representation of the distribution of property p on the molecular surface.
The affinity of 31 steroids binding to the corticosteroid binding globulin (CBG) receptor was modelled based on spatial autocorrelation coefficients of the molecular electrostatic potential as descriptor . A vector of twelve autocorrelation coefficients corresponding to twelve distance intervals of 1 Å width between 1 and 13 Å was determined for each steroid using Eq. 5. Then, this set of descriptors was investigated using two different methods: firstly, the 12-dimensional descriptor space was projected into a plane using a Kohonen neural network in order to visualize the high-dimensional descriptor space. Then, these descriptors were used to quantitatively model CBG activity.
The projection into a Kohonen map was performed by training a network that consisted of 15 x 15 neurons. The resulting mapping into two dimensions is shown in Fig. 13, with black, dark gray and light gray squares representing steroids of high, medium and low binding affinity, respectively. The Kohonen network used had the topology of a torus - i.e. neurons at the left and right side of the networks - and neurons at the upper and lower side of the network are directly connected. Therefore, it was possible to arrange four identical copies of the resulting Kohonen map like tiles in Fig. 13, in order to better represent the clusters formed by the steroids. The compounds of high, medium and low activity form three clearly perceivable clusters in the Kohonen map, as indicated in Fig. 13. Only one compound, a steroid of medium activity, is not grouped together with compounds of the same activity class, but is instead surrounded by highly active compounds.
Fig. 13. Four-fold replication of the Kohonen map of the steroid dataset. The three clusters of compounds of high (black squares), medium (asterisks), and low activity (squares with crosses) are highlighted.
The ability of the Kohonen network to distinguish between compounds belonging to different activity classes shows that the autocorrelation descriptor fulfills one of the prerequisites for a successful quantitative analysis: compounds that are similar to each other in the descriptor space exhibit similar biological activity. Therefore, we were encouraged to quantitatively model the binding affinity with a feed-forward neural network trained by back-propagation using the autocorrelation vector as descriptor.
In order to estimate the predictive power of the approach, cross-validation following the leave-one-out scheme was performed. Although the predicted values show the correct trend, the quality of the predictions is not satisfactory. Especially one outlier can be identified the very same outlier already identified in the Kohonen map. After omitting this compound from the dataset and repeating the cross-validation, a much better predictive power is achieved. This is also reflected by a cross-validated r2 of 0.84. Thus, a much better modelling was achieved than with the widely used CoMFA method that had only led to a cross-validated r2 of 0.65 with 21 compounds .
The methods introduced in previous sections have the advantage that they allow for a rapid visualization of high-dimensional descriptor spaces. The importance of this feature has increased with the advent of the large compound collections that can be generated by combinatorial chemistry and related techniques: small datasets comprising tens or hundreds of compounds can be analyzed using almost any method without reaching the limits of currently available computer hardware, whereas special techniques are needed for the handling of datasets of hundreds of thousands of compounds. To demonstrate the merits of Kohonen networks and spatial autocorrelation descriptors in handling large datasets, we analyzed three combinatorial libraries that together comprise more than 87000 compounds .
Rebek et al. published the synthesis of two combinatorial libraries of semi-rigid compounds that were prepared by condensing a rigid central molecule functionalized by four acid chloride groups with a set of 19 different L-amino acids . This process is summarized in Fig. 14. In addition to the two published libraries we included a third, hypothetical library with adamantane as central molecule into our study.
Fig. 14. Preparation of the xanthene and the cubane libraries.
A Kohonen network with 50 x 50 neurons was trained with the combined descriptors of the xanthene and the cubane libraries, each molecule represented by 12 autocorrelation values calculated from the electrostatic potential on the molecular surface. The resulting map is shown in Fig. 15a. The neurons are colored according to the most frequent central molecule that is mapped into them. All 2500 neurons of the map are occupied. The compounds of the cubane library form a cluster in the center of the map that is separated from the compounds of the xanthene library. The neural network can clearly separate the two libraries quite well they both cover different parts of Kohonen maps and, thus, it can be concluded that they are from different parts of the chemical space. Consequently, they are remarkably different and, thus, both worthwhile to be considered in a screening program.
Fig. 15. Kohonen map of (a) the combined xanthene and cubane libraries and (b) the combined xanthene, cubane and adamantane libraries.
In a second experiment, we trained the same network with the combined data set of all three libraries. This resulted in the Kohonen map shown in Fig. 15b. Again, a distinct cluster that is clearly separated from the xanthene derivatives can be seen in the center of the map. The cubane and adamantane derivatives, on the other hand, cannot be distinguished by the neural network. They are tightly mixed in the central cluster, even more than can be concluded from Fig. 15b, as 92% of the cubane and adamantane compounds are mapped into common neurons. The cubane and adamantane libraries, thus, cover the same part of the chemical space they are so similar to each other that considering both of them in a screening program is both a waste of resources and time. The xanthene library is evidently different from the other two libraries. Therefore, the xanthene and either one of the cubane or adamantane libraries should be used for screening.
Rebek et al. used their libraries to screen for novel trypsin inhibitors. Only the xanthene library showed significant trypsin inhibition, so that they concentrated further efforts on this library. In the next round of screening, they divided the xanthene library into six sublibraries by using subsets of only 15 amino acids for the generation of the libraries. These subsets were generated by omitting three amino acids in turn from a set of 18 amino acids. This process resulted in sublibraries of 25425 compounds that were tested for their trypsin inhibition. To study the diversity of the six sublibraries, we first trained a network with the complete xanthene dataset resulting in a map with all neurons occupied. In this map, we then sent the compounds of the different sublibraries, obtaining altogether six different maps, one each for each sublibrary (Fig. 16.).
Fig. 16. Kohonen maps of a network trained with the entire xanthene library. Neurons occupied by the different sublibraries are shown in black, unoccupied neurons in white.
The six maps show remarkable differences: some of them are nearly completely filled, some of them exhibit large white areas representing neurons that no compound was mapped into. The larger these white areas are, the less the corresponding sublibrary covers the chemical space of the original xanthene library. The omission of the basic or acidic amino acids, for example, has led to a decreased diversity as shown by the large number of empty neurons. On the other hand, the omission of the larger alkyl amino acids or the OH and S substituted amino acids from the xanthene library does not lead to a remarkable decrease in diversity as there are only small white areas in the corresponding maps.