Previous PageNext Page

8. Modeling

We have already seen that even in the case of classification a neural network returns values between zero and one, that is, a continuum of values. It is also possible, however, to train a neural network with real expectation values and to use the output values in their real magnitudes, just as one might normally calculate the value of a function from a series of variables. This task of using data about an object (compound, reaction, spectrum) and deriving other characteristics of the object from them has commonly come to be called "modeling", and is what we shall use the term to mean in the following sections. The neural network therefore assumes the task of using input variables (data) from an object to find the value of some dependent characteristic (or even several of them). A neural network offers the advantage of not requiring an explicit formulation of the relationship as a mathematical equation. It expresses this relationship implicitly in the connections between the individual neurons.

8.1. HPLC Analysis

A simple example shall serve to represent the potential of this technique in analytical chemistry.
In the HPLC analysis of Spanish wines, the dependence of the separation of the components (expressed as the selectivity factor SF) on the ethanol content (10, 20, or 30%) and the pH value (5.0, 5.5, or 5.6) of the liquid phase was determined. These nine experimental points were fitted to a quadratic equation with standard modeling techniques [57]. The result is given in Equation (r), where x1 = % ethanol and x2 = pH value. This functional relationship is also represented in Figure 40 in the form of lines of the same selectivity factor.

(r)

Fig. 40. HPLC analysis of Spanish wines: Shown is the dependence of the selectivity factor SF on the ethanol content x1, and on the pH value x2 of the liquid phase. The curved arrows highlight the maximum (left) and minimum SF (bottom).

The same nine experimental data were built into a neural network with two input units (one for the ethanol content and one for the pH value), one output neuron (for the selectivity factor), and six neurons in a hidden layer, which was trained with the back-propagation algorithm [58].
Values for the ethanol content and the pH were fed into the network and the results entered into the diagram in Figure 41. Here too, just as in Figure 40, lines of the same selectivity factor were drawn in. A comparison between Figures 40 and 41 shows that both standard modeling and neural network techniques arrive at quite similar results. In particular, the positions for the minimum and the maximum values of the selectivity factor are very similar.

Fig. 41. The results obtained from a neural network (shown at the top) on the selectivity of the HPLC analysis of Spanish wines: dependence of SF on the ethanol content and on the pH value. See also Figure 40.

The advantage of the neural network is clear: in statistical modeling the mathematical form of the functional relationship (here a quadratic equation) had to be given explicitly. With the neural network this is not necessary; it finds the relationship by itself, implicitly, by assigning appropriate weights.

8.2. Quantitative Structure-Activity Relationships (QSAR)

The search for quantitative structure-activity relationships (QSARs) is one of the most important application areas for modeling techniques. A great deal of trouble and effort has been invested, especially in the prediction of pharmacological and biological data. It is therefore all the more surprising that as yet very few studies have been published which employ neural networks to provide quantitative relationships between structure and biological activity [59][60]. A typical study shall briefly be described here.
For this study a data set that had already been investigated with statistical modeling techniques (a multilinear regression analysis) was deliberately chosen in order to compare the performance of a neural network against a standard method from the field of QSAR. The data set comprised 39 para-quinones, some of them anticarcinogenic (Scheme 2).

Scheme 2. para-Benzochinones containing two aziridine substituents, a class of compounds which includes some anticarcinogenic members. R1 and R2 are, for instance, CH3, C6H5, etc.

The influence of the substituents R1 and R2 was described by six physicochemical parameters: the contribution of R1 or of both substituents to the molar refractivity index, MR1 and MR1, 2, their contribution to the hydrophobicity, p1 and p1, 2, the substituent constants of the field effect (F), and the resonance effect (R). Accordingly for a description of the influences of the substituents, six input units were required (Fig. 42).
The neural network was intended to yield the minimum effective dosage of the medication for a single injection. This minimum effective dosage is given by the amount of substance (as lg 1/c) that leads to a 40% increase in length of life. A single neuron was present in order to output the value of lg 1/c. A hidden layer of 12 neurons completed the network architecture (Fig. 42).

Fig. 42. Neural network for predicting the anticarcinogenic activity of para-benzoquinones.

Thirty-five benzoquinones were used to train the multilayered network with the back-propagation algorithm. The values for lg 1/c obtained from the network were compared with the results calculated from an equation determined by multilinear regression analysis. In 17 cases the results from the neural network were better, for 6 more or less as good, and for 12 worse. In other words, the results from the neural network are significantly better. Nevertheless this problem can be solved quite well by a linear method that leaves little room for improvement by a neural network. In the case of QSAR problems containing nonlinear relationships more might be gained by using neural networks.

8.3. Chemical Reactivity

Whereas in Section 7.1 we were satisfied with a yes/no decision on chemical reactivity (does a bond break easily in a polar manner or not), here we wish to make a quantitative statement about the behavior of a chemical reaction.
The electrophilic aromatic substitution of monosubstituted benzene derivatives can lead in principle to three isomers: ortho, meta, and para products (Scheme 3).

Scheme 3. Isomeric distribution in electrophilic aromatic substitution.

The dependence of the isomer distribution on the nature of the substituents X is a classical problem in organic chemistry. Basically the substituents may be divided into two classes: electron-donating substituents (inductive or mesomeric), which prefer to direct into o- and p-positions, and mesomeric electron acceptors which steer towards the m-position. The factors that determine the o/p ratio are both steric and electrostatic in nature.
In one study [61] into the product ratios for the nitration of monosubstituted benzene derivatives, the amounts of ortho and para products were combined. Accordingly, one output neuron was used for the o + p content and a second for the percentage of m-isomer. As already mentioned, the distribution of product is determined by the nature of the substituent, especially by the electronic effects which it produces. In order to represent this, two codings of the input information were tested. In the first attempt the partial atomic charges on the six carbon atoms of the benzene ring as they are calculated in the semi-empirical quantum mechanical program suite MOPAC [62] according to Mulliken population analysis were used as the six inputs. In addition an intermediate layer with 10 hidden neurons brought the total number of weights to be determined to 6 x 10 + 10 x 2 = 80.
As an alternative method the structure of the substituent was represented directly in the form of a connection table. This had dimensions of 5 x 5; each row contained first the atomic number of the atom in question then the index of the atom itself followed by the index of the adjacent atom further away from the ring, the order of this bond, and the formal charge on the atom. For each atom of the substituent except a hydrogen atom a new row was used. The first row reflects the situation at the atom which is directly bonded to the ring. With each successive row a further step is taken into the substituent (cf. Fig. 43). If the substituent had fewer than five non-hydrogen atoms then the rest of the 25 entries were filled with zeros. If it had more than five heavier atoms, those atoms which were further away from the point of attachment of the substituent on the ring (atom 1) were left out. Figure 43 explains this coding by connection table for the example of acetanilide. For this coding 5 x 5 = 25 input units were required, the intermediate layer had five neurons, and the network thus 25 x 5 + 5 x 2 = 135 weights.

 

relevant bond:

   

atomic number

reference atom

adjacent atom

bond order

charge

7

2

1

1

0

6

3

2

1

0

8

4

3

2

0

6

5

3

1

0

0

0

0

0

0

Fig. 43. Example illustrating the representation of monosubstituted benzene derivatives by a connection table of the substituent.

The network was trained with 32 monosubstituted benzene derivatives and the back-propagation algorithm; it was then tested with 13 further benzene derivatives. In this example the enormous number of 100000 epochs or training cycles was necessary before the error in the training data set had been reduced to a sufficient degree.
Of the two forms of coding with their networks, the second version - that is, the input of the substituent by means of a connection table - clearly produced the better results. Only the percentages of the meta products (Table 2) are necessary to judge the quality of the results.

Table 2. Results of attempts at predicting the amount of meta-product in the nitration of monosubstituted benzene derivatives. The size of the error in the prediction is quoted [%].

Method

Training data
(32 compounds)

Test data
(13 compounds)

network based on charge

5.2

19.8

network based on connection table

0.3

12.1

CAMEO

18.0

22.6

estimate by chemists

-

14.7

The training data set was able to be learned down to an average error of 0.3% for the m-isomer content. For the network obtained from the charge values the average error was 5.2%. With the test data of 13 compounds that the network had not seen before, the connection table coding gave an average error of 12.1% in predicting the percentage of m-product. For the atomic charge representation the error was noticeably higher at 19.8%
The results from these two neural networks were compared to values produced by CAMEO [63], an expert system for predicting reactions. They were in every respect better than those produced by CAMEO. Finally the 13 monosubstituted benzene derivatives were given to three organic chemists to predict the expected percentage of m-product from the nitration. The values they gave were averaged; this gave an error of 14.7%. The chemists were thus better than the neural network with charge-coding and better than CAMEO, but were beaten by the neural network with the connection-table input!
Though the results for predicting the product ratios from the nitration of monosubstituted benzene derivatives with a neural network based on the coding of the substituents by a 5 x 5 connection table are very encouraging, they do deserve more detailed comment. First, it is not surprising that the coding by the partial charges on the six ring atoms was rather unconvincing. Apart from the known inadequacies of Mulliken population analysis, the ground-state charge distribution is only one of the factors (not even the most important electronic effect) that influences the product ratios in electrophilic aromatic substitution. For these reasons the charge values can not represent the benzene derivatives sufficiently well to explain the product ratios from nitration.
As satisfactory as its results may be, the coding of the benzene derivatives by a 5 x 5 connection table can not provide an explanation for the effects responsible for regioselectivity, nor for the influence of the reaction conditions. Nor can it explain the product ratios from di- and polysubstituted benzene derivatives. To do this one must choose a different representation for the benzene compounds, which is indeed feasible [64]. The substituent should not be described globally. Rather its influence on the individual positions on the aromatic ring should be represented directly: for each position on the ring a value for the resonance effect, the local electrostatic potential, and the steric effect should be given. Because this represents the influence of a substituent at every single position on the ring, we can extend inferences from monosubstituted benzene derivatives to di- and polysubstituted compounds. Then we can make predictions about isomer distributions in further substitution. In addition, it is also possible to take account of the influence of the medium by providing a further input unit for the concentration of sulfuric acid. The network is trained with this alongside the descriptors for the influences of the substituents [64].

8.4. Summary

Modeling tasks are very common in many branches of chemistry. This opens up a wide area for the application of neural networks. At present this area is dominated almost exclusively by multilayered networks and the back-propagation algorithm. This need not be the case. Other neural network architectures, especially the counterpropagation algorithm [27], can certainly be used just as well for creating models.
The deployment of neural networks for modeling (that is, for predicting some characteristic of an object from a series of other parameters or measurements from that object) should always be considered carefully in relation to the use of statistical methods. If one has a relatively clear idea which variables influence the feature sought, and if there is some largely linear relationship between them, then traditional methods, such as multilinear regression analysis clearly offer advantages: these methods are faster and require less computing time, they give measures for the quality of the discovered relationship, and, above all, the equation derived from the statistical modeling allows a clear interpretation of the individual effects of the feature sought.
Neural networks should be used, however, if it is presumed that nonlinear relationships exist between the dependent and the independent variables, and if it is not possible to specify exactly which parameters influence the characteristic under investigation.
Whether the choice falls on statistical methods or neural networks, the success of a study depends crucially on the selection of the data set, on the representation of the information, and on the methods employed to validate the results. Moreover, when implementing neural networks, the following points are of crucial importance:

-

the selection of a homogenous data set for training (e.g., by using experimental design techniques or a Kohonen network - see Section 10.1)

-

the partitioning of the data set into training and test data sets

-

the selection of suitable parameters as input data to describe the objects.

Previous PageNext Page


Johann.Gasteiger@chemie.uni-erlangen.de