Previous PageNext Page

7. Classification

Classification problems are one of the most common areas of application for neural networks. An object is first characterized by various measurements and on that basis assigned to a specific category. Alternatively it may be found not to have a particular characteristic or not to belong to the class in question. Thus the output data are binary in nature: a feature is either present or it is not; the object either belongs to a certain class or it does not. The input data that characterize the object can be binary in nature but can also be real values (measured data). Classification is a traditional domain for the use of statistical and pattern-recognition methods. Neural networks have the advantage that they can still be applied in cases where the relationships between the object data and the classes to which they are to be assigned are highly complex. Neural networks are suitable even for relationships that cannot or can only barely be expressed in terms of explicit equations. In the following sections we show examples from several branches of chemistry where either a single condition is to be diagnosed (Section 7.1) or a single category is to be chosen from a series of others (Section 7.5), or alternatively where an object is to be assigned to several classes simultaneously (Sections 7.2-7.4).

7.1. Chemical Reactivity

The chemist derives his knowledge about the reactivity of bonds and functional groups from a great many observations of individual reactions. How can this process be transferred to a neural network? Let us look at the problem of polar bond breaking (Scheme 1), the first step in many organic reactions [29].

Scheme 1. Polar bond cleavage.

The reactivity is characterized very roughly in this case: namely whether a bond is easy or hard to break heterolytically. For this a single output neuron that is set to one if the bond may be easily broken and to zero if the bond is difficult to break will suffice. We still have to characterize the polar bond breaking by parameters that serve to influence the process. For this purpose a number of energetic and electronic effects were used: bond dissociation energy (BDE), difference in total charge Dqtot, difference in p-charge Dqp, difference in s-electronegativity Dcs, s-polarity Qs, polarizability of bonds ab, and the degree of resonance stabilization R± of the charges that arise from polar bond breaking. Values for these quantities were calculated by empirical methods [30]-[34]. For these seven parameters, seven units are required into which are fed the (real) values of the individual quantities. A hidden layer of three neurons completes the architecture for this study (Fig. 30).

Fig. 30. Architecture and input parameters for a neural network to predict the breaking of bonds. For details see text.

A data set that consisted of 29 aliphatic compounds containing 385 bonds was created. As each bond may be broken heterolytically in two directions (see Scheme 1), there was a total of 770 possible bond breaks. From these, 149 heterolyses of simple bonds were chosen and 64 of them used to train the net by the back-propagation algorithm; the remaining 85 bond-breaks were used to test the net (the division into training and test data sets will be explained in Section 10.1). Figure 31 shows a selection of molecules from the data set, in which those bonds classified as breakable or unbreakable have been marked.

Fig. 31. Selection of structures from the training set indicating which bonds are easy (arrow) or difficult (arrow crossed through) to break heterolytically. The direction of the arrows indicate to which atom the electron pair will be shifted in bond cleavage (that is, which atom will receive the negative charge).

After 1300 cycles (epochs) the net had correctly learned all 64 bond breaks from the training data set. Then the bond breaks from the test data set were passed through the already trained neural net. These 85 bond breaks, about which the net had hitherto received no information, were also classified correctly. The division of the bonds into those easily and those not easily broken was predicted by the net exactly the same way as the chemist had determined them. The net had therefore learned the relationship between electronic and energetic parameters and polar bond-breaking.
The net was then ready to be applied to compounds that were contained neither in the training nor the test data sets. It even made correct predictions about bond types which contained atoms that were not used at all in training. In Figure 32 the predicted reactive bonds of a structure which was not used to train the net are shown.

Fig. 32. Bond cleavages predicted by the neural network for a structure which was not used in training. The direction of the arrows indicate the shifts of electron pairs, the values the predicted probability for heterolysis.

The dissociation of a bromide ion and a thiol group, both in allyl position, were found to be especially reactive, as were the removal of a proton from the central allylic position and from the thiol group. The allylic positions at the ends of the system were estimated to be less acidic, whereas the position at which the bromine atom can still function in an inductively stabilizing way receives a higher acidity. Thus all results correspond closely to chemical experience.
It is remarkable that the reactivity of the SH group was correctly judged, even though the training data set did not contain a single structure with a sulfur atom. This is made possible by the use of those electronic and energetic parameters that contain the influence of the atom in general form in the techniques used for calculation of the inputs. Thus even a type of atom which does not occur in the current training set can be taken into consideration, provided that it is contained in the calculation process.
This neural network is able to predict which bonds can easily be broken in a polar manner for a wide range of aliphatic structures.

7.2. Process Control

For many chemical processes the relationship between process data and control parameters can be represented, if at all, only by nonlinear equations. It is therefore very hard to model these processes and to predict their behavior. It therefore comes as no surprise that neural networks are being used intensively for tasks in process control [35]-[41]. This includes not only the classification of certain events (yes/no decisions), but also the modeling of control parameters (prediction of a real value).
An example in which the task is to choose between membership of different classes will serve to illustrate the possible applications [36]. The goal was to derive individual failure modes from six different sensor values for a continually stirred reaction vessel in which an exothermic reaction was in progress.
The following measurements were taken (cf. Fig. 33): 1) The outlet concentration of the educt Ce, 2) the reactor temperature Tr, 3) the volume of the contents of the reactor Vr, 4) the outlet flow-rate FRp, 5) the temperature of the cooling water Tc, and 6) the flow-rate of the cooling water, FRc. These six parameters were to be used as "symptoms" to diagnose several failure modes of the reactor. Malfunction can be caused by incorrect inlet concentration of the educt Ce0, inlet temperature Te, and inlet flow-rate FRe. If any of these measurements deviates by more than 5 percent from the normal value, a failure has occurred in the reactor.

Fig. 33. Diagram of a reactor tank showing the six measured values (Ce, Tr, Vr, FRp, Tc, FRc) and the state variables that give rise to a failure mode (Ce0, Te, FRe).

Each one of the failure modes affects almost all the symptoms, that is, all six measured values, so that a specific failure mode can not be diagnosed directly from a single measurement. Moreover, it is possible for the failure modes to occur not just singly but simultaneously, and for the effects of simultaneous failure modes to compensate in the sensor values, or contrarily to amplify each other synergetically.
Which network architecture was chosen in this case? Because six real measured values are to be input, six input units are required. Likewise the number of output neurons chosen was six; for each of the three crucial inlet parameters (Ce0, Te, and FRe) one neuron for a deviation over the normal value and one for a deviation below it. Thus if any of the six failure modes occurs the corresponding output neuron should also be activated. Five neurons in a hidden layer complete this multilayered net (Fig. 34).

Fig. 34. Neural network for diagnosing failure modes of the chemical reactor depicted in Figure 33.

Accordingly for this neural network 6 x 5 + 5 x 6 = 60 weights were to be determined. Twelve individual failure modes were deliberately produced, and the network trained by using the sensor data measured from them on application of the back-propagation algorithm. The network so trained was able to identify sensor data from the reactor running normally (data which were not used in the training process) as undisturbed behavior. Furthermore, when four multiple failure modes were produced the neural net was likewise able to derive these correctly from the sensor data taken at the time.
Neural networks will certainly become very important in the area of process control. It should also be possible to take a neural network that has been trained on a specific process, hard-code this onto a chip, and build this chip into the control process.

7.3. The Relationship between Structure and Infrared Spectra

In the examples from Sections 7.1 and 7.2, rather simple neural networks with relatively few weights were used. For the following application a considerably larger network with almost 10000 weights was developed.
Modern structure elucidation is based on spectroscopic methods. However, as the relationships between the structure of an organic compound and its spectroscopic data are too complex to be captured by simple equations, there are a multitude of empirical rules. The enormous amount of spectroscopic data thus available fulfills an important precondition for the training of neural networks. The first steps to store the relationships between structure and spectroscopic data in neural networks have already been taken. Nevertheless, as we shall see, there is still much development to be done in this area.
Munk et al. [42] investigated to what extent conclusions could be drawn by a neural network about the substructures contained in a particular compound from an infrared spectrum. The range of an infrared spectrum between 400 and 3960 cm-1 was divided into 256 intervals, and each of these intervals assigned to an input element. If an absorption band was found in such an interval, its intensity was fed into the input element. The neural network had 36 output neurons, each of which was responsible for one of 36 different functional units (primary alcohol, phenol, tertiary amine, ester etc.). If a functional unit was present in the compound under investigation, then the corresponding neuron received the value one, otherwise the value zero. Further, an intermediate layer containing 34 hidden neurons was used, which required 265 x 34 + 34 x 36 = 9928 weights to be determined for this multilayered network. The basic procedure and the architecture of the neural network are sketched in Figure 35.

Fig. 35. Neural network to learn the relationships between the infrared spectrum of a compound and the substructures present in it.

To determine the weights of the neural network, 2499 infrared spectra together with their structures, which were broken down into their functional units, were learned through the back-propagation algorithm. Then 416 spectra were used to test the predictive ability of the net. A single cycle through all spectra required 10 min CPU time on a VAX 3500; for a training session with many cycles (called epochs - typically 100 epochs were necessary) a Cray supercomputer was used.
For each functional group the quality of the results was measured by a number, named the A50 value. This value represents the precision at 50% information return, that is, the precision with which a functional group can be ascertained when the threshold value is set to the mean of the distribution curve.
Figure 36 shows a typical result, in this case for primary alcohols. The threshold value here lay at an output value of 0.86. At this value 132 of the 265 primary alcohols contained in the training set are correctly identified, but 34 compounds are incorrectly classified as primary alcohols as well. The A50 value for this group is therefore 132/(132 + 34) = 79.5 %. This value was still deemed to be good. Thirty of the 36 functional groups were able to be predicted with similar or better precision.

Fig. 36. Percent distribution of the output values Y of the neural network for primary alcohols. The solid line represents compounds which are primary alcohols, the shaded line all other compounds. The mean of the output values for primary alcohols is 0.86.

The results from this network, which contained a hidden layer, were compared with classification capabilities of a network without a hidden layer [43]. The hidden layer brought about a considerable improvement in results.
This experiment has, of course, not solved the problem of resolving the relationships between infrared spectrum and structure. For the most part, work concentrated on a few important functional groups, and the molecular skeleton was ignored. Even for those groups the predictions were only of moderate quality; to recognize 132 out of 265 primary alcohols in addition to 34 false assignments is disappointing. If one sets the threshold value even higher one can determine to a high degree of certainty whether a functional group is present or absent, but a large range of compounds remains for which no reliable predictions can be made. In an automatic structure-elucidation system, however, these kinds of predictions allow a considerable reduction of the search space. Herein lies the value of these results.
This is not the last word on the relationship between structure and infrared data. Further experiments should attempt to classify vibrations of the skeleton. This would, however, necessitate a change in the way structures are coded.

7.4. The Relationship between Structure and Mass Spectra

The relationships between mass spectra and structure are even more complex than those between infrared spectra and structure. Nevertheless, this problem also has already been approached with neural networks.
In this case, too, a multilayered network containing a hidden layer was trained with the back-propagation algorithm [44]. The mass spectra were described by 493 features; these included the logarithms of the intensity of the peaks between m/z 40 and 219, the logarithms of the neutral losses between m/z 0 and 179, autocorrelation sums, modulo-14 values, etc. The values for these 493 spectral characteristics were fed into the same number of input units.
Here too the structure of an organic compound was characterized by 36 substructures which, however, differed partly from those used in the study on infrared spectra. Thirty-six output neurons were needed for this. As the number of neurons in the hidden layer was 80, 493 x 80 + 80 x 36 = 42320 weights had to be determined.
Correspondingly larger data sets were investigated: with 31926 mass spectra for training and 12671 mass spectra for testing. With such large data sets and a network with so many weights, the learning process (the back-propagation algorithm was also used here) required a great deal of time. One epoch (that is the process of passing all 32000 spectra once through the net) needed 6h on a HP 9000/370 or SUN-4 workstation. Typically 50 epochs were needed; thus training alone required two weeks computing time on a high-performance workstation.
The results of the classification from the fully trained neural network, MSnet, were compared with results from STIRS [45]. STIRS, from the group led by McLafferty, is a powerful expert system for determining the presence of functional groups from mass spectra.
The classification results from MSnet were somewhat better than those from STIRS. MSnet offers a few additional advantages however: 1) A probability value can be given for the assignment of a compound to a particular class. 2) Not only the presence but also the absence of a functional group can be diagnosed. 3) The computing time required for queries to MSnet is two orders of magnitude lower than for STIRS.
The following point must be emphasized: Even though the training of a neural network may require a great deal of CPU time, a fully trained neural network can make predictions in minimal time.
In order to satisfy as general a requirement as representing or learning the relationship between molecular structure and spectroscopic data for the entire domain of organic chemistry, we must confront a fundamental problem, the statistical distribution of data. For example, the data set of 32000 compounds contains 33 phthalic acid esters, which give a very characteristic peak at m/z 149. However, most of the spectra which have a peak at m/z 149 happen not to be phthalic acid esters because there are only very few of them in the total set. As a consequence phthalic acid esters are not recognized.
In this paper [44] an interesting attempt is made to overcome this general problem: a hierarchy of neural networks is proposed (Fig. 37). A preliminary network first undertakes the partitioning according to the most important functional groups, while specialized neural networks carry out further refinements of the compound classes. Thus a special network was developed which divided compounds containing the O-C=O group into 22 subclasses (saturated esters, aromatic esters, lactones, anhydrides etc.).

Fig. 37. Hierarchy of neural networks for deriving substructures from mass spectra.

This idea of using a hierarchy of neural networks could prove very useful in other problem areas.

7.5. Secondary Structure of Proteins

In the example from the previous section we were still dealing with a rather simple network architecture. We now move on to describe cases with a rather extensive coding of the input data. The neural network involved is correspondingly complex, with a large number of weights.
To gain a deeper insight into the physiological characteristics of proteins one must know their secondary structure. For this reason there has been no scarcity of attempts to derive the secondary structure of proteins from their primary structures (that is, from the sequence of amino acids). Chou and Fasman [46] introduced a method which is much used today to decide from the sequence of amino acids what secondary structure the individual parts of a protein will assume. This procedure is able to predict with 50-53% accuracy for the individual amino acids in a protein whether they take part in an a-helix, b-sheet or an irregular coiled structure [47]. Over the past few years a series of papers have appeared in quick succession [48]-[56] on applications of neural networks to predict the secondary or even tertiary structure of particular sections of a protein from the sequence of amino acids. Our intention is to demonstrate the principal procedure based on the work of Qian and Sejnowski [48]. Most of the other investigations [49]-[56] have chosen a very similar strategy.
Both the Chou and Fasman method [46] and the application of neural networks are based on the assumption that the amino acid (AA) itself and its immediate surroundings (that is, the amino acids immediately before and after it in the sequence) decide which secondary structure it will adopt.
In order to take the dependence of the secondary structure on the sequence into account, the amino acid under examination will be fed into the net along with the six amino acids preceding and the six following it in the sequence. Thus from the amino acid sequence, a "window" of 13 amino acids is extracted every time. This window is pushed in steps of single AAs down the entire length of the amino acid sequence, so that each individual AA in turn may find itself in the center.

Fig. 38. Section ("window") of 13 amino acids from a protein sequence. The window helps to determine the secondary structure in which the amino acid in question (in this case valine) finds itself.

How are the individual amino acids coded? For each AA in the window of 13 AAs a bit vector of length 21 is used. For each of the 20 naturally occurring amino acids a particular position in the bit vector is reserved (e.g., the 14th bit is set to 1 when the AA proline is present). A last bit is required in order to mark that the window is at the beginning or end of the protein and thus that there are no more AAs at one of its ends. Thirteen input units are needed for the size of the window and 21 for the identity of the AA, thus 13 x 21 = 273 input units altogether; each feeds only one bit, a zero or a one, into the network.
The network had three output neurons: one for the presence of an a-helix, one for a b-sheet, and one for a coiled structure. After experimenting with between 0 and 80 hidden neurons, a hidden layer of 40 neurons was chosen as optimal for the net, which meant that 273 x 40 + 40 x 3 = 11040 weights were to be determined. Again in this case the back-propagation learning algorithm was chosen. The overall architecture is pictured in Figure 39.

Fig. 39. Neural network for deriving the secondary structure of a protein from the sequence of amino acids.

The network was trained with 106 proteins containing a total of 18105 amino acids. The efficacy of the net was tested with 15 additional proteins, which comprised 3250 AAs in total. A prediction accuracy of 62.7 % was achieved.
Thus one could say with 62.7% certainty whether an amino acid was part of an a-helix, a b-sheet or a coiled structure. This is a noticeable improvement over traditional methods for predicting secondary structures but still leaves something to be desired. It is therefore understandable that this area is being researched so actively [48]-[56].

7.6. Summary

These applications from very different branches of chemistry serve to underline the broad range of possibilities which neural networks offer for classification. In every one of the examples portrayed, a multilayered network was used that was trained by the back-propagation learning process.
The number of neurons in the hidden layer is usually determined by a systematic series of experiments. With too few neurons a problem can not be correctly learned; the larger the number of neurons the smaller the error in learning, but the longer the training times. Too many neurons and too long a training period can lead to another problem: overtraining. This means that a neural net has been trained for so long that, although it can reiterate the training set without error, it may give bad predictions when faced with new data. Multilayered networks usually have many weights, and therefore many degrees of freedom in adapting to a data set. This presents the danger that one might fall into an erroneous local minimum in training, thus reducing the ability to make predictions based on new data.
The complexity of the neural net (mainly the number of weights and the number of training data used) can bring with it a large increase in the training time. One should not be discouraged too much by long training times, however, because ideally a network needs to be trained only once. Once it has finished learning, predictions on new data can be made very rapidly, because these data only have to be passed through the fully trained net once.
In the case of classification problems it is desirable to obtain values of one or zero in the output neurons so that an object belongs unambiguously to a category or not. In reality values between one and zero are obtained, and decisions are made on the basis of threshold values (e.g., 0.8 or 0.2) whether an object belongs to that category. These numeric output values can also be interpreted as probabilities, and as such may be used in a decision support system.
The numerical output clearly points to the transition to modeling problems, which are the subject of the next section. In modeling tasks it is desirable to obtain function values, (i.e., real data). These can be produced by taking those values between zero and one and transforming them through mathematical functions.
In summary, the most important task when implementing a neural network is to find a suitable representation for the input and output data. The hierarchical ordering of multiple networks (Section 7.4) and the movable window from Section 7.5 demonstrates that the technology sets no limits to the imagination.

Previous PageNext Page


Johann.Gasteiger@chemie.uni-erlangen.de