Main indexFormer pageNext pageBeginningEnd

Substructure Searches

What is a Substructure Search?

Substructure searches provide an additional method to search for suitable starting materials in a catalog of chemicals. The focus of a substructure search is the design of the synthesis of a combinatorial library for a lead structure (see Tutorial B), whereas similarity searches play an important role during the design of a synthesis for a single target structure (see Tutorial A). Thus, both methods complement each other depending on the application of WODCA.
A substructure search in a catalog of chemicals is useful if someone is interested in all available starting materials containing a certain structure fragment (the substructure). In contrast to similarity searches the user has to define this substructure by himself/herself. The definition of a substructure query allows the specification of open sites or atom lists for a certain position in a chemical structure. While a substructure search always analyses the entire molecular structure of a compound from the catalog of chemicals, a similarity search considers only the largest fragment which results from the application of a similarity criterion to the compound from the catalog of chemicals.
Substructure searches are implemented in the WODCA system as an external tool (CACTVS Substructure Search).

Essentials of a Substructure Search

After a representative of a lead structure of a combinatorial library has been disconnected to give a set of precursors during the retrosynthetic analysis with WODCA, a substructure search can provide the structural variation for each of these precursors. Thus, substructure searches are useful to find a series of representatives of certain classes of compounds in a catalog of available starting materials that can act as precursors for the synthesis of an entire library of compounds.
The following list describes some typical features of a substructure search in general terms:

How to Define a Substructure

In the following, the various features provided for the definition of a substructure query are explained.

Open Sites

An open site allows any type of atom to be attached to a position of the query structure. For example, if a carbon atom within a substructure query carries three substituents as well as one open site, the forth substituent of the carbon atom can be any element (inclusive hydrogen) attached to the carbon atom. Further examples are given in Figure 1.

Figure 1: Definition of various substructure queries with open sites. On the right-hand side some examples for hits and non-hits are given

Atom Lists

It is possible to define a list of atom types instead of a single atom for a certain position of the substructure query. Such atom lists define a set of atoms which are allowed (positive atom list) or which are forbidden (negative atom list) for a certain position of the substructure during the substructure search. Furthermore, it is also possible to allow any atom type at a certain position of a substructure (Figure 2).

Figure 2: Definition of various substructure queries with atom lists. On the right-hand side some examples for hits and non-hits are given

Specifications for Atoms

Different specifications can be defined for each atom of a substructure query. If none of these specifications is set, the default settings are used in which each specification is not considered during the substructure search.The following list shows some examples of settings for atom specifications:

Specifications for Bonds

In order to specify bonds during the definition of a substructure query the following settings are provided:
If none of these specifications is set, the default settings are used in which each specification is not considered during the substructure search.

Bond Order

In many cases, the definition of open sites on atoms linked by a single bond influences the bond order of this bond as well, since two open sites on adjacent atoms can be combined to a double bond. The treatment of bond orders during a substructure search can be controlled by the CACTVS Substructure Search tool. Depending on the setting of the switch Bond Order in the main window of the CACTVS Substructure Search tool the following three cases have to be discussed. In case A, the switch Bond Order is activated (default setting) which means that bond orders in the substructure query are considered, in case B and C the switch is disabled.

Figure 3: Influence of bond orders and open sites on a substructure search. Bond orders are considered.

Case A: In Figure 3, the atoms directly connected to the bond marked as bold are all defined. In other words, there are no open sites at the atoms which form the bond. In this case, the exact bond order is considered during the superimposition of the substructure query and the compounds from the catalog of starting materials. As we can see from the result of the substructure search, the bond marked in bold has to be a single bond. Thus, 1-methyl-cyclohexene (3) or 5-methyl-cyclohexadiene (4) are not considered as hits. Only cyclohexane derivatives like 4-methyl-cyclohexanol (1) or 2-methyl-cyclohexanone (2) are found by the substructure search.

Figure 4: Influence of bond orders and open sites on a substructure search. Bond orders are not considered.

Case B: Now, the switch for the consideration of bond orders is disabled. Since two open sites at adjacent atoms linked by a single bond represent both a saturated single bond and an unsaturated double bond, 3-methyl-1-cyclohexene (1) is found as well as 3,5-dimethyl-2-cyclohexen-1-one (2) by the substructure search (Figure 4). Since the number of hydrogen atoms on the bond marked in bold is defined, 1-methyl-cyclohexene (3) and toluene (4) are excluded from the list of hits.

Figure 5: Influence of bond orders and open sites on a substructure search. Bond orders are not considered.

Case C: The substructure query in Figure 5 contains open sites on the atoms of the bond marked in bold. Thus, the result of the substructure search is completely different since all possible combinations of superimpositions of single bonds, double bonds, and aromatic bonds are allowed.

Global Specifications

The previous discussion has shown how the global setting for the switch Bond Order influences the results of a substructure search. Other global settings for the treatment of tautomers and chiral compounds are also provided and will be explained later (see section Option Panel).

How to Start the CACTVS Substructure Search Tool

A substructure search is initiated from the main window of WODCA. By clicking with the left mouse button on the command Substructure Search in the searches menu the external CACTVS tool for substructure searches is started (Figure 6). The current compound and the information which catalog of chemicals is currently loaded are automatically transferred to the tool for substructure searches.

Figure 6: CACTVS Substructure Search tool for the application of substructure searches

Definition of a Substructure Query

Figure 7 shows all important window elements of the graphical user interface of the CACTVS Substructure Search tool for the definition of a substructure query. Each window element is explained in the following.

Figure 7: Window elements of the CACTVS Substructure Search tool

The Molecule Canvas

The molecule canvas (Figure 7, A) is the most important window element of the CACTVS Substructure Search tool. It allows the definition of a substructure query which is then displayed in this window area. Initially, a molecular structure is always displayed with all hydrogen atoms to indicate that there are no open sites already defined. By changing this molecular structure the substructure query can easily be defined.
Definition of open sites. The easiest way to define an open sites on a certain atom of the query structure is just to delete a hydrogen atom or another terminal atom which is linked to this atom. Thus, each atom which is deleted from the query structure represents an open site. Before an atom can be removed it is necessary to switch the canvas mode to eraser mode (Figure 7, B). In eraser mode, an atom can be deleted just by a single mouse click with the left mouse button on its element symbol. The atom is then removed and an open site is defined on its corresponding position. For example, the nitrogen atom and a bond of the ring system are marked by white arrows (see Figure 7). After removing the hydrogen atoms from the nitrogen atom and from the two carbon atoms of the indicated bond, the nitrogen atom carries one open site and both carbon atoms of the bond carry two open sites.
Not only hydrogen atoms can be removed by a single mouse click but also any kind of atoms in the structure query, for example carbon atoms (1st column, Figure 8). It is also possible to remove bonds (2nd column, Figure 8) or entire atom groups (3rd column, Figure 8) from the structure query. If a single mouse click on the center atom of an atom group (e.g. carbon in the methylene group) is performed, the center atom as well as all hydrogen atoms of this group will be removed (3rd column, Figure 8). Each of these operations define additional open sites in the direct neighborhood to the elements deleted. The numbers in Figure 8 indicate the numbers of open sites on each atom.

Figure 8: Definition of open sites by deleting atoms, bonds, and groups from the molecular structure. The numbers indicate the numbers of open sites on the corresponding atom.

Canvas mode. The canvas mode can be controlled by two buttons in the upper right corner of the graphical user interface of the CACTVS Substructure Search tool (see Figure 7, B). The button marked with a pencil activates the drawing mode. The button with the rubber icon is used to switch the canvas to eraser mode. If the CACTVS Substructure Search tool is started for the first time, it is automatically switched to eraser mode (default setting).
Eraser mode. In eraser mode it is possible to remove atoms, bonds, or entire groups by a single mouse click from the molecular structure displayed in the molecule canvas. The eraser mode is useful for the modification of a query structure, e. g. for the definition of open sites.

Drawing mode. In drawing mode it is possible to draw atoms, to link atoms, and to change the bond order of existing bonds. Thus, the drawing mode can be used for the definition of a new substructure query or for the modification of an existing one.

Drawing of atoms and bonds. Before a new atom can be created, an atom type has to be chosen from the element panel in the main window of the CACTVS Substructure Search tool (Figure 7, C). To draw a new atom which is linked to an existing one just click with the left mouse button onto an atom and keep the mouse button pressed. A grid around the atom is displayed which shows positions of additional atom to be set (Figure 9: (1)). Move the mouse pointer to one of the grid items and release the mouse button. A new atom is created which is linked by a single bond to the atom considered (Figure 9: (2)).

Figure 9: Drawing a new atom

Changing the bond order. If the drawing mode of the molecule canvas is activated, the bond order can be increased by one by a single click with the left mouse button onto the bond. A single bond will be changed then to a double bond, a double bond will be transformed to a triple bond, and a triple bond will be returned to a single bond. All this will be done only if allowed on terms of valence rules.
Definition of stereochemistry. A bond can be represented as a solid line, as a broken wedge (in both directions), as a solid wedge (in both directions), and as a broken line. These different representations of a bond define the stereochemistry of the corresponding atoms. The representation of a bond can be changed by clicking with the right mouse button onto the bond. The representation changes then from one mode to the next mode.
Changing an atom type. The element panel contains a list of frequently used atom types which can be selected either for the drawing of new atoms or for the modification of existing atoms. Select an element symbol from the element panel and click on an atom of the query structure to change the atom type to the selected one.
Other atom types at the element panel. If other atom types are needed which are not contained in the element panel, click on the button with the PSE icon (button at the bottom of the element panel). The PSE panel is then displayed. Select a required atom type and it will be become part of the element panel.
Changing the grid. The grid panel (Figure 7, D) is used for switching the grid to another geometry. The default setting for the grid is a hexagonal geometry. That means, if someone follows the grid items suggested during the drawing of a molecular structure a cyclohexane structure can be created (see Figure 10).

Figure 10: Drawing atoms and bonds within a hexagonal grid and other grids provided

Alternative grid geometries are provided by the grid panel which is shown in Figure 11. The grid geometry can be changed by a single mouse click with the left mouse button.

Figure 11: The grid panel

The SMILES Panel

The SMILES panel of the CACTVS Substructure Search tool (Figure 6 and Figure 7, E) represents the substructure query from the molecule canvas in SMILES notation (further information about the SMILES notation is available at: http://www.daylight.com/smiles/smiles-intro.html). If the substructure query is modified in the molecule canvas, the SMILES panel will immediately be updated. Furthermore, it is possible to define or modify the substructure query directly in the SMILES panel. After clicking in the entry field of the SMILES panel it is possible to add or delete characters of the SMILES string. The input is terminated by pressing the Return-key on the keyboard. The substructure query is then immediately updated in the molecule canvas. Be aware of the following feature of the SMILES panel: Each character of the SMILES string is interpreted in parentheses, e.g. the SMILES string 'CCC' is interpreted as '[C][C][C]' when the input is finished. That means that no hydrogen atoms will be automatically added to any atom of the structure. Thus, use the definition of hydrogen atoms in SMILES notation to add hydrogen atoms on a certain position of the structure. Click on the Add H button right to the entry field to add hydrogen to all atoms with free sites. The Del H button is used to delete all hydrogen atoms from the structure query.

The Option Panel

The option panel (Figure 7, F) provides some global settings to control the result of a substructure search.
Tautomers. If this option is set, the CACTVS Substructure Search tool tries to consider all tautomeric forms of a query during the substructure search. Since it is quite difficult to determine exactly the tautomeric form of a substructure query containing open sites or atom lists, the result of such a search may sometimes be surprising.
Stereochemistry. If the substructure query contains chiral centers the Stereo option at the Option panel can be used. If this option is set, all stereo centers and their descriptors are determined and then considered in the substructure search. Furthermore, the descriptors for each bond (E/Z) are derived (if possible). During the search only structures with stereo descriptors matching the query are found as hits.
Bond order. In the default settings of the CACTVS Substructure Search tool, this option is already set. Thus, during the substructure search the bond orders are considered (see section Bond Order).
Overlap. It is possible to define more than one separate substructure query in the molecule canvas. During a substructure search only those compounds from the catalog of chemicals matching each of these substructure fragments are considered as hits. If the Overlap option is set, such multiple substructure queries are allowed to overlap during the superimposition with the compounds from the catalog of chemicals.

The Command Panel

Performing a substructure search. After a substructure query is completely defined, the Search button in the command panel can be pressed to start the substructure search (Figure 7, G). The substructure search is then performed in the catalog of chemicals which is currently loaded in WODCA. At the end of a search the message at the Match List icon is replaced by the number of hits found during the search.
Location of Search. It is possible to perform the substructure search not only in a catalog of chemicals but also in a match list.

One of the following items Search in Catalog (default setting), Search in Last Match List, and Exclude from Last Match List can be selected in the option menu from the command panel .

The search options Search in last Match List and Exclude from Last Match List are useful if the result of a substructure shall be restricted by revising the substructure query and repeating the substructure search only in a match list.

The Menu Bar

The menu bar of the CACTVS Substructure Search tool contains a File menu and an Edit menu. In the following all functions of these menus are explained.
File Menu. The File menu contains read and write commands for the substructure query specified by the user.

If the Read Query ... entry is selected, a dialog box appears which contains a list of files and subdirectories in the current directory. Select a file by a single click with the left mouse button and press the Read button to load the corresponding structure query into the CACTVS Substructure Search tool. If a molecule file contains more than one molecule entry it is possible to switch between each of these entries by the cascaded Buffer menu (see below).
In order to save the current substructure query use the Write Query ... option. The substructure is then saved with all its open sites, attributes, and specifications. Since a special file format for substructures (*.cbin) is used, the file cannot be read by WODCA.

If a molecule file with more than one structure is read into the CACTVS Substructure Search tool (as described above), the cascaded Buffer menu controls which structure is loaded into the molecule canvas.


Edit Menu. The Edit menu provides some global operations for the substructrue query displayed in the molecule canvas:

  • The Undo command neutralizes the last operation performed in the molecule canvas (e.g. drawing or deleting of atoms and bonds, setting of specifications for atoms and bonds).
  • The current substructure query can be removed by the Clear Canvas / Delete Query command.
  • The Add All Hydrogen Atoms command adds hydrogen atoms to each atom with open valences. Thus, after this operation has been applied to the substructure query the structure carries no open sites anymore. This function is identical to that of the Add H button in the SMILES panel.
  • The Delete All Hydrogen Atoms command removes each hydrogen atom of the current structure query in the molecule canvas. On positions where a hydrogen atom has been removed an open site is created. This function is identical to that of the Del H button in the SMILES panel.
  • The Beautify command recalculates the coordinates of the current substructure query and replots the structure in the molecule canvas.

Further Specifications

As described before, it is possible to provide further specifications for atoms and bonds of the substructure query.

Specifications for Atoms

If an atom of the substructure query in the molecule canvas is selected by a double click with the left mouse button on its element symbol, a window called Flags and Search Specs for Atom appears that allows one to modify the atom selected. At this state of development the design of the window is not quite finished. Thus, some window elements and their functions are not usable in the current version. In Figure 12 the most important window elements are marked by a box. In the following, a short description for each of these elements is given.

Figure 12: The pop-up window for the specification of atom types. The numbers correspond to the explanation in the text below

Atom type. If the atom type is set to Element (1) the atom type that has been originally drawn at this position is defined by the query structure during the substructure search. Element is the default setting for the atom type. If the atom type is switched to List (2) the user can define a list of elements instead of a single atom type for a certain position of the substructure query. The list of elements in the entry field has to be input as element symbols separated by an empty space. As described in chapter How to Define a Substructure it is possible either to define a positive atom list (definition of atoms at a certain position) or to define a negative atom list (exclusion of atoms at a certain position). For example, the atom list 'C N O' defines that one of the elements carbon, nitrogen, and oxygen, has to be at a certain position of the molecular structure to be a hit in a substructure search. By the same token, if these atom types have to be excluded from a certain position of the molecular structure, each element of the list has to carry an exclamation mark as a prefix, e.g. '!C !N !O'. Furthermore, some predefined atom lists are provided which can be used as templates (3): Any defines that there has to be an atom on a certain position but it can be of any kind. Thus, the list contains the entire periodic table. HDonor and HAcceptor define a list of atoms which are expected to be either a donor of a hydrogen atom or an acceptor of a hydrogen atom, respectively. The option Insulator is quite similar to Any atom type but it does not allow the atom to be part of a conjugated electron system.
Atom enviroment. A range of hydrogen atoms can be defined on a certain atom by the entry field Hydrogen Count (4) . For example, if an atom should carry one or two hydrogen atoms, the numbers '1-2' have to be input in the entry field. In the same manner, the valence range (5), the number of ligands without hydrogen atoms (6) or included hydrogen atoms can be defined. If an atom should be part of a ring system, a range for the ring size can be specified (7). If the atom should be part of a chain, the Chain button of the button panel (8) has to be pressed. If the Cyclic button is pressed, the ring size is defined to be greater or equal to three. The default setting of the button panel is Don't Care. Thus, an atom can be either located in a ring system or in a chain The Ligand Fuzz entry field has no function, yet.
Search flags. A number of search flags for an atom can be specified (9). If the Not-to-match flag is activated the atom considered is excluded from the superimposition of the query structure with the compounds from the catalog of chemicals during the substructure search. Additional flags allow the specification that an atom is part of an aliphatic structure fragment or of an aromatic system. The Match Stereo flag enforces that the the stereochemistry at an atom is considered in a substrucutre search. If the Match Charge flag is set, the charge of an atom on a certain position has to be the same both in the query structure and the compounds from the catalog of chemicals. The Unsaturated flag is useful if an atom carries open sites but should not match to saturated atoms. This flag is not useful in cases where the atom is already saturated. The Must Map flag has no function, yet.
The Set button (10) is used to transmit all specifications that have been defined in the window to the atom considered. After pressing the Set button the window is closed. The Cancel button (11) closes the window without any changes of the query structure.

Specifications for Bonds

If a bond of the substructure query in the molecule canvas is selected by a double click with the left mouse button on a bond, a window called Flags and Search Specs of Bond appears that allows one to modify the bond selected. At this state of development the design of the window is not quite finished. Thus, some window elements and their functions are not usable in the current version. In Figure 13 the most important window elements are marked by a box. In the following a short description for each of these elements is given.

Figure 13: The pop-up window for the specification of bond types. The numbers correspond to the explanation in the text below

Bond order. A range for the bond order can be set in this entry field, e.g. '1-2'. The default setting is 'as entered' (1) which corresponds to the bond order which is defined in the molecule canvas during the drawing of the substructure query.
Bond enviroment. If the bond should be located in a ring system a range for the ring size can be defined in the Ring Sizes entry field (2). If the bond should be part of a chain the Chain button of the button panel (3) has to be used. If the Cyclic button is pressed, the ring size is defined to be greater or equal to three. The default setting of the button panel is Don't Care. Thus, a bond can be either part of a ring system or a chain.
Search flags. A bond can be set to be aliphatic or aromatic (4). If the Stereo flag (5) is set the stereo descriptor of this bond has to be identical to those in the hits found by the substructure search. The Single/Aro option as well as the the Double/Aro (6) option is useful if the global setting Bond Order in the option panel is switched off. If the RingMatch flag (7) is activated both the considered bond of the substructure query as well as the corresponding bond in the compound from the catalog of chemicals have to be part of the same numbers of rings. Thus, a bond in cyclohexane cannot be superimposed onto the central bond of decaline since the former bond belongs to a single ring system, the latter one belongs to both ring systems of decaline.
The Set button (8) is used to transmit all specifications that have been defined in the window to the bond considered. After pressing the Set button the window is closed. The Cancel button (9) closes the window without any change of the query structure.

How to Search for a Class of Compounds in a Catalog of Chemicals

Before a substructure search can be performed, a query compound has to be defined and a catalog of chemicals has to be loaded into WODCA. If a query compound is already saved in a CTX structure file, it can be loaded into WODCA with the help of the file menu. Otherwise, it can directly be exported from the CACTVS Molecule Editor into WODCA. The file menu is also useful to load a catalog of chemicals. Whether a catalog of chemicals is already loaded into WODCA, or not, is indicated by the Catalog icon in WODCA's Information Area.
The next step is to click with the left mouse button on the command Substructure Search ... in the searches menu. The window of the CACTVS Substructure Search tool appears with the current compound displayed in the molecule canvas. Before a substructure search can be started the substructure query has to be defined (see also How to Define a Substructure). After pressing the Search button on the right-hand side at the bottom of the window, WODCA searches in the catalog of chemicals for compounds that contain the substructure of the query. If a substructure search was successful, WODCA lists all the compounds found in the WODCA console and the Match List icon in WODCA's Information Area indicates the number of hits. A single mouse click on the Match List icon opens the CACTVS Match List Browser to view the molecular structures of the hits.

Main indexFormer pageNext pageBeginningEnd

Last change: 2000-06-27
Webmaster: matthias.pfoertner@chemie.uni-erlangen.de