|
Reoptimization of MDL Keys for Use in Drug Discovery
Joseph L. Durant,* Burton A. Leland, Douglas R. Henry, and James
G. Nourse
MDL Information Systems, 14600 Catalina Street, San Leandro,
California 94577
Received December 17, 2001
Abstract:
| For a number of years MDL products have exposed both 166
bit and 960 bit keysets based on 2D descriptors. These keysets were
originally constructed and optimized for substructure searching. We report
on improvements in the performance of MDL keysets which are reoptimized for
use in molecular similarity. Classification performance for a test data set
of 957 compounds was increased from 0.65 for the 166 bit keyset and 0.67 for
the 960 bit keyset to 0.71 for a surprisal S/N pruned keyset containing 208
bits and 0.71 for a genetic algorithm optimized keyset containing 548 bits.
We present an overview of the underlying technology supporting the
definition of descriptors and the encoding of these descriptors into
keysets. This technology allows definition of descriptors as combinations of
atom properties, bond properties, and atomic neighborhoods at various
topological separations as well as supporting a number of custom
descriptors. These descriptors can then be used to set one or more bits in a
keyset. We constructed various keysets and optimized their performance in
clustering bioactive substances. Performance was measured using methodology
developed by Briem and Lessel. “Directed pruning” was carried out by
eliminating bits from the keysets on the basis of random selection, values
of the surprisal of the bit, or values of the surprisal S/N ratio of the
bit. The random pruning experiment highlighted the insensitivity of keyset
performance for keyset lengths of more than 1000 bits. Contrary to initial
expectations, pruning on the basis of the surprisal values of the various
bits resulted in keysets which underperformed those resulting from random
pruning. In contrast, pruning on the basis of the surprisal S/N ratio was
found to yield keysets which performed better than those resulting from
random pruning. We also explored the use of genetic algorithms in the
selection of optimal keysets. Once more the performance was only a weak
function of keyset size, and the optimizations failed to identify a single
globally optimal keyset. Instead multiple, equally optimal keysets could be
produced which had relatively low overlap of the descriptors they encoded. |
|
You can download
the complete paper here.
In order to generate the keys you can modify your database
using REXEC.
open DATABASE dbname/dbname TNAME dbname;
alter database drop TEXT2DKEYS;
alter database add
TEXT2DKEYS NOSUBSET/SUBSET;
close database;
The List of Searchable Keys
If you search a molecule database, there
are 166 numbered keys that you can use. If you do not recognize the words for a
specific key, the words might be in a special type of language called query
language. The list of searchable keys is as follows:
|
Key |
Description |
|
1 |
ISOTOPE |
|
2 |
103 < ATOMIC NO. < 256 |
|
3 |
GROUP IVA,VA,VIA PERIODS 4-6 (GE..) |
|
4 |
ACTINIDE |
|
5 |
GROUP IIIB,IVB (SC..) |
|
6 |
LANTHANIDE |
|
7 |
GROUP VB,VIB,VIIB (V..) |
|
8 |
QAAA@1 |
|
9 |
GROUP VIII (FE..) |
|
10 |
GROUP IIA (ALKALINE EARTH) |
|
11 |
4M RING |
|
12 |
GROUP IB,IIB (CU..) |
|
13 |
ON(C)C |
|
14 |
S-S |
|
15 |
OC(O)O |
|
16 |
QAA@1 |
|
17 |
CTC |
|
18 |
GROUP IIIA (B..) |
|
19 |
7M RING |
|
20 |
Si |
|
21 |
C=C(Q)Q |
|
22 |
3M RING |
|
23 |
NC(O)O |
|
24 |
N-O |
|
25 |
NC(N)N |
|
26 |
C$=C($A)$A |
|
27 |
I |
|
28 |
QCH2Q |
|
29 |
P |
|
30 |
CQ(C)(C)A |
|
31 |
QX |
|
32 |
CSN |
|
33 |
NS |
|
34 |
CH2=A |
|
35 |
GROUP IA (ALKALI METAL) |
|
36 |
S HETEROCYCLE |
|
37 |
NC(O)N |
|
38 |
NC(C)N |
|
39 |
OS(O)O |
|
40 |
S-O |
|
41 |
CTN |
|
42 |
F |
|
43 |
QHAQH |
|
44 |
OTHER |
|
45 |
C=CN |
|
46 |
BR |
|
47 |
SAN |
|
48 |
OQ(O)O |
|
49 |
CHARGE |
|
50 |
C=C(C)C |
|
51 |
CSO |
|
52 |
NN |
|
53 |
QHAAAQH |
|
54 |
QHAAQH |
|
55 |
OSO |
|
56 |
ON(O)C |
|
57 |
O HETEROCYCLE |
|
58 |
QSQ |
|
59 |
Snot%A%A |
|
60 |
S=O |
|
61 |
AS(A)A |
|
62 |
A$A!A$A |
|
63 |
N=O |
|
64 |
A$A!S |
|
65 |
C%N |
|
66 |
CC(C)(C)A |
|
67 |
QS |
|
68 |
QHQH (&..) |
|
69 |
QQH |
|
70 |
QNQ |
|
71 |
NO |
|
72 |
OAAO |
|
73 |
S=A |
|
74 |
CH3ACH3 |
|
75 |
A!N$A |
|
76 |
C=C(A)A |
|
77 |
NAN |
|
78 |
C=N |
|
79 |
NAAN |
|
80 |
NAAAN |
|
81 |
SA(A)A |
|
82 |
ACH2QH |
|
83 |
QAAAA@1 |
|
84 |
NH2 |
|
85 |
CN(C)C |
|
86 |
CH2QCH2 |
|
87 |
X!A$A |
|
88 |
S |
|
89 |
OAAAO |
|
90 |
QHAACH2A |
|
91 |
QHAAACH2A |
|
92 |
OC(N)C |
|
93 |
QCH3 |
|
94 |
QN |
|
95 |
NAAO |
|
96 |
5M RING |
|
97 |
NAAAO |
|
98 |
QAAAAA@1 |
|
99 |
C=C |
|
100 |
ACH2N |
|
101 |
8M RING OR LARGER |
|
102 |
QO |
|
103 |
CL |
|
104 |
QHACH2A |
|
105 |
A$A($A)$A |
|
106 |
QA(Q)Q |
|
107 |
XA(A)A |
|
108 |
CH3AAACH2A |
|
109 |
ACH2O |
|
110 |
NCO |
|
111 |
NACH2A |
|
112 |
AA(A)(A)A |
|
113 |
Onot%A%A |
|
114 |
CH3CH2A |
|
115 |
CH3ACH2A |
|
116 |
CH3AACH2A |
|
117 |
NAO |
|
118 |
ACH2CH2A>1 |
|
119 |
N=A |
|
120 |
HETEROCYCLIC ATOM>1 (&..) |
|
121 |
N HETEROCYCLE |
|
122 |
AN(A)A |
|
123 |
OCO |
|
124 |
QQ |
|
125 |
AROMATIC RING>1 |
|
126 |
A!O!A |
|
127 |
A$A!O>1 (&..) |
|
128 |
ACH2AAACH2A |
|
129 |
ACH2AACH2A |
|
130 |
QQ>1 (&..) |
|
131 |
QH>1 |
|
132 |
OACH2A |
|
133 |
A$A!N |
|
134 |
X (HALOGEN) |
|
135 |
Nnot%A%A |
|
136 |
O=A.1 |
|
137 |
HETEROCYCLE |
|
138 |
QCH2A>1 (&..) |
|
139 |
OH |
|
140 |
O>3 (&..) |
|
141 |
CH3>2 (&..) |
|
142 |
N>1 |
|
143 |
A$A!O |
|
144 |
Anot%A%Anot%A |
|
145 |
6M RING>1 |
|
146 |
O>2 |
|
147 |
ACH2CH2A |
|
148 |
AQ(A)A |
|
149 |
CH3>1 |
|
150 |
A!A$A!A |
|
151 |
NH |
|
152 |
OC(C)C |
|
153 |
QCH2A |
|
154 |
C=O |
|
155 |
A!CH2!A |
|
156 |
NA(A)A |
|
157 |
C-O |
|
158 |
C-N |
|
159 |
O>1 |
|
160 |
CH3 |
|
161 |
N |
|
162 |
AROMATIC |
|
163 |
6M RING |
|
164 |
O |
|
165 |
RING |
|
166 |
FRAGMENTS |
|