What are the general capabilities of Name to Structure?
General capabilities of Name>Struct
Name>Struct is designed to be as complete, accurate, and fast as possible so that it can be used with confidence to interpret one name or a million.Does Name>Struct interpret IUPAC names?
Does Name>Struct interpret CAS names?
"Yes" and "yes".
...however, this isn't really a good question for one very practical reason: many -- and perhaps most -- chemical names in actual use violate the nomenclature rules published by those organizations. IUPAC names and CAS names represent only a small fraction of the chemical names that are actually being used. Name>Struct does interpret IUPAC and CAS names, but it also recognizes many types of nomenclature usage (and misusage) that are discouraged or even forbidden by the published rules.
How much nomenclature does it recognize?
Name>Struct recognizes and correctly interprets:
- > 90% of organic and biochemical nomenclature published by IUPAC, IUBMB, and CAS
- > 60% of inorganic nomenclature published by IUPAC and CAS
...however, this also isn't really a good question. Published nomenclature recommendations are designed to be all-encompassing: IUPAC has as many rules describing the naming of amine oxides as it does for the naming of alcohols, even though the latter are vastly more common in practical use. The recommendations not supported by Name>Struct are, without exception, obscure.
What nomenclature procedures are supported?
Basically, "all of them". IUPAC offers this list of general nomenclature procedures:
- Substitutive names describe the replacement of a hydrogen with some other ligand. This is by far the most common nomenclature procedure, and includes names that list substituents as prefixes (1,2-dichloroethane), as suffixes (ethanol), and in combination (1,2-dichloroethanol).
- Functional class names have internal spaces that separate one or more parent structures from a functional class modifier. Possibly the most common examples are esters (methyl acetate), but also include names like methyl iodide and methyl ethyl ketone. This procedure is much less common than it used to be, as names like "ethanol" are gaining popularity over the equivalent "ethyl alcohol".
- Replacement names describe the replacement of individual carbon atoms by heteroatoms. These are most commonly seen in ring systems like "1,2,4-thiadiazole", and are occasionally seen in a number of other circumstances.
- Conjunctive names are strictly limited to cases where a ring system is connected directly to the tail end of a chain, as in "benzenepropanol". Names of this sort are more common in collections of highly systematic names, and especially collections of CAS names. Since this procedure is not well known, it is rarely seen in names generated by average chemists.
- Additive names describe names composed of multiple components where no component loses any atoms, including hydrogen. There are many different types of additive nomenclature, but salts (copper acetate, sodium chloride) are particularly well known.
- Multiplicative names exhibit the replication of a parent structure more than once. Names of this sort are generally limited to very symmetric compounds such as "2,2'-oxydiethanol". As with conjunctive names, multiplicative names are most commonly seen from CAS and rarely seen in names created by average chemists.
Name>Struct handles all of the above procedures more or less completely. That is, Name>Struct may fail to interpret any given name, but it would have problems because it failed to recognize some particular name fragment ("3-unknownyl-2-chloro-propanol") rather than because it couldn't understand the principles of a substitutive name.
The only general procedure that Name>Struct fails to support completely is subtractive nomenclature.
- Subtractive names describe structures where one or more atoms are removed from a parent structure. By far the majority of usages of subtractive nomenclature are in the "deoxy-" prefix for carbohydrates, nucleic acids, and amino acids, and the "nor-" prefix for natural products -- and Name>Struct does interpret those prefixes correctly. Technically, however, anything could be named using subtractive nomenclature. Methane could be named as "de(phenyl)toluene" or even recursively as de(phenyl)(phenyl(de(phenyl)toluene)). Name>Struct does not attempt to support subtractive nomenclature in the general case, but the general case is extremely obscure and rarely encountered.
How accurate is it?
Accuracy is an extremely important question, especially when batch-converting thousands (or hundreds of thousands) of names. It's very important that you are able to trust the output of any algorithm designed to run without supervision, and with Name>Struct, you can. In our extensive testing of many databases, including our own ChemFinder/Webserver and ChemACX, as well as many user-provided databases, we have found that the structures produced by Name>Struct are
- >99% accurate
It would be nice if we could claim to be 100% accurate, but that's never going to be realistic. The last percent includes a lot of names that are ambiguous in a variety of ways. Name>Struct is designed to interpret names in the most common and reasonable way possible. That's usually the appropriate thing to do, but if someone intends to use a name in an unusual or unreasonable way, the structure generated by Name>Struct won't match the structure that was intended (although it likely will match a name that could have been intended). Rather than arguing the correct behavior for these cases, we're simply not claiming more than 99% accuracy.
OK, how many names does it recognize?
We have looked at many different collections of chemical names from varied sources including chemical vendor catalogs, reference works, and published literature. No matter what source we examine, we have found that Name>Struct consistently interprets
- 70-90% of names that are actually found in real-life usage
In rare cases we have seen Name>Struct interpret as much as 95% or as little as 35% of a given source, but there were good explanations for those outliers. In general, the 70-90% figure should be seen in most circumstances. The remaining 10-30% of the names generally correspond to known limitations of Name>Struct, and most of those limitations are insurmountable. Even with an unlimited amount of effort, it will never be possible to generate structures for anywhere close to 100% of names that are currently being used.
It's possibly worth pointing out that Name>Struct can interpret literally an infinite number of names, but that's not a very useful observation, since it depends on the creation of trivially boring infinite series of related names:
By the same argument, it becomes obvious that Name>Struct will also fail to interpret an infinite number of names. Suppose that "foobaranol" represents some structure that Name>Struct cannot interpret. Accordingly, it will fail to generate structures for all of the following names as well:
So it's not really useful to ask "how many", but "what fraction of" is an excellent question with a good answer.
How fast is it?
In short, very. Specifically, the batch version of Name>Struct can process roughly:
- >30,000 names/minute
Of course, the actual speed will depend on the processor speed of the machine that is used to run the software, but that is a realistic speed for most modern machines (i.e. most machines produced in the last few years, or with at least a 2.5 GHz / Pentium 4 processor). That works out to less than 2 milliseconds per name, on average. In other words, it is possible to convert over a million names in the span of an hour. We know of only a few data sources with more than a million names, including the CAS and Beilstein databases themselves.
Article is closed for comments.