Gnarus Systems

Image description

Cat-SAR Details

Image description

“…but the wise little Alice was not going to do that in a hurry. `No, I’ll look first,’ she said, `and see whether it’s marked “poison” or not’…”

BACKGROUND: Computational and Predictive Toxicology

Benfenati and Gini [1] described modern SAR and QSAR methods as typically involving three parts: 1) the chemical part, 2) the biological part (i.e., biological activity), and 3) the methodology for relating parts 1 and 2. In practice, the chemical and biological parts are combined into a “learning set” for SAR and QSAR analyses. The main premise for SAR and QSAR methods is that recurring and identifiable attributes of chemicals are associated with, or responsible for, particular biological effects and these can be discovered by various mathematical, artificial intelligence, or statistical analysis of the learning sets.

Briefly, computer assisted drug discovery and predictive toxicology methods have continued to evolve since their inception by Hansch [2]. The application of the linear free energy relationship concept allowed medicinal chemists to optimize the biological activity of congeneric (similarly shaped) sets of compounds. The birth of the SAR and Quantitative Structure Activity Relationship (QSAR) paradigm allowed for the rational design of therapeutic agents and the accurate prediction of toxicants. Classical methods of SAR/QSAR are still appropriate for small series of similar compounds for which calculated and/or experimentally derived descriptors (i.e., chemical-physical or quantum mechanical) are available.

The application of three-dimensional molecular modeling, with exhaustive conformational analysis to large libraries of compounds, is labor and time intensive. However, aside from cost considerations, three-dimensional QSAR methods are typically not appropriate for structurally diverse sets of chemicals. The reason for this is that QSAR models are essentially built on the premise that the compounds being analyzed act through a common mechanism (e.g., a receptor and a set of analog ligands). As such, one of the major difficulties encountered in the area of SAR/QSAR analyses of chemical carcinogens (and other high end human toxicity endpoints) was the structurally diverse nature of chemical tested for adverse health effects. This is largely due to the large number of environmental, industrial, and consumer compounds in us and the limited number of chemicals tested for toxicity (see below).

The Computer Automated Structure Evaluation (CASE) expert system was developed over 20 years ago by Rosenkranz and Klopman to address these difficulties [3, 4]. This program was one of the first developed to efficiently and rapidly analyze large numbers of diverse compounds to identify structural features, termed biophores, that were associated with a particular biological activity. Briefly, biophores are statistically significant features derived from a learning set of diverse chemical structures that ultimately relate chemical structure to biological activity. Biophores often refer to functionalities whose biological effect is due to a common mechanism of action (e.g., for drug design, these substructures are “pharmacophores” and for carcinogenesis they are often called “structural alerts”).

In terms of carcinogenesis, we have reported predictive and mechanistically insightful SAR models for carcinogenesis based on CPDB analyses of mice [5] and rats [6] using the CASE/MULTICASE SAR expert system (MCASE). The best rat and mouse SAR models from these studies, respectively, had a concordance between experimental and SAR-predicted values of 71 and 78%, sensitivity of 69 and 77%, and specificity of 73 and 78% [5, 6]. These models while being predictive also provided insight into the structural underpinnings for species-specific carcinogenesis since many, though not all, of the readily explainable biophores corresponded with the genotoxic or electrophilic paradigm of carcinogenesis [7].

MCASE and MDL-QSAR models have also recently been developed for a set of 1540 compounds tested for rodent carcinogenicity as compiled by the FDA [8]. In this case, Contrera et al. reported a concordance of 66 and 69%, a sensitivity of 61 and 63%, and a specificity of 71 and 75%, respectively, for MCASE and MDL-QSAR [8]. Many others have also demonstrated the modeling of chemical carcinogens with varying degrees of success and the utility and application of some important toxicologically-focused predictive methods have been reviewed in-depth elsewhere [9-11].


Based on the rationale of the MCASE program and with advances in computing and chemoinformatics, Cunningham has developed the categorical-SAR (Cat-SAR) program [12-14]. The Cat-SAR program is a computationally based SAR expert system that was developed to associate 2-dimensional chemical fragments with active or inactive compounds in a learning set. However, unlike other 2-dimensional modeling approaches including MCASE, Cat-SAR is transparent, does not include proprietary computer code, and model parameter selection is controlled by the user. As such, the approach is sharable and allows unrestricted user scrutiny, intervention, and model optimization throughout the SAR modeling process.

Cat-SAR Expert System Learning Sets: Cat-SAR models are built through a comparison of descriptors found amongst two designated biological activity categories of compounds in the model’s learning set. For example, the categories for carcinogenesis are carcinogens and non-carcinogens. A set of rules is often used for the inclusion of chemicals in the learning sets. Organic salts are included as the freebase. Mixtures are mostly not included though in the case of some simple mixtures and technical grade preparations specific components can be included as the major or active entity. Metals are not included although in some instances metaloorganic compounds may be included as the organic part. And polymers may be included as representative monomers or dimers.


Cat-SAR Model Descriptors: Chemical Fragments: As mentioned, the cat-SAR expert system develops models based on two-dimensional chemical fragments [12, 14]. Basically, each chemical in the learning set is divided into all possible fragments between (for example) three and seven heavy atoms in size and considers atom types, bond types, and atomic connections. A compounds-fragment matrix is then computed for cat-SAR analysis where the rows are intact chemicals and the columns are molecular fragments. Thus, for each chemical, a tabulation of all its fragments is recorded across the table row and for each fragment all chemicals that contain it are tabulated down the table column.

Cat-SAR Model Development: To ascertain an association between chemical descriptors (i.e., fragments) and a chemical’s activity (or inactivity); a set of rules is used to choose “important” from “unimportant” descriptors. The first selection rule (the Number Rule) is the number of chemicals identified in the learning set that possesses each particular description. The second selection rule (the Proportion Rule) is the proportion of active or inactive chemicals that then possesses the particular description.

Model Validation and Application: The resulting list of important fragments can then be used for mechanistic analysis, to predict the activity of an unknown compound, and to validate or test the predictivity of the model. To predict the activity of an untested compound, the cat-SAR program determines which, if any, descriptors from the model’s pool of significant descriptors the untested compound contains. If none are present, no prediction of activity is made for the compound (i.e., there are no default predictions of activity). If one or more descriptors are present, the number of active and inactive compounds associated with each descriptor is determined. The probability of activity or inactivity is then calculated based on the total number of active and inactive compounds that went into deriving each of the descriptors.

A self-fit and two validations analyses are typically conducted for each model. For the self-fit analysis, after a model is developed, the model is used to predict the activity of the chemicals in its learning set in order to ascertain whether or not the model was generally capable of fitting the data. A LOO validation is also used wherein each chemical, one at a time, is removed from the total fragment set and an n-1 model is derived. Using the same criteria described above for predictions, the activity of the removed chemical is then predicted using the n-1 model. Additionally, with selected models, a LMO validation is conducted wherein 1000-10,000 randomly selected sets of 2.5% to 10% of the chemicals are removed from the total descriptor set and the n-x% model is derived and the activity of the removed chemicals is then predicted using the n-x% model.

Calculated concordance, sensitivity, and specificity between experimental and predicted values are then used to judge the predictivity of the models.

To consider concordance, sensitivity, and specificity, each compound’s probabilistic activity value is converted back to an active or inactive category value using a cut-off point derived from the LOO validations [6]. Depending on the application of the model, the cut-off point can be adjusted wherein a model with the best overall concordance can be selected (i.e., a most predictive model), one with near equal sensitivity and specificity (i.e., a balanced model) or one with high sensitivity (i.e., a risk averse model). For these exercises, the cut-off point will typically be selected for models that that have high overall concordance and near equal sensitivity and specificity.


In general SAR technology has been successfully applied to a number of human and environmental health endpoints including but not limited to cancer, genotoxicity, developmental toxicity, and chemical allergy. A more detailed list of cat-SAR models is included below in the Materials and Methods section and a brief description for the application of SAR to one specific human health effect, cancer is now provided.


In the case of chemical carcinogenesis, the rodent cancer bioassay has proven to be a useful tool in our understanding of the causes and cures of the disease. There are sufficient physiological, biochemical, and metabolic similarities between rodents and humans that allow a high probability that carcinogenicity results obtained in rodents will be predictive of similar effects in humans [15-17]. A complete 2-year cancer bioassay as conducted by the National Toxicology Program (NTP) including planning, evaluation, and review takes about five years to complete, costs between $2-4 million, and uses 400 animals [18]. There are currently 538 technical reports by the NTP for rodent carcinogenicity using its standardized bioassay [19]. Concurrent with the NTP, the Carcinogenic Potency Database (CPDB) analyzes and consolidates the world’s diverse literature and NTP reports of chronic long-term animal cancer bioassays into a single resource [20]. To date, analyses of 6540 experiments on 1547 chemicals are available on the CPDB’s web site [21] and the Environmental Protection Agency’s (EPA’s) Distributed Structure-Searchable Toxicity (DSSTox) Database Network [22].

However, although over 1500 chemicals have been tested at great expense for cancer in rodents there are 75,000 industrial chemicals on the Toxic Substance Control Act’s Chemical Substance Inventory [23], the National Institute of Environmental Health Sciences estimates that there are over 80,000 chemicals registered for use in the United States [24], and there are also 125,000 chemicals on the European Chemical Agency list [25]. From a public health perspective, it is evident that not all chemicals in use today will be tested in vivo for carcinogenesis, making the application of various predictive modeling techniques including SAR logical. From a medical perspective, given the relationships between the causes and cures of cancer and the cost of each cancer bioassay, using SAR modeling to generate testable hypotheses for cancer mechanisms, develop novel anticancer drugs, and prescreen non-cancer drugs for inadvertent carcinogenic activity is also logical. To these points, computational SAR models have gained recent acceptance in the regulatory community for both human health [26] and ecological endpoints [27], the Food and Drug Administration’s (FDA’s) Center for Drug Evaluation and Research (CDER) is actively using several SAR expert systems to explore their application in drug evaluations [8, 28, 29], and the EPA has established a center for computational toxicology [30].

Image description

Consulting Service

Gnarus Systems offers predictive toxicological assessment of chemicals, SAR model development, and virtual screening on an ad hoc basis.

Direct Access

For larger projects, Gnarus Systems can provide users with secure cloud-based access to the cat-SAR expert system along with validated predictive models.

Complete System

Similar to "Direct Access" with the added advantage that users can update/alter exisitng models as well as produce models with user specific data. In this case, users can develop, validate, and use models based on private or proprietary data.