|
Data pertinent to drug discovery typifies complex, high-dimensionality data. Regardless of whether
one considers the chemical or the biological domain, the entities themselves are extraordinarily
complex, and their interactions are even more so. "Chemistry space" is itself huge, and descriptors
capable of usefully discriminating molecular functionality frequently have dimensionalities from
103 to 107 or greater.
There are several approaches to dealing with high dimensionality data. These include methods that
explicitly ignore interactions among variables, ranking individual inputs by their information content
and creating models using only the most information-rich inputs. Others explicitly reduce the
dimensionality of the data and look at interactions among a limited number of variables, often on a
statistical basis.
Still other methods are "greedy," testing input variable interactions and either accepting or rejecting
them. Once rejected, they are never revisited for possible recombination with other, as yet unconsidered,
inputs thus giving solutions that are possibly sub-optimal.
Beyond the problems inherent in the analysis of high-dimensionality data, there is another equally
important consideration. Often, data exists from multiple domains. Typically these data are analyzed
in a domain-specific way. Chemistry data is analyzed in isolation from biological data, and vice versa.
We believe that this approach ignores the extent to which analysis in one domain can usefully inform the
analysis in another domain. This is another example of interactions of inputs, but one that is frequently
overlooked.
|