Theory of Operation
QTube is based upon a high-throughput computational representation of
list-mode data created by Cira Discovery Sciences, called Cytometric
Fingerprinting. The following section is excerpted from Rogers,
WT, Moser AR, Holyst HA, Mohler E, Bantly A, Moore J, Cytometric
Fingerprinting: Quantitative Characterization of Multivariate
Distributions, submitted to Cytometry A (2007).
Cytometric Fingerprinting Overview
The objective of Cytometric Fingerprinting (CF) is to represent the
information in cytometric list-mode data in a form that enables
quantitative comparison among samples. These fingerprints are
capable of capturing and encoding the full multivariate correlations of
complex, high dimensional cytometric data. This representation is
particularly useful when cell populations are not clearly delineated by
optimized assays and the distribution of events in the multi-parameter
space is not bimodal.
Each event in list-mode data is described by a vector of coordinates in
a multidimensional space. Thus, a complete mathematical
description of a sample is the multivariate probability distribution
function defining the density of events in this space. This
distribution may be approximated by dividing the space into small
volume elements, counting the number of events in each volume element,
and normalizing the count by the total number of events in the
sample. In the limit of an infinite number of events, the regions
may be made infinitesimally small yielding the true probability
distribution function. Of course, it is impossible to collect an
infinite number of events, so the question of interest is, how does one
accurately estimate the true probability distribution from a finite
sample of events? Equally importantly, how does one represent this
approximation of the multivariate probability density function in a
form amenable to comparing disparate samples?
The first question, that of estimating the probability density function
(PDF) from a finite sample of events, has been a subject of research
since the early 20th century (1). The most common non-parametric
means of estimating a PDF is a histogram where space is divided into
equal width bins. For a complex (rapidly changing) PDF, one would
like to choose small bins in order to accurately track the variation
with respect to independent variables (low bias). On the other
hand, one would like to choose bins of sufficient size to contain a
large number of events in order to estimate the value of the density
within a bin with high accuracy (low variance). This trade-off
between number of bins and bin size is the classic bias-variance
dilemma. For one independent variable and reasonably sized
datasets, it is not difficult to balance the bias-variance
requirements. However, for multidimensional data the curse of
dimensionality is a severe limitation. Choosing bins of fixed
width gives control over bias, but the problem of empty (or highly
populated) bins, depending on the nature of the distribution, means
that there is no control over the variance. An alternative approach is
to control the variance by choosing bins that contain equal numbers of
events. (This strategy is particularly useful if the ultimate
goal is to utilize bin event densities as features in classification
since the measurement accuracy for each feature should be the
same.) In the case of univariate data, there is a set of bin
boundaries that accomplishes this goal (2). For multivariate
data, however, there is not a unique solution. While this indeterminacy
might seem like a disadvantage, in fact, it creates an opportunity to
find a specific set of bin boundaries that does a superior job of
reducing bias.
Other methods of representing and analyzing multidimensional flow
cytometry data have been developed (3-6). One that is most
closely related to the present work is Probability Binning (PB)
(7). PB represents a multidimensional probability distribution as
a set of bins defining regions of the multidimensional space. The
boundaries of these bins are chosen so that approximately equal numbers
of events lie in each bin. Bins are found by selecting a
coordinate dimension, determining the median in that coordinate, and
dividing the data at the median value. In PB, the axis selection
is made by calculating the variance of the data in the parent bin for
each of the original coordinate dimensions and choosing the one
dimension having the largest variance. Although the decision is
made on the basis of the variance in each dimension, the split is not
necessarily along the optimal direction since the direction of maximum
variance may not coincide with one of the coordinate axes.
CF differs from PB in three important ways:
(i) CF forms bins by
splitting the data in the direction of maximum variance rather than
along the original coordinate axes. This involves first
determining the direction of maximum variance and then rotating the
data space such that the principle coordinate axis lies in the
direction of maximum variance.
(ii) CF creates a hierarchical, multi-resolution
representation of the data. This is done by retaining and utilizing
information for bins at each level of recursion.
(iii) CF utilizes the binned data to develop a
fingerprint that is a one-dimensional representation embodying the
information contained in the multi-resolution, multidimensional
representation.
Additionally, CF includes novel algorithms for finding and representing
bins from one data set and utilizing this bin representation to process
a second data set. It also includes a novel method of forming a
differential fingerprint that represents the degree of dissimilarity of
a given instance to two or more classes of instances.
Cytometric Fingerprints in QTube
Basically, QTube computes fairly low-resolution fingerprints, based
only on selected parameters common to a panel of tubes. It does
not use the hierarchical nature of CF, but only the highest-resolution
portion of the fingerprint. We then represent the fingerprint as
deviations from the expected value, and then compute metrics from this
representation. These are the numbers that appear in the QTube
report. Simple, really.
References
1. Sturges, H.A. (1926). Journal of the American Statistical
Association 21, 65-66.
2. Roederer,
M., Treister, A., Moore, W. & Herzenberg, L.A. (2001). Cytometry 45, 37-46.
3. Murphy, R.F. (1985). Cytometry 6, 302-309.
4, Robinson, J.P., Durack, G. &
Kelley, S. (1991). Cytometry 12,
82-90.
5. Robinson,
J.P., Ragheb, K., Lawler, G., Kelley, S. & Durack, G. (1992). Cytometry 13, 75-82.
6. Lugli, E., Pinti, M., Nasi, M.,
Troiano, L., Ferraresi, R., Mussi, C., Salvioli, G., Patsekin, V., Robinson,
J.P., Durante, C. et al. (2007).
Cytometry A 71, 334-344.
7. Roederer,
M., Moore, W., Treister, A., Hardy, R.R. &
Herzenberg, L.A.
(2001). Cytometry 45, 47-55.