Three-dimensional cluster resolution for guiding automatic chemometric model optimization.

PMID 23200385


A three-dimensional extension of a previously developed metric termed cluster resolution is presented. The cluster resolution metric considers confidence ellipses (here three-dimensional confidence ellipsoids) around clusters of points in principal component or latent variable space. Cluster resolution is defined as the maximum confidence limit at which confidence ellipses do not overlap and can serve to guide automated variable selection processes. Previously, this metric has been used to guide variable selection in a two-dimensional projection of data. In this study, the metric is refined to simultaneously consider the shapes of clusters of points in a three-dimensional space. We couple it with selectivity ratio-based variable ranking and a combined backward elimination/forward selection strategy to demonstrate its use for the automated optimization of a six-class PCA model of gasoline by vendor and octane rating. Within-class variability was artificially increased through evaporative weathering and intentional contamination of samples, making the optimization more challenging. Our approach was successful in identifying a small subset of variables (644) from the raw GC-MS chromatographic data which comprised ≈ 2 × 10(6) variables per sample. In the final model there was clear separation between all classes. Computational time for this completely automated variable selection was 36 h; slower than solving the same problem using three two-dimensional projections, but yielding an overall better model. By simultaneously considering three dimensions instead of only two at a time, the resulting overall cluster resolution was improved.

Related Materials