Metabolomics has emerged as a promising research field, particularly in personalized medicine where it has been used for diagnostic, predictive and prognostic purposes. Metabolomics measures concentrations of small molecules in fluids such as blood or urine, which are reflective of genetic alterations, protein changes, disease outcomes, or environmental influences. Typical projection-based methods used to analyze metabolomics data, such as Partial Least Squares (PLS) Regression, do not assume an underlying stochastic model for the data. Thus, this popular algorithmic modelling approach does not have statistical inference methods or the corresponding sample size methods. In addition, filtering of the data prior to PLS modelling is often carried out to eliminate non-differentially abundant metabolites. To determine sample size estimates in this challenging setting, we focus on the area under the receiver operating curve as the downstream parameter of interest. Simulation studies explore the impact of key metabolomics data and modelling features on sample size results. Two cancer data sets used to develop diagnostic predictors illustrate our approach.

PIs:  Wang Y (MSc student),Kopciuk K
Co-Is: Weljie A, Bathe O
Funding Source: NSERC, ACRI