High dimensional data are everywhere – from finance to biology to climatology – and the statistical and computational methods to analyse these data are developing at a rapid pace. Metabolomics has emerged as a promising research field in the post-genomic era and the data generated from the multiplexed, high-throughput metabolomic platforms are in the tens to hundreds of metabolites. As with other high dimensional data, metabolomic data have the p>>n problem; that is, the number of independent samples or patients is substantially smaller than the dimensionality of the measured metabolites. These data also suffer from often spurious high collinearity among the metabolites, leading to over fitting in the resulting model as well as model mis-identification. High dimensional statistical inference is, however, possible because of sparsity. The regression function is assumed to lie in a low dimensional manifold which results in many model parameters being zero. Thus, variable selection is critical to improving estimation accuracy, model interpretability and computational costs. In this project, we will compare the performance of four variable or feature selection methods for high dimensional data that are currently used with metabolomic data: t-tests to compare group means (using a liberal p-value of 0.3), and penalized likelihood methods such as Lasso (L1 regression), elastic net (linear combination of L1 and L2 penalties) and smoothly clipped absolute deviation (minimax concave penalty). Performance metrics will be based on simulated data, and the area under the receiver-operating characteristic curve will be used to assess the predictive value of the selected models. Real data sets (pancreatic and colorectal cancer) will also be used to illustrate these selection methods. The results of the project will be critical to substantially improving model selection strategies for metabolomic data.

Investigators: Danny Lu (BSc student), Alex de Leon, PhD, Karen Kopciuk, PhD