Bayesian nonparametric priors for genomic variant discoveries

Species sampling models are popular Bayesian nonparametric priors introduced by Pitman (1996) to model a population of animals composed by different species with unknown proportions. They have been widely investigated in terms of predictive structure, and they have been proved to be a powerful tool to face prediction. In the present talk we focus on feature sampling models, which generalize species sampling models by allowing each individual to belong to more than one species, now called features. Predictive inference in this setting is particularly relevant in many applied contexts, for example, in genomics, to predict the number of hitherto unseen features (genomic variants). We investigate two classes of Bayesian nonparametric priors for feature sampling models, and we shed light on their behaviour in terms of predictive inference. We first focus on the popular class of completely random measures (CRMs), which include the three-parameter Beta process, and we show how, for fixed prior's parameters, CRMs all lead to Poisson posterior distribution for the number of unseen features, which depends on the sampling information only through the sample size. With the aim to enrich the predictive structure, we thus investigate the class of Scaled Process (SP) priors (James et al., 2015). In particular we introduce the Stable-Beta Scaled Process (SB-SP) prior, and we show that it allows to enrich the posterior distribution of the number of unseen features arising under CRM priors, while maintaining its analytical tractability and interpretability. That is, the SB-SP prior leads to a negative Binomial posterior distribution for the unseen features, which depends on the sampling information through the sample size and the number of distinct features. The proposed approach turns out to be simple and computationally efficient. We apply our BNP proposal to synthetic data and to real cancer genomic data, showing that: i) it outperforms the most popular parametric and nonparametric competitors in terms of estimation accuracy; ii) it provides improved coverage for the estimation with respect to a BNP approach under CRM priors.

Authors: Tamara Broderick and Stefano Favaro