1 Introduction
Many realworld phenomena are best described by computer simulations. Such simulators often implement a stochastic generative process, which is based on a mechanistic model and parametrized by . In practice, these simulators are used to generate samples of observations , but the density is only defined implicitly through the simulation code. Often, the generative process involves latent variables and the density
(1) 
is intractable because of the integral over a large (and possibly highly structured) latent space. Without a tractable likelihood, statistical inference on the parameters given observed data is challenging. This problem has prompted the development of likelihoodfree inference methods such as Approximate Bayesian Computation [1, 2, 3, 4] and neural density or neural density ratio estimation algorithms [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23]. Nearly all of these established methods treat the simulator as a black box and only use its capability to generate samples for a specified values of .
In Refs. [24, 25, 26] a new paradigm was introduced that exploits additional information that can be extracted from the simulation. In particular, within the simulation where the latent variables are available, it is often possible to extract the joint likelihood ratio
(2) 
and the joint score
(3) 
which are conditioned on the latent variables corresponding to a particular sample.
It was then shown that certain loss functionals , which depend on the joint likelihood ratio and the joint score, are minimized by the likelihood ratio
(4) 
an otherwise intractable quantity. This motivates a family of new techniques for likelihoodfree inference in which the the joint likelihood ratio and joint score are used as training data for neural networks. These networks serve as surrogate models for the intractable likelihood or likelihood ratio. Experiments showed these new methods to be more sampleefficient than previously established neural density and neural density ratio estimation techniques. The authors of Refs. [24, 25, 26] coined the term “mining gold” for the process of extracting the joint likelihood ratio and joint score from the simulator – while the augmented data require some effort to extract, they are extremely valuable.
While the loss functionals originally proposed in Refs. [24, 25, 26] have the correct minima, they are not necessarily the most sample efficient. In particular, the proposed mean squared error (MSE) losses are often dominated by few samples with large joint likelihood ratios. Here we extend and improve that original work with two new algorithms for likelihoodfree inference. The key improvement are new loss functions, which use an improved estimator for the cross entropy based on the joint likelihood ratio and joint score.
2 Crossentropy estimation with augmented data
Consider the problem of estimating the likelihood ratio based on samples , labeled with ; samples , labeled ; and the joint likelihood ratio and joint score .
The familiar binary crossentropy loss functional is defined as
(5) 
For balanced samples () we have
(6)  
(7) 
which allows us to rewrite Eq. 5 as
(8) 
It is straightforward to show that this loss functional is minimized by
(9) 
To use the cross entropy to train a surrogate model with a finite number of samples, we need a tractable estimator for the cross entropy. The standard estimator, as used for instance in the Carl inference method [9], is given by
(10) 
The
act as an unbiased, but highvariance estimator of
. In the limit of infinite samples, this estimator therefore has the correct minimum of Eq. (9), but for finite sample sizes it may suffer from high variance.With the availability of the joint likelihood ratio from the simulator, the are tractable and we can define the alternative estimator
(11) 
By using the exact rather than the , the samples drawn according to also provide information about the second term in the loss function, and vice versa. By minimizing the loss function we get an estimator and thus a likelihood ratio estimator
(12) 
This defines the Alice inference method^{1}^{1}1Approximate likelihood with improved crossentropy estimator, which consists of mining the joint likelihood ratio from the simulator, training a neural network on the improved crossentropy estimator in Eq. (11), and using this surrogate model for statistical inference on .
It is to be expected that a likelihood ratio estimator based on the Alice estimator for the crossentropy should outperform the Carl method, which is based on the standard crossentropy estimator in Eq. (10). The more interesting question is how it stacks up against the Rolr technique introduced in Refs. [24, 25, 26], in which the loss function
(13) 
is minimized. In the limit of infinite samples it is minimized by . But here each event only contributes to either the squared error on or on term, which might lead to a higher variance.
In analogy to the Cascal and Rascal methods of Refs. [24, 25, 26], we can define an additional inference method which uses the joint score, i. e. an additional piece of information that describes the local (tangential) behavior of the likelihood function. If a parameterized likelihood ratio estimator is implemented with a differentiable architecture such as a neural network, we can calculate the gradient of the output with respect to and similarly calculate the corresponding score
(14) 
of the estimator. For a perfect (or equivalently ) estimator, this corresponding score will also minimize the squared error loss with respect to the joint score , which can be extracted from the simulator [24, 25, 26]. Turning this argument around, we can use the joint score to guide the training of the estimator. This is the idea behind the Alices ^{2}^{2}2Approximate likelihood with improved crossentropy estimator and score technique, which is based on the loss function
(15) 
The factor is necessary to guarantee the correct minimum of the squared error on the score. The hyperparameter weights the two terms in the loss function. This loss is the natural extension of the the Cascal loss function, but we expect it to reduce the variance compared to the Cascal approach for finite sample size. An interesting question is how it performs compared to the Rascal approach, which similarly augments the Rolr loss in Eq. (13) with the score term.
3 Experiments
We experiment with the new methods in the particle physics problem introduced in Refs. [24, 25]. In this realworld problem, the outcome of protonproton collisions is characterized by 42 observables, from which likelihood ratios and confidence limits on two model parameters are derived. We first consider an idealized setting neglecting the detector response where the likelihood function is tractable, which provides us with ground truth that can be used to evaluate the performance of the algorithms. For a detailed description of the setup, see Ref. [25].
In addition to the Carl, Rolr, Cascal, and Rascal techniques described above, we also compare to the Sally and Sallino methods. Sally and Sallino approximate a statistical model that is accurate in the neighborhood of . The methods are very sample efficient, but make approximations that limit their asymptotic performance.
Except for the new loss functions, we used the same architectures and hyperparameters as in Ref. [25]. In particular, we use fully connected networks with five hidden layers, 100 units each, and activation functions for both approaches. For Alices we use , which was found to give a good performance for the closely related Cascal method [25].
Strategy  Expected MSE  

training samples  training samples  training samples  
Histogram  
Carl  
Rolr  
Cascal  
Rascal  
Alice  
Alices  
Sally  
Sallino 
Table 1 and Fig. 1 show the quality of the likelihood ratio estimate based on various sized training samples for the new methods and compares them to the inference techniques presented in Ref. [25]. As a performance metric we use an expected mean squared error on the log likelihood ratio, as defined in Ref. [25].
Unsurprisingly, the Alice and Rolr methods clearly outperform Carl, which does not have access to the joint likelihood ratio. More significantly, we find that Alice outperforms and Rolr, which does have access to the joint likelihood ratio. We conjecture that this improvement can be attributed to the lower variance of the crossentropy compared to the squared error. More surprisingly, the Alice method also outperforms the Rascal method for larger training sample sizes (), even though Alice does not have access to the joint score.
For smaller training sample sizes () the Alices method outperforms the Alice method, which is not surprising given the additional information available during training. For larger training sample sizes (), the variance of the score actually deteriorates the performance of Alices compared to Alice. We did not perform hyperparameter tuning for as a function of the training sample size, which should ensure that Alices performs at least as well as Alice. We leave a systematic tuning of the parameter and an analysis of sources of variance in this approach for future work.
Figure 2 shows expected exclusion contours at different confidence levels on the two parameters, assuming 36 observed events distributed according to . The methods are trained on the full training samples of
samples. The left panel shows contours constructed based on asymptotic properties of the profile likelihood ratio test statistic. While methods such as
Rascal are generally very accurate, with this construction they can sometimes lead to overly optimistic exclusion contours, visible as tighter bounds than the “truth” contour. We find that switching to Alice and Alices reduces this issue, but does not entirely solve it.The right panel of Fig. 2
shows exclusion contours based on the frequentist confidence intervals calibrated with toy experiments. This Neyman construction guarantees coverage: while the limits from any approach may be worse than the optimal limits, they will never be overly optimistic. As test statistics we use the likelihood ratio with respect to the
, which explains why the contours are generally stronger than in the left panel. We find that both Alice and Alices, like Rascal and Cascal of Refs. [24, 25, 26], lead to limits that are virtually indistinguishable from the ideal limits based on the true likelihood ratio.Finally, in Fig. 3 we show similar expected exclusion contours, but in a more realistic setup in which the parton shower and detector effects are described with approximate smearing functions, which makes the true likelihood intractable. In this situation, we cannot compare the likelihood ratio estimators to the ground truth. Instead, we show the expected contours based on the Neyman construction, similar to the right panel of Fig. 2. In the left panel we show results for limited training samples of only events. In this setup, Alices allows for strong limits, comparable to Rascal and slightly better than for Alice. The right panel demonstrates that with the full training sample the results of Rascal, Alice, and Alices are indistinguishable.
4 Conclusions
In this work, we have extended recently developed inference techniques for the setting in which the likelihood is only implicitly defined through a stochastic generative model or simulator. By exploiting the joint likelihood ratio that can be extracted from the simulator, we introduced an improved crossentropy estimator. This improved crossentropy estimator is used to define two new likelihoodfree inference techniques: Alice and Alices.
Our experiments comparing Alice and Alices with the other recently developed techniques indicate that they are significantly more sample efficient than Rolr, Cascal, and Rascal techniques. We attribute this to the lower variance of the improved crossentropy estimator. For smaller training sample sizes, there are still advantages to the Sally and Sallino techniques.
We note that it is possible to use a hybrid of the traditional crossentropy of Eq. 5 and the improved crossentropy Eq. 11. This would be useful in situations where one may not have access to the joint ratio for practical reasons or because some training samples come from real data instead of a simulation. Furthermore, we note that the improved crossentropy estimator of Alice and Alices can be extended from the binary setting to one where samples are generated from multiple parameters points if the joint likelihood ratio for all pairs is available. These joint likelihood ratios provide the necessary ingredient for importance sampling beyond the binary setting considered here.
The ubiquity of simulators and other implicit models indicates there is enormous potential for likelihoodfree inference techniques. The use of augmented data improves the sample efficiency of these techniques significantly, and these results motivate further study of variance reduction techniques that leverage this augmented data.
Acknowledgments
JB, KC, and GL are grateful for the support of the MooreSloan data science environment at NYU. KC and GL were supported through the NSF grants ACI1450310 and PHY1505463. JP was partially supported by the Scientific and Technological Center of Valparaíso (CCTVal) under Fondecyt grant BASAL FB0821. This work was supported in part through the NYU IT High Performance Computing resources, services, and staff expertise.
References
 [1] D. B. Rubin: ‘Bayesianly justifiable and relevant frequency calculations for the applied statistician’. Ann. Statist. 12 (4), p. 1151, 1984. URL https://doi.org/10.1214/aos/1176346785.
 [2] M. A. Beaumont, W. Zhang, and D. J. Balding: ‘Approximate bayesian computation in population genetics’. Genetics 162 (4), p. 2025, 2002.
 [3] J. Alsing, B. Wandelt, and S. Feeney: ‘Massive optimal data compression and density estimation for scalable, likelihoodfree inference in cosmology’ , 2018. arXiv:1801.01497.
 [4] T. Charnock, G. Lavaux, and B. D. Wandelt: ‘Automatic physical inference with information maximizing neural networks’. Phys. Rev. D97 (8), p. 083004, 2018. arXiv:1802.03537.

[5]
T. Kanamori, S. Hido, and M. Sugiyama: ‘A leastsquares approach to direct importance estimation’. Journal of Machine Learning Research 10 (Jul), p. 1391, 2009.
 [6] Y. Fan, D. J. Nott, and S. A. Sisson: ‘Approximate Bayesian Computation via Regression Density Estimation’. ArXiv eprints , 2012. arXiv:1212.1479.
 [7] L. Dinh, D. Krueger, and Y. Bengio: ‘NICE: Nonlinear Independent Components Estimation’. ArXiv eprints , 2014. arXiv:1410.8516.
 [8] D. Jimenez Rezende and S. Mohamed: ‘Variational Inference with Normalizing Flows’. ArXiv eprints , 2015. arXiv:1505.05770.
 [9] K. Cranmer, J. Pavez, and G. Louppe: ‘Approximating Likelihood Ratios with Calibrated Discriminative Classifiers’ , 2015. arXiv:1506.02169.
 [10] K. Cranmer and G. Louppe: ‘Unifying generative models and exact likelihoodfree inference with conditional bijections’. J. Brief Ideas , 2016.
 [11] L. Dinh, J. SohlDickstein, and S. Bengio: ‘Density estimation using Real NVP’. ArXiv eprints , 2016. arXiv:1605.08803.
 [12] G. Papamakarios and I. Murray: ‘Fast free inference of simulation models with bayesian conditional density estimation’. In ‘Advances in Neural Information Processing Systems’, p. 1028–1036, 2016.
 [13] B. Paige and F. Wood: ‘Inference Networks for Sequential Monte Carlo in Graphical Models’. ArXiv eprints , 2016. arXiv:1602.06701.
 [14] R. Dutta, J. Corander, S. Kaski, and M. U. Gutmann: ‘Likelihoodfree inference by ratio estimation’. ArXiv eprints , 2016. arXiv:1611.10242.
 [15] B. Uria, M.A. Côté, K. Gregor, I. Murray, and H. Larochelle: ‘Neural Autoregressive Distribution Estimation’. ArXiv eprints , 2016. arXiv:1605.02226.
 [16] A. van den Oord, S. Dieleman, H. Zen, et al.: ‘WaveNet: A Generative Model for Raw Audio’. ArXiv eprints , 2016. arXiv:1609.03499.
 [17] A. van den Oord, N. Kalchbrenner, O. Vinyals, L. Espeholt, A. Graves, and K. Kavukcuoglu: ‘Conditional Image Generation with PixelCNN Decoders’. ArXiv eprints , 2016. arXiv:1606.05328.
 [18] A. van den Oord, N. Kalchbrenner, and K. Kavukcuoglu: ‘Pixel Recurrent Neural Networks’. ArXiv eprints , 2016. arXiv:1601.06759.
 [19] M. U. Gutmann, R. Dutta, S. Kaski, and J. Corander: ‘Likelihoodfree inference via classification’. Statistics and Computing p. 1–15, 2017.
 [20] D. Tran, R. Ranganath, and D. M. Blei: ‘Hierarchical Implicit Models and LikelihoodFree Variational Inference’. ArXiv eprints , 2017. arXiv:1702.08896.
 [21] G. Louppe and K. Cranmer: ‘Adversarial Variational Optimization of NonDifferentiable Simulators’. ArXiv eprints , 2017. arXiv:1707.07113.
 [22] G. Papamakarios, T. Pavlakou, and I. Murray: ‘Masked Autoregressive Flow for Density Estimation’. ArXiv eprints , 2017. arXiv:1705.07057.
 [23] G. Papamakarios, D. C. Sterratt, and I. Murray: ‘Sequential Neural Likelihood: Fast Likelihoodfree Inference with Autoregressive Flows’. ArXiv eprints , 2018. arXiv:1805.07226.
 [24] J. Brehmer, K. Cranmer, G. Louppe, and J. Pavez: ‘Constraining Effective Field Theories with Machine Learning’ , 2018. arXiv:1805.00013.
 [25] J. Brehmer, K. Cranmer, G. Louppe, and J. Pavez: ‘A Guide to Constraining Effective Field Theories with Machine Learning’ , 2018. arXiv:1805.00020.
 [26] J. Brehmer, G. Louppe, J. Pavez, and K. Cranmer: ‘Mining gold from implicit models to improve likelihoodfree inference’ , 2018. arXiv:1805.12244.
Comments
There are no comments yet.