One of the first things a scientist hears about statistics is that there is are two different approaches: frequentism and Bayesianism. Despite their importance, many scientific researchers never have opportunity to learn the distinctions between them and the different practical approaches that result. The purpose of this post is to synthesize the philosophical and pragmatic aspects of the frequentist and Bayesian approaches, so that scientists like myself might be better prepared to understand the types of data analysis people do. I'll start by addressing the philosophical distinctions between the views, and from there move to discussion of how these ideas are applied in practice, with some Python code snippets demonstrating the difference between the approaches. Frequentism vs. Bayesianism: a Philosophical Debate?Fundamentally, the disagreement between frequentists and Bayesians concerns the definition of probability. For frequentists, probability only has meaning in terms of a limiting case of repeated measurements. That is, if I measure the photon flux For Bayesians, the concept of probability is extended to cover degrees of certainty about statements. Say a Bayesian claims to measure the flux The surprising thing is that this arguably subtle difference in philosophy leads, in practice, to vastly different approaches to the statistical analysis of data. Below I will give a few practical examples of the differences in approach, along with associated Python code to demonstrate the practical aspects of the resulting methods. Frequentist and Bayesian Approaches in Practice: Counting Photons?Here we'll take a look at an extremely simple problem, and compare the frequentist and Bayesian approaches to solving it. There's necessarily a bit of mathematical formalism involved, but I won't go into too much depth or discuss too many of the subtleties. If you want to go deeper, you might consider — please excuse the shameless plug — taking a look at chapters 4-5 of our textbook. The Problem: Simple Photon Counts?Imagine that we point our telescope to the sky, and observe the light coming from a single star. For the time being, we'll assume that the star's true flux is constant with time, i.e. that is it has a fixed value The question is, given this set of measurements (Gratuitous aside on measurement errors: We'll make the reasonable assumption that errors are Gaussian. In a Frequentist perspective, Here we'll use Python to generate some toy data to demonstrate the two approaches to the problem. Because the measurements are number counts, a Poisson distribution is a good approximation to the measurement process:
In [1]:
# Generating some simple photon count data import numpy as np from scipy import stats np.random.seed(1) # for repeatability F_true = 1000 # true flux, say number of photons measured in 1 second N = 50 # number of measurements F = stats.poisson(F_true).rvs(N) # N measurements of the flux e = np.sqrt(F) # errors on Poisson counts estimated via square root Now let's make a simple visualization of the "measured" data:
In [2]:
%matplotlib inline import matplotlib.pyplot as plt fig, ax = plt.subplots() ax.errorbar(F, np.arange(N), xerr=e, fmt='ok', ecolor='gray', alpha=0.5) ax.vlines([F_true], 0, N, linewidth=5, alpha=0.2) ax.set_xlabel("Flux");ax.set_ylabel("measurement number"); These measurements each have a different error Let's take a look at the frequentist and Bayesian approaches to solving this. Frequentist Approach to Simple Photon Counts?We'll start with the classical frequentist maximum likelihood approach. Given a single observation This should be read "the probability of We construct the likelihood function by computing the product of the probabilities for each data point: Here What we'd like to do is determine Notice that in the special case of all errors That is, in agreement with intuition, We can go further and ask what the error of our estimate is. In the frequentist approach, this can be accomplished by fitting a Gaussian approximation to the likelihood curve at maximum; in this simple case this can also be solved analytically. It can be shown that the standard deviation of this Gaussian approximation is: These results are fairly simple calculations; let's evaluate them for our toy dataset:
In [3]:
w = 1. / e ** 2 print(""" F_true = {0} F_est = {1:.0f} +/- {2:.0f} (based on {3} measurements) """.format(F_true, (w * F).sum() / w.sum(), w.sum() ** -0.5, N)) F_true = 1000 F_est = 998 +/- 4 (based on 50 measurements) We find that for 50 measurements of the flux, our estimate has an error of about 0.4% and is consistent with the input value. Bayesian Approach to Simple Photon Counts?The Bayesian approach, as you might expect, begins and ends with probabilities. It recognizes that what we fundamentally want to compute is our knowledge of the parameters in question, i.e. in this case, Note that this formulation of the problem is fundamentally contrary to the frequentist philosophy, which says that probabilities have no meaning for model parameters like To compute this result, Bayesians next apply Bayes' Theorem, a fundamental law of probability: Though Bayes' theorem is where Bayesians get their name, it is not this law itself that is controversial, but the Bayesian interpretation of probability implied by the term Let's take a look at each of the terms in this expression:
If we set the prior and the Bayesian probability is maximized at precisely the same value as the frequentist result! So despite the philosophical differences, we see that (for this simple problem at least) the Bayesian and frequentist point estimates are equivalent. But What About the Prior??You'll noticed that I glossed over something here: the prior, A frequentist will point out that the prior is problematic when no true prior information is available. Though it might seem straightforward to use a noninformative prior like the flat prior mentioned above, there are some surprisingly subtleties involved. It turns out that in many situations, a truly noninformative prior does not exist! Frequentists point out that the subjective choice of a prior which necessarily biases your result has no place in statistical data analysis. A Bayesian would counter that frequentism doesn't solve this problem, but simply skirts the question. Frequentism can often be viewed as simply a special case of the Bayesian approach for some (implicit) choice of the prior: a Bayesian would say that it's better to make this implicit choice explicit, even if the choice might include some subjectivity. Photon Counts: the Bayesian approach?Leaving these philosophical debates aside for the time being, let's address how Bayesian results are generally computed in practice. For a one parameter problem like the one considered here, it's as simple as computing the posterior probability I won't go into the details of the theory of MCMC here. Instead I'll show a practical example of applying an MCMC approach using Dan Foreman-Mackey's excellent emcee package. Keep in mind here that the goal is to generate a set of points drawn from the posterior probability distribution, and to use those points to determine the answer we seek. To perform this MCMC, we start by defining Python functions for the prior
In [4]:
def log_prior(theta): return 1 # flat prior def log_likelihood(theta, F, e): return -0.5 * np.sum(np.log(2 * np.pi * e ** 2) + (F - theta[0]) ** 2 / e ** 2) def log_posterior(theta, F, e): return log_prior(theta) + log_likelihood(theta, F, e) Now we set up the problem, including generating some random starting guesses for the multiple chains of points.
In [5]:
ndim = 1 # number of parameters in the model nwalkers = 50 # number of MCMC walkers nburn = 1000 # "burn-in" period to let chains stabilize nsteps = 2000 # number of MCMC steps to take # we'll start at random locations between 0 and 2000 starting_guesses = 2000 * np.random.rand(nwalkers, ndim) import emcee sampler = emcee.EnsembleSampler(nwalkers, ndim, log_posterior, args=[F, e]) sampler.run_mcmc(starting_guesses, nsteps) sample = sampler.chain # shape = (nwalkers, nsteps, ndim) sample = sampler.chain[:, nburn:, :].ravel() # discard burn-in points If this all worked correctly, the array
In [6]:
# plot a histogram of the sample plt.hist(sample, bins=50, histtype="stepfilled", alpha=0.3, normed=True) # plot a best-fit Gaussian F_fit = np.linspace(975, 1025) pdf = stats.norm(np.mean(sample), np.std(sample)).pdf(F_fit) plt.plot(F_fit, pdf, '-k') plt.xlabel("F"); plt.ylabel("P(F)")
Out[6]:
<matplotlib.text.Text at 0x1075c7510> We end up with a sample of points drawn from the (normal) posterior distribution. The mean and standard deviation of this posterior are the corollary of the frequentist maximum likelihood estimate above:
In [7]:
print(""" F_true = {0} F_est = {1:.0f} +/- {2:.0f} (based on {3} measurements) """.format(F_true, np.mean(sample), np.std(sample), N)) F_true = 1000 F_est = 998 +/- 4 (based on 50 measurements) We see that as expected for this simple problem, the Bayesian approach yields the same result as the frequentist approach! Discussion?Now, you might come away with the impression that the Bayesian method is unnecessarily complicated, and in this case it certainly is. Using an Affine Invariant Markov Chain Monte Carlo Ensemble sampler to characterize a one-dimensional normal distribution is a bit like using the Death Star to destroy a beach ball, but I did this here because it demonstrates an approach that can scale to complicated posteriors in many, many dimensions, and can provide nice results in more complicated situations where an analytic likelihood approach is not possible. As a side note, you might also have noticed one little sleight of hand: at the end, we use a frequentist approach to characterize our posterior samples! When we computed the sample mean and standard deviation above, we were employing a distinctly frequentist technique to characterize the posterior distribution. The pure Bayesian result for a problem like this would be to report the posterior distribution itself (i.e. its representative sample), and leave it at that. That is, in pure Bayesianism the answer to a question is not a single number with error bars; the answer is the posterior distribution over the model parameters! Adding a Dimension: Exploring a more sophisticated model?Let's briefly take a look at a more complicated situation, and compare the frequentist and Bayesian results yet again. Above we assumed that the star was static: now let's assume that we're looking at an object which we suspect has some stochastic variation — that is, it varies with time, but in an unpredictable way (a Quasar is a good example of such an object). We'll propose a simple 2-parameter Gaussian model for this object: Now, we'll again consider
In [8]:
np.random.seed(42) # for reproducibility N = 100 # we'll use more samples for the more complicated model mu_true, sigma_true = 1000, 15 # stochastic flux model F_true = stats.norm(mu_true, sigma_true).rvs(N) # (unknown) true flux F = stats.poisson(F_true).rvs() # observed flux: true flux plus Poisson errors. e = np.sqrt(F) # root-N error, as above Varying Photon Counts: The Frequentist Approach?The resulting likelihood is the convolution of the intrinsic distribution with the error distribution, so we have Analogously to above, we can analytically maximize this likelihood to find the best estimate for And here we have a problem: the optimal value of Nevertheless, we can use numerical optimization techniques to determine the maximum likelihood value. Here we'll use the optimization routines available within Scipy's optimize submodule:
In [9]:
def log_likelihood(theta, F, e): return -0.5 * np.sum(np.log(2 * np.pi * (theta[1] ** 2 + e ** 2)) + (F - theta[0]) ** 2 / (theta[1] ** 2 + e ** 2)) # maximize likelihood <--> minimize negative likelihood def neg_log_likelihood(theta, F, e): return -log_likelihood(theta, F, e) from scipy import optimize theta_guess = [900, 5] theta_est = optimize.fmin(neg_log_likelihood, theta_guess, args=(F, e)) print(""" Maximum likelihood estimate for {0} data points: mu={theta[0]:.0f}, sigma={theta[1]:.0f} """.format(N, theta=theta_est)) Optimization terminated successfully. Current function value: 502.839505 Iterations: 58 Function evaluations: 114 Maximum likelihood estimate for 100 data points: mu=999, sigma=19 This maximum likelihood value gives our best estimate of the parameters There are several approaches to determining errors in a frequentist paradigm. We could, as above, fit a normal approximation to the maximum likelihood and report the covariance matrix (here we'd have to do this numerically rather than analytically). Alternatively, we can compute statistics like All of these would be valid techniques to use, but each comes with its own assumptions and subtleties. Here, for simplicity, we'll use the basic bootstrap resampler found in the astroML package:
In [10]:
from astroML.resample import bootstrap def fit_samples(sample): # sample is an array of size [n_bootstraps, n_samples] # compute the maximum likelihood for each bootstrap. return np.array([optimize.fmin(neg_log_likelihood, theta_guess, args=(F, np.sqrt(F)), disp=0) for F in sample]) samples = bootstrap(F, 1000, fit_samples) # 1000 bootstrap resamplings Now in a similar manner to what we did above for the MCMC Bayesian posterior, we'll compute the sample mean and standard deviation to determine the errors on the parameters.
In [11]:
mu_samp = samples[:, 0] sig_samp = abs(samples[:, 1]) print " mu = {0:.0f} +/- {1:.0f}".format(mu_samp.mean(), mu_samp.std()) print " sigma = {0:.0f} +/- {1:.0f}".format(sig_samp.mean(), sig_samp.std()) mu = 999 +/- 4 sigma = 18 +/- 5 I should note that there is a huge literature on the details of bootstrap resampling, and there are definitely some subtleties of the approach that I am glossing over here. One obvious piece is that there is potential for errors to be correlated or non-Gaussian, neither of which is reflected by simply finding the mean and standard deviation of each model parameter. Nevertheless, I trust that this gives the basic idea of the frequentist approach to this problem. Varying Photon Counts: The Bayesian Approach?The Bayesian approach to this problem is almost exactly the same as it was in the previous problem, and we can set it up by slightly modifying the above code.
In [12]:
def log_prior(theta): # sigma needs to be positive. if theta[1] <= 0: return -np.inf else: return 0 def log_posterior(theta, F, e): return log_prior(theta) + log_likelihood(theta, F, e) # same setup as above: ndim, nwalkers = 2, 50 nsteps, nburn = 2000, 1000 starting_guesses = np.random.rand(nwalkers, ndim) starting_guesses[:, 0] *= 2000 # start mu between 0 and 2000 starting_guesses[:, 1] *= 20 # start sigma between 0 and 20 sampler = emcee.EnsembleSampler(nwalkers, ndim, log_posterior, args=[F, e]) sampler.run_mcmc(starting_guesses, nsteps) sample = sampler.chain # shape = (nwalkers, nsteps, ndim) sample = sampler.chain[:, nburn:, :].reshape(-1, 2) Now that we have the samples, we'll use a convenience routine from astroML to plot the traces and the contours representing one and two standard deviations:
In [13]:
from astroML.plotting import plot_mcmc fig = plt.figure() ax = plot_mcmc(sample.T, fig=fig, labels=[r'$\mu$', r'$\sigma$'], colors='k') ax[0].plot(sample[:, 0], sample[:, 1], ',k', alpha=0.1) ax[0].plot([mu_true], [sigma_true], 'o', color='red', ms=10); The red dot indicates ground truth (from our problem setup), and the contours indicate one and two standard deviations (68% and 95% confidence levels). In other words, based on this analysis we are 68% confident that the model lies within the inner contour, and 95% confident that the model lies within the outer contour. Note here that The other thing to notice is that this posterior is definitely not Gaussian: this can be seen by the lack of symmetry in the vertical direction. That means that the Gaussian approximation used within the frequentist approach may not reflect the true uncertainties in the result. This isn't an issue with frequentism itself (i.e. there are certainly ways to account for non-Gaussianity within the frequentist paradigm), but the vast majority of commonly applied frequentist techniques make the explicit or implicit assumption of Gaussianity of the distribution. Bayesian approaches generally don't require such assumptions. (Side note on priors: there are good arguments that a flat prior on Conclusion?I hope I've been able to convey through this post how philosophical differences underlying frequentism and Bayesianism lead to fundamentally different approaches to simple problems, which nonetheless can often yield similar or even identical results. To summarize the differences:
In simple problems, the two approaches can yield similar results. As data and models grow in complexity, however, the two approaches can diverge greatly. In a followup post, I plan to show an example or two of these more complicated situations. Stay tuned! This post was written entirely in the IPython notebook. You can download this notebook, or see a static view here. |
|