Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
What Is Bayesian/Frequentist Inference? (2012) (normaldeviate.wordpress.com)
74 points by spekcular on Oct 10, 2022 | hide | past | favorite | 31 comments


This blog is by Larry Wasserman, so i think his advice should be taken seriously. I agree that there are uses of both philosophies, and that statisticians should be pragmatic rather than dogmatic.

My issue is that his advice is most useful for statisticians working in the abstract, but it doesn’t really help people working with real data. Scientists and data analysts just want to know how to analyze their data, and this does not help them. I know that stats isn’t a cookbook, but we could put some guard rails down that help practitioners with their problems.


>Scientists and data analysts just want to know how to analyze their data, and this does not help them.

I'm a computational biologist that uses Bayesian and Frequentist approaches depending on what I'm trying to achieve. This article was very helpful for making explicit a distinction I had not recognized before. With most of science (in my field) being done by people with Doctorates of Philosophy, I think it is reasonable to expect them to understand the underlying concepts of the math they are using. But, I'm anachronistic in wanting science to have a bit more natural philosophy rather than just "shut up and calculate" (https://en.wikiquote.org/wiki/Shut_up_and_calculate).


Guardrails for stats are something that I've put a lot of thought towards. The fundamental challenge is that statistics is operating in a world of incomplete information. Statistical measures are almost never monotone with respect to new information, and so any new piece of context might completely invalidate an analysis. Going beyond the abstract requires an intimate knowledge of the domain being analyzed, and the limitations of statistical methods as applied to that domain. "Guardrails for stats" have to be domain- and even dataset-specific.


The problem I see with guard rails is that it's very hard to know if you're doing statistics right, due to its nature.

Inference is sometimes hard enough on its own (and I sometimes use computational methods in addition to theory just to double-check my results, but that's just the innermost layer.

Outside of that you have to define appropriate and efficient samples, which is more difficult. You have to know what population you're actually interested in, which is less obvious than it sounds like, and on top of that you have to pick an experimental/observational method that minimises error and ideally lets you quantify it -- extremely hard in most practical cases.

Add to that the fact that the outcome of statistical analysis might often be, "well, we still don't know anything meaningful!" But if you say that, someone else will sound more confident and guess who people will listen to?

----

The way out is not guard rails, it's much better training from earlier ages. This stuff is hard and we are not born with intuition for it. We need lots of practise.

I still don't get why there's so much analysis and calculus in our curricula -- those are problems we can solve with numerical (sometimes statistical) methods. We ought to replace at least half off that with more probability and statistical inference and experimental design.


Guardrails is actually an apt term.

Just because you train people not to peer over the edge doesn't mean there's no need for safeguards. (Death Star laser operating platform, I'm looking at you!).

And just because there's safeguards doesn't mean you stop warning people not to peer over the edge.

The implicit assumption in your text is that if there's a guardrail, people will be leaning all day. Whereas my experience is that if there isn't, people will try that stuff regardless, and keep falling off the platform.

Obligatory: https://www.youtube.com/watch?v=9bSZXucTH4A


Dunno whether I agree to this. I agree that both are acceptable ways to do statistics. However

1. Bayesian stats is an approach that tends to make model assumptions fairly explicity, whereas in frequentist approaches, many assumptions are fairly implicit (Normal distribution of data, etc.) 2. I would consider myself a Bayesianist but I am sceptical about too much mention of esoteric terminology like "Belief". Bayesian probabilities are probabilities following the Kolmogorov axioms, which is also the foundation of Frequentist stats.

For decades, Bayesian inference was impractical because we need to resort to sampling methods and (a) computational power was insuffient and (b) we didn't have algorithms like No U-Turn Sampler (NUTS).

Both aspects are 'solved', so why is Bayesianism not universally adopted? Of course it still has a reputational disadvantage, but I think more importantly its because

* frequentist methods are good enough for purposes of publishing research [] for some problems we really have a hard time assembling bayesian graphs * some inference methods (e..g. Kalman filter) can both be seen as frequentist or Bayesian

As a bayesianist I am amazed at how well frequentism can work, even when the 'traditional' way of applying it contradicts the derivations of founding fathers like Fisher/Pearson. It's almost as if we have an evolutionary process at play.

[*] That is, if you use p-Values as publication thresholds


Scientific publishing has largely gone off the rails, thanks in no small part to the frequentist p-value obsession. It is not good enough, people just use it anyway.

I think most people want to avoid the dance of picking a prior, that is why frequentism is still so widespread.


Fully agree with every single word in this article. Particularly the bit about "identity statistics".

Also, regarding failure of notation: I've been arguing for a while that the notation we use for probability is highly problematic and effectively eggregious abuse of notation. And not just the whole "belief vs frequency of hypotheticals", but even the simple fact of what it represents. Does p(x) denote an event? The whole distribution? The distribution over a particular space?

An apologist might rightly point out that p(x) is actually shorthand for p(x=X), where x is the event and X is the distribution. But even this screams confusion.

For me the ideal notation would have been something that makes it explicit that the probability describes a set of 'sampling events' from a 'pool', e.g. Prob_{x}(X), where x is the set of events, and X is the nature of the distribution (i.e. a function which returns frequencies/beliefs over a domain).

And probability density functions would be denoted as actual derivatives, e.g. d/dx P_x (Cum.Normal)


The difference between Bayesian and Frequentist is in the interpretation of randomness. In Bayesian statistics 'randomness' is not a property of nature but a description of our knowledge.

What's randomness in a coin toss? If we had all the information we could perfectly predict the result of a toss. But if we know nothing then at most we can say is that both outcomes are equally probable.

Another example, if you had no idea who the next presidential winner will be between two candidates, than saying it's 50-50 is an accurate description of your knowledge.

If anyone is more interested I would refer to you to [1]. Here, probability theory is interpreted as an extension of logic. Very interesting stuff.

[1] http://www.med.mcgill.ca/epidemiology/hanley/bios601/Gaussia...


> Another example, if you had no idea who the next presidential winner will be between two candidates, than saying it's 50-50 is an accurate description of your knowledge.

How do you capture the difference between having no idea and having the absolute best possible idea that anyone has or can has and that idea being the probability is 50-50?

    50-50 = no idea, zero certainty
    50-50 = expert analysis with as high degree of certainty as possible of a very, very close race
These two things are expressed as being equal? Feels like something important has been lost.


>>> What's randomness in a coin toss? If we had all the information we could perfectly predict the result of a toss. But if we know nothing then at most we can say is that both outcomes are equally probable.

That's because you know it's a coin toss. If it was something else like whether a seed will germinate or not, I wouldn't assume equal probability.

Admittedly, this is something that's always puzzled me about Bayesian statistics, though I'm not sure it's fundamental.


>> What's randomness in a coin toss? If we had all the information we could perfectly predict the result of a toss. But if we know nothing then at most we can say is that both outcomes are equally probable.

Perhaps I'm being dense or overall don't understand, but how is this possible? What is "all the information"? Isn't it at most likely they the outcome is 50/50?


Each coin-toss is a deterministic physical process governed by laws of motion. If we had perfect information about the motion of all components in the system (hand, coin, air, floor, etc.), then we could, in principle, perfectly predict the outcome of every toss. Each individual toss would have a 100% probability of its predicted outcome.

Since we typically lack any of that information, we are stuck with the 50 / 50 prior distribution.


This is very debatable if you throw QM into the mix. From all what we know, we cannot predict everything with 100% success rate -- QM cannot be explained by a hidden variables model.


Argh - yes it can. Bell explicitly have a pretty trivial one in one of his early papers. QM cannot be explained by a non-contextual hidden variables model. (The simplest physical version of contextuality is locality - so as Bell showed you cannot explain quantum theory with a local hidden variables model).


Let's not break causality by observing non-local hidden variables.


Presumably talking to the wind here as this thread is old, but nonlocal hidden variables do not break causality! This can be seen explicitly in Bell's trivial model, where the state of a system is just the regular quantum state plus a uniform random number between 0 and 1 (hidden). Or put it this way, if this model "breaks causality" then so does regular non-relativistic quantum theory.


This falls apart in higher dimensions, but in the example given in the article the two answers only differ because they have different priors. If you repeat the bayesian analysis using the prior \theta ~ N(0, x), and let x go to infinity, then you approach the frequentist answer.

In my opinion, https://stats.stackexchange.com/questions/2272/whats-the-dif... is a better explanation of the difference between confidence and credible intervals.

Editing to add more commentary to my link:

If you've read the link, one of the principal objections of the frequentist is "What if the jar is type B? Then your interval will be wrong 80% of the time, and only correct 20% of the time!"

This is because, if you look at the original numbers, jar B has all types of cookies, and therefore any single draw from jar B can "look like" any other jar, and because other jars have more concentrated cookie types, they are "more likely" answers for each potential sample.

This issue also comes up with the frequentist analysis! If you look at the confidence intervals, they all say "This jar could be jar B". Instead of being bad at detecting jar B, they are good at always considering jar B no matter the evidence – but it's the same uncertainty.

The bayesian version of this objection is "when you pull a cookie with 3 chips your interval is only correct 41% of the time". This is because if we get jar A, we'll probably draw a 2-chip cookie, so it's outside of our confidence interval.

But note that we probably don't really care about the confidence interval or credibility interval. It's basically a hack to take a probabilistic problem and turn our answer into black and white. To say "these hypotheses are valid and these are invalid".

But this is statistics! If you just take the Bayesian approach, and throw out the need to create an arbitrary interval, you can just stop at the table titled P(Jar|Chips). That's all the information you need. If you draw a N-chip cookie, you can use that table to update your P(Chips_2) for a second draw, and you'll get a concrete probabilistic answer. Yes, you have to assume a prior. But frequentist statistics literally can't answer this question! Without a prior, there's no way to turn P(Chips | Jar) into a P(Jar | Chips) to update on, so you can't track your evidence to get better predictions. You just sit there saying "well, my interval meets the criterion even in the worst case".


Honestly, I think Wasserman does a better job. The cookie interval example gets the fact that frequentists require uniform coverage properties with respect to the unknown parameter right. But the "when you pull a cookie with 3 chips your interval is only correct 41% of the time" thing isn't really an essential Bayesian vs. Frequentist issue. As Wasserman notes, coverage is a minimal requirement for something being a confidence interval; we usually also construct them to avoid obvious deficiencies. All this objection shows is that the particular procedure in the example might not be the best one; it's not an argument that can be applied to all CIs in general. (And it's clear, if you play with the numbers, that the example can be improved.)

Also, regarding "But note that we probably don't really care about the confidence interval or credibility interval." Many times we do - giving a point estimate and an associated quantification of its uncertainty is one of the most basic statistical tasks.

Further, it's somewhat misleading to critique frequentists by saying they don't give probabilities for P(Jar | Chips), because in the frequentist setup the jar is a fixed and unknown parameter, not stochastic. For the two-cookie setting, it's trivial to generalize the construction in the M.SE post, so saying "frquentists can't track evidence to get better predictions" is simply wrong.


"giving a point estimate and an associated quantification of its uncertainty is one of the most basic statistical tasks". My argument is that this is only true in a frequentist framing. A bayesian framing would ask, "why do you need a point estimate when you have the posterior?" In the cookie-jar case, what do you actually need to do in the real world that a confidence interval of jars helps you with? Do you win a prize if you guess the jar right? Do you win a prize if you guess the next cookie right? Do you need to select a strategy up front which gives at least a 70% chance of being right about the jar over iterations where the jar is fixed but the data varies? In the last case, a Bayesian directly computes a confidence interval. But in other cases, a confidence interval is just throwing away data for no reason.

Frequentists are fundamentally unable to answer "Given that I drew a cookie with 2 chips on the first draw, what is the chance I draw a cookie with 0 chips on my second draw?". That question requires a prior; any solution frequentists find is just implicitly assigning a prior to the distribution of jars.


Well, I am not really interested in cookie jars. But I am interested in, for example, particle physics. There we need need simple ways to communicate point estimates and the associated uncertainties for various parameters of nature. Intervals are a convenient way to do this. Frequentist confidence intervals have the virtue that they will cover the true parameter at the nominal rate. Bayesian credible intervals in general have no such guarantee. In many cases we can find Bayesian-inspired interval estimators that have good coverage properties. But in some cases there is an irreconcilable conflict. And there you have to choose between long-run correctness and being Bayesian.

You might claim that we should do away with intervals and just report posteriors for all physical quantities. This complicates matters slightly without solving the problem. If the true parameters, when ultimately known, consistently end up in very low density regions of the probability distribution (such as far in the tails), we would regard our uncertainty estimates as poor. Again, it is not hard to construct examples where Bayesian methods have poor coverage properties in this sense.

(Also, a minor point: Regarding "Do you need to select a strategy up front which gives at least a 70% chance of being right about the jar over iterations where the jar is fixed but the data varies?", one does not need to fix the jar for frequentist methods to have good guarantees. See Wasserman's simulation with the median.)

Re: "Given that I drew a cookie with 2 chips on the first draw, what is the chance I draw a cookie with 0 chips on my second draw?", this is not a question about estimating an unknown parameter of a distribution, so it's not statistical in the sense Wasserman is talking about. It's just an elementary probability question that requires knowing something about the jars to answer. Both frequentist and Bayesian statisticians agree on the validity of Bayes rule (and hence how to answer this question once the relevant information is known or assumed); where they differ is on how to conceptualize and estimate unknown parameters of probability distributions.


> I am interested in, for example, particle physics. There we need need simple ways to communicate point estimates and the associated uncertainties for various parameters of nature.

Okay, but what do you _do_ with a confidence interval once you have it? It's just an abstract object that can't be used to take your knowledge and make better predictions about the future. If I tell you "This new particle decays with a half life of 28 years with a 95% confidence interval of +/- 5 years", can you take that information and use it to estimate the age of an object that started with 236 particles and now has 182 particles?

> this is not a question about estimating an unknown parameter of a distribution, so it's not statistical in the sense Wasserman is talking about

And a frequentist confidence interval doesn't answer a question about how you should update your knowledge so you can make better predictions in the future, so it's not statistical in the sense bayesians talk about.


The blog post talks about inference, not prediction, so I find it odd you keep bringing up prediction tasks. There are interesting questions and differences here, but it is very much not the subject of the post.

A standard frequentist tool for making predictions is the prediction interval. This is the appropriate comparison point for Bayesian prediction methods, and exactly the same issues arise as in the comparison of confidence intervals to credible intervals (or posteriors). Namely, frequentist prediction intervals have guaranteed error control, while Bayesian predictions generally do not. So in certain cases you have to choose between being right most of time about your predictions, and being Bayesian.


Discussed at the time:

What Is Bayesian/Frequentist Inference? - https://news.ycombinator.com/item?id=4800449 - Nov 2012 (27 comments)


In the end I believe the Bayesian inference is more straightforward to implement and understand if you can afford computationally sampling of the posterior. So I think at least in physics there is a shift towards Bayesian approaches.


Whilst successful in my career and user of probability, statistics, and inference on a regular basis, I simply cannot understand what's being discussed here.

I don't even want to understand it. Just like quantum, half the argument seems to be the a mismatch between mental models and actual reality.


> half the argument seems to be the a mismatch between mental models and actual reality.

Which half seems to be a mismatch to you? A bayesian half or a frequentist one?


Every time I've tried to understand the entire argument it just raises more questions to me. For example as I was first introduced to it, frequentists simple count frequencies observed in nature and then compute stats on them, and then build inferential models using those stats without assuming any complex underlying distribution. While Bayesians count frequencies, apply a prior correction (say, adding a pseudocount of one for every unobserved possible event, or any other way of assuming the generative process has a distribution that we've previously estimated), some stats,then build models from that.

however, after I was told that, I've seen several other arguments that quickly dive into: the distribution of the underlying events (I've heard that frequentists assume one type while bayesian assume another). Other folks just sort of give the example of the base rate fallacy.

Throughout all of this I've realized: I don't understand stats at all. I came to the scientific world with a view much more like physics: there is a microscopic event system (a particle simulation, or whatever) that we are observing, but due to limitations, we can only make macroscopic observations, which represent biased aggregations of the underlying microscopic event system. We can figure out those biases and use the aggregate data to build predictive models of the underlying systems- without ever really knowing the true details of the microscopic model.

From what I can tell, everything about what physicists do to model the world mentally is more Bayesian than Frequentist, if I understand what the hell people mean when they argue about it. However, as I said, every time I look at the arguments, I realize I don't understand stats, while I understand the physics approach which seems to be fairly obvious.


> For example as I was first introduced to it, frequentists simple count frequencies [...]. While Bayesians count frequencies [...]

I think that it is a bad way to explain differences. The good way is to look into the history of approaches and to see how they are different.

The history is illuminating. Frequentists started with card games, trying to figure out a winning strategy. And so they were attracted to frequencies, they invented combinatorics to calculate frequencies, and later they came with game theory. Of course it is not the whole story. While initially they get frequencies as given or inferred with math, they also encountered problems where it was impossible to calculate frequencies by combinatorics, so they invented a limit with samples approaching infinity of an empiric frequency, claimed it a definition of a probability, and now they deal with the impracticality of an infinity, using p-values or whatever to measure should they get more samples or it is enough already.

Thomas Bayes came from the other side. He started with a task where he had a hypotheses and tried to choose between them based on evidence. He was a priest and he was unsure should we believe in miracles given reports of eyewitnesses. So he dealt with a belief. He quantified belief and found a procedure of updating belief given a piece of evidence.

So generally speaking, Thomas Bayes started with the problem which frequentists saw as a side issue. Frequentists sought how to use probabilities to win a game without bothering much where to get those frequencies, Bayes sought how to infer a belief (or probabilities as a degrees of a belief) from an evidence, without bothering much what to do with the resulting belief. (To stop being a priest? I don't know what his plan was and I suspect he had no plan, it was a pure curiosity.).

And hence comes the ideological difference between them. Frequentists see probability as a property of a Universe, Bayesians see probability as a property of an observer, a property of his imperfect model of a Universe. Bayesians bring model explicitly into a picture, and so they can consciously think of enhancing it. Frequentists can think of a model too, but they lack vocabulary, it is a missing part of their picture, it is hard for them to pinpoint it.

It really has something in common with quantum mechanics that debated for decades is uncertainty a way the Universe works or it is just our imperfect way to describe it.

But these are ideological differences. To see practical differences one needs to dive into practical problems and to see how different approaches works there. Mostly people learn frequentist's approach in an undergraduate course and then they learn bayesianism on a bunch of problems that are easy with bayesianism and very hard or impossible to tackle with frequencies. You can try "Think Bayes"[1] if you like. Or to read Judea Pearl's "The Book of Why"[2]. He invented modern bayesianism, starting with ideas of Thomas Bayes. "The Book of Why" more of his next invention (Causation) but he talks there of bayesianism too.

[1] https://greenteapress.com/wp/think-bayes/

[2] https://www.amazon.com/Book-Why-Science-Cause-Effect-ebook/d...


Did you read Wasserman's article? There's a difference between a frequentist/Bayesian interpretation of probability and a frequentist/Bayesian method of inference. I don't have to take a position on what probability "really means" to use either kind of inference method. (As the article says, the true difference has a lot more to do with wanting guaranteed coverage...)


If it's just a perspective on how to analyze probabilities, the physicists already knew how to derive equations for macroscopic observables and combinatorics, in ways that correspond to reality as we understand it. Thanks for the long writeup, but you basicalyl just confirmed I wasn't missing anything fundamental.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: