Beta發行背後的直覺是什麼?


476

免責聲明:我不是統計學家,而是軟件工程師。我在統計學方面的大部分知識都來自自我教育,因此我在理解概念上仍然有很多空白,這些概念對於這裡的其他人而言似乎微不足道。因此,如果回答包含較少的具體術語和更多的解釋,我將非常感激。想像一下,您正在和奶奶說話:)

我正在嘗試掌握 beta分佈性質-它的用途以及在每種情況下的解釋方式。如果我們說的是正態分佈,則可以將其描述為火車的到站時間:最經常到達的時間是準時到達的,更不經常出現的是早到1分鐘或晚到1分鐘的時間,很少有差異到達的距離平均值20分鐘均勻分配尤其描述了彩票中每張彩票的機會。二項分佈可以用硬幣翻轉等來描述。但是,是否有 beta分佈這樣的直觀的解釋

假設 $ \ alpha = .99 $ $ \ beta = .5 $ 。Beta分佈 $ B(\ alpha,\ beta)$ 在這種情況下看起來像這樣(在R中生成):

enter image description here

但這實際上是什麼意思?Y軸顯然是概率密度,但是X軸上是什麼?

對於這個示例或其他示例,我將不勝感激。

49

A Beta distribution is used to model things that have a limited range, like 0 to 1.

Examples are the probability of success in an experiment having only two outcomes, like success and failure. If you do a limited number of experiments, and some are successful, you can represent what that tells you by a beta distribution.

Another example is order statistics. For example, if you generate several (say 4) uniform 0,1 random numbers, and sort them, what is the distribution of the 3rd one?

I use them to understand software performance diagnosis by sampling. If you stop a program at random $n$ times, and $s$ of those times you see it doing something you could actually get rid of, and $s>1$, then the fraction of time to be saved by doing so is represented by $Beta(s+1, (n-s)+1)$, and the speedup factor has a BetaPrime distribution.

More about that...


675

The short version is that the Beta distribution can be understood as representing a distribution of probabilities, that is, it represents all the possible values of a probability when we don't know what that probability is. Here is my favorite intuitive explanation of this:

Anyone who follows baseball is familiar with batting averages—simply the number of times a player gets a base hit divided by the number of times he goes up at bat (so it's just a percentage between 0 and 1). .266 is in general considered an average batting average, while .300 is considered an excellent one.

Imagine we have a baseball player, and we want to predict what his season-long batting average will be. You might say we can just use his batting average so far- but this will be a very poor measure at the start of a season! If a player goes up to bat once and gets a single, his batting average is briefly 1.000, while if he strikes out, his batting average is 0.000. It doesn't get much better if you go up to bat five or six times- you could get a lucky streak and get an average of 1.000, or an unlucky streak and get an average of 0, neither of which are a remotely good predictor of how you will bat that season.

Why is your batting average in the first few hits not a good predictor of your eventual batting average? When a player's first at-bat is a strikeout, why does no one predict that he'll never get a hit all season? Because we're going in with prior expectations. We know that in history, most batting averages over a season have hovered between something like .215 and .360, with some extremely rare exceptions on either side. We know that if a player gets a few strikeouts in a row at the start, that might indicate he'll end up a bit worse than average, but we know he probably won't deviate from that range.

Given our batting average problem, which can be represented with a binomial distribution (a series of successes and failures), the best way to represent these prior expectations (what we in statistics just call a prior) is with the Beta distribution- it's saying, before we've seen the player take his first swing, what we roughly expect his batting average to be. The domain of the Beta distribution is (0, 1), just like a probability, so we already know we're on the right track, but the appropriateness of the Beta for this task goes far beyond that.

We expect that the player's season-long batting average will be most likely around .27, but that it could reasonably range from .21 to .35. This can be represented with a Beta distribution with parameters $\alpha=81$ and $\beta=219$:

curve(dbeta(x, 81, 219))

Beta(81, 219)

I came up with these parameters for two reasons:

  • The mean is $\frac{\alpha}{\alpha+\beta}=\frac{81}{81+219}=.270$
  • As you can see in the plot, this distribution lies almost entirely within (.2, .35)- the reasonable range for a batting average.

You asked what the x axis represents in a beta distribution density plot—here it represents his batting average. Thus notice that in this case, not only is the y-axis a probability (or more precisely a probability density), but the x-axis is as well (batting average is just a probability of a hit, after all)! The Beta distribution is representing a probability distribution of probabilities.

But here's why the Beta distribution is so appropriate. Imagine the player gets a single hit. His record for the season is now 1 hit; 1 at bat. We have to then update our probabilities- we want to shift this entire curve over just a bit to reflect our new information. While the math for proving this is a bit involved (it's shown here), the result is very simple. The new Beta distribution will be:

$\mbox{Beta}(\alpha_0+\mbox{hits}, \beta_0+\mbox{misses})$

Where $\alpha_0$ and $\beta_0$ are the parameters we started with- that is, 81 and 219. Thus, in this case, $\alpha$ has increased by 1 (his one hit), while $\beta$ has not increased at all (no misses yet). That means our new distribution is $\mbox{Beta}(81+1, 219)$, or:

curve(dbeta(x, 82, 219))

enter image description here

Notice that it has barely changed at all- the change is indeed invisible to the naked eye! (That's because one hit doesn't really mean anything).

However, the more the player hits over the course of the season, the more the curve will shift to accommodate the new evidence, and furthermore the more it will narrow based on the fact that we have more proof. Let's say halfway through the season he has been up to bat 300 times, hitting 100 out of those times. The new distribution would be $\mbox{Beta}(81+100, 219+200)$, or:

curve(dbeta(x, 81+100, 219+200))

enter image description here

Notice the curve is now both thinner and shifted to the right (higher batting average) than it used to be- we have a better sense of what the player's batting average is.

One of the most interesting outputs of this formula is the expected value of the resulting Beta distribution, which is basically your new estimate. Recall that the expected value of the Beta distribution is $\frac{\alpha}{\alpha+\beta}$. Thus, after 100 hits of 300 real at-bats, the expected value of the new Beta distribution is $\frac{81+100}{81+100+219+200}=.303$- notice that it is lower than the naive estimate of $\frac{100}{100+200}=.333$, but higher than the estimate you started the season with ($\frac{81}{81+219}=.270$). You might notice that this formula is equivalent to adding a "head start" to the number of hits and non-hits of a player- you're saying "start him off in the season with 81 hits and 219 non hits on his record").

Thus, the Beta distribution is best for representing a probabilistic distribution of probabilities: the case where we don't know what a probability is in advance, but we have some reasonable guesses.


31

There are two principal motivations:

First, the beta distribution is conjugate prior to the Bernoulli distribution. That means that if you have an unknown probability like the bias of a coin that you are estimating by repeated coin flips, then the likelihood induced on the unknown bias by a sequence of coin flips is beta-distributed.

Second, a consequence of the beta distribution being an exponential family is that it is the maximum entropy distribution for a set of sufficient statistics. In the beta distribution's case these statistics are $\log(x)$ and $\log(1-x)$ for $x$ in $[0,1]$. That means that if you only keep the average measurement of these sufficient statistics for a set of samples $x_1, \dots, x_n$, the minimum assumption you can make about the distribution of the samples is that it is beta-distributed.

The beta distribution is not special for generally modeling things over [0,1] since many distributions can be truncated to that support and are more applicable in many cases.


45

The Beta distribution also appears as an order statistic for a random sample of independent uniform distributions on $(0,1)$.

Precisely, let $U_1$, $\ldots$, $U_n$ be $n$ independent random variables, each having the uniform distribution on $(0,1)$. Denote by $U_{(1)}$, $\ldots$, $U_{(n)}$ the order statistics of the random sample $(U_1, \ldots, U_n)$, defined by sorting the values of $U_1$, $\ldots$, $U_n$ in increasing order. In particular $U_{(1)}=\min(U_i)$ and $U_{(n)}=\max(U_i)$. Then one can show that $U_{(k)} \sim \textrm{Beta}(k, n+1-k)$ for every $k=1,\ldots,n$.

This result shows that the Beta distributions naturally appear in mathematics, and it has some interesting applications in mathematics.


26

enter image description here

Let's assume a seller on some e-commerce web-site receives 500 ratings of which 400 are good and 100 are bad.

We think of this as the result of a Bernoulli experiment of length 500 which led to 400 successes (1 = good) while the underlying probability $p$ is unknown.

The naive quality in terms of ratings of the seller is 80% because 0.8 = 400 / 500. But the "true" quality in terms of ratings we don't know.

Theoretically also a seller with "true" quality of $p=77\%$ might have ended up with 400 good of 500 ratings.

The pointy bar plot in the picture represents the frequency of how often it happend in a simulation that for a given assumed "true" $p$ 400 of 500 ratings were good. The bar plot is the density of the histogram of the result of the simulation.

And as you can see - the density curve of the beta distribution for $\alpha=400+1$ and $\beta=100+1$ (orange) tightly surrounds the bar chart (the density of the histogram for the simulation).

So the beta distribution essentially defines the probability that a Bernoulli experiment's success probability is $p$ given the outcome of the experiment.

library(ggplot2)

# 90% positive of 10 ratings
o1 <- 9
o0 <- 1
M <- 100
N <- 100000

m <- sapply(0:M/M,function(prob)rbinom(N,o1+o0,prob))
v <- colSums(m==o1)
df_sim1 <- data.frame(p=rep(0:M/M,v))
df_beta1 <- data.frame(p=0:M/M, y=dbeta(0:M/M,o1+1,o0+1))

# 80% positive of 500 ratings
o1 <- 400
o0 <- 100
M <- 100
N <- 100000

m <- sapply(0:M/M,function(prob)rbinom(N,o1+o0,prob))
v <- colSums(m==o1)
df_sim2 <- data.frame(p=rep(0:M/M,v))
df_beta2 <- data.frame(p=0:M/M, y=dbeta(0:M/M,o1+1,o0+1))

ggplot(data=df_sim1,aes(p)) +
    scale_x_continuous(breaks=0:10/10) +

    geom_histogram(aes(y=..density..,fill=..density..),
        binwidth=0.01, origin=-.005, colour=I("gray")) +
    geom_line(data=df_beta1 ,aes(p,y),colour=I("red"),size=2,alpha=.5) +

    geom_histogram(data=df_sim2, aes(y=..density..,fill=..density..),
        binwidth=0.01, origin=-.005, colour=I("gray")) +
    geom_line(data=df_beta2,aes(p,y),colour=I("orange"),size=2,alpha=.5)

http://www.joyofdata.de/blog/an-intuitive-interpretation-of-the-beta-distribution/


2

The beta distribution is very useful when you are working with particle size distribution. It is not the situation when you want to model a grain distribution; this case is better to use Tanh distribution $F(X) = \tanh ((x/p)^n)$ that is not bounded on the right.

By the way, what's up if you produce a size distribution from a microscopic observation and you have a particle distribution in number, and your aim is to work with a volume distribution? It is almost mandatory to get the original distribution in number bounded on the right. So, the transformation is more consistent because you are sure that in the new volume distribution does not appear any mode, nor median nor medium size out of the interval you are working. Besides, you avoid the Greenland Africa effect.

The transformation is very easy if you have regular shapes, i.e., a sphere or a prism. You ought to add three units to the alpha parameter of the number beta distribution and get the volume distribution.


5

My intuition says that it "weighs" both the current proportion of success "$x$" and current proportion of failure "$(1-x)$": $f(x;\alpha,\beta) = \text{constant}\cdot x^{\alpha-1}(1-x)^{\beta-1}$. Where the constant is $1/B(\alpha,\beta)$. The $\alpha$ is like a "weight" for success's contribution. The $\beta$ is like a "weight" for failure's contribution. You have a two dimensional parameter space (one for successes contribution and one for failures contribution) which makes it kind of difficult to think about and understand.


9

So far the preponderance of answers covered the rationale for Beta RVs being generated as the prior for a sample proportions, and one clever answer has related Beta RVs to order statistics.

Beta distributions also arise from a simple relationship between two Gamma(k_i, 1) RVs, i=1,2 call them X and Y. X/(X+Y) has a Beta distribution.

Gamma RVs already have their rationale in modeling arrival times for independent events, so I will not address that since it is not your question. But a "fraction of time" spent completing one of two tasks performed in sequence naturally lends itself to a Beta distribution.


3

In the cited example the parameters are alpha = 81 and beta = 219 from the prior year [81 hits in 300 at bats or (81 and 300 - 81 = 219)]

I don't know what they call the prior assumption of 81 hits and 219 outs but in English, that's the a priori assumption.

Notice how as the season progresses the curve shifts left or right and the modal probability shifts left or right but there is still a curve.

I wonder if the Laa of Large Numbers eventually takes hold and drives the batting average back to .270.

To guesstimate the alpha and beta in general one would take the complete number of prior occurrences (at bats), the batting average as known, obtain the total hits (the alpha), the beta or the grand total minus the failures) and voila – you have your formula. Then, work the additional data in as shown.


-2

I think there is NO intuition behind beta distribution! The beta distribution is just a very flexible distribution with FIX range! And for integer a and b it is even easy to deal with. Also many special cases of the beta have their native meaning, like the uniform distribution. So if the data needs to be modeled like this, or with slightly more flexibility, then the beta is a very good choice.


0

In another question concerning the beta distribution the following intuition behind beta is provided:

In other words the beta distribution can be seen as the distribution of probabilities in the center of a jittered distribution.

For details please checkout the full answer at https://stats.stackexchange.com/a/429754/142758


3

Most of the answers here seem to cover two approaches: Bayesian and the order statistic. I'd like to add a viewpoint from the binomial, which I think the easiest to grasp.

The intuition for a beta distribution comes into play when we look at it from the lens of the binomial distribution.

enter image description here

The difference between the binomial and the beta is that the former models the number of occurrences ($x$), while the latter models the probability ($p$) itself. In other words, the probability is a parameter in binomial; In the Beta, the probability is a random variable.

Interpretation of $\boldsymbol{\alpha}$$\boldsymbol{\beta}$

You can think of $\alpha-1$ as the number of successes and $\beta-1$ as the number of failures, just like $n$ & $n-x$ terms in binomial. You can choose the $\alpha$ and $\beta$ parameters however you think they are supposed to be. If you think the probability of success is very high, let's say 90%, set 90 for $\alpha$ and 10 for $\beta$. If you think otherwise, 90 for $\beta$ and 10 for $\alpha$.

As $\alpha$ becomes larger (more successful events), the bulk of the probability distribution will shift towards the right, whereas an increase in $\beta$ moves the distribution towards the left (more failures). Also, the distribution will narrow if both $\alpha$ and $\beta$ increase, for we are more certain.

The Intuition behind the shapes

The PDF of Beta distribution can be U-shaped with asymptotic ends, bell-shaped, strictly increasing/decreasing or even straight lines. As you change $\alpha$ or $\beta$, the shape of the distribution changes.

a. Bell-shape

enter image description here

Notice that the graph of PDF with $\alpha = 8$ and $\beta = 2$ is in blue, not in read. The x-axis is the probability of success. The PDF of a beta distribution is approximately normal if $\alpha +\beta$ is large enough and $\alpha$ & $\beta$ are approximately equal.

b. Straight Lines

enter image description here

The beta PDF can be a straight line too.

c. U-shape

enter image description here

When $\alpha <1$, $\beta<1$, the PDF of the Beta is U-shaped.

The Intuition behind the shapes

Why would Beta(2,2) be bell-shaped?

If you think of $\alpha-1$ as the number of successes and $\beta-1$ as the number of failures, Beta(2,2) means you got 1 success and 1 failure. So it makes sense that the probability of the success is highest at 0.5.

Also, Beta(1,1) would mean you got zero for the head and zero for the tail. Then, your guess about the probability of success should be the same throughout [0,1]. The horizontal straight line confirms it.

What’s the intuition for Beta(0.5, 0.5)?

Why is it U-shaped? What does it mean to have negative (-0.5) heads and tails? I don’t have an answer for this one yet. I even asked this on Stackexchange but haven’t gotten the response yet. If you have a good idea about the U-shaped Beta, please let me know!


0

There are already so many awesome answers here, but I'd like to share with you how I interpret the "probabilistic distribution of probabilities" as @David Robinson described in the accepted answer and add some complementary points using some very simple illustrations and derivations.

Imagine this, we have a coin and flip it in the following three scenarios: 1) toss it five times and get TTTTT (five tails and zero head); in scenario 2) use the same coin and toss it also five times and get HTTHH (three heads and two tails); in scenario 3) get the same coin and toss it ten times and get THHTHHTHTH (six heads and four tails).

Then three issues arise a) we don't have a strategy to guess the probability in the first flipping; b) in scenario 1 the probability (we would work out) of getting head in the 6th tossing would be impossible which seems unreal(black swan event); c) in scenario 2 and 3 the (relative) probabilities of getting head next time are both $0.6$ although we know the confidence is higher in scenario 3. Therefore it is not enough to estimate the probability in tossing a coin just using a probability point and with no prior information, instead, we need a prior before we toss the coin and a probability distribution for each time step in the three cases above.

Beta distribution $\text{Beta}(\theta|\alpha_H, \alpha_T)$ can address the three problems where $\theta$ represents the density over the interval [0, 1], $\alpha_H$ the times heads occur and $\alpha_T$ the times tails occur here.


For the issue a, we can assume before flipping the coin that heads and tails are equally likely by either use a probability point and saying that the chance of occurring heads is 50%, or employing the Beta distribution and setting the prior as $\text{Beta}(\theta|1, 1)$ (equivalent to the uniform distribution) meaning two virtual tosses(we can treat the hyperparameter (1, 1) as pseudocounts) and we have observed one head event and one tail event (as depicted bellow).

p = seq(0,1, length=100)
plot(p, dbeta(p, 1, 1), ylab="dbeta(p, 1, 1)", type ="l", col="blue")

Beta(\theta|1, 1) In fact we can bridge the two methods by the following derivation:

$\begin{align*} E[\text{Beta}(\theta|\alpha_H, \alpha_T)] &= \int_0^1 \theta P(\theta|\alpha_H, \alpha_T) d\theta \hspace{2.15cm}\text{the numerator/normalization is a constant}\\ &=\dfrac{\int_0^1 \theta \{ \theta^{\alpha_H-1} (1-\theta)^{\alpha_T-1}\}\ d\theta}{B(\alpha_H,\alpha_T)}\hspace{.75cm} \text{definition of Beta; the numerator is a constant} \\ &= \dfrac{B(\alpha_H+1,\alpha_T)}{B(\alpha_H,\alpha_T)} \hspace{3cm}\text{$\theta \theta^{\alpha_H-1}=\theta^{\alpha_H}$} \\ &= \dfrac{\Gamma(\alpha_H+1) \Gamma(\alpha_T)}{\Gamma(\alpha_H+\alpha_T+1)} \dfrac{\Gamma(\alpha_H+\alpha_T)}{\Gamma(\alpha_H)\Gamma(\alpha_T)} \\ &= \dfrac{\alpha_H}{\alpha_H+\alpha_T} \end{align*}$

We see that the expectation $\frac{1}{1+1}=50%$ is just equal to the probability point, and we can also view the probability point as one point in the Beta distribution (the Beta distribution implies that all probabilities are 100% but the probability point implies that only 50% is 100%).


For the issue b, we can calculate the posterior as follows after getting N observations(N is 5: $N_T=5$ and $N_H=0$) $\mathcal{D}$.

$\begin{align*} \text{Beta}(\theta|\mathcal{D}, \alpha_H, \alpha_T) &\propto P(\mathcal{D}|\theta,\alpha_H, \alpha_T)P(\theta|\alpha_H, \alpha_T) \hspace{.47cm}\text{likelihood $\times$ prior}\\ &= P(\mathcal{D}|\theta) P(\theta|\alpha_H, \alpha_T) \hspace{2cm} \text{as depicted bellow}\\ &\propto \theta^{N_H} (1-\theta)^{N_T} \cdot \theta^{\alpha_H-1} (1-\theta)^{\alpha_T-1} \\ &= \theta^{N_H+\alpha_H-1} (1-\theta)^{N_T+\alpha_T-1} \\ &= \text{Beta}(\theta|\alpha_H+N_H, \alpha_T+N_T) \end{align*}$

prior and evidence

$\mathcal{D}$,$\alpha_H$ and $\alpha_T$ are independent given $\theta$

We can plug in the prior and N observations and get $\text{Beta}(\theta|1+0, 1+5)$

p = seq(0,1, length=100)
plot(p, dbeta(p, 1+0, 1+5), ylab="dbeta(p, 1+0, 1+5)", type ="l", col="blue")

Beta(\theta|1+0, 1+5)

We see the distribution over all probabilities of getting a head the density is high over the low probabilities but never be zero we can get otherwise, and the expectation is $E[\text{Beta}(\theta|1+0, 1+5)] = \frac{1+0}{1+0+1+5}$ (the Laplace smoothing or additive smoothing) rather than 0/impossible (in issue b).


For the issue c, we can calculate the two posteriors (along the same line as the above derivation) and compare them (as with the uniform as prior). When we get three heads and two tails we get $\text{Beta}(\theta|\mathcal{D}, \alpha_H, \alpha_T)=\text{Beta}(\theta|1+3, 1+2)$

p = seq(0,1, length=100)
plot(p, dbeta(p, 1+3, 1+2), ylab="dbeta(p, 1+3, 1+2)", type ="l", col="blue")

Beta(\theta|4, 3)

When we get six heads and four tails we get $\text{Beta}(\theta|\mathcal{D}, \alpha_H, \alpha_T)=\text{Beta}(\theta|1+6, 1+4)$

p = seq(0,1, length=100)
plot(p, dbeta(p, 1+6, 1+4), ylab="dbeta(p, 1+6, 1+4)", type ="l", col="blue")

Beta(\theta|7, 5)

We can calculate their expectations ($\frac{1+3}{1+3+1+2} = 0.571 \approx \frac{1+6}{1+6+1+4} = 0.583$, and if we don't consider the prior $\frac{3}{3+2} = \frac{6}{6+4}$) but we can see that the second curve is more tall and narrow(more confident). The denominator of the expectation can be interpreted as a measure of confidence, the more evidence (either virtual or real) we have the more confident the posterior and the taller and narrower the curve of the Beta distribution. But if we do like that in issue c the information is just lost.

References:
1. https://math.stackexchange.com/a/497599/351322
2. 17.3.1.3 of Probabilistic Graphical Models Principles and Techniques