What is Bayesian inference?

A reminder on conditional probabilities

$Pr (A ∣ B)$ : Probability of A given B
The ordering matters: $Pr (A ∣ B)$ is not the same as $Pr (B ∣ A)$ .
$Pr (A ∣ B) = \frac{Pr (A \cap B)}{Pr (B)}$

Screening for vampirism

The chance of the test being positive given you are a vampire is $Pr (+ | vampire) = 0.90$ (sensitivity).
The chance of a negative test given you are mortal is $Pr (- | mortal) = 0.95$ (specificity).

What is the question?

From the perspective of the test: Given a person is a vampire, what is the probability that the test is positive? $Pr (+ | vampire) = 0.90$ .
From the perspective of a person: Given that the test is positive, what is the probability that this person is a vampire? $Pr (vampire | +) = ?$
Assume that vampires are rare, and represent only $0.1 %$ of the population. This means that $Pr (vampire) = 0.001$ .

What is the answer? Bayes’ theorem to the rescue!

$Pr (vampire | +) = \frac{Pr (vampire and +)}{Pr (+)}$
$Pr (vampire and +) = Pr (vampire) Pr (+ | vampire) = 0.0009$
$Pr (+) = 0.0009 + 0.04995 = 0.05085$
$Pr (vampire | +) = 0.0009 / 0.05085 = 0.02$

$Pr (vampire | +) = \frac{Pr (+ | vampire) Pr (vampire)}{Pr (+)}$

let’s have an example

Screening for vampirism

Suppose the diagnostic test has the same sensitivity and specificity but vampirism is more common: $10 %$ of the population is vampire.
What is the probability that a person is a vampire, given that the test is positive?

Solution

The probability that a person is a vampire, given that the test is positive

$Pr (+ | vampire) = 0.9$
$Pr (- | mortal) = 0.95$
$Pr (vampire) = 0.1$

$\begin{aligned} Pr (+) & = Pr (+ | vampire) Pr (vampire) + Pr (+ | mortal) Pr (mortal) \\ = 0.9 * 0.1 + 0.05 * 0.9 \\ = 0.135 \end{aligned}$

$\begin{aligned} Pr (vampire | +) & = Pr (+ | vampire) Pr (vampire) / Pr (+) \\ = 0.9 * 0.1 / 0.135 \end{aligned}$

Bayes’ theorem

A theorem about conditional probabilities.
$Pr (B ∣ A) = \frac{Pr (A ∣ B) Pr (B)}{Pr (A)}$

Easy to mess up with letters. Might be easier to remember when written like this:

$Pr (hypothesis ∣ data) = \frac{Pr (data ∣ hypothesis) Pr (hypothesis)}{Pr (data)}$

The “hypothesis” is typically something unobserved or unknown. It’s what you want to learn about using the data.
For regression models, the “hypothesis” is a parameter (intercept, slopes or error terms).
Bayes theorem tells you the probability of the hypothesis given the data.

Frequentist versus Bayesian

Typical stats problems involve estimating parameter $θ$ with available data.
The frequentist approach (maximum likelihood estimation – MLE) assumes that the parameters are fixed, but have unknown values to be estimated.
Classical estimates generally provide a point estimate of the parameter of interest.
The Bayesian approach assumes that the parameters are not fixed but have some fixed unknown distribution - a distribution for the parameter.

What is the Bayesian approach?

The approach is based upon the idea that the experimenter begins with some prior beliefs about the system.
And then updates these beliefs on the basis of observed data.
This updating procedure is based upon the Bayes’ Theorem:

$Pr (A ∣ B) = \frac{Pr (B ∣ A) Pr (A)}{Pr (B)}$

Schematically if $A = θ$ and $B = data$ , then
The Bayes’ theorem

$Pr (A ∣ B) = \frac{Pr (B ∣ A) Pr (A)}{Pr (B)}$

Translates into:

$Pr (θ ∣ data) = \frac{Pr (data ∣ θ) Pr (θ)}{Pr (data)}$

Bayes’ theorem

$Pr (θ ∣ data) = \frac{Pr (data ∣ θ) Pr (θ)}{Pr (data)}$

$P o s t e r i o r d i s t r i b u t i o n$ : Represents what you know after having seen the data. The basis for inference, a distribution, possibly multivariate if more than one parameter ( $θ$ ).
$L i k e l i h o o d$ : We know that guy from before, same as in the MLE approach.
$P r i o r d i s t r i b u t i o n$ : Represents what you know before seeing the data. The source of much discussion about the Bayesian approach.
$Pr (data) = \int L (data ∣ θ) Pr (θ) d θ$ : Possibly high-dimensional integral, difficult if not impossible to calculate. This is one of the reasons why we need simulation (MCMC) methods - more soon.

$\underset{Posterior}{\underset{⏟}{P (θ ∣ y)}} = \frac{\overset{Likelihood}{\overset{⏞}{P (y ∣ θ)}} \cdot \overset{Prior}{\overset{⏞}{P (θ)}}}{\underset{Normalizing Constant}{\underset{⏟}{P (y)}}}$

Likelihood

Usually, when talking about probability distributions, we assume that we know the parameter values.
In the real world, it is usually the other way around.

A question of interest might be for example:

We have observed 3 births by a female during her 10 breeding attempts. What does this tell us about the true probability of getting a successful breeding attempt from this female? For the population?

We don’t know what the probability of a birth is.
But we can calculate the probability of getting our data for different values:

dbinom(x=3,size=10,prob=0.1)
#> [1] 0.05739563

We don’t know what the probability of a birth is.
But we can see what the probability of getting our data would be for different values:

dbinom(x=3,size=10,prob=0.9)
#> [1] 8.748e-06

We don’t know what the probability of a birth is.
But we can see what the probability of getting our data would be for different values:


dbinom(x=3,size=10,prob=0.25)
#> [1] 0.2502823

dbinom(x=3,size=10,prob=0.1)
#> [1] 0.05739563
dbinom(x=3,size=10,prob=0.9)
#> [1] 8.748e-06
dbinom(x=3,size=10,prob=0.25)
#> [1] 0.2502823

So we would be more likely to observe 3 births if the probability is 0.25 than 0.1 or 0.9.

The likelihood

This reasoning is so common in statistics that it has a special name:
The likelihood is the probability of observing the data under a certain model.
The data are known, we usually consider the likelihood as a function of the model parameters $θ_{1}, θ_{2}, \dots, θ_{p}$

$L = P (θ_{1}, θ_{2}, \dots, θ_{p} ∣ data)$

A Formula for the Likelihood Function

Let $f (x; θ)$ denote the pdf of a random variable $X$ with associated parameter $θ$ . Suppose $X_{1}, X_{2}, \dots, X_{n}$ are random samples from this distribution, and $x_{1}, x_{2}, \dots, x_{n}$ are the corresponding observed values.

$L (θ ∣ x_{1}, x_{2}, \dots, x_{n}) = f (x_{1}; θ) f (x_{2}; θ) \dots f (x_{n}; θ) = \prod_{i = 1}^{n} f (x_{i}; θ) .$

Likelihood functions in R

We may create a function to calculate a likelihood:

lik.fun <- function(parameter){
  ll <- dbinom(x=3, size=10, prob=parameter)
  return(ll)
}

lik.fun(0.3)
#> [1] 0.2668279

lik.fun(0.6)
#> [1] 0.04246733

Maximize the likelihood (3 successes ot of 10 attempts)

The maximum of the likelihood is at value $0.3$

Maximum likelihood estimation

There is always a set of parameters that gives you the highest likelihood of observing the data, and this is the MLE.
These can be calculated using:
- Trial and error (not efficient!).
- Compute the maximum of a function by hand (rarely doable in practice).
- An iterative optimization algorithm: ?optim in R.

By hand:

compute MLE of $p$ from $Y \sim Bin (N = 10, p)$ with $k = 3$ successes

$P (Y = k) = (\binom{k}{N}) p^{k} (1 - p)^{N - k} = L (p)$ .
$\log (L (p)) = cte + k \log (p) + (N - k) \log (1 - p)$ .
We are searching for the maximum of $L$ , or equivalently that of $\log (L)$ .
Compute derivate w.r.t. $p$ : $\frac{d \log (L)}{d p} = \frac{k}{p} - \frac{(N - k)}{(1 - p)}$ .
Then solve $\frac{d \log (L)}{d p} = 0$ ; the MLE is $\hat{p} = \frac{k}{N} = \frac{3}{10} = 0.3$ .
Here, the MLE is the proportion of observed successes.

Using a computer:

MLE of $p$ from $Y \sim Bin (N = 10, p)$ with $k = 3$ successes

lik.fun <- function(parameter) dbinom(x=3, size=10, prob=parameter)
# ?optimize
optimize(lik.fun,c(0,1),maximum=TRUE)
#> $maximum
#> [1] 0.3000157
#> 
#> $objective
#> [1] 0.2668279

Use optim when the number of parameters is $> 1$ .

The Bayesian approach

The Bayesian starts off with a prior.
Now, the one thing we know about $θ$ is that is a continuous random variable and that it lies between zero and one.
Thus, a suitable prior distribution might be the Beta defined on $[0, 1]$ .
What is the Beta distribution?

What is the Beta distribution?

$q (θ ∣ α, β) = \frac{1}{Beta (α, β)} θ^{α - 1} (1 - θ)^{θ - 1}$

with $Beta (α, β) = \frac{Γ (α) Γ (β)}{Γ (α + β)}$ and $Γ (n) = (n - 1)!$

Beta Beta Conjugate Prior

We assume a priori that (let $θ = p$ )

$P r i o r : p \sim B e t a (α, β)$ so that $Pr (p) = p^{α - 1} (1 - p)^{β - 1}$

also :

$L i k e l i h o o d : X | p \sim B i n o m i a l (n, p)$

Then we have:

$L (P) = (\binom{n}{k}) p^{k} (1 - p)^{n - k}$

$P (p) = \frac{p^{α - 1} (1 - p)^{β - 1}}{B e t a (α, β)}$

$\begin{aligned} P r (p ∣ k) & \propto (\binom{n}{k}) p^{k} (1 - p)^{n - k} p^{a - 1} (1 - p)^{β - 1} \\ \propto p^{(α + k) - 1} (1 - p)^{(β + n - k) - 1} \end{aligned}$

That is:

$p ∣ k \sim B e t a (α + k, β + n - k)$

Take a Beta prior with a Binomial likelihood, you get a Beta posterior (conjugacy)

Elicitation

Often, one has a belief about the distribution of one’s data. You may think that your data come from a binomial distribution and in that case you typically know the $n$ , the number of trials but you usually do not know $p$ , the probability of success. Or you may think that your data come from a normal distribution. But you do not know the mean $μ$ or the standard deviation $σ$ of the normal. Beside to knowing the distribution of one’s data, you may also have beliefs about the unknown $p$ in the binomial or the unknown mean $μ$ in the normal.

Bayesians express their belief in terms of personal probabilities. These personal probabilities encapsulate everything a Bayesian knows or believes about the problem. But these beliefs must obey the laws of probability, and be consistent with everything else the Bayesian knows.

So a Bayesian will seek to express their belief about the value of $p$ through a probability distribution, and a very flexible family of distributions for this purpose is the beta family. A member of the beta family is specified by two parameters, $α$ and $β$ ; we denote this as $p \sim beta (α, β)$ . The probability density function is

$f (p) = \frac{Γ (α + β)}{Γ (α) Γ (β)} p^{α - 1} (1 - p)^{β - 1}$ where $0 \leq p \leq 1, α > 0, β > 0$ , and $Γ$ is a factorial:

$Γ (n) = (n - 1)! = (n - 1) \times (n - 2) \times \dots \times 1$ Here is a summary of the key ideas:

Bayesians express their uncertainty through probability distributions.
One can think about the situation and self-elicit a probability distribution that approximately reflects his/her personal probability.
One’s personal probability should change according Bayes’ rule, as new data are observed.
The beta family of distribution can describe a wide range of prior beliefs.

Conjugacy

Next, let’s introduce the concept of conjugacy in Bayesian statistics.

Suppose we have the prior beliefs about the data as below:

Binomial distribution $Bin (n, p)$ with $n$ known and $p$ unknown
Prior belief about $p$ is $beta (α, β)$

Then we observe $x$ success in $n$ trials, and it turns out the Bayes’ rule implies that our new belief about the probability density of $p$ is also the beta distribution, but with different parameters. In mathematical terms,

$p | x \sim beta (α + x, β + n - x) .$

This is an example of conjugacy. Conjugacy occurs when the posterior distribution is in the same family of probability density functions as the prior belief, but with new parameter values, which have been updated to reflect what we have learned from the data.

Why are the beta binomial families conjugate? Here is a mathematical explanation.

Recall the discrete form of the Bayes’ rule:

$P (A_{i} | B) = \frac{P (B | A_{i}) P (A_{i})}{\sum_{j = 1}^{n} P (B | A_{j}) P (A_{j})}$

However, this formula does not apply to continuous random variables, such as the $p$ which follows a beta distribution, because the denominator sums over all possible values (must be finitely many) of the random variable.

But the good news is that the $p$ has a finite range – it can take any value only between 0 and 1. Hence we can perform integration, which is a generalization of the summation. The Bayes’ rule can also be written in continuous form as:

$π^{*} (p | x) = \frac{P (x | p) π (p)}{\int_{0}^{1} P (x | p) π (p) d p} .$

This is analogus to the discrete form, since the integral in the denominator will also be equal to some constant, just like a summation. This constant ensures that the total area under the curve, i.e. the posterior density function, equals 1.

Note that in the numerator, the first term, $P (x | p)$ , is the data likelihood – the probability of observing the data given a specific value of $p$ . The second term, $π (p)$ , is the probability density function that reflects the prior belief about $p$ .

In the beta-binomial case, we have $P (x | p) = Bin (n, p)$ and $π (p) = beta (α, β)$ .

Plugging in these distributions, we get

\begin{aligned} π^{*} (p | x) & = \frac{1}{some number} \times P (x | p) π (p) \\ = \frac{1}{some number} [(\begin{array}{c} n \\ x \end{array}) p^{x} (1 - p)^{n - x}] [\frac{Γ (α + β)}{Γ (α) Γ (β)} p^{α - 1} (1 - p)^{β - 1}] \\ = \frac{Γ (α + β + n)}{Γ (α + x) Γ (β + n - x)} \times p^{α + x - 1} (1 - p)^{β + n - x - 1} \end{aligned}

Let $α^{*} = α + x$ and $β^{*} = β + n - x$ , and we get

$π^{*} (p | x) = beta (α^{*}, β^{*}) = beta (α + x, β + n - x),$

same as the posterior formula in Equation.

We can recognize the posterior distribution from the numerator $p^{α + x - 1}$ and $(1 - p)^{β + n - x - 1}$ . Everything else are just constants, and they must take the unique value, which is needed to ensure that the area under the curve between 0 and 1 equals 1. So they have to take the values of the beta, which has parameters $α + x$ and $β + n - x$ .

This is a cute trick. We can find the answer without doing the integral simply by looking at the form of the numerator.

Without conjugacy, one has to do the integral. Often, the integral is impossible to evaluate. That obstacle is the primary reason that most statistical theory in the 20th century was not Bayesian. The situation did not change until modern computing allowed researchers to compute integrals numerically.

Three Conjugate Families

In this section, the three conjugate families are beta-binomial, gamma-Poisson, and normal-normal pairs. Each of them has its own applications in everyday life.

The Gamma-Poisson Conjugate Families

The Poisson distribution has a single parameter $λ$ , and it is denoted as $X \sim Pois (λ)$ with $λ > 0$ . The probability mass function is

$P (X = k) = \frac{λ^{k}}{k!} e^{- λ} for k = 0, 1, \dots,$

where $k! = k \times (k - 1) \times \dots \times 1$ . This gives the probability of observing a random variable equal to $k$ .

Note that $λ$ is both the mean and the variance of the Poisson random variable. It is obvious that $λ$ must be greater than zero, because it represents the mean number of counts, and the variance should be greater than zero (except for constants, which have zero variance).

The gamma family is flexible, and Figure below illustrates a wide range of gamma shapes.

The probability density function for the gamma is indexed by shape $k$ and scale $θ$ , denoted as $Gamma (k, θ)$ with $k, θ > 0$ . The mathematical form of the density is

$f (x) = \frac{1}{Γ (k) θ^{k}} x^{k - 1} e^{- x / θ}$ where

$Γ (z) = \int_{0}^{\infty} x^{z - 1} e^{- x} d x .$ $Γ (z)$ , the gamma function, is simply a constant that ensures the area under curve between 0 and 1 sums to 1, just like in the beta probability distribution case of Equation @ref(eq:beta). A special case is that $Γ (n) = (n - 1)!$ when $n$ is a positive integer.

However, some books parameterize the gamma distribution in a slightly different way with shape $α = k$ and rate (inverse scale) $β = 1 / θ$ :

$f (x) = \frac{β^{α}}{Γ (α)} x^{α - 1} e^{- β x}$

For this example, we use the $k$ - $θ$ parameterization, but you should always check which parameterization is being used. For example, $R$ uses the $α$ - $β$ parameterization by default.
In the the later material we find that using the rate parameterization is more convenient.

For the gamma Poisson conjugate family, suppose we observed data $x_{1}, x_{2}, \dots, x_{n}$ that follow a Poisson distribution.Then similar to the previous section, we would recognize the kernel of the gamma when using the gamma-Poisson family. The posterior $Gamma (k^{*}, θ^{*})$ has parameters

$k^{*} = k + \sum_{i = 1}^{n} x_{i} and θ^{*} = \frac{θ}{(n θ + 1)} .$

The Normal-Normal Conjugate Families

There are other conjugate families, and one is the normal-normal pair. If your data come from a normal distribution with known variance $σ^{2}$ but unknown mean $μ$ , and if your prior on the mean $μ$ , has a normal distribution with self-elicited mean $ν$ and self-elicited variance $τ^{2}$ , then your posterior density for the mean, after seeing a sample of size $n$ with sample mean $\bar{x}$ , is also normal. In mathematical notation, we have

$\begin{aligned} x | μ & \sim N (μ, σ^{2}) \\ μ & \sim N (ν, τ^{2}) \end{aligned}$

As a practical matter, one often does not know $σ^{2}$ , the standard deviation of the normal from which the data. But there are cases in which it is reasonable to treat the $σ^{2}$ as known.

For the normal-normal conjugate families, assume the prior on the unknown mean follows a normal distribution, i.e. $μ \sim N (ν, τ^{2})$ . We also assume that the data $x_{1}, x_{2}, \dots, x_{n}$ are independent and come from a normal with variance $σ^{2}$ .

Then the posterior distribution of $μ$ is also normal, with mean as a weighted average of the prior mean and the sample mean. We have

$μ | x_{1}, x_{2}, \dots, x_{n} \sim N (ν^{*}, τ^{* 2}),$

where

$ν^{*} = \frac{ν σ^{2} + n \bar{x} τ^{2}}{σ^{2} + n τ^{2}} and τ^{*} = \sqrt{\frac{σ^{2} τ^{2}}{σ^{2} + n τ^{2}}} .$

Example: Numerical Example

Suppose we have a sample of size $n = 10$ with sample $\overset{―}{X} = 5$ , known variance $σ^{2} = 4$ , prior mean $μ_{0} = 3$ and prior variance $τ^{2} = 2$ . The posterior mean and variance are:

Credible and Confidence Intervals

credible intervals, the Bayesian alternative to confidence intervals. Confidence intervals,are the frequentist way to express uncertainty about an estimate of a population mean, a population proportion or some other parameter.

A confidence interval has the form of an upper and lower bound.

$L, U = pe \pm se \times cv$

where $L$ , $U$ are the lower bound and upper bound of the confidence interval respectively, $pe$ represents “point estimates”, $se$ is the standard error, and $cv$ is the critical value.

Most importantly, the interpretation of a 95% confidence interval on the mean is that “95% of similarly constructed intervals will contain the true mean”, not “the probability that true mean lies between $L$ and $U$ is 0.95”.

The reason for this frequentist wording is that a frequentist may not express his uncertainty as a probability. The true mean is either within the interval or not, so the probability is zero or one. The problem is that the frequentist does not know which is the case.

On the other hand, Bayesians have no such qualms. It is fine for us to say that “the probability that the true mean is contained within a given interval is 0.95”. To distinguish our intervals from confidence intervals, we call them credible intervals.

Definitin 1

A confidence interval of X% for a parameter is an interval $(a, b)$ generated by a repeated sampling procedure has probability X% of containing the true value of the parameter, for all possible values of the parameter.

Definition 2

Say you performed a statistical analysis to compare the efficacy of a public policy between two groups and you obtain a difference between the mean of these groups.You can express this difference as a confidence interval. Often we choose 95% confidence.This means that $95 studies out of 100$ , that uses the $same sample size and target population$ , performing the $same statistical test$ ,will expect to find a result of the mean difference between groups inside the confidence interval.