Foundations of Probability theory for Health Researchers

Author

Affiliation

Bongani Ncube

University Of the Witwatersrand (School of Public Health)

Published

30 March 2025

Keywords

Probability, Descrete, Bayesian Statistics, Statistics for Health, Biostatistics

Basic concepts of Probability

Learning objectives

Understand and use the terminology of probability

Sample Space and Random Events

Both deterministic and stochastic phenomena drive the everyday life.

A deterministic phenomenon (process or experiment) always produce the same outcome each time it is repeated under the same conditions.
A random phenomenon (process or experiment) is characterized by conditions under which the result cannot be determined with certainty before it occurs, that is, one of several possible outcomes is observed each time the process or experiment is repeated. For example, when a coin is tossed, the outcome is either heads H or tails T, but unknown before the coin is tossed.

The sample space Ω is defined as the set of all possible outcomes of a random experiment. For example, if we roll a 6-sided die, the sample space is the set of the six possible outcomes, Ω ={1, 2, 3, 4, 5, 6} (Figure 1).

Figure 1: Sample space for rolling a 6-sided die.

Important

For each experiment, two events always exist:

the sample space, Ω, which comprises all possible outcomes.
the empty set = ∅, that contains no outcomes and it is called the impossible event.
The sample space, $Ω$ , is the set of all possible outcomes of an experiment.
An outcome, $ω$ , is a result from an experiment or observation.
An event, $A$ , is a collection of one or more outcomes from an experiment or observation.

Operations of events using set theory and Venn diagrams

Union of Events: A∪B

The union of the events A and B, denoted by A∪B, is the collection of all outcomes that are in A or in B or in both of them and it is also an event. It will occur if either A or B occurs (the symbol ∪ is equivalent to OR operator).

Example

In the experiment of rolling a die, let’s consider the events A = “the number rolled is even” and B = “the number rolled is less than three”.

A <- c(2, 4, 6)      # A = {2, 4, 6}
B <- c(1, 2)         # B = {1, 2}

union(A, B)          # A∪B = {2, 4, 6, 1}

[1] 2 4 6 1

Figure 2: The union of the events A and B as represented in a Venn diagram.

Intersection of Events: A∩B

The intersection of A and B, denoted by A∩B, consists of all outcomes that are in both A and B (the symbol ∩ is equivalent to AND operator). That is, the events A and B must occur simultaneously.

Example

# A = {2, 4, 6}
# B = {1, 2}

intersect(A, B)

[1] 2

Figure 3: The intersection of the events A and B as represented in a Venn diagram.

Complement Events

For example, the complement of the union of A and B, denoted by $(A \cup B)^{c}$ (sometimes denoted as $(A \cup B)^{'}$ ), is also an event and consists of all outcomes of the sample space Ω that don’t belong to A∪B.

Example

# A = {2, 4, 6}
# B = {1, 2}

AUB <- union(A, B)                     # A∪B = {2, 4, 6, 1} 


sample_space <- c(1, 2, 3, 4, 5, 6)    # sample_space = {1, 2, 3, 4, 5, 6}

setdiff(sample_space, AUB)

[1] 3 5

Figure 4: The complement of the union of A and B as represented in a Venn diagram.

Mutually exclusive events

Let’s consider the events A = “the number rolled is even” and C = “the number rolled is odd”.

The events A and C are mutually exclusive (also known as incompatible or disjoint) if they cannot occur simultaneously. This means that they do not share any outcomes and A∩C =∅.

Example

A <- c(2, 4, 6)      
C <- c(1, 3, 5)         

intersect(A, C)

numeric(0)

Figure 5: Venn Diagram of two mutually exclusive events.

Probability

The concept of probability is used in everyday life which stands for the likelihood of occurring or non-occurring of random events. The first step towards determining the probability of an event is to establish a number of basic rules that capture the meaning of probability. The probability of an event should fulfill three axioms defined by Kolmogorov:

The Kolmogorov Axioms

The probability of an event A is a non-negative number, P(A) ≥ 0
The probability of all possible outcomes, or sample space Ω, equals to one, P(Ω) = 1
If A and B are two mutually exclusive events (also known as disjoint events), then P(A ∪ B) = P(A) + P(B) and P(A ∩ B) = 0.

Definition of Probability

A. Theoretical probability (theoretical approach)

Theoretical probability describes the behavior we expect to happen if we give a precise description of the experiment (but without conducting any experiments). Theoretically, we can list out all the equally probable outcomes of an experiment, and determine how many of them are favorable for the event A to occur. Then, the probability of an event A to occur is defined as:

$\begin{matrix} (1) & P (A) = \frac{Number of outcomes favourable to the event A}{Total number of possible outcomes} \end{matrix}$

Note that the Equation 1 only works for experiments that are considered “fair”; this means that there must be no bias involved so that all outcomes are equally likely to occur.

Example 1

What is the theoretical probability of rolling the number “5” when we roll a six-sided fair die once?

The theoretical probability is:

$P (rolling 5) = \frac{1 outcome favourable to the event}{6 possible outcomes} = \frac{1}{6} \approx 0.167$

This is because only one outcome (die showing: ) is favorable out of the six equally likely outcomes (die showing: , , , , , ).

Example 2

What is the probability of rolling either a “5” or a “6” when we roll a six-sided fair die once?

The theoretical probability is:

$P (rolling 5 OR 6) = \frac{2 outcomes favourable to the event}{6 possible outcomes} = \frac{2}{6} = \frac{1}{3} \approx 0.33$

This is because two outcomes (die showing: or ) is favorable out of the six equally likely outcomes (die showing: , , , , , ).

We can also use the probability’s axioms. The probability of rolling a 6 is 1/6 and the probability of rolling a 5 is also 1/6. We cannot take a 5 and 6 at the same time (these events are mutually exclusive) so:

$P(rolling a 5 OR 6) = P(rolling a 5) + P(rolling a 6) = 1/6 + 1/6 = 2/6 = 1/3$

B. Experimental probability (frequentist approach)

The experimental probability is based on data from repetitions of the same experiment. According to this approach, the probability of an event A, denoted by P(A), is the relative frequency of occurrence of the event over a total number of experiments:

$\begin{matrix} (2) & P (A) \approx \frac{number of times A occured}{total number of experiments} \end{matrix}$

Yet, this definition seems less clear, as it does not specify the exact interpretation of “repetitions of the same experiment” [@finetti2008e].

The following properties are useful to assign and manipulate event probabilities.

Fundamental Properties of Probability

Let $A$ and $B$ denote two events in sample space $Ω$ , then

$P (A)$ is the probability that event $A$ occurs.
$P (A^{C}) = P (\bar{A}) = P (A^{'})$ is the probability that event $A$ does NOT occur.
The notation $A^{C}$ , $\bar{A}$ , or $A^{'}$ are used to denote the complement of $A$ .
$P (A \cap B)$ is the probability that events $A$ and $B$ both occur.
$P (A \cup B)$ is the probability that either event $A$ or event $B$ occurs (or both $A$ and $B$ occur).
$P (B | A)$ is the conditional probability that event $B$ occurs given that event $A$ occurs.
$P (A - B)$ is the probability that event A occurs and event B does not occur.
The probability of the null event is zero, P(∅) = 0.
The probability of the complement event A satisfies the property:

$\begin{matrix} (3) & P (A^{'}) = 1 - P (A) \end{matrix}$

The probability of the union of two events satisfies the general property that (Addition Rule of Probability) :

$\begin{matrix} (4) & P (A \cup B) = P (A) + P (B) - P (A \cap B) \end{matrix}$

The Conditional Probability

The conditional probability is indicated as P(A|B) (or A given B) and the outcome of event A depends on the outcome of event B. The following formula defines the conditional probability:

$\begin{matrix} (5) & P (A \cap B) = P (A | B) \cdot P (B) \end{matrix}$

$\begin{matrix} (6) & P (A | B) = \frac{P (A \cap B)}{P (B)} \end{matrix}$

Example

Suppose we roll two fair six-sided dice. What is the probability that the first roll is a 3, given that the sum of two rolls is 8?

The sample space of the experiment consists of all ordered pairs of numbers from 1 to 6. That is, Ω = {(1, 1), (1, 2),… , (1, 6), (2, 1),… , (6, 6)}.

It is useful to define the following two events:

A = {The first roll shows 3, and the second any number}.
B = {The sum of two rolls is 8}.

We are interested in finding the conditional probability: $P (A | B) = \frac{P (A \cap B)}{P (B)}$

Event A (the first roll shows 3, and the second any number) is given by outcomes A = {(3,1), (3,2), (3,3), (3,4), (3,5), (3, 6)}.

Therefore, the probability of event A is:

$P (A) = \frac{6}{36} = \frac{1}{6}$

Event B (the sum of two rolls is 8) is given by outcomes B = {(2,6), (3,5), (4,4), (5,3), (6,2)} :

Therefore, the probability of event B to occur is:

$P (B) = \frac{5}{36}$

Also, the event A∩B occurs if the first die shows 3 and the sum is 8, which can clearly occur only if a sequence of (3,5) takes place:

1st roll	2	$3$	4	5	6
2nd roll	6	$5$	4	3	2
Sum	8	$8$	8	8	8

Thus, the probability of intersection of the two events is P(A∩B) = 1/36.

Finally, according to the definition of conditional probability Equation 6, the probability of interest is:

$P (A | B) = \frac{P (A \cap B)}{P (B)} = \frac{\frac{1}{36}}{\frac{5}{36}} = \frac{1}{5}$

Therefore, the “knowledge” that the sum of two rolls is 8 has updated the probability of A from P(A) = 1/6 = 0.167 to P(A|B) = 1/5 = 0.2.

Bayes’ theorem

Bayes’ theorem is based on this concept of “revisiting probability” when new information is available.

The Equation 5 states that $P (A \cap B) = P (A | B) \cdot P (B)$ . Note that the $P (A \cap B)$ is the probability of both events A and B occurring, so we can also state that $P (A \cap B) = P (B | A) \cdot P (A)$ .

Now, replacing the P(A ∩ B) with P(B|A) · P(A) in the Equation 6 we get the Bayes’ theorem:

$\begin{matrix} (7) & P (A | B) = \frac{P (B | A) \cdot P (A)}{P (B)} \end{matrix}$

where $P (B) \neq 0$ .

Example

We are interested in calculating the probability of developing lung cancer if a person smokes tobacco for a long time, P(Cancer|Smoker).

Suppose that 8% of the population has lung cancer, P(Cancer) = 0.08, and 30% of the population are chronic smokers, P(Smoker) = 0.30. Also, suppose that we know that 60% of all people who have lung cancer are smokers, P(Smoker|Cancer) = 0.6.

Using the Bayes’ theorem we have:

$P(Cancer|Smoker) = \frac{P(Smoker|Cancer)· P(Cancer)}{P(Smoker)} = \frac{0.6 \times 0.08}{0.3} = \frac{0.048}{0.3} = 0.16$

Bayes’ Theorem

Partitions of Sample Spaces

A partition of a space $Ω$ is a collection of disjoint sets such that $⋃_{i = 1}^{\infty} A_{i} = Ω$ .

Partition of a Probability Space.

Statement of Bayes’ Theorem

Let $A_{1}$ , $A_{2}$ , $\dots, A_{k}$ be a partition of $Ω$ such that $P (A_{i}) > 0$ for each $i$ . If $B$ is any event with $P (B) > 0$ , then for each $i = 1, \dots k$ , we have

$P (A_{i} ∣ B) = \frac{P (A_{i} \cap B)}{P (B)} = \frac{P (B ∣ A_{i}) P (A_{i})}{\sum_{j = 1}^{k} P (B ∣ A_{j}) P (A_{j})} .$

Independence of events

If the knowledge of occurrence of an event does not influence the occurrence of another event, the two events are called independent. In fact, if A and B are independent, then the conditional probability is P(A|B) = P(A), i.e. the occurrence of B has no influence on the occurrence of A and P(B|A) = P(B), i.e. the occurrence of A has no influence on the occurrence of A. Consider, for example, rolling two dice consecutively: the outcome of the first die is independent of the outcome of the other die.

Two events A and B are said to be independent if:

$\begin{matrix} (8) & P (A \cap B) = P (A) \cdot P (B) \end{matrix}$

This is known as Multiplication Rule of Probability and follows directly from Equation 5 because P(A|B) = P(A).

Example

Determine the probability of obtaining two 3s when rolling two six-sided fair dice consecutively. This event can be decomposed in two events:

A = {die 1 shows , and die 2 shows any number}.

$P (A) = \frac{6}{36} = \frac{1}{6}$

B = {die 1 shows any number, and die 2 shows }.

$P (B) = \frac{6}{36} = \frac{1}{6}$

We can state that the two events A and B are independent by nature, since each event involves a different die, which has no knowledge of the outcome of the other one. The event of interest is A ∩ B, and the definition of probability of two independent events leads to:

$P(A ∩ B) = P(A) · P(B) = \frac{1}{6} \cdot \frac{1}{6} = \frac{1}{36}$

This result can be verified by a direct count of all possible outcomes in the roll of two dice, and the fact that there is only one combination out of 36 that gives rise to two consecutive 3s.

Random variables and Probability Distributions

A random variable assigns a numerical quantity to every possible outcome of a random phenomenon and may be:

discrete if it takes either a finite number or an infinite sequence of possible values
continuous if it takes any value in some interval on the real numbers

Discrete Probability Distributions

The probability distribution of a discrete random variable X is defined by the probability mass function (pmf) as:

$\begin{matrix} (9) & P (X = x) = P (x) \end{matrix}$

where:

$P (X = x)$ is the probability that the random variable X takes the value x and

$P (x)$ is the probability of the specific outcome x occurring.

The pmf has two properties:

$P (x) \geq 0$
$\sum_{x} P (x) = 1$

Additionally, the cumulative distribution function (cdf) gives the probability that the random variable X is less than or equal to x and is usually denoted as F(x):

$\begin{matrix} (10) & F (x) = P (X \leq x) = \sum_{x_{i} \leq x} P (x_{i}) \end{matrix}$

where the sum takes place for all the values $x_{1}, x_{2}, \dots, x_{i}$ , which are $x_{i} \leq x$ .

When dealing with a random variable, it is common to calculate three important summary statistics: the expected value, variance and standard deviation.

Expected Value

The expected value or mean, denoted as E(X) or μ, is defined as the weighted average of the values that X can take on, with each possible value being weighted by its respective probability, P(x).

$μ = E (X) = \sum_{i} x_{i} \cdot P (x_{i})$

Variance

We can also define the variance, denoted as $σ^{2}$ , which is a measure of the variability of the X.

$σ^{2} = Var (X) = E [X - E (X)]^{2} = E [(X - μ)^{2}] = \sum_{i} (x_{i} - μ)^{2} P (x_{i})$

There is an easier form of this formula.

$σ^{2} = Var (X) = E (X^{2}) - E (X)^{2} = \sum_{i} x_{i}^{2} P (x_{i}) - μ^{2}$

Standard deviation

The standard deviation is the square root of the variance.

$σ = \sqrt{Var(X)} = \sqrt{σ^{2}}$

Bernoulli distribution

A random experiment with two possible outcomes, generally referred to as success (x = 1) and failure (x = 0), is called a Bernoulli trial.

Let X be a binary random variable of a Bernoulli trial which takes the value 1 (success) with probability p and 0 (failure) with probability 1-p. The distribution of the X variable is called Bernoulli distribution with parameter p, denoted as $X \sim B e r n o u l l i (p)$ , where $0 \leq p \leq 1$ .

The probability mass function (pmf) of X is given by:

$\begin{matrix} (11) & P (X = x) = {\begin{cases} 1 - p, & f o r x = 0 \\ p, & f o r x = 1 \end{cases} \end{matrix}$

which can also be written as: $\begin{matrix} (12) & P (X = x) = p^{x} (1 - p)^{1 - x} for x \in {0, 1} \end{matrix}$

The cumulative distribution function (cdf) of X is given by:

$\begin{matrix} (13) & F (x) = P (X \leq x) = {\begin{cases} 0, & f o r x < 0 \\ 1 - p, & f o r 0 \leq x < 1 \\ 1, & f o r x \geq 1 \end{cases} \end{matrix}$

Note

The random variable X can take either value 0 or value 1. If $x < 0$ , then $P (X \leq x) = 0$ because X can not take values smaller than 0. If $0 \leq x < 1$ , then $P (X \leq x) = P (X = 0) = 1 - p$ . Finally, if $x \geq 1$ , then $P (X \leq x) = P (X = 0) + P (X = 1) = (1 - p) + p = 1$ .

The expected value of random variable, X, with Bernoulli(p) distribution is:

NOTE: In this case, the mean can be interpreted as the proportion of the population who has the outcome (success).

$\begin{matrix} (14) & μ = E (X) = p \end{matrix}$

Proof:

$E (X) = \sum_{i = 1}^{2} x_{i} P (x_{i}) = 0 \cdot P (X = 0) + 1 \cdot P (X = 1) = p$

The variance is:

$\begin{matrix} (15) & σ^{2} = V a r (X) = p (1 - p) \end{matrix}$

Proof:

$V a r (X) = \sum_{i = 1}^{2} x_{i} P (x_{i}) - μ^{2} = 0 \cdot P (X = 0) + 1 \cdot P (X = 1) - p^{2} = p (1 - p)$

and the standard deviation is:

$\begin{matrix} (16) & σ = \sqrt{V a r (X)} = \sqrt{p (1 - p)} \end{matrix}$

Example

Let X be a random variable representing the result of a surgical procedure, where X = 1 if the surgery was successful and X = 0 if it was unsuccessful. Suppose that the probability of success is 0.7; then X follows a Bernoulli distribution with parameter p = 0.7: $X \sim B e r n o u l l i (0.7)$ .

Find the main characteristics of this distribution.

The pmf for this distribution is:

$\begin{matrix} (17) & P (X = x) = {\begin{cases} 0.3, & f o r x = 0 \\ 0.7, & f o r x = 1 \end{cases} \end{matrix}$

According to Equation 17 we have:

X	0	1
P(X)	0.3	0.7

We can plot the pmf for visualizing the distribution of the two outcomes (Figure 6).

Figure 6: Plot of the pmf for Bernoulli(0.7) distribution.

The cdf for this distribution is:

$F (x) = P (X \leq x) = {\begin{cases} 0, & f o r x < 0 \\ 0.3, & f o r 0 \leq x < 1 \\ 1, & f o r x \geq 1 \end{cases}$

Figure 7: Plot of the cdf for Bernoulli(0.7) distribution.

The mean is $μ$ = p = 0.7 and the variance is $σ^{2}$ = 0.7(1-0.7)= 0.7 $\cdot$ 0.3 = 0.21.

Binomial distribution

The binomial probability distribution can be used for modeling the number of times a particular event occurs (successes) in a sequence of n repeated and independent Bernoulli trials, each with probability p of success.

The binomial setting

There is a fixed number of n repeated Bernoulli trials.
The n trials are all independent. That is, knowing the result of one trial does not change the probability we assign to other trials.
Both probability of success, p, and probability of failure, 1-p, are constant throughtout the trials.

Let X be a random variable that indicates the number of successes in n-independent Bernoulli trials. If random variable X satisfies the binomial setting, it follows the binomial distribution with parameters n and p, denoted as $X \sim B i n o m i a l (n, p)$ , where n is the Bernoulli trial parameter (a positive integer) and p the Bernoulli probability parameter ( $0 \leq p \leq 1$ ).

The probability mass function (pmf) of X is given by:

$\begin{matrix} (18) & P (X = x) = (\binom{n}{x}) \cdot p^{x} \cdot (1 - p)^{n - x} \end{matrix}$

where x = 0, 1, … , n and $(\binom{n}{x}) = \frac{n!}{x! (n - x)!}$

Note that: $n! = 1 \cdot 2 \cdot 3 \cdot \dots \cdot (n - 2) \cdot (n - 1) \cdot n$

The cumulative distribution function (cdf) of X is given by:

$\begin{matrix} (19) & F (x) = P (X \leq x) = {\begin{cases} 0, & f o r x < 0 \\ \sum_{k = 0}^{x} (\begin{array}{c} n \\ k \end{array}) p^{k} (1 - p)^{n - k}, & f o r 0 \leq x < n \\ 1, & f o r x \geq n \end{cases} \end{matrix}$

The mean of random variable, X, with Binomial(n, p) distribution is:

$\begin{matrix} (20) & μ = n p \end{matrix}$

the variance is:

$\begin{matrix} (21) & σ^{2} = n p (1 - p) \end{matrix}$

and the standard deviation:

$\begin{matrix} (22) & σ = \sqrt{n p (1 - p)} \end{matrix}$

Example

Let the random variable X be the number of successful surgical procedures and suppose that a new surgery method is successful 70% of the time (p = 0.7). If the results of 10 surgeries are randomly sampled, and X follows a Binomial distribution $X \sim B i n o m i a l (10, 0.7)$ , find the main characteristics of this distribution.

So, the pmf for this distribution is:

$\begin{matrix} (23) & P (X = x) = (\binom{10}{x}) \cdot {0.7}^{x} \cdot (1 - 0.7)^{10 - x} \end{matrix}$

The pmf of Binomial(10, 0.7) distribution specifies the probability of 0 through 10 successful surgical procedures.

According to Equation 23 we have:

Table 1: Probability table for

X \sim P o i s s o n (2.5)

X	0	1	2	3	…	8	9	10
P(X)	0	0.0001	0.0014	0.009	…	0.233	0.121	0.028

Binomial Distribution Functions in R

In R, the we can use the functions:

dbinom(x, n, p) calculates the probability of exactly $x$ successes out $n$ trials, $p (x) = P (X = x)$ .
pbinom(x, n, p) calculates the probability of at most $x$ successes out $n$ trials, $F (x) = P (X \leq x)$ .
rbinom(m, n, p) randomly sample $m$ values from $X \sim Binom (n, p)$ (with replacement).
qbinom(q, n, p) compute the q^th quantile.

We can easily compute the above probabilities using the dbinom() function in R:

dbinom(0:10, size = 10, prob = 0.7)

 [1] 0.0000059049 0.0001377810 0.0014467005 0.0090016920 0.0367569090
 [6] 0.1029193452 0.2001209490 0.2668279320 0.2334744405 0.1210608210
[11] 0.0282475249

We can plot the pmf for visualizing the distribution (Figure 8).

Figure 8: Plot of the pmf for Binomial(10, 0.7) distribution.

The cdf for this distribution is:

$\begin{matrix} (24) & F (x) = P (X \leq x) = {\begin{cases} 0, & f o r x < 0 \\ \sum_{k = 0}^{x} (\begin{array}{c} 10 \\ k \end{array}) {0.7}^{k} (1 - 0.7)^{10 - k}, & f o r 0 \leq x < 10 \\ 1, & f o r x \geq 10 \end{cases} \end{matrix}$

In R, we can calculate the cumulative probabilities for all the possible outcomes using the pbinom() as follows:

# find the cumulative probabilities
pbinom(0:10, size = 10, prob = 0.7)

 [1] 0.0000059049 0.0001436859 0.0015903864 0.0105920784 0.0473489874
 [6] 0.1502683326 0.3503892816 0.6172172136 0.8506916541 0.9717524751
[11] 1.0000000000

The cdf for this distribution is shown below (Figure 9):

Figure 9: Plot of the cdf for Binomial(10, 0.7) distribution.

The mean is $μ$ = n p = 10 $\cdot$ 0.7 = 7 successful surgeries and the variance is $σ^{2}$ = n p (1-p) = 10 $\cdot$ 0.7 $\cdot$ 0.3 = 2.1

Let’s calculate the probability of having having more than 8 successful surgical procedures out of a total of 10. Therefore, we want to calculate the probability P(X > 8):

$P (X > 8) = P (X = 9) + P (X = 10) = (\binom{10}{9}) \cdot {0.7}^{9} \cdot {0.3}^{1} + (\binom{10}{10}) \cdot {0.7}^{10} \cdot {0.3}^{0} \Rightarrow$

$P (X > 8) = 10 \cdot 0.04035 \cdot 0.3 + 1 \cdot 0.02824 = 0.12105 + 0.02825 = 0.1493$

In R, we can calculate the probabilities P(X = 9) and P(X = 10) by applying the function dbinom() and adding the results:

p9 <- dbinom(9, size=10, prob=0.7)
p9

[1] 0.1210608

p10 <- dbinom(10, size=10, prob=0.7)
p10

[1] 0.02824752

p9 + p10

[1] 0.1493083

Of note, another way to find the above probability is to calculate the 1-P(X ≤ 8):

1 - pbinom(8, size=10, prob=0.7)

[1] 0.1493083

The Geometric Distribution

If we repeat a Bernoulli trial that has probability of success $p$ for each trial, we can count the number of failures, $X$ , that occur before the first success. In such cases, $X$ follows a geometric distribution, and we write $X \sim Geom (p)$ .

$f (x) = q^{x} p for x = 0, 1, 2, \dots .$

Poisson distribution

While a random variable with a Binomial distribution describes a count variable (e.g., number of successful surgeries), its range is restricted to whole numbers from 0 to n. For example, in a set of 10 surgical procedures (n = 10), the number of successful surgeries cannot surpass 10.

Now, let’s suppose that we are are interested in the number of successful surgeries per month in a particular specialty within a hospital. Theoretically, in this case, it is possible for the values to extend indefinitely without a predetermined upper limit.

Therefore, using the Poisson distribution, we can estimate the probability of observing a certain number of successful surgeries in a given month.

The Poisson setting

The events (occurrences) are counted within a fixed interval of time or space. The interval should be well-defined and consistent.
Each event is assumed to be independent of the others. The occurrence of one event does not affect the probability of another event happening.
The probability of an event occurring remains consistent throughout the interval.

Let X be a random variable that indicates the number of events (occurrences) that happen within a fixed interval of time or space. If $λ$ represents the average rate of events (occurrences) in this interval or space, the X has a Poisson distribution that is specified by the parameter $λ$ , denoted as $X \sim P o i s s o n (λ)$ , where $λ$ is a positive real number ( $λ > 0$ ).

The probability mass function (pmf) of X is given by:

$\begin{matrix} (25) & P (X = x) = \frac{λ^{x} e^{- λ}}{x!} \end{matrix}$

where x = 0, 1, … +∞, λ > 0.

The cumulative distribution function (cdf) of X is given by:

$\begin{matrix} (26) & F (x) = P (X \leq x) = {\begin{cases} 0, & f o r x < 0 \\ \sum_{k = 0}^{x} \begin{array}{c} \frac{λ^{k} e^{- λ}}{k!} \end{array}, & f o r x \geq 0 \end{cases} \end{matrix}$

The mean and variance of a random variable that follows the Poisson(λ) distribution are the same and equal to λ:

μ = λ
$σ^{2}$ = λ

Example

Let X be a random variable of the number of successful heart transplant surgeries per week in a specialized cardiac center. We assume that the average rate of successful surgeries per week is 2.5 ( $λ = 2.5$ ), and X follows a Poisson distribution: $X \sim P o i s s o n (2.5)$

According to Equation 25, the probability mass function (pmf) of X is:

$P (X = x) = \frac{{2.5}^{x} e^{- 2.5}}{x!}$

The resulting probability table is:

Table 2: Probability table for

X \sim P o i s s o n (2.5)

X	0	1	2	3	4	5	6	7	8	…
P(X)	0.082	0.205	0.257	0.214	0.134	0.067	0.028	0.01	0.003	…

We can compute the above probabilities using the dpois() function in R:

dpois(0:8, lambda = 2.5)

[1] 0.082084999 0.205212497 0.256515621 0.213763017 0.133601886 0.066800943
[7] 0.027833726 0.009940617 0.003106443

We can also plot the pmf for visualizing the distribution (Figure 10).

Figure 10: Plot of the pmf for Poisson(2.5) distribution.

For this example, the probability of none successful heart transplant surgery in a week is P(X = 0) = 0.08, while the probability of exactly two successful surgeries per week increases to P(X = 2) = 0.257.

In R, we can calculate the cumulative probabilities for all the possible outcomes using the ppois() as follows:

# find the cumulative probabilities
ppois(0:8, lambda = 2.5)

[1] 0.0820850 0.2872975 0.5438131 0.7575761 0.8911780 0.9579790 0.9858127
[8] 0.9957533 0.9988597

The cdf for this distribution is shown below (Figure 11):

Figure 11: Plot of the pmf for Poisson(2.5) distribution.

The mean and variance of this variable are $μ$ = 2.5 and $σ^{2}$ = 2.5, respectively.

Let’s calculate the probability of up to four successful heart transplant surgeries per week. Therefore, we want to calculate the cumulative probability $P (X \leq 4) = \sum_{k = 0}^{4} \frac{λ^{k} e^{- λ}}{k!}$ . In practice, we can add the individual probabilities from the Table 2 for the corresponding outcomes:

P(X ≤ 4) = P(X = 0) + P(X = 1) + P(X = 2) + P(X = 3) + P(X = 4) = 0.082 + 0.205 + 0.257 + 0.214 + 0.134 = 0.892.

In R, we can calculate this probability by applying the function ppois():

ppois(4, lambda = 2.5)

[1] 0.891178

So, the probability of up to four successful heart transplant surgeries per week in the specialized cardiac center is approximately 0.89 or 89%.

Probability distributions for continuous outcomes

Unlike discrete random variables, which have a probability mass function (pmf) that assigns probabilities to individual values, continuous random variables have a probability density function (pdf), denoted as f(x), which satisfies the following properties:

$f (x) \geq 0$
$\int_{- \infty}^{+ \infty} f (x) d x = 1$

In this case, we are interested in the probability that the value of the random variable X is within a specific interval from $x_{1}$ to $x_{2}$ , denoted as $P (x_{1} \leq X \leq x_{2})$ .

$\begin{matrix} (27) & P (x_{1} \leq X \leq x_{2}) = \int_{x_{1}}^{x_{2}} f (x) d x \end{matrix}$

The graphical representation of the probability density function referred to as a density plot (Figure 12). In this plot, the x-axis represents the possible values of the variable X, while the y-axis represents the probability density.

Additionally, from the pdf we can find the cumulative probability by calculating the area from -∞ to a specific value $x_{o}$ (shaded blue area in Figure 13). The cumulative distribution function (cdf) gives the probability that the random variable X is less than or equal to $x_{o}$ and is usually denoted as:

$\begin{matrix} (28) & F (x_{o}) = P (X \leq x_{o}) = \int_{- \infty}^{x_{o}} f (x) d x \end{matrix}$ where $- \infty \leq x_{o} \leq + \infty$

Figure 13: Plot of the cumulative probability.

Note

The probability of a certain point value in X is zero, and the area under the probability density curve of the interval (−∞, +∞) should be 1.

The expected value, variance, and standard deviation for a continuous random variable X are as follows:

Expected Value

The expected value or mean, denoted as E(X) or μ, is calculated by integrating over the entire range of possible values:

$μ = E (X) = \int_{- \infty}^{+ \infty} x \cdot f (x) d x$

Variance

We can also calculate the variance of the variable X.

$σ^{2} = Var (X) = E [(X - μ)^{2}] = \int_{- \infty}^{+ \infty} (x - μ)^{2} \cdot f (x) d x$

Standard deviation

The standard deviation is often preferred over the variance because it is in the same units as the random variable.

$σ = \sqrt{Var(X)} = \sqrt{σ^{2}}$

Uniform distribution

The simplest continuous probability distribution is the uniform distribution.
Let X be a continuous random variable that follows the uniform distribution with parameters the minimum value $a$ and the maximum value $b$ ( $a < b$ ), $X \sim U n i f o r m (α, b)$

The probability density function (pdf) of X is given by:

$\begin{matrix} (29) & f (x) = {\begin{cases} \frac{1}{b - a} & for a \leq x \leq b \\ 0 & otherwise \end{cases} \end{matrix}$

The cumulative distribution function (cdf) of X is:

$\begin{matrix} (30) & F (x) = {\begin{cases} 0 & for x < a \\ \frac{x - a}{b - a} & for a \leq x \leq b \\ 1 & for x > b \end{cases} \end{matrix}$

The mean of X is given by:

$μ = \frac{a + b}{2}$

The variance of the X is given by: $σ^{2} = \frac{(b - a)^{2}}{12}$

The uniform distribution, with a=0 and b=1, is highly useful as a random number generator. Let’s see an example of simple randomization in a clinical trial.

Example

Let X be a random variable that follows the uniform distribution Uniform(0, 1). Find the main characteristics of this distribution.

Then utilize this distribution to randomize 100 individuals between treatments A and B in a clinical trial.

NOTE: The simple randomization with the Uniform(0,1) distribution ensures that each individual in the study has an equal chance of being assigned to either treatment group.

The pdf for this distribution is:

$f (x) = {\begin{cases} 1 & for 0 \leq x \leq 1 \\ 0 & otherwise \end{cases}$

Figure 14: Plot of the pdf for Uniform(0, 1) distribution.

The cdf for this distribution is:

$F (x) = {\begin{cases} 0 & for x < 0 \\ x & for 0 \leq x \leq 1 \\ 1 & for x > 1 \end{cases}$

Figure 15: Plot of the cdf for Poisson(2.5) distribution.

The mean and variance of a random variable that follows the Uniform(0, 1) distribution are:

$μ = \frac{0 + 1}{2} = \frac{1}{2}$
$σ^{2} = \frac{(1 - 0)^{2}}{12} = \frac{1}{12}$

Next, we use the Uniform(0, 1) to randomize individuals between treatments A and B in a clinical trial:

# Set seed for reproducibility
set.seed(235)

# Define the sample size of the clinical trial
N <- 100

# Create a data frame to store the information
data <- data.frame(id = paste("id", 1:N, sep = ""), trt = NA)

# Simulate 100 uniform random variables from (0-1)
x <- runif(N)
# Display the first 10 values
x[1:10]

 [1] 0.98373365 0.79328881 0.60725184 0.12093957 0.24990178 0.52147467
 [7] 0.96265857 0.92370588 0.45026693 0.07862284

# Make treatment assignments, if x < 0.5 treatment A else B
data$trt <- ifelse(x < 0.5, "A", "B")

# Display the first few rows of the data frame
head(data, 10)

id	trt
id1	B
id2	B
id3	B
id4	A
id5	A
id6	B
id7	B
id8	B
id9	A
id10	A

# Display the counts and proportions of each treatment
table(data$trt)


 A  B 
48 52

prop.table(table(data$trt))


   A    B 
0.48 0.52

Normal distribution

A normal distribution, also known as a Gaussian distribution, is a fundamental concept in statistics and probability theory and is defined by two parameters: the mean (μ) and the standard deviation (σ) (see ?@sec-normal).

The probability density function (pdf) of $X \sim N o r m a l (μ, σ^{2})$ is given by:

$\begin{matrix} (31) & f (x) = \frac{1}{σ \sqrt{2 π}} e^{- \frac{1}{2} {(\frac{x - μ}{σ})}^{2}} \end{matrix}$

where $π \approx 3.14$ and $e \approx 2.718$ .

The cumulative distribution function (cdf) of X sums from negative infinity up to the value of $x_{o}$ , which is $(- \infty, x_{o}]$ in interval notation:

$\begin{matrix} (32) & F (x_{o}) = P (X \leq x_{o}) = \frac{1}{σ \sqrt{2 π}} \int_{- \infty}^{x_{o}} e^{- \frac{1}{2} {(\frac{x - μ}{σ})}^{2}} d x \end{matrix}$ where $- \infty \leq x_{o} \leq + \infty$

Example

Let’s say that in a population the random variable of height, X, for adult people approximates a normal distribution with a mean μ = 170 cm and a standard deviation σ = 10 cm.

The pdf for this distribution is shown below (Figure 16):

The Figure 17 illustrates the normal cumulative distribution function. Note that continuous variables generate a smooth curve, while discrete variables produce a stepped line plot.

Figure 17: The normal cumulative distribution function.

Let’s assume that we want to calculate the area under the curve between 160 cm and 180 cm, that is:

$P (160 \leq X \leq 180) = \int_{160}^{180} f (x) d x$

Using the properties of integrals we have:

$\int_{- \infty}^{180} f (x) d x = \int_{- \infty}^{160} f (x) d x + \int_{160}^{180} f (x) d x$ $\Leftrightarrow \int_{160}^{180} f (x) d x = \int_{- \infty}^{180} f (x) d x - \int_{- \infty}^{160} f (x) d x$ $\Leftrightarrow P (160 \leq X \leq 180) = P (X \leq 180) - P (X \leq 160)$

Therefore, one way to find the area under the curve between 160 cm and 180 cm is to calculate the cdf at each of these values and then find the difference between them:

Lets calculate the $P (X \leq 180)$ :

$P (X \leq 180) = \int_{- \infty}^{180} f (x) d x$

pnorm(180, mean = 170, sd = 10)

[1] 0.8413447

Similarly, we can calculate the $P (X \leq 160)$ :

$P (X \leq 160) = \int_{- \infty}^{160} f (x) d x$

pnorm(160, mean = 170, sd = 10)

[1] 0.1586553

Finally, we subtract the two values (shaded blue areas) as follows:

pnorm(180, mean = 170, sd = 10) - pnorm(160, mean = 170, sd = 10)

[1] 0.6826895

Standard Normal distribution

If X is a random variable with a normal distribution having a mean of $μ$ and a standard deviation of $σ$ , then the standardized Normal deviate can be expressed as:

$\begin{matrix} (33) & z = \frac{x - μ}{σ} \end{matrix}$

The z (often called z-score) is a random variable that has a Standard Normal distribution, also called a z-distribution, i.e. a special normal distribution where $μ = 0$ and $σ^{2} = 1$ . In this case, Equation 31 is transformed as follows:

$\begin{matrix} (34) & f (z) = \frac{1}{\sqrt{2 π}} e^{- \frac{1}{2} z^{2}} \end{matrix}$

Figure 21: The Standard Normal Distribution

Z-scores are commonly used in medical settings to assess how an individual’s measurement compares to the average value of the entire population.

Example

Let’s assume that the diastolic blood pressure distribution among men has a normal distribution with mean 80 mmHg and standard deviation 15 mmHg. If an individual’s diastolic blood pressure is recorded as 110 mmHg, how many standard deviations differs from the population mean?

The z-score Equation 33 is:

$z = \frac{(110 - 80)}{15} = 2$

So this person has a diastolic blood pressure that is 2 standard deviations above the population mean.

To find the area under the curve between two z-scores, $z_{1}$ and $z_{2}$ , we have to integrate the pdf Equation 34 as follows:

$\begin{matrix} (35) & P (z_{1} \leq Z \leq z_{2}) = \frac{1}{\sqrt{2 π}} \int_{z_{1}}^{z_{2}} e^{- \frac{1}{2} z^{2}} d z \end{matrix}$

For example, let’s calculate the area under the curve in Figure 22, for $z_{1}$ = 0 and $z_{2}$ = 2:

Figure 22: The Standard Normal Distribution

In R, we can easily calculate the above area using the cdf of the normal distribution:

$P (0 \leq Z \leq 2) = P (Z \leq 2) - P (Z \leq 0)$

pnorm(2, mean = 0, sd = 1) - pnorm(0, mean = 0, sd = 1)

[1] 0.4772499

Chi-square distribution

The chi-square distribution arises in various contexts, such as the chi-square test for independence in ?@sec-chi_square.

Let $Z_{1}, Z_{2}, . . ., Z_{ν}$ be independent random variables that follow the standard normal distribution N(0,1). Then, the sum of the squares of these standard normal random variables, $U = \sum_{i = 1}^{n} Z_{i}^{2}$ , follows the chi-squared distribution with $ν$ degrees of freedom, $U \sim χ_{ν}^{2}$ .

Therefore, the chi-squared distribution is determined by a single parameter, $ν$ , which specifies the degrees of freedom (the number of random variables Z being summed). For example, the square of a standard normal variable follows the chi-square distribution with one degree of freedom, $U = Z_{1}^{2} \sim χ_{1}^{2}$ .

The mean and variance of a random variable that follows the chi-square distribution with $ν$ degrees of freedom are:

$μ = ν$
$σ^{2} = 2 ν$

The shape of the chi-square distribution depends on the degrees of freedom, $ν$ .

Figure 23: Family of chi-Square distributions

NOTE: When $ν$ is large, the shape of the chi-square distribution becomes increasingly similar to that of a normal distribution.

Appendix: Properties of Random Variables

In Introduction to Random Variables we discovered the following properties for a random variable $X$ .

Properties of Discrete Random Variables

For a discrete random variable $X$ , let $p (x)$ and $F (x)$ denote the pmf and cdf, respectively, we have:

$0 \leq p (x) \leq 1$ for all $x$
$\sum_{a l l x} p (x) = 1$
$F (x) = P (X \leq x) = \sum_{k = x_{m i n}}^{x} p (k)$
$0 \leq F (x) \leq 1$ for all $x$
$lim_{x \to \infty} F (x) = 1$
$F (x)$ is nondecreasing.

Properties of Continuous Random Variables

For a continuous random variable $X$ , let $f (x)$ and $F (x)$ denote the pdf and cdf, respectively, we have:

$f (x) \geq 0$ for all $x$
$\int_{- \infty}^{\infty} f (x) = 1$
$P (a < x < b) = \int_{a}^{b} f (x) d x$
$F (x) = \int_{- \infty}^{x} f (t) d t$
The $F (x)$ is an antiderivative of $f$ .
The $f (x)$ is the derivative of $F (x)$ .
$0 \leq F (x) \leq 1$ for all $x$ .
$lim_{x \to \infty} F (x) = 1$ .
$F (x)$ is nondecreasing.