Foundations of Probability theory for Health Researchers
Probability, Descrete, Bayesian Statistics, Statistics for Health, Biostatistics
Basic concepts of Probability
Sample Space and Random Events
Both deterministic and stochastic phenomena drive the everyday life.
A deterministic phenomenon (process or experiment) always produce the same outcome each time it is repeated under the same conditions.
A random phenomenon (process or experiment) is characterized by conditions under which the result cannot be determined with certainty before it occurs, that is, one of several possible outcomes is observed each time the process or experiment is repeated. For example, when a coin is tossed, the outcome is either heads H or tails T, but unknown before the coin is tossed.
The sample space Ω is defined as the set of all possible outcomes of a random experiment. For example, if we roll a 6-sided die, the sample space is the set of the six possible outcomes, Ω ={1, 2, 3, 4, 5, 6} (Figure 1).
For each experiment, two events always exist:
the sample space, Ω, which comprises all possible outcomes.
the empty set = ∅, that contains no outcomes and it is called the impossible event.
The sample space,
, is the set of all possible outcomes of an experiment.An outcome,
, is a result from an experiment or observation.An event,
, is a collection of one or more outcomes from an experiment or observation.
Operations of events using set theory and Venn diagrams
Union of Events: A∪B
The union of the events A and B, denoted by A∪B, is the collection of all outcomes that are in A or in B or in both of them and it is also an event. It will occur if either A or B occurs (the symbol ∪ is equivalent to OR operator).
Example
In the experiment of rolling a die, let’s consider the events A = “the number rolled is even” and B = “the number rolled is less than three”.
[1] 2 4 6 1
Intersection of Events: A∩B
The intersection of A and B, denoted by A∩B, consists of all outcomes that are in both A and B (the symbol ∩ is equivalent to AND operator). That is, the events A and B must occur simultaneously.
Example
# A = {2, 4, 6}
# B = {1, 2}
intersect(A, B)
[1] 2
Complement Events
For example, the complement of the union of A and B, denoted by
Example
Mutually exclusive events
Let’s consider the events A = “the number rolled is even” and C = “the number rolled is odd”.
The events A and C are mutually exclusive (also known as incompatible or disjoint) if they cannot occur simultaneously. This means that they do not share any outcomes and A∩C =∅.
Example
Probability
The concept of probability is used in everyday life which stands for the likelihood of occurring or non-occurring of random events. The first step towards determining the probability of an event is to establish a number of basic rules that capture the meaning of probability. The probability of an event should fulfill three axioms defined by Kolmogorov:
Definition of Probability
A. Theoretical probability (theoretical approach)
Theoretical probability describes the behavior we expect to happen if we give a precise description of the experiment (but without conducting any experiments). Theoretically, we can list out all the equally probable outcomes of an experiment, and determine how many of them are favorable for the event A to occur. Then, the probability of an event A to occur is defined as:
Note that the Equation 1 only works for experiments that are considered “fair”; this means that there must be no bias involved so that all outcomes are equally likely to occur.
Example 1
What is the theoretical probability of rolling the number “5” when we roll a six-sided fair die once?
The theoretical probability is:
This is because only one outcome (die showing: ) is favorable out of the six equally likely outcomes (die showing: , , , , , ).
Example 2
What is the probability of rolling either a “5” or a “6” when we roll a six-sided fair die once?
The theoretical probability is:
This is because two outcomes (die showing: or ) is favorable out of the six equally likely outcomes (die showing: , , , , , ).
We can also use the probability’s axioms. The probability of rolling a 6 is 1/6 and the probability of rolling a 5 is also 1/6. We cannot take a 5 and 6 at the same time (these events are mutually exclusive) so:
B. Experimental probability (frequentist approach)
The experimental probability is based on data from repetitions of the same experiment. According to this approach, the probability of an event A, denoted by P(A), is the relative frequency of occurrence of the event over a total number of experiments:
Yet, this definition seems less clear, as it does not specify the exact interpretation of “repetitions of the same experiment” [@finetti2008e].
The following properties are useful to assign and manipulate event probabilities.
The Conditional Probability
The conditional probability is indicated as P(A|B) (or A given B) and the outcome of event A depends on the outcome of event B. The following formula defines the conditional probability:
or
Example
Suppose we roll two fair six-sided dice. What is the probability that the first roll is a 3, given that the sum of two rolls is 8?
The sample space of the experiment consists of all ordered pairs of numbers from 1 to 6. That is, Ω = {(1, 1), (1, 2),… , (1, 6), (2, 1),… , (6, 6)}.
It is useful to define the following two events:
A = {The first roll shows 3, and the second any number}.
B = {The sum of two rolls is 8}.
We are interested in finding the conditional probability:
- Event A (the first roll shows 3, and the second any number) is given by outcomes A = {(3,1), (3,2), (3,3), (3,4), (3,5), (3, 6)}.
Therefore, the probability of event A is:
- Event B (the sum of two rolls is 8) is given by outcomes B = {(2,6), (3,5), (4,4), (5,3), (6,2)} :
Therefore, the probability of event B to occur is:
- Also, the event A∩B occurs if the first die shows 3 and the sum is 8, which can clearly occur only if a sequence of (3,5) takes place:
1st roll | 2 | 4 | 5 | 6 | |
2nd roll | 6 | 4 | 3 | 2 | |
Sum | 8 | 8 | 8 | 8 |
Thus, the probability of intersection of the two events is P(A∩B) = 1/36.
- Finally, according to the definition of conditional probability Equation 6, the probability of interest is:
Therefore, the “knowledge” that the sum of two rolls is 8 has updated the probability of A from P(A) = 1/6 = 0.167 to P(A|B) = 1/5 = 0.2.
Bayes’ theorem
Bayes’ theorem is based on this concept of “revisiting probability” when new information is available.
The Equation 5 states that
Now, replacing the P(A ∩ B) with P(B|A) · P(A) in the Equation 6 we get the Bayes’ theorem:
where
Example
We are interested in calculating the probability of developing lung cancer if a person smokes tobacco for a long time, P(Cancer|Smoker).
Suppose that 8% of the population has lung cancer, P(Cancer) = 0.08, and 30% of the population are chronic smokers, P(Smoker) = 0.30. Also, suppose that we know that 60% of all people who have lung cancer are smokers, P(Smoker|Cancer) = 0.6.
Using the Bayes’ theorem we have:
Bayes’ Theorem
Partitions of Sample Spaces
A partition of a space
Statement of Bayes’ Theorem
Let
Independence of events
If the knowledge of occurrence of an event does not influence the occurrence of another event, the two events are called independent. In fact, if A and B are independent, then the conditional probability is P(A|B) = P(A), i.e. the occurrence of B has no influence on the occurrence of A and P(B|A) = P(B), i.e. the occurrence of A has no influence on the occurrence of A. Consider, for example, rolling two dice consecutively: the outcome of the first die is independent of the outcome of the other die.
Two events A and B are said to be independent if:
This is known as Multiplication Rule of Probability and follows directly from Equation 5 because P(A|B) = P(A).
Example
Determine the probability of obtaining two 3s when rolling two six-sided fair dice consecutively. This event can be decomposed in two events:
- A = {die 1 shows , and die 2 shows any number}.
- B = {die 1 shows any number, and die 2 shows }.
We can state that the two events A and B are independent by nature, since each event involves a different die, which has no knowledge of the outcome of the other one. The event of interest is A ∩ B, and the definition of probability of two independent events leads to:
This result can be verified by a direct count of all possible outcomes in the roll of two dice, and the fact that there is only one combination out of 36 that gives rise to two consecutive 3s.
Random variables and Probability Distributions
A random variable assigns a numerical quantity to every possible outcome of a random phenomenon and may be:
- discrete if it takes either a finite number or an infinite sequence of possible values
- continuous if it takes any value in some interval on the real numbers
Discrete Probability Distributions
The probability distribution of a discrete random variable X is defined by the probability mass function (pmf) as:
where:
The pmf has two properties:
Additionally, the cumulative distribution function (cdf) gives the probability that the random variable X is less than or equal to x and is usually denoted as F(x):
where the sum takes place for all the values
When dealing with a random variable, it is common to calculate three important summary statistics: the expected value, variance and standard deviation.
Expected Value
The expected value or mean, denoted as E(X) or μ, is defined as the weighted average of the values that X can take on, with each possible value being weighted by its respective probability, P(x).
Variance
We can also define the variance, denoted as
There is an easier form of this formula.
Standard deviation
The standard deviation is the square root of the variance.
Bernoulli distribution
A random experiment with two possible outcomes, generally referred to as success (x = 1) and failure (x = 0), is called a Bernoulli trial.
Let X be a binary random variable of a Bernoulli trial which takes the value 1 (success) with probability p and 0 (failure) with probability 1-p. The distribution of the X variable is called Bernoulli distribution with parameter p, denoted as
- The probability mass function (pmf) of X is given by:
which can also be written as:
- The cumulative distribution function (cdf) of X is given by:
The random variable X can take either value 0 or value 1. If
The expected value of random variable, X, with Bernoulli(p) distribution is:
The variance is:
and the standard deviation is:
Example
Let X be a random variable representing the result of a surgical procedure, where X = 1 if the surgery was successful and X = 0 if it was unsuccessful. Suppose that the probability of success is 0.7; then X follows a Bernoulli distribution with parameter p = 0.7:
Find the main characteristics of this distribution.
- The pmf for this distribution is:
According to Equation 17 we have:
X | 0 | 1 |
P(X) | 0.3 | 0.7 |
We can plot the pmf for visualizing the distribution of the two outcomes (Figure 6).
- The cdf for this distribution is:
- The mean is
= p = 0.7 and the variance is = 0.7(1-0.7)= 0.7 0.3 = 0.21.
Binomial distribution
The binomial probability distribution can be used for modeling the number of times a particular event occurs (successes) in a sequence of n repeated and independent Bernoulli trials, each with probability p of success.
- There is a fixed number of n repeated Bernoulli trials.
- The n trials are all independent. That is, knowing the result of one trial does not change the probability we assign to other trials.
- Both probability of success, p, and probability of failure, 1-p, are constant throughtout the trials.
Let X be a random variable that indicates the number of successes in n-independent Bernoulli trials. If random variable X satisfies the binomial setting, it follows the binomial distribution with parameters n and p, denoted as
- The probability mass function (pmf) of X is given by:
where x = 0, 1, … , n and
Note that:
- The cumulative distribution function (cdf) of X is given by:
The mean of random variable, X, with Binomial(n, p) distribution is:
the variance is:
and the standard deviation:
Example
Let the random variable X be the number of successful surgical procedures and suppose that a new surgery method is successful 70% of the time (p = 0.7). If the results of 10 surgeries are randomly sampled, and X follows a Binomial distribution
- So, the pmf for this distribution is:
The pmf of Binomial(10, 0.7) distribution specifies the probability of 0 through 10 successful surgical procedures.
According to Equation 23 we have:
Binomial Distribution Functions in R
In R, the we can use the functions:
-
dbinom(x, n, p)
calculates the probability of exactly successes out trials, . -
pbinom(x, n, p)
calculates the probability of at most successes out trials, . -
rbinom(m, n, p)
randomly sample values from (with replacement). -
qbinom(q, n, p)
compute the qth quantile.
We can easily compute the above probabilities using the dbinom()
function in R:
dbinom(0:10, size = 10, prob = 0.7)
[1] 0.0000059049 0.0001377810 0.0014467005 0.0090016920 0.0367569090
[6] 0.1029193452 0.2001209490 0.2668279320 0.2334744405 0.1210608210
[11] 0.0282475249
We can plot the pmf for visualizing the distribution (Figure 8).
- The cdf for this distribution is:
In R, we can calculate the cumulative probabilities for all the possible outcomes using the pbinom()
as follows:
# find the cumulative probabilities
pbinom(0:10, size = 10, prob = 0.7)
[1] 0.0000059049 0.0001436859 0.0015903864 0.0105920784 0.0473489874
[6] 0.1502683326 0.3503892816 0.6172172136 0.8506916541 0.9717524751
[11] 1.0000000000
The cdf for this distribution is shown below (Figure 9):
- The mean is
= n p = 10 0.7 = 7 successful surgeries and the variance is = n p (1-p) = 10 0.7 0.3 = 2.1
Let’s calculate the probability of having having more than 8 successful surgical procedures out of a total of 10. Therefore, we want to calculate the probability P(X > 8):
In R, we can calculate the probabilities P(X = 9) and P(X = 10) by applying the function dbinom()
and adding the results:
p9 <- dbinom(9, size=10, prob=0.7)
p9
[1] 0.1210608
p10 <- dbinom(10, size=10, prob=0.7)
p10
[1] 0.02824752
p9 + p10
[1] 0.1493083
Of note, another way to find the above probability is to calculate the 1-P(X ≤ 8):
1 - pbinom(8, size=10, prob=0.7)
[1] 0.1493083
The Geometric Distribution
If we repeat a Bernoulli trial that has probability of success
Poisson distribution
While a random variable with a Binomial distribution describes a count variable (e.g., number of successful surgeries), its range is restricted to whole numbers from 0 to n. For example, in a set of 10 surgical procedures (n = 10), the number of successful surgeries cannot surpass 10.
Now, let’s suppose that we are are interested in the number of successful surgeries per month in a particular specialty within a hospital. Theoretically, in this case, it is possible for the values to extend indefinitely without a predetermined upper limit.
Therefore, using the Poisson distribution, we can estimate the probability of observing a certain number of successful surgeries in a given month.
- The events (occurrences) are counted within a fixed interval of time or space. The interval should be well-defined and consistent.
- Each event is assumed to be independent of the others. The occurrence of one event does not affect the probability of another event happening.
- The probability of an event occurring remains consistent throughout the interval.
Let X be a random variable that indicates the number of events (occurrences) that happen within a fixed interval of time or space. If
- The probability mass function (pmf) of X is given by:
where x = 0, 1, … +∞, λ > 0.
- The cumulative distribution function (cdf) of X is given by:
The mean and variance of a random variable that follows the Poisson(λ) distribution are the same and equal to λ:
- μ = λ
-
= λ
Example
Let X be a random variable of the number of successful heart transplant surgeries per week in a specialized cardiac center. We assume that the average rate of successful surgeries per week is 2.5 (
- According to Equation 25, the probability mass function (pmf) of X is:
The resulting probability table is:
X | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | … |
P(X) | 0.082 | 0.205 | 0.257 | 0.214 | 0.134 | 0.067 | 0.028 | 0.01 | 0.003 | … |
We can compute the above probabilities using the dpois()
function in R:
dpois(0:8, lambda = 2.5)
[1] 0.082084999 0.205212497 0.256515621 0.213763017 0.133601886 0.066800943
[7] 0.027833726 0.009940617 0.003106443
We can also plot the pmf for visualizing the distribution (Figure 10).
For this example, the probability of none successful heart transplant surgery in a week is P(X = 0) = 0.08, while the probability of exactly two successful surgeries per week increases to P(X = 2) = 0.257.
- In R, we can calculate the cumulative probabilities for all the possible outcomes using the
ppois()
as follows:
# find the cumulative probabilities
ppois(0:8, lambda = 2.5)
[1] 0.0820850 0.2872975 0.5438131 0.7575761 0.8911780 0.9579790 0.9858127
[8] 0.9957533 0.9988597
The cdf for this distribution is shown below (Figure 11):
- The mean and variance of this variable are
= 2.5 and = 2.5, respectively.
Let’s calculate the probability of up to four successful heart transplant surgeries per week. Therefore, we want to calculate the cumulative probability
P(X ≤ 4) = P(X = 0) + P(X = 1) + P(X = 2) + P(X = 3) + P(X = 4) = 0.082 + 0.205 + 0.257 + 0.214 + 0.134 = 0.892.
In R, we can calculate this probability by applying the function ppois()
:
ppois(4, lambda = 2.5)
[1] 0.891178
So, the probability of up to four successful heart transplant surgeries per week in the specialized cardiac center is approximately 0.89 or 89%.
Probability distributions for continuous outcomes
Unlike discrete random variables, which have a probability mass function (pmf) that assigns probabilities to individual values, continuous random variables have a probability density function (pdf), denoted as f(x), which satisfies the following properties:
In this case, we are interested in the probability that the value of the random variable X is within a specific interval from
The graphical representation of the probability density function referred to as a density plot (Figure 12). In this plot, the x-axis represents the possible values of the variable X, while the y-axis represents the probability density.
Additionally, from the pdf we can find the cumulative probability by calculating the area from -∞ to a specific value
The probability of a certain point value in X is zero, and the area under the probability density curve of the interval (−∞, +∞) should be 1.
The expected value, variance, and standard deviation for a continuous random variable X are as follows:
Expected Value
The expected value or mean, denoted as E(X) or μ, is calculated by integrating over the entire range of possible values:
Variance
We can also calculate the variance of the variable X.
Standard deviation
The standard deviation is often preferred over the variance because it is in the same units as the random variable.
Uniform distribution
The simplest continuous probability distribution is the uniform distribution.
Let X be a continuous random variable that follows the uniform distribution with parameters the minimum value
- The probability density function (pdf) of X is given by:
- The cumulative distribution function (cdf) of X is:
The mean of X is given by:
The variance of the X is given by:
The uniform distribution, with a=0 and b=1, is highly useful as a random number generator. Let’s see an example of simple randomization in a clinical trial.
Example
Let X be a random variable that follows the uniform distribution Uniform(0, 1). Find the main characteristics of this distribution.
Then utilize this distribution to randomize 100 individuals between treatments A and B in a clinical trial.
NOTE: The simple randomization with the Uniform(0,1) distribution ensures that each individual in the study has an equal chance of being assigned to either treatment group.
- The pdf for this distribution is:
- The cdf for this distribution is:
The mean and variance of a random variable that follows the Uniform(0, 1) distribution are:
Next, we use the Uniform(0, 1) to randomize individuals between treatments A and B in a clinical trial:
# Set seed for reproducibility
set.seed(235)
# Define the sample size of the clinical trial
N <- 100
# Create a data frame to store the information
data <- data.frame(id = paste("id", 1:N, sep = ""), trt = NA)
# Simulate 100 uniform random variables from (0-1)
x <- runif(N)
# Display the first 10 values
x[1:10]
[1] 0.98373365 0.79328881 0.60725184 0.12093957 0.24990178 0.52147467
[7] 0.96265857 0.92370588 0.45026693 0.07862284
# Make treatment assignments, if x < 0.5 treatment A else B
data$trt <- ifelse(x < 0.5, "A", "B")
# Display the first few rows of the data frame
head(data, 10)
id | trt |
---|---|
id1 | B |
id2 | B |
id3 | B |
id4 | A |
id5 | A |
id6 | B |
id7 | B |
id8 | B |
id9 | A |
id10 | A |
# Display the counts and proportions of each treatment
table(data$trt)
A B
48 52
prop.table(table(data$trt))
A B
0.48 0.52
Normal distribution
A normal distribution, also known as a Gaussian distribution, is a fundamental concept in statistics and probability theory and is defined by two parameters: the mean (μ) and the standard deviation (σ) (see ?@sec-normal).
- The probability density function (pdf) of
is given by:
where
- The cumulative distribution function (cdf) of X sums from negative infinity up to the value of
, which is in interval notation:
Example
Let’s say that in a population the random variable of height, X, for adult people approximates a normal distribution with a mean μ = 170 cm and a standard deviation σ = 10 cm.
The pdf for this distribution is shown below (Figure 16):
The Figure 17 illustrates the normal cumulative distribution function. Note that continuous variables generate a smooth curve, while discrete variables produce a stepped line plot.
Let’s assume that we want to calculate the area under the curve between 160 cm and 180 cm, that is:
Using the properties of integrals we have:
Therefore, one way to find the area under the curve between 160 cm and 180 cm is to calculate the cdf at each of these values and then find the difference between them:
- Lets calculate the
:
pnorm(180, mean = 170, sd = 10)
[1] 0.8413447
- Similarly, we can calculate the
:
pnorm(160, mean = 170, sd = 10)
[1] 0.1586553
Finally, we subtract the two values (shaded blue areas) as follows:
Standard Normal distribution
If X is a random variable with a normal distribution having a mean of
The z (often called z-score) is a random variable that has a Standard Normal distribution, also called a z-distribution, i.e. a special normal distribution where
Z-scores are commonly used in medical settings to assess how an individual’s measurement compares to the average value of the entire population.
Example
Let’s assume that the diastolic blood pressure distribution among men has a normal distribution with mean 80 mmHg and standard deviation 15 mmHg. If an individual’s diastolic blood pressure is recorded as 110 mmHg, how many standard deviations differs from the population mean?
The z-score Equation 33 is:
So this person has a diastolic blood pressure that is 2 standard deviations above the population mean.
To find the area under the curve between two z-scores,
For example, let’s calculate the area under the curve in Figure 22, for
In R, we can easily calculate the above area using the cdf of the normal distribution:
Chi-square distribution
The chi-square distribution arises in various contexts, such as the chi-square test for independence in ?@sec-chi_square.
Let
Therefore, the chi-squared distribution is determined by a single parameter,
The mean and variance of a random variable that follows the chi-square distribution with
The shape of the chi-square distribution depends on the degrees of freedom,
NOTE: When
Appendix: Properties of Random Variables
In Introduction to Random Variables we discovered the following properties for a random variable
Properties of Discrete Random Variables
For a discrete random variable
for all for all is nondecreasing.
Properties of Continuous Random Variables
For a continuous random variable
for allThe
is an antiderivative of .The
is the derivative of . for all . . is nondecreasing.