p <- seq(0, 1, 0.01) # values of p on x-axis
like.binom <- choose(100,55)* p^55 * (1-p)^45 # values of L(p)
cv <- choose(100,55) * (0.55)^55 * (1-0.55)^45
plot(p, like.binom, # plot p and likelihood on x and y axes
type = "l", # connect plotted points with a curve
ylab = "L(p)", # y-axis label
xlab = "p", # x-axis label
main = "Plot of Likelihood Function") # main label
points(x = 0.55, y = cv, cex = 2, pch = 20, col = "tomato") # point at max
axis(1, at=c(0.55), label="theta = 0.55", col.axis = "tomato", pos=0.0015, cex = 1.5) # marking MLE estimate
abline(v = 0.55, col = "tomato", lwd = 2, lty = 2) # marking MLE estimateMathematics For health Researchers 2
Probability, Maximum Likelihood, Bayesian Statistics, Estimators, Biostatistics
What is the most likely value for the unknown parameter \(\theta\) given we know a random sample of values \(x_1, x_2, \ldots x_n\)?
The likelihood function \[\color{dodgerblue}{L(\theta)= L( \theta \mid x_1, x_2, \ldots x_n)}\] gives the likelihood of the parameter \(\theta\) given the observed sample data. A maximum likelihood estimate (MLE), denoted \(\color{dodgerblue}{\mathbf{\hat{\theta}_{\rm MLE}}}\), is the value of \(\theta\) that gives the maximum value of the likelihood function \(L(\theta)\).
A Formula for the Likelihood Function
Let \(f(x; \theta)\) denote the pdf of a random variable \(X\) with associated parameter \(\theta\). Suppose \(X_1, X_2, \ldots , X_n\) are random samples from this distribution, and \(x_1, x_2, \ldots , x_n\) are the corresponding observed values.
\[\color{dodgerblue}{\boxed{L(\theta \mid x_1, x_2, \ldots , x_n) = f(x_1; \theta) f(x_2; \theta) \ldots f(x_n; \theta) = \prod_{i=1}^n f(x_i; \theta).}}\]
In the formula for the likelihood function, the values \(x_1, x_2, \ldots x_n\) are fixed values, and the parameter \(\theta\) is the variable in the likelihood function. We consider what happens to the value of the \(L(\theta)\) when we vary the value of \(\theta\). The MLE \(\hat{\theta}_{\rm{MLE}}\) is the value of \(\theta\) that gives the maximum value of \(L(\theta)\).
Steps for Finding MLE
Steps for finding MLE, \(\hat{\theta}_{\rm MLE}\):
- Find a formula the likelihood function.
\[L(\theta \mid x_1, x_2, \ldots , x_n) = f(x_1; \theta) f(x_2; \theta) \ldots f(x_n; \theta) = \prod_{i=1}^n f(x_i; \theta)\]
- Maximize the likelihood function.
- Take the derivative of \(L\) with respect to \(\theta\)
- Find critical points of \(L\) where \(\frac{dL}{d\theta}=0\) (or is undefined).
- Evaluate \(L\) at each critical point and identify the MLE.
Using the Log-Likelihood Function
Logarithmic functions such as \(y = \ln{x}\) are increasing functions. The larger the input \(x\), the larger the output \(y = \ln{x}\) becomes. Thus, the value of \(\theta\) that gives the maximum value of \(L(\theta)\) will also correspond to the value of \(\theta\) that gives the maximum value of the function \(y = \ln{(L(\theta))}\), and vice versa:
The value of \(\theta\) that maximizes functions \(y=\ln{(L(\theta}))\) is the value of \(\theta\) that maximizes \(L(\theta)\).
We call the the natural log of the likelihood function, \(\color{dodgerblue}{y=\ln{(L(\theta}))}\), the log-likelihood function.
In statistics, the term “log” usually means “natural log”. The notation \(\log{()}\) is often used to denote a natural log instead of using \(\ln{()}\). This can be confusing since you may have previously learned \(\log{()}\) implies “log base 10”. Similarly, in R:
- The function
log(x)is the natural log of x. - The function
log10(x)is the log base 10 of x.
Why Maximize \(y=\ln{(L(\theta}))\) Instead of \(L(\theta)\)?
Consider the likelihood function from [Question 7],
\[L({\color{tomato}\lambda}) = {\color{tomato}\lambda}^n e^{- {\color{tomato}\lambda} \sum_i x_i}.\] To find the critical values, we first need to find an expression for the derivative \(\frac{d L}{d \lambda}\).
- We need to apply the product rule.
- We need to apply the chain rule to compute the derivative of \(e^{- {\color{tomato}\lambda} \sum_i x_i}\).
- After finding an expression for the derivative, we would then need to solve a complicated equation.
- We can use key properties of the natural log to help make the differentiation easier!
Useful Properties of the Natural Log
The four properties of natural logs listed below will be helpful to recall when working with log-likelihood functions.
\(\ln{(A \cdot B)} = \ln{A} + \ln{B}\)
\(\ln{\left( \frac{A}{B} \right)} = \ln{A} - \ln{B}\)
\(\ln{(A^k)} = k \ln{A}\)
\(\ln{e^k} = k\)
Likelihood functions are by definition a product of functions and often involve \(e\). Taking the natural log of the likelihood function converts a product to a sum. It is much easier to take the derivative of sums than products!
Steps for Finding MLE Using a Log-Likelihood Function
Steps for finding MLE, \(\hat{\theta}_{\rm MLE}\), using a log-likelihood function:
- Find a formula the likelihood function.
\[L(\theta \mid x_1, x_2, \ldots , x_n) = f(x_1; \theta) f(x_2; \theta) \ldots f(x_n; \theta) = \prod_{i=1}^n f(x_i; \theta)\] 2. Apply the natural log to \(L(\theta)\) to derive the log-likelihood function \(y = \ln{(L(\theta))}\). Simplify using properties of the natural log before moving to the next step.
- Maximize the log-likelihood function.
- Take the derivative of \(y=\ln{(L(\theta))}\) with respect to \(\theta\)
- Find critical points of the log-likelihood function where \(\frac{dy}{d\theta}=0\) (or is undefined).
- Evaluate the log-likelihood function \(y=\ln{(L(\theta))}\) at each critical point and identify the MLE.
Properties of estimators
An estimator, denoted \(\color{dodgerblue}{\hat{\theta}}\), is a formula or rule that we use to estimate the value of an unknown population parameter \(\theta\). For a single parameter \(\theta\), there are many (possibly infinite) different estimators \(\hat{\theta}\) from which to choose from. We have more deeply investigated two particularly useful methods: maximum likelihood estimates (MLE) and method of moments estimation.
We can think of different estimators, \(\hat{\theta}\), as different paths attempting to arrive at the same destination, the value of \(\theta\). Different statisticians might prefer different paths, which path is optimal? Sometimes different methods lead to the same result, and sometimes they differ. When they differ, how do we decide which estimate is best?
Four different sampling distributions of the results of the four dart throwing methods are plotted in Figure 2. The location of the population parameter (the center of the dart board) is indicated by the dashed red line. The mean of the sampling distribution is indicated by the solid blue vertical line. Match each of the distributions labeled A-D below to one of the four dart boards displayed in [Question 1].
Bias of an Estimator
No matter what formula we choose as an estimator, the estimate we obtain will vary from sample to sample. We like an estimator to be, on average, equal to the parameter it is estimating. The bias of an estimator \(\hat{\theta }\) for parameter \(\theta\) is defined as the difference in the average (expected) value of the estimator and the parameter \(\theta\),
\[{\large \color{dodgerblue}{\boxed{ \mbox{Bias} = E(\hat{\theta}) - \theta.}}}\]
- \(\hat{\theta}\) is an unbiased estimator if \(\color{dodgerblue}{\mbox{Bias} = E(\hat{\theta}) - \theta =0}\).
- If the bias is positive, then on average \(\hat{\theta}\) gives an overestimate for \(\theta\).
- If the bias is negative, then on average \(\hat{\theta}\) gives an underestimate for \(\theta\).
Prove the following statement:
If \(X_1\), \(X_2\), \(\ldots\) , \(X_n\) are independently and identically distributed random variables with \(E(X_i) = \mu\) and \(\mbox{Var}(X_i) = \sigma^2\), then
\[{\color{dodgerblue}{\boxed{E \bigg[ \sum_{i=1}^n (X_i - \overline{X})^2 \bigg] = (n-1)\sigma^2.}}}\]
Use the result of Theorem 15.1 that states the following:
If \(X_1\), \(X_2\), \(\ldots\) , \(X_n\) are independently and identically distributed random variables with \(\overline{X} = \frac{1}{n} \sum_{i=1}^n X_i\), then
\[\boxed{ E\bigg[ \sum_{i=1}^n (X_i - \overline{X})^2 \bigg] = \sum_{i=1}^n E \big[ X_i^2 \big] - n E \big[ \overline{X}^2 \big]}\]
Proof:
We first apply Theorem 15.1 to begin simplifying the expected value of the sum of the squared deviations,
\[E\bigg[ \sum_{i=1}^n (X_i - \overline{X})^2 \bigg] = \sum_{i=1}^n {\color{dodgerblue}{ E \big[ X_i^2 \big]}} - n {\color{tomato}{E \big[ \overline{X}^2 \big]}}\]
Next we simplify using properties of random variables and summations as follows,
\[\begin{aligned} E\bigg[ \sum_{i=1}^n (X_i - \overline{X})^2 \bigg] &= \sum_{i=1}^n {\color{dodgerblue}{ E \big[ X_i^2 \big]}} - n {\color{tomato}{E \big[ \overline{X}^2 \big]}} & \mbox{by Theorem 15.1}\\ &= \sum_{i=1}^n \bigg( {\color{dodgerblue}{ \mbox{Var} \big[ X_i \big] + \left( E \big[ X_i \big] \right)^2 }} \bigg) - n \left( {\color{tomato}{\mbox{Var} \big[ \overline{X} \big] + \left( E \big[ \overline{X} \big]\right)^2}} \right) & \mbox{Justification 1 ??}\\ &= \sum_{i=1}^n {\color{dodgerblue}{ \left( \sigma^2 + \mu^2 \right)}} - n \left( \mbox{Var} \big[ \overline{X} \big] + \left( E \big[ \overline{X} \big]\right)^2 \right) & \mbox{Justification 2 ??}\\ &= \sum_{i=1}^n\left( \sigma^2 + \mu^2 \right) - n \left( {\color{tomato}{\frac{\sigma^2}{n}}} + \left( {\color{tomato}{ \mu }}\right)^2 \right) & \mbox{Justification 3 ??} \\ &= {\color{dodgerblue}{n(\sigma^2 + \mu^2)}} - \sigma^2 - n\mu^2 & \mbox{Justification 4 ??}\\ &= (n-1) \sigma^2. & \mbox{Algebraically simplify} \end{aligned}\]
Precision of Estimators
Let \(\hat{\theta}\) be an estimator for a parameter \(\theta\). We can measure how precise \(\hat{\theta}\) is by considering how “spread out” the estimates obtained by selecting many random samples (each size \(n\)) and calculating an estimate \(\hat{\theta}\). The variance of the sampling distribution, \(\mbox{Var}(\hat{\theta})\), measures the variability in estimates due to the uncertainty in random sampling. The standard error of \(\hat{\theta}\) is the standard deviation of the sampling distribution for \(\hat{\theta}\) and also commonly used.
- In some cases, we can use theory from probability to derive a formula for \(\mbox{Var}(\hat{\theta})\).
- We can also approximate \(\mbox{Var}(\hat{\theta})\) by creating a sampling distribution through simulations.
Efficiency of Unbiased Estimators
If \(\hat{\theta}_1\) and \(\hat{\theta}_2\) are both unbiased estimators of \(\theta\), then \(\hat{\theta}_1\) is said to be more efficient than \(\hat{\theta}_2\) if \({\color{dodgerblue}{\mbox{Var} ( \hat{\theta}_1) < \mbox{Var} (\hat{\theta}_2)}}\). For example, in [Question 5] we show the usual sample mean \(\hat{\mu}_1=\overline{X}\) is a more efficient estimator than the weighted mean \(\hat{\mu}_2\).
Mean Squared Error
We have explored bias and variability of estimators. It is not always possible or reasonable to use an unbiased estimator. Moreover, in some cases an estimator with a little bit of bias and very little variability might be preferred over an unbiased estimator that has a lot of variability. Choosing which estimator is preferred often involves a trade-off between bias and variability.
The Mean Squared Error (MSE) of an estimator \(\hat{\theta}\) measures the average squared distance between the estimator and the parameter \(\theta\),
\[{\color{dodgerblue}{\mbox{MSE} \big[ \hat{\theta} \big] = E \big[ (\hat{\theta}-\theta)^2 \big]}}.\]
- The MSE is a criterion that takes into account both the bias and variability of an estimator!
- In the Appendix we prove Theorem 15.3 that gives the relation of the MSE to the variance and bias:
\[\boxed{\large {\color{dodgerblue}{ \mbox{MSE} \big[ \hat{\theta} \big] }} = {\color{tomato}{\mbox{Var} \big[ \hat{\theta} \big]}} + {\color{mediumseagreen}{\left( \mbox{Bias}(\hat{\theta}) \right)^2. }}}\]
- In the special case where \(\hat{\theta}\) is an unbiased estimator, then \(\mbox{MSE} \big[ \hat{\theta} \big] =\mbox{Var} \big[ \hat{\theta} \big]\).
Appendix: Proofs for Theorems
Theorem 15.1
If \(X_1\), \(X_2\), \(\ldots\) , \(X_n\) independently and identically distributed random variables, then
\[E\bigg[ \sum_{i=1}^n (X_i - \overline{X})^2 \bigg] = \sum_{i=1}^n E \big[ X_i^2 \big] - n E \big[ \overline{X}^2\big] .\]
Proof of Theorem 15.1
Let \(X_1\), \(X_2\), \(\ldots\) , \(X_n\) be independently and identically distributed random variables with \(\displaystyle \overline{X} = \frac{1}{n}\sum_{i=1}^n X_i\). The following properties are used in the proof that follows.
- The linearity of the expected value of random variables gives
\[ E \bigg[ \sum_{i=1}^n (X_i) \bigg] = \sum_{i=1}^n \big( E \lbrack X_i \rbrack \big) = n\bar{X} \quad \mbox{and} \quad E \bigg[ \sum_{i=1}^n (X_i^2) \bigg] = \sum_{i=1}^n \big( E \lbrack X_i^2 \rbrack \big). \tag{1}\]
- Recall properties of summation:
\[\sum_{i=1}^n (c a_i) = c \big( \sum_{i=1}^n a_i \big) \quad \mbox{and} \quad \sum_{i=1}^n c = nc. \tag{2}\]
We first expand the summation \(\sum_{i=1}^n (X_i - \overline{X})^2\) inside the expected value
\[\begin{aligned} E\bigg[ \sum_{i=1}^n (X_i - \overline{X})^2 \bigg] &= E \bigg[ (X_1 - \overline{X})^2 + (X_2 - \overline{X})^2 + \ldots + (X_n - \overline{X})^2 \bigg] \\ &= E \bigg[ (X_1^2 - 2X_1\overline{X} + \overline{X}^2) +(X_2^2 - 2X_2\overline{X} + \overline{X}^2) + \ldots + (X_n^2 - 2X_n\overline{X} + \overline{X}^2) \bigg] \end{aligned}\]
Regrouping terms, using the linearity of the expected value and properties of summation stated above, we have
\[E\bigg[ \sum_{i=1}^n (X_i - \overline{X})^2 \bigg] = \sum_{i=1}^n E(X_i^2) + E \bigg[ - 2 \overline{X}\left( {\color{tomato}{\sum_{i=1}^n X_i}} \right) + n \overline{X}^2 \bigg].\]
Since \(\overline{X} = \frac{1}{n} \sum_{i=1}^n X_i\), we have \({\color{tomato}{\sum_{i=1}^n X_i = n \overline{X}}}\), and therefore
\[\begin{aligned} E\bigg[ \sum_{i=1}^n (X_i - \overline{X})^2 \bigg] &= \sum_{i=1}^n E(X_i^2) + E \bigg[ - 2 \overline{X}\left( {\color{tomato}{n\overline{X}}} \right) + n \overline{X}^2 \bigg] \\ &= \sum_{i=1}^n E(X_i^2) + E \bigg[ -n \overline{X}^2\bigg]\\ &= \sum_{i=1}^n E(X_i^2) - n E \big[ \overline{X}^2\big] .\\ \end{aligned}\]
This concludes the proof!
Theorem 15.2
If \(X_1\), \(X_2\), \(\ldots\) , \(X_n\) are independently and identically distributed random variables with \(E(X_i) = \mu\) and \(\mbox{Var}(X_i) = \sigma^2\), then \(\displaystyle E \bigg[ \sum_{i=1}^n (X_i - \overline{X})^2 \bigg] = (n-1)\sigma^2\).
Proof of Theorem 15.2
We first apply Theorem 15.1 to begin simplifying the expected value of the sum of the squared deviations.
\[E\bigg[ \sum_{i=1}^n (X_i - \overline{X})^2 \bigg] = \sum_{i=1}^n {\color{dodgerblue}{ E \big[ X_i^2 \big]}} - n {\color{tomato}{E \big[ \overline{X}^2 \big]}}\]
From the variance property \(\mbox{Var}(Y) = E(Y^2) - \big( E(Y) \big)^2\), we know for any random variable \(Y\), we have \(E(Y^2) = \mbox{Var}(Y) + \big( E(Y) \big)^2\). Applying this property to each \(X_i\) and \(\overline{X}\) (which is a linear combination of random variables), we have
\[E\bigg[ \sum_{i=1}^n (X_i - \overline{X})^2 \bigg] = \sum_{i=1}^n \bigg( {\color{dodgerblue}{ \mbox{Var} \big[ X_i \big] + \left( E \big[ X_i \big] \right)^2 }} \bigg) - n \left( {\color{tomato}{\mbox{Var} \big[ \overline{X} \big] + \left( E \big[ \overline{X} \big]\right)^2}} \right)\]
From the first summation on the right side, we have \({\color{dodgerblue}{ \mbox{Var} \big[ X_i \big] + \left( E \big[ X_i \big] \right)^2 = \sigma^2 + \mu^2}}\). Recall from the Central Limit Theorem for Means, we have \(\mbox{Var} \big[ \overline{X} \big] = \frac{\sigma^2}{n}\) and \(E \big[ \overline{X} \big] = \mu\), and thus we have \({\color{tomato}{\mbox{Var} \big[ \overline{X} \big] + \left( E \big[ \overline{X} \big]\right)^2 =\frac{\sigma^2}{n} + \mu^2 }}\). Thus, we have
\[\begin{aligned} E\bigg[ \sum_{i=1}^n (X_i - \overline{X})^2 \bigg] &= \sum_{i=1}^n {\color{dodgerblue}{ \left( \sigma^2 + \mu^2 \right)}} - n \left( {\color{tomato}{\frac{\sigma^2}{n}}} + {\color{tomato}{ \mu^2}} \right) \\ &= n(\sigma^2 + \mu^2) - \sigma^2 - n\mu^2 \\ &= (n-1) \sigma^2. \end{aligned}\]
This concludes our proof!
Theorem 15.3
Let \(\hat{\theta}\) be an estimator for parameter \(\theta\). The mean squared error (MSE) is
\[\mbox{MSE} \big[ \hat{\theta} \big] = \mbox{Var} \big[\hat{\theta} \big] + \left( \mbox{Bias} \big[ \hat{\theta}\big] \right)^2.\]
Proof of Theorem 15.3
We begin with the definition, add and subtract \({\color{tomato}{E \big[ \hat{\theta} \big]}}\) inside the expected value, regroup terms inside the expected value, and finally break up the linear combination inside the expected value to get the result below.
\[\begin{aligned} \mbox{MSE} \big[ \hat{\theta} \big] &= E \big[ (\hat{\theta}-\theta)^2 \big] \\ &= E \bigg[ \left( \hat{\theta} {\color{tomato}{- E \big[ \hat{\theta} \big] + E \big[ \hat{\theta} \big]}} -\theta \right)^2 \bigg] \\ &= E \bigg[ \left( {\color{tomato}{(\hat{\theta} - E \big[ \hat{\theta} \big])}} + {\color{dodgerblue}{(E \big[ \hat{\theta} \big] -\theta)}} \right)^2 \bigg] \\ &= E \bigg[ {\color{tomato}{(\hat{\theta} - E \big[ \hat{\theta} \big])}}^2 + 2{\color{tomato}{(\hat{\theta} - E \big[ \hat{\theta} \big])}}{\color{dodgerblue}{(E \big[ \hat{\theta} \big] -\theta)}} + {\color{dodgerblue}{(E \big[ \hat{\theta} \big] -\theta)}}^2 \bigg] \\ &= E \bigg[ (\hat{\theta} - E \big[ \hat{\theta} \big])^2 \bigg] + 2 E \bigg[(\hat{\theta} - E \big[ \hat{\theta} \big]) (E \big[ \hat{\theta} \big] -\theta) \bigg] + E \bigg[ (E \big[ \hat{\theta} \big] -\theta)^2 \bigg] \end{aligned}\]
By definition of the variance, we have
\[{\color{tomato}{E \bigg[ (\hat{\theta} - E \big[ \hat{\theta} \big])^2 \bigg] = \mbox{Var} \big[ \hat{\theta} \big]}}. \tag{3}\]
By definition of bias of an estimator, we have
\[{\color{dodgerblue}{E \bigg[ (E \big[ \hat{\theta} \big] -\theta)^2 \bigg] = E \bigg[ \big( \mbox{Bias}(\hat{\theta}) \big)^2 \bigg] = \bigg( \mbox{Bias}(\hat{\theta}) \bigg)^2}}. \tag{4}\]
- Note \(\mbox{Bias}(\hat{\theta})\) is a constant value which might be unknown, but it is not a random variable.
- The expected value of a constant is the value of the constant, thus \(E \bigg[ \big( \mbox{Bias}(\hat{\theta}) \big)^2 \bigg] = \bigg( \mbox{Bias}(\hat{\theta}) \bigg)^2\).
\[{\color{mediumseagreen}{ \begin{aligned} E \bigg[ (\hat{\theta} - E \big[ \hat{\theta} \big]) (E \big[ \hat{\theta} \big] -\theta) \bigg] &= E \bigg[ \hat{\theta} \cdot E \big[ \hat{\theta} \big] - \hat{\theta} \cdot \theta - \bigg( E \big[ \hat{\theta} \big] \bigg)^2 + E \big[ \hat{\theta} \big] \cdot \theta \bigg] \\ &= E \big[ \hat{\theta} \big] E \big[ \hat{\theta} \big] - E \big[ \theta \big] E \big[ \hat{\theta} \big] - \bigg( E \big[ \hat{\theta} \big] \bigg)^2 + E \big[ \theta \big] E \big[ \hat{\theta} \big] \\ &= 0. \end{aligned} }} \tag{5}\]
Using the results in (Equation 3), (Equation 4), and (Equation 5), we have
\[\begin{aligned} \mbox{MSE} \big[ \hat{\theta} \big] &= {\color{tomato}{E \bigg[ (\hat{\theta} - E \big[ \hat{\theta} \big])^2 \bigg] }} + 2 {\color{mediumseagreen}{ E \bigg[(\hat{\theta} - E \big[ \hat{\theta} \big]) (E \big[ \hat{\theta} \big] -\theta) \bigg] }} + {\color{dodgerblue}{E \bigg[ (E \big[ \hat{\theta} \big] -\theta)^2 \bigg] }} \\ &= {\color{tomato}{\mbox{Var} \big[ \hat{\theta} \big]}} + 2 \cdot {\color{mediumseagreen}{0}} + {\color{dodgerblue}{\left( \mbox{Bias}(\hat{\theta}) \right)^2}} \\ &= \mbox{Var} \big[ \hat{\theta} \big] + \left( \mbox{Bias}(\hat{\theta}) \right)^2. \end{aligned}\]
This completes our proof!