A guide to statistical Practice in R and Stata for Public health Researchers

Authors

Bongani Ncube

University Of the Witwatersrand (Biostatistician)

Published

16 March 2025

Library setup

library(tidyverse)
library(flextable)

About me

I am an Msc Research fellow at the Wits School of Public Health. I have a background in advanced mathematics and Statistics.

About the presentation

I am compiling the notes while finishing my Biostatistics for Health researchers 1

use "birthweight1.dta", clear

list in 1/5

  1. | studno | numfoll | lastweek | weekadj | mothage | sector  | nopregs  |
     |      1 |       7 |       18 |      19 |      30 |      D  |       6  |
     |----------------------------------------------------------------------|
     | ncalive | regmerg |       dob |    sex |     bwgp | delmerg | magegp |
     |       5 |       3 | 27 Apr 94 |   male | 2.5-2.99 |       3 |    30+ |
     |----------------------------------------------------------------------|
     | nca2 | month | bimonth | bweight | sectnum |  propinf | ss | nopreg2 |
     |   5+ |     4 | mar-apr |     2.9 |       D |  .137931 | AA |      6+ |
     |----------------------------------------------------------------------|
     | ss2 | bwgp2 |  sect2 | unborn2 | unborn | excl  | season  | episode  |
     |  AA | >=2.5 | inland |     <=1 |    <=1 |    0  |    dry  |       0  |
     |----------------------+-----------------------------------------------|
     |       infect2        |        bweight2        |        od_igg        |
     |             0        |            3.13        |         1.712        |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  2. | studno | numfoll | lastweek | weekadj | mothage | sector  | nopregs  |
     |      2 |       7 |       18 |      19 |      28 |      C  |       4  |
     |----------------------------------------------------------------------|
     | ncalive | regmerg |       dob |    sex |     bwgp | delmerg | magegp |
     |       3 |       3 | 02 May 94 | female |   3-3.49 |       3 |  20-29 |
     |----------------------------------------------------------------------|
     | nca2 | month | bimonth | bweight | sectnum |  propinf | ss | nopreg2 |
     |    3 |     5 | may-jun |     3.1 |       C | .1724138 | AA |     4/5 |
     |----------------------------------------------------------------------|
     | ss2 | bwgp2 |  sect2 | unborn2 | unborn | excl  | season  | episode  |
     |  AA | >=2.5 | inland |     <=1 |    <=1 |    0  |    wet  |       0  |
     |----------------------+-----------------------------------------------|
     |       infect2        |        bweight2        |        od_igg        |
     |             0        |            3.18        |         1.322        |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  3. | studno | numfoll | lastweek | weekadj | mothage | sector  | nopregs  |
     |      4 |       6 |       18 |      19 |      40 |      D  |      11  |
     |----------------------------------------------------------------------|
     | ncalive | regmerg |       dob |    sex |     bwgp | delmerg | magegp |
     |       6 |       3 | 30 Jun 94 |   male |   3-3.49 |       3 |    30+ |
     |----------------------------------------------------------------------|
     | nca2 | month | bimonth | bweight | sectnum |  propinf | ss | nopreg2 |
     |   5+ |     6 | may-jun |       3 |       D |        0 | AA |      6+ |
     |----------------------------------------------------------------------|
     | ss2 | bwgp2 |  sect2 | unborn2 | unborn | excl  | season  | episode  |
     |  AA | >=2.5 | inland |      3+ |      5 |    0  |    wet  |       0  |
     |----------------------+-----------------------------------------------|
     |       infect2        |        bweight2        |        od_igg        |
     |             0        |            3.23        |           .71        |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  4. | studno | numfoll | lastweek | weekadj | mothage | sector  | nopregs  |
     |      5 |       7 |       18 |      19 |      35 |      A  |       5  |
     |----------------------------------------------------------------------|
     | ncalive | regmerg |       dob |    sex |     bwgp | delmerg | magegp |
     |       4 |       3 | 09 Jun 94 | female |   3-3.49 |       3 |    30+ |
     |----------------------------------------------------------------------|
     | nca2 | month | bimonth | bweight | sectnum |  propinf | ss | nopreg2 |
     |    4 |     6 | may-jun |     3.2 |   A-sea | .2727273 | AA |     4/5 |
     |----------------------------------------------------------------------|
     | ss2 | bwgp2 |  sect2 | unborn2 | unborn | excl  | season  | episode  |
     |  AA | >=2.5 |    sea |     <=1 |    <=1 |    0  |    wet  |       1  |
     |----------------------+-----------------------------------------------|
     |       infect2        |        bweight2        |        od_igg        |
     |             1        |             3.1        |        1.8935        |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  5. | studno | numfoll | lastweek | weekadj | mothage | sector  | nopregs  |
     |      6 |       4 |        6 |       7 |      23 |      D  |       2  |
     |----------------------------------------------------------------------|
     | ncalive | regmerg |       dob |    sex |     bwgp | delmerg | magegp |
     |       1 |       3 | 07 May 94 | female |   2-2.49 |       3 |  20-29 |
     |----------------------------------------------------------------------|
     | nca2 | month | bimonth | bweight | sectnum |  propinf | ss | nopreg2 |
     |    1 |     5 | may-jun |     2.3 |       D |      .25 | AA |     2/3 |
     |----------------------------------------------------------------------|
     | ss2 | bwgp2 |  sect2 | unborn2 | unborn | excl  | season  | episode  |
     |  AA |  <2.5 | inland |     <=1 |    <=1 |    0  |    wet  |       1  |
     |----------------------+-----------------------------------------------|
     |       infect2        |        bweight2        |        od_igg        |
     |             1        |            2.54        |          1.81        |
     +----------------------------------------------------------------------+

id	female	race	ses	schtyp	prog	read	write	math	science	socst
70	0	4	1	1	1	57	52	41	47	57
121	1	4	2	1	3	68	59	53	63	61
86	0	4	3	1	1	44	33	54	58	31
141	0	4	3	1	3	63	44	47	53	56
172	0	4	2	1	2	47	52	57	53	61

Important terms

A descriptive statistic is a summary statistic that quantitatively describes or summarizes features from a collection of information

Inferential statistics is a branch of statistics that makes the use of various analytical tools to draw inferences about the population data from sample data.
A Nominal Scale is a measurement scale, in which numbers serve as “tags” or “labels” only, to identify or classify an object. This measurement normally deals only with non-numeric (quantitative) variables or where numbers have no value.
Snowball sampling is a non-probability sampling method where new units are recruited by other units to form part of the sample.
A critical value serves as a boundary within the sampling distribution of a test statistic. These values play a crucial role in both hypothesis tests and confidence intervals.
Non-probability sampling is a sampling method that uses non-random criteria
A one-tailed test is a statistical test in which the critical area of a distribution is one-sided so that it is either greater than or less than a certain value, but not both.
The confidence interval is an estimate of the amount of uncertainty associated with a sample, computed from the statistics of the observed data
Covariance is a measure of the joint variability of two random variables.
The interval scale is a quantitative measurement scale where there is order, the difference between the two variables is meaningful and equal, and the presence of zero is arbitrary.
Probability sampling refers refers to the process in which each and every element of the population when selecting has an equal chance of being included in the sample for instance simple random sampling is a method of probability sampling.

ANOVA is a method of assessing the differences between sample means for instance to test the difference in mean salary between people with degrees,diplome,masters and PhD one would perfom an ANOVA
significance test is a formal procedure for testing properties of population distributions and can be used to test the differences between a single sample value and a fixed value for instance if we know the population mean and we wish to test whether a particular sample mean is significantly different from it.
A statistic is a numerical value computed based on data from a sample e.g sample variance
A Parameter on the other hand is a numerical value calculated based on values in the whole population e.g population population variance
Null hypothesis in a significance test is when an assertion is made that no difference exits.
Alternative hypothesis refers to the case when an assertion is made that a significant difference exists and it is accepted when the null hypothesis is rejected.
chi-square arises in statistics when we wish to compare a set of observed frequencies with a set of theoretical frequencies and it also a descriptive measure of the magnitude of the discrepancies between the observed and expected frequencies.
One way ANOVA - one way anova compares the means of two or more groups in order to determine whether there is statistical evidence that the associated population means are significantly different
test for independence - In tests of independence, two variables are involved and these are usually nominal variables where the test is used to answer whether the two variables are dependent to each other that is they are assoaciated. For example if we wish to test whether HIV status is associated with whether someone is poor or not from a given dataset.

Probability

Principles

Here are three rules that come up all the time.

$P r (A \cup B) = P r (A) + P r (B) - P r (A \cap B)$ . This rule generalizes to $P r (A \cup B \cup C) = P r (A) + P r (B) + P r (C) - P r (A \cap B) - P r (A \cap C) - P r (B \cap C) + P r (A \cap B \cap C)$ .
$P r (A | B) = \frac{P (A \cap B)}{P (B)}$
If A and B are independent, $P r (A \cap B) = P r (A) P r (B)$ , and $P r (A | B) = P r (A)$ .

Bayes theorem - Application :Diagnostics

D = “Disease is Present’
$D^{c}$ = “Disease is not Present”
$T^{+}$ = “Positive test result”
$T^{-}$ = “Negative test result”
$P (T^{+} | D)$ = “Sensitivity (true positive rate)”
$P (T^{+} | D^{c})$ = “Prob false positive rate”
$P (T^{-} | D)$ = false negative
$P (T^{-} | D^{c})$ = “Specificity (true negative)”

$P V N = P (D^{c} | T^{-})$

P(D) IS THE PROB OF DISEASE

If your model predicts a patient as 1 (positive) and they belong to category 1 (positive) in reality we call this a true positive.
If your model predicts a patient as 0 (negative) and they belong to category 1 (positive) in reality we call this a false negative.
If your model predicts a patient as 1 (positive) and they belong to category 0 (negative) in reality we call this a false positive.
If your model predicts a patient as 0 (negative) and they belong to category 0 (negative) in reality we call this a true negative.

🎓 Precision: $\frac{T P}{T P + F P}$ defined as the proportion of predicted positives that are actually positive. Also called positive predictive value

🎓 Recall: $\frac{T P}{T P + F N}$ defined as the proportion of positive results out of the number of samples which were actually positive. Also known as sensitivity.

🎓 Specificity: $\frac{T N}{T N + F P}$ defined as the proportion of negative results out of the number of samples which were actually negative.

🎓 Accuracy: $\frac{T P + T N}{T P + T N + F P + F N}$ The percentage of labels predicted accurately for a sample.

🎓 F Measure: A weighted average of the precision and recall, with best being 1 and worst being 0.

Multiplication Rule. $| S | = | S_{1} | \dots | S_{k} |$ .

How many outcomes are possible from a sequence of 4 coin flips and 2 rolls of a die? $| S | = | S_{1} | \cdot | S_{2} | \dots | S_{6} | = 2 \cdot 2 \cdot 2 \cdot 2 \cdot 6 \cdot 6 = 288$ .

Statistical inference

Make inferences (an interpretation) about the true parameter value $β$ based on our estimator/estimate
Test whether our underlying assumptions (about the true population parameters, random variables, or model specification) hold true.

Testing does not

Confirm with 100% a hypothesis is true
Confirm with 100% a hypothesis is false
Tell you how to interpret the estimate value (Economic vs. Practical vs. Statistical Significance)

Hypothesis: Translate an objective in better understanding the results in terms of specifying a value (or sets of values) in which our population parameters should/should not lie.

Null hypothesis ( $H_{0}$ ): A statement about the population parameter that we take to be true in which we would need the data to provide substantial evidence that against it.
- Can be either a single value (ex: $H_{0} : β = 0$ ) or a set of values (ex: $H_{0} : β_{1} \geq 0$ )
- Will generally be the value you would not like the population parameter to be (subjective)
  - $H_{0} : β_{1} = 0$ means you would like to see a non-zero coefficient
  - $H_{0} : β_{1} \geq 0$ means you would like to see a negative effect
- “Test of Significance” refers to the two-sided test: $H_{0} : β_{j} = 0$
Alternative hypothesis ( $H_{a}$ or $H_{1}$ ) (Research Hypothesis): All other possible values that the population parameter may be if the null hypothesis does not hold.

Type I Error

Error made when $H_{0}$ is rejected when, in fact, $H_{0}$ is true.
The probability of committing a Type I error is $α$ (known as level of significance of the test)

Type I error ( $α$ ): probability of rejecting $H_{0}$ when it is true.

Legal analogy: In U.S. law, a defendant is presumed to be “innocent until proven guilty”.
If the null hypothesis is that a person is innocent, the Type I error is the probability that you conclude the person is guilty when he is innocent.

Type II Error

Type II error level ( $β$ ): probability that you fail to reject the null hypothesis when it is false.

In the legal analogy, this is the probability that you fail to find the person guilty when he or she is guilty.

Error made when $H_{0}$ is not rejected when, in fact, $H_{1}$ is true
The probability of committing a Type II error is $β$ (known as the power of the test)

Random sample of size n: A collection of n independent random variables taken from the distribution X, each with the same distribution as X.

Sample mean

$\bar{X} = (\sum_{i = 1}^{n} X_{i}) / n$

Sample Median

$\tilde{x}$ = the middle observation in a sample of observation order from smallest to largest (or vice versa).

If n is odd, $\tilde{x}$ is the middle observation,
If n is even, $\tilde{x}$ is the average of the two middle observations.

Sample variance $S^{2} = \frac{\sum_{i = 1}^{n} (X_{i} = \bar{X})^{2}}{n - 1} = \frac{n \sum_{i = 1}^{n} X_{i}^{2} - (\sum_{i = 1}^{n} X_{i})^{2}}{n (n - 1)}$

Sample standard deviation $S = \sqrt{S^{2}}$

Sample proportions $\hat{p} = \frac{X}{n} = \frac{number in the sample with trait}{sample size}$

$\hat{p_{1} - p_{2}} = \hat{p_{1}} - \hat{p_{2}} = \frac{X_{1}}{n_{1}} - \frac{X_{2}}{n_{2}} = \frac{n_{2} X_{1} = n_{1} X_{2}}{n_{1} n_{2}}$

Estimators
Point Estimator
$\hat{θ}$ is a statistic used to approximate a population parameter $θ$

Point estimate
The numerical value assumed by $\hat{θ}$ when evaluated for a given sample

Unbiased estimator
If $E (\hat{θ}) = θ$ , then $\hat{θ}$ is an unbiased estimator for $θ$

$\bar{X}$ is an unbiased estimator for $μ$
$S^{2}$ is an unbiased estimator for $σ^{2}$
$\hat{p}$ is an unbiased estimator for p
$\hat{p_{1} - P_{2}}$ is an unbiased estimator for $p_{1} - p_{2}$
$\bar{X_{1}} - \bar{X_{2}}$ is an unbiased estimator for $μ_{1} - μ_{2}$

Note: $S$ is a biased estimator for $σ$

Distribution of the sample mean

If $\bar{X}$ is the sample mean based on a random sample of size n drawn from a normal distribution X with mean $μ$ and standard deviation $σ$ , the $\bar{X}$ is normally distributed, with mean $μ_{\bar{X}} = μ$ and variance $σ_{\bar{X}}^{2} = V a r (\bar{X}) = \frac{σ^{2}}{n}$ . Then the standard error of the mean is: $σ_{\bar{X}} = \frac{σ}{\sqrt{n}}$

One Sample Inference

$Y_{i} \sim i . i . d . N (μ, σ^{2})$

i.i.d. standards for “independent and identically distributed”

Hence, we have the following model:

$Y_{i} = μ + ϵ_{i}$ where

$ϵ_{i} \sim^{i i d} N (0, σ^{2})$
$E (Y_{i}) = μ$
$V a r (Y_{i}) = σ^{2}$
$\bar{y} \sim N (μ, σ^{2} / n)$

The Mean

When $σ^{2}$ is estimated by $s^{2}$ , then

$\frac{\bar{y} - μ}{s / \sqrt{n}} \sim t_{n - 1}$

Then, a $100 (1 - α) %$ confidence interval for $μ$ is obtained from:

$1 - α = P (- t_{α / 2; n - 1} \leq \frac{\bar{y} - μ}{s / \sqrt{n}} \leq t_{α / 2; n - 1}) = P (\bar{y} - (t_{α / 2; n - 1}) s / \sqrt{n} \leq μ \leq \bar{y} + (t_{α / 2; n - 1}) s / \sqrt{n})$

And the interval is

$\bar{y} \pm (t_{α / 2; n - 1}) s / \sqrt{n}$

and $s / \sqrt{n}$ is the standard error of $\bar{y}$

If the experiment were repeated many times, $100 (1 - α) %$ of these intervals would contain $μ$

	Confidence Interval $100 (1 - α)$	Sample Sizes Confidence $α$ , Error d	Hypothesis Testing Test Statistic
When $σ^{2}$ is known, X is normal (or $n \geq 25$ )	$\bar{X} \pm z_{α / 2} \frac{σ}{\sqrt{n}}$	$n \approx \frac{z_{α / 2}^{2} σ^{2}}{d^{2}}$	$z = \frac{\bar{X} - μ_{0}}{σ / \sqrt{n}}$
When $σ^{2}$ is unknown, X is normal (or $n \geq 25$ )	$\bar{X} \pm t_{α / 2} \frac{s}{\sqrt{n}}$	$n \approx \frac{z_{α / 2}^{2} s^{2}}{d^{2}}$	$t = \frac{\bar{X} - μ_{0}}{s / \sqrt{n}}$

Example of one sample t.test

patients suffering from clinical depression
patient	1	2	3	4	5	6	7	8
Days	51	41	62	33	28	43	37	44

$\sum X = 339$ $\bar{X} = 42.375$

Hypothesis Testing

$H_{0} : μ = 43$ $H_{1} : μ \neq 43$

$T = \frac{\bar{X} - μ}{\frac{σ}{\sqrt{n}}}$

$T = \frac{42.375 - 43}{\frac{2}{\sqrt{8}}}$

$T = - 0.8838835$

since it is a two tailed test we divide percentage by 2

$t_{8 - 1} (\frac{5 %}{2}) = t_{7} (0.025) = 2.365$

Conclusion

Since $t_{7} (0.025) > | T |$ it is found in the rejection region hence we reject the null hypothesis and conclude that this group of patients is significantly from the previous group

the single or one sample t.tests tests whether the mean is equal or less than a specified value. the value is usually based on theoretical considerations or previous research

use "birthweight1.dta", clear

ttest write ==50

variable write not found
r(111);

r(111);

use "birthweight1.dta", clear

ttest bweight ==3.25

variable write not found
r(111);




One-sample t test
------------------------------------------------------------------------------
Variable |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
 bweight |     141    3.007518    .0353594    .4198693     2.93761    3.077425
------------------------------------------------------------------------------
    mean = mean(bweight)                                          t =  -6.8577
H0: mean = 3.25                                  Degrees of freedom =      140

   Ha: mean < 3.25             Ha: mean != 3.25               Ha: mean > 3.25
 Pr(T < t) = 0.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 1.0000

mean = 50

t.test(school$write,mu=50,alternative = "two.sided" )


    One Sample t-test

data:  school$write
t = 4.1403, df = 199, p-value = 0.00005121
alternative hypothesis: true mean is not equal to 50
95 percent confidence interval:
 51.45332 54.09668
sample estimates:
mean of x 
   52.775

mean < 50

t.test(school$write,mu=50,alternative = "less" )


    One Sample t-test

data:  school$write
t = 4.1403, df = 199, p-value = 1
alternative hypothesis: true mean is less than 50
95 percent confidence interval:
    -Inf 53.8826
sample estimates:
mean of x 
   52.775

mean = 50

t.test(school$write,mu=50,alternative = "greater" )


    One Sample t-test

data:  school$write
t = 4.1403, df = 199, p-value = 0.0000256
alternative hypothesis: true mean is greater than 50
95 percent confidence interval:
 51.6674     Inf
sample estimates:
mean of x 
   52.775

$\bar{X} = 52.775$

Hypothesis Testing

$H_{0} : μ = 50$ $H_{1} : μ \neq 50$

$T = \frac{\bar{X} - μ}{\frac{σ}{\sqrt{n}}}$

$T = \frac{52.775 - 50}{\frac{9.478586}{\sqrt{200}}}$

(52.775-50)/(9.478586/sqrt(200))

[1] 4.140325

$T = 4.140325$

since it is a two tailed test we divide percentage by 2

$t_{200 - 1} (\frac{5 %}{2}) = t_{199} (0.025) = 2.365$

Conclusion

We conclude that the mean is statistically significantly different from .

For Difference of Means ( $μ_{1} - μ_{2}$ ), Independent Samples

	$100 (1 - α)$ Confidence Interval	Hypothesis Testing Test Statistic
When $σ^{2}$ is known	${\bar{X}}_{1} - {\bar{X}}_{2} \pm z_{α / 2} \sqrt{\frac{σ_{1}^{2}}{n_{1}} + \frac{σ_{2}^{2}}{n_{2}}}$	$z = \frac{({\bar{X}}_{1} - {\bar{X}}_{2}) - (μ_{1} - μ_{2})_{0}}{\sqrt{\frac{σ_{1}^{2}}{n_{1}} + \frac{σ_{2}^{2}}{n_{2}}}}$
When $σ^{2}$ is unknown, Variances Assumed EQUAL	${\bar{X}}_{1} - {\bar{X}}_{2} \pm t_{α / 2} \sqrt{s_{p}^{2} (\frac{1}{n_{1}} + \frac{1}{n_{2}})}$	$t = \frac{({\bar{X}}_{1} - {\bar{X}}_{2}) - (μ_{1} - μ_{2})_{0}}{\sqrt{s_{p}^{2} (\frac{1}{n_{1}} + \frac{1}{n_{2}})}}$	Pooled Variance: $s_{p}^{2} = \frac{(n_{1} - 1) s_{1}^{2} - (n_{2} - 1) s_{2}^{2}}{n_{1} + n_{2} - 2}$ Degrees of Freedom: $γ = n_{1} + n_{2} - 2$
When $σ^{2}$ is unknown, Variances Assumed UNEQUAL	${\bar{X}}_{1} - {\bar{X}}_{2} \pm t_{α / 2} \sqrt{(\frac{s_{1}^{2}}{n_{1}} + \frac{s_{2}^{2}}{n_{2}})}$	$t = \frac{({\bar{X}}_{1} - {\bar{X}}_{2}) - (μ_{1} - μ_{2})_{0}}{\sqrt{(\frac{s_{1}^{2}}{n_{1}} + \frac{s_{2}^{2}}{n_{2}})}}$	Degrees of Freedom: $γ = \frac{(\frac{s_{1}^{2}}{n_{1}} + \frac{s_{2}^{2}}{n_{2}})^{2}}{\frac{(\frac{s_{1}^{2}}{n_{1}})^{2}}{n_{1} - 1} + \frac{(\frac{s_{2}^{2}}{n_{2}})^{2}}{n_{2} - 1}}$

two sample T-test

When you have a single explanatory variable which is qualitative and only have two levels, you can run a student’s T-test to test for a difference in the mean of the two levels. If appropriate for your data, you can choose to test a unilateral hypothesis. This means that you can test the more specific assumption that one level has a higher mean than the other, rather than that they simply have different means.Note that robustness of this test increases with sample size and is higher when groups have equal sizes

For the t-test, the t statistic used to find the p-value calculation is calculated as: $t = \frac{(\overset{―}{x_{1}} - \overset{―}{x_{2}})}{\sqrt{{\frac{s_{1}^{2}}{n}}_{1} + {\frac{s_{2}^{2}}{n}}_{2}}}$

where

$\overset{―}{x_{1}}$ and $\overset{―}{x_{2}}$ are the means of the response variable y for group 1 and 2, respectively,
$s_{1}^{2}$ and $s_{2}^{2}$ are the variances of the response variable y for group 1 and 2, respectively,
$n_{1}$ and $n_{2}$ are the sample sizes of groups 1 and 2, respectively.

NB: this applies if the variances are not equal

for equal variances we use the formula in the table above:

$t = \frac{({\bar{X}}_{1} - {\bar{X}}_{2}) - (μ_{1} - μ_{2})_{0}}{\sqrt{s_{p}^{2} (\frac{1}{n_{1}} + \frac{1}{n_{2}})}}$ where the Pooled Variance: $s_{p}^{2} = \frac{(n_{1} - 1) s_{1}^{2} - (n_{2} - 1) s_{2}^{2}}{n_{1} + n_{2} - 2}$

Note that the t-test is mathematically equivalent to a one-way ANOVA with 2 levels.

Assumptions

If the assumptions of the t-test are not met, the test can give misleading results. Here are some important things to note when testing the assumptions of a t-test.

Normality of data
As with simple linear regression, the residuals need to be normally distributed. If the data are not normally distributed, but have reasonably symmetrical distributions, a mean which is close to the centre of the distribution, and only one mode (highest point in the frequency histogram) then a t-test will still work as long as the sample is sufficiently large (rule of thumb ~30 observations). If the data is heavily skewed, then we may need a very large sample before a t-test works. In such cases, an alternate non-parametric test should be used.
Homoscedasticity
Another important assumption of the two-sample t-test is that the variance of your two samples are equal. This allows you to calculate a pooled variance, which in turn is used to calculate the standard error. If population variances are unequal, then the probability of a Type I error is greater than α.
The robustness of the t-test increases with sample size and is higher when groups have equal sizes.
We can test for difference in variances among two populations and ask what is the probability of taking two samples from two populations having identical variances and have the two sample variances be as different as are $s_{1}^{2}$ and $s_{2}^{2}$ .
To do so, we must do the variance ratio test (i.e. an F-test).

Violation of assumptions

If variances between groups are not equal, it is possible to use corrections, like the Welch correction. If assumptions cannot be respected, you can transform your data (log or square root for example) or use the non-parametric equivalent of t-test, the Mann-Whitney test. Finally, if the two groups are not independent (e.g. measurements on the same individual at 2 different years), you should use a Paired t-test.

Example in Stata
Example in R

use "high_school.dta", clear

ttest write ==read

variable write not found
r(111);


(highschool and beyond (200 cases))


Paired t test
------------------------------------------------------------------------------
Variable |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
   write |     200      52.775    .6702372    9.478586    51.45332    54.09668
    read |     200       52.23    .7249921    10.25294    50.80035    53.65965
---------+--------------------------------------------------------------------
    diff |     200        .545    .6283822    8.886666   -.6941424    1.784142
------------------------------------------------------------------------------
     mean(diff) = mean(write - read)                              t =   0.8673
 H0: mean(diff) = 0                              Degrees of freedom =      199

 Ha: mean(diff) < 0           Ha: mean(diff) != 0           Ha: mean(diff) > 0
 Pr(T < t) = 0.8066         Pr(|T| > |t|) = 0.3868          Pr(T > t) = 0.1934

use "wintfda.dta", clear

ttest rhknow , by(ephsex)

variable write not found
r(111);




Two-sample t test with equal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
       0 |      82    7.829268    .2578516    2.334946    7.316224    8.342312
       1 |     121    8.917355    .2159567    2.375524    8.489776    9.344935
---------+--------------------------------------------------------------------
Combined |     203    8.477833    .1693948    2.413504    8.143824    8.811841
---------+--------------------------------------------------------------------
    diff |           -1.088087    .3374608               -1.753505   -.4226695
------------------------------------------------------------------------------
    diff = mean(0) - mean(1)                                      t =  -3.2243
H0: diff = 0                                     Degrees of freedom =      201

    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.0007         Pr(|T| > |t|) = 0.0015          Pr(T > t) = 0.9993

t.test(school$write,school$read , alternative = "two.sided" , var.equal = FALSE)


    Welch Two Sample t-test

data:  school$write and school$read
t = 0.55199, df = 395.57, p-value = 0.5813
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.396081  2.486081
sample estimates:
mean of x mean of y 
   52.775    52.230

Compare Reading marks by gender

use "birthweight1.dta", clear

ttest read, by(female)

variable write not found
r(111);



variable read not found
r(111);

r(111);

t.test(read~female, data=school, alternative="two.sided" , var.equal = TRUE)


    Two Sample t-test

data:  read by female
t = 0.74801, df = 198, p-value = 0.4553
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 -1.783998  3.964459
sample estimates:
mean in group 0 mean in group 1 
       52.82418        51.73394

Define the hypothesis

$H_{0}$ : $μ_{m a l e} = μ_{f e m a l e}$
the means are given in the output tables

Define significance level

determine at 5% level

$s_{p}^{2} = \frac{(91 - 1) 10.50671 + (109 - 1) 10.05783}{109 + 91 - 2}$

((91 -1)*10.50671^2 + (109-1)*10.05783^2)/(109 + 91 -2)

[1] 105.3559

$t = \frac{(52.82418 - 51.73394)}{\sqrt{10.26187 (\frac{1}{91} + \frac{1}{109})}}$

(52.82418-51.73394)/sqrt(105.3559*(1/91 +1/109))

[1] 0.7480173

For Difference of Means ( $μ_{1} - μ_{2}$ ), Paired Samples (D = X-Y)

$100 (1 - α)$ Confidence Interval
$\bar{D} \pm t_{α / 2} \frac{s_{d}}{\sqrt{n}}$

Hypothesis Testing Test Statistic

$t = \frac{\bar{D} - D_{0}}{s_{d} / \sqrt{n}}$

Difference of Two Proprotions

Mean

$\hat{p_{1}} - \hat{p_{2}}$

Variance $\frac{p_{1} (1 - p_{1})}{n_{1}} + \frac{p_{2} (1 - p_{2})}{n_{2}}$

$100 (1 - α)$ Confidence Interval

$\hat{p_{1}} - \hat{p_{2}} + z_{α / 2} \sqrt{\frac{p_{1} (1 - p_{1})}{n_{1}} + \frac{p_{2} (1 - p_{2})}{n_{2}}}$

Sample Sizes, Confidence $α$ , Error d
(Prior Estimate fo $\hat{p_{1}}, \hat{p_{2}}$ )

$n \approx \frac{z_{α / 2}^{2} [p_{1} (1 - p_{1}) + p_{2} (1 - p_{2})]}{d^{2}}$

(No Prior Estimates for $\hat{p}$ )

$n \approx \frac{z_{α / 2}^{2}}{2 d^{2}}$

Hypothesis Testing - Test Statistics

Null Value $(p_{1} - p_{2}) \neq 0$

$z = \frac{(\hat{p_{1}} - \hat{p_{2}}) - (p_{1} - p_{2})_{0}}{\sqrt{\frac{p_{1} (1 - p_{1})}{n_{1}} + \frac{p_{2} (1 - p_{2})}{n_{2}}}}$

Null Value $(p_{1} - p_{2})_{0} = 0$

$z = \frac{\hat{p_{1}} - \hat{p_{2}}}{\sqrt{\hat{p} (1 - \hat{p}) (\frac{1}{n_{1}} + \frac{1}{n_{2}})}}$

where

$\hat{p} = \frac{x_{1} + x_{2}}{n_{1} + n_{2}} = \frac{n_{1} \hat{p_{1}} + n_{2} \hat{p_{2}}}{n_{1} + n_{2}}$

(21.9-16.7)/sqrt(20.38889*(2/10))

[1] 2.575085

Single Variance

$1 - α = P (χ_{1 - α / 2; n - 1}^{2}) \leq (n - 1) s^{2} / σ^{2} \leq χ_{α / 2; n - 1}^{2}) = P (\frac{(n - 1) s^{2}}{χ_{α / 2}^{2}} \leq σ^{2} \leq \frac{(n - 1) s^{2}}{χ_{1 - α / 2}^{2}})$

and a $100 (1 - α) %$ confidence interval for $σ^{2}$ is:

$(\frac{(n - 1) s^{2}}{χ_{α / 2; n - 1}^{2}}, \frac{(n - 1) s^{2}}{χ_{1 - α / 2; n - 1}^{2}})$ Confidence limits for $σ^{2}$ are obtained by computing the positive square roots of these limits

Equivalently,

$100 (1 - α)$ Confidence Interval

$L_{1} = \frac{(n - 1) s^{2}}{χ_{α / 2}^{2}} L_{1} = \frac{(n - 1) s^{2}}{χ_{1 - α / 2}^{2}}$ Hypothesis Testing Test Statistic

$χ^{2} = \frac{(n - 1) s^{2}}{σ_{0}^{2}}$

Single Proportion (p)

Confidence Interval $100 (1 - α)$	Sample Sizes Confidence $α$ , Error d (prior estimate for $\hat{p}$ )	(No prior estimate for $\hat{p}$ )	Hypothesis Testing Test Statistic
$\hat{p} \pm z_{α / 2} \sqrt{\frac{\hat{p} (1 - \hat{p})}{n}}$	$n \approx \frac{z_{α / 2}^{2} \hat{p} (1 - \hat{p})}{d^{2}}$	$n \approx \frac{z_{α / 2}^{2}}{4 d^{2}}$	$z = \frac{\hat{p} - p_{0}}{\sqrt{\frac{p_{0} (1 - p_{0})}{n}}}$

Power

Formally, power (for the test of the mean) is given by:

$π (μ) = 1 - β = P (test rejects H_{0} | μ)$

Chi-square Test for Associations

Contigency tables

when we work with categorical data in research we aim at discussing the analysis of response data when there are either no predictor variables, or all of the predictor variables are also categorical.

If you have no predictor variable, you are essentially testing whether the heights of a bar chart differ significantly from each or from some hypothesized distribution. This is a one-way table and you perform a one-proportion Z-test, exact binomial test, chi-squared goodness-of-fit test, or G-test.
If you have a single categorical predictor variable, you are testing whether the joint frequency counts differ from the expected frequency counts in the saturated model. This is a two-way table and you are compare the frequency counts with a proportion difference test, chi-squared independence test, G-test, or Fisher’s exact test.
If you have multiple categorical predictor variables, you have a k-way contingency table and you can extend the chi-square independence test and G-test to cover this as well.

Pearson Chi-square Test for independence

The chi-square independence test tests whether observed joint frequency counts $O_{i j}$ differ from expected frequency counts $E_{i j}$ under the independence model (the model of independent explanatory variables, $π_{i j} = π_{i +} π_{+ j}$ . $H_{0}$ is $O_{i j} = E_{i j}$ .
Determine whether an association exists or whether two variables are dependent on each other or not + Sometimes, $H_{0}$ represents the model whose validity is to be tested. Contrast this with the conventional formulation of $H_{0}$ as the hypothesis that is to be disproved. The goal in this case is not to disprove the model, but to see whether data are consistent with the model and if deviation can be attributed to chance.
These tests do not measure the strength of an association.
These tests depend on and reflect the sample size - double the sample size by copying each observation, double the $χ^{2}$ statistic even thought the strength of the association does not change.
The [Pearson Chi-square Test] is not appropriate when more than about 20% of the cells have an expected cell frequency of less than 5 (large-sample p-values not appropriate).
When the sample size is small the exact p-values can be calculated (this is prohibitive for large samples); calculation of the exact p-values assumes that the column totals and row totals are fixed.
$X^{2} = \sum \frac{(O_{i j} - E_{i j})^{2}}{E_{i j}}$

where $O_{j} = p_{j} n$ and $E_{j} = π_{j} n$ . The sampling distribution of $X^{2}$ approaches the $χ_{J - 1}^{2}$ as the sample size $n \to \infty$ .

Question (Taken from Zimbabwe Open University)

A ZOU Regional Coordinator observed that on one weekend school, 150 students who turned up were categorized by gender and programme as shown in

Gender	Physcology	Development studies	R & A Management	Total
Male	46	29	28	103
Female	27	14	6	47
Total	73	43	34	150

record the data in R

dat_o <- matrix(
  c(46,27,29,14,28,6), 
  nrow = 2,
  dimnames = list(Gender = c("Male", "Female"),
    study = c("Physcology", "Development studies","R & A Management"))
)

view the observed values in a neat table

dat_o %>% data.frame() %>% rownames_to_column(var = " ") %>%
  janitor::adorn_totals(where = c("row", "col")) %>%
  flextable::flextable() %>%
  flextable::colformat_int(j = c(2, 3, 4), big.mark = ",") %>%
  flextable::autofit()

contigency table for the data is given above

Look at the Observed proportions

prop.table(dat_o, margin = 1) %>% data.frame() %>% rownames_to_column(var = " ") %>%
  janitor::adorn_totals(where = c("col")) %>%
  flextable::flextable() %>%
  flextable::colformat_num(j = c(2, 3, 4), digits = 2) %>%
  flextable::autofit()

Calculating expected frequencies

$E_{i j} = \frac{O_{i} \times O_{i j}}{n}$

$E_{11} = \frac{73 \times 103}{150} = 50.12667$

$E_{12} = \frac{43 \times 103}{150} = 29.52667$

$E_{11} = \frac{34 \times 103}{150} = 23.34667$

$E_{11} = \frac{73 \times 47}{150} = 22.87333$

$E_{22} = \frac{43 \times 47}{150} = 13.47333$

$E_{23} = \frac{34 \times 47}{150} = 10.65333$

Doing this in R yeilds

expected <- matrix(rowSums(dat_o)) %*% t(matrix(colSums(dat_o))) / sum(dat_o)
expected

         [,1]     [,2]     [,3]
[1,] 50.12667 29.52667 23.34667
[2,] 22.87333 13.47333 10.65333

Define the hypothesis

$H_{0}$ : There is no association between gender and student programme choice
$H_{1}$ : There is an association between gender and student programme choice

The test statistic is calculated as follows $χ^{2} = \sum_{j = 1}^{c} \sum_{i = 1}^{r} \frac{(O_{i j} - E_{i j})^{2}}{E_{i j}}$

We reject the null hypothesis at 5% level of significance if $χ_{2} > χ_{0.05}^{2} (r - 1) (c - 1)$

calculating $χ^{2}$

$χ^{2} = \frac{(46 - 50.12667)^{2}}{50.12667} + \frac{(29 - 29.52667)^{2}}{29.52667} + \frac{(28 - 23.34667)^{2}}{23.34667} + \frac{(27 - 22.87333)^{2}}{22.87333} + \frac{(14 - 13.47333)^{2}}{13.47333} + \frac{(6 - 10.65333)^{2}}{10.65333}$

$χ^{2} = 4.07425$ and in R We can

vitc_e <- sum(dat_o) * prop.table(dat_o, 1) * prop.table(dat_o, 2)
X2 <- sum((dat_o - vitc_e)^2 / vitc_e)
print(X2)

[1] 4.074251

The degrees of freedom is

vitc_dof <- (nrow(dat_o) - 1) * (ncol(dat_o) - 1)
print(vitc_dof)

[1] 2

$χ_{0.05}^{2} (2 - 1) (3 - 1) = χ_{0.05}^{2} (2) = 6$

conclusion

Since $χ^{2} = 4.07425 < 6$ ,we fail to reject $H_{0}$ and conclude that there is no association between gender and student programme choice.

in R we can do this manually

pchisq(q = X2, df = vitc_dof, lower.tail = FALSE)

[1] 0.130403

Conclusion

since the p-value is greater than 0.05 we conclude that there is no association between the two variables

the deviance statistic is

$G^{2} = 2 \sum_{i j} O_{i j} \log (\frac{O_{i j}}{E_{i j}})$

G2 <- - 2 * sum(dat_o * log(dat_o / vitc_e))
print(G2)

[1] 4.371267

$X^{2}$ and $G^{2}$ increase with the disagreement between the saturated model proportions $p_{i j}$ and the independence model proportions $π_{i j}$ .

The associated p-values for the deviance residuals are

pchisq(q = G2, df = vitc_dof, lower.tail = FALSE)

[1] 0.1124065

Other than doing the manual labor

R has a built in function called chisq.test() and it takes in a table object

recall that

dat_o

        study
Gender   Physcology Development studies R & A Management
  Male           46                  29               28
  Female         27                  14                6

vitc_chisq_test <- chisq.test(dat_o, correct = FALSE)
print(vitc_chisq_test)


    Pearson's Chi-squared test

data:  dat_o
X-squared = 4.0743, df = 2, p-value = 0.1304

The Yates correction yields more conservative p-values.

chisq.test(dat_o)


    Pearson's Chi-squared test

data:  dat_o
X-squared = 4.0743, df = 2, p-value = 0.1304

These p-values are evidence for rejecting the independence model.

Here is the chi-square test applied to the above data. Recall this data set is 2x3, so the degrees of freedom are $(2 - 1) (3 - 1) = 3$ . The Yates continuity correction does not apply to data other than 2x2, so the correct = c(TRUE, FALSE) has no effect in chisq.test().

Expected frequencies from the output

vitc_chisq_test$expected

        study
Gender   Physcology Development studies R & A Management
  Male     50.12667            29.52667         23.34667
  Female   22.87333            13.47333         10.65333

Helment injury vs wearing a helmet

$P (H e l m e t) = \frac{147}{793}$

$P (I n j u r y) = \frac{235}{793}$

$P (H e l m e t \cap i n j u r y) = \frac{17}{793}$

Testing the Hypothesis

Define the hypothesis

$H_{0}$ : Suffering a head injury is not associated with wearing a helmet
$H_{1}$ : There is association between wearing a helmet and suffering a head injury.

Decide the level of significance

at 5% level of significance

Calculate the test statistic

Calculating expected frequencies

$E_{i j} = \frac{O_{i} \times O_{i j}}{n}$

(235*147)/(793)

[1] 43.56242

$E_{11} = \frac{235 \times 147}{793} = 43.56242$

(235* 646)/(793)

[1] 191.4376

$E_{12} = \frac{235 \times 646}{793} = 191.4376$

(558* 147)/(793)

[1] 103.4376

$E_{21} = \frac{558 \times 147}{793} = 103.4376$

(558 * 646)/(793)

[1] 454.5624

$E_{11} = \frac{558 \times 646}{793} = 454.5624$

The test statistic is calculated as follows $χ^{2} = \sum_{j = 1}^{c} \sum_{i = 1}^{r} \frac{(O_{i j} - E_{i j})^{2}}{E_{i j}}$

We reject the null hypothesis at 5% level of significance if $χ_{2} > χ_{0.05}^{2} (r - 1) (c - 1)$

calculating $χ^{2}$

$χ_{u n c o r r e c t e d}^{2} = \frac{(17 - 43.6)^{2}}{43.6} + \frac{(218 - 191.4)^{2}}{191.4} + \frac{(130 - 103.4)^{2}}{103.4} + \frac{(428 - 454.6)^{2}}{454.6}$

(((17-43.6)^2)/(43.6))+(((218-191.4)^2)/(191.4))+(((130-103.4)^2)/(103.4))+(((428-454.6)^2)/(454.6))

[1] 28.32459

Chi-Squared distribution

Random variable $X$ is distributed $X \sim χ_{k}^{2}$ if

$f (X) = \frac{1}{2^{k / 2} Γ (k / 2)} x^{k / 2 - 1} e^{- x / 2}$

with $E (X) = k$ and $V a r (X) = 2 k$ .

For our example

$(r - 1) (c - 1) = (2 - 1) (2 - 1)$ = 1

options(scipen = 999)
dof <- 1
x2 <- 28.32459
(p_value <- pchisq(q = x2, df = dof, lower.tail = FALSE))

[1] 0.0000001025845

Or simulate this by taking the mean of 10,000 random trials.

mean(rchisq(n = 10000, df = dof) >= x2)

[1] 0

data.frame(x = 250:300 / 10) %>%
  mutate(density = pchisq(x, df = dof, lower.tail = FALSE),
         cdf = if_else(x > x2, density, as.numeric(NA))) %>%
  ggplot() +
  geom_line(aes(x = x, y = density)) +
  geom_area(aes(x = x, y = cdf), alpha = 0.3) +
  theme_minimal() +
  labs(title = "P(X^2 > 28.3) when X ~ ChiSq(1)")

How large would $X^{2}$ have to be to conclude the observed were not in agreement with expectations?

qchisq(p = .05, df = 1, lower.tail = FALSE)

[1] 3.841459

Does type of therapy associated with improvement?

contigency table for the data
variable	Improvement	no improvemet	Total
therapy x	28	16	44
therapy y	37	9	46
Total	65	25	90

Calculating expected frequencies

$E_{i j} = \frac{O_{i} \times O_{i j}}{n}$

$E_{11} = \frac{65 \times 44}{90} = 31.77778$

$E_{12} = \frac{25 \times 44}{90} = 12.2222$

$E_{21} = \frac{65 \times 46}{90} = 33.2222$ $E_{22} = \frac{25 \times 46}{90} = 12.77778$

$H_{0}$ : The two variables are independent
$H_{1}$ : The two variables are dependent

The test statistic is calculated as follows $χ^{2} = \sum_{j = 1}^{c} \sum_{i = 1}^{r} \frac{(O_{i j} - E_{i j})^{2}}{E_{i j}}$

We reject the null hypothesis at 5% level of significance if $χ_{2} > χ_{0.05}^{2} (r - 1) (c - 1)$

summary table

Observed(O)	Expected(E)	(O-E)	[(O-E)^2]/E
28	31.77778	14.27177	0.4491114
16	12.2222	14.27177	1.167693
37	33.222	14.27328	0.4296335
9	12.77778	14.27162	1.116909

$χ^{2} = 3.16334$ and

$χ_{0.05}^{2} (2 - 1) (2 - 1) = χ_{0.05}^{2} (1) = 3.84$

conclusion

Since the calculated value is less than than the critical value ,we fail to reject to reject the null hypothesis and conclude that the two variables are independent.

Examples in R and Stata

use "birthweight1.dta", clear

list  id female race ses schtyp prog in 1/5

variable write not found
r(111);



variable id not found
r(111);

r(111);

use "birthweight1.dta", clear

tab ses prog , chi expected exact cchi2

variable write not found
r(111);



variable ses not found
r(111);

r(111);

Observed frequencies

school |> 
  janitor::tabyl(ses,prog) |> 
  janitor::adorn_totals(where = c("row", "col")) %>%
  flextable::flextable() %>%
  flextable::colformat_int(j = c(2, 3, 4), big.mark = ",") %>%
  flextable::autofit()

$E_{i j} = \frac{O_{i} \times O_{i j}}{n}$

$E_{11} = \frac{45 \times 47}{200} = 10.575$

$E_{12} = \frac{105 \times 47}{200} = 24.675$

$E_{11} = \frac{50 \times 47}{200} = 11.75$

$E_{11} = \frac{45 \times 95}{200} = 21.375$

$E_{12} = \frac{105 \times 95}{200} = 49.875$

$E_{11} = \frac{50 \times 95}{200} = 23.75$

$E_{11} = \frac{45 \times 58}{200} = 13.050$

$E_{12} = \frac{105 \times 58}{200} = 30.450$

$E_{11} = \frac{50 \times 58}{200} = 14.50$

tab<-table(school$ses,school$prog) |> as.matrix()

expected <- matrix(rowSums(tab)) %*% t(matrix(colSums(tab))) / sum(tab)
expected

       [,1]   [,2]  [,3]
[1,] 10.575 24.675 11.75
[2,] 21.375 49.875 23.75
[3,] 13.050 30.450 14.50

vitc_e <- sum(tab) * prop.table(tab, 1) * prop.table(tab, 2)
X2 <- sum((tab - vitc_e)^2 / vitc_e)
print(X2)

[1] 16.60444

Shapiro-Wilk Test

The Shapiro-Wilk test is a test of whether a random variable is normally distributed. It uses test statistic

$W = \frac{{(\sum_{i = 1}^{n} a_{i} x_{(i)})}^{2}}{{(\sum_{(i = 1)}^{n} x_{i} - \bar{x})}^{2}}$ where $x_{(i)}$ is the ith smallest number in the sample and $a_{i}$ are coefficients calculated as $(m^{T} V^{- 1}) / C$ where $V$ is the covariance matrix and $m$ and $C$ are vector norms.

The test uses $H_{0}$ of normality, so the p-value is the probability of rejecting normality. As an example, Shapiro-Wilk rejects normality for a small sample of 10 from the binomial distribution, but does not reject it from a larger sample of 30.

shapiro.test(rbinom(100, 10, .3))
## 
##  Shapiro-Wilk normality test
## 
## data:  rbinom(100, 10, 0.3)
## W = 0.93847, p-value = 0.0001557
shapiro.test(rbinom(100, 30, .3))
## 
##  Shapiro-Wilk normality test
## 
## data:  rbinom(100, 30, 0.3)
## W = 0.97063, p-value = 0.02469

Library setup

About me

About the presentation

Example datasets used in this exercise

Important terms

Probability

Principles

Bayes theorem - Application :Diagnostics

Statistical inference

One Sample Inference

The Mean

Example of one sample t.test

Hypothesis Testing

Conclusion

Hypothesis Testing

Conclusion

For Difference of Means (μ1−μ2), Independent Samples

two sample T-test

Assumptions

Violation of assumptions

Compare Reading marks by gender

Define the hypothesis

Define significance level

For Difference of Means (μ1−μ2), Paired Samples (D = X-Y)

Difference of Two Proprotions

Single Variance

Single Proportion (p)

Power

Chi-square Test for Associations

Contigency tables

Pearson Chi-square Test for independence

Question (Taken from Zimbabwe Open University)

record the data in R

view the observed values in a neat table

Look at the Observed proportions

Calculating expected frequencies

Doing this in R yeilds

Define the hypothesis

the deviance statistic is

Other than doing the manual labor

Expected frequencies from the output

Helment injury vs wearing a helmet

Testing the Hypothesis

Define the hypothesis

Decide the level of significance

Calculate the test statistic

Calculating expected frequencies

Chi-Squared distribution

For our example

Does type of therapy associated with improvement?

Calculating expected frequencies

summary table

Examples in R and Stata

Observed frequencies

Shapiro-Wilk Test

For Difference of Means ( $μ_{1} - μ_{2}$ ), Independent Samples

For Difference of Means ( $μ_{1} - μ_{2}$ ), Paired Samples (D = X-Y)