A guide to statistical Practice in R and Stata for Public health Researchers

Authors

Bongani Ncube

University Of the Witwatersrand (Biostatistician)

Published

16 March 2025

Library setup

About me

I am an Msc Research fellow at the Wits School of Public Health. I have a background in advanced mathematics and Statistics.

About the presentation

I am compiling the notes while finishing my Biostatistics for Health researchers 1

Example datasets used in this exercise

Important terms

A descriptive statistic is a summary statistic that quantitatively describes or summarizes features from a collection of information

  1. Inferential statistics is a branch of statistics that makes the use of various analytical tools to draw inferences about the population data from sample data.

  2. A Nominal Scale is a measurement scale, in which numbers serve as “tags” or “labels” only, to identify or classify an object. This measurement normally deals only with non-numeric (quantitative) variables or where numbers have no value.

  3. Snowball sampling is a non-probability sampling method where new units are recruited by other units to form part of the sample.

  4. A critical value serves as a boundary within the sampling distribution of a test statistic. These values play a crucial role in both hypothesis tests and confidence intervals.

  5. Non-probability sampling is a sampling method that uses non-random criteria

  6. A one-tailed test is a statistical test in which the critical area of a distribution is one-sided so that it is either greater than or less than a certain value, but not both.

  7. The confidence interval is an estimate of the amount of uncertainty associated with a sample, computed from the statistics of the observed data

  8. Covariance is a measure of the joint variability of two random variables.

  9. The interval scale is a quantitative measurement scale where there is order, the difference between the two variables is meaningful and equal, and the presence of zero is arbitrary.

  10. Probability sampling refers refers to the process in which each and every element of the population when selecting has an equal chance of being included in the sample for instance simple random sampling is a method of probability sampling.

  1. ANOVA is a method of assessing the differences between sample means for instance to test the difference in mean salary between people with degrees,diplome,masters and PhD one would perfom an ANOVA

  2. significance test is a formal procedure for testing properties of population distributions and can be used to test the differences between a single sample value and a fixed value for instance if we know the population mean and we wish to test whether a particular sample mean is significantly different from it.

  3. A statistic is a numerical value computed based on data from a sample e.g sample variance

  4. A Parameter on the other hand is a numerical value calculated based on values in the whole population e.g population population variance

  5. Null hypothesis in a significance test is when an assertion is made that no difference exits.

  6. Alternative hypothesis refers to the case when an assertion is made that a significant difference exists and it is accepted when the null hypothesis is rejected.

  7. chi-square arises in statistics when we wish to compare a set of observed frequencies with a set of theoretical frequencies and it also a descriptive measure of the magnitude of the discrepancies between the observed and expected frequencies.

  8. One way ANOVA - one way anova compares the means of two or more groups in order to determine whether there is statistical evidence that the associated population means are significantly different

  9. test for independence - In tests of independence, two variables are involved and these are usually nominal variables where the test is used to answer whether the two variables are dependent to each other that is they are assoaciated. For example if we wish to test whether HIV status is associated with whether someone is poor or not from a given dataset.

Probability

Principles

Here are three rules that come up all the time.

  • Pr(AB)=Pr(A)+Pr(B)Pr(AB). This rule generalizes to Pr(ABC)=Pr(A)+Pr(B)+Pr(C)Pr(AB)Pr(AC)Pr(BC)+Pr(ABC).

  • Pr(A|B)=P(AB)P(B)

  • If A and B are independent, Pr(AB)=Pr(A)Pr(B), and Pr(A|B)=Pr(A).

Bayes theorem - Application :Diagnostics

  • D = “Disease is Present’

  • Dc = “Disease is not Present”

  • T+ = “Positive test result”

  • T = “Negative test result”

  • P(T+|D) = “Sensitivity (true positive rate)”

  • P(T+|Dc) = “Prob false positive rate”

  • P(T|D) = false negative

  • P(T|Dc) = “Specificity (true negative)”

PVN=P(Dc|T)

P(D) IS THE PROB OF DISEASE

  • If your model predicts a patient as 1 (positive) and they belong to category 1 (positive) in reality we call this a true positive.

  • If your model predicts a patient as 0 (negative) and they belong to category 1 (positive) in reality we call this a false negative.

  • If your model predicts a patient as 1 (positive) and they belong to category 0 (negative) in reality we call this a false positive.

  • If your model predicts a patient as 0 (negative) and they belong to category 0 (negative) in reality we call this a true negative.

🎓 Precision: TPTP+FP defined as the proportion of predicted positives that are actually positive. Also called positive predictive value

🎓 Recall: TPTP+FN defined as the proportion of positive results out of the number of samples which were actually positive. Also known as sensitivity.

🎓 Specificity: TNTN+FP defined as the proportion of negative results out of the number of samples which were actually negative.

🎓 Accuracy: TP+TNTP+TN+FP+FN The percentage of labels predicted accurately for a sample.

🎓 F Measure: A weighted average of the precision and recall, with best being 1 and worst being 0.

Multiplication Rule. |S|=|S1||Sk|.

How many outcomes are possible from a sequence of 4 coin flips and 2 rolls of a die? |S|=|S1||S2||S6|=222266=288.

Statistical inference

  • Make inferences (an interpretation) about the true parameter value β based on our estimator/estimate
  • Test whether our underlying assumptions (about the true population parameters, random variables, or model specification) hold true.

Testing does not

  • Confirm with 100% a hypothesis is true
  • Confirm with 100% a hypothesis is false
  • Tell you how to interpret the estimate value (Economic vs. Practical vs. Statistical Significance)

Hypothesis: Translate an objective in better understanding the results in terms of specifying a value (or sets of values) in which our population parameters should/should not lie.

  • Null hypothesis (H0): A statement about the population parameter that we take to be true in which we would need the data to provide substantial evidence that against it.
    • Can be either a single value (ex: H0:β=0) or a set of values (ex: H0:β10)
    • Will generally be the value you would not like the population parameter to be (subjective)
      • H0:β1=0 means you would like to see a non-zero coefficient
      • H0:β10 means you would like to see a negative effect
    • “Test of Significance” refers to the two-sided test: H0:βj=0
  • Alternative hypothesis (Ha or H1) (Research Hypothesis): All other possible values that the population parameter may be if the null hypothesis does not hold.

Type I Error

Error made when H0 is rejected when, in fact, H0 is true.
The probability of committing a Type I error is α (known as level of significance of the test)

Type I error (α): probability of rejecting H0 when it is true.

Legal analogy: In U.S. law, a defendant is presumed to be “innocent until proven guilty”.
If the null hypothesis is that a person is innocent, the Type I error is the probability that you conclude the person is guilty when he is innocent.


Type II Error

Type II error level (β): probability that you fail to reject the null hypothesis when it is false.

In the legal analogy, this is the probability that you fail to find the person guilty when he or she is guilty.

Error made when H0 is not rejected when, in fact, H1 is true
The probability of committing a Type II error is β (known as the power of the test)

Random sample of size n: A collection of n independent random variables taken from the distribution X, each with the same distribution as X.

Sample mean

X¯=(i=1nXi)/n

Sample Median

x~ = the middle observation in a sample of observation order from smallest to largest (or vice versa).

If n is odd, x~ is the middle observation,
If n is even, x~ is the average of the two middle observations.

Sample variance S2=i=1n(Xi=X¯)2n1=ni=1nXi2(i=1nXi)2n(n1)

Sample standard deviation S=S2

Sample proportions p^=Xn=number in the sample with traitsample size

p1p2^=p1^p2^=X1n1X2n2=n2X1=n1X2n1n2

Estimators
Point Estimator
θ^ is a statistic used to approximate a population parameter θ


Point estimate
The numerical value assumed by θ^ when evaluated for a given sample


Unbiased estimator
If E(θ^)=θ, then θ^ is an unbiased estimator for θ

  1. X¯ is an unbiased estimator for μ
  2. S2 is an unbiased estimator for σ2
  3. p^ is an unbiased estimator for p
  4. p1P2^ is an unbiased estimator for p1p2
  5. X1¯X2¯ is an unbiased estimator for μ1μ2

Note: S is a biased estimator for σ

Distribution of the sample mean

If X¯ is the sample mean based on a random sample of size n drawn from a normal distribution X with mean μ and standard deviation σ, the X¯ is normally distributed, with mean μX¯=μ and variance σX¯2=Var(X¯)=σ2n. Then the standard error of the mean is: σX¯=σn

One Sample Inference

Yii.i.d.N(μ,σ2)

i.i.d. standards for “independent and identically distributed”

Hence, we have the following model:

Yi=μ+ϵi where

  • ϵiiidN(0,σ2)
  • E(Yi)=μ
  • Var(Yi)=σ2
  • y¯N(μ,σ2/n)


The Mean

When σ2 is estimated by s2, then

y¯μs/ntn1

Then, a 100(1α)% confidence interval for μ is obtained from:

1α=P(tα/2;n1y¯μs/ntα/2;n1)=P(y¯(tα/2;n1)s/nμy¯+(tα/2;n1)s/n)

And the interval is

y¯±(tα/2;n1)s/n

and s/n is the standard error of y¯

If the experiment were repeated many times, 100(1α)% of these intervals would contain μ

Confidence Interval 100(1α) Sample Sizes
Confidence α, Error d
Hypothesis Testing
Test Statistic
When σ2 is known, X is normal (or n25) X¯±zα/2σn nzα/22σ2d2 z=X¯μ0σ/n
When σ2 is unknown, X is normal (or n25) X¯±tα/2sn nzα/22s2d2 t=X¯μ0s/n

Example of one sample t.test

patients suffering from clinical depression
patient 1 2 3 4 5 6 7 8
Days 51 41 62 33 28 43 37 44

X=339 X¯=42.375

Hypothesis Testing

H0:μ=43 H1:μ43

T=X¯μσn

T=42.3754328

T=0.8838835

since it is a two tailed test we divide percentage by 2

t81(5%2)=t7(0.025)=2.365

Conclusion

Since t7(0.025)>|T| it is found in the rejection region hence we reject the null hypothesis and conclude that this group of patients is significantly from the previous group

For Difference of Means (μ1μ2), Independent Samples

100(1α) Confidence Interval Hypothesis Testing
Test Statistic
When σ2 is known X¯1X¯2±zα/2σ12n1+σ22n2 z=(X¯1X¯2)(μ1μ2)0σ12n1+σ22n2
When σ2 is unknown, Variances Assumed EQUAL X¯1X¯2±tα/2sp2(1n1+1n2) t=(X¯1X¯2)(μ1μ2)0sp2(1n1+1n2) Pooled Variance: sp2=(n11)s12(n21)s22n1+n22
Degrees of Freedom: γ=n1+n22
When σ2 is unknown, Variances Assumed UNEQUAL X¯1X¯2±tα/2(s12n1+s22n2) t=(X¯1X¯2)(μ1μ2)0(s12n1+s22n2) Degrees of Freedom: γ=(s12n1+s22n2)2(s12n1)2n11+(s22n2)2n21

two sample T-test

When you have a single explanatory variable which is qualitative and only have two levels, you can run a student’s T-test to test for a difference in the mean of the two levels. If appropriate for your data, you can choose to test a unilateral hypothesis. This means that you can test the more specific assumption that one level has a higher mean than the other, rather than that they simply have different means.Note that robustness of this test increases with sample size and is higher when groups have equal sizes

For the t-test, the t statistic used to find the p-value calculation is calculated as: t=(x1x2)s12n1+s22n2

where

x1 and x2 are the means of the response variable y for group 1 and 2, respectively,
s12 and s22 are the variances of the response variable y for group 1 and 2, respectively,
n1 and n2 are the sample sizes of groups 1 and 2, respectively.

NB: this applies if the variances are not equal

for equal variances we use the formula in the table above:

t=(X¯1X¯2)(μ1μ2)0sp2(1n1+1n2) where the Pooled Variance: sp2=(n11)s12(n21)s22n1+n22

Note that the t-test is mathematically equivalent to a one-way ANOVA with 2 levels.

Assumptions

If the assumptions of the t-test are not met, the test can give misleading results. Here are some important things to note when testing the assumptions of a t-test.

  1. Normality of data
    As with simple linear regression, the residuals need to be normally distributed. If the data are not normally distributed, but have reasonably symmetrical distributions, a mean which is close to the centre of the distribution, and only one mode (highest point in the frequency histogram) then a t-test will still work as long as the sample is sufficiently large (rule of thumb ~30 observations). If the data is heavily skewed, then we may need a very large sample before a t-test works. In such cases, an alternate non-parametric test should be used.
  2. Homoscedasticity
    Another important assumption of the two-sample t-test is that the variance of your two samples are equal. This allows you to calculate a pooled variance, which in turn is used to calculate the standard error. If population variances are unequal, then the probability of a Type I error is greater than α.
    The robustness of the t-test increases with sample size and is higher when groups have equal sizes.
    We can test for difference in variances among two populations and ask what is the probability of taking two samples from two populations having identical variances and have the two sample variances be as different as are s12 and s22.
    To do so, we must do the variance ratio test (i.e. an F-test).

Violation of assumptions

If variances between groups are not equal, it is possible to use corrections, like the Welch correction. If assumptions cannot be respected, you can transform your data (log or square root for example) or use the non-parametric equivalent of t-test, the Mann-Whitney test. Finally, if the two groups are not independent (e.g. measurements on the same individual at 2 different years), you should use a Paired t-test.

Compare Reading marks by gender

For Difference of Means (μ1μ2), Paired Samples (D = X-Y)

100(1α) Confidence Interval
D¯±tα/2sdn

Hypothesis Testing Test Statistic

t=D¯D0sd/n

Difference of Two Proprotions

Mean

p1^p2^

Variance p1(1p1)n1+p2(1p2)n2

100(1α) Confidence Interval

p1^p2^+zα/2p1(1p1)n1+p2(1p2)n2

Sample Sizes, Confidence α, Error d
(Prior Estimate fo p1^,p2^)

nzα/22[p1(1p1)+p2(1p2)]d2

(No Prior Estimates for p^)

nzα/222d2

Hypothesis Testing - Test Statistics

Null Value (p1p2)0

z=(p1^p2^)(p1p2)0p1(1p1)n1+p2(1p2)n2

Null Value (p1p2)0=0

z=p1^p2^p^(1p^)(1n1+1n2)

where

p^=x1+x2n1+n2=n1p1^+n2p2^n1+n2

(21.9-16.7)/sqrt(20.38889*(2/10))
[1] 2.575085

Single Variance

1α=P(χ1α/2;n12)(n1)s2/σ2χα/2;n12)=P((n1)s2χα/22σ2(n1)s2χ1α/22)

and a 100(1α)% confidence interval for σ2 is:

((n1)s2χα/2;n12,(n1)s2χ1α/2;n12) Confidence limits for σ2 are obtained by computing the positive square roots of these limits

Equivalently,

100(1α) Confidence Interval

L1=(n1)s2χα/22L1=(n1)s2χ1α/22 Hypothesis Testing Test Statistic

χ2=(n1)s2σ02

Single Proportion (p)

Confidence Interval 100(1α) Sample Sizes
Confidence α, Error d (prior estimate for p^)
(No prior estimate for p^) Hypothesis Testing
Test Statistic
p^±zα/2p^(1p^)n nzα/22p^(1p^)d2 nzα/224d2 z=p^p0p0(1p0)n

Power

Formally, power (for the test of the mean) is given by:

π(μ)=1β=P(test rejects H0|μ)

Chi-square Test for Associations

Contigency tables

when we work with categorical data in research we aim at discussing the analysis of response data when there are either no predictor variables, or all of the predictor variables are also categorical.

Pearson Chi-square Test for independence

  • The chi-square independence test tests whether observed joint frequency counts Oij differ from expected frequency counts Eij under the independence model (the model of independent explanatory variables, πij=πi+π+j. H0 is Oij=Eij.
  • Determine whether an association exists or whether two variables are dependent on each other or not + Sometimes, H0 represents the model whose validity is to be tested. Contrast this with the conventional formulation of H0 as the hypothesis that is to be disproved. The goal in this case is not to disprove the model, but to see whether data are consistent with the model and if deviation can be attributed to chance.
  • These tests do not measure the strength of an association.
  • These tests depend on and reflect the sample size - double the sample size by copying each observation, double the χ2 statistic even thought the strength of the association does not change.
  • The [Pearson Chi-square Test] is not appropriate when more than about 20% of the cells have an expected cell frequency of less than 5 (large-sample p-values not appropriate).
  • When the sample size is small the exact p-values can be calculated (this is prohibitive for large samples); calculation of the exact p-values assumes that the column totals and row totals are fixed.
    X2=(OijEij)2Eij

where Oj=pjn and Ej=πjn. The sampling distribution of X2 approaches the χJ12 as the sample size n.

Question (Taken from Zimbabwe Open University)

A ZOU Regional Coordinator observed that on one weekend school, 150 students who turned up were categorized by gender and programme as shown in

Gender Physcology Development studies R & A Management Total
Male 46 29 28 103
Female 27 14 6 47
Total 73 43 34 150

record the data in R

dat_o <- matrix(
  c(46,27,29,14,28,6), 
  nrow = 2,
  dimnames = list(Gender = c("Male", "Female"),
    study = c("Physcology", "Development studies","R & A Management"))
)

view the observed values in a neat table

dat_o %>% data.frame() %>% rownames_to_column(var = " ") %>%
  janitor::adorn_totals(where = c("row", "col")) %>%
  flextable::flextable() %>%
  flextable::colformat_int(j = c(2, 3, 4), big.mark = ",") %>%
  flextable::autofit()

contigency table for the data is given above

Look at the Observed proportions

prop.table(dat_o, margin = 1) %>% data.frame() %>% rownames_to_column(var = " ") %>%
  janitor::adorn_totals(where = c("col")) %>%
  flextable::flextable() %>%
  flextable::colformat_num(j = c(2, 3, 4), digits = 2) %>%
  flextable::autofit()

Calculating expected frequencies

Eij=Oi×Oijn

E11=73×103150=50.12667

E12=43×103150=29.52667

E11=34×103150=23.34667

E11=73×47150=22.87333

E22=43×47150=13.47333

E23=34×47150=10.65333

Doing this in R yeilds

expected <- matrix(rowSums(dat_o)) %*% t(matrix(colSums(dat_o))) / sum(dat_o)
expected
         [,1]     [,2]     [,3]
[1,] 50.12667 29.52667 23.34667
[2,] 22.87333 13.47333 10.65333

Define the hypothesis

  • H0 : There is no association between gender and student programme choice
  • H1 : There is an association between gender and student programme choice
  1. The test statistic is calculated as follows χ2=j=1ci=1r(OijEij)2Eij

We reject the null hypothesis at 5% level of significance if χ2>χ0.052(r1)(c1)

calculating χ2

χ2=(4650.12667)250.12667+(2929.52667)229.52667+(2823.34667)223.34667+(2722.87333)222.87333+(1413.47333)213.47333+(610.65333)210.65333

χ2=4.07425 and in R We can

vitc_e <- sum(dat_o) * prop.table(dat_o, 1) * prop.table(dat_o, 2)
X2 <- sum((dat_o - vitc_e)^2 / vitc_e)
print(X2)
[1] 4.074251

The degrees of freedom is

vitc_dof <- (nrow(dat_o) - 1) * (ncol(dat_o) - 1)
print(vitc_dof)
[1] 2

χ0.052(21)(31)=χ0.052(2)=6

  • conclusion

Since χ2=4.07425<6 ,we fail to reject H0 and conclude that there is no association between gender and student programme choice.

in R we can do this manually

pchisq(q = X2, df = vitc_dof, lower.tail = FALSE)
[1] 0.130403

Conclusion

since the p-value is greater than 0.05 we conclude that there is no association between the two variables

the deviance statistic is

G2=2ijOijlog(OijEij)

G2 <- - 2 * sum(dat_o * log(dat_o / vitc_e))
print(G2)
[1] 4.371267

X2 and G2 increase with the disagreement between the saturated model proportions pij and the independence model proportions πij.

The associated p-values for the deviance residuals are

pchisq(q = G2, df = vitc_dof, lower.tail = FALSE)
[1] 0.1124065

Other than doing the manual labor

  • R has a built in function called chisq.test() and it takes in a table object

recall that

dat_o
        study
Gender   Physcology Development studies R & A Management
  Male           46                  29               28
  Female         27                  14                6
vitc_chisq_test <- chisq.test(dat_o, correct = FALSE)
print(vitc_chisq_test)

    Pearson's Chi-squared test

data:  dat_o
X-squared = 4.0743, df = 2, p-value = 0.1304

The Yates correction yields more conservative p-values.

chisq.test(dat_o)

    Pearson's Chi-squared test

data:  dat_o
X-squared = 4.0743, df = 2, p-value = 0.1304

These p-values are evidence for rejecting the independence model.

Here is the chi-square test applied to the above data. Recall this data set is 2x3, so the degrees of freedom are (21)(31)=3. The Yates continuity correction does not apply to data other than 2x2, so the correct = c(TRUE, FALSE) has no effect in chisq.test().

Expected frequencies from the output

vitc_chisq_test$expected
        study
Gender   Physcology Development studies R & A Management
  Male     50.12667            29.52667         23.34667
  Female   22.87333            13.47333         10.65333

Helment injury vs wearing a helmet

P(Helmet)=147793

P(Injury)=235793

P(Helmetinjury)=17793

Testing the Hypothesis

Define the hypothesis

  • H0: Suffering a head injury is not associated with wearing a helmet
  • H1: There is association between wearing a helmet and suffering a head injury.

Decide the level of significance

  • at 5% level of significance

Calculate the test statistic

Calculating expected frequencies

Eij=Oi×Oijn

(235*147)/(793)
[1] 43.56242

E11=235×147793=43.56242

(235* 646)/(793)
[1] 191.4376

E12=235×646793=191.4376

(558* 147)/(793)
[1] 103.4376

E21=558×147793=103.4376

(558 * 646)/(793)
[1] 454.5624

E11=558×646793=454.5624

  1. The test statistic is calculated as follows χ2=j=1ci=1r(OijEij)2Eij

We reject the null hypothesis at 5% level of significance if χ2>χ0.052(r1)(c1)

calculating χ2

χuncorrected2=(1743.6)243.6+(218191.4)2191.4+(130103.4)2103.4+(428454.6)2454.6

(((17-43.6)^2)/(43.6))+(((218-191.4)^2)/(191.4))+(((130-103.4)^2)/(103.4))+(((428-454.6)^2)/(454.6))
[1] 28.32459

Chi-Squared distribution

Random variable X is distributed Xχk2 if

f(X)=12k/2Γ(k/2)xk/21ex/2

with E(X)=k and Var(X)=2k.

For our example

(r1)(c1)=(21)(21) = 1

options(scipen = 999)
dof <- 1
x2 <- 28.32459
(p_value <- pchisq(q = x2, df = dof, lower.tail = FALSE))
[1] 0.0000001025845

Or simulate this by taking the mean of 10,000 random trials.

mean(rchisq(n = 10000, df = dof) >= x2)
[1] 0
data.frame(x = 250:300 / 10) %>%
  mutate(density = pchisq(x, df = dof, lower.tail = FALSE),
         cdf = if_else(x > x2, density, as.numeric(NA))) %>%
  ggplot() +
  geom_line(aes(x = x, y = density)) +
  geom_area(aes(x = x, y = cdf), alpha = 0.3) +
  theme_minimal() +
  labs(title = "P(X^2 > 28.3) when X ~ ChiSq(1)")

How large would X2 have to be to conclude the observed were not in agreement with expectations?

qchisq(p = .05, df = 1, lower.tail = FALSE)
[1] 3.841459

Does type of therapy associated with improvement?

contigency table for the data
variable Improvement no improvemet Total
therapy x 28 16 44
therapy y 37 9 46
Total 65 25 90

Calculating expected frequencies

Eij=Oi×Oijn

E11=65×4490=31.77778

E12=25×4490=12.2222

E21=65×4690=33.2222 E22=25×4690=12.77778

  • H0 : The two variables are independent
  • H1 : The two variables are dependent

The test statistic is calculated as follows χ2=j=1ci=1r(OijEij)2Eij

We reject the null hypothesis at 5% level of significance if χ2>χ0.052(r1)(c1)

summary table

Observed(O) Expected(E) (O-E) [(O-E)^2]/E
28 31.77778 14.27177 0.4491114
16 12.2222 14.27177 1.167693
37 33.222 14.27328 0.4296335
9 12.77778 14.27162 1.116909

χ2=3.16334 and

χ0.052(21)(21)=χ0.052(1)=3.84

  • conclusion

Since the calculated value is less than than the critical value ,we fail to reject to reject the null hypothesis and conclude that the two variables are independent.

Examples in R and Stata

Shapiro-Wilk Test

The Shapiro-Wilk test is a test of whether a random variable is normally distributed. It uses test statistic

W=(i=1naix(i))2((i=1)nxix¯)2 where x(i) is the ith smallest number in the sample and ai are coefficients calculated as (mTV1)/C where V is the covariance matrix and m and C are vector norms.

The test uses H0 of normality, so the p-value is the probability of rejecting normality. As an example, Shapiro-Wilk rejects normality for a small sample of 10 from the binomial distribution, but does not reject it from a larger sample of 30.

shapiro.test(rbinom(100, 10, .3))
## 
##  Shapiro-Wilk normality test
## 
## data:  rbinom(100, 10, 0.3)
## W = 0.93847, p-value = 0.0001557
shapiro.test(rbinom(100, 30, .3))
## 
##  Shapiro-Wilk normality test
## 
## data:  rbinom(100, 30, 0.3)
## W = 0.97063, p-value = 0.02469