Analysis of Variance in R and Stata for Public health Researchers
Biostatistics for health Researchers
Authors
Bongani Ncube
University Of the Witwatersrand (Biostatistician)
Published
16 March 2025
Limitations of conducting multiple t-tests
Why can’t T-tests be used to compare each mean with each other (conducting multiple t-tests) when there are more than two means
Time-consuming: Analysis becomes increasingly time-consuming as the number of means increases
Inflates overall significance level: Increases the likelihood of finding a significant result by chance alone, leading to a high risk of false positive
Loss of power: making detecting true differences between means harder, especially with small sample sizes.
Interpretation challenges: Numerous pairwise comparisons become challenging to interpret and synthesise findings coherently
These limitations can be effectively controlled using ANOVA to compare means across multiple groups
Assumptions
Independence of observations: Ensure that the observations within each group are independent
Normality: Check whether the DV is approximately normally distributed for each category of the IDV
Create histograms or Q-Q plots for each group
Shapiro-Wilk test of normality (Best for small sample size)
Null hypothesis (H0): the data is normally distributed
Alternative hypothesis (H1): The data is not normally distributed
p-value < 0.05, reject H0
p-value >= 0.05, fail to reject-data is normal
D’Agostino-Pearson Test (sktest-best for large samples)
Same as above
Test can be unreliable if observations in one or more groups come from a highly non-normal distribution
Homogeneity of variances: Check whether the variances of the groups are approximately equal (More important than the normality test)
Create boxplots for each group
Levene’s test
Bartlett’s test (part of the standard output of the oneway command in Stata)
Null Hypothesis (H0): The variances across all groups are equal (homoscedasticity)
Alternative Hypothesis (H1): At least one group has a variance that is significantly different (heteroscedasticity)
If this assumption does not hold we should consider using the non-parametric alternative (Kruskal Wallis test ) or transforming our response variable
Analysis Steps
Conduct exploratory data analysis- Graphically and numerically, e.g. mean, SD, min, max, for each group
Verify that the assumptions of ANOVA are met
Formulate hypotheses
Calculate test statistic
Determine critical value
Decision making
Post-hoc analysis (if needed)
Theory behind Analysis of Variance (ANOVA)
ANOVA is using the same underlying mechanism as linear regression. However, the angle that ANOVA chooses to look at is slightly different from the traditional linear regression. It can be more useful in the case with qualitative variables and designed experiments.
Experimental Design
Factor: explanatory or predictor variable to be studied in an investigation
Treatment (or Factor Level): “value” of a factor applied to the experimental unit
Experimental Unit: person, animal, piece of material, etc. that is subjected to treatment(s) and provides a response
Single Factor Experiment: one explanatory variable considered
Multifactor Experiment: more than one explanatory variable
Classification Factor: A factor that is not under the control of the experimenter (observational data)
Experimental Factor: assigned by the experimenter
Basics of experimental design:
Choices that a statistician has to make:
set of treatments
set of experimental units
treatment assignment (selection bias)
measurement (measurement bias, blind experiments)
Advancements in experimental design:
Factorial Experiments:
consider multiple factors at the same time (interaction)
Replication: repetition of experiment
assess mean squared error
control over precision of experiment (power)
Randomization
Before R.A. Fisher (1900s), treatments were assigned systematically or subjectively
randomization: assign treatments to experimental units at random, which averages out systematic effects that cannot be control by the investigator
Local control: Blocking or Stratification
Reduce experimental errors and increase power by placing restrictions on the randomization of treatments to experimental units.
Randomization may also eliminate correlations due to time and space.
Completely Randomized Design (CRD)
Treatment factor A with \(a\ge2\) treatments levels. Experimental units are randomly assinged to each treatment. The number of experiemntal units in each group can be
equal (balanced): n
unequal (unbalanced): \(n_i\) for the i-th group (i = 1,…,a).
The total sample size is \(N=\sum_{i=1}^{a}n_i\)
Possible assignments of units to treatments are \(k=\frac{N!}{n_1!n_2!...n_a!}\)
Each has probability 1/k of being selected. Each experimental unit is measured with a response \(Y_{ij}\), in which j denotes unit and i denotes treatment.
Treatment
1
2
…
a
\(Y_{11}\)
\(Y_{21}\)
…
\(Y_{a1}\)
\(Y_{12}\)
…
…
…
…
…
…
…
Sample Mean
\(\bar{Y_{1.}}\)
\(\bar{Y_{2.}}\)
…
\(\bar{Y_{a.}}\)
Sample SD
\(s_1\)
\(s_2\)
…
\(s_a\)
where \(\bar{Y_{i.}}=\frac{1}{n_i}\sum_{j=1}^{n_i}Y_{ij}\)
And the grand mean is \(\bar{Y_{..}}=\frac{1}{N}\sum_{i}\sum_{j}Y_{ij}\)
Single Factor Fixed Effects Model
also known as Single Factor (One-Way) ANOVA or ANOVA Type I model.
Partitioning the Variance
The total variability of the \(Y_{ij}\) observation can be measured as the deviation of \(Y_{ij}\) around the overall mean \(\bar{Y_{..}}\): \(Y_{ij} - \bar{Y_{..}}\)
This can be rewritten as: \[
\begin{split}
Y_{ij} - \bar{Y_{..}}&=Y_{ij} - \bar{Y_{..}} + \bar{Y_{i.}} - \bar{Y_{i.}} \\
&= (\bar{Y_{i.}}-\bar{Y_{..}})+(Y_{ij}-\bar{Y_{i.}})
\end{split}
\] where
the first term is the between treatment differences (i.e., the deviation of the treatment mean from the overall mean)
the second term is within treatment differences (i.e., the deviation of the observation around its treatment mean)
we lose a d.f. for the total corrected SSTO because of the estimation of the mean (\(\sum_{i}\sum_{j}(Y_{ij} - \bar{Y_{..}})=0\))
And, for the SSTR \(\sum_{i}n_i(\bar{Y_{i.}}-\bar{Y_{..}})=0\)
Accordingly, \(MSTR= \frac{SST}{a-1}\) and \(MSR=\frac{SSE}{N-a}\)
Use the one-way ANOVA test to compare the mean response of a continuous dependent variable among the levels of a factor variable (if you only have two levels, use the independent-samples t-test). The observations must be independent, meaning the data generators cannot influence each other (e.g., same participant in different groups, or participants interact with each other to produce the observed outcome).
The ANOVA method decomposes the deviation of observation \(Y_{ij}\) around the overall mean \(\bar{Y}_{..}\) into two parts, the deviation of the observations around their treatment means, \(SSE\), and the deviation of the treatment means around the overall mean, \(SSR\). Their ratio, \(F = \frac{SSR}{SSE}\) follows an F-distribution with \(k-1\) numerator dof and \(N-k\) denominator dof. The more observation variance captured by the treatments, the large is \(F\), and the less likely that the null hypothesis, \(H_0 = \mu_1 = \mu_2 = \cdots = \mu_k\) is true.
Compare the F-statistic to the F-distribution with \(k-1\) numerator degrees of freedom and \(N-k\) denominator degrees of freedom
The F test does not indicate which populations cause the rejection of \(H_0\). For this, use one of the post-hoc tests: Tukey, Fisher’s Least Significant Difference (LSD), Bonferroni, Scheffe, or Dunnett.
ANOVA returns reliable results if the following conditions hold:
there should be no significant outliers within the factor levels;
the response variable should be approximately normally distributed within each factor level; and
the the response variable variances within the factor levels should be equal.
Here is an example where you might use an ANOVA test.
Example
An experiment was performed to test the effects of drugs A and B on the lymphocyte count in mice by comparing A, B and a placebo (inactive substance) C. Fifteen mice were randomly assigned into three groups of five mice each, and the groups were given A, B and placebo C, respectively. The lymphocyte counts (in hundreds per cubic mm of blood) are summarised below:
use drug*/Assess the skewness of the countvariableby drug typeorusesumdetail abovetabstatcount, by(drug_type) stat(n meansdminqmax)egenskew = skew(count), by(drug_type)tabdisp drug_type, c(skew)
Data is presented as mean ± standard deviation. lymphocyte count increased from Drug A (n = 5, 68 ± 9.3), to Drug B (n = 5, 60 ± 6.4), to Placebo C (n = 5, 55 ± 3.5) in that order.
The one-way ANOVA indicates effect of drug was statistically significantly different for different levels of drugs, F(2, 12) = 4.623656, p < .0.03245365.
Drug_aov<-aov(`lymphocyte count`~Drug, data =DrugTest)Drug_anova<-anova(Drug_aov)Drug_anova%>%tidy()%>%flextable()%>%set_table_properties(width =0.8, layout ="autofit")%>%colformat_num(j =c(3, 4, 5), digits =1)%>%colformat_num(j =6, digits =4)%>%set_caption("Results of ANOVA for Drug vs lymphocyte count")
use drug*/////fit oneway anova///// */obtain additional information about the production databy wheat varietiesonewaycount drug_type,tabulate
| Summary of count
drug_type | Mean Std. dev. Freq.
------------+------------------------------------
1 | 68 9.3005376 5
2 | 60 6.363961 5
3 | 55 3.5355339 5
------------+------------------------------------
Total | 61 8.4006802 15
Analysis of variance
Source SS df MS F Prob > F
------------------------------------------------------------------------
Between groups 430 2 215 4.62 0.0325
Within groups 558 12 46.5
------------------------------------------------------------------------
Total 988 14 70.5714286
Bartlett's equal-variances test: chi2(2) = 2.9923 Prob>chi2 = 0.224
use drug*/if the equal variassc install robnovarobnova count drug_type
Outcome variable was count and predictor variable was drug_type
Sum of Squares Model = 430.0000
Sum of Squares Residual = 558.0000
Sum of Squares Total = 988.0000
R-squared = 0.435223
-----------------------------------------------------
Test | F df1 df2 p
-----------------+-----------------------------------
Brown-Forsythe's | 4.6237 2 8.3892 0.0443
Fisher's | 4.6237 2 12.0000 0.0325
Welch's | 4.3961 2 6.9782 0.0581
-----------------------------------------------------
Total number of observations used was 15.
table_group|>summarise(SSW =(5-1)*sum(`$(n_i-1)S^2_{i}$`), MSW =SSW/(15-3),"$\\bar{X}_{grand}$"=sum(`$\\bar{X}_i$`), N =sum(n))|>knitr::kable()
SSW
MSW
\(\bar{X}_{grand}\)
N
558
46.5
183
15
SSB AND MSB
(SSW<-988-558)
[1] 430
(MSW<-SSW/(3-1))
[1] 215
(F=215/46.5)
[1] 4.623656
Checking Conditions
The ANOVA test applies when the dependent variable is continuous, the independent variable is categorical, and the observations are independent within groups. Independent means the observations should be from a random sample, or from an experiment using random assignment. Each group’s size should be less than 10% of its population size. The groups must also be independent of each other (non-paired, and non-repeated measures). Additionally, there are three conditions related to the data distribution.
No outliers. There should be no significant outliers in the groups. Outliers exert a large influence on the mean and standard deviation. Test with a box plot. If it fails this condition, you might be able to drop the outliers or transform the data. Otherwise you’ll need to switch to a non-parametric test.
Normality. Each group’s values should be nearly normally distributed (“nearly” because ANOVA is considered robust to the normality assumption). This condition is especially important with small sample sizes. Test with the Q-Q plots or the Shapiro-Wilk test for normality. If the data is very non-normal, you might be able to transform your response variable, or use a nonparametric test such as Kruskal-Wallis.
Equal Variances. The group variances should be roughly equal. This condition is especially important when sample sizes differ. The IQR of the box plot is a good way to visually assess this condition. A rule of thumb that the largest sample standard deviation should be less than twice the size of the smallest. More formal homogeneity of variance tests include the Bartlett test, and Levene test. If the variances are very different, use a Games-Howell post hoc test for multiple comparisons instead of the Tukey post hoc test.
Outliers
Assess outliers with a box plot. The whiskers extend no further than 1.5*IQR from the upper and lower hinges. Any observations beyond the whiskers are outliers and are plotted individually. Our example includes an outlier in fertilizer group F2.
There were no outliers in the data, as assessed by inspection of a boxplot.
The data did pass the test, so what do you do? There are generally three reasons for outliers: data entry errors, measurement errors, and genuinely unusual values. If the problem’s data entry - fix it! If it’s measurement, throw it out. If it is genuine, you have some options.
Kruskal-Wallace H test. It is a non-parametric test. Be careful here, because it is not quite testing the same \(H_0\).
Transform the dependent variable. Don’t do this unless the data is also non-normal. It also has the downside of making interpretation more difficult.
Leave it in if it doesn’t affect the conclusion (compared to taking it out).
use drug*/Shapiro–Wilk W testbysort drug_type: swilkcount
-> drug_type = 1
Shapiro–Wilk W test for normal data
Variable | Obs W V z Prob>z
-------------+------------------------------------------------------
count | 5 0.80367 2.318 1.361 0.08678
-------------------------------------------------------------------------------
-> drug_type = 2
Shapiro–Wilk W test for normal data
Variable | Obs W V z Prob>z
-------------+------------------------------------------------------
count | 5 0.66397 3.967 2.663 0.00387
-------------------------------------------------------------------------------
-> drug_type = 3
Shapiro–Wilk W test for normal data
Variable | Obs W V z Prob>z
-------------+------------------------------------------------------
count | 5 0.91538 0.999 -0.001 0.50058
count was normally distributed, as assessed by inspection of a Q-Q plot.
or
count was normally distributed, as assessed by Shapiro-Wilk’s test (p > .05).
Had the data not been normally distributed, you would have three options:
transform the dependent variable;
use a non-parametric test such as Kruskal-Wallis; or
carry on regardless.
Transformations will generally only work when the distribution of scores in all groups are the same shape. They also have the drawback of making the data less interpratable.
You can also choose to carry on regardless. ANOVA is considered “robust” to normality violations.
Equal Variances
The equality of sample variances condition is less critical when sample sizes are similar among the groups. One rule of thumb is that no group’s standard deviation should be more than double that of any other.
use drug*Levenes and Brown-Forsythe tests*graph box count, over(drug_type) ylabel(50(10)80, labsize(small)) title("lymphocyte count by drug type")quietlygraphexport box.svg, replacerobvarcount, by(drug_type)
Now that the conditions are checked, conduct a post-hoc test to see which groups differ. Here is the Tukey test. Placebo C and Drug A were significantly different.
use drug*/pairwise comparison using the tukey test (requires equal sample sizes inall groups= which we have)pwmean count, over(drug_type) mcompare(tukey) effects cimeans
The statistical tests for the model conditions (e.g. Bartlett’s test for homogeneity) are often too sensitive. ANOVA is robust to small violations of the conditions. However, heterogeneity is a common problem in ANOVA. Tranforming the response variable can often remove the heterogeneity. Finding the correct transformation can be challenging, but the Box-Cox procedure can help. The MASS::boxcox() function calculates the profile log-likelihoods for a power transformation of the response variable \(Y^\lambda\).
\(\lambda\)
\(Y^\lambda\)
Transformation
2
\(Y^2\)
Square
1
\(Y^1\)
(no transformation)
.5
\(Y^{.5}\)
Square Root
0
\(\ln(Y)\)
Log
-.5
\(Y^{-.5}\)
Inverse Square Root
-1
\(Y^{-1}\)
Inverse
The Box-Cox procedure does not recommend any particular transformation of the data in this case.