Analysis of Variance in R and Stata for Public health Researchers

Biostatistics for health Researchers

Authors

Bongani Ncube

University Of the Witwatersrand (Biostatistician)

Published

16 March 2025

Limitations of conducting multiple t-tests

Why can’t T-tests be used to compare each mean with each other (conducting multiple t-tests) when there are more than two means

Time-consuming: Analysis becomes increasingly time-consuming as the number of means increases
Inflates overall significance level: Increases the likelihood of finding a significant result by chance alone, leading to a high risk of false positive
Loss of power: making detecting true differences between means harder, especially with small sample sizes.
Interpretation challenges: Numerous pairwise comparisons become challenging to interpret and synthesise findings coherently

These limitations can be effectively controlled using ANOVA to compare means across multiple groups

Assumptions

Independence of observations: Ensure that the observations within each group are independent
Normality: Check whether the DV is approximately normally distributed for each category of the IDV
Create histograms or Q-Q plots for each group
Shapiro-Wilk test of normality (Best for small sample size)
Null hypothesis (H0): the data is normally distributed
Alternative hypothesis (H1): The data is not normally distributed
p-value < 0.05, reject H0
p-value >= 0.05, fail to reject-data is normal
D’Agostino-Pearson Test (sktest-best for large samples)
Same as above
Test can be unreliable if observations in one or more groups come from a highly non-normal distribution
Homogeneity of variances: Check whether the variances of the groups are approximately equal (More important than the normality test)
Create boxplots for each group
Levene’s test
Bartlett’s test (part of the standard output of the oneway command in Stata)
- Null Hypothesis (H0): The variances across all groups are equal (homoscedasticity)
- Alternative Hypothesis (H1): At least one group has a variance that is significantly different (heteroscedasticity)
If this assumption does not hold we should consider using the non-parametric alternative (Kruskal Wallis test ) or transforming our response variable

Analysis Steps

Conduct exploratory data analysis- Graphically and numerically, e.g. mean, SD, min, max, for each group
Verify that the assumptions of ANOVA are met
Formulate hypotheses
Calculate test statistic
Determine critical value
Decision making
Post-hoc analysis (if needed)

Theory behind Analysis of Variance (ANOVA)

ANOVA is using the same underlying mechanism as linear regression. However, the angle that ANOVA chooses to look at is slightly different from the traditional linear regression. It can be more useful in the case with qualitative variables and designed experiments.

Experimental Design

Factor: explanatory or predictor variable to be studied in an investigation
Treatment (or Factor Level): “value” of a factor applied to the experimental unit
Experimental Unit: person, animal, piece of material, etc. that is subjected to treatment(s) and provides a response
Single Factor Experiment: one explanatory variable considered
Multifactor Experiment: more than one explanatory variable
Classification Factor: A factor that is not under the control of the experimenter (observational data)
Experimental Factor: assigned by the experimenter

Basics of experimental design:

Choices that a statistician has to make:
- set of treatments
- set of experimental units
- treatment assignment (selection bias)
- measurement (measurement bias, blind experiments)
Advancements in experimental design:
1. Factorial Experiments:
  consider multiple factors at the same time (interaction)
2. Replication: repetition of experiment
  - assess mean squared error
  - control over precision of experiment (power)
3. Randomization
  - Before R.A. Fisher (1900s), treatments were assigned systematically or subjectively
  - randomization: assign treatments to experimental units at random, which averages out systematic effects that cannot be control by the investigator
4. Local control: Blocking or Stratification
  - Reduce experimental errors and increase power by placing restrictions on the randomization of treatments to experimental units.

Randomization may also eliminate correlations due to time and space.

Completely Randomized Design (CRD)

Treatment factor A with \(a\ge2\) treatments levels. Experimental units are randomly assinged to each treatment. The number of experiemntal units in each group can be

equal (balanced): n
unequal (unbalanced): \(n_i\) for the i-th group (i = 1,…,a).

The total sample size is \(N=\sum_{i=1}^{a}n_i\)

Possible assignments of units to treatments are \(k=\frac{N!}{n_1!n_2!...n_a!}\)

Each has probability 1/k of being selected. Each experimental unit is measured with a response \(Y_{ij}\), in which j denotes unit and i denotes treatment.

Treatment

	1	2	…	a
	\(Y_{11}\)	\(Y_{21}\)	…	\(Y_{a1}\)
	\(Y_{12}\)	…	…	…
	…	…	…	…
Sample Mean	\(\bar{Y_{1.}}\)	\(\bar{Y_{2.}}\)	…	\(\bar{Y_{a.}}\)
Sample SD	\(s_1\)	\(s_2\)	…	\(s_a\)

where \(\bar{Y_{i.}}=\frac{1}{n_i}\sum_{j=1}^{n_i}Y_{ij}\)

\(s_i^2=\frac{1}{n_i-1}\sum_{j=1}^{n_i}(Y_{ij}-\bar{Y_i})^2\)

And the grand mean is \(\bar{Y_{..}}=\frac{1}{N}\sum_{i}\sum_{j}Y_{ij}\)

Single Factor Fixed Effects Model

also known as Single Factor (One-Way) ANOVA or ANOVA Type I model.

Partitioning the Variance

The total variability of the \(Y_{ij}\) observation can be measured as the deviation of \(Y_{ij}\) around the overall mean \(\bar{Y_{..}}\): \(Y_{ij} - \bar{Y_{..}}\)

This can be rewritten as: \[ \begin{split} Y_{ij} - \bar{Y_{..}}&=Y_{ij} - \bar{Y_{..}} + \bar{Y_{i.}} - \bar{Y_{i.}} \\ &= (\bar{Y_{i.}}-\bar{Y_{..}})+(Y_{ij}-\bar{Y_{i.}}) \end{split} \] where

the first term is the between treatment differences (i.e., the deviation of the treatment mean from the overall mean)
the second term is within treatment differences (i.e., the deviation of the observation around its treatment mean)

\[ \begin{split} \sum_{i}\sum_{j}(Y_{ij} - \bar{Y_{..}})^2 &= \sum_{i}n_i(\bar{Y_{i.}}-\bar{Y_{..}})^2+\sum_{i}\sum_{j}(Y_{ij}-\bar{Y_{i.}})^2 \\ SSTO &= SSTR + SSE \\ total~SS &= treatment~SS + error~SS \\ (N-1)~d.f. &= (a-1)~d.f. + (N - a) ~ d.f. \end{split} \]

we lose a d.f. for the total corrected SSTO because of the estimation of the mean (\(\sum_{i}\sum_{j}(Y_{ij} - \bar{Y_{..}})=0\))
And, for the SSTR \(\sum_{i}n_i(\bar{Y_{i.}}-\bar{Y_{..}})=0\)

Accordingly, \(MSTR= \frac{SST}{a-1}\) and \(MSR=\frac{SSE}{N-a}\)

ANOVA Table

Source of Variation	SS	df	MS
Between Treatments	\(\sum_{i}n_i (\bar{Y_{i.}}-\bar{Y_{..}})^2\)	a-1	SSTR/(a-1)
Error (within treatments)	\(\sum_{i}\sum_{j}(Y_{ij}-\bar{Y_{i.}})^2\) \| N-a \| SSE/(N-a) \|
Total (corrected)	\[\sum(Y_{ij} - \bar{Y}_{..})^2\]	N-1

Linear Model Explanation of ANOVA

Summary of one-Way ANOVA

Use the one-way ANOVA test to compare the mean response of a continuous dependent variable among the levels of a factor variable (if you only have two levels, use the independent-samples t-test). The observations must be independent, meaning the data generators cannot influence each other (e.g., same participant in different groups, or participants interact with each other to produce the observed outcome).

The ANOVA method decomposes the deviation of observation \(Y_{ij}\) around the overall mean \(\bar{Y}_{..}\) into two parts, the deviation of the observations around their treatment means, \(SSE\), and the deviation of the treatment means around the overall mean, \(SSR\). Their ratio, \(F = \frac{SSR}{SSE}\) follows an F-distribution with \(k-1\) numerator dof and \(N-k\) denominator dof. The more observation variance captured by the treatments, the large is \(F\), and the less likely that the null hypothesis, \(H_0 = \mu_1 = \mu_2 = \cdots = \mu_k\) is true.

Compare the F-statistic to the F-distribution with \(k-1\) numerator degrees of freedom and \(N-k\) denominator degrees of freedom

The F test does not indicate which populations cause the rejection of \(H_0\). For this, use one of the post-hoc tests: Tukey, Fisher’s Least Significant Difference (LSD), Bonferroni, Scheffe, or Dunnett.

ANOVA returns reliable results if the following conditions hold:

1. there should be no significant outliers within the factor levels;
1. the response variable should be approximately normally distributed within each factor level; and
1. the the response variable variances within the factor levels should be equal.

Here is an example where you might use an ANOVA test.

Example

An experiment was performed to test the effects of drugs A and B on the lymphocyte count in mice by comparing A, B and a placebo (inactive substance) C. Fifteen mice were randomly assigned into three groups of five mice each, and the groups were given A, B and placebo C, respectively. The lymphocyte counts (in hundreds per cubic mm of blood) are summarised below:

Long Format

Stata
R

use "drug", clear

describe


list

Contains data from drug.dta
 Observations:            15                  
    Variables:             3                  26 Feb 2024 20:37
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
id              byte    %8.0g                 ID
drug_type       byte    %10.0g                
count           byte    %8.0g                 
-------------------------------------------------------------------------------
Sorted by: id  drug_type

     +-----------------------+
     | id   drug_t~e   count |
     |-----------------------|
  1. |  1          1      71 |
  2. |  1          2      54 |
  3. |  1          3      55 |
  4. |  2          1      76 |
  5. |  2          2      65 |
     |-----------------------|
  6. |  2          3      52 |
  7. |  3          1      52 |
  8. |  3          2      54 |
  9. |  3          3      58 |
 10. |  4          1      72 |
     |-----------------------|
 11. |  4          2      68 |
 12. |  4          3      51 |
 13. |  5          1      69 |
 14. |  5          2      59 |
 15. |  5          3      59 |
     +-----------------------+

Wide Format

DrugTest |> 
  pivot_wider(names_from=Drug, values_from =`lymphocyte count`) |> 
  janitor::adorn_totals(where = c("row", "col")) %>%
  flextable::flextable() %>%
  flextable::colformat_int(j = c(2, 3, 4), big.mark = ",") %>%
  flextable::autofit()

Descriptive analysis

Start with a column plot with 95% confidence intervals. it is unclear if the effects of each drug are significantly different from each other.

Descriptive statistics in Stata
Descriptive statistics in R

use drug

*/Assess the skewness of the count variable by drug type or use sum detail above
tabstat count, by(drug_type) stat(n mean sd min q max)
egen skew = skew(count), by(drug_type)
tabdisp drug_type, c(skew)

Summary for variables: count
Group variable: drug_type 

drug_type |         N      Mean        SD       Min       p25       p50
----------+------------------------------------------------------------
        1 |         5        68  9.300538        52        69        71
        2 |         5        60  6.363961        54        54        59
        3 |         5        55  3.535534        51        52        55
----------+------------------------------------------------------------
    Total |        15        61   8.40068        51        54        59
-----------------------------------------------------------------------

drug_type |       p75       Max
----------+--------------------
        1 |        72        76
        2 |        65        68
        3 |        58        59
----------+--------------------
    Total |        69        76
-------------------------------



----------------------
drug_type |       skew
----------+-----------
        1 |  -1.213235
        2 |   .2212293
        3 |          0
----------------------

Data is presented as mean ± standard deviation. lymphocyte count increased from Drug A (n = 5, 68 ± 9.3), to Drug B (n = 5, 60 ± 6.4), to Placebo C (n = 5, 55 ± 3.5) in that order.

Conducting the one way anova

The one-way ANOVA indicates effect of drug was statistically significantly different for different levels of drugs, F(2, 12) = 4.623656, p < .0.03245365.

Drug_aov <- aov(`lymphocyte count` ~ Drug, data = DrugTest)
Drug_anova <- anova(Drug_aov)

Drug_anova %>% 
  tidy() %>%
  flextable() %>%
  set_table_properties(width = 0.8, layout = "autofit") %>%
  colformat_num(j = c(3, 4, 5), digits = 1) %>%
  colformat_num(j = 6, digits = 4) %>%
  set_caption("Results of ANOVA for Drug vs lymphocyte count")

use drug
*/////fit oneway anova/////

 */obtain additional information about the production data by wheat varieties
oneway count drug_type,tabulate

            |          Summary of count
  drug_type |        Mean   Std. dev.       Freq.
------------+------------------------------------
          1 |          68   9.3005376           5
          2 |          60    6.363961           5
          3 |          55   3.5355339           5
------------+------------------------------------
      Total |          61   8.4006802          15

                        Analysis of variance
    Source              SS         df      MS            F     Prob > F
------------------------------------------------------------------------
Between groups             430      2          215      4.62     0.0325
 Within groups             558     12         46.5
------------------------------------------------------------------------
    Total                  988     14   70.5714286

Bartlett's equal-variances test: chi2(2) =   2.9923    Prob>chi2 = 0.224

use drug
*/if the equal variassc install robnova
robnova count drug_type

Outcome variable was count and predictor variable was drug_type

Sum of Squares Model = 430.0000
Sum of Squares Residual = 558.0000
Sum of Squares Total = 988.0000
R-squared = 0.435223

-----------------------------------------------------
            Test |    F       df1      df2       p   
-----------------+-----------------------------------
Brown-Forsythe's |  4.6237     2      8.3892   0.0443
        Fisher's |  4.6237     2     12.0000   0.0325
         Welch's |  4.3961     2      6.9782   0.0581
-----------------------------------------------------
Total number of observations used was 15.

SST and MST

DrugTest |> 
  ungroup() |> 
  dplyr::summarize("$MST$" = var(`lymphocyte count`),
                   "$SST$" = (15-1)*var(`lymphocyte count`)) |> 
  knitr::kable()

\(MST\)	\(SST\)
70.57143	988

(DrugTest |> 
  select(-id) |> 
  group_by(Drug) |> 
  summarise(n=n(),
            "$(n_i-1)S^2_{i}$"=sd(`lymphocyte count`)^2,
            "$\\bar{X}_i$"=mean(`lymphocyte count`))->table_group) |> 
  knitr::kable()

Drug	n	\((n_i-1)S^2_{i}\)	\(\bar{X}_i\)
Drug A	5	86.5	68
Drug B	5	40.5	60
Placebo C	5	12.5	55

SSW AND MSW

table_group |> 
  summarise(SSW =(5-1)*sum(`$(n_i-1)S^2_{i}$`),
            MSW = SSW/(15-3),
            "$\\bar{X}_{grand}$"=sum(`$\\bar{X}_i$`),
            N = sum(n)) |> 
  knitr::kable()

SSW	MSW	\(\bar{X}_{grand}\)	N
558	46.5	183	15

SSB AND MSB

(SSW<-988-558)

[1] 430

(MSW<-SSW/(3-1))

[1] 215

(F =215/46.5)

[1] 4.623656

Checking Conditions

The ANOVA test applies when the dependent variable is continuous, the independent variable is categorical, and the observations are independent within groups. Independent means the observations should be from a random sample, or from an experiment using random assignment. Each group’s size should be less than 10% of its population size. The groups must also be independent of each other (non-paired, and non-repeated measures). Additionally, there are three conditions related to the data distribution.

No outliers. There should be no significant outliers in the groups. Outliers exert a large influence on the mean and standard deviation. Test with a box plot. If it fails this condition, you might be able to drop the outliers or transform the data. Otherwise you’ll need to switch to a non-parametric test.
Normality. Each group’s values should be nearly normally distributed (“nearly” because ANOVA is considered robust to the normality assumption). This condition is especially important with small sample sizes. Test with the Q-Q plots or the Shapiro-Wilk test for normality. If the data is very non-normal, you might be able to transform your response variable, or use a nonparametric test such as Kruskal-Wallis.
Equal Variances. The group variances should be roughly equal. This condition is especially important when sample sizes differ. The IQR of the box plot is a good way to visually assess this condition. A rule of thumb that the largest sample standard deviation should be less than twice the size of the smallest. More formal homogeneity of variance tests include the Bartlett test, and Levene test. If the variances are very different, use a Games-Howell post hoc test for multiple comparisons instead of the Tukey post hoc test.

Outliers

Assess outliers with a box plot. The whiskers extend no further than 1.5*IQR from the upper and lower hinges. Any observations beyond the whiskers are outliers and are plotted individually. Our example includes an outlier in fertilizer group F2.

There were no outliers in the data, as assessed by inspection of a boxplot.

The data did pass the test, so what do you do? There are generally three reasons for outliers: data entry errors, measurement errors, and genuinely unusual values. If the problem’s data entry - fix it! If it’s measurement, throw it out. If it is genuine, you have some options.

Kruskal-Wallace H test. It is a non-parametric test. Be careful here, because it is not quite testing the same \(H_0\).
Transform the dependent variable. Don’t do this unless the data is also non-normal. It also has the downside of making interpretation more difficult.
Leave it in if it doesn’t affect the conclusion (compared to taking it out).

use drug

*/histogram count if drug_type==1

histogram count, by (drug_type) normal
quietly graph export histog.svg, replace

DrugTest |> 
  ggplot(aes(x=`lymphocyte count`))+
  geom_histogram(bins=5)+
  facet_wrap(~Drug)

Quantile Quantile Plots

QQ plots in Stata
QQ plot in R

use Drug
*/Quantile-Quantile Plot (most commmon than pnorm)
qnorm count if drug_type == 1 

quietly graph export qq1.svg, replace
qnorm count if drug_type == 2 

quietly graph export qq2.svg, replace
qnorm count if drug_type == 3 

quietly graph export qq3.svg, replace

DrugTest %>%
  ggplot(aes(sample = `lymphocyte count`)) +
  stat_qq() +
  stat_qq_line(col = "goldenrod") +
  facet_wrap(~Drug) +
  theme_minimal() +
  labs(title = "Normal Q-Q Plot")

Shapiro Wilk test

And the Shapiro-Wilk test fails to reject the normality null hypothesis.

In stata
In Stata

use drug

*/Shapiro–Wilk W test

bysort drug_type: swilk count

-> drug_type = 1

                   Shapiro–Wilk W test for normal data

    Variable |        Obs       W           V         z       Prob>z
-------------+------------------------------------------------------
       count |          5    0.80367      2.318     1.361    0.08678

-------------------------------------------------------------------------------
-> drug_type = 2

                   Shapiro–Wilk W test for normal data

    Variable |        Obs       W           V         z       Prob>z
-------------+------------------------------------------------------
       count |          5    0.66397      3.967     2.663    0.00387

-------------------------------------------------------------------------------
-> drug_type = 3

                   Shapiro–Wilk W test for normal data

    Variable |        Obs       W           V         z       Prob>z
-------------+------------------------------------------------------
       count |          5    0.91538      0.999    -0.001    0.50058

x <- by(DrugTest, DrugTest$Drug, function(x) shapiro.test(x$`lymphocyte count`) %>% tidy())

x[1:4] %>%
  bind_rows() %>%
  mutate(group = names(x)) %>%
  dplyr::select(group, everything(), - method) %>%
  flextable() %>% 
  set_table_properties(width = 0.6, layout = "autofit") %>%
  set_caption("Shapiro-Wilk Normality Test")

If the data passes the test, report

count was normally distributed, as assessed by inspection of a Q-Q plot.

count was normally distributed, as assessed by Shapiro-Wilk’s test (p > .05).

Had the data not been normally distributed, you would have three options:

transform the dependent variable;
use a non-parametric test such as Kruskal-Wallis; or
carry on regardless.

Transformations will generally only work when the distribution of scores in all groups are the same shape. They also have the drawback of making the data less interpratable.

You can also choose to carry on regardless. ANOVA is considered “robust” to normality violations.

Equal Variances

The equality of sample variances condition is less critical when sample sizes are similar among the groups. One rule of thumb is that no group’s standard deviation should be more than double that of any other.

Levene In Stata
In R

use drug
*Levenes and Brown-Forsythe tests*

graph box count, over(drug_type) ylabel(50(10)80, labsize(small))  title("lymphocyte count by drug type")

quietly graph export box.svg, replace

robvar count, by(drug_type)

            |          Summary of count
  drug_type |        Mean   Std. dev.       Freq.
------------+------------------------------------
          1 |          68   9.3005376           5
          2 |          60    6.363961           5
          3 |          55   3.5355339           5
------------+------------------------------------
      Total |          61   8.4006802          15

W0  =  1.12751678   df(2, 12)     Pr > F = 0.35585905

W50 =  0.39463087   df(2, 12)     Pr > F = 0.68236167

W10 =  1.12751678   df(2, 12)     Pr > F = 0.35585905

DrugTest %>% 
  group_by(Drug) %>% 
  summarize(.groups = "drop", sd = sd(`lymphocyte count`)) %>%
  flextable() %>%
  set_table_properties(width = 0.5, layout = "autofit")

library(car)
leveneTest(`lymphocyte count`~Drug,DrugTest)

	Df	F value	Pr(>F)
group	2	0.3946309	0.6823617
	12	NA	NA

leveneTest(`lymphocyte count`~Drug,DrugTest,center ="mean")

	Df	F value	Pr(>F)
group	2	1.127517	0.355859
	12	NA	NA

Bartlett’s test of homogeneity fails to reject \(H_0\) of homogeity (p = 0.2494).

bartlett.test(`lymphocyte count` ~ Drug, data = DrugTest)


    Bartlett test of homogeneity of variances

data:  lymphocyte count by Drug
Bartlett's K-squared = 2.9923, df = 2, p-value = 0.224

Post Hoc

Post Hoc Analysis in R
Post Hoc in Stata

Now that the conditions are checked, conduct a post-hoc test to see which groups differ. Here is the Tukey test. Placebo C and Drug A were significantly different.

(trkey<-TukeyHSD(Drug_aov))

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = `lymphocyte count` ~ Drug, data = DrugTest)

$Drug
                 diff       lwr       upr     p adj
Drug B-Drug A      -8 -19.50589  3.505887 0.1940828
Placebo C-Drug A  -13 -24.50589 -1.494113 0.0270213
Placebo C-Drug B   -5 -16.50589  6.505887 0.4981695

plot(trkey)

use drug

*/pairwise comparison using the tukey test (requires equal sample sizes in all groups= which we have)
pwmean count, over(drug_type) mcompare(tukey) effects cimeans

Pairwise comparisons of means with equal variances

Over: drug_type

--------------------------------------------------------------
             |                                 Unadjusted
       count |       Mean   Std. err.     [95% conf. interval]
-------------+------------------------------------------------
   drug_type |
          1  |         68    3.04959      61.35551    74.64449
          2  |         60    3.04959      53.35551    66.64449
          3  |         55    3.04959      48.35551    61.64449
--------------------------------------------------------------

---------------------------
             |    Number of
             |  comparisons
-------------+-------------
   drug_type |            3
---------------------------

------------------------------------------------------------------------------
             |                              Tukey                Tukey
       count |   Contrast   Std. err.      t    P>|t|     [95% conf. interval]
-------------+----------------------------------------------------------------
   drug_type |
     2 vs 1  |         -8   4.312772    -1.85   0.194    -19.50589    3.505887
     3 vs 1  |        -13   4.312772    -3.01   0.027    -24.50589   -1.494113
     3 vs 2  |         -5   4.312772    -1.16   0.498    -16.50589    6.505887
------------------------------------------------------------------------------

Handling Non-Constant Variance

The statistical tests for the model conditions (e.g. Bartlett’s test for homogeneity) are often too sensitive. ANOVA is robust to small violations of the conditions. However, heterogeneity is a common problem in ANOVA. Tranforming the response variable can often remove the heterogeneity. Finding the correct transformation can be challenging, but the Box-Cox procedure can help. The MASS::boxcox() function calculates the profile log-likelihoods for a power transformation of the response variable \(Y^\lambda\).

\(\lambda\)	\(Y^\lambda\)	Transformation
2	\(Y^2\)	Square
1	\(Y^1\)	(no transformation)
.5	\(Y^{.5}\)	Square Root
0	\(\ln(Y)\)	Log
-.5	\(Y^{-.5}\)	Inverse Square Root
-1	\(Y^{-1}\)	Inverse

The Box-Cox procedure does not recommend any particular transformation of the data in this case.