A guide to statistical Practice in R and Stata for Public health Researchers
Library setup
About me
I am an Msc Research fellow at the Wits School of Public Health. I have a background in advanced mathematics and Statistics.
About the presentation
I am compiling the notes while finishing my
Biostatistics for Health researchers 1
Example datasets used in this exercise
use "birthweight1.dta", clear
list in 1/5  1. | studno | numfoll | lastweek | weekadj | mothage | sector  | nopregs  |
     |      1 |       7 |       18 |      19 |      30 |      D  |       6  |
     |----------------------------------------------------------------------|
     | ncalive | regmerg |       dob |    sex |     bwgp | delmerg | magegp |
     |       5 |       3 | 27 Apr 94 |   male | 2.5-2.99 |       3 |    30+ |
     |----------------------------------------------------------------------|
     | nca2 | month | bimonth | bweight | sectnum |  propinf | ss | nopreg2 |
     |   5+ |     4 | mar-apr |     2.9 |       D |  .137931 | AA |      6+ |
     |----------------------------------------------------------------------|
     | ss2 | bwgp2 |  sect2 | unborn2 | unborn | excl  | season  | episode  |
     |  AA | >=2.5 | inland |     <=1 |    <=1 |    0  |    dry  |       0  |
     |----------------------+-----------------------------------------------|
     |       infect2        |        bweight2        |        od_igg        |
     |             0        |            3.13        |         1.712        |
     +----------------------------------------------------------------------+
     +----------------------------------------------------------------------+
  2. | studno | numfoll | lastweek | weekadj | mothage | sector  | nopregs  |
     |      2 |       7 |       18 |      19 |      28 |      C  |       4  |
     |----------------------------------------------------------------------|
     | ncalive | regmerg |       dob |    sex |     bwgp | delmerg | magegp |
     |       3 |       3 | 02 May 94 | female |   3-3.49 |       3 |  20-29 |
     |----------------------------------------------------------------------|
     | nca2 | month | bimonth | bweight | sectnum |  propinf | ss | nopreg2 |
     |    3 |     5 | may-jun |     3.1 |       C | .1724138 | AA |     4/5 |
     |----------------------------------------------------------------------|
     | ss2 | bwgp2 |  sect2 | unborn2 | unborn | excl  | season  | episode  |
     |  AA | >=2.5 | inland |     <=1 |    <=1 |    0  |    wet  |       0  |
     |----------------------+-----------------------------------------------|
     |       infect2        |        bweight2        |        od_igg        |
     |             0        |            3.18        |         1.322        |
     +----------------------------------------------------------------------+
     +----------------------------------------------------------------------+
  3. | studno | numfoll | lastweek | weekadj | mothage | sector  | nopregs  |
     |      4 |       6 |       18 |      19 |      40 |      D  |      11  |
     |----------------------------------------------------------------------|
     | ncalive | regmerg |       dob |    sex |     bwgp | delmerg | magegp |
     |       6 |       3 | 30 Jun 94 |   male |   3-3.49 |       3 |    30+ |
     |----------------------------------------------------------------------|
     | nca2 | month | bimonth | bweight | sectnum |  propinf | ss | nopreg2 |
     |   5+ |     6 | may-jun |       3 |       D |        0 | AA |      6+ |
     |----------------------------------------------------------------------|
     | ss2 | bwgp2 |  sect2 | unborn2 | unborn | excl  | season  | episode  |
     |  AA | >=2.5 | inland |      3+ |      5 |    0  |    wet  |       0  |
     |----------------------+-----------------------------------------------|
     |       infect2        |        bweight2        |        od_igg        |
     |             0        |            3.23        |           .71        |
     +----------------------------------------------------------------------+
     +----------------------------------------------------------------------+
  4. | studno | numfoll | lastweek | weekadj | mothage | sector  | nopregs  |
     |      5 |       7 |       18 |      19 |      35 |      A  |       5  |
     |----------------------------------------------------------------------|
     | ncalive | regmerg |       dob |    sex |     bwgp | delmerg | magegp |
     |       4 |       3 | 09 Jun 94 | female |   3-3.49 |       3 |    30+ |
     |----------------------------------------------------------------------|
     | nca2 | month | bimonth | bweight | sectnum |  propinf | ss | nopreg2 |
     |    4 |     6 | may-jun |     3.2 |   A-sea | .2727273 | AA |     4/5 |
     |----------------------------------------------------------------------|
     | ss2 | bwgp2 |  sect2 | unborn2 | unborn | excl  | season  | episode  |
     |  AA | >=2.5 |    sea |     <=1 |    <=1 |    0  |    wet  |       1  |
     |----------------------+-----------------------------------------------|
     |       infect2        |        bweight2        |        od_igg        |
     |             1        |             3.1        |        1.8935        |
     +----------------------------------------------------------------------+
     +----------------------------------------------------------------------+
  5. | studno | numfoll | lastweek | weekadj | mothage | sector  | nopregs  |
     |      6 |       4 |        6 |       7 |      23 |      D  |       2  |
     |----------------------------------------------------------------------|
     | ncalive | regmerg |       dob |    sex |     bwgp | delmerg | magegp |
     |       1 |       3 | 07 May 94 | female |   2-2.49 |       3 |  20-29 |
     |----------------------------------------------------------------------|
     | nca2 | month | bimonth | bweight | sectnum |  propinf | ss | nopreg2 |
     |    1 |     5 | may-jun |     2.3 |       D |      .25 | AA |     2/3 |
     |----------------------------------------------------------------------|
     | ss2 | bwgp2 |  sect2 | unborn2 | unborn | excl  | season  | episode  |
     |  AA |  <2.5 | inland |     <=1 |    <=1 |    0  |    wet  |       1  |
     |----------------------+-----------------------------------------------|
     |       infect2        |        bweight2        |        od_igg        |
     |             1        |            2.54        |          1.81        |
     +----------------------------------------------------------------------+| id | female | race | ses | schtyp | prog | read | write | math | science | socst | 
|---|---|---|---|---|---|---|---|---|---|---|
| 70 | 0 | 4 | 1 | 1 | 1 | 57 | 52 | 41 | 47 | 57 | 
| 121 | 1 | 4 | 2 | 1 | 3 | 68 | 59 | 53 | 63 | 61 | 
| 86 | 0 | 4 | 3 | 1 | 1 | 44 | 33 | 54 | 58 | 31 | 
| 141 | 0 | 4 | 3 | 1 | 3 | 63 | 44 | 47 | 53 | 56 | 
| 172 | 0 | 4 | 2 | 1 | 2 | 47 | 52 | 57 | 53 | 61 | 
Important terms
A descriptive statistic is a summary statistic that quantitatively describes or summarizes features from a collection of information
- Inferential statistics is a branch of statistics that makes the use of various analytical tools to draw inferences about the population data from sample data. 
- A Nominal Scale is a measurement scale, in which numbers serve as “tags” or “labels” only, to identify or classify an object. This measurement normally deals only with non-numeric (quantitative) variables or where numbers have no value. 
- Snowball sampling is a non-probability sampling method where new units are recruited by other units to form part of the sample. 
- A critical value serves as a boundary within the sampling distribution of a test statistic. These values play a crucial role in both hypothesis tests and confidence intervals. 
- Non-probability sampling is a sampling method that uses non-random criteria 
- A one-tailed test is a statistical test in which the critical area of a distribution is one-sided so that it is either greater than or less than a certain value, but not both. 
- The confidence interval is an estimate of the amount of uncertainty associated with a sample, computed from the statistics of the observed data 
- Covariance is a measure of the joint variability of two random variables. 
- The interval scale is a quantitative measurement scale where there is order, the difference between the two variables is meaningful and equal, and the presence of zero is arbitrary. 
- Probability sampling refers refers to the process in which each and every element of the population when selecting has an equal chance of being included in the sample for instance simple random sampling is a method of probability sampling. 
- ANOVA is a method of assessing the differences between sample means for instance to test the difference in mean salary between people with degrees,diplome,masters and PhD one would perfom an ANOVA 
- significance test is a formal procedure for testing properties of population distributions and can be used to test the differences between a single sample value and a fixed value for instance if we know the population mean and we wish to test whether a particular sample mean is significantly different from it. 
- A statistic is a numerical value computed based on data from a sample e.g sample variance 
- A Parameter on the other hand is a numerical value calculated based on values in the whole population e.g population population variance 
- Null hypothesis in a significance test is when an assertion is made that no difference exits. 
- Alternative hypothesis refers to the case when an assertion is made that a significant difference exists and it is accepted when the null hypothesis is rejected. 
- chi-square arises in statistics when we wish to compare a set of observed frequencies with a set of theoretical frequencies and it also a descriptive measure of the magnitude of the discrepancies between the observed and expected frequencies. 
- One way ANOVA - one way anova compares the means of two or more groups in order to determine whether there is statistical evidence that the associated population means are significantly different 
- test for independence - In tests of independence, two variables are involved and these are usually nominal variables where the test is used to answer whether the two variables are dependent to each other that is they are assoaciated. For example if we wish to test whether HIV status is associated with whether someone is poor or not from a given dataset. 
Probability
Principles
Here are three rules that come up all the time.
- If A and B are independent, 
Bayes theorem - Application :Diagnostics
- D = “Disease is Present’ 
P(D) IS THE PROB OF DISEASE
- If your model predicts a patient as - 1(positive) and they belong to category- 1(positive) in reality we call this a- true positive.
- If your model predicts a patient as - 0(negative) and they belong to category- 1(positive) in reality we call this a- false negative.
- If your model predicts a patient as - 1(positive) and they belong to category- 0(negative) in reality we call this a- false positive.
- If your model predicts a patient as - 0(negative) and they belong to category- 0(negative) in reality we call this a- true negative.
🎓 Precision: 
🎓 Recall: sensitivity.
🎓 Specificity: 
🎓 Accuracy: 
🎓 F Measure: A weighted average of the precision and recall, with best being 1 and worst being 0.
Multiplication Rule. 
How many outcomes are possible from a sequence of 4 coin flips and 2 rolls of a die? 
Statistical inference
- Make inferences (an interpretation) about the true parameter value - Test whether our underlying assumptions (about the true population parameters, random variables, or model specification) hold true.
Testing does not
- Confirm with 100% a hypothesis is true
- Confirm with 100% a hypothesis is false
- Tell you how to interpret the estimate value (Economic vs. Practical vs. Statistical Significance)
Hypothesis: Translate an objective in better understanding the results in terms of specifying a value (or sets of values) in which our population parameters should/should not lie.
- 
Null hypothesis (- Can be either a single value (ex: - Will generally be the value you would not like the population parameter to be (subjective)
- 
- 
 
- 
- “Test of Significance” refers to the two-sided test: 
 
- Can be either a single value (ex: 
- 
Alternative hypothesis (
Type I Error
Error made when 
The probability of committing a Type I error is 
Type I error (
Legal analogy: In U.S. law, a defendant is presumed to be “innocent until proven guilty”.
If the null hypothesis is that a person is innocent, the Type I error is the probability that you conclude the person is guilty when he is innocent.
Type II Error
Type II error level (
In the legal analogy, this is the probability that you fail to find the person guilty when he or she is guilty.
Error made when 
The probability of committing a Type II error is 
Random sample of size n: A collection of n independent random variables taken from the distribution X, each with the same distribution as X.
Sample mean
Sample Median
If n is odd, 
If n is even, 
Sample variance 
Sample standard deviation 
Sample proportions 
Estimators
Point Estimator
Point estimate
The numerical value assumed by 
Unbiased estimator
If 
- 
- 
- 
- 
- 
Note: 
Distribution of the sample mean
If 
One Sample Inference
i.i.d. standards for “independent and identically distributed”
Hence, we have the following model:
The Mean
When 
Then, a 
And the interval is
and 
If the experiment were repeated many times, 
| Confidence Interval | Sample Sizes Confidence | Hypothesis Testing Test Statistic | |
|---|---|---|---|
| When | |||
| When | 
Example of one sample t.test
| patient | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 
|---|---|---|---|---|---|---|---|---|
| Days | 51 | 41 | 62 | 33 | 28 | 43 | 37 | 44 | 
Hypothesis Testing
since it is a two tailed test we divide percentage by 2
Conclusion
Since 
the single or one sample t.tests tests whether the mean is equal or less than a specified value. the value is usually based on theoretical considerations or previous research
use "birthweight1.dta", clear
ttest write ==50variable write not found
r(111);
r(111);use "birthweight1.dta", clear
ttest bweight ==3.25variable write not found
r(111);
One-sample t test
------------------------------------------------------------------------------
Variable |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
 bweight |     141    3.007518    .0353594    .4198693     2.93761    3.077425
------------------------------------------------------------------------------
    mean = mean(bweight)                                          t =  -6.8577
H0: mean = 3.25                                  Degrees of freedom =      140
   Ha: mean < 3.25             Ha: mean != 3.25               Ha: mean > 3.25
 Pr(T < t) = 0.0000         Pr(|T| > |t|) = 0.0000          Pr(T > t) = 1.0000- mean = 50
t.test(school$write,mu=50,alternative = "two.sided" )
    One Sample t-test
data:  school$write
t = 4.1403, df = 199, p-value = 0.00005121
alternative hypothesis: true mean is not equal to 50
95 percent confidence interval:
 51.45332 54.09668
sample estimates:
mean of x 
   52.775 - mean < 50
t.test(school$write,mu=50,alternative = "less" )
    One Sample t-test
data:  school$write
t = 4.1403, df = 199, p-value = 1
alternative hypothesis: true mean is less than 50
95 percent confidence interval:
    -Inf 53.8826
sample estimates:
mean of x 
   52.775 - mean = 50
t.test(school$write,mu=50,alternative = "greater" )
    One Sample t-test
data:  school$write
t = 4.1403, df = 199, p-value = 0.0000256
alternative hypothesis: true mean is greater than 50
95 percent confidence interval:
 51.6674     Inf
sample estimates:
mean of x 
   52.775 Hypothesis Testing
(52.775-50)/(9.478586/sqrt(200))[1] 4.140325since it is a two tailed test we divide percentage by 2
Conclusion
- We conclude that the mean is statistically significantly different from .
For Difference of Means (
|  | Hypothesis Testing Test Statistic | ||
|---|---|---|---|
| When | |||
| When | Pooled Variance: Degrees of Freedom: | ||
| When | Degrees of Freedom: | 
two sample T-test
When you have a single explanatory variable which is qualitative and only have two levels, you can run a student’s T-test to test for a difference in the mean of the two levels. If appropriate for your data, you can choose to test a unilateral hypothesis. This means that you can test the more specific assumption that one level has a higher mean than the other, rather than that they simply have different means.Note that robustness of this test increases with sample size and is higher when groups have equal sizes
For the t-test, the t statistic used to find the p-value calculation is calculated as: 
where
NB: this applies if the variances are not equal
for equal variances we use the formula in the table above:
Note that the t-test is mathematically equivalent to a one-way ANOVA with 2 levels.
Assumptions
If the assumptions of the t-test are not met, the test can give misleading results. Here are some important things to note when testing the assumptions of a t-test.
- 
Normality of data
 As with simple linear regression, the residuals need to be normally distributed. If the data are not normally distributed, but have reasonably symmetrical distributions, a mean which is close to the centre of the distribution, and only one mode (highest point in the frequency histogram) then a t-test will still work as long as the sample is sufficiently large (rule of thumb ~30 observations). If the data is heavily skewed, then we may need a very large sample before a t-test works. In such cases, an alternate non-parametric test should be used.
- 
Homoscedasticity
 Another important assumption of the two-sample t-test is that the variance of your two samples are equal. This allows you to calculate a pooled variance, which in turn is used to calculate the standard error. If population variances are unequal, then the probability of a Type I error is greater than α.
 The robustness of the t-test increases with sample size and is higher when groups have equal sizes.
 We can test for difference in variances among two populations and ask what is the probability of taking two samples from two populations having identical variances and have the two sample variances be as different as are
 To do so, we must do the variance ratio test (i.e. an F-test).
Violation of assumptions
If variances between groups are not equal, it is possible to use corrections, like the Welch correction. If assumptions cannot be respected, you can transform your data (log or square root for example) or use the non-parametric equivalent of t-test, the Mann-Whitney test. Finally, if the two groups are not independent (e.g. measurements on the same individual at 2 different years), you should use a Paired t-test.
use "high_school.dta", clear
ttest write ==readvariable write not found
r(111);
(highschool and beyond (200 cases))
Paired t test
------------------------------------------------------------------------------
Variable |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
   write |     200      52.775    .6702372    9.478586    51.45332    54.09668
    read |     200       52.23    .7249921    10.25294    50.80035    53.65965
---------+--------------------------------------------------------------------
    diff |     200        .545    .6283822    8.886666   -.6941424    1.784142
------------------------------------------------------------------------------
     mean(diff) = mean(write - read)                              t =   0.8673
 H0: mean(diff) = 0                              Degrees of freedom =      199
 Ha: mean(diff) < 0           Ha: mean(diff) != 0           Ha: mean(diff) > 0
 Pr(T < t) = 0.8066         Pr(|T| > |t|) = 0.3868          Pr(T > t) = 0.1934use "wintfda.dta", clear
ttest rhknow , by(ephsex)variable write not found
r(111);
Two-sample t test with equal variances
------------------------------------------------------------------------------
   Group |     Obs        Mean    Std. err.   Std. dev.   [95% conf. interval]
---------+--------------------------------------------------------------------
       0 |      82    7.829268    .2578516    2.334946    7.316224    8.342312
       1 |     121    8.917355    .2159567    2.375524    8.489776    9.344935
---------+--------------------------------------------------------------------
Combined |     203    8.477833    .1693948    2.413504    8.143824    8.811841
---------+--------------------------------------------------------------------
    diff |           -1.088087    .3374608               -1.753505   -.4226695
------------------------------------------------------------------------------
    diff = mean(0) - mean(1)                                      t =  -3.2243
H0: diff = 0                                     Degrees of freedom =      201
    Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0
 Pr(T < t) = 0.0007         Pr(|T| > |t|) = 0.0015          Pr(T > t) = 0.9993t.test(school$write,school$read , alternative = "two.sided" , var.equal = FALSE)
    Welch Two Sample t-test
data:  school$write and school$read
t = 0.55199, df = 395.57, p-value = 0.5813
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.396081  2.486081
sample estimates:
mean of x mean of y 
   52.775    52.230 Compare Reading marks by gender
use "birthweight1.dta", clear
ttest read, by(female)variable write not found
r(111);
variable read not found
r(111);
r(111);t.test(read~female, data=school, alternative="two.sided" , var.equal = TRUE)
    Two Sample t-test
data:  read by female
t = 0.74801, df = 198, p-value = 0.4553
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
 -1.783998  3.964459
sample estimates:
mean in group 0 mean in group 1 
       52.82418        51.73394 Define the hypothesis
- the means are given in the output tables 
Define significance level
- determine at 5% level
((91 -1)*10.50671^2 + (109-1)*10.05783^2)/(109 + 91 -2)[1] 105.3559(52.82418-51.73394)/sqrt(105.3559*(1/91 +1/109))[1] 0.7480173For Difference of Means (
Hypothesis Testing Test Statistic
Difference of Two Proprotions
Mean
Variance 
Sample Sizes, Confidence 
(Prior Estimate fo 
(No Prior Estimates for 
Hypothesis Testing - Test Statistics
Null Value 
Null Value 
where
(21.9-16.7)/sqrt(20.38889*(2/10))[1] 2.575085Single Variance
and a 
Equivalently,
Single Proportion (p)
| Confidence Interval | Sample Sizes Confidence | (No prior estimate for | Hypothesis Testing Test Statistic | 
|---|---|---|---|
Power
Formally, power (for the test of the mean) is given by:
Chi-square Test for Associations
Contigency tables
when we work with categorical data in research we aim at discussing the analysis of response data when there are either no predictor variables, or all of the predictor variables are also categorical.
Pearson Chi-square Test for independence
- The chi-square independence test tests whether observed joint frequency counts - Determine whether an association exists or whether two variables are dependent on each other or not + Sometimes, 
 
- These tests do not measure the strength of an association.
 
- These tests depend on and reflect the sample size - double the sample size by copying each observation, double the 
 
- The [Pearson Chi-square Test] is not appropriate when more than about 20% of the cells have an expected cell frequency of less than 5 (large-sample p-values not appropriate).
 
- When the sample size is small the exact p-values can be calculated (this is prohibitive for large samples); calculation of the exact p-values assumes that the column totals and row totals are fixed.
where 
Question (Taken from Zimbabwe Open University)
A ZOU Regional Coordinator observed that on one weekend school, 150 students who turned up were categorized by gender and programme as shown in
| Gender | Physcology | Development studies | R & A Management | Total | 
|---|---|---|---|---|
| Male | 46 | 29 | 28 | 103 | 
| Female | 27 | 14 | 6 | 47 | 
| Total | 73 | 43 | 34 | 150 | 
record the data in R
view the observed values in a neat table
dat_o %>% data.frame() %>% rownames_to_column(var = " ") %>%
  janitor::adorn_totals(where = c("row", "col")) %>%
  flextable::flextable() %>%
  flextable::colformat_int(j = c(2, 3, 4), big.mark = ",") %>%
  flextable::autofit()contigency table for the data is given above
Look at the Observed proportions
prop.table(dat_o, margin = 1) %>% data.frame() %>% rownames_to_column(var = " ") %>%
  janitor::adorn_totals(where = c("col")) %>%
  flextable::flextable() %>%
  flextable::colformat_num(j = c(2, 3, 4), digits = 2) %>%
  flextable::autofit()Calculating expected frequencies
Doing this in R yeilds
Define the hypothesis
- 
- 
- The test statistic is calculated as follows 
We reject the null hypothesis at 5% level of significance if 
calculating 
R We can
vitc_e <- sum(dat_o) * prop.table(dat_o, 1) * prop.table(dat_o, 2)
X2 <- sum((dat_o - vitc_e)^2 / vitc_e)
print(X2)[1] 4.074251The degrees of freedom is
- conclusion
Since 
in R we can do this manually
pchisq(q = X2, df = vitc_dof, lower.tail = FALSE)[1] 0.130403Conclusion
since the p-value is greater than
0.05we conclude that there is no association between the two variables
the deviance statistic is
The associated p-values for the deviance residuals are
pchisq(q = G2, df = vitc_dof, lower.tail = FALSE)[1] 0.1124065Other than doing the manual labor
- R has a built in function called chisq.test()and it takes in a table object
recall that
dat_o        study
Gender   Physcology Development studies R & A Management
  Male           46                  29               28
  Female         27                  14                6vitc_chisq_test <- chisq.test(dat_o, correct = FALSE)
print(vitc_chisq_test)
    Pearson's Chi-squared test
data:  dat_o
X-squared = 4.0743, df = 2, p-value = 0.1304The Yates correction yields more conservative p-values.
chisq.test(dat_o)
    Pearson's Chi-squared test
data:  dat_o
X-squared = 4.0743, df = 2, p-value = 0.1304These p-values are evidence for rejecting the independence model.
Here is the chi-square test applied to the above data. Recall this data set is 2x3, so the degrees of freedom are correct = c(TRUE, FALSE) has no effect in chisq.test().
Expected frequencies from the output
vitc_chisq_test$expected        study
Gender   Physcology Development studies R & A Management
  Male     50.12667            29.52667         23.34667
  Female   22.87333            13.47333         10.65333Helment injury vs wearing a helmet
Testing the Hypothesis
Define the hypothesis
- 
- 
Decide the level of significance
- at 5% level of significance
Calculate the test statistic
Calculating expected frequencies
(235*147)/(793)[1] 43.56242(235* 646)/(793)[1] 191.4376(558* 147)/(793)[1] 103.4376(558 * 646)/(793)[1] 454.5624- The test statistic is calculated as follows 
We reject the null hypothesis at 5% level of significance if 
calculating 
(((17-43.6)^2)/(43.6))+(((218-191.4)^2)/(191.4))+(((130-103.4)^2)/(103.4))+(((428-454.6)^2)/(454.6))[1] 28.32459Chi-Squared distribution
Random variable 
with 
For our example
options(scipen = 999)
dof <- 1
x2 <- 28.32459
(p_value <- pchisq(q = x2, df = dof, lower.tail = FALSE))[1] 0.0000001025845Or simulate this by taking the mean of 10,000 random trials.
data.frame(x = 250:300 / 10) %>%
  mutate(density = pchisq(x, df = dof, lower.tail = FALSE),
         cdf = if_else(x > x2, density, as.numeric(NA))) %>%
  ggplot() +
  geom_line(aes(x = x, y = density)) +
  geom_area(aes(x = x, y = cdf), alpha = 0.3) +
  theme_minimal() +
  labs(title = "P(X^2 > 28.3) when X ~ ChiSq(1)")How large would 
qchisq(p = .05, df = 1, lower.tail = FALSE)[1] 3.841459Does type of therapy associated with improvement?
| variable | Improvement | no improvemet | Total | 
|---|---|---|---|
| therapy x | 28 | 16 | 44 | 
| therapy y | 37 | 9 | 46 | 
| Total | 65 | 25 | 90 | 
Calculating expected frequencies
- 
- 
The test statistic is calculated as follows 
We reject the null hypothesis at 5% level of significance if 
summary table
| Observed(O) | Expected(E) | (O-E) | [(O-E)^2]/E | 
|---|---|---|---|
| 28 | 31.77778 | 14.27177 | 0.4491114 | 
| 16 | 12.2222 | 14.27177 | 1.167693 | 
| 37 | 33.222 | 14.27328 | 0.4296335 | 
| 9 | 12.77778 | 14.27162 | 1.116909 | 
- conclusion
Since the calculated value is less than than the critical value ,we fail to reject to reject the null hypothesis and conclude that the two variables are independent.
Examples in R and Stata
use "birthweight1.dta", clear
list  id female race ses schtyp prog in 1/5variable write not found
r(111);
variable id not found
r(111);
r(111);use "birthweight1.dta", clear
tab ses prog , chi expected exact cchi2variable write not found
r(111);
variable ses not found
r(111);
r(111);Observed frequencies
school |> 
  janitor::tabyl(ses,prog) |> 
  janitor::adorn_totals(where = c("row", "col")) %>%
  flextable::flextable() %>%
  flextable::colformat_int(j = c(2, 3, 4), big.mark = ",") %>%
  flextable::autofit()vitc_e <- sum(tab) * prop.table(tab, 1) * prop.table(tab, 2)
X2 <- sum((tab - vitc_e)^2 / vitc_e)
print(X2)[1] 16.60444Shapiro-Wilk Test
The Shapiro-Wilk test is a test of whether a random variable is normally distributed. It uses test statistic
The test uses 
shapiro.test(rbinom(100, 10, .3))
## 
##  Shapiro-Wilk normality test
## 
## data:  rbinom(100, 10, 0.3)
## W = 0.93847, p-value = 0.0001557
shapiro.test(rbinom(100, 30, .3))
## 
##  Shapiro-Wilk normality test
## 
## data:  rbinom(100, 30, 0.3)
## W = 0.97063, p-value = 0.02469