A guide to statistical Practice in R and Stata for Public health Researchers
Library setup
About me
I am an Msc Research fellow at the Wits School of Public Health. I have a background in advanced mathematics and Statistics.
About the presentation
I am compiling the notes while finishing my
Biostatistics for Health researchers 1
Example datasets used in this exercise
use "birthweight1.dta", clear
list in 1/5
1. | studno | numfoll | lastweek | weekadj | mothage | sector | nopregs |
| 1 | 7 | 18 | 19 | 30 | D | 6 |
|----------------------------------------------------------------------|
| ncalive | regmerg | dob | sex | bwgp | delmerg | magegp |
| 5 | 3 | 27 Apr 94 | male | 2.5-2.99 | 3 | 30+ |
|----------------------------------------------------------------------|
| nca2 | month | bimonth | bweight | sectnum | propinf | ss | nopreg2 |
| 5+ | 4 | mar-apr | 2.9 | D | .137931 | AA | 6+ |
|----------------------------------------------------------------------|
| ss2 | bwgp2 | sect2 | unborn2 | unborn | excl | season | episode |
| AA | >=2.5 | inland | <=1 | <=1 | 0 | dry | 0 |
|----------------------+-----------------------------------------------|
| infect2 | bweight2 | od_igg |
| 0 | 3.13 | 1.712 |
+----------------------------------------------------------------------+
+----------------------------------------------------------------------+
2. | studno | numfoll | lastweek | weekadj | mothage | sector | nopregs |
| 2 | 7 | 18 | 19 | 28 | C | 4 |
|----------------------------------------------------------------------|
| ncalive | regmerg | dob | sex | bwgp | delmerg | magegp |
| 3 | 3 | 02 May 94 | female | 3-3.49 | 3 | 20-29 |
|----------------------------------------------------------------------|
| nca2 | month | bimonth | bweight | sectnum | propinf | ss | nopreg2 |
| 3 | 5 | may-jun | 3.1 | C | .1724138 | AA | 4/5 |
|----------------------------------------------------------------------|
| ss2 | bwgp2 | sect2 | unborn2 | unborn | excl | season | episode |
| AA | >=2.5 | inland | <=1 | <=1 | 0 | wet | 0 |
|----------------------+-----------------------------------------------|
| infect2 | bweight2 | od_igg |
| 0 | 3.18 | 1.322 |
+----------------------------------------------------------------------+
+----------------------------------------------------------------------+
3. | studno | numfoll | lastweek | weekadj | mothage | sector | nopregs |
| 4 | 6 | 18 | 19 | 40 | D | 11 |
|----------------------------------------------------------------------|
| ncalive | regmerg | dob | sex | bwgp | delmerg | magegp |
| 6 | 3 | 30 Jun 94 | male | 3-3.49 | 3 | 30+ |
|----------------------------------------------------------------------|
| nca2 | month | bimonth | bweight | sectnum | propinf | ss | nopreg2 |
| 5+ | 6 | may-jun | 3 | D | 0 | AA | 6+ |
|----------------------------------------------------------------------|
| ss2 | bwgp2 | sect2 | unborn2 | unborn | excl | season | episode |
| AA | >=2.5 | inland | 3+ | 5 | 0 | wet | 0 |
|----------------------+-----------------------------------------------|
| infect2 | bweight2 | od_igg |
| 0 | 3.23 | .71 |
+----------------------------------------------------------------------+
+----------------------------------------------------------------------+
4. | studno | numfoll | lastweek | weekadj | mothage | sector | nopregs |
| 5 | 7 | 18 | 19 | 35 | A | 5 |
|----------------------------------------------------------------------|
| ncalive | regmerg | dob | sex | bwgp | delmerg | magegp |
| 4 | 3 | 09 Jun 94 | female | 3-3.49 | 3 | 30+ |
|----------------------------------------------------------------------|
| nca2 | month | bimonth | bweight | sectnum | propinf | ss | nopreg2 |
| 4 | 6 | may-jun | 3.2 | A-sea | .2727273 | AA | 4/5 |
|----------------------------------------------------------------------|
| ss2 | bwgp2 | sect2 | unborn2 | unborn | excl | season | episode |
| AA | >=2.5 | sea | <=1 | <=1 | 0 | wet | 1 |
|----------------------+-----------------------------------------------|
| infect2 | bweight2 | od_igg |
| 1 | 3.1 | 1.8935 |
+----------------------------------------------------------------------+
+----------------------------------------------------------------------+
5. | studno | numfoll | lastweek | weekadj | mothage | sector | nopregs |
| 6 | 4 | 6 | 7 | 23 | D | 2 |
|----------------------------------------------------------------------|
| ncalive | regmerg | dob | sex | bwgp | delmerg | magegp |
| 1 | 3 | 07 May 94 | female | 2-2.49 | 3 | 20-29 |
|----------------------------------------------------------------------|
| nca2 | month | bimonth | bweight | sectnum | propinf | ss | nopreg2 |
| 1 | 5 | may-jun | 2.3 | D | .25 | AA | 2/3 |
|----------------------------------------------------------------------|
| ss2 | bwgp2 | sect2 | unborn2 | unborn | excl | season | episode |
| AA | <2.5 | inland | <=1 | <=1 | 0 | wet | 1 |
|----------------------+-----------------------------------------------|
| infect2 | bweight2 | od_igg |
| 1 | 2.54 | 1.81 |
+----------------------------------------------------------------------+
id | female | race | ses | schtyp | prog | read | write | math | science | socst |
---|---|---|---|---|---|---|---|---|---|---|
70 | 0 | 4 | 1 | 1 | 1 | 57 | 52 | 41 | 47 | 57 |
121 | 1 | 4 | 2 | 1 | 3 | 68 | 59 | 53 | 63 | 61 |
86 | 0 | 4 | 3 | 1 | 1 | 44 | 33 | 54 | 58 | 31 |
141 | 0 | 4 | 3 | 1 | 3 | 63 | 44 | 47 | 53 | 56 |
172 | 0 | 4 | 2 | 1 | 2 | 47 | 52 | 57 | 53 | 61 |
Important terms
A descriptive statistic is a summary statistic that quantitatively describes or summarizes features from a collection of information
Inferential statistics is a branch of statistics that makes the use of various analytical tools to draw inferences about the population data from sample data.
A Nominal Scale is a measurement scale, in which numbers serve as “tags” or “labels” only, to identify or classify an object. This measurement normally deals only with non-numeric (quantitative) variables or where numbers have no value.
Snowball sampling is a non-probability sampling method where new units are recruited by other units to form part of the sample.
A critical value serves as a boundary within the sampling distribution of a test statistic. These values play a crucial role in both hypothesis tests and confidence intervals.
Non-probability sampling is a sampling method that uses non-random criteria
A one-tailed test is a statistical test in which the critical area of a distribution is one-sided so that it is either greater than or less than a certain value, but not both.
The confidence interval is an estimate of the amount of uncertainty associated with a sample, computed from the statistics of the observed data
Covariance is a measure of the joint variability of two random variables.
The interval scale is a quantitative measurement scale where there is order, the difference between the two variables is meaningful and equal, and the presence of zero is arbitrary.
Probability sampling refers refers to the process in which each and every element of the population when selecting has an equal chance of being included in the sample for instance simple random sampling is a method of probability sampling.
ANOVA is a method of assessing the differences between sample means for instance to test the difference in mean salary between people with degrees,diplome,masters and PhD one would perfom an ANOVA
significance test is a formal procedure for testing properties of population distributions and can be used to test the differences between a single sample value and a fixed value for instance if we know the population mean and we wish to test whether a particular sample mean is significantly different from it.
A statistic is a numerical value computed based on data from a sample e.g sample variance
A Parameter on the other hand is a numerical value calculated based on values in the whole population e.g population population variance
Null hypothesis in a significance test is when an assertion is made that no difference exits.
Alternative hypothesis refers to the case when an assertion is made that a significant difference exists and it is accepted when the null hypothesis is rejected.
chi-square arises in statistics when we wish to compare a set of observed frequencies with a set of theoretical frequencies and it also a descriptive measure of the magnitude of the discrepancies between the observed and expected frequencies.
One way ANOVA - one way anova compares the means of two or more groups in order to determine whether there is statistical evidence that the associated population means are significantly different
test for independence - In tests of independence, two variables are involved and these are usually nominal variables where the test is used to answer whether the two variables are dependent to each other that is they are assoaciated. For example if we wish to test whether HIV status is associated with whether someone is poor or not from a given dataset.
Probability
Principles
Here are three rules that come up all the time.
. This rule generalizes to .If A and B are independent,
, and .
Bayes theorem - Application :Diagnostics
D = “Disease is Present’
= “Disease is not Present” = “Positive test result” = “Negative test result” = “Sensitivity (true positive rate)” = “Prob false positive rate” = false negative = “Specificity (true negative)”
P(D) IS THE PROB OF DISEASE
If your model predicts a patient as
1
(positive) and they belong to category1
(positive) in reality we call this atrue positive
.If your model predicts a patient as
0
(negative) and they belong to category1
(positive) in reality we call this afalse negative
.If your model predicts a patient as
1
(positive) and they belong to category0
(negative) in reality we call this afalse positive
.If your model predicts a patient as
0
(negative) and they belong to category0
(negative) in reality we call this atrue negative
.
🎓 Precision:
🎓 Recall: sensitivity
.
🎓 Specificity:
🎓 Accuracy:
🎓 F Measure: A weighted average of the precision and recall, with best being 1 and worst being 0.
Multiplication Rule.
How many outcomes are possible from a sequence of 4 coin flips and 2 rolls of a die?
Statistical inference
- Make inferences (an interpretation) about the true parameter value
based on our estimator/estimate - Test whether our underlying assumptions (about the true population parameters, random variables, or model specification) hold true.
Testing does not
- Confirm with 100% a hypothesis is true
- Confirm with 100% a hypothesis is false
- Tell you how to interpret the estimate value (Economic vs. Practical vs. Statistical Significance)
Hypothesis: Translate an objective in better understanding the results in terms of specifying a value (or sets of values) in which our population parameters should/should not lie.
-
Null hypothesis (
): A statement about the population parameter that we take to be true in which we would need the data to provide substantial evidence that against it.- Can be either a single value (ex:
) or a set of values (ex: ) - Will generally be the value you would not like the population parameter to be (subjective)
-
means you would like to see a non-zero coefficient -
means you would like to see a negative effect
-
- “Test of Significance” refers to the two-sided test:
- Can be either a single value (ex:
-
Alternative hypothesis (
or ) (Research Hypothesis): All other possible values that the population parameter may be if the null hypothesis does not hold.
Type I Error
Error made when
The probability of committing a Type I error is
Type I error (
Legal analogy: In U.S. law, a defendant is presumed to be “innocent until proven guilty”.
If the null hypothesis is that a person is innocent, the Type I error is the probability that you conclude the person is guilty when he is innocent.
Type II Error
Type II error level (
In the legal analogy, this is the probability that you fail to find the person guilty when he or she is guilty.
Error made when
The probability of committing a Type II error is
Random sample of size n: A collection of n independent random variables taken from the distribution X, each with the same distribution as X.
Sample mean
Sample Median
If n is odd,
If n is even,
Sample variance
Sample standard deviation
Sample proportions
Estimators
Point Estimator
Point estimate
The numerical value assumed by
Unbiased estimator
If
-
is an unbiased estimator for -
is an unbiased estimator for -
is an unbiased estimator for p -
is an unbiased estimator for -
is an unbiased estimator for
Note:
Distribution of the sample mean
If
One Sample Inference
i.i.d. standards for “independent and identically distributed”
Hence, we have the following model:
The Mean
When
Then, a
And the interval is
and
If the experiment were repeated many times,
Confidence Interval |
Sample Sizes Confidence |
Hypothesis Testing Test Statistic |
|
---|---|---|---|
When |
|||
When |
Example of one sample t.test
patient | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
---|---|---|---|---|---|---|---|---|
Days | 51 | 41 | 62 | 33 | 28 | 43 | 37 | 44 |
Hypothesis Testing
since it is a two tailed test we divide percentage by 2
Conclusion
Since
the single or one sample t.tests tests whether the mean is equal or less than a specified value. the value is usually based on theoretical considerations or previous research
use "birthweight1.dta", clear
ttest write ==50
variable write not found
r(111);
r(111);
use "birthweight1.dta", clear
ttest bweight ==3.25
variable write not found
r(111);
One-sample t test
------------------------------------------------------------------------------
Variable | Obs Mean Std. err. Std. dev. [95% conf. interval]
---------+--------------------------------------------------------------------
bweight | 141 3.007518 .0353594 .4198693 2.93761 3.077425
------------------------------------------------------------------------------
mean = mean(bweight) t = -6.8577
H0: mean = 3.25 Degrees of freedom = 140
Ha: mean < 3.25 Ha: mean != 3.25 Ha: mean > 3.25
Pr(T < t) = 0.0000 Pr(|T| > |t|) = 0.0000 Pr(T > t) = 1.0000
mean = 50
t.test(school$write,mu=50,alternative = "two.sided" )
One Sample t-test
data: school$write
t = 4.1403, df = 199, p-value = 0.00005121
alternative hypothesis: true mean is not equal to 50
95 percent confidence interval:
51.45332 54.09668
sample estimates:
mean of x
52.775
mean < 50
t.test(school$write,mu=50,alternative = "less" )
One Sample t-test
data: school$write
t = 4.1403, df = 199, p-value = 1
alternative hypothesis: true mean is less than 50
95 percent confidence interval:
-Inf 53.8826
sample estimates:
mean of x
52.775
mean = 50
t.test(school$write,mu=50,alternative = "greater" )
One Sample t-test
data: school$write
t = 4.1403, df = 199, p-value = 0.0000256
alternative hypothesis: true mean is greater than 50
95 percent confidence interval:
51.6674 Inf
sample estimates:
mean of x
52.775
Hypothesis Testing
(52.775-50)/(9.478586/sqrt(200))
[1] 4.140325
since it is a two tailed test we divide percentage by 2
Conclusion
- We conclude that the mean is statistically significantly different from .
For Difference of Means ( ), Independent Samples
|
Hypothesis Testing Test Statistic |
||
---|---|---|---|
When |
|||
When |
Pooled Variance: Degrees of Freedom: |
||
When |
Degrees of Freedom: |
two sample T-test
When you have a single explanatory variable which is qualitative and only have two levels, you can run a student’s T-test to test for a difference in the mean of the two levels. If appropriate for your data, you can choose to test a unilateral hypothesis. This means that you can test the more specific assumption that one level has a higher mean than the other, rather than that they simply have different means.Note that robustness of this test increases with sample size and is higher when groups have equal sizes
For the t-test, the t statistic used to find the p-value calculation is calculated as:
where
NB: this applies if the variances are not equal
for equal variances we use the formula in the table above:
Note that the t-test is mathematically equivalent to a one-way ANOVA with 2 levels.
Assumptions
If the assumptions of the t-test are not met, the test can give misleading results. Here are some important things to note when testing the assumptions of a t-test.
-
Normality of data
As with simple linear regression, the residuals need to be normally distributed. If the data are not normally distributed, but have reasonably symmetrical distributions, a mean which is close to the centre of the distribution, and only one mode (highest point in the frequency histogram) then a t-test will still work as long as the sample is sufficiently large (rule of thumb ~30 observations). If the data is heavily skewed, then we may need a very large sample before a t-test works. In such cases, an alternate non-parametric test should be used. -
Homoscedasticity
Another important assumption of the two-sample t-test is that the variance of your two samples are equal. This allows you to calculate a pooled variance, which in turn is used to calculate the standard error. If population variances are unequal, then the probability of a Type I error is greater than α.
The robustness of the t-test increases with sample size and is higher when groups have equal sizes.
We can test for difference in variances among two populations and ask what is the probability of taking two samples from two populations having identical variances and have the two sample variances be as different as are and .
To do so, we must do the variance ratio test (i.e. an F-test).
Violation of assumptions
If variances between groups are not equal, it is possible to use corrections, like the Welch correction. If assumptions cannot be respected, you can transform your data (log or square root for example) or use the non-parametric equivalent of t-test, the Mann-Whitney test. Finally, if the two groups are not independent (e.g. measurements on the same individual at 2 different years), you should use a Paired t-test.
use "high_school.dta", clear
ttest write ==read
variable write not found
r(111);
(highschool and beyond (200 cases))
Paired t test
------------------------------------------------------------------------------
Variable | Obs Mean Std. err. Std. dev. [95% conf. interval]
---------+--------------------------------------------------------------------
write | 200 52.775 .6702372 9.478586 51.45332 54.09668
read | 200 52.23 .7249921 10.25294 50.80035 53.65965
---------+--------------------------------------------------------------------
diff | 200 .545 .6283822 8.886666 -.6941424 1.784142
------------------------------------------------------------------------------
mean(diff) = mean(write - read) t = 0.8673
H0: mean(diff) = 0 Degrees of freedom = 199
Ha: mean(diff) < 0 Ha: mean(diff) != 0 Ha: mean(diff) > 0
Pr(T < t) = 0.8066 Pr(|T| > |t|) = 0.3868 Pr(T > t) = 0.1934
use "wintfda.dta", clear
ttest rhknow , by(ephsex)
variable write not found
r(111);
Two-sample t test with equal variances
------------------------------------------------------------------------------
Group | Obs Mean Std. err. Std. dev. [95% conf. interval]
---------+--------------------------------------------------------------------
0 | 82 7.829268 .2578516 2.334946 7.316224 8.342312
1 | 121 8.917355 .2159567 2.375524 8.489776 9.344935
---------+--------------------------------------------------------------------
Combined | 203 8.477833 .1693948 2.413504 8.143824 8.811841
---------+--------------------------------------------------------------------
diff | -1.088087 .3374608 -1.753505 -.4226695
------------------------------------------------------------------------------
diff = mean(0) - mean(1) t = -3.2243
H0: diff = 0 Degrees of freedom = 201
Ha: diff < 0 Ha: diff != 0 Ha: diff > 0
Pr(T < t) = 0.0007 Pr(|T| > |t|) = 0.0015 Pr(T > t) = 0.9993
t.test(school$write,school$read , alternative = "two.sided" , var.equal = FALSE)
Welch Two Sample t-test
data: school$write and school$read
t = 0.55199, df = 395.57, p-value = 0.5813
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.396081 2.486081
sample estimates:
mean of x mean of y
52.775 52.230
Compare Reading marks by gender
use "birthweight1.dta", clear
ttest read, by(female)
variable write not found
r(111);
variable read not found
r(111);
r(111);
t.test(read~female, data=school, alternative="two.sided" , var.equal = TRUE)
Two Sample t-test
data: read by female
t = 0.74801, df = 198, p-value = 0.4553
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
-1.783998 3.964459
sample estimates:
mean in group 0 mean in group 1
52.82418 51.73394
Define the hypothesis
:the means are given in the output tables
Define significance level
- determine at 5% level
((91 -1)*10.50671^2 + (109-1)*10.05783^2)/(109 + 91 -2)
[1] 105.3559
(52.82418-51.73394)/sqrt(105.3559*(1/91 +1/109))
[1] 0.7480173
For Difference of Means ( ), Paired Samples (D = X-Y)
Hypothesis Testing Test Statistic
Difference of Two Proprotions
Mean
Variance
Sample Sizes, Confidence
(Prior Estimate fo
(No Prior Estimates for
Hypothesis Testing - Test Statistics
Null Value
Null Value
where
(21.9-16.7)/sqrt(20.38889*(2/10))
[1] 2.575085
Single Variance
and a
Equivalently,
Single Proportion (p)
Confidence Interval |
Sample Sizes Confidence |
(No prior estimate for |
Hypothesis Testing Test Statistic |
---|---|---|---|
Power
Formally, power (for the test of the mean) is given by:
Chi-square Test for Associations
Contigency tables
when we work with categorical data in research we aim at discussing the analysis of response data when there are either no predictor variables, or all of the predictor variables are also categorical.
Pearson Chi-square Test for independence
- The chi-square independence test tests whether observed joint frequency counts
differ from expected frequency counts under the independence model (the model of independent explanatory variables, . is . - Determine whether an association exists or whether two variables are dependent on each other or not + Sometimes,
represents the model whose validity is to be tested. Contrast this with the conventional formulation of as the hypothesis that is to be disproved. The goal in this case is not to disprove the model, but to see whether data are consistent with the model and if deviation can be attributed to chance.
- These tests do not measure the strength of an association.
- These tests depend on and reflect the sample size - double the sample size by copying each observation, double the
statistic even thought the strength of the association does not change.
- The [Pearson Chi-square Test] is not appropriate when more than about 20% of the cells have an expected cell frequency of less than 5 (large-sample p-values not appropriate).
- When the sample size is small the exact p-values can be calculated (this is prohibitive for large samples); calculation of the exact p-values assumes that the column totals and row totals are fixed.
where
Question (Taken from Zimbabwe Open University)
A ZOU Regional Coordinator observed that on one weekend school, 150 students who turned up were categorized by gender and programme as shown in
Gender | Physcology | Development studies | R & A Management | Total |
---|---|---|---|---|
Male | 46 | 29 | 28 | 103 |
Female | 27 | 14 | 6 | 47 |
Total | 73 | 43 | 34 | 150 |
record the data in R
view the observed values in a neat table
dat_o %>% data.frame() %>% rownames_to_column(var = " ") %>%
janitor::adorn_totals(where = c("row", "col")) %>%
flextable::flextable() %>%
flextable::colformat_int(j = c(2, 3, 4), big.mark = ",") %>%
flextable::autofit()
contigency table for the data is given above
Look at the Observed proportions
prop.table(dat_o, margin = 1) %>% data.frame() %>% rownames_to_column(var = " ") %>%
janitor::adorn_totals(where = c("col")) %>%
flextable::flextable() %>%
flextable::colformat_num(j = c(2, 3, 4), digits = 2) %>%
flextable::autofit()
Calculating expected frequencies
Doing this in R yeilds
Define the hypothesis
-
: There is no association between gender and student programme choice -
: There is an association between gender and student programme choice
- The test statistic is calculated as follows
We reject the null hypothesis at 5% level of significance if
calculating
R
We can
vitc_e <- sum(dat_o) * prop.table(dat_o, 1) * prop.table(dat_o, 2)
X2 <- sum((dat_o - vitc_e)^2 / vitc_e)
print(X2)
[1] 4.074251
The degrees of freedom is
- conclusion
Since
in R we can do this manually
pchisq(q = X2, df = vitc_dof, lower.tail = FALSE)
[1] 0.130403
Conclusion
since the p-value is greater than
0.05
we conclude that there is no association between the two variables
the deviance statistic is
The associated p-values for the deviance residuals are
pchisq(q = G2, df = vitc_dof, lower.tail = FALSE)
[1] 0.1124065
Other than doing the manual labor
- R has a built in function called
chisq.test()
and it takes in a table object
recall that
dat_o
study
Gender Physcology Development studies R & A Management
Male 46 29 28
Female 27 14 6
vitc_chisq_test <- chisq.test(dat_o, correct = FALSE)
print(vitc_chisq_test)
Pearson's Chi-squared test
data: dat_o
X-squared = 4.0743, df = 2, p-value = 0.1304
The Yates correction yields more conservative p-values.
chisq.test(dat_o)
Pearson's Chi-squared test
data: dat_o
X-squared = 4.0743, df = 2, p-value = 0.1304
These p-values are evidence for rejecting the independence model.
Here is the chi-square test applied to the above data. Recall this data set is 2x3, so the degrees of freedom are correct = c(TRUE, FALSE)
has no effect in chisq.test()
.
Expected frequencies from the output
vitc_chisq_test$expected
study
Gender Physcology Development studies R & A Management
Male 50.12667 29.52667 23.34667
Female 22.87333 13.47333 10.65333
Helment injury vs wearing a helmet
Testing the Hypothesis
Define the hypothesis
-
: Suffering a head injury is not associated with wearing a helmet -
: There is association between wearing a helmet and suffering a head injury.
Decide the level of significance
- at 5% level of significance
Calculate the test statistic
Calculating expected frequencies
(235*147)/(793)
[1] 43.56242
(235* 646)/(793)
[1] 191.4376
(558* 147)/(793)
[1] 103.4376
(558 * 646)/(793)
[1] 454.5624
- The test statistic is calculated as follows
We reject the null hypothesis at 5% level of significance if
calculating
(((17-43.6)^2)/(43.6))+(((218-191.4)^2)/(191.4))+(((130-103.4)^2)/(103.4))+(((428-454.6)^2)/(454.6))
[1] 28.32459
Chi-Squared distribution
Random variable
with
For our example
options(scipen = 999)
dof <- 1
x2 <- 28.32459
(p_value <- pchisq(q = x2, df = dof, lower.tail = FALSE))
[1] 0.0000001025845
Or simulate this by taking the mean of 10,000 random trials.
data.frame(x = 250:300 / 10) %>%
mutate(density = pchisq(x, df = dof, lower.tail = FALSE),
cdf = if_else(x > x2, density, as.numeric(NA))) %>%
ggplot() +
geom_line(aes(x = x, y = density)) +
geom_area(aes(x = x, y = cdf), alpha = 0.3) +
theme_minimal() +
labs(title = "P(X^2 > 28.3) when X ~ ChiSq(1)")
How large would
qchisq(p = .05, df = 1, lower.tail = FALSE)
[1] 3.841459
Does type of therapy associated with improvement?
variable | Improvement | no improvemet | Total |
---|---|---|---|
therapy x | 28 | 16 | 44 |
therapy y | 37 | 9 | 46 |
Total | 65 | 25 | 90 |
Calculating expected frequencies
-
: The two variables are independent -
: The two variables are dependent
The test statistic is calculated as follows
We reject the null hypothesis at 5% level of significance if
summary table
Observed(O) | Expected(E) | (O-E) | [(O-E)^2]/E |
---|---|---|---|
28 | 31.77778 | 14.27177 | 0.4491114 |
16 | 12.2222 | 14.27177 | 1.167693 |
37 | 33.222 | 14.27328 | 0.4296335 |
9 | 12.77778 | 14.27162 | 1.116909 |
- conclusion
Since the calculated value is less than than the critical value ,we fail to reject to reject the null hypothesis and conclude that the two variables are independent.
Examples in R and Stata
use "birthweight1.dta", clear
list id female race ses schtyp prog in 1/5
variable write not found
r(111);
variable id not found
r(111);
r(111);
use "birthweight1.dta", clear
tab ses prog , chi expected exact cchi2
variable write not found
r(111);
variable ses not found
r(111);
r(111);
Observed frequencies
school |>
janitor::tabyl(ses,prog) |>
janitor::adorn_totals(where = c("row", "col")) %>%
flextable::flextable() %>%
flextable::colformat_int(j = c(2, 3, 4), big.mark = ",") %>%
flextable::autofit()
vitc_e <- sum(tab) * prop.table(tab, 1) * prop.table(tab, 2)
X2 <- sum((tab - vitc_e)^2 / vitc_e)
print(X2)
[1] 16.60444
Shapiro-Wilk Test
The Shapiro-Wilk test is a test of whether a random variable is normally distributed. It uses test statistic
The test uses
shapiro.test(rbinom(100, 10, .3))
##
## Shapiro-Wilk normality test
##
## data: rbinom(100, 10, 0.3)
## W = 0.93847, p-value = 0.0001557
shapiro.test(rbinom(100, 30, .3))
##
## Shapiro-Wilk normality test
##
## data: rbinom(100, 30, 0.3)
## W = 0.97063, p-value = 0.02469