Multivariate statistics for Data science Notes

Published

11 January 2024

Multivariate Methods

y1,...,yp are possibly correlated random variables with means μ1,...,μp

y=(y1.yp)

E(y)=(μ1.μp)

Let σij=cov(yi,yj) for i,j=1,,p

Σ=(σij)=(σ11σ22...σ1pσ21σ22...σ2p....σp1σp2...σpp)

where Σ (symmetric) is the variance-covariance or dispersion matrix

Let up×1 and vq×1 be random vectors with means μu and μv . Then

Σuv=cov(u,v)=E[(uμu)(vμv)]

in which ΣuvΣvu and Σuv=Σvu


Properties of Covariance Matrices

  1. Symmetric Σ=Σ
  2. Non-negative definite aΣa0 for any aRp, which is equivalent to eigenvalues of Σ, λ1λ2...λp0
  3. |Σ|=λ1λ2...λp0 (generalized variance) (the bigger this number is, the more variation there is
  4. trace(Σ)=tr(Σ)=λ1+...+λp=σ11+...+σpp= sum of variance (total variance)

Note:

  • Σ is typically required to be positive definite, which means all eigenvalues are positive, and Σ has an inverse Σ1 such that Σ1Σ=Ip×p=ΣΣ1


Correlation Matrices

ρij=σijσiiσjj

R=(ρ11ρ12...ρ1pρ21ρ22...ρ2p....ρp1ρp2...ρpp)

where ρij is the correlation, and ρii=1 for all i

Alternatively,

R=[diag(Σ)]1/2Σ[diag(Σ)]1/2

where diag(Σ) is the matrix which has the σii’s on the diagonal and 0’s elsewhere

and A1/2 (the square root of a symmetric matrix) is a symmetric matrix such as A=A1/2A1/2

Equalities

Let

  • x and y be random vectors with means μx and μy and variance -variance matrices Σx and Σy.

  • A and B be matrices of constants and c and d be vectors of constants

Then

  • E(Ay+c)=Aμy+c

  • var(Ay+c)=Avar(y)A=AΣyA

  • cov(Ay+c,By+d)=AΣyB

  • E(Ay+Bx+c)=Aμy+Bμx+c

  • var(Ay+Bx+c)=AΣyA+BΣxB+AΣyxB+BΣyxA

Multivariate Normal Distribution

Let y be a multivariate normal (MVN) random variable with mean μ and variance Σ. Then the density of y is

f(y)=1(2π)p/2|Σ|1/2exp(12(yμ)Σ1(yμ))

yNp(μ,Σ)

Properties of MVN

  • Let Ar×p be a fixed matrix. Then AyNr(Aμ,AΣA) . rp and all rows of A must be linearly independent to guarantee that AΣA is non-singular.

  • Let G be a matrix such that Σ1=GG. Then GyNp(Gμ,I) and G(yμ)Np(0,I)

  • Any fixed linear combination of y1,...,yp (say cy) follows cyN1(cμ,cΣc)

  • Define a partition, [y1,y2] where

    • y1 is p1×1

    • y2 is p2×1,

    • p1+p2=p

    • p1,p21 Then

(y1y2)N((μ1μ2),(Σ11Σ12Σ21Σ22))

  • The marginal distributions of y1 and y2 are y1Np1(μ1,Σ11) and y2Np2(μ2,Σ22)

  • Individual components y1,...,yp are all normally distributed yiN1(μi,σii)

  • The conditional distribution of y1 and y2 is normal

    • y1|y2Np1(μ1+Σ12Σ221(y2μ2),Σ11Σ12Σ221σ21)

      • In this formula, we see if we know (have info about) y2, we can re-weight y1 ’s mean, and the variance is reduced because we know more about y1 because we know y2
    • which is analogous to y2|y1. And y1 and y2 are independently distrusted only if Σ12=0

  • If yN(μ,Σ) and Σ is positive definite, then (yμ)Σ1(yμ)χ(p)2

  • If yi are independent Np(μi,Σi) random variables, then for fixed matrices Ai(m×p), i=1kAiyiNm(i=1kAiμi,i=1kAiΣiAi)

Multiple Regression

(Yx)Np+1([μyμx],[σY2ΣyxΣyxΣxx])

The conditional distribution of Y given x follows a univariate normal distribution with

E(Y|x)=μy+ΣyxΣxx1(xμx)=μyΣyxΣxx1μx+ΣyxΣxx1x=β0+βx

where β=(β1,...,βp)=Σxx1Σyx (e.g., analogous to (xx)1xy but not the same if we consider Yi and xi, i=1,..,n and use the empirical covariance formula: var(Y|x)=σY2ΣyxΣxx1Σyx)


Samples from Multivariate Normal Populations

A random sample of size n, y1,..,yn from Np(μ,Σ). Then

  • Since y1,...,yn are iid, their sample mean, y¯=i=1nyi/nNp(μ,Σ/n). that is, y¯ is an unbiased estimator of μ

  • The p×p sample variance-covariance matrix, S is S=1n1i=1n(yiy¯)(yiy¯)=1n1(i=1nyiyiny¯y¯)

    • where S is symmetric, unbiased estimator of Σ and has p(p+1)/2 random variables.
  • (n1)SWp(n1,Σ) is a Wishart distribution with n-1 degrees of freedom and expectation (n1)Σ. The Wishart distribution is a multivariate extension of the Chi-squared distribution.

  • y¯ and S are independent

  • y¯ and S are sufficient statistics. (All of the info in the data about μ and Σ is contained in y¯ and S , regardless of sample size).


Large Sample Properties

y1,...,yn are a random sample from some population with mean μ and variance-covariance matrix Σ

  • y¯ is a consistent estimator for μ

  • S is a consistent estimator for Σ

  • Multivariate Central Limit Theorem: Similar to the univariate case, n(y¯μ)˙Np(0,Σ) where n is large relative to p (n25p), which is equivalent to y¯˙Np(μ,Σ/n)

  • Wald’s Theorem: n(y¯μ)S1(y¯μ) when n is large relative to p.


Maximum Likelihood Estimation for MVN

Suppose iid y1,...ynNp(μ,Σ), the likelihood function for the data is

L(μ,Σ)=j=1n(1(2π)p/2|Σ|1/2exp(12(yjμ)Σ1)(yjμ))=1(2π)np/2|Σ|n/2exp(12j=1n(yjμ)Σ1)(yjμ)

Then, the MLEs are

μ^=y¯

Σ^=n1nS

using derivatives of the log of the likelihood function with respect to μ and Σ


Properties of MLEs

  • Invariance: If θ^ is the MLE of θ, then the MLE of h(θ) is h(θ^) for any function h(.)

  • Consistency: MLEs are consistent estimators, but they are usually biased

  • Efficiency: MLEs are efficient estimators (no other estimator has a smaller variance for large samples)

  • Asymptotic normality: Suppose that θ^n is the MLE for θ based upon n independent observations. Then θ^n˙N(θ,H1)

    • H is the Fisher Information Matrix, which contains the expected values of the second partial derivatives fo the log-likelihood function. the (i,j)th element of H is E(2l(θ)θiθj)

    • we can estimate H by finding the form determined above, and evaluate it at θ=θ^n

  • Likelihood ratio testing: for some null hypothesis, H0 we can form a likelihood ratio test

    • The statistic is: Λ=maxH0l(μ,Σ|Y)maxl(μ,Σ|Y)

    • For large n, 2logΛχ(v)2 where v is the number of parameters in the unrestricted space minus the number of parameters under H0


Test of Multivariate Normality

  • Check univariate normality for each trait (X) separately

    • Can check [Normality Assessment]

    • The good thing is that if any of the univariate trait is not normal, then the joint distribution is not normal (see again [m]). If a joint multivariate distribution is normal, then the marginal distribution has to be normal.

    • However, marginal normality of all traits does not imply joint MVN

    • Easily rule out multivariate normality, but not easy to prove it

  • Mardia’s tests for multivariate normality

    • Multivariate skewness isβ1,p=E[(yμ)Σ1(xμ)]3

    • where x and y are independent, but have the same distribution (note: β here is not regression coefficient)

    • Multivariate kurtosis is defined as

    • β2,pE[(yμ)Σ1(xμ)]2

    • For the MVN distribution, we have β1,p=0 and β2,p=p(p+2)

    • For a sample of size n, we can estimate

      β^1,p=1n2i=1nj=1ngij2

      β^2,p=1ni=1ngii2

      • where gij=(yiy¯)S1(yjy¯). Note: gii=di2 where di2 is the Mahalanobis distance
    • [@MARDIA_1970] shows for large n

      κ1=nβ^1,p6˙χp(p+1)(p+2)/62

      κ2=β^2,pp(p+2)8p(p+2)/nN(0,1)

      • Hence, we can use κ1 and κ2 to test the null hypothesis of MVN.

      • When the data are non-normal, normal theory tests on the mean are sensitive to β1,p , while tests on the covariance are sensitive to β2,p

  • Alternatively, Doornik-Hansen test for multivariate normality [@doornik2008]

  • Chi-square Q-Q plot

    • Let yi,i=1,...,n be a random sample sample from Np(μ,Σ)

    • Then zi=Σ1/2(yiμ),i=1,...,n are iid Np(0,I). Thus, di2=ziziχp2,i=1,...,n

    • plot the ordered di2 values against the qualities of the χp2 distribution. When normality holds, the plot should approximately resemble a straight lien passing through the origin at a 45 degree

    • it requires large sample size (i.e., sensitive to sample size). Even if we generate data from a MVN, the tail of the Chi-square Q-Q plot can still be out of line.

  • If the data are not normal, we can

    • ignore it

    • use nonparametric methods

    • use models based upon an approximate distribution (e.g., GLMM)

    • try performing a transformation


Mean Vector Inference

In the univariate normal distribution, we test H0:μ=μ0 by using

T=y¯μ0s/ntn1

under the null hypothesis. And reject the null if |T| is large relative to t(1α/2,n1) because it means that seeing a value as large as what we observed is rare if the null is true

Equivalently,

T2=(y¯μ0)2s2/n=n(y¯μ0)(s2)1(y¯μ0)f(1,n1)

Natural Multivariate Generalization

H0:μ=μ0Ha:μμ0

Define Hotelling’s T2 by

T2=n(y¯μ0)S1(y¯μ0)

which can be viewed as a generalized distance between y¯ and μ0

Under the assumption of normality,

F=np(n1)pT2f(p,np)

and reject the null hypothesis when F>f(1α,p,np)

  • The T2 test is invariant to changes in measurement units.

    • If z=Cy+d where C and d do not depend on y, then T2(z)T2(y)
  • The T2 test can be derived as a likelihood ratio test of H0:μ=μ0


Confidence Intervals

Confidence Region

An “exact” 100(1α)% confidence region for μ is the set of all vectors, v, which are “close enough” to the observed mean vector, y¯ to satisfy

n(y¯μ0)S1(y¯μ0)(n1)pnpf(1α,p,np)

  • v are just the mean vectors that are not rejected by the T2 test when y¯ is observed.

In case that you have 2 parameters, the confidence region is a “hyper-ellipsoid”.

In this region, it consists of all μ0 vectors for which the T2 test would not reject H0 at significance level α

Even though the confidence region better assesses the joint knowledge concerning plausible values of μ , people typically include confidence statement about the individual component means. We’d like all of the separate confidence statements to hold simultaneously with a specified high probability. Simultaneous confidence intervals: intervals against any statement being incorrect


Simultaneous Confidence Statements
  • Intervals based on a rectangular confidence region by projecting the previous region onto the coordinate axes:

y¯i±(n1)pnpf(1α,p,np)siin

for all i=1,..,p

which implied confidence region is conservative; it has at least 100(1α)%

Generally, simultaneous 100(1α)% confidence intervals for all linear combinations , a of the elements of the mean vector are given by

ay¯±(n1)pnpf(1α,p,np)aSan

  • works for any arbitrary linear combination aμ=a1μ1+...+apμp, which is a projection onto the axis in the direction of a

  • These intervals have the property that the probability that at least one such interval does not contain the appropriate aμ is no more than α

  • These types of intervals can be used for “data snooping” (like [Scheffe])


One μ at a time
  • One at a time confidence intervals:

y¯i±t(1α/2,n1siin

  • Each of these intervals has a probability of 1α of covering the appropriate μi

  • But they ignore the covariance structure of the p variables

  • If we only care about k simultaneous intervals, we can use “one at a time” method with the [Bonferroni] correction.

  • This method gets more conservative as the number of intervals k increases.


General Hypothesis Testing

One-sample Tests

H0:Cμ=0

where

  • C is a c×p matrix of rank c where cp

We can test this hypothesis using the following statistic

F=nc(n1)cT2

where T2=n(Cy¯)(CSC)1(Cy¯)

Example:

H0:μ1=μ2=...=μp

Equivalently,

μ1μ2=0μp1μp=0

a total of p1 tests. Hence, we have C as the p1×p matrix

C=(110001100011)

number of rows = c=p1

Equivalently, we can also compare all of the other means to the first mean. Then, we test μ1μ2=0,μ1μ3=0,...,μ1μp=0, the (p1)×p matrix C is

C=(110010101001)

The value of T2 is invariant to these equivalent choices of C

This is often used for repeated measures designs, where each subject receives each treatment once over successive periods of time (all treatments are administered to each unit).


Example:

Let yij be the response from subject i at time j for i=1,..,n,j=1,...,T. In this case, yi=(yi1,...,yiT),i=1,...,n are a random sample from NT(μ,Σ)

Let n=8 subjects, T=6. We are interested in μ1,..,μ6

H0:μ1=μ2=...=μ6

Equivalently,

μ1μ2=0μ2μ3=0...μ5μ6=0

We can test orthogonal polynomials for 4 equally spaced time points. To test for example the null hypothesis that quadratic and cubic effects are jointly equal to 0, we would define C

C=(11111331)

Two-Sample Tests

Consider the analogous two sample multivariate tests.

Example: we have data on two independent random samples, one sample from each of two populations

y1iNp(μ1,Σ)y2jNp(μ2,Σ)

We assume

  • normality

  • equal variance-covariance matrices

  • independent random samples

We can summarize our data using the sufficient statistics y¯1,S1,y¯2,S2 with respective sample sizes, n1,n2

Since we assume that Σ1=Σ2=Σ, compute a pooled estimate of the variance-covariance matrix on n1+n22 df

S=(n11)S1+(n21)S2(n11)+(n21)

H0:μ1=μ2Ha:μ1μ2

At least one element of the mean vectors is different

We use

  • y¯1y¯2 to estimate μ1μ2

  • S to estimate Σ

    Note: because we assume the two populations are independent, there is no covariance

    cov(y¯1y¯2)=var(y¯1)+var(y¯2)=Σ1n1+Σ2n2=Σ(1n1+1n2)

Reject H0 if

T2=(y¯1y¯2){S(1n1+1n2)}1(y¯1y¯2)=n1n2n1+n2(y¯1y¯2){S}1(y¯1y¯2)(n1+n22)pn1+n2p1f(1α,n1+n2p1)

or equivalently, if

F=n1+n2p1(n1+n22)pT2f(1α,p,n1+n2p1)

A 100(1α)% confidence region for μ1μ2 consists of all vector δ which satisfy

n1n2n1+n2(y¯1y¯2δ)S1(y¯1y¯2δ)(n1+n22)pn1+n2p1f(1α,p,n1+n2p1)

The simultaneous confidence intervals for all linear combinations of μ1μ2 have the form

a(y¯1y¯2)±(n1+n22)pn1+n2p1f(1α,p,n1+n2p1)×aSa(1n1+1n2)

Bonferroni intervals, for k combinations

(y¯1iy¯2i)±t(1α/2k,n1+n22)(1n1+1n2)sii

Model Assumptions

If model assumption are not met

  • Unequal Covariance Matrices

    • If n1=n2 (large samples) there is little effect on the Type I error rate and power fo the two sample test

    • If n1>n2 and the eigenvalues of Σ1Σ21 are less than 1, the Type I error level is inflated

    • If n1>n2 and some eigenvalues of Σ1Σ21 are greater than 1, the Type I error rate is too small, leading to a reduction in power

  • Sample Not Normal

    • Type I error level of the two sample T2 test isn’t much affect by moderate departures from normality if the two populations being sampled have similar distributions

    • One sample T2 test is much more sensitive to lack of normality, especially when the distribution is skewed.

    • Intuitively, you can think that in one sample your distribution will be sensitive, but the distribution of the difference between two similar distributions will not be as sensitive.

    • Solutions:

      • Transform to make the data more normal

      • Large large samples, use the χ2 (Wald) test, in which populations don’t need to be normal, or equal sample sizes, or equal variance-covariance matrices

        • H0:μ1μ2=0 use (y¯1y¯2)(1n1S1+1n2S2)1(y¯1y¯2)˙χ(p)2


Equal Covariance Matrices Tests

With independent random samples from k populations of p-dimensional vectors. We compute the sample covariance matrix for each, Si, where i=1,...,k

H0:Σ1=Σ2==Σk=ΣHa:at least 2 are different

Assume H0 is true, we would use a pooled estimate of the common covariance matrix, Σ

S=i=1k(ni1)Sii=1k(ni1)

with i=1k(ni1)


Bartlett’s Test

(a modification of the likelihood ratio test). Define

N=i=1kni

and (note: || are determinants here, not absolute value)

M=(Nk)log|S|i=1k(ni1)log|Si|

C1=12p2+3p16(p+1)(k1){i=1k(1ni1)1Nk}

  • Reject H0 when MC1>χ1α,(k1)p(p+1)/22

  • If not all samples are from normal populations, MC1 has a distribution which is often shifted to the right of the nominal χ2 distribution, which means H0 is often rejected even when it is true (the Type I error level is inflated). Hence, it is better to test individual normality first, or then multivariate normality before you do Bartlett’s test.


Two-Sample Repeated Measurements

  • Define yhi=(yhi1,...,yhit) to be the observations from the i-th subject in the h-th group for times 1 through T

  • Assume that y11,...,y1n1 are iid Nt(μ1,Σ) and that y21,...,y2n2 are iid Nt(μ2,Σ)

  • H0:C(μ1μ2)=0c where C is a c×t matrix of rank c where ct

  • The test statistic has the form

T2=n1n2n1+n2(y¯1y¯2)C(CSC)1C(y¯1y¯2)

where S is the pooled covariance estimate. Then,

F=n1+n2c1(n1+n22)cT2f(c,n1+n2c1)

when H0 is true

If the null hypothesis H0:μ1=μ2 is rejected. A weaker hypothesis is that the profiles for the two groups are parallel.

μ11μ21=μ12μ22μ1t1μ2t1=μ1tμ2t

The null hypothesis matrix term is then

H0:C(μ1μ2)=0c , where c=t1 and

C=(110001100001)(t1)×t