Understanding Classification Techniques
data Science for Everyone
Discriminant Analysis
Suppose we have two or more different populations from which observations could come from. Discriminant analysis seeks to determine which of the possible population an observation comes from while making as few mistakes as possible
-
This is an alternative to logistic approaches with the following advantages:
when there is clear separation between classes, the parameter estimates for the logic regression model can be surprisingly unstable, while discriminant approaches do not suffer
If X is normal in each of the classes and the sample size is small, then discriminant approaches can be more accurate
Notation
SImilar to MANOVA, let
Let
- Assume that each observation is from one of
possible populations. - We want to form a discriminant rule that will allocate an observation
to population j when is in fact from this population
Known Populations
The maximum likelihood discriminant rule for assigning an observation
Consider the likelihood for a single observation
Since
Consider a simple univariate example. Suppose we have data from one of two binomial populations.
The first population has
trials with success probabilityThe second population has
trials with success probabilityto which population would we assign an observation of
-
Note:
where is the binomial likelihood.Hence, we choose the second population
Another example
We have 2 populations, where
First population:
Second population:
The likelihood for a single observation is
Consider a likelihood ratio rule
Hence, we classify into
pop 1 if
pop 2 if
for ties, flip a coin
Another way to think:
we classify into population 1 if the “standardized distance” of y from
(Significant simplification occurs in th special case where
Thus, we classify into population 1 if
or
and
Thus, we classify into population 1 if this is less than 0.
Discriminant classification rule is linear in y in this case.
Multivariate Expansion
Suppose that there are 2 populations
Again, we classify into population 1 if this is less than 0, otherwise, population 2. And like the univariate case with non-equal variances, this is a quadratic discriminant rule.
And if the covariance matrices are equal:
This linear discriminant rule is also referred to as Fisher’s linear discriminant function
By assuming the covariance matrices are equal, we assume that the shape and orientation fo the two populations must be the same (which can be a strong restriction)
In other words, for each variable, it can have different mean but the same variance.
- Note: LDA Bayes decision boundary is linear. Hence, quadratic decision boundary might lead to better classification. Moreover, the assumption of same variance/covariance matrix across all classes for Gaussian densities imposes the linear rule, if we allow the predictors in each class to follow MVN distribution with class-specific mean vectors and variance/covariance matrices, then it is Quadratic Discriminant Analysis. But then, you will have more parameters to estimate (which gives more flexibility than LDA) at the cost of more variance (bias -variance tradeoff).
When
where
is the standard normal cdf
Suppose there are
Bayes Discriminant Rules
If we know that population j has prior probabilities
This rule allocates an observation
Note:
-
Maximum likelihood discriminant rule is a special case of the Bayes discriminant rule, where it sets all the
Optimal Properties of Bayes Discriminant Rules
let
be the probability of correctly assigning an observation from population ithen one rule (with probabilities
) is as good as another rule (with probabilities ) if for allThe first rule is better than the alternative if
for at least one i.A rule for which there is no better alternative is called admissible
Bayes Discriminant Rules are admissible
If we utilized prior probabilities, then we can form the posterior probability of a correct allocation,
Bayes Discriminant Rules have the largest possible posterior probability of correct allocation with respect to the prior
These properties show that Bayes Discriminant rule is our best approach.
Unequal Cost
We want to consider the cost misallocation Define
to be the cost associated with allocation a member of population j to population i.-
Assume that
for all if
We could determine the expected amount of loss for an observation allocated to population i as
where the are the probabilities of allocating an observation from population j into population iWe want to minimize the amount of loss expected for our rule. Using a Bayes Discrimination, allocate
to the population j which minimizesWe could assign equal probabilities to each group and get a maximum likelihood type rule. here, we would allocate
to population j which minimizes
Example:
Two binomial populations, each of size 10, with probabilities
And the probability of being in the first population is .9
However, suppose the cost of inappropriately allocating into the first population is 1 and the cost of incorrectly allocating into the second population is 5.
In this case, we pick population 1 over population 2
In general, we consider two regions,
where
Discrimination Under Estimation
Suppose we know the form of the distributions for populations of interests, but we still have to estimate the parameters.
Example:
we know the distributions are multivariate normal, but we have to estimate the means and variances
The maximum likelihood discriminant rule allocates an observation
where
For instance, we have 2 multivariate normal populations with distinct means, but common variance covariance matrix
MLEs for
Thus, an estimated discriminant rule could be formed by substituting these sample values for the population values
Native Bayes
The challenge with classification using Bayes’ is that we don’t know the (true) densities,
, while LDA and QDA make strong multivariate normality assumptions to deal with this.Naive Bayes makes only one assumption: within the kth class, the p predictors are independent (i.e,, for
where
This assumption allows the use of joint distribution without the need to account for dependence between observations. However, this (native) assumption can be unrealistic, but still works well in cases where the number of sample (n) is not large relative to the number of features (p).
With this assumption, we have
we only need to estimate the one-dimensional density function
When
is quantitative, assume it has a univariate normal distribution (with independence): which is more restrictive than QDA because it assumes predictors are independent (e.g., a diagonal covariance matrix)When
is quantitative, use a kernel density estimator [Kernel Methods] ; which is a smoothed histogramWhen
is qualitative, we count the promotion of training observations for the j-th predictor corresponding to each class.
Comparison of Classification Methods
Assuming we have K classes and K is the baseline from (James , Witten, Hastie, and Tibshirani book)
Comparing the log odds relative to the K class
Logistic Regression
LDA
where
Similar to logistic regression, LDA assumes the log odds is linear in
Even though they look like having the same form, the parameters in logistic regression are estimated by MLE, where as LDA linear parameters are specified by the prior and normal distributions
We expect LDA to outperform logistic regression when the normality assumption (approximately) holds, and logistic regression to perform better when it does not
QDA
where
Naive Bayes
where
Summary
LDA is a special case of QDA
LDA is robust when it comes to high dimensions
Any classifier with a linear decision boundary is a special case of naive Bayes with
, which means LDA is a special case of naive Bayes. LDA assumes that the features are normally distributed with a common within-class covariance matrix, and naive Bayes assumes independence of the features.Naive bayes is also a special case of LDA with
restricted to a diagonal matrix with diagonals, (another notation ) assumingQDA and naive Bayes are not special case of each other. In principal,e naive Bayes can produce a more flexible fit by the choice of
, but it’s restricted to only purely additive fit, but QDA includes multiplicative terms of the formNone of these methods uniformly dominates the others: the choice of method depends on the true distribution of the predictors in each of the K classes, n and p (i.e., related to the bias-variance tradeoff).
Compare to the non-parametric method (KNN)
KNN would outperform both LDA and logistic regression when the decision boundary is highly nonlinear, but can’t say which predictors are most important, and requires many observations
KNN is also limited in high-dimensions due to the curse of dimensionality
Since QDA is a special type of nonlinear decision boundary (quadratic), it can be considered as a compromise between the linear methdos and KNN classification. QDA can have fewer training observations than KNN but not as flexible.
From simulation:
True decision boundary | Best performance |
---|---|
Linear | LDA + Logistic regression |
Moderately nonlinear | QDA + Naive Bayes |
Highly nonlinear (many training, p is not large) | KNN |
- like linear regression, we can also introduce flexibility by including transformed features
Probabilities of Misclassification
When the distribution are exactly known, we can determine the misclassification probabilities exactly. however, when we need to estimate the population parameters, we have to estimate the probability of misclassification
-
Naive method
Plugging the parameters estimates into the form for the misclassification probabilities results to derive at the estimates of the misclassification probability.
But this will tend to be optimistic when the number of samples in one or more populations is small.
-
Resubstitution method
Use the proportion of the samples from population i that would be allocated to another population as an estimate of the misclassification probability
But also optimistic when the number of samples is small
-
Jack-knife estimates:
The above two methods use observation to estimate both parameters and also misclassification probabilities based upon the discriminant rule
Alternatively, we determine the discriminant rule based upon all of the data except the k-th observation from the j-th population
then, determine if the k-th observation would be misclassified under this rule
perform this process for all
observation in population j . An estimate fo the misclassficaiton probability would be the fraction of observations which were misclassifiedrepeat the process for other
populationsThis method is more reliable than the others, but also computationally intensive
Cross-Validation
Summary
Consider the group-specific densities
Assume equal misclassification costs, the Bayes classification probability of
where there are
We then classify into the group for which this probability of membership is largest
Alternatively, we can write this in terms of a generalized squared distance formation
where
-
is the squared Mahalanobis distance from to the centroid of group j, and if the within group covariance matrices are not equal if a pooled covariance estimate is appropriate
and
then, the posterior probability of belonging to group j is
where
and
Assessing Classification Performance
For binary classification, confusion matrix
Predicted class | ||||
---|---|---|---|---|
- or Null | + or Null | Total | ||
True Class | - or Null | True Neg (TN) | False Pos (FP) | N |
+ or Null | False Neg (FN) | True Pos (TP) | P | |
Total | N* | P* |
and table 4.6 from [@james2013]
Name | Definition | Synonyms |
---|---|---|
False Pos rate | FP/N | Tyep I error, 1 0 SPecificity |
True Pos. rate | TP/P | 1 - Type II error, power, sensitivity, recall |
Pos Pred. value | TP/P* | Precision, 1 - false discovery promotion |
Neg. Pred. value | TN/N* |
ROC curve (receiver Operating Characteristics) is a graphical comparison between sensitivity (true positive) and specificity ( = 1 - false positive)
y-axis = true positive rate
x-axis = false positive rate
as we change the threshold rate for classifying an observation as from 0 to 1
AUC (area under the ROC) ideally would equal to 1, a bad classifier would have AUC = 0.5 (pure chance)