Generative learning algorithms

There are two philosophies in classification problems:

Decision Boundary Approach (e.g., Logistic Regression, Perceptron Algorithm):
- This approach focuses on finding a decision boundary, typically a straight line, that separates two classes (e.g., elephants and dogs) in the feature space.
- A model is trained on a given dataset, and once the boundary is established, a new input is classified by checking which side of the boundary it falls on.
- The decision-making process involves a comparison of the position of the new data point relative to the learned boundary.
Model-Based Approach:
- Instead of finding a direct boundary, this approach builds separate models for each class (e.g., a model of what elephants look like and a model of what dogs look like).
- When classifying a new input, the model compares how closely the input resembles each class model.
- The final classification is based on which model the new input matches more closely, indicating which class it is most similar to.

In essence, while the first approach relies on establishing a single boundary to differentiate between classes, the second approach builds individual class models and uses similarity matching for classification.

Algorithms that aim to learn p(y|x) directly, like logistic regression, or those that map inputs X directly to labels (e.g., the perceptron algorithm), are known as discriminative learning algorithms. In contrast, the focus here is on algorithms that model p(x|y) and p(y) instead. These algorithms are called generative learning algorithms.

After modeling \(p(y)\) (called the class priors) and \(p(x|y)\), we can then use the Bayes rule to derive the posterior distribution on \(y\) given \(x\) \[ p(y \mid x)=\frac{p(x \mid y) p(y)}{p(x)} \] The denominator can be derived by \(p(x)=p(x\mid y=1)p(y=1)+p(x|y=0)p(y=0)\) and thus can also be expressed in terms of the quantities \(p(x|y)\) and \(p(y)\) that we’ve learned. Actually, if we are calculating \(p(y|x)\) in order to make a prediction we don’t actually need to calculate the denominator.

1 Gaussian discriminant analysis

1.1 The Gaussian discriminant analysis model

When we have a classification problem in which the input features \(x\) are continuous-valued random variables, we can then use the Gaussian Discriminant Analysis (GDA) model, which models \(p(x|y)\) using a multivariate normal distribution. \[ \begin{aligned} y & \sim \operatorname{Bernoulli}(\phi) \\ x \mid y=0 & \sim \mathcal{N}\left(\mu_0, \Sigma\right) \\ x \mid y=1 & \sim \mathcal{N}\left(\mu_1, \Sigma\right) \end{aligned} \] The log-likelihood of the data is given by \[ \begin{aligned} \ell\left(\phi, \mu_0, \mu_1, \Sigma\right) & =\log \prod_{i=1}^n p\left(x^{(i)}, y^{(i)} ; \phi, \mu_0, \mu_1, \Sigma\right) \\ & =\log \prod_{i=1}^n p\left(x^{(i)} \mid y^{(i)} ; \mu_0, \mu_1, \Sigma\right) p\left(y^{(i)} ; \phi\right) . \end{aligned} \] By maximizing \(\ell\) with respect to the parameters, we find the maximum likelihood estimate of the parameters to be: \[ \begin{aligned} \phi & =\frac{1}{n} \sum_{i=1}^n 1\left\{y^{(i)}=1\right\} \\ \mu_0 & =\frac{\sum_{i=1}^n 1\left\{y^{(i)}=0\right\} x^{(i)}}{\sum_{i=1}^n 1\left\{y^{(i)}=0\right\}} \\ \mu_1 & =\frac{\sum_{i=1}^n 1\left\{y^{(i)}=1\right\} x^{(i)}}{\sum_{i=1}^n 1\left\{y^{(i)}=1\right\}} \\ \Sigma & =\frac{1}{n} \sum_{i=1}^n\left(x^{(i)}-\mu_{y^{(i)}}\right)\left(x^{(i)}-\mu_{y^{(i)}}\right)^T \end{aligned} \] Then we have

\[\begin{aligned} p(x|y=1)p(y)=\phi_1\frac{1}{(2 \pi)^{d / 2}|\Sigma|^{1 / 2}} \exp \left(-\frac{1}{2}(x-\mu_1)^T \Sigma^{-1}(x-\mu_1)\right) .\\ p(x|y=0)p(y)=\phi_0\frac{1}{(2 \pi)^{d / 2}|\Sigma|^{1 / 2}} \exp \left(-\frac{1}{2}(x-\mu_0)^T \Sigma^{-1}(x-\mu_0)\right) .\\ \end{aligned}\]

and hence

\[\begin{aligned} \log p(x|y=1)p(y)=&\log \phi_1-\frac{1}{2}(x-\mu_1)^T \Sigma^{-1}(x-\mu_1)+C\\ \log p(x|y=0)p(y)=&\log \phi_0-\frac{1}{2}(x-\mu_0)^T \Sigma^{-1}(x-\mu_0)+C\\ \end{aligned}\]

we assign \(x\) to \(y=1\) if \[ \log p(x|y=1)p(y)-\log p(x|y=0)p(y)>0 \] That is \[ \begin{aligned} &\log(\phi_1 /\phi_0)-\frac{1}{2}\left((x-\mu_1)^T \Sigma^{-1}(x-\mu_1)-(x-\mu_0)^T \Sigma^{-1}(x-\mu_0) \right)\\ =&\log(\phi_1 /\phi_0)-\frac{1}{2}\left(x-\mu_1+x-\mu_0 \right)^T\Sigma^{-1} \left(x-\mu_1-x+\mu_0 \right)\\ =&\log(\phi_1 /\phi_0)-\left(x-\frac{\mu_1+\mu_0}{2} \right)^T\Sigma^{-1} \left(\mu_1-\mu_0 \right) \end{aligned} \] It is a linear function of \(x\). (We can show that if the covariance matrix are not equal, then it is quadratic function.)

1.2 Discussion: GDA and logistic regression

If we view the quantity \(p\left(y=1 \mid x ; \phi, \mu_0, \mu_1, \Sigma\right) \text { as a function of } x\) as a function of \(x\), we’ll find that it can be expressed in the form: \[ \begin{aligned} p(y=1 \mid x)=&\frac{p(x \mid y=1) p(y=1)}{p(x\mid y=1)p(y=1)+p(x|y=0)p(y=0)}\\ =&\frac{1}{1+\frac{p(x|y=0)p(y=0)}{p(x\mid y=1)p(y=1)}}\\ =&\frac{1}{1+\exp \left(-\theta^T x\right)} \end{aligned} \] where \(\theta\) is some appropriate function of \(\phi, \Sigma, \mu_0, \mu_1\). (here we add \(x_0=1\)).

GDA and logistic regression will, in general, give different decision boundaries when trained on the same dataset. Which is better?

We just argued that if \(p(x \mid y)\) is multivariate gaussian (with shared \(\Sigma\) ), then \(p(y \mid x)\) necessarily follows a logistic function. The converse, however, is not true; i.e., \(p(y \mid x)\) being a logistic function does not imply \(p(x \mid y)\) is multivariate gaussian.

This shows that GDA makes stronger modeling assumptions about the data than does logistic regression. It turns out that when these modeling assumptions are correct, then GDA will find better fits to the data, and is a better model.

In contrast, by making significantly weaker assumptions, logistic regression is also more robust and less sensitive to incorrect modeling assumptions.

2 Naive bayes

For our motivating example, consider building an email spam filter using machine learning. Classifying emails is one example of a broader set of problems called text classification.

Let’s say we have a training set (a set of emails labeled as spam or non-spam). We first specify the features \(x_j\) used to represent an email.

We will represent an email via a feature vector whose length is equal to the number of words in the dictionary. Specifically, if an email contains the \(j\)-th word of the dictionary, then we will set \(x_j = 1\), otherwise, we let \(x_j = 0\).

Instead of using all English words, it is common practice to use only words found in the training set as features to reduce computational and space requirements. This method also allows for the inclusion of unique words not found in dictionaries. Additionally, very high-frequency words (stop words) like “the,” “of,” and “and” are often excluded, as they appear in many documents and provide little value in distinguishing spam from non-spam emails.

The set of words encoded into the feature vector is called the vocabulary, so the dimension of x is equal to the size of the vocabulary.

To model \(p(x|y)\), we will therefore make a very strong assumption. We will assume that the \(x_i\)’s are conditionally independent given \(y\). This assumption is called the Naive Bayes (NB) assumption, and the resulting algorithm is called the Naive Bayes classifier.

We don’t use multinomial distribution, since the dimension of vocabulary is high. If there are 5000 words, we will have \(2^{5000}-1\) parameters.

The independence is assumed given \(y\). This does not means that they are independent.

We know have \[ \begin{aligned} & p\left(x_1, \ldots, x_{d} \mid y\right) \\ & \quad=p\left(x_1 \mid y\right) p\left(x_2 \mid y, x_1\right) p\left(x_3 \mid y, x_1, x_2\right) \cdots p\left(x_{d} \mid y, x_1, \ldots, x_{d-1}\right) \\ & \quad=p\left(x_1 \mid y\right) p\left(x_2 \mid y\right) p\left(x_3 \mid y\right) \cdots p\left(x_{d} \mid y\right) \\ & \quad=\prod_{j=1}^d p\left(x_j \mid y\right) \end{aligned} \] Our model is parameterized by \(\phi_{j \mid y=1}=p\left(x_j=1 \mid y=1\right), \phi_{j \mid y=0}=p\left(x_j=\right.\) \(1 \mid y=0)\), and \(\phi_y=p(y=1)\). As usual, given a training set \(\left\{\left(x^{(i)}, y^{(i)}\right) ; i=\right.\) \(1, \ldots, n\}\), we can write down the joint likelihood of the data: \[ \mathcal{L}\left(\phi_y, \phi_{j \mid y=0}, \phi_{j \mid y=1}\right)=\prod_{i=1}^n p\left(x^{(i)}, y^{(i)}\right) \] Maximizing this with respect to \(\phi_y, \phi_{j \mid y=0}\) and \(\phi_{j \mid y=1}\) gives the maximum likelihood estimates:

\[ \begin{aligned} \phi_{j \mid y=1} & =\frac{\sum_{i=1}^n 1\left\{x_j^{(i)}=1 \wedge y^{(i)}=1\right\}}{\sum_{i=1}^n 1\left\{y^{(i)}=1\right\}} \\ \phi_{j \mid y=0} & =\frac{\sum_{i=1}^n 1\left\{x_j^{(i)}=1 \wedge y^{(i)}=0\right\}}{\sum_{i=1}^n 1\left\{y^{(i)}=0\right\}} \\ \phi_y & =\frac{\sum_{i=1}^n 1\left\{y^{(i)}=1\right\}}{n} \end{aligned} \]

In the equations above, the ” \(\wedge\) ” symbol means “and.” The parameters have a very natural interpretation.

Having fit all these parameters, to make a prediction on a new example with features \(x\), we then simply calculate \[ \begin{aligned} p(y=1 \mid x) & =\frac{p(x \mid y=1) p(y=1)}{p(x)} \\ & =\frac{\left(\prod_{j=1}^d p\left(x_j \mid y=1\right)\right) p(y=1)}{\left(\prod_{j=1}^d p\left(x_j \mid y=1\right)\right) p(y=1)+\left(\prod_{j=1}^d p\left(x_j \mid y=0\right)\right) p(y=0)} \end{aligned} \] The generalization to \(x_j\) takes more than 2 values is straightforward.

2.1 Laplace smoothing

It is statistically a bad idea to estimate the probability of some event to be zero just because you haven’t seen it before in your finite training set.

Take the problem of estimating the mean of a multinomial random variable z taking values in \(\{1, \ldots, k\}\). We can pa-rameterize our multinomial with \(\phi_j=p(z=j)\). Given a set of \(n\) independent observations the maximum likelihood estimates are given by \[ \phi_j=\frac{\sum_{i=1}^n 1\left\{z^{(i)}=j\right\}}{n} . \] To avoid that some of the \(\phi_j\)’s might end up as zero, we can use Laplace smoothing \[ \phi_j=\frac{1+\sum_{i=1}^n 1\left\{z^{(i)}=j\right\}}{k+n} . \] Here, we’ve added \(1\) to the numerator, and \(k\) to the denominator. ( \(\sum_{i=1}^k\phi_i=1\) still holds).

2.2 Event models for text classification

While Naive Bayes as we’ve presented it will work well for many classification problems, for text classification, there is a related model that does even better.

To describe this model, we will use a different notation and set of features for representing emails.

Each word in an email is represented by \(x_j\) , where \(j\) denotes the position of the word within the email.
\(x_j\) is an integer that corresponds to the identity of the word in the vocabulary, which means each word in the vocabulary is assigned a unique number from 1 to \(|V|\) (where \(|V|\) is the total number of unique words in the vocabulary).

We assume

the way an email is generated is via a random process in which spam/non-spam is first determined (according to p(y)).
the sender of the email writes the email by first generating \(x_1\) from some multinomial distribution over words \((p(x_1|y))\).
the second word \(x_2\) is chosen independently of \(x_1\) but from the same multinomial distribution
and similarly for \(x_3\), \(x_4 \ldots\)

Thus, the overall probability of a message is given by \[ p(y) \prod_{j=1}^d p\left(x_j \mid y\right) \] Now, \(x_j|y\) is a multinomial, rather than a Bernoulli distribution.

The parameters for our new model are \(\phi_y=p(y)\) , \(\phi_{k \mid y=1}=\) \(p\left(x_j=k \mid y=1\right)(\) for any \(j)\) and \(\phi_{k \mid y=0}=p\left(x_j=k \mid y=0\right)\). Note that we have assumed that \(p\left(x_j \mid y\right)\) is the same for all values of \(j\) (i.e., that the distribution according to which a word is generated does not depend on its position \(j\) ).

If we are given a training set, the likelihood of the data is given by \[ \begin{aligned} \mathcal{L}\left(\phi_y, \phi_{k \mid y=0}, \phi_{k \mid y=1}\right) & =\prod_{i=1}^n p\left(x^{(i)}, y^{(i)}\right) \\ & =\prod_{i=1}^n\left(\prod_{j=1}^{d_i} p\left(x_j^{(i)} \mid y ; \phi_{k \mid y=0}, \phi_{k \mid y=1}\right)\right) p\left(y^{(i)} ; \phi_y\right) . \end{aligned} \] Maximizing this yields the maximum likelihood estimates of the parameters: \[ \begin{aligned} \phi_{k \mid y=1} & =\frac{\sum_{i=1}^n \sum_{j=1}^{d_i} 1\left\{x_j^{(i)}=k \wedge y^{(i)}=1\right\}}{\sum_{i=1}^n 1\left\{y^{(i)}=1\right\} d_i} \\ \phi_{k \mid y=0} & =\frac{\sum_{i=1}^n \sum_{j=1}^{d_i} 1\left\{x_j^{(i)}=k \wedge y^{(i)}=0\right\}}{\sum_{i=1}^n 1\left\{y^{(i)}=0\right\} d_i} \\ \phi_y & =\frac{\sum_{i=1}^n 1\left\{y^{(i)}=1\right\}}{n} . \end{aligned} \] If we were to apply Laplace smoothing (which is needed in practice for good performance), we obtain: \[ \begin{aligned} & \phi_{k \mid y=1}=\frac{1+\sum_{i=1}^n \sum_{j=1}^{d_i} 1\left\{x_j^{(i)}=k \wedge y^{(i)}=1\right\}}{|V|+\sum_{i=1}^n 1\left\{y^{(i)}=1\right\} d_i} \\ & \phi_{k \mid y=0}=\frac{1+\sum_{i=1}^n \sum_{j=1}^{d_i} 1\left\{x_j^{(i)}=k \wedge y^{(i)}=0\right\}}{|V|+\sum_{i=1}^n 1\left\{y^{(i)}=0\right\} d_i} \end{aligned} \]