55 Model Based Prediction

55.1 Basic Idea

  1. Assume the data follow a probabilistic model
  2. Use Bayes’ theorem to identify optimal classifiers

Pros:

  • Can take advantage of structure of the data
  • May be computationally convenient
  • Are reasonably accurate on real problems

Cons:

  • Make additional assumptions about the data
  • When the model is incorrect, you may get reduced accuracy

55.2 Model Based Approach

  1. Our goal is to build a parametric model for conditional distributions \(P(Y = k|X = x)\)
  2. A typical approach is to apply Bayes Theorem

\[Pr(Y = k|X = x) = \frac{Pr(X = x|Y = k)~Pr(Y = k)}{\sum_{l=1}^K ~ Pr(X = x|Y = l) ~ Pr(Y = l)}\] \[Pr(Y = k|X = x) = \frac{f_k(x)\pi_k}{\sum_{l=1}^K f_l(x)\pi_l}\] Where:

  • \(f_k(x)\) represents the model for the \(x\) variables
  • \(\pi_k\) represents the model for the prior probabilities

Where:

  • \(Pr(Y = k|X = x)\) is the probability that \(Y = k\) given that \(X = x\)
  • \(Y\) is the outcome
  • \(k\) is the specific class given a particular set of predictor variables so that \(X\) is equal to the value of \(x\).
  1. Typically prior probabilities \(\pi_k\) are set in advance.
  2. A common choice for \(f_k(x)\) is a Gaussian distribution: \[f_k(x) = \frac{1}{\sigma_i \sqrt{2\pi}}e^{\frac{(x-\mu_k)^2}{\sigma_k^2}}\]
  3. Estimate the parameters \((\mu_k, \sigma_k^2)\) from the data.
  4. Once we have estimated these parameters, we can then classify the class as that with the highest value of \(P(Y = k|X = x)\)

55.3 Classifying Using the Model

A range of models use this approach:

  • Linear discriminant analysis assumes \(f_k(x)\) is a multivariate Gaussian with same covariances.
  • Quadratic discriminant analysis assumes \(f_k(x)\) is a multivariate Gaussian with different covariances
  • Model Based Prediction assumes more complicated versions for the covariance matrix
  • Naive Bayes assumes independence between features for model building

55.4 Why Linear Discriminant Analysis?

\[log \frac{Pr(Y = k|X = x)}{Pr(Y = j|X = x)}\] \[= log \frac{f_k(x)}{f_j(x)} + log \frac{\pi_k}{\pi_j}\] \[= log \frac{\pi_k}{\pi_j} - \frac{1}{2}(\mu_k + \mu_j)^T \sum^{-1}(\mu_k + \mu_j) + x^T \sum{-1}(\mu_k - \mu_j)\]

55.5 Decision Boundaries

Each ring below represents a different Gaussian distribution corresponding to a different class. The decision boundaries are created from the points of intersection between these Gaussian distributions.

55.6 Discriminant Function

\[\delta_k(x) = x^T \sum^{-1} \mu_k - \frac{1}{2} \mu_k \sum^{-1} \mu_k + log(\mu_k)\] Where:

  • \(\mu_k\) is the mean of the class \(k\) for all features

  • \(\sum^{-1}\) is the inverse of the covariance matrix

  • The observation is chosen by maximising the value of \(\delta_k(x)\)

  • Decide on class based on \(\hat{Y}(x) = argmax_k ~~\delta_k(x)\)

  • Usually estimate parameters with maximum likelihood.

55.7 Naive Bayes

Suppose we have many predictors, we would want to model: \(Pr(Y = k|X_1, ..., X_m)\)

We could use Bayes Theorem to get:

\[Pr(Y = k|X_1, ..., X_m) = \frac{\pi_k Pr(X_1, ..., X_m|Y = k)}{\sum_{l=1}^K Pr(X_1, ..., X_m|Y = k) \pi_l}\] \[\propto~~\pi_k Pr(X_1, ..., X_m|Y = k)\] This can be written:

\[P(X_1, ..., X_m|Y = k) = \pi_k P(X_1| Y =k) P(X_2, ..., X_m| X_1, Y = k)\] \[P\pi_k P(X_1| Y =k) P(X_2, ..., X_m| X_1, Y = k)P(X_3, ..., X_m|X_1, X_2, Y = k)\] \[\pi_k P(X_1| Y =k) P(X_2, ..., X_m| X_1, Y = k)...P(X_m|X_1, ..., X_{m-1}, Y = k)\] With assumptions, this simplifies to: \[\approx \pi_k P(X_1|Y = K)P(X_2|Y = K)...P(X_m|Y = K)\]