In this blog post, we will explore the importance of stochastic models in the context of unsupervised learning. We will start with k-means clustering, which deterministically clusters points based on heuristics, and build up to Expectation Maximization (EM), which can use any parametrized probabilistic distribution to cluster data.

k-means clustering

k-means clustering is a method used to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. Given a set of observations (x_1, x_2, \ldots, x_n), we want to partition into k sets S = \{ S_1, S_2, \ldots , S_k \}. We do this by minimizing the intra-cluster squared Euclidean distance:

\min_S \sum_{i=1}^k \sum_{x \in S_i} \| x - \mu_i \|^2_2

where \mu_i = \dfrac{1}{S_i} \sum_{x \in S_i}x is the mean of points in S_i

To give a minimal example of k-means clustering, we are going to generate data according to x \sim \mathcal{N}(\mu, I), where I is the identity matrix. Then we’re going to cluster the data with sklearn.cluster.KMeans.

We obtained an accuracy of 96.8%, which is pretty good! Due to its use of Euclidean distance, K-means actually makes an assumption about the underlying data distribution: that it is spherical. These are types of multivariate distributions that are symmetrically distributed in all directions in space, where the distribution does not depend on the direction but only on the distance from the center. A Gaussian distribution is an example of that, and it is exactly what we used to generate our data!

We can actually easily plot the covariance of these classes:

However, if our data doesn’t follow a spherical distribution, then k-means clustering does less well. If we now generate some new data according to

x \sim \mathcal{N}(\mu, \Sigma) where \Sigma \neq I.

Reapplying k-means on this new data reveals very different results and an accuracy of 46.5%

Visualizing the reach of each cluster reveals exactly why k-means isn’t performing well:

No wonder k-means does poorly. Can we break the assumptions of spherical distributions and just plug in any probabilistic distribution p(x) we want? Of course, we can! as long as it is parameterized (and thus solvable) p(x;\theta).

Since we want our distribution to best express the true data, we simply have to maximize the expectation of data x given some distribution parameters \theta:

\max_\theta \mathbb{E} \big[ p(x | \theta) \big]

Since we are dealing with probabilities, we can instead optimize the log of the expectation to simplify solving

\max_\theta \mathbb{E} \big[ \log p(x | \theta) \big]

This technique is called Expectation Maximization (EM) and the optimization process can be solved with any optimization approach. If the problem is simple, we can even find an analytical solution. However, for most practical applications, we use Maximum a posteriori estimation (MAP) or simply gradient-based ascent with learning rate \alpha:

\theta \leftarrow \theta + \alpha \nabla_\theta \mathbb{E} \big[ \log p(x | \theta) \big]

EM has two main benefits over k-means clustering:

  • Richer data models that can capture more complex distributions.
  • Soft clustering, which computes the probability of each point belonging to a specific cluster.

If we decide to use a normal distribution: p(x; \theta) := \mathcal{N}(x; \mu, \Sigma) then EM simply becomes:

\mu \leftarrow \mu + \alpha \big( \Sigma^{-1} (x-\mu) \big)

\Sigma \leftarrow \Sigma + \dfrac{\alpha}{2} \big( \Sigma^{-1} - \Sigma^{-1} (x-\mu) (x-\mu)^T \Sigma^{-1} \big)

Let’s see this in action!

Our more expressive models netted us a much higher accuracy of 89.2%! Let’s now circle back. What if we assume the covariance matrix is \Sigma = I not learnable? Suddenly, our EM algorithm became even simpler:

This update equation is almost exactly an update equation we can use for k-means clustering (which actually only sums over inter-cluster points). As such, k-means clustering can be seen as a special case of Expectation Maximization (EM).

Note that in practice, k-means doesn’t use gradient descent/ascent but instead estimates the means from the data points directly. You can view these two as just different optimization options to solve the same problem.

If we restrict our previous model to have \Sigma = I, then we get:

These probabilistic models are way more than classifiers. Since we are learning a probability distribution, we can now sample from it! x \sim p(x). They are effectively generative models that can fit data and generate new artificial data!

Takeaways

  • k-means is a simple algorithm that assumes that all data follows a spherical distribution.
  • We can parametrize each cluster as a probabilistic distribution, which gives us more expressiveness.
  • We can then optimize this with Expectation Maximization (EM).

Probabilistic models with EM allow us to:

  • Use more complex models that can capture non-spherical distributions.
  • Soft classify each data point with individual probabilities of belonging to a class.
  • Generate new artificial data.

Like all things, EM isn’t perfect and has a few limitations:

  • It is prone to getting stuck in local maxima, and the solution heavily depends on the initial values. Can you think of any techniques we studied in this course to alleviate this?
  • It still assumes that each class is uni-modal. This can be addressed by using multiple Gaussians and adding them together to form more complex distributions. This is known as Gaussian Mixture Models (GMMS).
  • Heavy reliance on the assumed underlying model. While EM is more expressive than k-means clustering, it still works on an assumption about an underlying distribution, which might not be correct. Non-parametric approaches such as Kernel Density Estimation and Gaussian Processes are alternatives that tackle this issue.

This blog post also comes with an accompanying notebook where you can recreate and play with our results. Notebook