No Straight Lines Here: The Wacky World of Non-Linear Manifold Learning

Summary

The blog post explores the complexities of non-linear manifold learning, an advanced technique for deciphering complex, intertwined patterns in high-dimensional data such as images or videos. Manifold learning serves as a method for non-linear dimensionality reduction, highlighting the capability to project high-dimensional data onto lower-dimensional spaces, preserving essential features and relationships. Through various algorithms such as Sammon Mapping, Isomap, Laplacian Eigenmaps, t-SNE, and UMAP, the blog post delves into how each algorithm uniquely addresses the challenge of revealing hidden structures within data, which traditional linear methods might overlook. The conclusion emphasizes the importance of experimenting with different algorithms, fine-tuning hyperparameters, and visualizing results to unlock deep insights hidden within complex datasets.

In this era of machine learning and data analysis, the quest to understand complex relationships within high-dimensional data like images or videos is not simple and often requires techniques beyond simple ones. The patterns are complex, twisted and intertwined, defying the simplicity of straight lines. This is where non-linear manifold learning algorithms step in. They unravel the hidden structures within our data revealing relationships that conventional linear methods might overlook. In this article, we’ll embark on a journey through some of the prominent non-linear manifold algorithms, each with its quirks and capabilities.

Introduction

Manifold learning is an approach to non-linear dimensionality reduction. According to Wikipedia, Nonlinear dimensionality reduction[1] refers to various techniques which are aimed at projecting high-dimensional data onto lower-dimensional latent manifolds. Essentially, a latent manifold represents a curved subspace housing an embedding of items, where the proximity of similar items is preserved. This transformative process not only facilitates the visualization of data in a reduced-dimensional space but also enables the utilization of the mapping from higher-dimensional to lower dimensions for tackling downstream machine-learning tasks.

Consider you’re handed the above dataset featuring images of the letter ‘A’, where each square represents a training example. Your mission: assess the readability of these images. Using a straightforward linear regression model by directly feeding in pixel values might not be the wisest move. With over 1,000 dimensions, (1024 pixel values for 32 x 32 image) it’s a complex landscape to navigate. Yet, a closer inspection reveals a simple truth: these images essentially vary in two dimensions – angle and size. Imagine if we could distil these features; our problem would transform, leaving us contending with just two features instead of an overwhelming array.

Now, how do Nonlinear dimensionality reduction (NLDR) techniques step in? Picture them as skilled sculptors, chiselling away correlated information (like the shared ‘A’ theme) to unveil the essential details of angle and size. Take a glance at the adjacent image, showcasing samples from this dataset. Witness how Manifold Sculpting, a Nonlinear Dimensionality Reduction algorithm, adeptly projects the data into a streamlined two-dimensional space.

A natural question arises: what happens if we use PCA on this data? PCA, or Principal Component Analysis, thrives in finding structure within high-dimensional data by condensing the dataset onto the principal components that capture the maximum variance. However, PCA assumes this structure lies on a flat, linear plane. But in our case, the data likely forms a curved manifold, not a flat plane. So, if we try to force it onto a two-dimensional plane using PCA, the results, as illustrated below, wouldn’t be ideal (crowding effect). The data points appear scattered and don’t reveal the underlying structure based on angle and size, hindering our ability to accurately assess readability.

In the article, we will focus on five different non-linear manifold algorithms.

Sammon Mapping
Isomap (Isometric Mapping)
Laplacian Eigenmaps
t-SNE
UMAP

Sammon Mapping

Sammon Mapping [2] was proposed in 1969 by J.W. Sammon. It might be probably correct to call this method the first proposed nonlinear method for manifold learning. Sammon Mapping is also known by other names like Sammon’s Non-Linear Mapping or Nonlinear Mapping (NLM). Before we try and understand how it works let us first understand MDS.

Multi-Dimensional Scaling (MDS) is a technique for visualizing the degree of similarity among individual points within a dataset. The underlying concept involves optimizing the arrangement of projected points, ensuring that if points are distant in the original dimension, their counterparts in the projected dimension also maintain a significant separation. Conversely, when points are close in the original dimension, their corresponding projections should also exhibit similar proximity in the reduced dimension. Consider the following example, wherein points xi existing in a three-dimensional space undergo projection as zi in a two-dimensional space. The associated cost function, or error function f(z), essentially quantifies the disparity between the distances among all pairs of points in the lower dimension (zi) and their counterparts in the higher dimension (xi). Once this cost function is computed, optimization techniques, such as gradient descent or other methods, can be employed to fine-tune the z values and minimize the discrepancies between the projected and original distances.

Image borrowed from UBC CPSC-340 slides[3].

Now once we have an idea of MDS, the challenge of traditional MDS cost function is that they focus a lot on larger distances. This leads to a crowding effect like the PCA. We will discuss the effect in more detail later. An attempt to solve this problem is by using Sammon Mapping in which we use weighted MDS. If we denote the distance between the ith and jth objects in the original space by d*ij and their projections by dij, then Sammon’s stress or Sammon error function aim is to minimize E given below.

Example

In the below example, we have a three-dimensional dataset which consists of eight sinusoids on the surface of a cylinder. If you try to visualize you can easily see that the entire dataset lies on a 2-manifold. If you try to project it using linear techniques like PCA we can see that individual sinusoids intersect and the overall topology is not preserved. If we apply Sammon mapping here we can see that there are fewer intersections between the sinusoids. Although Sammon Mapping is not perfect like the bottom right, it is still better than PCA.

Image borrowed from Henderson, P. (1997) Sammon mapping[5].

Code Sample

The python doesn’t have good support for Sammon Mapping. There is a Tom Pollard’s repository[14] which has good support. Here is a sample code on how to use this package.

import pandas as pd
from sammon import sammon

df = pd.read_csv('{your_dataset}.csv')

# sammon(...) wants a Matrix
X = df.as_matrix(columns = df.columns[:7])

# By default, sammon returns a 2-dim array and the error E
[y, E] = sammon(X)

Isomap

Isomap was proposed by JB Tenenbaum in 2000[4]. It tries to find a lower-dimensional embedding which preserves geodesic distances between all points. If you look at the below Swiss roll in three dimensional, the Euclidean distance will capture the shortest distance between the points while ignoring the manifold. The geodesic distance here essentially captures this and returns the distance along the manifold. This distance will be used in the actual algorithm to learn the mapping.

Algorithm

The idea is to use this geodesic function by hopping around the manifold.

For each point, we will look for the neighbour(if the distance is close enough in the original high-dimensional space) and join them with the edge.
The newly constructed structure is a weighted graph with weights as the distance between the points.
Calculate the geodesic distance to be the shortest point from one point to another in this graph.
Run the MDS algorithm and get the new mapping.

Below is an image showing points arranged in S form and how when projected using Isomap are all mapped separately and there is no crowding effect. It shows step-by-step how the algorithm is being implemented here.

Example

If you look at the below example we have a Swiss roll. PCA doesn’t do a good job of projection as we see that points of different colours are overlapping with each other (crowding effect). Isomap does a good job and learns to unwrap the roll and project it correctly.

Code Sample

Sklearn has good support for Isomap and one can use sklearn.manifold.Isomap[16] directly for implementation. A good hyperparameter to play around is the number of neighbours for creating the weighted graph.

from sklearn.datasets import load_digits
from sklearn.manifold import Isomap
X, _ = load_digits(return_X_y=True)
X.shape
embedding = Isomap(n_components=2)
X_transformed = embedding.fit_transform(X[:100])
X_transformed.shape

Laplacian Eignemaps

Laplacian Eigenmap is yet another technique similar to Isomap in the sense that here we again create a weighted graph. It was developed by Mikhail Belkin and Partha Niyogi in 2003 [8]. This technique uses spectral techniques to perform dimensionality reduction. It is mostly used in image segmentation, clustering, and visualization of manifold structures.

Algorithm

The idea is to model the unknown manifold M (higher dimensional points in some space) by constructing graph G where data points are connected by an edge. Then we will construct the graph Laplacian L on G. Once we have this L we will calculate the spectrum(L) (eigenvalues) and the corresponding eigenfunctions. Lastly, with the help of eigenfunctions, we can construct the embedding in the new space.

Construct a weighted graph by choosing a distance function like Euclidean.
For assigning weights we can either make them all the same as constant 1 or use some other function like a heat kernel.
Then we compute the eigenvalues of the generalized eigenvector problem Lf = λDf where D is the diagonal weight matrix and L = D – W is the graph Laplacian.
Lastly, we get the embedding from the eigenvectors.

Here is a toy example showing how the algorithm works. Here we have 4 points in two dimensional and we want to project them onto one dimensional.

Image borrowed from PyData Berlin 2018 talk on Laplacian Eigenmaps[9].

Example

Here is an image showing Spectral Embedding. One can play with the below sphere here[21].

Code

Sklearn has good support for Laplacian Eigenmaps and one can use sklearn.manifold.SpectralEmbedding[17] directly for implementation. A good hyperparameter to play around is how to create the affinity matrix.

from sklearn.datasets import load_digits
from sklearn.manifold import SpectralEmbedding
X, _ = load_digits(return_X_y=True)
X.shape
embedding = SpectralEmbedding(n_components=2)
X_transformed = embedding.fit_transform(X[:100])
X_transformed.shape

t-SNE

t-SNE was developed by Laurens van der Maaten and Geoffrey Hinton in 2008 [11]. t-SNE is predominantly used for visualization, especially in high-dimensional data visualization like gene expression data or deep learning activations. t-SNE preserves the local similarity structure of the data and models each high-dimensional object by a two or three-dimensional point in such a way that similar objects are modelled by nearby points and dissimilar objects are modelled by distant points with high probability.

Algorithm

The algorithm for t-SNE is as follows:

For each data point xi, fit a Gaussian distribution and measure the density of all other points xj under this Gaussian as shown in the figure below.
Calculate the conditional probability pij, representing the likelihood that xi would choose xj as its neighbour if neighbours were selected in proportion to their density under the Gaussian distribution.
In low-dimensional space, repeat the same process by centring a kernel over each point yi and measuring the density of all other points yj under t-distribution.
Minimize the difference between the pij values in high-dimensional space and the qij values in the low-dimensional map. The divergence measure used for this purpose is the Kullback–Leibler divergence.

Image borrowed from slides on Dimensionality reduction with t-SNE(2018)[22].

Things to know :

t-SNE doesn’t always produce similar output on successive runs.
The number of iterations to converge depends upon the datasets.
Cluster sizes or distances between clusters in a t-SNE plot mean nothing.

Example

There is a pretty good article where one can play with t-SNE called distil[15].

Below is an example of the MNIST dataset and how the 2-dimensional projection looks like in each case of Sammon-Mapping, Isomap and t-SNE.

Code

Sklearn has really good support for t-SNE and one can use sklearn.manifold.TSNE[18] directly for implementation. There are a lot of hyperparameters to play around here namely perplexity, learning rate, maximum number of iterations, etc. Perplexity reflects our opinion of the density of the data. A good idea to vary it between 5 and 50.

import numpy as np
from sklearn.manifold import TSNE
X = np.array([[0, 0, 0], [0, 1, 1], [1, 0, 1], [1, 1, 1]])
X_embedded = TSNE(n_components=2, learning_rate='auto',
                  init='random', perplexity=3).fit_transform(X)
X_embedded.shape

UMAP

Umap stands for Uniform Manifold Approximation and Projection. It is a recent algorithm developed by Leland Mclnnes, John Healy and James Melville in 2018. It can used for data visualization like Gene Expression, preprocessing steps for machine learning and EDA to understand inherent structure. UMAP starts by approximating the data manifold in the high-dimensional space using a fuzzy topological representation. This means, that instead of creating a strict k-nearest neighbors’ graph, UMAP allows for a more nuanced connection between data points. The local structure is however captured similarly to t-SNE, by comparing distances in the original space to a distribution.

Advantages of UMAP over t-SNE.

t-SNE always starts with random initialization. UMAP starts with Spectral Embedding and thus every time we use UMAP on a specific dataset, we always start with a low-dimensional graph. So, t-SNE can sometimes produce different visualizations with different runs or perplexity settings, but UMAP is more consistent
UMAP just moves one point or subset of points while t-SNE moves every single point a little bit in each iteration. UMAP gives more control over how tightly packed low-dimensional points end up. So UMAP tends to be faster than t-SNE, especially for larger datasets.
UMAP retains more of the global structure than t-SNE, while t-SNE is predominantly focused on preserving local clusters.
While primarily used for visualization, UMAP can be used as a general non-linear dimensionality reduction method for other tasks as well.

Example

Below is a UMAP projection 3D woolly mammoth skeleton (50k points, 10k shown) in 2 dimensions for the n_neighbors = 15 and min_dist = 0.1 parameters.

Code

One can install a Python package called umap-learn[19] to work with UMAP. An important thing to know here is that hyperparameter choice is really important. One can understand different hyperparameters here in this article[12] from Google.

pip install umap-learn
import umap
from sklearn.datasets import load_digits
X = load_digits()
embedding = umap.UMAP().fit_transform(X.data)

Conclusion

As we conclude our journey through the wacky world of non-linear manifold learning, it’s essential to consider practical tips for implementing these algorithms effectively. A few of them are listed here[20].

Here is a table presenting a comparison of various algorithms, outlining their pros and cons, along with suitable use cases:

Algorithm	Pros	Cons	When to use
Sammon Mapping	• Fast	• Sensitive to noise • Prone to local minima.	• Good for small datasets
Isomap	• Can handle large datasets	• Sensitive to noise and outliers	• Good for datasets with intrinsic geometry
Laplacian Eigenmaps	• Good for preserving local structure • Handles non-linear data.	• Sensitive to noise	• Good for datasets with intrinsic geometry
t-SNE	• Good for visualizing high-dimensional data • Effective in visualizations.	• Slow	• Good for datasets with complex non-linear structure
UMAP	• Fast • Balances global and local structures.	• Can handle large datasets • Requires careful hyperparameter tuning	• Good for datasets with complex non-linear structure

Table generated from ChatGPT and Bard

While there’s no single “one-size-fits-all” solution, understanding the unique characteristics of each algorithm is important. Here are some key steps to success:

Embrace experimentation. Try different algorithms based on your data and goals. Explore popular options like Isomap, t-SNE, and UMAP, each with its strengths and weaknesses.
Fine-tune with hyperparameters. These parameters control the behaviour of the algorithm. Experiment with different values to find the optimal configuration for your specific dataset.
Visualize for comprehension. Once you have the results, visualize the low-dimensional representation created by the algorithm. This helps you identify patterns and gain meaningful insights hidden within the data.
Iterate and improve. Remember, the process is iterative. Based on your visualizations and understanding, refine your choices of algorithm and hyperparameters, and repeat the steps to continuously improve your results.

Remember, while non-linear manifold learning algorithms are powerful tools, their success ultimately lies in your ability to understand their capabilities and limitations and approach them strategically.

References

Wikipedia contributors. (2024, February 20). Nonlinear dimensionality reduction. In Wikipedia, The Free Encyclopedia. Retrieved 19:51, March 5, 2024, from https://en.wikipedia.org/w/index.php?title=Nonlinear_dimensionality_reduction&oldid=1209176789
Wikipedia contributors. (2023, March 28). Sammon mapping. In Wikipedia, The Free Encyclopedia. Retrieved 18:40, March 6, 2024, from https://en.wikipedia.org/w/index.php?title=Sammon_mapping&oldid=1146979695
https://ubc-cs.github.io/cpsc340/lectures/L27.pdf
Wikipedia contributors. (2024, January 4). Isomap. In Wikipedia, The Free Encyclopedia. Retrieved 20:20, March 6, 2024, from https://en.wikipedia.org/w/index.php?title=Isomap&oldid=1193584106
Henderson, P. (1997). Sammon mapping. Pattern Recognit. Lett, 18(11-13), 1307-1316.
Sammon, J. W. (1969). A nonlinear mapping for data structure analysis. IEEE Transactions on computers, 100(5), 401-409.
Tenenbaum, J. B., Silva, V. D., & Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. science, 290(5500), 2319-2323.
Belkin, M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation, 15(6), 1373-1396.
https://juanitorduz.github.io/documents/orduz_pydata2018.pdf
https://juanitorduz.github.io/laplacian_eigenmaps_dim_red/
Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-SNE. Journal of machine learning research, 9(11).
https://pair-code.github.io/understanding-umap/
McInnes, L., Healy, J., & Melville, J. (2018). Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426.
https://github.com/tompollard/sammon
https://distill.pub/2016/misread-tsne/
https://scikit-learn.org/stable/modules/generated/sklearn.manifold.Isomap.html#sklearn.manifold.Isomap
https://scikit-learn.org/stable/modules/generated/sklearn.manifold.SpectralEmbedding.html#sklearn.manifold.SpectralEmbedding
https://scikit-learn.org/stable/modules/generated/sklearn.manifold.TSNE.html#sklearn.manifold.TSNE
https://umap-learn.readthedocs.io/en/latest/
https://scikit-learn.org/stable/modules/manifold.html#tips-on-practical-use
https://scikit-learn.org/stable/auto_examples/manifold/plot_manifold_sphere.html
https://seoulai.com/presentations/t-SNE.pdf

No Straight Lines Here: The Wacky World of Non-Linear Manifold Learning

Summary

Introduction

Sammon Mapping

Example

Code Sample

Further Reading

Isomap

Algorithm

Example

Code Sample

Further Reading

Laplacian Eignemaps

Algorithm

Example

Code

Further Reading

t-SNE

Algorithm

Example

Code

Further Reading

UMAP

Example

Code

Further Reading

Conclusion

References

Summary

Georgia Institute of Technology