Summary

This blog post explores the importance of evaluating features after dimensionality reduction, highlighting how the methods can mitigate issues like overfitting and reduce computational costs, while emphasizing the need to ensure the retained features are informative. This blog post details model-specific techniques for assessing feature quality, including evaluating PCA through explained variance and scree plots, and ICA using kurtosis, alongside general methods like calculating reconstruction error. Additionally, the blog post covers model-agnostic techniques for feature quality assessment, such as visualization through scatter and t-SNE plots, and clustering performance evaluation. Finally, the blog post suggests further evaluations using neural network models to assess the impact of dimensionality reduction on model performance and explainability, underscoring the breadth of methods available to understand and optimize the feature space.

Introduction 

Dimensionality reduction can be a critical preprocessing step that transforms a dataset’s features with high dimensions in input space to much lower dimensions in some latent space. It can bring us multiple benefits when training the model including avoiding the curse of dimensionality issues, reducing the risk of model overfitting, and lowering the computation costs of the model. However, how can we ensure that the features after the dimensionality reduction are informative and can help the model learn the essential patterns? In the following post, we will introduce some practical techniques to help you assess the quality of the features you got from dimensionality reduction.

1). Model Specific Techniques

Linear methods have been popular among all dimensionality reduction methods for their simplicity and interpretability, like Principal Component Analysis (PCA) and Independent Component Analysis (ICA). Based on the mechanism of the dimensionality reduction methods, specific ways can be applied to evaluate the features after the dimensionality reduction. 

Evaluate PCA

The key intuition of the PCA algorithm is to extract the principle components (directions) of the original features that capture the most variance of the data by performing eigendecomposition on the covariance matrix of the data. The resulting eigenvectors indicate the directions and the corresponding eigenvalues represent the variance captured. The eigenvectors corresponding to the largest eigenvalues are selected as the principal components for feature projection. [1]

  • Variance Explained / Scree Plot: Based on the mechanism of PCA, the most important criterion is to check how much variance is explained by each principal component. This can help you to decide how many directions to keep in the transformed features. The Scree plot can be further used to help you better visualize the “elbow” point where the remaining components contribute less significantly to capturing data variance, such that the components before the elbow point should be retained as significant. Below is a tutorial for the Scree plot in Python [2].

Evaluate ICA

The key idea behind the ICA method is to extract the independent features from a mixture of features. Then, if the features are not randomly distributed, then its distribution is less likely to be the Gaussian distribution but the mixture of multiple features is closer to the Gaussian distribution. Therefore, the goal is to find the less Gaussian distributed features within the original feature space.

  • Kurtosis: This is a measure of the “tailedness” of a probability distribution, where if the kurtosis value minus 3 is 0 means a Gaussian distribution, where the positive and negative values correspond to the distribution that is more heavy-tailed or light-tailed compared to the Gaussian distribution [3]. There is a Python package that can help you test if the random variable is normally distributed or not [4]. The statistics here is the fisher kurtosis where 0 indicates the Gaussian distribution.

General for linear methods

  • Calculate reconstruction error: Linear dimensionality reduction methods always transform the original feature space to low-dimensional feature space by multiplication of some matrix. Here, we can calculate the reconstruction error between the original feature and the reconstructed feature in the original dimension to evaluate how well the low-dimensional space can preserve the information. The reconstructed feature can be transformed back to the higher dimension through matrix multiplication. 

Take PCA as an example, suppose Z = XV, where V is the eigenvector matrix, X is the original feature matrix and Z is the processed low-dimensional feature matrix. Here, we can apply VT with Z to reconstruct the X = ZVT, then the Mean-Squared Error (MSE) can be calculated between X and X. Similar process can be done for ICA and Random Projection (RP). The process can also be done through the Python package with the inverse transform by the pca.inverse_transform function [5].

Note that if you scale the original features, you should reverse the feature scaling process when calculating the reconstruction error.

2). Model Agnostic Techniques

The above methods are particularly suitable for a certain dimensionality reduction method or linear methods in general, but there could be more general techniques that can be applied to help evaluate the dimensionality reduction effect. In this section, we will cover some practical ways for you to check the feature quality regardless of the dimensionality reduction methods. 

Visualization

Visualization is always the most direct and effective first step to check the resulting feature space in low-dimensional space. Being able to visualize the data is also regarded as a huge advantage of doing dimensionality reduction on the original feature space, to help people better visualize and understand the data structure. There can be several ways to visualize the space as follows. 

  • Scatter plot: After reducing the dimensions of features to 2D or 3D after some dimensionality reduction techniques, we can simply use a scatter plot to visualize the low-dimensional features. Within the plot, we can investigate the clustering behaviors of the features, whether there are outliers, the correlation between features, and so on to assess their quality in preserving essential information for classification. 
  • t-SNE plot: Further, we can use some techniques like t-SNE in addition to the scatter plot. t-SNE itself is a non-linear dimensionality reduction method that may preserve the non-linear relations in the data. t-SNE can be further used as a complement to linear methods like PCA to capture further non-linearity and provide some better visualization to help interpret the resulting low-dimensional feature space. Specifically, the t-SNE plot can provide you better with the data structure and the clustering behaviors within the data. One thing to notice about the t-SNE method is the hyperparameter tuning which can significantly affect the visualization at the end. Also, the algorithm has randomness over different rounds. A similar method can be UMAP [6].  Below is the t-SNE plot in Python, the code starts with transforming the original feature with the t-SNE model and then plots the resulting low-dimensional space with a scatter plot [7].

Here is an example of the t-SNE plot of the MNIST dataset where we can clearly observe the clustering behaviors of different labels.

Clustering performance

One of the key characteristics behind features is the clustering capability and quality for the downstream prediction task. Just like the factor we are looking for in the visualization, we can explicitly analyze the clustering behaviors with some clustering algorithms and corresponding evaluation methods. A good low-dimensional space should preserve correct clustering behaviors. 

  • First, you may choose a clustering algorithm, like K-means, DBSCAN to start with and cluster the low-dimensional feature space
  • Second, evaluate the resulting clusters through measures like the Silhouette score that can be used to measure the similarity to its own cluster and the difference to other clusters [8]. The higher the value, the better the clustering performance. Value 1 indicates the perfect cluster, value 0 indicates the overlapping clusters, and value -1 indicates the sample has been assigned to the wrong cluster. You can use the Python function sklearn.metrics.silhouette_score to calculate the score [9].
  • If labels are available, check the compatibility of the resulting clusters with the ground truths of the data.

3). Further evaluation using the NN models

Methods in the previous two sections mainly focus on evaluating the resulting low-dimensional feature space. In practice, the ultimate goal of dimensionality reduction is to aid the model’s performance. Therefore, the indirect assessment is to assess the feature quality and understand the feature space with the help of the downstream Neural Network (NN) models. 

Model performance

Obviously, if the low-dimensional features after the dimensionality reduction preserve all key patterns for the prediction task and remove irrelevant/redundant features, then the model performance should be as good as or better than the model trained with the raw input space. 

Explainability of the model 

Other than checking the performance, more in-depth investigations can be done to understand how the NN model utilizes and proceeds with the low-dimensional feature space. Particularly, this field of methods has gained much attention in modern ML/DL as Explainable artificial intelligence (XAI) or Explainable Machine Learning (XML) [10, 11]. NN has been known as a “black box” algorithm that makes it hard for people to understand the mechanism behind the model inference. However, for some high-stakes applications like medical prediction or scientific discovery, it is important to understand why the model makes such predictions. Specifically, people would like to interpret which features contribute the most to the final model prediction. Similarly, we can use the XAI methods to interpret how features in the low-dimensional space influence the model performance. The following are some practical methods that are often used to interpret tabular data. If you are interested in this field, there are also some further advanced techniques designed for image data or graph data for your future reading. 

  • Local Interpretable Model-Agnostic Explanations (LIME) [12]: The key idea of LIME is to perturb the input data and see how the predictions change respectively. Although the entire model is complex globally, we can interpret the model locally through linear approximation. The linear model that we get by data perturbation is the explanation and can illustrate the decision boundary of the model locally for interpretation. Below is an example from the iris dataset with the random forest classifier [13].

The right-most column is the table including the raw feature values, the middle is the feature importance for each class and the left-most column is the prediction probabilities.

Some other interpretation methods have been introduced in our previous post (https://sites.gatech.edu/omscs7641/2024/02/07/introduction-to-classification-model-comparison-methods/), like SHapley Additive exPlanations (SHAP) and Partial Dependence Plots (PDP)

Further Techniques

  • For Image Data: Grad-CAM which utilizes the model gradients to determine which features mostly impact the model prediction. The heat map can be generated to highlight the critical regions in the image. If you are interested, further works can be found in this survey article [14].
  • For Graph Data: Similarly, the key for graph model interpretation is to identify the key subgraph that determines the prediction. Many of them use causality to extract the invariant subgraph that determines the data label. Some of the representative works can be found in this survey article [15].

Conclusion

This post includes several types of methods that can be used to help evaluate and interpret the low-dimensional feature space after the dimensionality reduction methods. The introduced methods can be great tools to help you form the analysis for the dimensionality reduction algorithms and can be well-connected with the clustering algorithms.

References

[1]. Shlens, Jonathon. “A tutorial on principal component analysis.” arXiv preprint arXiv:1404.1100 (2014).

[2]. Zach. “How to Create a Scree Plot in Python (Step-by-Step).” Statology, 18 Sept. 2021, www.statology.org/scree-plot-python/.

[3]. Turney, Shaun. “What Is Kurtosis?: Definition, Examples & Formula.” Scribbr, 29 Jan. 2024, www.scribbr.com/statistics/kurtosis/.

[4]. “Scipy.Stats.Kurtosistest#.” Scipy.Stats.Kurtosistest – SciPy v1.12.0 Manual, docs.scipy.org/doc/scipy/reference/generated/scipy.stats.kurtosistest.html.

[5]. “SKLEARN.DECOMPOSITION.PCA.” Scikit, scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html.

[6]. McInnes, Leland, John Healy, and James Melville. “Umap: Uniform manifold approximation and projection for dimension reduction.” arXiv preprint arXiv:1802.03426 (2018).

[7]. Violante, Andre. “An Introduction to T-Sne with Python Example.” Medium, Medium, 31 Aug. 2018, medium.com/@violante.andre/an-introduction-to-t-sne-with-python-example-47e6ae7dc58f.

[8]. “Silhouette (Clustering).” Wikipedia, Wikimedia Foundation, 25 Dec. 2023, en.wikipedia.org/wiki/Silhouette_(clustering).

[9]. “Sklearn.Metrics.Silhouette_score.” Scikit, scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html.

[10]. Arrieta, Alejandro Barredo, et al. “Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI.” Information fusion 58 (2020): 82-115.

[11]. “Explainable Artificial Intelligence.” Wikipedia, Wikimedia Foundation, 7 Mar. 2024, en.wikipedia.org/wiki/Explainable_artificial_intelligence.

[12]. Ribeiro, Marco Tulio, Sameer Singh, and Carlos Guestrin. “” Why should i trust you?” Explaining the predictions of any classifier.” Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 2016.

[13]. Marcotcr. “MARCOTCR/Lime: Lime: Explaining the Predictions of Any Machine Learning Classifier.” GitHub, github.com/marcotcr/lime.

[14]. Das, Arun, and Paul Rad. “Opportunities and challenges in explainable artificial intelligence (xai): A survey.” arXiv preprint arXiv:2006.11371 (2020).

[15]. Yuan, Hao, et al. “Explainability in graph neural networks: A taxonomic survey.” IEEE transactions on pattern analysis and machine intelligence 45.5 (2022): 5782-5799.