Summary
Hyperparameter tuning is a method for finding the best combination of parameters that improves the overall performance of a machine learning model. Hyperparameter tuning can be thought of as an optimization problem. This tutorial will briefly discuss the hyperparameter tuning problem, discuss different methods for hyperparameter tuning, and perform a simple scikit-learn tutorial on different hyperparameter tuning algorithms using an SVM classifier on the iris dataset.
Introduction
Hyperparameter tuning is a method for finding the best parameters to use for a machine learning model. There are a few different methods for hyperparameter tuning such as Grid Search, Random Search, and Bayesian Search. Grid Search is a search algorithm that performs an exhaustive search over a user-defined discrete hyperparameter space [1, 3]. Random Search is a search algorithm that improves upon Grid Search by performing a randomized search over a few distributions of the hyperparameter space [1, 3]. Finally, Bayesian Search is a search algorithm that attempts to improve its search performance by trading off between exploration and exploitation using Bayesian optimization [1, 2, 5]. This tutorial will connect concepts from both A1 and A2 by 1) setting up hyperparameter tuning as an optimization problem, 2) describing the details of how Grid Search, Random Search, and Bayesian Search try to solve the problem, and 3) providing a short demonstration with scikit-learn and scikit-optimize on an SVM classifier.
Hyperparameter Tuning as an Optimization Problem
When starting out in machine learning, optimization and hyperparameter tuning might not seem to be connected. However, when considering an objective function that can be maximized or minimized with respect to a set of hyperparameters, hyperparameter tuning can be redefined mathematically as an optimization problem. Eq. 1 shows the objective function measured as the expected loss given a dataset D, learning algorithm A, and loss function L over a set of hyperparameters defined by H. h* is the optimal hyperparameters found by finding by minimizing the objective function [1, 4].
In this tutorial, the algorithm will be an SVM classifier using classification metrics like accuracy, recall, precision, and f1-score in place of a loss function. The problem would turn into a maximization problem as a result of trying to maximize the selected classification metrics. The set of hyperparameters chosen will be chosen for each hyperparameter tuning algorithm from the type of kernel, regularization parameter C, kernel coefficient γ, and degree of the polynomial.
Grid Search
Grid Search is the first algorithm to discuss with respect to hyperparameter tuning. The algorithm searches through the grid exhaustively for all possible combinations of hyperparameters. In the SVM classifier chosen for this tutorial, if 3 different values each of C and γ, then Grid Search would loop through 3*3 = 9 possible combinations of hyperparameters between the C and γ. As the number of hyperparameters and number of values chosen for those parameters increase, the grid space becomes very large. This property of exhaustive search algorithms is due to the “curse of dimensionality” [1]. As the number of hyperparameters and values for those parameters increase, the more computationally expensive Grid Search tends to be on the grid space. Figure 1 shows an example of an exhaustive grid space for 2 hyperparameters with 5 chosen values [1, 3].
Grid Search requires prior knowledge of the hyperparameters as all possible candidate hyperparameter combinations are searched. For example, the values for C and γ can be tuned over a narrow range of linearly- or log-spaced values if the user has prior experience with SVMs. Although Grid Search can be computationally expensive, there are options to parallelize the algorithm. Grid Search does provide good results as long as the grid space can be searched efficiently enough on a narrow range of values [1, 3].
Randomized Search
Random Search improves on the computational expense that comes with exhaustively searching through the hyperparameter space by randomly sampling hyperparameter values from a distribution of values. Figure 2 shows a possible set of values after running Random Search 9 times. Rather than having user-defined discrete values, the values often come from continuous distributions [1, 3].
An example of how Random Search selects from a distribution is shown in Figure 3. Note that all four points that are selected are randomly chosen from the distribution. Random Search benefits from parallelization as well. However, due to the stochastic nature of Random Search, there is still a possibility that the algorithm would need to continue sampling for long durations of time in a fairly large range of values in the hyperparameter space. However, Random Search still can outperform Grid Search in cases where more time may be needed to test different values [1]. Since the algorithm can still be computationally expensive, the recommendation is to use Random Search in earlier stages of hyperparameter tuning [1]. For example, for an SVM, the optimal range of C and γ are likely not well known by users that have little experience with SVMs in general. Randomized search can help narrow down on a potential range after finding the best parameters.
Bayesian Search
Bayesian Search is a search method that attempts to improve upon Grid Search and Random Search by improving its choice of hyperparameters using Bayesian optimization. Bayesian statistics has be utilized in many types of machine learning algorithms such as expectation-maximization (EM) and Naive Bayes classifier. Bayesian Search shows another potential use-case for Bayesian methods, where the idea of choosing hyperparameters that maximize the maximum likelihood (MLE) or maximum a posteriori (MAP) estimates is desirable [5]. Figure 4 shows a possible set of hyperparameters that can be used to maximize the MLE or MAP estimates of observations.
Bayesian Search can improve the next value of a given hyperparameter due to the properties of Bayesian methods. The search algorithm can also use less calls to the objective function by moving towards hyperparameters that improve MLE or MAP [5]. However, depending on the implementation of Bayesian Search, the optimization process may not be well parallelized. Also, Bayesian Search can use up a lot more computational resources than Grid Search and Random Search [1]. Bayesian Search can be leveraged for difficult to compute objective functions that cannot be represented or solved easily by other optimization methods, which why hyperparameter tuning can be solved with this method [5].
Hyperparameter Tuning Python Tutorial
The tutorial focuses more on a simple run of SVM with scikit-learn and scikit-optimize implementations of Grid Search, Random Search, and Bayesian Search. The dataset chosen for this tutorial is the iris dataset, which is made up of 150 samples, 4 features, and 3 class labels. Each class is equally represented, meaning the dataset is perfectly balanced [3]. The following code starts with importing the necessary libraries, classes, and functions.
# Import necessary classes and functions from sklearn.svm import SVC from sklearn.model_selection import train_test_split, GridSearchCV, RandomizedSearchCV from sklearn.metrics import classification_report from sklearn.preprocessing import StandardScaler from sklearn.datasets import load_iris from skopt import BayesSearchCV import numpy as np from scipy.stats import uniform from time import time
After importing the libraries, the dataset needs to be loaded and stored locally in the X and y variables. A 70:30 train-test split is chosen for a single random seed. Since SVMs are also sensitive to magnitudes of different numerical features, a standard scaler is used to scale the data appropriately prior to training and prediction steps. The code below shows the functions and classes used to perform the data loading and scaling.
# Load dataset, split it into train and test sets, and scale the X_train and X_test for the SVC algorithm X, y = load_iris(return_X_y=True) random_state = 0 X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.7, shuffle=True, random_state=0) std_scaler = StandardScaler() X_train_scaled = std_scaler.fit_transform(X_train) X_test_scaled = std_scaler.transform(X_test)
For each search algorithm, an equivalent class and set of parameters are created to fit with the training data and predict with the test data. Notice for Grid Search a narrow range of C and γ are chosen to show that Grid Search can be used with a narrower search criteria. Random Search and Bayesian Search use larger ranges of values because the assumption in those algorithms is that the user does not have intuition on which values would work for the given dataset. The code below shows all the possible hyperparameters that were chosen. For each algorithm, 81 hyperparameter values were chosen mainly to compare directly with similar number of hyperparameter combinations amongst each of the algorithms. Parallelization was also done for each algorithm to allow for faster run times.
# Grid Search search space param_grid = { 'C': np.logspace(-1, 1, 3), 'gamma': np.logspace(-1, 1, 3), 'degree': np.linspace(1, 8, 3, dtype=np.int64), 'kernel': ['linear', 'poly', 'rbf'] } # Random Search search space param_distributions = { 'C': uniform(0, 4), 'gamma': uniform(0, 4), 'degree': [1, 5, 8], 'kernel': ['linear', 'poly', 'rbf'] } # Bayesian Search search space search_spaces = { 'C': (1e-5, 1e5, 'log-uniform'), 'gamma': (1e-5, 1e2, 'log-uniform'), 'degree': (1, 8), 'kernel': ['linear', 'poly', 'rbf'] } # Instantiate algorithms with 5-fold stratified cross-validation grid_opt = GridSearchCV( SVC(), param_grid=param_grid, cv=5, n_jobs=4 ) randomized_opt = RandomizedSearchCV( SVC(), param_distributions=param_distributions, n_iter=81, cv=5, n_jobs=4 ) bayes_opt = BayesSearchCV( SVC(), search_spaces=search_spaces, n_iter=81, cv=5, n_jobs=4 )
The classification report results along with the best parameters were printed in the code below. In the results, the classification reports will be discussed briefly with some concluding remarks about the results and discussion.
start = time() grid_opt.fit(X_train_scaled, y_train) run_time = time() - start print(f'Grid Search Time (s): {run_time}') print(f'Best Params: {grid_opt.best_params_}') y_pred = grid_opt.predict(X_test_scaled) print(classification_report(y_test, y_pred)) start = time() randomized_opt.fit(X_train_scaled, y_train) run_time = time() - start print(f'Randomized Search Time (s): {run_time}') print(f'Best Params: {randomized_opt.best_params_}') y_pred = randomized_opt.predict(X_test_scaled) print(classification_report(y_test, y_pred)) start = time() bayes_opt.fit(X_train_scaled, y_train) run_time = time() - start print(f'Bayesian Search Time (s): {run_time}') print(f'Best Params: {bayes_opt.best_params_}') y_pred = bayes_opt.predict(X_test_scaled) print(classification_report(y_test, y_pred))
Results & Discussion
The results below show the classification report for the multiclass iris dataset showed the same results across all three search algorithms. However, notice the best parameters. For Grid Search, a linear classifier with a polynomial kernel with a C and γ at 10.0 are sufficient to classify well (accuracy, precision, recall are all high). For Random Search, the best kernel was RBF with a C of 1.45 and γ of 0.188. The degree does not matter here since the best estimator was an RBF kernel and not the polynomial kernel. For Bayesian Search, the best kernel was polynomial with a degree of 3, C at 2.25e(-5), and γ at 100.0. The set of parameters across all three algorithms are very different from each other, meaning that each of the search algorithms resulted in a possible hypothesis that can classify the iris dataset well.
Here, the suggestion is that the dataset is easy to model with multiple types of hypotheses for the SVM algorithm. Plotting learning and validation curves are usually desired along with hyperparameter tuning to discuss how bias-variance and overfitting-underfitting fit into the analysis. For this tutorial, these curves were omitted due to approaching the tutorial from a point of view from an inexperienced user of scikit-learn and scikit-optimize and thus emphasizing more of a quick start guide into how hyperparameter tuning works with these three algorithms.
Another result to notice is that Bayesian Search runs much longer (in seconds) than Grid Search (second longest) or Random Search. A possible reason for the longer computation time is that Bayesian methods tend to be more expensive on average due to needing to compute a large number of probabilities. Grid Search is exhaustively searching through the hyperparameter space, but does not have the same computational requirements of Bayesian Search. Finally, Random Search is the fastest here because randomly searching without guidance or exhaustive search tends to be faster if a good seed is chosen.
Grid Search Time (s): 0.7962002754211426 Best Params: {'C': 10.0, 'degree': 1, 'gamma': 10.0, 'kernel': 'poly'} precision recall f1-score support 0 1.00 1.00 1.00 16 1 1.00 0.94 0.97 18 2 0.92 1.00 0.96 11 accuracy 0.98 45 macro avg 0.97 0.98 0.98 45 weighted avg 0.98 0.98 0.98 45 Randomized Search Time (s): 0.12950968742370605 Best Params: {'C': 1.4514262033840444, 'degree': 5, 'gamma': 0.1881736303575372, 'kernel': 'rbf'} precision recall f1-score support 0 1.00 1.00 1.00 16 1 1.00 0.94 0.97 18 2 0.92 1.00 0.96 11 accuracy 0.98 45 macro avg 0.97 0.98 0.98 45 weighted avg 0.98 0.98 0.98 45 Bayesian Search Time (s): 58.95056343078613 Best Params: OrderedDict([('C', 2.256584259459779e-05), ('degree', 3), ('gamma', 100.0), ('kernel', 'poly')]) precision recall f1-score support 0 1.00 1.00 1.00 16 1 1.00 0.94 0.97 18 2 0.92 1.00 0.96 11 accuracy 0.98 45 macro avg 0.97 0.98 0.98 45 weighted avg 0.98 0.98 0.98 45
Concluding Remarks
The iris dataset is very small and balanced, meaning it is not representative of most data that would be collected. The multiple SVM hypotheses that worked well for each tuned search algorithm suggests that the models performed well and had the same values for the classification metrics chosen. However, the run times were fairly different between Bayesian Search and the other two search algorithms. The results show that each search algorithm (much like other types of search algorithms) searches possible hyperparameter combinations that maximize the classification metrics and can thus be treated as solutions to an optimization problem. Similarly, tuning methods are often used with supervised learning algorithms prior to performing classification or regression tasks on datasets. A1 focuses on possible supervised techniques that can help improve classification or regression scores while A2 focuses on randomized optimization methods for solving common optimization problems. Hyperparameter tuning can be seen as a mix of concepts between the two assignments and can help provide some insight into how different optimization methods work in practice. Machine learning is an interesting field where optimization can be found in seemingly different topics. It should be noted that any further exploration of optimization and how they are done in unsupervised, semi-supervised, and reinforcement learning is encouraged as students continue to delve deeper into CS 7641.
References
[1] T. Yu and H. Zhu, Hyper-Parameter Optimization: A Review of Algorithms and Applications. 2020.
[2] “scikit-optimize: sequential model-based optimization in Python — scikit-optimize 0.8.1 documentation.” Accessed: Feb. 15, 2024. [Online]. Available: https://scikit-optimize.github.io/stable/
[3] “scikit-learn: machine learning in Python — scikit-learn 1.4.1 documentation.” Accessed: Feb. 15, 2024. [Online]. Available: https://scikit-learn.org/stable/
[4] “AutoML | AutoML: Methods, Systems, Challenges (first book on AutoML).” Accessed: Feb. 15, 2024. [Online]. Available: https://www.automl.org/book/
[5] P. I. Frazier, “A Tutorial on Bayesian Optimization.” arXiv, Jul. 08, 2018. doi: 10.48550/arXiv.1807.02811.