Summary
Random Search Algorithms significantly enhance machine learning optimization, excelling in complex or limited-resource scenarios by offering an efficient alternative to traditional methods like grid search or gradient descent. The discussion focuses on Randomized Hill Climbing, Simulated Annealing, and Parallel Recombinative Simulated Annealing, alongside conventional tuning techniques such as grid and random search. Through a case study on SVM model tuning with the Wine dataset, random search algorithms are shown to surpass traditional methods in simplicity, speed, and performance. The analysis underscores the superiority of random search approaches in navigating non-differentiable, discontinuous, or irregular optimization landscapes.
Introduction
Local Search Algorithms play a crucial role in Machine Learning by addressing a wide range of optimization problems, as noted by Solis and Wets [1]. These algorithms are especially useful for tasks like hyperparameter optimization or optimizing loss functions. Search algorithms are particularly beneficial in situations where computational resources are limited or the problem is complex, such as with neural networks. In certain instances, loss functions might exhibit characteristics such as lack of smooth differentiability, presence of discontinuities, or a highly irregular search space. These features can render traditional optimization methods like grid search or gradient descent less efficient. Random search methods, therefore, become a preferred alternative, offering a practical solution when exhaustive grid search is too costly and gradient descent is not applicable. In this course, we’ve explored various optimization techniques, including Randomized Hill Climbing, Simulated Annealing, Genetic Algorithms, and MIMIC (developed by Professor Charles Isbell!). I’d like to focus on three specific methods: Randomized Hill Climbing, Simulated Annealing, and Parallel Recombinative Simulated Annealing [8], alongside traditional tuning methods like grid search and random search.
Definitions
Randomized Hill Climbing (RHC)
Let S be the set of all possible solutions, and f: S -> R be the objective function to be maximized. Starting from an initial solution s in S, RHC generates a neighbor s’ by making a random adjustment to s. If f(s’) > f(s), then s is updated to s’. This process repeats until no improvement is found. [3]
Simulated Annealing (SA)
Let T be a temperature greater than 0, controlling the probability of accepting worse solutions. For each iteration, a neighbor s’ of the current solution s is considered. If f(s’) is greater than f(s), then s’ is accepted. Otherwise, s’ is accepted with a probability p equal to e raised to the power of negative (f(s) minus f(s’)) divided by T. Over time, T is decreased according to a schedule, reducing the likelihood of accepting worse solutions. [4]
Parallel recombinative simulated annealing (PRSA)
Parallel Recombinative Simulated Annealing (PRSA) is an advanced optimization algorithm that combines the principles of Simulated Annealing (SA) with concepts from genetic algorithms, particularly the recombination (or crossover) operation, in a parallel computing environment. This hybrid approach is designed to efficiently explore and exploit a solution space to find optimal or near-optimal solutions to complex optimization problems. [8]
Comparisons
I plan to conduct a comparative analysis of four algorithms used for hyperparameter tuning: Grid Search, Random Search, Randomized Hill Climbing (RHC), Simulated Annealing (SA), and Parallel Recombinative Simulated Annealing (PRSA). Gradient-free methods are often preferred for hyperparameter tuning because hyperparameters can be discrete or categorical and lack gradients, leading to the widespread use of Grid Search and Random Search in tuning contexts. For this case study, I’ll use the Wine dataset and SVM model tuning as examples. The evaluation of these algorithms will focus on three main criteria: simplicity of implementation, speed, and performance.
Grid Search, Random Search and RHC
Grid search is arguably the most popular method for hyperparameter tuning in machine learning. Grid search exhaustively tests all possible combinations of hyperparameters within a user-defined grid, whereas random search selects hyperparameters randomly from a specified space. Grid search guarantees finding the optimal hyperparameters if the solution lies within the grid. However, Grid search becomes impractical as the number of hyperparameters increases, leading to an exponential growth in runtime. In contrast, random search can explore a broader range of values for each hyperparameter, making it more suitable when dealing with limited computational resources and a large number of hyperparameters due to its flexibility.
Randomized Hill Climbing (RHC) might seem similar to random search name-wise but operates differently. After selecting an initial point randomly, RHC iteratively explores the neighborhood rather than making another random choice. This approach makes RHC more effective in smaller search spaces, where it can efficiently find an optimum. On the other hand, random search is better suited for high-dimensional spaces where the relationships between hyperparameters and objective functions are unknown.
In this case study, evaluating the relationship between hyperparameters and objective functions is challenging (refer fig. 2 & fig 3) due to the significant computational time required to train the SVM model across a grid of hyperparameters. This complexity highlights the need for selecting the most appropriate tuning method based on the specific requirements and constraints of the task at hand.
Given the extensive range of hyperparameter values we need to explore, random search is likely to yield better results compared to grid search, thanks to random search’s superior ability to explore a broad range. Random search also tends to be more efficient because it can be easily parallelized, with each trial being independent of the others.
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(SVC(), param_grid, cv=5, verbose=1)
grid_search.fit(X_train_scaled, y_train)
from sklearn.model_selection import RandomizedSearchCV
random_search = RandomizedSearchCV(SVC(), param_distributions=param_dist, n_iter=15, cv=2, verbose=1, random_state=42)
random_search.fit(X_train_scaled, y_train)
# Fitting 5 folds for each of 6 candidates, totalling 30 fits
# Best parameters (Grid Search): {'C': 100, 'gamma': 1}
# Best score (Grid Search): 0.7459007352941176
# Fitting 2 folds for each of 15 candidates, totalling 30 fits
# Best parameters (Random Search): {'C': 0.7523742884534856, 'gamma': 0.45627721822084705}
# Best score (Random Search): 0.7568637519561816
Indeed, our results confirm this hypothesis, with random search outperforming grid search in accuracy.
Regarding Randomized Hill Climbing (RHC), this method might converge more quickly to a solution if RHC starts near a good set of hyperparameters, due to its iterative approach to improvement. However, RHC is also more prone to getting stuck in local optima. But, since we’re only exploring two parameters (C and gamma), local search strategies like RHC can be more effective. Local search strategies actively seek to enhance performance without excessively draining computational resources. In fact, RHC achieved an accuracy of 0.79, surpassing both grid search and random search, which suggests that local search algorithms hold promise for hyperparameter tuning.
Simulated annealing (SA) and RHC
While SA (Simulated Annealing) and RHC (Randomized Hill Climbing) can both be seen as generalized hill-climbing models, they differ in a significant way. SA allows for the replacement of the current solution with a worse neighbor based on a probability p, which changes over time. Initially, p is high, accepting most worse solutions, but p decreases over time, making the acceptance of worse solutions rare. This approach illustrates a key trade-off in machine learning between exploration and exploitation. The choice between SA and RHC depends on whether we prioritize exploration. For example, SA is generally more suitable for complex and rugged solution spaces. The need to achieve a global optimum also plays a role; sometimes, a local optimum with acceptable performance is sufficient, reducing the necessity for a global search.
In our experiment, SA was slower than RHC without restarts due to the extensive exploration involved. SA indeeds yield better performance, which aligns with our expectations given its strong ability to escape local minima. However, the efficiency and convergence rate of SA can be enhanced with more refined algorithms like Very Fast Simulated Reannealing (VFSR) [2], which introduces a more aggressive cooling schedule and utilizes Levy flights for a balanced mix of long and short moves in the solution space. While we won’t delve into the specifics of VFSR, it’s important to note that SA’s shortcomings can be mitigated by adjusting certain aspects of the algorithm, making it more effective under the right conditions.
Parallel Recombinative Simulated Annealing
Simulated Annealing (SA) has a proven convergence mechanism based on its cooling schedule, which allows for control over convergence by adjusting the cooling rate. This convergence mechanism, highlighted in the foundational 1983 paper by S. Kirkpatrick, C.D. Gelatt, and M.P. Vecchi [4], relies on an ideally slow cooling process that brings the system to thermal equilibrium at each temperature stage, ensuring the algorithm can thoroughly explore the solution space over time. On the other hand, Genetic Algorithms (GAs) lack this straightforward mechanism, making their convergence control more nuanced due to the complex interactions of selection, crossover, and mutation processes, without a single controlling parameter. Nonetheless, GAs can make up for SA’s slower runtime and limited parallelism by evaluating and evolving multiple candidate solutions simultaneously, which can be distributed across multiple processors or nodes to accelerate computation. [8]
We can actually merge the strengths of both approaches. Parallel Recombinative Simulated Annealing (PRSA) is designed to address SA’s speed limitations by integrating the parallelism of genetic algorithms with SA’s reliable convergence properties. PRSA operates through several key steps [8]:
- Initialization: Start with a sufficiently high initial temperature and randomly initialize a population of solutions.
- Recombination and Mutation: For half the population size (n/2 times), perform the following steps:
a. Randomly select two parents from the population.
b. Generate two children by applying a recombination operator (such as crossover) followed by a mutation operator to the parents.
c. Conduct Boltzmann trials between the parents and their children, replacing the parents with the winners of these trials. - Cooling: Periodically lower the temperature according to a predefined cooling schedule.
- Repeat: Continue the process of generating new populations from the current population by repeating step 2 at the updated temperature until a termination condition is met (e.g., a sufficiently low temperature or a maximum number of iterations).
This approach allows PRSA to leverage the best of both worlds, combining the methodical convergence of SA with the efficiency and scalability of GAs for improved performance across complex optimization tasks. Here is the pseudocode:
// Pseudocode for Parallel Recombinative Simulated Annealing (PRSA) for SVM Hyperparameter Optimization:
Input: param_space, population_size, generations, initial_temp, cooling_rate
Output: best_score, best_params
Function PRSA(param_space, population_size, generations, initial_temp, cooling_rate)
population <- GenerateInitialPopulation(population_size, param_space)
temperature <- initial_temp
for generation in 1 to generations do
new_population <- []
for i in 1 to floor(population_size / 2) do
Select parent1, parent2 randomly from population
child1, child2 <- RecombineAndMutate(parent1, parent2, param_space)
parent1_score <- EvaluateModel(parent1)
parent2_score <- EvaluateModel(parent2)
child1_score <- EvaluateModel(child1)
child2_score <- EvaluateModel(child2)
if BoltzmannTrial(parent1_score, child1_score, temperature) then
Add child1 to new_population
else
Add parent1 to new_population
endif
if BoltzmannTrial(parent2_score, child2_score, temperature) then
Add child2 to new_population
else
Add parent2 to new_population
endif
done
population <- new_population
temperature <- temperature * cooling_rate
done
best_params <- Max(population by EvaluateModel)
best_score <- EvaluateModel(best_params)
return best_score, best_params
End Function
After we applied PRSA to the tuning hyperparameters for SVM model part the results and the comparison are as follow:
Best score (PRSA): 0.7893903186274509
Best parameters (PRSA): C=10, gamma=0.1
Due to time constraints, we did not implement mutation, and therefore, we are not fully leveraging the capabilities of Genetic Algorithms. However, thanks to the “implicit parallelism” [8] inherent in Genetic Algorithms, which comes from optimizing a large population instead of just a single solution, the runtime of our algorithm is significantly shorter. Yet, the benefits of parallelism are not as pronounced in this case because we are only optimizing two hyperparameters and have defined only a limited search space due to time constraints. Given more time, we could harness the full potential of combining the thorough convergence towards possibly achieving a global optimum with the efficient parallelism of Genetic Algorithms.
Conclusion
In this tutorial, we embark on a journey to tune two hyperparameters of SVM using commonly used approaches such as Grid Search and Random Search, in addition to the local search algorithms we’ve introduced in this class, such as RHC (Randomized Hill Climbing) and SA (Simulated Annealing). We also explore a recombinative algorithm, PRSA (Parallel Recombinative Simulated Annealing), that combines the strong convergence attribute of SA with the easy parallelism exhibited by genetic algorithms (GA). We demonstrate the robust performance of local search algorithms, which iteratively search for optimal hyperparameters, rather than relying solely on predetermined parameters in a grid or random choices. If you need to tune hyperparameters and are willing to trade some runtime for better performance, you might find the local search approaches introduced in this class particularly beneficial for achieving optimal results :).
References
- Solis, F. J., & Wets, R. J.-B. (1981). Minimization by Random Search Techniques. Mathematics of Operations Research, 6(1), 19-30. https://doi.org/10.1287/moor.6.1.19
- Ingber, L. (1989). Very fast simulated re-annealing. Mathematical and Computer Modelling, 12(8), 967–973. DOI: 10.1016/0895-7177(89)90202-1.
- Holland, J. H. (1975). Adaptation in Natural and Artificial Systems. University of Michigan Press.
- Kirkpatrick, S., Gelatt, C. D., & Vecchi, M. P. (1983). Optimization by Simulated Annealing. Science, 220(4598), 671-680. https://doi.org/10.1126/science.220.4598.671
- Goldberg, D. E. (1989). Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley.
- De Bonet, J. S., Isbell, C. L., & Viola, P. (1997). MIMIC: Finding Optima by Estimating Probability Densities. Advances in Neural Information Processing Systems, 9.
- Pilario, Karl Ezra, Cao, Yi, and Shafiee, Mahmoud (2020). A Kernel Design Approach to Improve Kernel Subspace Identification. IEEE Transactions on Industrial Electronics, PP, 1–1. DOI: 10.1109/TIE.2020.2996142.
- Mahfoud, Samir W., and Goldberg, David E. (1995). Parallel recombinative simulated annealing: A genetic algorithm. Parallel Computing, 21(1), 1–28.