Bayesian Optimization to the Rescue
Developing software solutions or more specifically data science solutions, time is always a limited resource. The next deadline or the next milestone is approaching every day without exception and different strategies are used to deliver quickly and meet deadlines. In data science, the training time of models is a significant issue that is difficult to avoid. Here is how Bayesian optimization can avoid unnecessarily training models.
Let’s take a step back. In a software solution involving machine learning, usually, only one model will be deployed in the end. However, finding this model involves testing different approaches, different models, different hyperparameters, etc. It should be noted that tuning a model often requires as much or more time than fitting a model when its hyperparameters are known. Indeed, the best configuration is not known beforehand. Finding this requires training and validation of different models in order to get an estimate of their respective generalization error. This iterative process makes it possible to converge on a final model which does not, however, offer any guarantee of optimality. Each iteration involves training a model that will not be used in production. For some models, the training time is counted in hours or even days. In this context, there are countless options and the number of trials is limited. It is a dilemma between exploitation and exploration.
Ideally, several models and model configurations would be tested in an exploration phase, then the best models would be refined in an exploitation phase where the most efficient model would be kept. Too little exploration can result in ignoring potentially more efficient models while too little exploitation can result in ignoring incremental improvements readily available. The question is when exploration prevails over exploitation and vice versa.
This image is often used to demonstrate that you should never give up, however it does represent the dilemma between exploitation and exploration. The exploitation strategy is highlighted here.
Babysitting
The simplest approach to this problem is to manually configure the hyperparameters of a model, start training, wait for the results of this configuration, determine the next trial to be performed based on the results obtained and start again until there is no more time available. This approach is known as babysitting.
It offers several advantages in terms of learning about models since it allows an in-depth understanding of the different levers of each model family. Also, this approach makes it possible to decide at each iteration whether exploitation or exploration is the best strategy. However, it is not viable in a business context, since it requires, as the name suggests, constant supervision. In this context, a lot of time is spent monitoring the training process, which does not add value to the project.
Also, since the next test is started manually based on previous results, if the results are obtained during a period when the data scientist is not available, then the time between when the process is completed and when the data scientist is available is wasted. It may seem minimal, but considering days off and nights, many hours of computing time are lost.
Grid and Random Search
A more sophisticated approach to this problem is to automate the initiation of the training process for a list of combinations of given configurations which is commonly called a Grid Search. Another similar approach is to define a range of possible configurations and start the training process randomly in this range of configurations. The latter approach is called Random Search.
While these two methods allow the scientist to move away from overseeing the training process and explore more exhaustively the hyperparameter space, these two methods have other weaknesses. Both Grid Search and Random Search can spend a lot of time on configurations with little potential. The computation time gained by automating part of the process is lost by performing unnecessary trials. Additionally, a well-known phenomenon in the data science world is the curse of dimensionality. Indeed, this phenomenon indicates that by increasing the number of parameters to configure, the number of possibilities explodes. Therefore, it is impossible to perform the number of tests necessary to obtain a good representation of the direction of improvement.
Finally, in the babysitting approach, the data scientist takes into account the previous iterations in choosing the next trial, which is not the case for these last two approaches. These approaches are therefore oriented towards exploration but abandon the exploitation aspect.
Bayesian Optimization
Ideally, a data scientist could provide a range of possibilities for model configurations and an algorithm would determine the next configuration with the most potential to try based on the results of previous iterations.
This is what Bayesian optimization does through a Gaussian process. Two reasons allow a given configuration to increase its potential, either it is located far from all previously tested configurations or it is located near an efficient configuration. By combining these two criteria, Bayesian optimization seeks to reduce uncertainty by exploring regions that have been little explored while exploiting regions near a performing configuration. This is called an acquisition function. Several methods for obtaining the acquisition function exist, however, these are all at a high level a combination of the two criteria stated above. The exploration-exploitation dilemma is therefore intrinsically taken into account in this approach. It should be noted that this approach is not a solution to the curse of dimensionality since with a large range of configurations, exploring it remains an issue.
Application in Python
BayesianOptimization is a Python library that offers Bayesian optimization algorithms. It is well documented. I recommend that you take a look at it.
Basic usage
BayesianOptimization is one such library that requires little work to make a profit. All we need is a black-box function that returns a numeric value (float) and a range of hyperparameters.
In the following example, the black-box function to be optimized is a function that calculates the average negative log loss across the splits of the cross-validation.
The n_estimators hyperparameter has a 100 to 1000 range and the min_sample_split hyperparameter has a 5 to 50 range.
The Bayesian optimizer (BayesianOptimization) is initialized by passing in parameter the black-box function and the range of hyperparameters. Thereafter, we maximize by defining the number of random iterations at the start (init_points) and the number of Bayesian optimization iterations (n_iter).
Some more advanced features
Even though the library offers a high level of abstraction, it also offers more advanced features such as log writing and reading, as well as a variety of acquisition functions.
Logs
To write the results obtained in a log, simply initialize a JSONLogger and subscribe the OPTIMIZATION_STEP in the log. Thus, each trial carried out during the optimization will be recorded in a log.
To read the results obtained from previous trials, BayesianOptimization offers the load_logs function which allows you to continue the search for hyperparameters from previous searches.
Acquisition Functions
As mentioned above, several acquisition functions exist. The acquisition functions implemented in the library are Upper Confidence Bounds (‘ucb’), Expected Improvement (‘ei’) and Probability of Improvement (‘poi’). Here is a resource recommended by the author of the library which deals with the differences between these functions.
Conclusion
In short, different approaches exist to the exploration-exploitation dilemma. The babysitting approach is formative but difficult to achieve in a business context. The Grid Search and Random Search offer interesting capacities in terms of exploration, but neglect the exploitation aspect. Finally, Bayesian optimization offers an interesting approach which allows in a business context to automate part of the hyperparameter search, to avoid wasting time and ultimately to deliver a high-performance solution on time.
Contacts and Source Code
Ressources
[1] Fernando Nogueira. (2014–). Bayesian Optimization: Open source constrained global optimization tool for Python.
[2] Fernando Nogueira. (2013). Machine learning — Bayesian optimization and multi-armed bandits. Consulted the 18/02/2020 on https://www.youtube.com/watch?v=vz3D36VXefI&index=10&list=PLE6Wd9FR--EdyJ5lbFl8UuGjecvVw66F6
[3]Snoek, J., Larochelle, H., & Adams, R. (2012). Practical bayesian optimization of machine learning algorithms. In Advances in neural information processing systems (pp. 2951–2959).