In the sections on Feature engineering and Predicting we learned how
to train both the feature engineering algorithm and the ML algorithm
used for prediction in the getML engine. So, we basically have all
ingredients to get a data science project started, right? Well,
that’s maybe true but there are lots of parameters
XGBoostRegressor all do have their own
settings. We strive to provide you with as much information as
possible in both the documentation and the corresponding doc
strings. But tuning all these parameters manually can be a quite
tedious task. To relieve you from these difficulties, we also provide an
automated hyperparameter optimization.
The most relevant parameters of these classes can be chosen to
constitute individual dimensions of a parameter space. For each
parameter, a lower and upper bound has to be provided and the
hyperparameter optimization will search the space within these
bounds. This will be done iteratively by drawing a specific parameter
combination, overwriting the corresponding parameters in a base model,
and then fitting and scoring it. The algorithm used to draw from the
parameter space is represented by the different classes of
LatinHypercubeSearch are purely statistical
use prior knowledge obtained from evaluations of previous parameter
combinations to select the next one.
Bayesian hyperparameter optimization¶
What’s the incentive behind using a Bayesian hyperparameter optimization anyway and how does it work?
Our overall goal is to get the best hyperparameter combination in
order to perform the best prediction possible. To rephrase it in
mathematical terms, we want to minimize the negative log-likelihood of
an objective function representing the performance of the feature
engineering algorithm, the feature selector, and predictor (measured
using a particular score) given a set of data. But the surface of this
negative log-likelihood is not convex and contains many local
minima. Thus, we need to use a global optimization scheme. First of
all, we sample random points in the parameter space, evaluate the
objective functions at all those sites, and, finally, start a
well-known and tested local optimization routine, e.g. Nelder-Mead, at
the best-performing combination. The initial point in our parameter
space used to start the optimization from will be the parameters of
the provided model - either the ones you chose manually or the default
ones in the
RelboostModel constructors, which we will call
the base model from here on.
But on top of having local minima, the objective function has a far worse property: It is very expensive to evaluate. Local optimization algorithms can easily require over one hundred iterations to converge which usually is not an issue (e.g. minimizing the negative log-likelihood of a distribution function on, let’s say, 1000 data points only takes about 10ms on modern computers). But if evaluating the objective function involves performing a multi-stage fit of various machine learning algorithms to a large amount of data, each iteration can take minutes or even longer. In such a scenario, even the simple task of performing a local minimization very quickly becomes computationally infeasible.
This is where Bayesian hyperparameter optimization comes in. Its idea is to not fit the negative log-likelihood of the objective function directly but, instead, to approximate it with a surrogate model - the Gaussian process [Rasmussen06] - and to fit the approximation instead. By doing so we trade evaluation time - since the surrogate is much more cheap to evaluate - in for accuracy - since we are only dealing with an approximation of the real objective function.
The first part of our global optimization scheme (sampling the parameter space), becomes a lot more crucial since not just the quality of the starting points for the local optimization but, even more important, the approximation by the Gaussian process depends on the number and distribution of previous evaluations. Without a good coverage of the parameter space, the Gaussian process will not resemble its target properly and the results of the local optimization will be poorly.
Fortunately, we can do better than simply drawing random parameter combinations, fitting a single Gaussian process afterwards, and returning its minimum. The second core idea of the Bayesian hyperparameter optimization is to redesign the global optimization for better and more efficient performance. The local optimization will be replaced by an iterative scheme in which the surrogate is fitted to all previous parameter combinations and used to find the most promising combination to evaluate next. As a measure of quality for the next point to evaluate - also called acquisition function in the Bayesian optimization community - we use the expected information (EI) [Villemonteix09]. It measures how much improvement with respect to the current minimum of the negative log-likelihood function is expected when evaluating a particular additional point given all previous evaluations. An immense benefit of using the maximum of the EI (or other acquisition functions) calculated for the Gaussian process over the raw minimum of the surrogate is that they provide a trade-off between exploration and exploitation of the parameter space and are thus able to efficiently fit the objective function “on their own”. The EI itself we also have to optimize using a global scheme throughout the whole parameter space. But since this is done on top of the Gaussian process, it is quite fast.
To summarize, the optimization starts by drawing a number of points
within the parameter space at random and using them to fit and score
the model. Next, a Gaussian process is fitted to all parameter
combinations/calculated score pairs. Using the most likely position
of the optimal parameter combination can be determined by optimizing
the EI. The model corresponding to this combination is fitted and
scored by the getML engine and, again, a Gaussian process is fitted to
all evaluations. This procedure will be continued until the maximum
number of iterations - provided in
GaussianHyperparameterSearch as input
n_iter - is reached. As a result you get a list of
all fitted variants of the base model as well as their calculated
scores. Note, however, that the final parameter combination calculated
based on the models in the provided list will not be returned. Since
the EI is a trade-off between exploration and exploitation, the last
combination does not have to be the optimal one and we only keep those
models we know the performance (score) of.
The algorithm occasionally evaluates quite a number of samples from the hyperparameter space (sometimes several dozen) in what appears to be the global minimum while in reality it got stuck in a local one. The possibility for such an event to happen is particularly high for high-dimensional spaces or/and too short burn-in periods. In time the algorithm will be able to escape the local minima and approach the global one. But instead of increasing the number of surrogate evaluation, we recommend to perform a more thorough burn-in period instead.