Hyperparameter optimization

In the sections on Feature engineering and Predicting we learned how to train both the feature engineering algorithm and the ML algorithm used for prediction in the getML engine. So, we basically have all ingredients to get a data science project started, right? Well, that’s maybe true but there are lots of parameters involved. MultirelModel, RelboostModel, LinearRegression, LogisticRegression, XGBoostClassifier, and XGBoostRegressor all do have their own settings. We strive to provide you with as much information as possible in both the documentation and the corresponding doc strings. But tuning all these parameters manually can be a quite tedious task. To relieve you from these difficulties, we also provide an automated hyperparameter optimization.

The most relevant parameters of these classes can be chosen to constitute individual dimensions of a parameter space. For each parameter, a lower and upper bound has to be provided and the hyperparameter optimization will search the space within these bounds. This will be done iteratively by drawing a specific parameter combination, overwriting the corresponding parameters in a base model, and then fitting and scoring it. The algorithm used to draw from the parameter space is represented by the different classes of hyperopt. While RandomSearch and LatinHypercubeSearch are purely statistical approaches, GaussianHyperparameterSearch will use prior knowledge obtained from evaluations of previous parameter combinations to select the next one.

Bayesian hyperparameter optimization

What’s the incentive behind using a Bayesian hyperparameter optimization anyway and how does it work?

Our overall goal is to get the best hyperparameter combination in order to perform the best prediction possible. To rephrase it in mathematical terms, we want to minimize the negative log-likelihood of an objective function representing the performance of the feature engineering algorithm, the feature selector, and predictor (measured using a particular score) given a set of data. But the surface of this negative log-likelihood is not convex and contains many local minima. Thus, we need to use a global optimization scheme. First of all, we sample random points in the parameter space, evaluate the objective functions at all those sites, and, finally, start a well-known and tested local optimization routine, e.g. Nelder-Mead, at the best-performing combination. The initial point in our parameter space used to start the optimization from will be the parameters of the provided model - either the ones you chose manually or the default ones in the MultirelModel or RelboostModel constructors, which we will call the base model from here on.

But on top of having local minima, the objective function has a far worse property: It is very expensive to evaluate. Local optimization algorithms can easily require over one hundred iterations to converge which usually is not an issue (e.g. minimizing the negative log-likelihood of a distribution function on, let’s say, 1000 data points only takes about 10ms on modern computers). But if evaluating the objective function involves performing a multi-stage fit of various machine learning algorithms to a large amount of data, each iteration can take minutes or even longer. In such a scenario, even the simple task of performing a local minimization very quickly becomes computationally infeasible.

This is where Bayesian hyperparameter optimization comes in. Its idea is to not fit the negative log-likelihood of the objective function directly but, instead, to approximate it with a surrogate model - the Gaussian process [Rasmussen06] - and to fit the approximation instead. By doing so we trade evaluation time - since the surrogate is much more cheap to evaluate - in for accuracy - since we are only dealing with an approximation of the real objective function.

The first part of our global optimization scheme (sampling the parameter space), becomes a lot more crucial since not just the quality of the starting points for the local optimization but, even more important, the approximation by the Gaussian process depends on the number and distribution of previous evaluations. Without a good coverage of the parameter space, the Gaussian process will not resemble its target properly and the results of the local optimization will be poorly.

Fortunately, we can do better than simply drawing random parameter combinations, fitting a single Gaussian process afterwards, and returning its minimum. The second core idea of the Bayesian hyperparameter optimization is to redesign the global optimization for better and more efficient performance. The local optimization will be replaced by an iterative scheme in which the surrogate is fitted to all previous parameter combinations and used to find the most promising combination to evaluate next. As a measure of quality for the next point to evaluate - also called acquisition function in the Bayesian optimization community - we use the expected information (EI) [Villemonteix09]. It measures how much improvement with respect to the current minimum of the negative log-likelihood function is expected when evaluating a particular additional point given all previous evaluations. An immense benefit of using the maximum of the EI (or other acquisition functions) calculated for the Gaussian process over the raw minimum of the surrogate is that they provide a trade-off between exploration and exploitation of the parameter space and are thus able to efficiently fit the objective function “on their own”. The EI itself we also have to optimize using a global scheme throughout the whole parameter space. But since this is done on top of the Gaussian process, it is quite fast.

To summarize, the optimization starts by drawing a number of points within the parameter space at random and using them to fit and score the model. Next, a Gaussian process is fitted to all parameter combinations/calculated score pairs. Using the most likely position of the optimal parameter combination can be determined by optimizing the EI. The model corresponding to this combination is fitted and scored by the getML engine and, again, a Gaussian process is fitted to all evaluations. This procedure will be continued until the maximum number of iterations - provided in GaussianHyperparameterSearch as input argument n_iter - is reached. As a result you get a list of all fitted variants of the base model as well as their calculated scores. Note, however, that the final parameter combination calculated based on the models in the provided list will not be returned. Since the EI is a trade-off between exploration and exploitation, the last combination does not have to be the optimal one and we only keep those models we know the performance (score) of.

The algorithm occasionally evaluates quite a number of samples from the hyperparameter space (sometimes several dozen) in what appears to be the global minimum while in reality it got stuck in a local one. The possibility for such an event to happen is particularly high for high-dimensional spaces or/and too short burn-in periods. In time the algorithm will be able to escape the local minima and approach the global one. But instead of increasing the number of surrogate evaluation, we recommend to perform a more thorough burn-in period instead.

Rasmussen06

Carl Edward Rasmussen and Christopher K. I. Williams, MIT Press, 2006

Villemonteix09

Julien Villemonteix, Emmanuel Vazquez, and Eric Walter, 2009