Interruptible optimization runs with checkpoints

Christian Schell, Mai 2018 Reformatted by Holger Nahrstaedt 2020

Problem statement

Optimization runs can take a very long time and even run for multiple days. If for some reason the process has to be interrupted results are irreversibly lost, and the routine has to start over from the beginning.

With the help of the callbacks.CheckpointSaver callback the optimizer’s current state can be saved after each iteration, allowing to restart from that point at any time.

This is useful, for example,

  • if you don’t know how long the process will take and cannot hog computational resources forever

  • if there might be system failures due to shaky infrastructure (or colleagues…)

  • if you want to adjust some parameters and continue with the already obtained results

print(__doc__)
import sys
import numpy as np
np.random.seed(777)
import os

Simple example

We will use pretty much the same optimization problem as in the Bayesian optimization with skopt notebook. Additionally we will instantiate the callbacks.CheckpointSaver and pass it to the minimizer:

from skopt import gp_minimize
from skopt import callbacks
from skopt.callbacks import CheckpointSaver

noise_level = 0.1


def obj_fun(x, noise_level=noise_level):
    return np.sin(5 * x[0]) * (1 - np.tanh(x[0] ** 2)) + np.random.randn() \
        * noise_level

checkpoint_saver = CheckpointSaver("./checkpoint.pkl", compress=9) # keyword arguments will be passed to `skopt.dump`

gp_minimize(obj_fun,            # the function to minimize
            [(-20.0, 20.0)],    # the bounds on each dimension of x
            x0=[-20.],          # the starting point
            acq_func="LCB",     # the acquisition function (optional)
            n_calls=10,         # number of evaluations of f including at x0
            n_random_starts=3,  # the number of random initial points
            callback=[checkpoint_saver],
            # a list of callbacks including the checkpoint saver
            random_state=777)

Out:

         fun: -0.17524445239614728
   func_vals: array([-0.04682088, -0.08228249, -0.00653801, -0.07133619,  0.09063509,
       0.07662367,  0.08260541, -0.13236828, -0.17524445,  0.10024491])
      models: [GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=1, nu=2.5) + WhiteKernel(noise_level=1),
                        n_restarts_optimizer=2, noise='gaussian',
                        normalize_y=True, random_state=655685735), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=1, nu=2.5) + WhiteKernel(noise_level=1),
                        n_restarts_optimizer=2, noise='gaussian',
                        normalize_y=True, random_state=655685735), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=1, nu=2.5) + WhiteKernel(noise_level=1),
                        n_restarts_optimizer=2, noise='gaussian',
                        normalize_y=True, random_state=655685735), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=1, nu=2.5) + WhiteKernel(noise_level=1),
                        n_restarts_optimizer=2, noise='gaussian',
                        normalize_y=True, random_state=655685735), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=1, nu=2.5) + WhiteKernel(noise_level=1),
                        n_restarts_optimizer=2, noise='gaussian',
                        normalize_y=True, random_state=655685735), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=1, nu=2.5) + WhiteKernel(noise_level=1),
                        n_restarts_optimizer=2, noise='gaussian',
                        normalize_y=True, random_state=655685735), GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=1, nu=2.5) + WhiteKernel(noise_level=1),
                        n_restarts_optimizer=2, noise='gaussian',
                        normalize_y=True, random_state=655685735)]
random_state: RandomState(MT19937) at 0x7F4692427340
       space: Space([Real(low=-20.0, high=20.0, prior='uniform', transform='normalize')])
       specs: {'args': {'func': <function obj_fun at 0x7f4693f12dc0>, 'dimensions': Space([Real(low=-20.0, high=20.0, prior='uniform', transform='normalize')]), 'base_estimator': GaussianProcessRegressor(kernel=1**2 * Matern(length_scale=1, nu=2.5),
                        n_restarts_optimizer=2, noise='gaussian',
                        normalize_y=True, random_state=655685735), 'n_calls': 10, 'n_random_starts': 3, 'n_initial_points': 10, 'initial_point_generator': 'random', 'acq_func': 'LCB', 'acq_optimizer': 'auto', 'x0': [-20.0], 'y0': None, 'random_state': RandomState(MT19937) at 0x7F4692427340, 'verbose': False, 'callback': [<skopt.callbacks.CheckpointSaver object at 0x7f4691bede20>], 'n_points': 10000, 'n_restarts_optimizer': 5, 'xi': 0.01, 'kappa': 1.96, 'n_jobs': 1, 'model_queue_size': None}, 'function': 'base_minimize'}
           x: [-18.660711608231072]
     x_iters: [[-20.0], [5.857990176187936], [-11.97095004855501], [5.450171667295798], [10.52421848474863], [-17.111120867645933], [7.251301457257323], [-19.167098803897993], [-18.660711608231072], [-18.284297234995442]]

Now let’s assume this did not finish at once but took some long time: you started this on Friday night, went out for the weekend and now, Monday morning, you’re eager to see the results. However, instead of the notebook server you only see a blank page and your colleague Garry tells you that he had had an update scheduled for Sunday noon – who doesn’t like updates?

gp_minimize did not finish, and there is no res variable with the actual results!

Restoring the last checkpoint

Luckily we employed the callbacks.CheckpointSaver and can now restore the latest result with skopt.load (see Store and load skopt optimization results for more information on that)

from skopt import load

res = load('./checkpoint.pkl')

res.fun

Out:

-0.17524445239614728

Possible problems

  • changes in search space: You can use this technique to interrupt the search, tune the search space and continue the optimization. Note that the optimizers will complain if x0 contains parameter values not covered by the dimension definitions, so in many cases shrinking the search space will not work without deleting the offending runs from x0 and y0.

  • see Store and load skopt optimization results

for more information on how the results get saved and possible caveats

Total running time of the script: ( 0 minutes 2.810 seconds)

Estimated memory usage: 14 MB

Gallery generated by Sphinx-Gallery