Interruptible optimization runs with checkpoints

Christian Schell, Mai 2018 Reformatted by Holger Nahrstaedt 2020

Problem statement

Optimization runs can take a very long time and even run for multiple days. If for some reason the process has to be interrupted results are irreversibly lost, and the routine has to start over from the beginning.

With the help of the callbacks.CheckpointSaver callback the optimizer’s current state can be saved after each iteration, allowing to restart from that point at any time.

This is useful, for example,

  • if you don’t know how long the process will take and cannot hog computational resources forever

  • if there might be system failures due to shaky infrastructure (or colleagues…)

  • if you want to adjust some parameters and continue with the already obtained results

print(__doc__)
import sys
import numpy as np
np.random.seed(777)
import os

# The followings are hacks to allow sphinx-gallery to run the example.
sys.path.insert(0, os.getcwd())
main_dir = os.path.basename(sys.modules['__main__'].__file__)
IS_RUN_WITH_SPHINX_GALLERY = main_dir != os.getcwd()

Simple example

We will use pretty much the same optimization problem as in the Bayesian optimization with skopt notebook. Additionally we will instantiate the callbacks.CheckpointSaver and pass it to the minimizer:

from skopt import gp_minimize
from skopt import callbacks
from skopt.callbacks import CheckpointSaver

noise_level = 0.1

if IS_RUN_WITH_SPHINX_GALLERY:
    # When this example is run with sphinx gallery, it breaks the pickling
    # capacity for multiprocessing backend so we have to modify the way we
    # define our functions. This has nothing to do with the example.
    from utils import obj_fun
else:
    def obj_fun(x, noise_level=noise_level):
        return np.sin(5 * x[0]) * (1 - np.tanh(x[0] ** 2)) + np.random.randn() * noise_level

checkpoint_saver = CheckpointSaver("./checkpoint.pkl", compress=9) # keyword arguments will be passed to `skopt.dump`

gp_minimize(obj_fun,                       # the function to minimize
              [(-20.0, 20.0)],             # the bounds on each dimension of x
              x0=[-20.],                     # the starting point
              acq_func="LCB",              # the acquisition function (optional)
              n_calls=10,                   # the number of evaluations of f including at x0
              n_random_starts=0,           # the number of random initialization points
              callback=[checkpoint_saver], # a list of callbacks including the checkpoint saver
              random_state=777);

Out:

/home/circleci/project/skopt/optimizer/optimizer.py:409: UserWarning: The objective has been evaluated at this point before.
  warnings.warn("The objective has been evaluated "
/home/circleci/project/skopt/optimizer/optimizer.py:409: UserWarning: The objective has been evaluated at this point before.
  warnings.warn("The objective has been evaluated "
/home/circleci/project/skopt/optimizer/optimizer.py:409: UserWarning: The objective has been evaluated at this point before.
  warnings.warn("The objective has been evaluated "
/home/circleci/project/skopt/optimizer/optimizer.py:409: UserWarning: The objective has been evaluated at this point before.
  warnings.warn("The objective has been evaluated "
/home/circleci/project/skopt/optimizer/optimizer.py:409: UserWarning: The objective has been evaluated at this point before.
  warnings.warn("The objective has been evaluated "
/home/circleci/project/skopt/optimizer/optimizer.py:409: UserWarning: The objective has been evaluated at this point before.
  warnings.warn("The objective has been evaluated "
/home/circleci/project/skopt/optimizer/optimizer.py:409: UserWarning: The objective has been evaluated at this point before.
  warnings.warn("The objective has been evaluated "
/home/circleci/project/skopt/optimizer/optimizer.py:409: UserWarning: The objective has been evaluated at this point before.
  warnings.warn("The objective has been evaluated "

          fun: -0.17524445239614728
    func_vals: array([-0.04682088, -0.08228249, -0.00653801, -0.07133619,  0.09063509,
        0.07662367,  0.08260541, -0.13236828, -0.17524445,  0.10024491])
       models: [GaussianProcessRegressor(alpha=1e-10, copy_X_train=True,
                         kernel=1**2 * Matern(length_scale=1, nu=2.5) + WhiteKernel(noise_level=1),
                         n_restarts_optimizer=2, noise='gaussian',
                         normalize_y=True, optimizer='fmin_l_bfgs_b',
                         random_state=655685735), GaussianProcessRegressor(alpha=1e-10, copy_X_train=True,
                         kernel=1**2 * Matern(length_scale=1, nu=2.5) + WhiteKernel(noise_level=1),
                         n_restarts_optimizer=2, noise='gaussian',
                         normalize_y=True, optimizer='fmin_l_bfgs_b',
                         random_state=655685735), GaussianProcessRegressor(alpha=1e-10, copy_X_train=True,
                         kernel=1**2 * Matern(length_scale=1, nu=2.5) + WhiteKernel(noise_level=1),
                         n_restarts_optimizer=2, noise='gaussian',
                         normalize_y=True, optimizer='fmin_l_bfgs_b',
                         random_state=655685735), GaussianProcessRegressor(alpha=1e-10, copy_X_train=True,
                         kernel=1**2 * Matern(length_scale=1, nu=2.5) + WhiteKernel(noise_level=1),
                         n_restarts_optimizer=2, noise='gaussian',
                         normalize_y=True, optimizer='fmin_l_bfgs_b',
                         random_state=655685735), GaussianProcessRegressor(alpha=1e-10, copy_X_train=True,
                         kernel=1**2 * Matern(length_scale=1, nu=2.5) + WhiteKernel(noise_level=1),
                         n_restarts_optimizer=2, noise='gaussian',
                         normalize_y=True, optimizer='fmin_l_bfgs_b',
                         random_state=655685735), GaussianProcessRegressor(alpha=1e-10, copy_X_train=True,
                         kernel=1**2 * Matern(length_scale=1, nu=2.5) + WhiteKernel(noise_level=1),
                         n_restarts_optimizer=2, noise='gaussian',
                         normalize_y=True, optimizer='fmin_l_bfgs_b',
                         random_state=655685735), GaussianProcessRegressor(alpha=1e-10, copy_X_train=True,
                         kernel=1**2 * Matern(length_scale=1, nu=2.5) + WhiteKernel(noise_level=1),
                         n_restarts_optimizer=2, noise='gaussian',
                         normalize_y=True, optimizer='fmin_l_bfgs_b',
                         random_state=655685735), GaussianProcessRegressor(alpha=1e-10, copy_X_train=True,
                         kernel=1**2 * Matern(length_scale=1, nu=2.5) + WhiteKernel(noise_level=1),
                         n_restarts_optimizer=2, noise='gaussian',
                         normalize_y=True, optimizer='fmin_l_bfgs_b',
                         random_state=655685735), GaussianProcessRegressor(alpha=1e-10, copy_X_train=True,
                         kernel=1**2 * Matern(length_scale=1, nu=2.5) + WhiteKernel(noise_level=1),
                         n_restarts_optimizer=2, noise='gaussian',
                         normalize_y=True, optimizer='fmin_l_bfgs_b',
                         random_state=655685735), GaussianProcessRegressor(alpha=1e-10, copy_X_train=True,
                         kernel=1**2 * Matern(length_scale=1, nu=2.5) + WhiteKernel(noise_level=1),
                         n_restarts_optimizer=2, noise='gaussian',
                         normalize_y=True, optimizer='fmin_l_bfgs_b',
                         random_state=655685735)]
 random_state: RandomState(MT19937) at 0x7F8322CE7B40
        space: Space([Real(low=-20.0, high=20.0, prior='uniform', transform='normalize')])
        specs: {'args': {'func': <function obj_fun at 0x7f8320d43d30>, 'dimensions': Space([Real(low=-20.0, high=20.0, prior='uniform', transform='normalize')]), 'base_estimator': GaussianProcessRegressor(alpha=1e-10, copy_X_train=True,
                         kernel=1**2 * Matern(length_scale=1, nu=2.5),
                         n_restarts_optimizer=2, noise='gaussian',
                         normalize_y=True, optimizer='fmin_l_bfgs_b',
                         random_state=655685735), 'n_calls': 10, 'n_random_starts': 0, 'acq_func': 'LCB', 'acq_optimizer': 'auto', 'x0': [-20.0], 'y0': None, 'random_state': RandomState(MT19937) at 0x7F8322CE7B40, 'verbose': False, 'callback': [<skopt.callbacks.CheckpointSaver object at 0x7f831a509100>], 'n_points': 10000, 'n_restarts_optimizer': 5, 'xi': 0.01, 'kappa': 1.96, 'n_jobs': 1, 'model_queue_size': None}, 'function': 'base_minimize'}
            x: [20.0]
      x_iters: [[-20.0], [20.0], [20.0], [-20.0], [-20.0], [20.0], [-20.0], [20.0], [20.0], [20.0]]

Now let’s assume this did not finish at once but took some long time: you started this on Friday night, went out for the weekend and now, Monday morning, you’re eager to see the results. However, instead of the notebook server you only see a blank page and your colleague Garry tells you that he had had an update scheduled for Sunday noon – who doesn’t like updates?

gp_minimize did not finish, and there is no res variable with the actual results!

Restoring the last checkpoint

Luckily we employed the callbacks.CheckpointSaver and can now restore the latest result with skopt.load (see Store and load skopt optimization results for more information on that)

from skopt import load

res = load('./checkpoint.pkl')

res.fun

Out:

-0.17524445239614728

Possible problems

  • changes in search space: You can use this technique to interrupt the search, tune the search space and continue the optimization. Note that the optimizers will complain if x0 contains parameter values not covered by the dimension definitions, so in many cases shrinking the search space will not work without deleting the offending runs from x0 and y0.

  • see Store and load skopt optimization results

for more information on how the results get saved and possible caveats

Total running time of the script: ( 0 minutes 3.680 seconds)

Estimated memory usage: 8 MB

Gallery generated by Sphinx-Gallery