Fast and Scalable Hyperparameter Tuning and Cross-validation in AWS SageMaker | by João Pereira | Mar, 2023 | Towards Data Science

Fast and Scalable Hyperparameter Tuning and Cross-validation in AWS SageMaker | by João Pereira | Mar, 2023 | Towards Data Science

Fast and Scalable Hyperparameter Tuning and Cross-validation in AWS SageMaker

This article shares a recipe to speeding up to 60% your hyperparameter tuning with cross-validation in SageMaker Pipelines leveraging SageMaker Managed Warm Pools. By using Warm Pools, the runtime of a Tuning step with 120 sequential jobs is reduced from 10h to 4h.

Improving and evaluating the performance of a machine learning model often requires a variety of ingredients. Hyperparameter tuning and cross-validation are 2 such ingredients. The first finds the best version of a model, while the second estimates how a model will generalize to unseen data. These steps, combined, introduce computing challenges as they require training and validating a model multiple times, in parallel and/or in sequence.

What this article is about…

We will consider Bayesian optimization for hyperparameter tuning that leverages the scores of the hyperparameter combinations already tested to choose the hyperparameter set to test in the next round. We will use k-fold cross-validation to score each combination of hyperparameters, in which the splits are as follows:

The full dataset is partitioned into ? validation folds, the model trained on ?-1 folds, and validated on its corresponding held-out fold. The overall score is the average over the individual validation scores obtained for each validation fold.

1. What are Warm Pools?

Whenever a training job is launched in AWS, the provisioned instance takes roughly 3min to bootstrap before the training script is executed. This startup time adds up when running multiple jobs sequentially, which is the case when performing hyperparameter tuning using a Bayesian optimization strategy. Here, dozens or even hundreds of jobs are run in sequence leading to a significant total time that can be on par with or even higher than the actual execution times of the scripts.

SageMaker Managed Warm Pools make it possible to retain training infrastructure after a job is completed for a desired number of seconds, enabling saving the instance startup time for every subsequent job.

Enabling Warm Pools is straightforward. You simply add an extra parameter (keep_alive_period_in_seconds) when creating a training job in SageMaker:

If you want to learn more about SageMaker Managed Warm Pools, here is the documentation:

Now that we know what are Warm Pools, in Section 2 we are going to dive deep into how to leverage them to speed-up the overall runtime of a SageMaker Pipeline that includes hyperparameter tuning with cross-validation.

2. End-to-end SageMaker Pipeline

The following figure depicts an end-to-end SageMaker Pipeline that performs hyperparameter tuning with cross-validation.

We will create the pipeline using the SageMaker Python SDK, which is an open-source library that simplifies the process of training, tuning, and deploying machine learning models in AWS SageMaker. The pipeline steps in the diagram are summarized as follows:

Please find detailed documentation on how to implement these steps on the SageMaker Developer Guide.

3. What happens inside the Tuning step?

Let’s now dig deeper into the pipeline step 2 that iteratively tries and cross-validates multiple hyperparameter combinations in parallel and in sequence. The solution is represented in the following diagram:

The solution relies on SageMaker Automatic Model Tuning to create and orchestrate the training jobs that test multiple hyperparameter combinations. The Automatic Model Tuning job can be launched using the HyperparameterTuner available in the SageMaker Python SDK. It creates MxN hyperparameter tuning training jobs, M of which are run in parallel over N sequential rounds that progressively search for the best hyperparameters. Each of these jobs launches and monitors a set of K cross-validation jobs. At each tuning round, MxK instances in a Warm Pool are retained for the next round. In the subsequent rounds there is no instance startup time.

SageMaker’s HyperparameterTuner already makes use of Warm Pools as announced on the AWS News Blog. However, the cross-validation training jobs that are created in each tuning job — that cross-validate a specific combination of hyperparameters — have to be manually created and monitored, and the provisioned instances are not kept in a Warm Pool. Each hyperparameter tuning training job will only finish when all the underlying cross-validation training jobs have completed.

To bring the architecture above to life and enable Warm Pools for all training jobs, we need to create three main scripts:,, and

Tip: Add a small delay, i.e., a few seconds, between the calls to the SageMaker APIs that create and monitor the training jobs to prevent the“Rate Exceeded” error, as in the example:

Tip: Disable the debugger profiler when launching your SageMaker training jobs. These profiler instances will be as many as the training instances and can make the overall cost increase significantly. You can do so by simply setting disable_profiler=Truein the Estimator definition.

Tip: Write a general-purpose training.pyscript and reuse it for training the model on cross-validation sets and for training the final model with the optimal hyperparameters on the full training set.

To control each parallel cross-validation set of jobs, as well as to compute a final validation metric for each specific hyperparameter combination tested, there are several custom functions that have to be implemented inside the script. This example provides good inspiration, even though it does not enable Warm Pools or Lambda.

This content was originally published here.