Deploying a Data Science Platform on AWS: Parallelizing Experiments (Part III)

Deploying a Data Science Platform on AWS: Parallelizing Experiments (Part III)

Data Science Cloud Infrastructure

A step-by-step guide to deploy a Data Science platform on AWS with open-source software

Photo by Chris Ried on Unsplash

In our previous post, we configured Amazon ECR to push a Docker image to AWS and configured an S3 bucket to write the output of our Data Science experiments.

In this final post, we’ll show you how to use Ploomber and Soopervisor to create grids of experiments that you can run in parallel on AWS Batch, and how to request resources dynamically (CPUs, RAM, and GPUs).

Hi! My name is Eduardo, and I like writing about all things MLOps. If you want to keep up-to-date with my content. Follow me on Medium or Twitter. Thanks for reading!

This is how our architecture looks like:

Platform’s architecture. Image by author.

Authenticating with the aws CLI

We’ll be using the aws CLI again to configure the infrastructure, so ensure you’re authenticated and have enough permissions:

https://medium.com/media/955c78a009bc1f59c3485e3f83edf6ca/href

Checking Docker

We’ll be using Docker for this part, so ensure it’s up and running:

https://medium.com/media/1e9456bfd16efadafaf6ba3ef9dd30a6/href

Creating an Amazon ECR repository

First, let’s create another ECR repository to host our Docker image:

https://medium.com/media/e0a61688ca89682dc894269d3374e9f1/hrefhttps://medium.com/media/2c15047d322d19f3d9fddacc27ccea28/href

Output:

https://medium.com/media/a0e1fee91bf93897138146fb829974cf/href

Assign the REPOSITORY variable to the output of the previous command:

https://medium.com/media/18ce3ac2a29aa2e29c815539b7d6aa54/href

Getting sample code

We’ll now get a sample project. First, let’s install the required packages.

Note: We recommend you install them in a virtual environment.

https://medium.com/media/761b202d975649cbf5cdc88d103054df/href

Download the example in the grid directory:

https://medium.com/media/783f8b07f44a346a83986bf89f8d0883/href

Output:

https://medium.com/media/9b02512f64d3b9a7f8272cb37604ce0c/href

This downloaded a full project:

https://medium.com/media/70e8140492fc71e3d9efe27ba19b7f68/href

Output:

https://medium.com/media/4749770a531b4841f23ae966cc8b6ec3/href

The example we downloaded prepares some data and trains a dozen Machine Learning models in parallel, here’s a graphical representation:

Graphical representation of our workflow. Image by author.

Let’s look at the pipeline.yaml file, which specifies the tasks in our workflow:

https://medium.com/media/ec3e2b97f25c71d45bab7efdbace790f/href

Output:

https://medium.com/media/cbb5b42fa145834ebc9ba40c739dc763/href

The pipeline.yaml is one interface that Ploomber has to describe computational workflows (you can also declare them with Python).

The tasks section contains five entries, one per task. The first four are Python functions that process some input data ( tasks.raw.get, tasks.features.sepal, tasks.features.petal, tasks.features.features), and the last one is a script that fits a model ( scripts/fit.py).

Note the last entry is longer because it’s a grid task: it’ll use the same script and execute it multiple times with different parameters. In total, the script will be executed 12 times, but this could be a larger number.

To learn more about the pipeline.yaml file and Ploomber, check our documentation.

Let’s now configure AWS Batch as our cloud environment (Kubernetes, SLURM, and Airflow are supported as well).

Configuring the project to run on AWS

https://medium.com/media/a8a180cc524908372bab693d95bc5574/href

Output:

========================= Loading DAG =========================
No pipeline.aws-env.yaml found, looking for pipeline.yaml instead Found /Users/Edu/dev/ploomber.io/raw/ds-platform-part-iii/grid/pipeline.yaml.
Loading...
Adding /Users/Edu/dev/ploomber.io/raw/ds-platform-part-iii/grid/aws-env/Dockerfile...
============================= Done ============================
Fill in the configuration in the 'aws-env' section in soopervisor.yaml then submit to AWS Batch with: soopervisor export aws-env
Environment added, to export it:
$ soopervisor export aws-env
To force execution of all tasks:
$ soopervisor export aws-env --mode force

There are a few extra things we need to configure, to facilitate the setup, we created a script that automates these tasks depending on your AWS infrastructure, let’s download it:

https://medium.com/media/2b2fecadcc30f79dafe72f76293afd77/href

Output:

https://medium.com/media/7060015a0b47afd033402e334e005619/href

Now, set the values for the AWS Batch job queue and artifacts bucket you want to use. (If in doubt, you might want to revisit the previous tutorials: Part I, and Part II).

https://medium.com/media/15298e614c361455f7fe5f36e6a3bbd4/href

Let’s generate the configuration file that specifies the job queue to use and the ECR repository to upload our code:

https://medium.com/media/9da768980d459690ccd2b17c9e7c182a/href

Output:

https://medium.com/media/c46147ea6d2e63c80f5a184363682abf/href

Now, let’s specify the S3 client so the outputs of the pipeline are uploaded to the bucket:

https://medium.com/media/46fffd61bfdced3e72ae8562eab1d5d7/href

Output:

https://medium.com/media/945d78180b8bc021303fa711e6f84f01/href

Modify the pipeline.yaml so it uses the client we created in the step above:

https://medium.com/media/f42f3475d8a014b01a61f8a990228dd7/href

Upload the project to the ECR repository

Let’s upload our project to ECR:

https://medium.com/media/0cc594df42a370edbea0c58f683fad22/hrefhttps://medium.com/media/43d22ca6ec03917a76508b273a69d2b3/href

Output:

https://medium.com/media/7bf1caf4925d87ef0131094a09f246bc/href

Ensure boto3 is installed as part of our project. We need to upload to S3:

https://medium.com/media/ca79c958caf6c2d06569242fab34016e/href

Execute jobs in AWS Batch

We’re now ready to schedule our workflow! Let’s use the soopervisor export command to build the Docker image, push it to ECR and schedule the jobs on AWS Batch:

https://medium.com/media/c39ba641acca559a1457669ff1575ac7/href

You can monitor execution in the AWS Batch console. Or use the following command, just ensure you change the job name. The following command retrieves the status of the fit-random-forest-1-gini task:

https://medium.com/media/bbc11fdb7a2afb96b7dd460775e35640/href

Output:

https://medium.com/media/dbb2930011aebaefc0986a79c8e5b03a/href

After a few minutes, all tasks should be executed!

Checking output

Let’s check the outputs in the S3 bucket:

https://medium.com/media/7a3d096dcc4a397aeceb6ec7a4730460/href

Output:

https://medium.com/media/399b41c6183e600b1f50126550f23adf/href

You can see there’s a combination of .pickle files (the trained models), .csv (processed data), and .html (reports generated from the training script).

Let’s download one of the reports:

https://medium.com/media/e95c504275bcc3d8824676332e17373a/href

Output:

https://medium.com/media/06ac9535ce797583b7387257f19ae5ff/href

Open the report.html and you’ll see the outputs of the training script!

Requesting more resources

Let’s take a look at the grid/soopervisor.yaml file which configures the cloud environment:

https://medium.com/media/9ea2fc66cf96dc39795349f30b18b3a9/href

Output:

https://medium.com/media/99d98819d66de8e4f65bb1f1325d4670/href

The soopervisor.yaml file specifies the backend to use ( aws-batch), the resources to use by default ({memory: 16384, vcpus: 8}), the job queue, region and repository.

We can add a new section to specify per-task resources, to override the default value:

https://medium.com/media/d6210103631a3516e1e52cdfe6222ed8/href

Closing remarks

In this final part, we showed how to create multi-step workflows, and how to parametrize a script to create a grid of experiments that can run in parallel. Now you have a scalable infrastructure to run Data Science and Machine Learning experiments!

If you need help customizing the infrastructure or want to share your feedback, please join our !

To keep up-to-date with our content; follow us on Twitter, LinkedIn, or subscribe to our newsletter!

Epilogue: Cleaning up the infrastructure

Here’s the command you need to run to delete the ECR repository we created on this post. To delete all the infrastructure, revisit the previous tutorials.

https://medium.com/media/6eba1b4b9d3e59d86b2b81848103ff30/href

Output:

https://medium.com/media/5bbe2a465b3502b7b251bacd90e9959a/href

Originally published at


Deploying a Data Science Platform on AWS: Parallelizing Experiments (Part III) was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

This content was originally published here.