Training and Deploying Your PyTorch Model in the Cloud with Azure ML

Training and Deploying Your PyTorch Model in the Cloud with Azure ML

You’ve been training your PyTorch models on your machine, and getting by just fine. Why would you want to train and deploy them in the cloud? Training in the cloud will allow you to handle larger ML models and datasets than you could train on your development machine. And deploying your model in the cloud will allow your system to scale to many more inference requests than a development machine could handle. In short, moving your code to the cloud will open up a new world of possibilities by scaling up the hard work you’ve already done.

The good news is that moving your PyTorch models to the cloud using Azure ML is fairly straightforward. In this article, I will show you how to train and deploy a simple Fashion MNIST model in the cloud. The steps you’ll see here are the same regardless of the complexity of your PyTorch model, so by the end of this article you’ll be well prepared to apply them to your own work.

You can find the project associated with this post on GitHub, including complete instructions on how to run it.

Step 1: Train and test your PyTorch model locally

You’re probably already done with this step. I added it here anyway because I can’t emphasize enough that your model should be working as expected before you move it to the cloud. You’ll be so much more efficient this way — iterating on code is quick when you do it locally, but a training roundtrip to the cloud takes much longer! If your dataset is too large to train locally, use a portion of your data, and then add the full dataset right before moving to the cloud.

Since we’re using Fashion MNIST, which consists of only 70,000 images, we can train our model locally using the full dataset. If you’re not familiar with the Fashion MNIST dataset and how to write the PyTorch code to train a classifier for this data, you can read this post for more information. Below you can see the training code for our scenario, which can be found in the train.py file of the accompanying GitHub project:

As you can see, we use the MLflow open source framework to save the model and log metrics. Using MLflow is not a requirement when using Azure ML — logging using any other logging framework and saving your model using PyTorch would work perfectly fine. However, MLflow logging displays nicely in Azure ML, and saving a model using the MLflow format simplifies Azure ML deployment, so I’ve been embracing this framework more and more lately.

Our PyTorch neural network can be found in the neural_network.py file of the GitHub project, and it can be seen below:

Our neural network is pretty simple: we have an input layer that takes 28 x 28 pixels as input (the size of a single Fashion MNIST image), a hidden layer with 20 nodes followed by ReLU activation functions, and an output layer with 10 nodes, one for each clothing item. This allows us to input an image of size 28×28, and get back a vector of 10 values, with the highest output value revealing our prediction!

The PyTorch code we use to fit and evaluate our model can be found in the utils_train_nn.py file of our project. Here’s the code:

This file contains fairly generic PyTorch code to train and evaluate a model, using a concise API. Depending on your scenario, you may be able to reuse it as is, and simply call “fit” to train your model for a single epoch, and “evaluate” to evaluate it. You can look back at the train.py file to see how I use this API.

To train the model you just need to execute the “train.py” file, optionally passing in a path for the data and another path for the model. Since the data is small and the neural network is simple, the model should take just a few minutes to train on your development machine.

You could write some more PyTorch code to test your model. But in this project I decided to use MLflow’s CLI instead, which enables me to test my model without writing any extra code. Here’s the command I execute:

— In this command, “model-uri” refers to the path to our saved model, and “input-path” refers to a file containing the pixel values of two test images. I’m using a JSON file in this case, but MLflow also supports CSV.

You should get predictions for the two images that are similar to the following:

Great! You’ve trained and tested your model locally, and are now ready to move your work to the cloud!

Step 2: Train your model in the cloud

In order to train your model in the cloud using Azure ML, you’ll need to create the following Azure ML entities:

The command job uses your compute cluster, data, and environment, and it runs your training code in the cloud. Once it finishes execution, it produces a trained model, which you can register as an Azure ML resource and download locally. You can get a good overview of the entities available in Azure ML in this blog post. The code to create these entities can be found in the job.py file of the current post’s project, or below:

There are actually three different ways for creating these resources: from your terminal using the Azure ML CLI and YAML configuration files which I cover in this post, with a low-code approach using the Azure ML studio, or by writing code using the Azure ML Python SDK, which is the method I show in our current post. Regardless of which option you choose, you can visualize the results of your work in the Azure ML studio. From the left navigation pane of the studio, you can click on “Compute,” “Data,” “Environments,” “Jobs,” and “Model” to see the entities you just created!

Step 3: Deploy your model to the cloud

Now that you have a trained model registered in the cloud, let’s look at how you can deploy it. This will enable you and your users to make a prediction using the trained model from anywhere, at scale. In order to deploy with Azure ML, we’ll need to create the following two resources:

The code that creates these resources can be found on the endpoint.py file of the accompanying project. You can also see it below:

This code creates the endpoint and deployment, and invokes the endpoint by using a JSON file with Fashion MNIST images as input. You should obtain a prediction similar to what you saw locally.

Before you set your work aside, remember to delete your endpoint if you’re not planning on using it, to avoid getting charged. You can do this by running the delete_endpoint.py file in the project, or by going to the “Endpoints” section in the Azure ML Studio.

And that’s all there is to it! You now have the knowledge you need to train and deploy your machine learning models at scale, in the cloud! ?

Read my next post on Medium, about training a PyTorch model using Azure ML components and pipelines.

To learn more about Azure ML and other AI/ML topics, check out my machine learning blog.

This content was originally published here.