Evaluate ML Models with Azure Machine Learning’s Responsible AI Insights

Evaluate ML Models with Azure Machine Learning’s Responsible AI Insights

In December 2021, we introduced the Responsible AI dashboard, a comprehensive experience bringing together several mature Responsible AI tools in the areas of data explorer (to proactively identify whether there is sufficient data representation for the variety of data subgroups), fairness assessment (to assess and identify your model’s group fairness issues), model interpretability (to understand how features are impacting your model predictions), error analysis (to easily identify error distributions across your data cohorts), counterfactual and causal inference analysis (to empower you to make responsible model-driven and data-driven decisions). The dashboard aims to address the issues of Responsible AI tool discoverability and fragmentation by enabling:

The Responsible AI dashboard is now integrated and generally available with the Azure Machine Learning (Azure Machine Learning) platform, enabling our cloud customers to use a variety of experiences (via CLI, SDK, and no-code UI wizard) to generate Responsible AI dashboards for their machine learning models, enhancing their model debugging and understanding processes.

In public preview, the Responsible AI scorecard, is a reporting feature which can also be generated in Azure Machine Learning to create and share reports surfacing key data characteristics, and model performance and fairness insights. The scorecard helps contextualize the model and data health insights with both technical and non-technical audiences, bringing stakeholders along as well as assisting in compliance reviews.

Walkthrough of the Responsible AI dashboard

In this article, the scenario we will walk through is a linear regression model used for the hypothetical purpose of determining developer access to a GPT-2 model published for a limited group of users. In the following sections, we will dive deeper into how the Responsible AI dashboard can be used to debug the data and model and inform better decision making. The regression model is trained on a historical dataset of programmers who were scored from 0 to 10 based on characteristics such as age, geographical region, what operating system they use, employer, style of coding, and so on. If the model predicts a score of 7 to 10, then they are allowed access. A sample of the synthetic data is below

First name Last name Score (target) Style YOE IDE Programming language Location Number of GitHub repos contributed to Employer OS Job title Age
Bryan Ray 8 spaces 16 Emacs R Antarctica 2 Snapchat MacOS Principal Engineer 32
Donovan Lucero 3 tabs 9 pyCharm Swift Antarctica 2 Instagram Linux Distinguished Engineer 35
Dean Hurley 1 tabs 7 XCode C# Antarctica 0 Uber MacOS Senior Engineer 32
Nathan Weaver 6 spaces 15 Visual Studio R Antarctica 0 Amazon Linux Principal Engineer 32
Raelyn Sloan 5 tabs 7 Eclipse Java Antarctica 0 Twitter Windows SWE 2 33.1

Essentially, this model is allocating opportunity across different developers. So, we should take a closer look at this model to identify what kind of errors it’s making, diagnose what is causing those errors, and use those insights to improve the model. After uncovering those evaluation insights on our model, we can share them via the Responsible AI scorecard with other stakeholders who also want to ensure the app’s transparency and robustness and build trust with our end users.

The Responsible AI dashboard can be generated via a code-first CLI v2 and SDK v2 experience or a no-code method via Azure Machine Learning’s studio UI.

Generating a Responsible AI dashboard

Using Python with the Azure Machine Learning SDKv2

An Azure Machine Learning training pipeline job can be configured and executed remotely with a python notebook using the Azure Machine Learning SDKv2. Once you train your model and register it, you can create a Responsible AI dashboard by first, selecting the components you would like to activate in the dashboard, specifying the input and outputs of each component, and creating a component job for each of them. The components available by default in all Azure Machine Learning workspaces are:

With our components defined, we can assemble our pipeline job and submit it to Azure Machine Learning. Model performance and fairness disparity metrics along with dataset explorer are automatically generated for your Responsible AI dashboard.

Using YAML with the Azure Machine Learning CLIv2

Alternatively, we can create this job with a YAML file to automate creating the Responsible AI dashboard in your MLOps via the Azure Machine Learning CLIv2 experience. We can specify all the jobs that we want to kick off: training the model, registering the model, and then creating the Responsible AI dashboard with a YAML file then executing the job with a single line from the CLI.

Read more about how to create the Responsible AI dashboard with Python and YAML in SDKv2/CLIv2.

Using no-code guided UI wizard in Azure Machine Learning studio

Finally, we can create this job without leaving the Azure Machine Learning studio at all with a no-code wizard experience. If we go to our list of registered models, we first select the model we want to generate Responsible AI insights for, click on the “Responsible AI” tab, and click the “Create Responsible AI insights > Create dashboard” button.

You first pick a train-and-test dataset that was used to train and test your model.

For this scenario, we will be choosing regression to match our model.

For the Responsible AI dashboard components that we’re interested in, we can choose either the debugging profile or the real-life interventions profile.

We’ll move forward with model debugging and customize the dashboard to include error analysis, counterfactual analysis, and model explanation. For error analysis, I can choose up to two features to pre-generate an error heat map for. For counterfactual analysis, I’m interested in seeing a diverse set of examples (let’s say 10 examples for each datapoint) where we automatically perturb features just enough, so they receive a score of 7 to 10. We can even control which features are being perturbed if we don’t want certain features to be changed.

Once that all looks good, we can move on to the final step to configure our experiment. We can name our job that will generate our Responsible AI dashboard, and either select an existing experiment to kick off the job in or create a new one. We’ll create a new one with the necessary resources and hit ‘Create’ and kick off the job.

With that, we can jump into the Azure Machine Learning studio to see if the job has been successfully completed and we can see the resulting Responsible AI dashboard for our model showing up.

Read more about how to create the Responsible AI dashboard with no-code UI wizard in Azure Machine Learning studio.

Viewing the Responsible AI dashboard

The Responsible AI dashboard is a dynamic and interactive interface to investigate your model and data built on a host of open-sourced state-of-the-art technology. You can view your dashboard(s) by navigating to the registered model you have generated a Responsible AI dashboard for. Clicking on the Responsible AI tab will take you to your dashboards.

We enable an integration of your workspace compute resources to access all the features such as retraining error trees, recalculating probabilities and generating insights in real time.

The different components of the Responsible AI dashboard are designed such that they can easily communicate with each other. You can create cohorts of your data to slice and dice your analysis and interactively pass cohorts and insights from one component to another for deep-dive investigations. You can hide the different components you’ve generated for the dashboard in the “dashboard configuration” or add them back by clicking the blue “plus” icon.

We first look at our error tree, which tells us where the distribution of most of our errors lie. It seems that our models made the greatest number of errors for programmers living in Antarctica who don’t program in C, PHP, or Swift and don’t contribute that often to GitHub repos. We can easily save this as a new cohort to investigate later, but in the meanwhile it will show up as a “Temporary cohort” in the subsequent components.

When looking at our model overview, we can get a high-level view of the model prediction distribution to help build intuition for the next steps in model debugging. In the “Feature cohorts” tab, we can also see Fairness metrics in the second table. The two rows display difference and ratios of the performance metrics as shown in the columns in the first table. For example, we see that there is a huge disparity between those who use spaces versus columns with the difference in mean absolute error of 659.563.

We can use the data explorer to see if feature distribution in our dataset is skewed. This can cause a model to incorrectly predict datapoints belonging to an underrepresented group or to be optimized along an inappropriate metric. If we bin our x-axis to be the ground truth of different scores a programmer can get (where 7-10 is the accepted range) and look at the style, we see that there is a highly skewed distribution of programmers who use tabs being scored lower and programmers who use spaces being scores higher.

Additionally, since we know our model made the most amount of error for those living in Antarctica, when we investigate location, we see a highly skewed distribution of programmers living in Antarctica who were scored lower. What this means is that our model will unfairly favor those who are using spaces, and not living in Antarctica when providing access to the application we built.

Coming down to our aggregate feature importance, we can see for our overall model, which features were the most important to the model’s predictions; and we can see that style (tabs or spaces) is by far the most considered, then operating system then programming language. If we click into style, we can see that using ‘spaces’ has a positive feature importance and ‘tabs’ has a negative feature importance showing us that ‘spaces’ is what contributes to a higher score.

We can also look at two specific programmers who got a low and high score. Row 35 has a high score and uses spaces and row 2 has a low score and uses tabs. When we look at the individual feature importance of each programmers’ features, we can see that the ‘spaces’ positively contributed to Row 35’s high score, while ‘tabs’ contributed negatively towards a lower score for Row 2.

We can take a deeper look with counterfactual what-if examples. When selecting someone below the 7 to 10 range prediction, we can see what bare minimum changes could happen to their features to lead to much higher predictions. In this programmer’s case, some recommended changes would be switching their style to spaces.

Finally, if we wanted to purely use historic data to identify the features that have the most direct effect on our outcome of interest, in this case the score, we can use causal analysis.  In our case, we want to understand the causal effect of years of experience and number of GitHub repos a programmer has contributed to on the score. The aggregate causal effects show you overall for your whole dataset, on average, increasing the number of GitHub repos by 1 increases the score by 0.095 whereas increasing the number of years of experience by 1 doesn’t increase the score by much at all.

However, if we want to look at individual programmers and perturb those values and see the outcome of specific treatments to years of experience, we can see that for some programmers, increasing the years of experience does cause the score to increase by a bit.

Additionally, the treatment policy tab can help us decide what overall treatment policy to take to maximize real-world impact on our score.  We can see the best future interventions to apply to certain segmentations of our programmer population to see the biggest boost in the scores overall.

And if you can only focus on 10 programmers to reach out to, you can see a ranked list of top k programmers who would gain the most from either increasing or decreasing the number of GitHub repos.

Read the UI overview of how to use the different charts and visualizations of the Responsible AI dashboard.

Next steps

Learn more about the RAI dashboard and scorecard in the Microsoft documentation and generate them today to boost justified trust and appropriate reliance in your AI-driven processes.


In the past year, our teams across the globe have joined forces to release the very first one-stop-shop dashboard for easy implementation of responsible AI in practice, making these efforts available to the community as open source and as part of the Azure Machine Learning ecosystem. We acknowledge their great efforts and are excited to see how you use this tool in your AI lifecycle.

Azure Machine Learning:

Microsoft Research:

About the author: Minsoo Thigpen is a Product Manager at Microsoft Azure Machine Learning designing and building out Responsible AI tools for data scientists. She has bachelor’s degrees in Applied Mathematics and Painting from Brown University and Rhode Island School of Design (RISD). Coming from an interdisciplinary background with experience in building machine learning models and products, analyzing data, and designing UX, she is always finding work at the intersection of AI/ML, design, and social sciences to empower data and ML practitioners to work ethically and responsibly end-to-end.

This content was originally published here.