Risk references

Name	Category	Description	Why it matters	Configuration
Average Confidence	Model Performance	This test checks the average confidence of the model predictions between the reference and evaluation sets to see if the metric has experienced significant degradation. The “confidence” of a prediction for classification tasks is defined as the distance between the probability of the predicted class (defined as the argmax over the prediction vector) and 1. We average this metric across all predictions.	During production, factors like distribution shift or a change in kk may cause model performance to decrease significantly. Since oftentimes labels are not available in a production setting, this metric can serve as a useful proxy for model performance.	By default, this test runs if predictions are specified (no labels required).
Average Thresholded Confidence	Model Performance	This test checks the average thresholded confidence (ATC) of the model predictions between the reference and evaluation sets to see if the metric has experienced significant degradation. ATC is a method for estimating accuracy of unlabeled examples taken from https://arxiv.org/abs/2201.04234 this paper. The threshold is first computed on the reference set: we pick a confidence threshold such that the percentage of datapoints whose max predicted probability is less than the threshold is around equal to the error rate of the model (here, it is 1-accuracy) on the reference set. Then, we apply this threshold in the evaluation set: the predicted accuracy is then equal to the percentage of datapoints with max predicted probability greater than this threshold.	During production, factors like distribution shift may cause model performance to decrease significantly. Since oftentimes labels are not available in a production setting, this metric can serve as a useful proxy for model performance.	By default, this test runs if predictions/labels are specified in the reference set and predictions are specified in the eval set (no labels required).
Calibration Comparison	Model Performance	This test checks that the reference and evaluation sets have sufficiently similar calibration curves as measured by the Mean Squared Error (MSE) between the two curves. The calibration curve is a line plot where the x-axis represents the average predicted probability and the y-axis is the proportion of positive predictions. The curve of the ideal calibrated model is thus a linear straight line from (0, 0) moving linearly.	Knowing how well-calibrated your model is can help you better interpret and act upon model outputs, and can even be an indicator of generalization. A greater difference between reference and evaluation curves could indicate a lack of generalizability. In addition, a change in calibration could indicate that decision-making or thresholding conducted upstream needs to change as it is behaving differently on held-out data.	By default, this test runs over the predictions and labels.
Precision	Model Performance	This test checks the Precision metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Precision has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.	During production, factors like distribution shift or a change in kk may cause model performance to decrease significantly.	By default, this test runs over the Precision metric with the below thresholds set for the absolute and degradation tests.
Mean-Squared-Log Error (MSLE)	Model Performance	This test checks the Mean-Squared-Log Error (MSLE) metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Mean-Squared-Log Error (MSLE) has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.	During production, factors like distribution shift or a change in kk may cause model performance to decrease significantly.	By default, this test runs over the Mean-Squared-Log Error (MSLE) metric with the below thresholds set for the absolute and degradation tests.
Macro Precision	Model Performance	This test checks the Macro Precision metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Macro Precision has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.	During production, factors like distribution shift or a change in kk may cause model performance to decrease significantly.	By default, this test runs over the Macro Precision metric with the below thresholds set for the absolute and degradation tests.
BERT Score	Model Performance	This test checks the BERT Score metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of BERT Score has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.	During production, factors like distribution shift or a change in kk may cause model performance to decrease significantly.	By default, this test runs over the BERT Score metric with the below thresholds set for the absolute and degradation tests.
Multiclass Accuracy	Model Performance	This test checks the Multiclass Accuracy metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Multiclass Accuracy has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.	During production, factors like distribution shift or a change in kk may cause model performance to decrease significantly.	By default, this test runs over the Multiclass Accuracy metric with the below thresholds set for the absolute and degradation tests.
Mean-Absolute Error (MAE)	Model Performance	This test checks the Mean-Absolute Error (MAE) metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Mean-Absolute Error (MAE) has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.	During production, factors like distribution shift or a change in kk may cause model performance to decrease significantly.	By default, this test runs over the Mean-Absolute Error (MAE) metric with the below thresholds set for the absolute and degradation tests.
Prediction Variance	Model Performance	This test checks the Prediction Variance metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Prediction Variance has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.	During production, factors like distribution shift or a change in kk may cause model performance to decrease significantly.	By default, this test runs over the Prediction Variance metric with the below thresholds set for the absolute and degradation tests.
F1	Model Performance	This test checks the F1 metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of F1 has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.	During production, factors like distribution shift or a change in kk may cause model performance to decrease significantly.	By default, this test runs over the F1 metric with the below thresholds set for the absolute and degradation tests.
Mean-Squared Error (MSE)	Model Performance	This test checks the Mean-Squared Error (MSE) metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Mean-Squared Error (MSE) has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.	During production, factors like distribution shift or a change in kk may cause model performance to decrease significantly.	By default, this test runs over the Mean-Squared Error (MSE) metric with the below thresholds set for the absolute and degradation tests.
Average Prediction	Model Performance	This test checks the Average Prediction metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Average Prediction has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.	During production, factors like distribution shift or a change in kk may cause model performance to decrease significantly.	By default, this test runs over the Average Prediction metric with the below thresholds set for the absolute and degradation tests.
Mean Reciprocal Rank (MRR)	Model Performance	This test checks the Mean Reciprocal Rank (MRR) metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Mean Reciprocal Rank (MRR) has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.	During production, factors like distribution shift or a change in kk may cause model performance to decrease significantly.	By default, this test runs over the Mean Reciprocal Rank (MRR) metric with the below thresholds set for the absolute and degradation tests.
Macro F1	Model Performance	This test checks the Macro F1 metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Macro F1 has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.	During production, factors like distribution shift or a change in kk may cause model performance to decrease significantly.	By default, this test runs over the Macro F1 metric with the below thresholds set for the absolute and degradation tests.
METEOR Score	Model Performance	This test checks the METEOR Score metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of METEOR Score has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.	During production, factors like distribution shift or a change in kk may cause model performance to decrease significantly.	By default, this test runs over the METEOR Score metric with the below thresholds set for the absolute and degradation tests.
False Negative Rate	Model Performance	This test checks the False Negative Rate metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of False Negative Rate has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.	During production, factors like distribution shift or a change in kk may cause model performance to decrease significantly.	By default, this test runs over the False Negative Rate metric with the below thresholds set for the absolute and degradation tests.
Prediction Variance (Positive Labels)	Model Performance	This test checks the Prediction Variance (Positive Labels) metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Prediction Variance (Positive Labels) has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.	During production, factors like distribution shift or a change in kk may cause model performance to decrease significantly.	By default, this test runs over the Prediction Variance (Positive Labels) metric with the below thresholds set for the absolute and degradation tests.
Rank Correlation	Model Performance	This test checks the Rank Correlation metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Rank Correlation has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.	During production, factors like distribution shift or a change in kk may cause model performance to decrease significantly.	By default, this test runs over the Rank Correlation metric with the below thresholds set for the absolute and degradation tests.
Recall	Model Performance	This test checks the Recall metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Recall has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.	During production, factors like distribution shift or a change in kk may cause model performance to decrease significantly.	By default, this test runs over the Recall metric with the below thresholds set for the absolute and degradation tests.
Accuracy	Model Performance	This test checks the Accuracy metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Accuracy has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.	During production, factors like distribution shift or a change in kk may cause model performance to decrease significantly.	By default, this test runs over the Accuracy metric with the below thresholds set for the absolute and degradation tests.
Average Number of Predicted Entities	Model Performance	This test checks the Average Number of Predicted Entities metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Average Number of Predicted Entities has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.	During production, factors like distribution shift or a change in kk may cause model performance to decrease significantly.	By default, this test runs over the Average Number of Predicted Entities metric with the below thresholds set for the absolute and degradation tests.
Flesch-Kincaid Grade Level	Model Performance	This test checks the Flesch-Kincaid Grade Level metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Flesch-Kincaid Grade Level has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.	During production, factors like distribution shift or a change in kk may cause model performance to decrease significantly.	By default, this test runs over the Flesch-Kincaid Grade Level metric with the below thresholds set for the absolute and degradation tests.
Multiclass AUC	Model Performance	This test checks the Multiclass AUC metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Multiclass AUC has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.	During production, factors like distribution shift or a change in kk may cause model performance to decrease significantly.	By default, this test runs over the Multiclass AUC metric with the below thresholds set for the absolute and degradation tests.
Mean-Absolute Percentage Error (MAPE)	Model Performance	This test checks the Mean-Absolute Percentage Error (MAPE) metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Mean-Absolute Percentage Error (MAPE) has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.	During production, factors like distribution shift or a change in kk may cause model performance to decrease significantly.	By default, this test runs over the Mean-Absolute Percentage Error (MAPE) metric with the below thresholds set for the absolute and degradation tests.
ROUGE Score	Model Performance	This test checks the ROUGE Score metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of ROUGE Score has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.	During production, factors like distribution shift or a change in kk may cause model performance to decrease significantly.	By default, this test runs over the ROUGE Score metric with the below thresholds set for the absolute and degradation tests.
False Positive Rate	Model Performance	This test checks the False Positive Rate metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of False Positive Rate has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.	During production, factors like distribution shift or a change in kk may cause model performance to decrease significantly.	By default, this test runs over the False Positive Rate metric with the below thresholds set for the absolute and degradation tests.
Average Number of Predicted Boxes	Model Performance	This test checks the Average Number of Predicted Boxes metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Average Number of Predicted Boxes has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.	During production, factors like distribution shift or a change in kk may cause model performance to decrease significantly.	By default, this test runs over the Average Number of Predicted Boxes metric with the below thresholds set for the absolute and degradation tests.
Root-Mean-Squared Error (RMSE)	Model Performance	This test checks the Root-Mean-Squared Error (RMSE) metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Root-Mean-Squared Error (RMSE) has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.	During production, factors like distribution shift or a change in kk may cause model performance to decrease significantly.	By default, this test runs over the Root-Mean-Squared Error (RMSE) metric with the below thresholds set for the absolute and degradation tests.
Average Rank	Model Performance	This test checks the Average Rank metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Average Rank has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.	During production, factors like distribution shift or a change in kk may cause model performance to decrease significantly.	By default, this test runs over the Average Rank metric with the below thresholds set for the absolute and degradation tests.
Macro Recall	Model Performance	This test checks the Macro Recall metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Macro Recall has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.	During production, factors like distribution shift or a change in kk may cause model performance to decrease significantly.	By default, this test runs over the Macro Recall metric with the below thresholds set for the absolute and degradation tests.
SBERT Score	Model Performance	This test checks the SBERT Score metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of SBERT Score has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.	During production, factors like distribution shift or a change in kk may cause model performance to decrease significantly.	By default, this test runs over the SBERT Score metric with the below thresholds set for the absolute and degradation tests.
Positive Prediction Rate	Model Performance	This test checks the Positive Prediction Rate metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Positive Prediction Rate has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.	During production, factors like distribution shift or a change in kk may cause model performance to decrease significantly.	By default, this test runs over the Positive Prediction Rate metric with the below thresholds set for the absolute and degradation tests.
Normalized Discounted Cumulative Gain (NDCG)	Model Performance	This test checks the Normalized Discounted Cumulative Gain (NDCG) metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Normalized Discounted Cumulative Gain (NDCG) has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.	During production, factors like distribution shift or a change in kk may cause model performance to decrease significantly.	By default, this test runs over the Normalized Discounted Cumulative Gain (NDCG) metric with the below thresholds set for the absolute and degradation tests.
AUC	Model Performance	This test checks the AUC metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of AUC has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.	During production, factors like distribution shift or a change in kk may cause model performance to decrease significantly.	By default, this test runs over the AUC metric with the below thresholds set for the absolute and degradation tests.
BLEU Score	Model Performance	This test checks the BLEU Score metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of BLEU Score has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.	During production, factors like distribution shift or a change in kk may cause model performance to decrease significantly.	By default, this test runs over the BLEU Score metric with the below thresholds set for the absolute and degradation tests.
Prediction Variance (Negative Labels)	Model Performance	This test checks the Prediction Variance (Negative Labels) metric to see both if its performance on the evaluation set alone is satisfactory, as well as if performance in terms of Prediction Variance (Negative Labels) has degraded from the reference to evaluation set. The key detail displays whether the given performance metric has degraded beyond a defined threshold.	During production, factors like distribution shift or a change in kk may cause model performance to decrease significantly.	By default, this test runs over the Prediction Variance (Negative Labels) metric with the below thresholds set for the absolute and degradation tests.
Row-wise Toxic Content	Model Alignment	This test scans the model output on each row in the dataset to check if it contains toxic content. This test uses an external language model to evaluate toxicity.	Generative language models are trained on massive volumes of unfiltered content scraped from the web, which means they can learn to imitate harmful and offensive language. It is important to verify that your model is not responding to user inputs with toxic content.	By default, this test runs over all inputs in the evaluation dataset.
Row-wise Factual Inconsistency	Factual Awareness	This test scans the model output on each row in the dataset to check for false or inaccurate statements. This test requires providing a file containing the set of facts specific to your use case that are the most important items for the model to always be correct about.	Generative language models are trained to match the distribution of text observed in its training data as closely as possible. This means that they are susceptible to generate sequences of words that are highly correlated, semantically similar, and sound coherent together, but that may not be factually consistent, a phenomenon commonly referred to as “hallucination”. It is important in general that your model outputs factually correct information, and especially that it is consistent with the specific information relevant to your specific application.	By default, this test runs over all inputs in the evaluation dataset.
Protected Feature Drift	Bias and Fairness	This test measures the change in the distribution of a feature by comparing the distribution in an evaluation set to a reference set. The test severity is a function of both the degree to which the distribution has changed and the estimated impact the observed drift has had on model performance.	Distribution shift between training and inference can cause degradation in model performance. If the shift is sufficiently large, retraining the model on newer data may be necessary.	By default, this test runs over all feature columns with sufficiently many samples in both the reference and evaluation sets.
Demographic Parity (Pos Pred)	Bias and Fairness	This test checks whether the Selection Rate for any subset of a feature performs as well as the best Selection Rate across all subsets of that feature. The Demographic Parity is calculated as the Positive Prediction Rate. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the Selection Rate of model predictions within a specific subset is significantly lower than that of other subsets by taking a ratio of the rates. Also included in this test is the Impact Ratios tab, which includes a calculation of Disparate Impact Ratio for each subset. Disparate Impact Ratio is defined as the Positive Prediction Rate for the subset divided by the best Positive Prediction Rate across all subsets.	Assessing differences in Selection Rate is an important measures of fairness. It is meant to be used in a setting where we assert that the base Selection Rates between subgroups should be the same (even if empirically they are different). This contrasts with equality of opportunity or predictive parity tests, which permit classification rates to depend on a sensitive attribute. Comparing Positive Prediction Rates and Impact Ratios over all subsets can be useful in legal/compliance settings where we want the Selection Rate for any sensitive group to fundamentally be the same as other groups.	By default, the Selection Rate is computed for all protected features. The severity threshold baseline is set to 80% by default, in accordance with the four-fifths law for adverse impact detection.
Demographic Parity (Avg Pred)	Bias and Fairness	This test checks whether the Average Prediction for any subset of a feature performs as well as the best Average Prediction across all subsets of that feature. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the Average Prediction of model predictions within a specific subset is significantly lower than that of other subsets by taking a ratio of the rates. Also included in this test is the Impact Ratios tab, which includes a calculation of Disparate Impact Ratio for each subset. Disparate Impact Ratio is defined as the Positive Prediction Rate for the subset divided by the best Positive Prediction Rate across all subsets.	Assessing differences in Average Prediction is an important measures of fairness. It is meant to be used in a setting where we assert that the base Average Predictions between subgroups should be the same (even if empirically they are different). This contrasts with equality of opportunity or predictive parity tests, which permit classification rates to depend on a sensitive attribute. Comparing Positive Prediction Rates and Impact Ratios over all subsets can be useful in legal/compliance settings where we want the Average Prediction for any sensitive group to fundamentally be the same as other groups.	By default, the Average Prediction is computed for all protected features. The severity threshold baseline is set to 80% by default, in accordance with the four-fifths law for adverse impact detection.
Demographic Parity (Avg Rank)	Bias and Fairness	This test checks whether the Average Rank for any subset of a feature performs as well as the best Average Rank across all subsets of that feature. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the Average Rank of model predictions within a specific subset is significantly lower than that of other subsets by taking a ratio of the rates. Also included in this test is the Impact Ratios tab, which includes a calculation of Disparate Impact Ratio for each subset. Disparate Impact Ratio is defined as the Positive Prediction Rate for the subset divided by the best Positive Prediction Rate across all subsets.	Assessing differences in Average Rank is an important measures of fairness. It is meant to be used in a setting where we assert that the base Average Ranks between subgroups should be the same (even if empirically they are different). This contrasts with equality of opportunity or predictive parity tests, which permit classification rates to depend on a sensitive attribute. Comparing Positive Prediction Rates and Impact Ratios over all subsets can be useful in legal/compliance settings where we want the Average Rank for any sensitive group to fundamentally be the same as other groups.	By default, the Average Rank is computed for all protected features. The severity threshold baseline is set to 80% by default, in accordance with the four-fifths law for adverse impact detection.
Class Imbalance	Bias and Fairness	This test checks whether the training sample size for any subset of a feature is significantly smaller than other subsets of that feature. The test first splits the dataset into various subset classes within the feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the class imbalance measure of that subset compared to the largest subset exceeds a set threshold.	Assessing class imbalance is an important measure of fairness. Features with low subset sizes can result in the model overfitting those subsets, and hence cause a larger error when those subsets appear in test data. This test can be useful in legal/compliance settings where sufficient data for all subsets of a protected feature is important.	By default, class imbalance is tested for all protected features.
Equalized Odds	Bias and Fairness	This test checks for equal true positive and false positive rates over all subsets for each protected feature. The test first splits the dataset into various subset classes within the feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the true positive and false positive rates of that subset significantly varies as compared to the largest subset.	Equalized odds (or disparate mistreatment) is an important measure of fairness in machine learning. Subjects in protected groups may have different true positive rates or false positive rates, which imply that the model may be biased on those protected features. Fulfilling the condition of equalized odds may be a requirement in various legal/compliance settings.	By default, equalized odds is tested for all protected features.
Feature Independence	Bias and Fairness	This test checks the independence of each protected feature with the predicted label class. It runs over categorical protected features and uses the chi square test of independence to determine the feature independence. The test compares the observed data to a model that distributes the data according to the expectation that the variables are independent. Wherever the observed data does not fit the model, the likelihood that the variables are dependent becomes stronger.	A test of independence assesses whether observations consisting of measures on two variables, expressed in a contingency table, are independent of each other. This can be useful when assessing how protected features impact the predicted class and helping with the feature selection process.	By default, this test is run over all protected categorical features.
Predict Protected Features	Bias and Fairness	The Predict Protected Features test works by training a multi-class logistic regression model to infer categorical protected features from unprotected categorical and numerical features. The model is fit to the reference data and scored based on its accuracy over the evaluation data. The unprotected categorical features are one-hot encoded.	In a compliance setting, it may be prohibited to include certain protected features in your training data. However, unprotected features might still provide your model with information about the protected features. If a simple logistic regression model can be trained to accurately predict protected features, your model might have a hidden reliance on protected features, resulting in biased decisions.	By default, the selection rate is computed for all protected features.
Equal Opportunity (Recall)	Bias and Fairness	The recall test is more popularly referred to as equal opportunity or false negative error rate balance in fairness literature. This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the Recall of model predictions within a specific subset is significantly lower than the model prediction Recall over the entire population.	Having different Recall between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for and ethics, but also indicates failures in adequate feature representation and fairness spurious correlation. Unlike demographic parity, this test permits assuming different base label rates but flags differing mistake rates between different subgroups. An intuitive example is when the label indicates a positive attribute: if predicting whether to interview a given candidate, make sure that out of qualified candidates, the rate at which the model predicts a rejection is similar to group A and B.	By default, Recall is computed over all predictions/labels. Note that we round predictions to 0/1 to compute recall.
Equal Opportunity (Macro Recall)	Bias and Fairness	The recall test is more popularly referred to as equal opportunity or false negative error rate balance in fairness literature. When transitioning to the multiclass setting we can use macro recall which computes the recall of each individual class and then averages these numbers. This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the Macro Recall of model predictions within a specific subset is significantly lower than the model prediction Macro Recall over the entire population.	Having different Macro Recall between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for and ethics, but also indicates failures in adequate feature representation and fairness spurious correlation. Unlike demographic parity, this test permits assuming different base label rates but flags differing mistake rates between different subgroups. An intuitive example is when the label indicates a positive attribute: if predicting whether to interview a given candidate, make sure that out of qualified candidates, the rate at which the model predicts an interview is similar to group A and B.	By default, Macro Recall is computed over all predictions/labels. Note that the predicted label is the label with the largest predicted class probability.
Intersectional Group Fairness (Pos Pred)	Bias and Fairness	This test checks whether the model performs equally well across subgroups created from the intersection of protected groups. The test first creates unique pairs of categorical protected features. We then test whether the positive prediction rate of model predictions within a specific subset is significantly lower than the model positive prediction rate over the entire population. This will expose hidden biases against groups at the intersection of these protected features. Also included in this test is the Impact Ratios tab, which includes a calculation of Disparate Impact Ratio for each subgroup. Disparate Impact Ratio is defined as the Positive Prediction Rate for the subgroup divided by the best Positive Prediction Rate across all subgroups.	Most existing work in the fairness literature deals with a binary view of fairness - either a particular group is performing worse or not. This binary categorization misses the important nuance of the fairness field - that biases can often be amplified in subgroups that combine membership from different protected groups, especially if such a subgroup is particularly underrepresented in opportunities historically. The intersectional group fairness test is run over subsets representing this intersection between two protected groups.	This test runs over unique pairs of categorical protected features.
Intersectional Group Fairness (Avg Pred)	Bias and Fairness	This test checks whether the model performs equally well across subgroups created from the intersection of protected groups. The test first creates unique pairs of categorical protected features. We then test whether the average prediction of model predictions within a specific subset is significantly lower than the model average prediction over the entire population. This will expose hidden biases against groups at the intersection of these protected features. Also included in this test is the Impact Ratios tab, which includes a calculation of Disparate Impact Ratio for each subgroup. Disparate Impact Ratio is defined as the Positive Prediction Rate for the subgroup divided by the best Positive Prediction Rate across all subgroups.	Most existing work in the fairness literature deals with a binary view of fairness - either a particular group is performing worse or not. This binary categorization misses the important nuance of the fairness field - that biases can often be amplified in subgroups that combine membership from different protected groups, especially if such a subgroup is particularly underrepresented in opportunities historically. The intersectional group fairness test is run over subsets representing this intersection between two protected groups.	This test runs over unique pairs of categorical protected features.
Intersectional Group Fairness (Avg Rank)	Bias and Fairness	This test checks whether the model performs equally well across subgroups created from the intersection of protected groups. The test first creates unique pairs of categorical protected features. We then test whether the average rank of model predictions within a specific subset is significantly lower than the model average rank over the entire population. This will expose hidden biases against groups at the intersection of these protected features. Also included in this test is the Impact Ratios tab, which includes a calculation of Disparate Impact Ratio for each subgroup. Disparate Impact Ratio is defined as the Positive Prediction Rate for the subgroup divided by the best Positive Prediction Rate across all subgroups.	Most existing work in the fairness literature deals with a binary view of fairness - either a particular group is performing worse or not. This binary categorization misses the important nuance of the fairness field - that biases can often be amplified in subgroups that combine membership from different protected groups, especially if such a subgroup is particularly underrepresented in opportunities historically. The intersectional group fairness test is run over subsets representing this intersection between two protected groups.	This test runs over unique pairs of categorical protected features.
Predictive Equality (FPR)	Bias and Fairness	The false positive error rate test is also popularly referred to as predictive equality, or equal mis-opportunity in fairness literature. This test checks whether the model performs equally well across a given subset of rows as it does across the whole dataset. The key detail displays the performance difference between the lowest performing subset and the overall population. The test first splits the dataset into various subsets depending on the quantiles of a given feature column. If the feature is categorical, the data is split based on the feature values. We then test whether the False Positive Rate of model predictions within a specific subset is significantly upper than the model prediction False Positive Rate over the entire population.	Having different False Positive Rate between different subgroups is an important indicator of performance bias; in general, bias is an important phenomenon in machine learning and not only contains implications for and ethics, but also indicates failures in adequate feature representation and fairness spurious correlation. Unlike demographic parity, this test permits assuming different base label rates but flags differing mistake rates between different subgroups. As an intuitive example, consider the case when the label indicates an undesirable attribute: if predicting whether a person will default on their loan, make sure that for people who didn’t default, the rate at which the model incorrectly predicts positive is similar for group A and B.	By default, False Positive Rate is computed over all predictions/labels. Note that we round predictions to 0/1 to compute false positive rate.
Discrimination By Proxy	Bias and Fairness	This test checks whether any feature is a proxy for a protected feature. It runs over categorical features, using mutual information as a measure of similarity with a protected feature. Mutual information measures any dependencies between two variables.	A common strategy to try to ensure a model is not biased is to remove protected features from the training data entirely so the model cannot learn over them. However, if other features are highly dependent on those features, that could lead to the model effectively still training over those features by proxy.	By default, this test is run over all categorical protected columns.
Subset Sensitivity (Pos Pred)	Bias and Fairness	This test measures how sensitive the model is to substituting the lowest performing subset of a feature into a sample of data. The test splits the dataset into various subsets based on the feature values and finds the lowest performing subset, based on the lowest Positive Prediction Rate. The test then substitutes this subset into a sample from the original data and calculates the change in Positive Prediction Rate. This test fails if the Positive Prediction Rate changes significantly between the original rows and the rows substituted with the lowest performing subset.	Assessing differences in model output is an important measure of fairness. If the model performs worse because of the value of a protected feature such as race or gender, then this could indicate bias. It can be useful in legal/compliance settings where we fundamentally want the prediction for any protected group to be the same as for other groups.	By default, the subset sensitivity is computed for all protected features that are strings.
Subset Sensitivity (Avg Pred)	Bias and Fairness	This test measures how sensitive the model is to substituting the lowest performing subset of a feature into a sample of data. The test splits the dataset into various subsets based on the feature values and finds the lowest performing subset, based on the lowest Average Prediction. The test then substitutes this subset into a sample from the original data and calculates the change in Average Prediction. This test fails if the Average Prediction changes significantly between the original rows and the rows substituted with the lowest performing subset.	Assessing differences in model output is an important measure of fairness. If the model performs worse because of the value of a protected feature such as race or gender, then this could indicate bias. It can be useful in legal/compliance settings where we fundamentally want the prediction for any protected group to be the same as for other groups.	By default, the subset sensitivity is computed for all protected features that are strings.
Subset Sensitivity (Avg Rank)	Bias and Fairness	This test measures how sensitive the model is to substituting the lowest performing subset of a feature into a sample of data. The test splits the dataset into various subsets based on the feature values and finds the lowest performing subset, based on the lowest Average Rank. The test then substitutes this subset into a sample from the original data and calculates the change in Average Rank. This test fails if the Average Rank changes significantly between the original rows and the rows substituted with the lowest performing subset.	Assessing differences in model output is an important measure of fairness. If the model performs worse because of the value of a protected feature such as race or gender, then this could indicate bias. It can be useful in legal/compliance settings where we fundamentally want the prediction for any protected group to be the same as for other groups.	By default, the subset sensitivity is computed for all protected features that are strings.
Gendered Pronoun Distribution	Bias and Fairness	This test checks that both masculine and feminine pronouns are approximately equally likely to be predicted by the fill-mask model for various templates.	Fill-mask models can be tested for gender bias by analyzing predictions for a masked portion of a semantically-bleached template. If a model is significantly more likely to suggest a masculine or feminine pronoun within a sentence relative to its counterpart, it may be learning biased behaviors, which can have important ethical implications.	This test runs only on fill-mask model tasks.
Fill Mask Invariance	Bias and Fairness	This test uses templates to check that word associations of fill-mask models are similar for majority and protected minority groups.	Fill-mask models are vulnerable to significant bias based on the target groups provided in a semantically-bleached template. If a model is significantly more likely to suggest certain attributes within a sentence for one protected group relative to a counterpart, it may be learning biased behaviors, which can have important ethical implications.	This test runs only on fill-mask model tasks.
Replace Masculine with Feminine Pronouns	Bias and Fairness	This test measures the robustness of your model to Replace Masculine with Feminine Pronouns transformations. It does this by taking a sample input, swapping all masculine pronouns from the input string to feminine ones, and measuring the behavior of the model on the transformed input.	Production natural language input sequences can have errors from data preprocessing or human input (mistaken or adversarial). It is important that your NLP models are robust to the introduction of such errors.	By default, this test runs over a sample of strings from the evaluation set, and it performs this attack on 5% of the words in each input.
Replace Feminine with Masculine Pronouns	Bias and Fairness	This test measures the robustness of your model to Replace Feminine with Masculine Pronouns transformations. It does this by taking a sample input, swapping all feminine pronouns from the input string to masculine ones, and measuring the behavior of the model on the transformed input.	Production natural language input sequences can have errors from data preprocessing or human input (mistaken or adversarial). It is important that your NLP models are robust to the introduction of such errors.	By default, this test runs over a sample of strings from the evaluation set, and it performs this attack on 5% of the words in each input.
Replace Masculine with Feminine Names	Bias and Fairness	This test measures the invariance of your model to swapping gendered names transformations. It does this by taking a sample input, swapping all instances of traditionally masculine names (in the provided list) with a traditionally feminine name, and measuring the behavior of the model on the transformed input.	Production natural language input sequences should properly support people of all demographics. It is important that your NLP models are robust to spurious correlations and bias from the data.	By default, this test runs over a sample of text instances from the evaluation set that containone or more words from the source list.
Replace Feminine with Masculine Names	Bias and Fairness	This test measures the invariance of your model to swapping gendered names transformations. It does this by taking a sample input, swapping all instances of traditionally feminine names (in the provided list) with a traditionally masculine name, and measuring the behavior of the model on the transformed input.	Production natural language input sequences should properly support people of all demographics. It is important that your NLP models are robust to spurious correlations and bias from the data.	By default, this test runs over a sample of text instances from the evaluation set that containone or more words from the source list.
Replace Masculine with Plural Pronouns	Bias and Fairness	This test measures the robustness of your model to Replace Masculine with Plural Pronouns transformations. It does this by taking a sample input, swapping all masculine pronouns from the input string to plural ones, and measuring the behavior of the model on the transformed input.	Production natural language input sequences can have errors from data preprocessing or human input (mistaken or adversarial). It is important that your NLP models are robust to the introduction of such errors.	By default, this test runs over a sample of strings from the evaluation set, and it performs this attack on 5% of the words in each input.
Replace Feminine with Plural Pronouns	Bias and Fairness	This test measures the robustness of your model to Replace Feminine with Plural Pronouns transformations. It does this by taking a sample input, swapping all feminine pronouns from the input string to plural ones, and measuring the behavior of the model on the transformed input.	Production natural language input sequences can have errors from data preprocessing or human input (mistaken or adversarial). It is important that your NLP models are robust to the introduction of such errors.	By default, this test runs over a sample of strings from the evaluation set, and it performs this attack on 5% of the words in each input.
Swap High Income with Low Income Countries	Bias and Fairness	This test measures the invariance of your model to country name swap transformations. It does this by taking a sample input, swapping all instances of traditionally high-income countries (in the provided list) with a traditionally low-income country, and measuring the behavior of the model on the transformed input.	Production natural language input sequences should properly support people of all demographics. It is important that your NLP models are robust to spurious correlations and bias from the data.	By default, this test runs over a sample of text instances from the evaluation set that containone or more words from the source list.
Swap Low Income with High Income Countries	Bias and Fairness	This test measures the invariance of your model to country name swap transformations. It does this by taking a sample input, swapping all instances of traditionally low-income countries (in the provided list) with a traditionally high-income country, and measuring the behavior of the model on the transformed input.	Production natural language input sequences should properly support people of all demographics. It is important that your NLP models are robust to spurious correlations and bias from the data.	By default, this test runs over a sample of text instances from the evaluation set that containone or more words from the source list.

Getting Started

Core Concepts

Prediction

Policies

Gen AI

Resources