modeva.TestSuite.compare_slicing_robustness#

TestSuite.compare_slicing_robustness(features: str, dataset: str = 'test', metric: str = None, method: str = 'uniform', bins: int | Dict = 10, n_estimators: int = 1000, threshold: float | int = None, n_repeats: int = 10, perturb_features: str | Tuple = None, perturb_method: str = 'normal', noise_levels: float | int = 0.1, random_state: int = 0)#

Compares model robustness across different data slices by analyzing performance stability under perturbations.

This function evaluates the robustness of a model by analyzing its performance across various data slices, applying perturbations to the specified features, and measuring the stability of the model’s predictions. It allows for different binning methods and metrics to be used, providing flexibility in how the analysis is conducted.

Parameters:

features (str) – Name of the feature to use for data slicing.
dataset ({"main", "train", "test"}, default="test") – Dataset to analyze.
method ({"uniform", "quantile", "auto-xgb1", "precompute"}, default="uniform") –
Method for binning numerical features:
- ”uniform”: Equal-width bins
- ”quantile”: Equal-frequency bins
- ”auto-xgb1”: XGBoost-based automatic binning
- ”precompute”: Use pre-defined bin edges
bins (int or dict, default=10) –
Controls binning granularity:
- If int: Number of bins for numerical features. For “quantile”, this is the maximum number of bins. For “auto-xgb1”, this sets XGBoost’s max_bin parameter.
- If dict: Manual bin specifications for each feature, only used with method=”precompute”. Format: {feature_name: array_of_bin_edges}. Example: {“X0”: [0.1, 0.5, 0.9]} Note: Cannot specify bins for categorical features.
metric (str, default=None) –
Model performance metric to use.
- For classification (default=”AUC”): “ACC”, “AUC”, “F1”, “LogLoss”, and “Brier”
- For regression (default=”MSE”): “MSE”, “MAE”, and “R2”
n_estimators (int, default=1000) – The number of estimators in xgboost, used when method=”auto-xgb1”.
threshold (float or int, default=None) – Threshold for filtering fairness non-robust regions. If not specified, it will not be used.
n_repeats (int, default=10) – Number of times to repeat perturbation analysis.
perturb_features (str or tuple, default=None) – Features to perturb. If None, all features are perturbed.
perturb_method ({"normal", "quantile"}, default="normal") – Method for perturbing numerical features.
noise_levels (float or int, default=0.1) – Magnitude of perturbation to apply.
random_state (int, default=0) – Random seed for reproducibility.

Returns:

A container object with the following components:

key: “compare_slicing_robustness”
data: Name of the dataset used
model: List of model names being compared
inputs: Input parameters used for the analysis
value: Dictionary of (“<model_name>”, item) pairs, where each item is a nested dictionary with dictionary containing the information about the performance metric (after perturbation) for each segment.
- ”Feature”: feature name
- ”Segment”: segment value (categorical) or segment range (numerical)
- ”Size”: number of samples in this segment
- <”metric”>: perturbed model performance metric value of this segment
- ”Sample_ID”: sample indices of this segment
- ”Sample_Dataset”: dataset name, e.g., “train”, “test”, etc.
- ”Segment_Info”: explicit definition of this segment, similar to “Segment”
- ”Weak”: boolean indicator showing whether this segment is weak or not
table: DataFrame with comparative performance metrics
options: Dictionary of visualizations configuration for a mulit-line plot where x-axis is the selected slicing feature, and y-axis is performance metric (after perturbation). Run results.plot() to show this plot.

Return type:

ValidationResult

Examples

Robustness Analysis (Classification)

Robustness Analysis (Regression)