modeva.TestSuite.compare_slicing_reliability#

TestSuite.compare_slicing_reliability(features: str, train_dataset: str = 'test', test_dataset: str = 'test', test_size: float = 0.5, method: str = 'uniform', bins: int | Dict = 10, n_estimators: int = 1000, threshold: float | int = None, metric: str = 'width', alpha: float = 0.1, max_depth: int = 5, random_state: int = 0)#

Compares reliability metrics across different slices of data for multiple models.

This function compares reliability metrics, such as width and coverage, across various slices of data for multiple models. It utilizes the specified features to segment the dataset and computes the reliability metrics based on the chosen method and parameters.

Parameters:
  • features (str) – Name of the feature to use for slicing the data.

  • train_dataset ({"main", "train", "test"}, default="test") – Dataset to use for training and calibration purposes.

  • test_dataset ({"main", "train", "test"}, default="test") – Dataset to use for evaluation and comparison.

  • test_size (float, default=0.5) – Proportion of data to use as test set when train_dataset equals test_dataset. Must be between 0 and 1.

  • method ({"uniform", "quantile", "auto-xgb1", "precompute"}, default="uniform") –

    Method to use for creating bins:

    • ”uniform”: Equal-width bins

    • ”quantile”: Equal-frequency bins

    • ”auto-xgb1”: XGBoost-based automatic binning

    • ”precompute”: Use pre-defined bin edges

  • bins (int or dict, default=10) –

    Controls binning granularity:

    • If int: Number of bins for numerical features. For “quantile”, this is the maximum number of bins. For “auto-xgb1”, this sets XGBoost’s max_bin parameter.

    • If dict: Manual bin specifications for each feature, only used with method=”precompute”. Format: {feature_name: array_of_bin_edges}. Example: {“X0”: [0.1, 0.5, 0.9]} Note: Cannot specify bins for categorical features.

  • n_estimators (int, default=1000) – Number of trees for XGBoost when using method=”auto-xgb1”.

  • threshold (float or int, default=None) – Threshold for filtering fairness unreliable regions. If not specified, it will not be used.

  • metric ({"width", "coverage"}, default="width") –

    Reliability metric to compute:

    • ”width”: Average prediction interval width

    • ”coverage”: Average prediction coverage rate

  • alpha (float, default=0.1) – Target coverage level for prediction intervals (between 0 and 1).

  • max_depth (int, default=5) – Maximum tree depth for gradient boosting model (regression tasks only).

  • random_state (int, default=0) – Random seed for reproducibility.

Returns:

A container object with the following components:

  • key: “compare_slicing_reliability”

  • data: Name of the dataset used

  • model: List of model names being compared

  • inputs: Input parameters used for the analysis

  • value: Dictionary of (“<model_name>”, item) pairs, where each item is a nested dictionary with dictionary containing the information about the reliability metric for each segment.

    • ”Feature”: feature name

    • ”Segment”: segment value (categorical) or segment range (numerical)

    • ”Size”: number of samples in this segment

    • <”metric”>: reliability metric value of this segment

    • ”Sample_ID”: sample indices of this segment

    • ”Sample_Dataset”: dataset name, e.g., “train”, “test”, etc.

    • ”Segment_Info”: explicit definition of this segment, similar to “Segment”

    • ”Weak”: boolean indicator showing whether this segment is weak or not

  • table: DataFrame with detailed reliability statistics per slice

  • options: Dictionary of visualizations configuration for a mulit-line plot where x-axis is the selected slicing feature, and y-axis is performance metric gap. Run results.plot() to show this plot.

Return type:

ValidationResult

Examples

Reliability Analysis (Classification)

Reliability Analysis (Classification)

Reliability Analysis (Regression)

Reliability Analysis (Regression)