modeva.TestSuite.compare_slicing_accuracy#

TestSuite.compare_slicing_accuracy(features: str, dataset: str = 'test', metric: str = None, method: str = 'uniform', bins: int | Dict = 10, n_estimators: int = 1000, threshold: float | int = None)#

Compares model performance across different data slices based on a specified feature.

This function evaluates the performance of multiple models on different slices of the dataset, allowing for a detailed comparison based on a specified feature. It computes performance metrics for each model across the defined slices and returns a structured result containing the analysis.

Parameters:
  • features (str) – Name of the feature to use for data slicing.

  • dataset ({"main", "train", "test"}, default="test") – The data set to be tested.

  • metric (str, default=None) –

    Model performance metric to use.

    • For classification (default=”AUC”): “ACC”, “AUC”, “F1”, “LogLoss”, and “Brier”

    • For regression (default=”MSE”): “MSE”, “MAE”, and “R2”

  • method ({"uniform", "quantile", "auto-xgb1", "precompute"}, default="uniform") –

    Method for binning numerical features:

    • ”uniform”: Equal-width binning

    • ”quantile”: Equal-frequency binning

    • ”auto-xgb1”: XGBoost-based automatic binning

    • ”precompute”: Use pre-specified bin edges

  • bins (int or dict, default=10) –

    Controls binning granularity:

    • If int: Number of bins for numerical features. For “quantile”, this is the maximum number of bins. For “auto-xgb1”, this sets XGBoost’s max_bin parameter.

    • If dict: Manual bin specifications for each feature, only used with method=”precompute”. Format: {feature_name: array_of_bin_edges}. Example: {“X0”: [0.1, 0.5, 0.9]} Note: Cannot specify bins for categorical features.

  • n_estimators (int, default=1000) – Number of estimators for XGBoost when method=”auto-xgb1”

  • threshold (float or int, default=None) – Threshold for filtering fairness metric results. If not specified, it will not be used.

Returns:

A container object with the following components:

  • key: “compare_slicing_accuracy”

  • data: Name of the dataset used

  • model: List of model names being compared

  • inputs: Input parameters used for the analysis

  • value: Dictionary of (“<model_name>”, item) pairs, where each item is a list of dict, containing the information about the performance metrics for each segment,

    • ”Feature”: feature name

    • ”Segment”: segment value (categorical) or segment range (numerical)

    • ”Size”: number of samples in this segment

    • <”metric”>: performance metric value of this segment

    • ”Sample_ID”: sample indices of this segment

    • ”Sample_Dataset”: dataset name, e.g., “train”, “test”, etc.

    • ”Segment_Info”: explicit definition of this segment, similar to “Segment”

  • table: Pandas DataFrame with detailed slicing statistics

  • options: Dictionary of visualizations configuration for a mulit-line plot where x-axis is the selected slicing feature, and y-axis is performance metric. Run results.plot() to show this plot.

Return type:

ValidationResult

Examples

Sliced Performance (Classification)

Sliced Performance (Classification)

Sliced Performance (Regression)

Sliced Performance (Regression)