modeva.TestSuite.compare_reliability#

TestSuite.compare_reliability(train_dataset: str = 'test', test_dataset: str = 'test', test_size: float = 0.5, alpha: float = 0.1, max_depth: int = 5, random_state: int = 0)#

Compares reliability performance of multiple models under data shifts by evaluating prediction intervals/sets.

This function evaluates the prediction intervals/sets of various models by comparing their reliability metrics on specified training and testing datasets. It aggregates results such as average width and coverage for each model, providing insights into their performance under different conditions.

Parameters:

train_dataset ({"main", "train", "test"}, default="test") – Dataset used for model training and calibration. Choose from available dataset splits.
test_dataset ({"main", "train", "test"}, default="test") – Dataset used for evaluation. Choose from available dataset splits.
test_size (float, default=0.5) – Proportion of data to use as test set when train_dataset equals test_dataset. Must be between 0 and 1.
alpha (float, default=0.1) – Target miscoverage rate for prediction intervals/sets. Must be between 0 and 1.
max_depth (int, default=5) – Maximum depth of the GBM trees used for quantile regression. Only applicable for regression tasks.
random_state (int, default=0) – Random seed for reproducible results.

Returns:

A container object with the following components:

key: “compare_reliability”
data: Name of the dataset used
model: List of model names being compared
inputs: Input parameters used for the analysis
value: Dictionary of (“<model_name>”, item), each item is also a dictionary with:
- ”interval”: Prediction intervals / sets and related metrics
- ”data_info”: The sample indices of reliable and unreliable samples, which can be further used for data distribution test, e.g.,
  data_results = ds.data_drift_test(**results.value["LGBMRegressor"]["data_info"]) data_results.plot("summary") data_results.plot(("density", "MedInc"))
table: DataFrame with detailed reliability metrics including average width and coverage for each model
options: Dictionary of visualizations configuration. Run results.plot() to show all plots; Run results.plot(name=xxx) to display one preferred plot; and the following names are available:
- width: Bar plot comparing empirical widths across models
- coverage: Bar plot comparing empirical coverage across models

Return type:

ValidationResult

Examples

Reliability Analysis (Classification)

Reliability Analysis (Regression)