modeva.TestSuite.diagnose_reliability#

TestSuite.diagnose_reliability(train_dataset: str = 'test', test_dataset: str = 'test', test_size: float = 0.5, alpha: float = 0.1, max_depth: int = 5, width_threshold: float = 0.1, random_state: int = 0)#

Evaluates model reliability using split conformal prediction.

This method assesses the reliability of model predictions by generating prediction intervals (regression) or prediction sets (classification) using conformal prediction.

Parameters:
  • train_dataset ({"main", "train", "test"}, default="test") – Dataset used for calibrating the conformal prediction model. Not to be confused with the model’s original training set.

  • test_dataset ({"main", "train", "test"}, default="test") – Dataset used for evaluation.

  • test_size (float, default=0.5) – Proportion of data to use for testing when train_dataset == test_dataset. Must be between 0 and 1.

  • alpha (float, default=0.1) – Target miscoverage rate (1 - confidence level). For example, alpha=0.1 aims for 90% coverage.

  • max_depth (int, default=5) – Maximum depth of the gradient boosting trees for regression tasks. Only used when task_type is REGRESSION.

  • width_threshold (float, default=0.1) – For regression: proportion of samples with largest prediction intervals to classify as unreliable.

  • random_state (int, default=0) – Random seed for reproducibility.

Returns:

A result object containing:

  • key: “diagnose_reliability”

  • data: Name of the dataset used

  • model: Name of the model used

  • inputs: Input parameters used for the test

  • table: DataFrame with average width and coverage metrics

  • value: Dictionary containing detailed results including:

    • ”interval”: Prediction intervals / sets and related metrics

    • ”data_info”: The sample indices of reliable and unreliable samples, which can be further used for data distribution test, e.g.,

      data_results = ds.data_drift_test(**results.value["data_info"])
      data_results.plot("summary")
      data_results.plot(("density", "MedInc"))
      
  • options: Dictionary of visualizations configuration for a line plot where x-axis is the actual response, and y-axis is prediction, and prediction interval. Run results.plot() to show this plot.

Return type:

ValidationResult

Notes

For regression tasks:

  • Uses residual quantile regression to calculate prediction intervals

  • The calibration dataset is split 50/50 into training (for fitting quantile regression model) and validation (for calculating the threshold of nonconformity scores) sets

  • Samples with widest prediction intervals are marked as unreliable

For classification tasks:

  • Generates prediction sets: {0}, {1}, {0,1}, or {}

  • Uses nonconformity scores to determine set membership

  • Samples with prediction sets {} or {0,1} are marked as unreliable

Examples

Reliability Analysis (Classification)

Reliability Analysis (Classification)

Reliability Analysis (Regression)

Reliability Analysis (Regression)