Reliability Analysis (Classification)#

This example demonstrates how to analyze model reliability and calibration for classification problems using various methods and metrics.

Installation

# To install the required package, use the following command:
# !pip install modeva

Authentication

# To get authentication, use the following command: (To get full access please replace the token to your own token)
# from modeva.utils.authenticate import authenticate
# authenticate(auth_code='eaaa4301-b140-484c-8e93-f9f633c8bacb')

Import required modules

from modeva import DataSet
from modeva import TestSuite
from modeva.models import MoLGBMClassifier
from modeva.models import MoXGBClassifier
from modeva.testsuite.utils.slicing_utils import get_data_info

Load and prepare dataset

ds = DataSet()
ds.load(name="TaiwanCredit")
ds.scale_numerical(method="minmax")
ds.preprocess()
ds.set_random_split(random_state=0)

Train models

model1 = MoXGBClassifier(max_depth=2)
model1.fit(ds.train_x, ds.train_y)

model2 = MoLGBMClassifier(max_depth=2, verbose=-1, random_state=0)
model2.fit(ds.train_x, ds.train_y.ravel().astype(float))
MoLGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
                 importance_type='split', learning_rate=0.1, max_depth=2,
                 min_child_samples=20, min_child_weight=0.001,
                 min_split_gain=0.0, n_estimators=100, n_jobs=None,
                 num_leaves=31, objective=None, random_state=0, reg_alpha=0.0,
                 reg_lambda=0.0, subsample=1.0, subsample_for_bin=200000,
                 subsample_freq=0, verbose=-1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.


Basic reliability analysis#

ts = TestSuite(ds, model1)

As train_dataset == test_dataset, we would split the test data, one for training (calculating the non-conformal scores) and another for evaluation the test_size (0.5) is the proportion of the test data used for training.

results = ts.diagnose_reliability(
    train_dataset="test",
    test_dataset="test",
    test_size=0.5,
    alpha=0.2,
    random_state=0
)
results.table
Avg.Width Avg.Coverage
0 0.9407 0.7887


Analyze data drift between reliable and unreliable samples of the test dataset (obtained from the reliability analysis)

data_results = ds.data_drift_test(
    **results.value["data_info"],
    distance_metric="PSI",
    psi_method="uniform",
    psi_bins=10
)

Draw the PSI values of each feature

data_results.plot("summary")


Draw the density plot of the reliable and unreliable samples against “PAY_1”

data_results.plot(("density", "PAY_1"))


Slicing reliability#

features is the feature to be used for slicing

results = ts.diagnose_slicing_reliability(
    features="PAY_1",
    train_dataset="train",
    test_dataset="test",
    test_size=0.5,
    metric="coverage",
    random_state=0
)
results.plot()


Multiple 1D feature reliability analysis

results = ts.diagnose_slicing_reliability(
    features=(("PAY_1", ), ("EDUCATION",), ("PAY_2", )),
    train_dataset="train",
    test_dataset="test",
    test_size=0.5,
    metric="coverage",
    random_state=0
)
results.table
Feature Segment Size Coverage Threshold Weak
0 PAY_2 [0.80, 0.89] 8 1.0000 0.8965 True
1 PAY_2 [0.71, 0.80) 4 1.0000 0.8965 True
2 PAY_2 [0.62, 0.71) 5 1.0000 0.8965 True
3 PAY_2 [0.18, 0.27) 6 1.0000 0.8965 True
4 PAY_1 [0.50, 0.60) 20 1.0000 0.8965 True
5 PAY_1 [0.70, 0.80) 2 1.0000 0.8965 True
6 PAY_1 [0.80, 0.90) 3 1.0000 0.8965 True
7 PAY_1 [0.90, 1.00] 8 1.0000 0.8965 True
8 PAY_1 [0.10, 0.20) 3339 0.9030 0.8965 True
9 PAY_2 [0.09, 0.18) 3828 0.9002 0.8965 True
10 EDUCATION 2.0 2716 0.8999 0.8965 True
11 EDUCATION 1.0 2232 0.8978 0.8965 True
12 PAY_1 [0.00, 0.10) 1243 0.8946 0.8965 False
13 PAY_1 [0.30, 0.40) 545 0.8936 0.8965 False
14 PAY_2 [0.00, 0.09) 1299 0.8922 0.8965 False
15 EDUCATION 3.0 984 0.8872 0.8965 False
16 PAY_2 [0.27, 0.36) 772 0.8847 0.8965 False
17 PAY_1 [0.20, 0.30) 783 0.8774 0.8965 False
18 PAY_2 [0.44, 0.53) 65 0.8769 0.8965 False
19 EDUCATION 0.0 68 0.8529 0.8965 False
20 PAY_2 [0.53, 0.62) 13 0.8462 0.8965 False
21 PAY_1 [0.40, 0.50) 51 0.8039 0.8965 False
22 PAY_1 [0.60, 0.70) 6 0.6667 0.8965 False
23 PAY_2 [0.36, 0.44) 0 NaN 0.8965 False


Batch mode 1D Slicing (all features by setting features=None)

results = ts.diagnose_slicing_reliability(
    features=None,
    train_dataset="train",
    test_dataset="test",
    test_size=0.5,
    metric="coverage",
    random_state=0
)
results.table
Feature Segment Size Coverage Threshold Weak
0 BILL_AMT3 [0.00, 0.10) 1 1.0 0.8965 True
1 AGE [0.90, 1.00] 1 1.0 0.8965 True
2 PAY_1 [0.50, 0.60) 20 1.0 0.8965 True
3 PAY_1 [0.70, 0.80) 2 1.0 0.8965 True
4 PAY_1 [0.80, 0.90) 3 1.0 0.8965 True
... ... ... ... ... ... ...
204 PAY_5 [0.70, 0.80) 0 NaN 0.8965 False
205 PAY_5 [0.90, 1.00] 0 NaN 0.8965 False
206 PAY_6 [0.20, 0.30) 0 NaN 0.8965 False
207 PAY_6 [0.60, 0.70) 0 NaN 0.8965 False
208 PAY_6 [0.90, 1.00] 0 NaN 0.8965 False

209 rows × 6 columns



Draw the coverage plot of each feature

results.plot("PAY_1")


Analyze data drift between samples above and under the threshold

data_info = get_data_info(res_value=results.value)
data_results = ds.data_drift_test(
    **data_info["PAY_1"],
    distance_metric="PSI",
    psi_method="uniform",
    psi_bins=10
)
data_results.plot("summary")


Single feature density plot

data_results.plot(("density", "PAY_1"))


2D feature interaction reliability analysis we can use a pair of features for 2D slicing

results = ts.diagnose_slicing_reliability(
    features=("PAY_1", "EDUCATION"),
    train_dataset="train",
    test_dataset="test",
    test_size=0.5,
    random_state=0
)
results.table
Feature1 Segment1 Feature2 Segment2 Size Width Threshold Weak
27 PAY_1 [0.60, 0.70) EDUCATION 3.0 1 2.0000 1.1957 True
26 PAY_1 [0.60, 0.70) EDUCATION 2.0 5 1.4000 1.1957 True
22 PAY_1 [0.50, 0.60) EDUCATION 2.0 11 1.2727 1.1957 True
19 PAY_1 [0.40, 0.50) EDUCATION 3.0 11 1.2727 1.1957 True
23 PAY_1 [0.50, 0.60) EDUCATION 3.0 4 1.2500 1.1957 True
3 PAY_1 [0.00, 0.10) EDUCATION 3.0 159 1.2453 1.1957 True
6 PAY_1 [0.10, 0.20) EDUCATION 2.0 1615 1.2111 1.1957 True
4 PAY_1 [0.10, 0.20) EDUCATION 0.0 48 1.2083 1.1957 True
2 PAY_1 [0.00, 0.10) EDUCATION 2.0 425 1.2071 1.1957 True
14 PAY_1 [0.30, 0.40) EDUCATION 2.0 280 1.2036 1.1957 True
21 PAY_1 [0.50, 0.60) EDUCATION 1.0 5 1.2000 1.1957 True
5 PAY_1 [0.10, 0.20) EDUCATION 1.0 1123 1.1995 1.1957 True
7 PAY_1 [0.10, 0.20) EDUCATION 3.0 553 1.1863 1.1957 False
1 PAY_1 [0.00, 0.10) EDUCATION 1.0 647 1.1808 1.1957 False
15 PAY_1 [0.30, 0.40) EDUCATION 3.0 133 1.1805 1.1957 False
9 PAY_1 [0.20, 0.30) EDUCATION 1.0 318 1.1698 1.1957 False
10 PAY_1 [0.20, 0.30) EDUCATION 2.0 338 1.1686 1.1957 False
38 PAY_1 [0.90, 1.00] EDUCATION 2.0 6 1.1667 1.1957 False
13 PAY_1 [0.30, 0.40) EDUCATION 1.0 131 1.1603 1.1957 False
11 PAY_1 [0.20, 0.30) EDUCATION 3.0 120 1.1583 1.1957 False
18 PAY_1 [0.40, 0.50) EDUCATION 2.0 32 1.1562 1.1957 False
8 PAY_1 [0.20, 0.30) EDUCATION 0.0 7 1.1429 1.1957 False
17 PAY_1 [0.40, 0.50) EDUCATION 1.0 8 1.1250 1.1957 False
0 PAY_1 [0.00, 0.10) EDUCATION 0.0 12 1.0833 1.1957 False
30 PAY_1 [0.70, 0.80) EDUCATION 2.0 2 1.0000 1.1957 False
34 PAY_1 [0.80, 0.90) EDUCATION 2.0 2 1.0000 1.1957 False
35 PAY_1 [0.80, 0.90) EDUCATION 3.0 1 1.0000 1.1957 False
12 PAY_1 [0.30, 0.40) EDUCATION 0.0 1 1.0000 1.1957 False
39 PAY_1 [0.90, 1.00] EDUCATION 3.0 2 1.0000 1.1957 False
16 PAY_1 [0.40, 0.50) EDUCATION 0.0 0 NaN 1.1957 False
20 PAY_1 [0.50, 0.60) EDUCATION 0.0 0 NaN 1.1957 False
24 PAY_1 [0.60, 0.70) EDUCATION 0.0 0 NaN 1.1957 False
25 PAY_1 [0.60, 0.70) EDUCATION 1.0 0 NaN 1.1957 False
28 PAY_1 [0.70, 0.80) EDUCATION 0.0 0 NaN 1.1957 False
29 PAY_1 [0.70, 0.80) EDUCATION 1.0 0 NaN 1.1957 False
31 PAY_1 [0.70, 0.80) EDUCATION 3.0 0 NaN 1.1957 False
32 PAY_1 [0.80, 0.90) EDUCATION 0.0 0 NaN 1.1957 False
33 PAY_1 [0.80, 0.90) EDUCATION 1.0 0 NaN 1.1957 False
36 PAY_1 [0.90, 1.00] EDUCATION 0.0 0 NaN 1.1957 False
37 PAY_1 [0.90, 1.00] EDUCATION 1.0 0 NaN 1.1957 False


Model reliability comparison#

tsc = TestSuite(ds, models=[model1, model2])
results = tsc.compare_reliability(
    train_dataset="train",
    test_dataset="test",
    test_size=0.5,
    alpha=0.1,
    max_depth=5,
    random_state=0
)
results.table
MoXGBClassifier MoLGBMClassifier
Avg.Width Avg.Coverage Avg.Width Avg.Coverage
0 1.1957 0.8965 1.2053 0.8987


Model slicing reliability comparison

results = tsc.compare_slicing_reliability(
    features="PAY_1",
    train_dataset="train",
    test_dataset="test",
    test_size=0.5,
    alpha=0.1,
    max_depth=5,
    metric="width",
    random_state=0
)
results.plot()


Total running time of the script: (0 minutes 20.849 seconds)

Gallery generated by Sphinx-Gallery