Note

Go to the end to download the full example code.

Reliability Analysis (Classification)#

This example demonstrates how to analyze model reliability and calibration for classification problems using various methods and metrics.

Installation

# To install the required package, use the following command:
# !pip install modeva

Authentication

# To get authentication, use the following command: (To get full access please replace the token to your own token)
# from modeva.utils.authenticate import authenticate
# authenticate(auth_code='eaaa4301-b140-484c-8e93-f9f633c8bacb')

Import required modules

from modeva import DataSet
from modeva import TestSuite
from modeva.models import MoLGBMClassifier
from modeva.models import MoXGBClassifier
from modeva.testsuite.utils.slicing_utils import get_data_info

Load and prepare dataset

ds = DataSet()
ds.load(name="TaiwanCredit")
ds.scale_numerical(method="minmax")
ds.preprocess()
ds.set_random_split(random_state=0)

Train models

model1 = MoXGBClassifier(max_depth=2)
model1.fit(ds.train_x, ds.train_y)

model2 = MoLGBMClassifier(max_depth=2, verbose=-1, random_state=0)
model2.fit(ds.train_x, ds.train_y.ravel().astype(float))

MoLGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
                 importance_type='split', learning_rate=0.1, max_depth=2,
                 min_child_samples=20, min_child_weight=0.001,
                 min_split_gain=0.0, n_estimators=100, n_jobs=None,
                 num_leaves=31, objective=None, random_state=0, reg_alpha=0.0,
                 reg_lambda=0.0, subsample=1.0, subsample_for_bin=200000,
                 subsample_freq=0, verbose=-1)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Basic reliability analysis#

ts = TestSuite(ds, model1)

As train_dataset == test_dataset, we would split the test data, one for training (calculating the non-conformal scores) and another for evaluation the test_size (0.5) is the proportion of the test data used for training.

results = ts.diagnose_reliability(
    train_dataset="test",
    test_dataset="test",
    test_size=0.5,
    alpha=0.2,
    random_state=0
)
results.table

	Avg.Width	Avg.Coverage
0	0.9407	0.7887

Analyze data drift between reliable and unreliable samples of the test dataset (obtained from the reliability analysis)

data_results = ds.data_drift_test(
    **results.value["data_info"],
    distance_metric="PSI",
    psi_method="uniform",
    psi_bins=10
)

Draw the PSI values of each feature

data_results.plot("summary")

Draw the density plot of the reliable and unreliable samples against “PAY_1”

data_results.plot(("density", "PAY_1"))

Slicing reliability#

features is the feature to be used for slicing

results = ts.diagnose_slicing_reliability(
    features="PAY_1",
    train_dataset="train",
    test_dataset="test",
    test_size=0.5,
    metric="coverage",
    random_state=0
)
results.plot()

Multiple 1D feature reliability analysis

results = ts.diagnose_slicing_reliability(
    features=(("PAY_1", ), ("EDUCATION",), ("PAY_2", )),
    train_dataset="train",
    test_dataset="test",
    test_size=0.5,
    metric="coverage",
    random_state=0
)
results.table

	Feature	Segment	Size	Coverage	Threshold	Weak
0	PAY_2	[0.80, 0.89]	8	1.0000	0.8965	True
1	PAY_2	[0.71, 0.80)	4	1.0000	0.8965	True
2	PAY_2	[0.62, 0.71)	5	1.0000	0.8965	True
3	PAY_2	[0.18, 0.27)	6	1.0000	0.8965	True
4	PAY_1	[0.50, 0.60)	20	1.0000	0.8965	True
5	PAY_1	[0.70, 0.80)	2	1.0000	0.8965	True
6	PAY_1	[0.80, 0.90)	3	1.0000	0.8965	True
7	PAY_1	[0.90, 1.00]	8	1.0000	0.8965	True
8	PAY_1	[0.10, 0.20)	3339	0.9030	0.8965	True
9	PAY_2	[0.09, 0.18)	3828	0.9002	0.8965	True
10	EDUCATION	2.0	2716	0.8999	0.8965	True
11	EDUCATION	1.0	2232	0.8978	0.8965	True
12	PAY_1	[0.00, 0.10)	1243	0.8946	0.8965	False
13	PAY_1	[0.30, 0.40)	545	0.8936	0.8965	False
14	PAY_2	[0.00, 0.09)	1299	0.8922	0.8965	False
15	EDUCATION	3.0	984	0.8872	0.8965	False
16	PAY_2	[0.27, 0.36)	772	0.8847	0.8965	False
17	PAY_1	[0.20, 0.30)	783	0.8774	0.8965	False
18	PAY_2	[0.44, 0.53)	65	0.8769	0.8965	False
19	EDUCATION	0.0	68	0.8529	0.8965	False
20	PAY_2	[0.53, 0.62)	13	0.8462	0.8965	False
21	PAY_1	[0.40, 0.50)	51	0.8039	0.8965	False
22	PAY_1	[0.60, 0.70)	6	0.6667	0.8965	False
23	PAY_2	[0.36, 0.44)	0	NaN	0.8965	False

Batch mode 1D Slicing (all features by setting features=None)

results = ts.diagnose_slicing_reliability(
    features=None,
    train_dataset="train",
    test_dataset="test",
    test_size=0.5,
    metric="coverage",
    random_state=0
)
results.table

	Feature	Segment	Size	Coverage	Threshold	Weak
0	BILL_AMT3	[0.00, 0.10)	1	1.0	0.8965	True
1	AGE	[0.90, 1.00]	1	1.0	0.8965	True
2	PAY_1	[0.50, 0.60)	20	1.0	0.8965	True
3	PAY_1	[0.70, 0.80)	2	1.0	0.8965	True
4	PAY_1	[0.80, 0.90)	3	1.0	0.8965	True
...	...	...	...	...	...	...
204	PAY_5	[0.70, 0.80)	0	NaN	0.8965	False
205	PAY_5	[0.90, 1.00]	0	NaN	0.8965	False
206	PAY_6	[0.20, 0.30)	0	NaN	0.8965	False
207	PAY_6	[0.60, 0.70)	0	NaN	0.8965	False
208	PAY_6	[0.90, 1.00]	0	NaN	0.8965	False

209 rows × 6 columns

Draw the coverage plot of each feature

results.plot("PAY_1")

Analyze data drift between samples above and under the threshold

data_info = get_data_info(res_value=results.value)
data_results = ds.data_drift_test(
    **data_info["PAY_1"],
    distance_metric="PSI",
    psi_method="uniform",
    psi_bins=10
)
data_results.plot("summary")

Single feature density plot

data_results.plot(("density", "PAY_1"))

2D feature interaction reliability analysis we can use a pair of features for 2D slicing

results = ts.diagnose_slicing_reliability(
    features=("PAY_1", "EDUCATION"),
    train_dataset="train",
    test_dataset="test",
    test_size=0.5,
    random_state=0
)
results.table

	Feature1	Segment1	Feature2	Segment2	Size	Width	Threshold	Weak
27	PAY_1	[0.60, 0.70)	EDUCATION	3.0	1	2.0000	1.1957	True
26	PAY_1	[0.60, 0.70)	EDUCATION	2.0	5	1.4000	1.1957	True
22	PAY_1	[0.50, 0.60)	EDUCATION	2.0	11	1.2727	1.1957	True
19	PAY_1	[0.40, 0.50)	EDUCATION	3.0	11	1.2727	1.1957	True
23	PAY_1	[0.50, 0.60)	EDUCATION	3.0	4	1.2500	1.1957	True
3	PAY_1	[0.00, 0.10)	EDUCATION	3.0	159	1.2453	1.1957	True
6	PAY_1	[0.10, 0.20)	EDUCATION	2.0	1615	1.2111	1.1957	True
4	PAY_1	[0.10, 0.20)	EDUCATION	0.0	48	1.2083	1.1957	True
2	PAY_1	[0.00, 0.10)	EDUCATION	2.0	425	1.2071	1.1957	True
14	PAY_1	[0.30, 0.40)	EDUCATION	2.0	280	1.2036	1.1957	True
21	PAY_1	[0.50, 0.60)	EDUCATION	1.0	5	1.2000	1.1957	True
5	PAY_1	[0.10, 0.20)	EDUCATION	1.0	1123	1.1995	1.1957	True
7	PAY_1	[0.10, 0.20)	EDUCATION	3.0	553	1.1863	1.1957	False
1	PAY_1	[0.00, 0.10)	EDUCATION	1.0	647	1.1808	1.1957	False
15	PAY_1	[0.30, 0.40)	EDUCATION	3.0	133	1.1805	1.1957	False
9	PAY_1	[0.20, 0.30)	EDUCATION	1.0	318	1.1698	1.1957	False
10	PAY_1	[0.20, 0.30)	EDUCATION	2.0	338	1.1686	1.1957	False
38	PAY_1	[0.90, 1.00]	EDUCATION	2.0	6	1.1667	1.1957	False
13	PAY_1	[0.30, 0.40)	EDUCATION	1.0	131	1.1603	1.1957	False
11	PAY_1	[0.20, 0.30)	EDUCATION	3.0	120	1.1583	1.1957	False
18	PAY_1	[0.40, 0.50)	EDUCATION	2.0	32	1.1562	1.1957	False
8	PAY_1	[0.20, 0.30)	EDUCATION	0.0	7	1.1429	1.1957	False
17	PAY_1	[0.40, 0.50)	EDUCATION	1.0	8	1.1250	1.1957	False
0	PAY_1	[0.00, 0.10)	EDUCATION	0.0	12	1.0833	1.1957	False
30	PAY_1	[0.70, 0.80)	EDUCATION	2.0	2	1.0000	1.1957	False
34	PAY_1	[0.80, 0.90)	EDUCATION	2.0	2	1.0000	1.1957	False
35	PAY_1	[0.80, 0.90)	EDUCATION	3.0	1	1.0000	1.1957	False
12	PAY_1	[0.30, 0.40)	EDUCATION	0.0	1	1.0000	1.1957	False
39	PAY_1	[0.90, 1.00]	EDUCATION	3.0	2	1.0000	1.1957	False
16	PAY_1	[0.40, 0.50)	EDUCATION	0.0	0	NaN	1.1957	False
20	PAY_1	[0.50, 0.60)	EDUCATION	0.0	0	NaN	1.1957	False
24	PAY_1	[0.60, 0.70)	EDUCATION	0.0	0	NaN	1.1957	False
25	PAY_1	[0.60, 0.70)	EDUCATION	1.0	0	NaN	1.1957	False
28	PAY_1	[0.70, 0.80)	EDUCATION	0.0	0	NaN	1.1957	False
29	PAY_1	[0.70, 0.80)	EDUCATION	1.0	0	NaN	1.1957	False
31	PAY_1	[0.70, 0.80)	EDUCATION	3.0	0	NaN	1.1957	False
32	PAY_1	[0.80, 0.90)	EDUCATION	0.0	0	NaN	1.1957	False
33	PAY_1	[0.80, 0.90)	EDUCATION	1.0	0	NaN	1.1957	False
36	PAY_1	[0.90, 1.00]	EDUCATION	0.0	0	NaN	1.1957	False
37	PAY_1	[0.90, 1.00]	EDUCATION	1.0	0	NaN	1.1957	False

Model reliability comparison#

tsc = TestSuite(ds, models=[model1, model2])
results = tsc.compare_reliability(
    train_dataset="train",
    test_dataset="test",
    test_size=0.5,
    alpha=0.1,
    max_depth=5,
    random_state=0
)
results.table

	MoXGBClassifier		MoLGBMClassifier
	Avg.Width	Avg.Coverage	Avg.Width	Avg.Coverage
0	1.1957	0.8965	1.2053	0.8987

Model slicing reliability comparison

results = tsc.compare_slicing_reliability(
    features="PAY_1",
    train_dataset="train",
    test_dataset="test",
    test_size=0.5,
    alpha=0.1,
    max_depth=5,
    metric="width",
    random_state=0
)
results.plot()

Total running time of the script: (0 minutes 20.849 seconds)

Gallery generated by Sphinx-Gallery