Note

Go to the end to download the full example code.

Residual Analysis (Classification)#

Evaluate model residuals.

Installation

# To install the required package, use the following command:
# !pip install modeva

Authentication

# To get authentication, use the following command: (To get full access please replace the token to your own token)
# from modeva.utils.authenticate import authenticate
# authenticate(auth_code='eaaa4301-b140-484c-8e93-f9f633c8bacb')

Import modeva modules

from modeva import DataSet
from modeva import TestSuite
from modeva.models import MoLGBMClassifier

Load BikeSharing Dataset

ds = DataSet()
ds.load(name="TaiwanCredit")
ds.set_random_split()
ds.set_target("FlagDefault")

Fit a LGBM model

model = MoLGBMClassifier(name="LGBM-2", max_depth=2, verbose=-1, random_state=0)
model.fit(ds.train_x, ds.train_y.ravel())

MoLGBMClassifier(boosting_type='gbdt', class_weight=None, colsample_bytree=1.0,
                 importance_type='split', learning_rate=0.1, max_depth=2,
                 min_child_samples=20, min_child_weight=0.001,
                 min_split_gain=0.0, n_estimators=100, n_jobs=None,
                 num_leaves=31, objective=None, random_state=0, reg_alpha=0.0,
                 reg_lambda=0.0, subsample=1.0, subsample_for_bin=200000,
                 subsample_freq=0, verbose=-1)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Analyzes residuals feature importance#

ts = TestSuite(ds, model)
results = ts.diagnose_residual_interpret(dataset="train")
results.plot()

Visualize the residual against predictor#

results = ts.diagnose_residual_analysis(features="PAY_1", dataset="train")
results.plot()

Visualize the residual against response variable#

results = ts.diagnose_residual_analysis(features="FlagDefault", dataset="train")
results.plot()

Visualize the residual against model prediction (predict proba)#

results = ts.diagnose_residual_analysis(use_prediction=True, dataset="train")
results.plot()

Interpret residual by a XGB depth-2 model#

results = ts.diagnose_residual_interpret(dataset='test', n_estimators=100, max_depth=2)

XGB-2 feature performance

results.plot("feature_importance")

XGB-2 effect performance

results.plot("effect_importance")

Further interpretation (main effect plot)

ts_residual = results.value["TestSuite"]
ts_residual.interpret_effects("PAY_1", dataset="test").plot()

Further interpretation (local interpretation)

ts_residual.interpret_local_fi(sample_index=20).plot()

Random forest-based residual clustering analysis (absolute residual)#

results = ts.diagnose_residual_cluster(
    dataset="test",
    response_type="abs_residual",
    metric="AUC",
    n_clusters=10,
    cluster_method="pam",
    sample_size=2000,
    rf_n_estimators=100,
    rf_max_depth=5,
)
results.table

	AUC	Size	abs_residual
8	0.7154	546.0	0.4159
1	0.6771	505.0	0.4157
7	0.7873	303.0	0.3967
3	0.5774	395.0	0.3713
0	0.5578	828.0	0.3080
4	0.5983	580.0	0.2636
5	0.6240	397.0	0.2324
6	0.5773	353.0	0.2108
2	0.5649	572.0	0.1948
9	0.6507	1521.0	0.1457

Residual value for each cluster

results.plot("cluster_residual")

Performance metric for each cluster

results.plot("cluster_performance")

Feature importance of the random forest model

results.plot("feature_importance")

Analyze data drift for a specific cluster

data_results = ds.data_drift_test(
    **results.value["clusters"][0]["data_info"],
    distance_metric="PSI",
    psi_method="uniform",
    psi_bins=10
)
data_results.plot("summary")

data_results.plot(name=('density', 'PAY_1'))

Random forest-based residual clustering analysis (perturbed residual)#

results = ts.diagnose_residual_cluster(
    dataset="test",
    response_type="abs_residual_perturb",
    metric="AUC",
    n_clusters=10,
    cluster_method="pam",
    sample_size=2000,
    rf_n_estimators=100,
    rf_max_depth=5,
)
results.table

	AUC	Size	abs_residual_perturb
8	0.6163	402.0	0.4647
0	0.5702	267.0	0.4251
4	0.7641	451.0	0.3942
6	0.5864	1374.0	0.3819
7	0.5843	324.0	0.3788
5	0.5081	207.0	0.3488
9	0.5652	816.0	0.3040
3	0.6097	518.0	0.2948
1	0.6311	1305.0	0.2389
2	0.5840	336.0	0.1384

Random forest-based residual clustering analysis (prediction interval width)#

results = ts.diagnose_residual_cluster(
    dataset="test",
    response_type="pi_width",
    metric="AUC",
    n_clusters=10,
    cluster_method="pam",
    sample_size=2000,
    rf_n_estimators=100,
    rf_max_depth=5,
)
results.table

	AUC	Size	pi_width
3	0.5111	62.0	1.9839
5	0.4744	73.0	1.9589
0	0.7062	436.0	1.9472
6	0.5538	68.0	1.2941
2	0.5236	110.0	1.2000
4	0.4581	94.0	1.1809
7	0.5564	56.0	1.1786
1	0.6386	355.0	1.0873
8	0.5405	109.0	1.0459
9	0.6352	1637.0	1.0006

Compare residuals cluster of multiple models#

benchmark = MoLGBMClassifier(name="LGBM-5", max_depth=5, verbose=-1, random_state=0)
benchmark.fit(ds.train_x, ds.train_y.ravel())

tsc = TestSuite(ds, models=[model, benchmark])
results = tsc.compare_residual_cluster(dataset="test")
results.table

	LGBM-2		LGBM-5
	AUC	size	AUC	size
8	0.7154	546.0	0.7014	546.0
1	0.6771	505.0	0.6891	505.0
7	0.7873	303.0	0.7716	303.0
3	0.5774	395.0	0.6202	395.0
0	0.5578	828.0	0.5762	828.0
4	0.5983	580.0	0.5960	580.0
5	0.6240	397.0	0.6133	397.0
6	0.5773	353.0	0.6302	353.0
2	0.5649	572.0	0.6115	572.0
9	0.6507	1521.0	0.6466	1521.0

results.plot("cluster_performance")

Total running time of the script: (2 minutes 36.199 seconds)

Gallery generated by Sphinx-Gallery